Tech Translated: Synthetic data

December 07, 2023

What is synthetic data?
What business problems can synthetic data address?
How does synthetic data create value?
Who should be paying attention to synthetic data?

What is synthetic data?

Synthetic data is exactly what it sounds like: data that has been artificially created (usually via algorithms, statistical models, or generative AI), rather than generated directly from real-world activities. To develop synthetic data, information from almost any source is analysed to detect structures and patterns, which are then used as the foundation for creating new datasets that mimic the core characteristics of the original.

What business problems can synthetic data address?

Identifying, collecting, and structuring relevant data in ways that enable it to inform business decisions is time-consuming, expensive, and potentially risky. At the same time, “Every business is dealing with data protection rules and the right handling of sensitive data,” says Marcus Hartmann, partner and chief data officer for PwC Germany and Europe. “Synthetic data can give a clear impression of something without exposing the underlying origins and sensitive information.”

When appropriate data is inaccessible due to concerns about confidentiality, privacy, or regulatory compliance—or simply doesn’t exist in sufficient quantities to be useful—synthetic datasets can sidestep these restrictions.

How does synthetic data create value?

Synthetic data is often a lower-cost, faster way to access vast quantities of data than traditional data collection and curation methods. This means it has the potential to turbocharge the data-driven transformation of every industry by becoming the foundation for training machine-learning models and AI, which in turn enables the development of new products, services, and ways of working—finally delivering on the promise of “big data” that got us all so excited a few years back.

Synthetic data is already being used in many industries. Amazon used synthetic data about speech patterns, syntax, and semantics to improve multilingual speech recognition in its Alexa virtual assistant. The UK’s National Health Service (NHS) has converted real-world data on patient admissions for accidents and emergency (A&E) treatment into a statistically similar but anonymized open-source dataset to help NHS care organizations better understand and meet the needs of patients and healthcare providers. This kind of health data has also been leveraged by Alphabet and US insurance company Anthem to improve insurance fraud detection.

However, this is still relatively early-stage tech, and as with any other machine-generated information, the output is only as good as the inputs and algorithms. Anomalies and outliers in the source data can be amplified or lost altogether; either option will make the end product less representative of the real data it’s meant to replace. Synthetic datasets might also accidentally retain some personally identifiable information from the source, which could violate people’s privacy and expose organizations using the data to legal action.

Generative AI has been known to “hallucinate” incorrect information, when it fails to recognize anomalies in the underlying model and draws conclusions that seem statistically likely, but are not supported by the actual data. Any synthetic datasets created from those hallucinations are then affected. Some fear that because of this phenomenon, the proliferation of synthetic data could, over time, introduce feedback loops that would make AI-generated information less reliable.

Ensuring the value of synthetic data will require robust human due diligence. Following the guidance of PwC’s “Responsible AI” toolkit can help.

Who should be paying attention to synthetic data?

There are potential applications for synthetic data in almost every business, with CIOs, CTOs, CISOs, and the research and development, data and analytics, legal and compliance, and marketing and sales departments likely already exploring their options. Industries that deal with issues of data privacy and access—notably, healthcare, pharmaceuticals and life sciences, and financial services—are likely to see the greatest benefits.

Key takeaways

Synthetic data is artificially generated to mimic real datasets while protecting privacy. It enables safe model training, analytics, and testing without exposing sensitive information, helping organizations innovate responsibly even in data-constrained environments.

Learn more

Last updated on 7 December 2023

Explore more about AI, data, and tech

Whether you’re just beginning your AI journey or are well on your way, we’ll bring insights and practical guidance.

Get the strategy+business newsletter

{{contentList.dataService.numberHits}} {{contentList.dataService.numberHits == 1 ? 'result' : 'results'}}

Contact us

Joe Atkinson

Global Chief AI Officer for the PwC Network of Firms, PwC US

Tel: +1 215-704-0372

Matt Wood

Global and US Commercial Technology & Innovation Officer (CTIO), PwC US

Hide

© 2017 - 2026 PwC. All rights reserved. PwC refers to the PwC network and/or one or more of its member firms, each of which is a separate legal entity. Please see www.pwc.com/structure for further details. This content is for general information purposes only, and should not be used as a substitute for consultation with professional advisors.