Synthetic Data: The Promise and Peril of AI’s Hottest Tool

From safer cars to rare disease breakthroughs, artificially generated datasets promise big wins, but new laws and ethical red lines are putting limits on their reach, as they must.

Synthetic data, which is data artificially generated by algorithms rather than collected from the real world, is reshaping the way industries develop AI.

In the automotive sector, it is helping simulate thousands of accident scenarios without the need for dangerous and costly real-world crashes. In healthcare, it is providing research teams with vast datasets that preserve patient privacy while enabling breakthroughs in rare disease detection and treatment. Financial institutions are using it to model fraud prevention without exposing sensitive transactions.

The advantages are clear: speed, scale, privacy. But as adoption spreads, so do questions about accuracy, bias, and ethics, particularly under the European Union’s new Artificial Intelligence Act.

When Synthetic Data Saves Lives

Automotive safety is one of the clearest success stories. Testing autonomous vehicles in the real world is expensive, slow, and sometimes impossible, especially for rare edge cases such as an animal darting into the road in poor weather. Synthetic datasets can recreate these moments in controlled environments, enabling models to learn from scenarios they might never otherwise encounter.

In medicine, synthetic patient records have become an important tool for training diagnostic algorithms without breaching privacy laws like HIPAA or GDPR. For rare diseases, where case numbers are too low to train models effectively, synthetic data can be used to “amplify” examples while protecting patient identities.

Fraud detection teams at banks have also embraced synthetic datasets to rehearse responses to evolving criminal tactics, avoiding the legal and security risks of working with real customer data.

Where Synthetic Data Gets Risky

Despite its promise—and its real, demonstrated value—synthetic data carries what researchers call a “reality gap.” No matter how advanced the generation technique, at the end of the day the data still comes from a simulation. Humans are inherently complex, and subtle real-world signals—variations in behavior, environmental unpredictability, or cultural nuance—can vanish in the abstraction.

It is important to be clear: this is not a critique of the universality of human expressions. That stance is well established. Decades of cross-cultural research, as well as the proven success of facial coding and facial expression analysis in real-world contexts, show that expressions hold universal and consistent similarities globally. These consistencies are strong enough to form the foundation of reliable emotion research and commercial applications alike.

The risk lies elsewhere: when synthetic data is used to train predictive AI models. Affectiva’s facial expression analysis, for instance, depends on capturing micro-expressions and emotional cues from the faces of real people. Training such systems on synthetic faces risks stripping out the very nuances they are designed to detect. 

Consider, for example, that in Japan a smile can often be used to mask discomfort or disapproval rather than happiness, or that in several Pacific islands raised eyebrows can indicate approval rather than surprise. Without grounding in authentic data, algorithms may appear accurate in testing but misinterpret emotion in real-life scenarios—leading to compromised research outcomes or flawed product decisions.

And as every scientist knows, bias is a hazard that must be mitigated at all costs. If the real-world data used to train a synthetic generator already contains demographic imbalances, the resulting datasets can perpetuate—or even amplify—those skews. Worse still, the apparent “cleanliness” of synthetic data can foster a false sense of neutrality, hiding biases from scrutiny and making them more dangerous than those present in messier, but authentic, human datasets.

The Regulatory Squeeze

The European Parliament’s AI Act, passed earlier this year, puts synthetic data under closer watch. The law categorizes AI applications by risk level, with the most stringent obligations on systems that affect safety, rights, or democratic processes.

Under the Act, developers must be transparent about data sources, prove that synthetic datasets do not introduce bias, and in some high-risk sectors, maintain real-world validation datasets. In other words: synthetic data alone may not be enough to meet compliance.

For companies in sectors like healthcare or automotive, this means hybrid approaches, combining synthetic and real-world data, will become not just best practice, but a legal necessity.

The Case for Keeping Humans in the Loop

For scientific platforms that analyze human behavior, whether it’s facial expressions, speech, or physiological signals, there is no substitute for real-world data in training and validation.

Synthetic augmentations can help fill gaps, rebalance datasets, or simulate rare scenarios, but the “ground truth” must come from actual human observations. Without it, algorithms risk losing their sensitivity to the complexity of human behavior, a danger not only for research accuracy, but also for the trustworthiness of any commercial application.

Where the Line Should Be Drawn

Synthetic data has proven itself as a powerful ally, especially for scaling datasets, generating rare scenarios, and protecting privacy. In most industries, the strongest results will come from a hybrid approach, where synthetic and real-world data work together: synthetic for scale and diversity, real for grounding models in reality.

But some domains carry subtler demands. In areas where algorithms need to detect the intricacies of human emotion, micro-expressions, and behavior, such as facial coding, sentiment analysis, or behavioral research, only genuine human data can capture the full range of nuance.

About the author


See what is next in human behavior research

Follow our newsletter to get the latest insights and events send to your inbox.