Truth, Lies, and Synthetic Data: Using Synthetic Data Effectively in Market Research

The recent ESOMAR Congress was illuminating on the number of topics – but particularly regarding how far the commercial research industry has already come in the adoption of Generative AI technology.  At that conference a year ago, there were a lot of pitches by AI businesses, and a lot of anxiety about what AI would mean. 12 months later, we saw a lot more pragmatism, and while there’s still anxiety, the way the industry is likely to adopt these tools is taking much clearer shape

AI seems to be being used in 4 broad ways :

  • Moderation tools – using Large Language Models to automate aspects of the interview process, both  to write questionnaires and discussion guides, but also moderate short interviews directly
  •  Operational Agents – automating the research implementation process to cut costs and time
  • Automated analysis – finding themes in verbatim data, and extracting and visualising numerical findings to accelerate the process of generating insights
  • Synthetic Data – creating new data based on patterns of past responses and wider learning, to augment the data collected from real humans.

While each of these applications has pros and cons, it is synthetic data which is by far the most controversial, as we’ll see.

Synthetic data applications


Synthetic data has been highly successfully used in other fields.  In the automotive sector, it is helping simulate thousands of accident scenarios without the need for dangerous and costly real-world crashes. In healthcare, it is providing research teams with vast datasets that preserve patient privacy while enabling breakthroughs in rare disease detection and treatment. Financial institutions are using it to model fraud prevention without exposing sensitive transactions.

However, the commercial research challenge is arguably different.  Rather than zero in on a specific pattern in an image or scenario, here we are trying to predict what real people, who are notoriously chaotic, mercurial and illogical, will say and do.  And the essence of the research industry is to get to the truth, and the reality of behaviour, not a hypothetical model.  But despite the challenge, synthetic data is being applied now in a number of ways:

  1. Building ‘personas’, or Large Language Model-based tools who can be interacted with in natural language, to mine research datasets, both within and across studies to answer key client questions.
  2.  To ‘fill in the gaps’ in data sets – to estimate the answers that were not given in an interview, based on the answers from people who did, or from past studies that asked those questions.
  3. Boosting the sample size of studies, especially when it comes of hard-to-reach populations, by creating ‘digital twins’ of whole participants, with a view to increasing the robustness of findings from small groups.

The benefits are obvious – finding and interviewing real people can be relatively expensive and time consuming, despite the automation journey that the research industry has been on for years. It makes sense to use AI to squeeze more juice out of the data we have.  And anything that makes the results more accessible and usable is a good thing.

A reality check

However, it’s important not to be swept along by the excitement around synthetic data here.  Particularly when it comes to the creation of ‘new’ data, such as digital twins or synthetic answers.   To state the obvious – these aren’t real data. They are estimates, and estimates have errors.  To simply treat such data as the same as real human data is at best risky and at worst deceptive, and it is a relief to hear that research industry guidelines have been updated to ensure that transparency is a necessary condition of such applications.

But there are some implications of this reality that are important to be aware of:

  1. All models tend to regress to the mean.  That means – synthetic data will tend towards an ‘average’ answer, rather than reflect the range of responses real people will give.  We’ve seen that in practice at Affectiva / iMotions when we compare things like predicted eye-tracking  to real eye-tracking.  The prediction tends to show a strong centre bias on the screen, which is on average true, so will produce meaningful correlations, but therefore misses a lot of insight at the edges that becomes apparent when we look at data from real people.
  2. Models tend to miss the extremes – so distributions of synthetic data are often narrower than the real data – and that’s challenging given that the interesting results are often at the extremes.  Its particularly challenging in research that is seeking to measure responses to novel or new ideas, such as ad testing or new product development, where asking AI to evaluate new ideas it has never seen before may therefore lead to more errors, not less.
  3. The Internal relationships within the data get diluted.  As all the model estimates have errors, the relationships between, say, certain brand image perceptions and purchasing, can begin to weaken – which is bad news if that data is then used to try to understand the importance of different attributes or the contributors to behaviour
  4. The synthetic data is only as good as the ground truth data it is trained on.  AI is a numbers game – the more real data you have the better your model (which is why the Affectiva facial coding technology works so well, as we have access to data from millions of cases to train on).  Biases that exist in that ground truth will be multiplied, and that’s especially true when training cases are sparse.
  5. Perhaps most importantly, there is no statistical ‘magic’ in a Large Language model or AI tools that means that the laws of statistics don’t apply.  You can create as many digital twins as you like but you are still basing them on the data you have – and particularly when the use is to boost low-incidence samples, the issue becomes compounded.  If you start with 30 people you still have thirty people at the end, even if you created another 170 digital twins.  You can’t calculate your standard error from data that includes synthetic data – that concept just doesn’t apply, because the twins are duplicates, even if the duplication is fuzzy.
  6. Ironically, it can be more expensive to create synthetic data than to collect real data.  The business of creating synthetic twin is expensive, and if it costs less than a dollar to ask a real person the questionnaire (whether that is a good thing or not is a topic for another post) then why wouldn’t you?  AI will always be faster, but it may not always be cheaper.

Some guidelines

All of the above is not to say that synthetic data has no place in commercial research – if it can make data more usable, accessible and faster then that is a huge step forward for what can sometimes be a very dry and slow process for the end buyers of research.  However, we need to take care, and:

  • Be realistic:  Be clear on what is and is not real data, and apply appropriate caution to augmented results.
  • Be pragmatic: There are times when you need to go deep and explore ideas with real people and take your time to think about it, but there are other cases where a fast, mostly right answer is fine.  For instance, researching the core idea and execution of a new ad campaign requires real people – but testing the 100s of iterations of that campaign that follow with an AI model may be good enough to optimise the media plan, and is better than no research at all.
  • Keep feeding the machine.  The worst case is for the industry to only rely on models and feed it synthetic data, which then inevitably diverge from reality.  We can’t let our AI choke on its own exhaust.

The arrival of generative AI is a transformative moment for commercial research, and if the industry is open minded, it can get us to better insights faster, and at greater scale; but we need to ground ourselves in the reality of these techniques, and continue to draw on real human data too.

About the author


See what is next in human behavior research

Follow our newsletter to get the latest insights and events send to your inbox.