A Practitioner’s Guide to AI-Powered Biometric Research

Online interviews often rely on self-report, but emotional engagement and valence are difficult to articulate. This article explores how iMotions and audEERING enable real-time, objective measurement of emotional responses through facial expression and voice analysis using standard webcams and microphones.

Introduction: The Limits of What People Say

Online interviews and focus groups have become the dominant method for qualitative research across market research, UX, communications, and social science. But there is a fundamental problem with relying solely on what respondents say, namely that people are notoriously poor reporters of their own emotional states. They rationalize, self-censor, socially conform, and sometimes simply lack the vocabulary to describe what they actually feel.

Emotional valence, which is the degree to which a feeling is positive or negative, and emotional engagement, which is how intensely a person is involved in an experience, are two of the most important signals a researcher can capture in an interview. Yet traditional interview methodology leaves both almost entirely unmeasured, relying on post-hoc self-report or the subjective interpretation of a skilled moderator.

A new generation of AI-powered biometric tools is changing this. By combining facial expression analysis (FEA) with voice-based emotion AI, researchers can now capture objective, real-time measures of valence and engagement during online interviews, using a standard webcam and microphone.

At iMotions, we see this shift reflected in the growing integration of multimodal behavioral data with advanced voice analytics. Through our different platforms, and in collaboration with partners such as audEERING, a Germany-based specialist in AI-driven audio and voice analysis, researchers can combine physiological, behavioral, and vocal signals to better understand human responses.

Quick Insight: This article explains what emotional engagement and valence actually are, how these signals can be measured objectively during an online interview, what tools exist to do so, and how to design a study that extracts meaningful insights from the data.

1. What Are Valence and Engagement?

Before diving into methodology, it is worth being precise about what these terms mean in the context of affective science.

Valence

Valence describes the positivity or negativity of an emotional state. It is one of the two core dimensions of the circumplex model of affect (alongside arousal), a widely used framework in emotion research. A person watching a heartwarming story has high positive valence. A person reading a frustrating instruction manual has negative valence. Valence is distinct from intensity, and a person can feel mildly happy (positive valence, low arousal) or intensely joyful (positive valence, high arousal).

In an interview context, valence tells you whether a respondent’s emotional reaction to a topic, stimulus, or question is fundamentally pleasant or unpleasant, irrespective of what they say. This distinction matters enormously, because a respondent might describe a product as ‘fine’ while displaying negative facial valence throughout the discussion of it.

Engagement

Engagement, as measured in behavioral research, reflects the level of expressiveness and active involvement a person shows in response to a stimulus or situation. It captures how much a person is ‘in’ an experience, not just how they feel about it. High engagement may be positive or negative, as in; a person who is furious is highly engaged; a bored person is not.

In interview research, engagement is a proxy for relevance and salience. Topics that produce high engagement are the ones that matter to participants. Topics that flatline on engagement metrics, even when respondents give detailed verbal answers, may be topics that respondents are processing intellectually rather than feeling.

Quick Insight: The combination of valence and engagement together gives you what might be called the emotional fingerprint of an interview: not just what people feel, but how intensely they feel it and whether that feeling is positive or negative.

Arousal and Dominance

A third dimension commonly used in affective research is arousal, which covers the physiological and psychological activation level associated with an emotional state. Relaxation and boredom sit at the low-arousal end; excitement and anger sit at the high-arousal end. Dominance, a less commonly used fourth dimension, captures the degree to which a person feels in control of a situation.

Both arousal and dominance are measurable from voice characteristics. audEERING’s devAIce technology, integrated into iMotions’ voice analysis module, outputs all three dimensions (valence, arousal, and dominance) on continuous scales in real time. This three-dimensional picture of emotional expression provides significantly more nuance than simple categorical labels like ‘happy’ or ‘sad’.

2. The Two Signal Sources: Face and Voice

There are two primary non-intrusive channels through which emotional engagement and valence can be measured remotely during an online interview: the participant’s face and their voice. Both are captured via webcam and microphone.

Facial Expression Analysis (FEA)

Facial expression analysis uses computer vision to detect and quantify facial muscle movements in real time. The scientific foundation is the Facial Action Coding System (FACS), developed by psychologists Paul Ekman and Wallace Friesen, which provides an objective, anatomy-based taxonomy of all visible facial muscle movements. These are called Action Units (AUs).

Rather than simply labeling a ‘happy face,’ FACS-based systems identify the specific muscle movements that compose an expression. This could, for example, be a cheek raise combined with a lip corner pull indicates joy. iMotions integrates Affectiva’s AFFDEX engine, one of the most widely validated automated facial coding systems available, to detect up to 20 Action Units per video frame, along with seven core emotion classifications (joy, anger, fear, surprise, sadness, contempt, and disgust), and critically, composite metrics for valence and engagement.

What iMotions FEA Measures

Seven core emotions: Joy, Anger, Fear, Surprise, Sadness, Contempt, Disgust
Up to 20 Action Units (AUs) — the raw muscle movement data
Valence — the continuous positive-to-negative emotional tone
Engagement — the expressiveness and active involvement of the participant
Head pose and blink metrics
3D head orientation and attention indicators

7 Basic Emotions — FACS Action Units

Joy

Crucially, iMotions FEA is available both in the full desktop lab environment and through its Online and Remote Data Collection (RDC) platform, running directly in a browser via the participant’s own webcam. This means FEA can be deployed at scale, globally, without any special hardware or participant travel.

Quick Insight: Valence and engagement are crucial metrics for quantifying an experience. Valence represents the overall emotional tone, ranging from negative to positive. Engagement measures the level of expressiveness and involvement.

Voice Analysis: The Hidden Emotional Signal

While facial expression captures the outward emotional display, the human voice carries a parallel and complementary stream of emotional information — one that is harder for respondents to consciously control. Vocal characteristics including pitch, speaking rate, loudness, and intonation vary systematically with emotional state, and these variations can be detected and quantified by AI systems trained on large corpora of emotionally annotated speech.

iMotions’ Voice Analysis Module is powered by audEERING’s devAIce technology. Founded in 2012 as a spin-off of the Technical University of Munich, audEERING has spent over a decade building and validating AI models for vocal expression analysis. Its devAIce platform analyzes approximately 7,000 acoustic parameters covering phonatory, articulatory, and prosodic aspects of speech — making it among the most comprehensive voice analysis systems available.

What iMotions Voice Analysis (audEERING devAIce) Measures

Valence — the positive-to-negative emotional tone of the voice
Arousal — the activation or energy level present in the vocal signal
Dominance — the perceived control or confidence in the speaker’s voice
Categorical emotion states: anger, happiness, sadness, neutral
Prosodic features: pitch, loudness, speaking rate, and intonation
Speaker attributes: estimated age and gender

The devAIce system operates with two models simultaneously: a dimensional model that places the voice on continuous scales for arousal, valence, and dominance, and a categorical classifier that assigns the voice to discrete emotion categories. This dual approach provides both nuanced continuous data and interpretable categorical outputs in the same analysis stream.

Why Both Channels are Important: The Multimodal Advantage

Faces and voices carry overlapping but distinct emotional information. A person can smile while speaking in a tense, high-arousal voice. A person can speak in calm, measured tones while showing subtle brow furrow — an Action Unit associated with confusion or concern. These divergences are not methodological noise; they are meaningful data.

In communication research, the concordance or discordance between facial and vocal emotional signals is itself a research finding. A respondent whose face and voice are emotionally aligned is likely experiencing a genuine, integrated emotional response. A respondent whose face shows positive valence but whose voice shows elevated arousal and neutral categorical emotion may be performing positivity — telling you what they think you want to hear.

Quick Insight: iMotions allows researchers to examine when the message, the voice, and the expression align, and when they do not. This cross-modal comparison is one of the most powerful capabilities in remote emotional measurement.

The iMotions platform synchronizes facial and voice data at the millisecond level, aligning both streams with stimulus events and survey responses in a single unified timeline. This means that at any moment in an interview, you can see what the participant said, what their face expressed, what their voice suggested emotionally, and what stimulus or question they were responding to.

3. The Technology Stack: iMotions + audEERING

iMotions: The Research Platform

iMotions was founded to solve a specific problem: different biometric sensors produce data in different formats, at different sampling rates, with different software interfaces. Researchers who wanted to combine, say, eye tracking with facial expression analysis and physiological sensors faced an integration nightmare. iMotions built a unified platform that ingests, synchronizes, and presents all of these signals in a single environment.

Today, iMotions is used by more than three-quarters of the world’s top 100 universities and is trusted by researchers across academic and commercial settings. Its product suite includes iMotions Lab (full desktop environment for in-lab research), iMotions Online/Education (browser-based tool for teaching and lightweight research), and the Remote Data Collection (RDC) Platform (the full lab-grade capability deployed remotely over the internet).

For online interviews, the Remote Data Collection platform is the relevant product. It captures webcam eye tracking, facial expression analysis via Affectiva AFFDEX, voice analysis via audEERING devAIce, and webcam-based respiration — all through a standard browser, with no participant installation required. Studies are designed in iMotions Lab software, distributed via a shareable link, and analyzed back in the full iMotions analytics environment.

audEERING: The Voice AI Pioneer

audEERING GmbH, headquartered in Gilching near Munich, is the market leader for AI-based audio analysis. The company has 20 years of combined research heritage, having grown from academic foundations at the Technical University of Munich. Its core product, devAIce, is the engine behind iMotions’ Voice Analysis Module.

devAIce is available as an SDK, a Web API, and a plugin for game engines and XR platforms. Within iMotions’ RDC environment, it operates as an integrated module — participants’ audio is processed locally on the researcher’s hardware, ensuring data sovereignty and GDPR compliance. No audio data is sent to external servers.

The partnership between iMotions and audEERING was announced in August 2023. In the words of audEERING CEO Dagmar Schuller: ‘Together, we will make a significant contribution to improving scientific processes and usher in a new era for human behavior analysis.’ The integration was a natural fit — iMotions needed a best-in-class voice AI component, and audEERING needed a world-class research platform through which to deploy its technology in scientific and commercial research contexts.

Quick Insight: The audEERING devAIce expression model has been downloaded more than 3 million times on Hugging Face, reflecting its standing as a benchmark technology in the open academic community, before its commercial integration into iMotions.

4. Designing an Online Interview Study for Emotional Measurement

Collecting facial and voice data during an online interview is technically straightforward with iMotions RDC. The methodological challenge lies in study design, structuring the interview so that the data you collect is interpretable and comparable across participants.

Stimulus Design and Standardization

One of the most important lessons from biometric interview research is that variability in the interview flow makes data comparison difficult. If every participant follows a different conversational path, it is hard to isolate what was driving an emotional response at any given moment.

Best practice recommendations from iMotions research and UX practitioners suggest structuring the interview so that key stimulus moments — concepts shown, videos played, or specific questions asked — are consistent across all participants. iMotions’ study builder allows researchers to embed stimuli (images, videos, web content) directly into the interview flow and mark these as event markers in the timeline. This means that emotional data can be time-locked to specific stimuli, so that you can see exactly what was on screen or what question was being asked when a particular emotional peak occurred.

Webcam and Microphone Setup

All data collection via iMotions RDC requires only a webcam and microphone. Participants access the study through a standard browser link. No software installation is needed on the participant side. The platform uses the browser’s native media APIs, with servers in both Germany and the United States for GDPR-compliant data handling.

Lighting is the most common quality issue in webcam-based FEA. Participants should be in a well-lit environment with light coming from in front of them (not behind). iMotions includes calibration steps and quality checks to flag poor tracking conditions before a study begins.

Integrating Surveys and Biometrics

Self-report data remains a critical complement to biometric measurement. iMotions RDC includes a built-in survey tool supporting scales, video, images, and branching logic, with integrations to third-party survey platforms. Researchers can embed survey questions before, during, and after interview segments, allowing direct comparison between what participants say they felt (explicit self-report) and what their face and voice showed implicitly.

This triangulation — explicit self-report alongside implicit biometric signal — is the gold standard in affective research. Neither channel alone is definitive. Self-report is subject to rationalization and social desirability bias; biometric signals require careful contextualization. Used together, they produce a far richer picture of the participant’s actual emotional experience.

Sample Size Considerations

Biometric research online scales in ways that in-lab research cannot. Because participants access studies from their own devices, iMotions RDC allows recruitment across geographies and time zones simultaneously. For interview research, sample sizes of 20 to 50 participants are typically sufficient for pattern identification, though larger samples improve statistical reliability for between-group comparisons.

The platform supports panel provider integrations, making it possible to recruit targeted demographic samples through standard market research infrastructure while still collecting full biometric data.

5. What the Data Looks Like: Key Metrics and Outputs

Facial Expression Metrics

The iMotions FEA module outputs time-stamped scores for each metric at the frame rate of the webcam (typically 15–30 frames per second). In the iMotions signal viewer, these appear as overlaid waveforms on the study timeline, synchronized with audio, video, and event markers. Key outputs include:

Valence score (continuous, negative to positive): the net emotional tone at each frame
Engagement score (continuous, 0 to 1): the level of facial expressiveness and involvement
Individual AU intensity scores: the raw muscle movement data for advanced analysis
Emotion probability scores: likelihood values for each of the seven core emotions
Head pose and attention indicators

Researchers can visualize individual participant timelines, aggregate signals across participants to identify emotional peaks and troughs, and use iMotions’ Comparison tab to compare emotional responses to different stimuli or between participant groups.

Voice Analysis Metrics

The Voice Analysis module outputs dimensional and categorical emotion data from the audio signal. Key outputs include:

Valence (continuous): the positive or negative tone of the speaker’s voice
Arousal (continuous): the energy or activation level of the voice
Dominance (continuous): the perceived confidence or control in the voice
Categorical emotion label: angry, happy, sad, or neutral classification
Prosodic features: pitch, loudness, speaking rate, and intonation richness

The iMotions platform also includes a Speech-to-Text module that transcribes interview audio and allows researchers to identify emotionally significant words and phrases. This means that a peak in vocal arousal can be pinned to the exact words a participant was saying at that moment — enabling a level of qualitative-quantitative integration that was previously impossible in remote research settings.

With facial and voice data time-synchronized in the iMotions platform, researchers can compute moment-by-moment concordance between the two channels. Common analysis questions include: When do facial and vocal valence diverge? Are there moments where high facial engagement coincides with low vocal arousal, suggesting intellectual rather than emotional processing? Do participants show consistent emotional responses across modalities, or are there systematic discordances that suggest impression management?

Quick Insight: Research on video virality using iMotions data found that facial expressions of joy, engagement, and positive valence, alongside GSR arousal peaks, were among the most predictive features for viewer engagement, allowing prediction of engagement with over 80% accuracy.

6. Practical Applications in Interview Research

Market Research and Concept Testing

For market researchers, online interviews with FEA and voice analysis offer a way to validate or challenge what respondents say about concepts, products, or campaigns. A respondent who describes a product concept as ‘interesting’ but displays sustained neutral-to-negative facial valence and low engagement throughout the discussion may be politely disengaged rather than genuinely interested. This distinction can change the direction of a product development decision.

audEERING’s market research documentation notes that from valence and arousal scores, specific expression dimensions can be derived, including disinterest, irritation, excitement, and relaxation, providing richer market research parameters than categorical survey responses alone.

Communications and Message Testing

In communications research, the alignment between the intended emotional impact of a message and the actual emotional response of the audience is the key question. iMotions’ Communications Research Lab combines FEA and voice analysis to measure audience responses to messages, speeches, and campaigns. Researchers can identify which moments in a piece of communication drive positive valence and engagement, and which cause disengagement or negative affect, at a moment-by-moment level unavailable from any survey instrument.

UX Research and Think-Aloud Studies

Think-aloud protocols, where users verbalize their thoughts while interacting with a product, are a standard UX research method. Voice analysis adds a dimension that verbal content alone cannot capture: the emotional coloring of what participants say. A user who says ‘this is fine’ in a frustrated, high-arousal voice is communicating something different from a user who says the same words in a calm, positive-valence tone. iMotions’ integration of voice analysis with eye tracking and FEA makes it possible to correlate vocal emotional state with exactly where the user was looking and what they were doing at that moment.

Healthcare and Telehealth Research

Voice analysis has a long history in clinical research, where it has been used to detect voice biomarkers for conditions including depression, Parkinson’s disease, and Alzheimer’s disease. In telehealth interview contexts, the ability to passively monitor vocal characteristics during patient-clinician interactions offers potential for early detection and monitoring. audEERING’s devAIce has been used in healthcare research contexts, and iMotions’ platform provides the study design and data management infrastructure to support IRB-compliant clinical research.

7. Ethical Considerations and Data Governance

Facial expression data and voice recordings are biometric data and are subject to data protection regulation in most jurisdictions. In Europe, both are covered by the GDPR. In the United States, state-level biometric privacy laws (including Illinois BIPA) apply in many contexts. Research use of iMotions FEA and voice analysis requires explicit informed consent from participants covering the collection, storage, and analysis of both facial video and audio data.

Key ethical requirements for conducting biometric interview research include:

IRB or ethics board approval for academic and clinical research
Explicit, informed participant consent covering the specific biometric signals collected
Clear data retention and deletion policies communicated to participants
Data anonymization where possible and required
Transparency about the use of AI analysis tools and their limitations

iMotions’ RDC platform addresses data sovereignty concerns directly: audio and video data is processed locally on the researcher’s hardware. audEERING’s devAIce likewise processes audio locally by default in the iMotions integration, meaning no biometric data is transmitted to third-party servers during analysis. For European researchers, iMotions maintains server infrastructure in Germany in addition to the United States.

It is also important to communicate to participants and stakeholders that automated emotional measurement is probabilistic, not deterministic. FEA and voice analysis systems measure observable signals, facial muscle movements, acoustic features, and infer emotional states from these signals. These inferences are supported by robust scientific evidence but are not infallible. They should be interpreted alongside self-report data and qualitative interview findings, not as replacements for them.

Conclusion: Beyond Words

The future of qualitative research is not purely qualitative. As AI-powered biometric tools become more accessible and deployable at scale online, the most rigorous interview research will routinely combine the depth of human conversation with the objectivity of continuous emotional measurement.

The iMotions + audEERING stack represents the current state of the art for doing this in online interview settings. iMotions’ Remote Data Collection platform provides the study design, data collection, synchronization, and analysis infrastructure. audEERING’s devAIce technology provides the voice-based emotional intelligence layer. Together, they give researchers something that was previously available only in fully equipped labs: a real-time, millisecond-level picture of what participants feel, not just what they say.

The tools exist. The validation is there. What remains is the methodological shift, and a willingness to treat emotional engagement and valence as measurable research variables rather than impressionistic moderator judgments. For researchers willing to make that shift, online interviews will never look quite the same.

Key Takeaways

Emotional valence (positive/negative tone) and engagement (intensity of involvement) are measurable in real time from face and voice during online interviews.
iMotions’ Remote Data Collection platform captures both signals using only the participant’s webcam and microphone — no special hardware or lab required.
Facial expression analysis is powered by Affectiva’s AFFDEX engine, grounded in the FACS framework, and outputs valence, engagement, 7 core emotions, and up to 20 Action Units per frame.
Voice analysis is powered by audEERING’s devAIce technology, outputting valence, arousal, and dominance on continuous scales alongside categorical emotion classification.
Both signals are synchronized at the millisecond level in iMotions, enabling cross-modal analysis and time-locking of emotional responses to specific stimuli, questions, or moments.
Biometric data should always be triangulated with self-report survey data — neither channel alone is definitive.
Facial and voice data are biometric data; informed consent, ethics board approval, and GDPR compliance are required.
Study design standardization — consistent stimuli, consistent question sequences — is critical for interpretable and comparable biometric interview data.