Multimodal Language Analysis with Recurrent Multistage Fusion

Paul Pu Liang

Ziyin Liu

Amir Zadeh

Louis-Philippe Morency

Abstract: Computational modeling of human multimodal language is an emerging research area in natural language processing spanning the language, visual and acoustic modalities. Comprehending multimodal language requires modeling not only the interactions within each modality (intra-modal interactions) but more importantly the interactions between modalities (cross-modal interactions). In this paper, we propose the Recurrent Multistage Fusion Network (RMFN) which decomposes the fusion problem into multiple stages, each of them focused on a subset of multimodal signals for specialized, effective fusion. Crossmodal interactions are modeled using this multistage fusion approach which builds upon intermediate representations of previous stages. Temporal and intra-modal interactions are modeled by integrating our proposed fusion approach with a system of recurrent neural networks. The RMFN displays state-of-the-art performance in modeling human multimodal language across three public datasets relating to multimodal sentiment analysis, emotion recognition, and speaker traits recognition. We provide visualizations to show that each stage of fusion focuses on a different subset of multimodal signals, learning increasingly discriminative multimodal representations.

This publication uses Facial Expression Analysis which is fully integrated into iMotions Lab

Learn more

Learn more about the technologies used

Other publications you might be interested in

Emotion Detection on User Front-Facing App Interfaces for Enhanced Schedule Optimization: A Machine Learning Approach

Open AccessPeer-Reviewed24/06/2025University of Toronto + 2
Biometrics in roadway design: incorporating the subliminal human experience

Open AccessPeer-Reviewed15/05/2025Yale University + 3
Monitoring Viewer Attention During Online Ads

GatedPeer-Reviewed12/05/2025Affectiva
The Role of Empathy in Leadership Ethics: Examining Empathic Relational Leadership Practice through Video-Based Methods

GatedPeer-Reviewed04/04/2025George Washington University + 2
Emotional contagion in dyadic online video conferences—empirical evidence based on self-report and facial expression data

Open AccessPeer-Reviewed03/04/2025Ludwig-Maximilians-Universität München (LMU Munich)
The Role of Energetic Music in a Video Game: Analyzing its Effect on Immersion, Perception and Performance

Open AccessPeer-Reviewed08/11/2024University of Primorska + 3

Unconscious, Non-Conscious, or Subconscious: When To Use Which Term According To Science

Integrating the California Verbal Learning Test (CVLT) with iMotions: A Multimodal Approach to Verbal Memory Assessment

Multimodal Language Analysis with Recurrent Multistage Fusion

Learn more about the technologies used

Other publications you might be interested in

Emotion Detection on User Front-Facing App Interfaces for Enhanced Schedule Optimization: A Machine Learning Approach

Biometrics in roadway design: incorporating the subliminal human experience

Monitoring Viewer Attention During Online Ads

The Role of Empathy in Leadership Ethics: Examining Empathic Relational Leadership Practice through Video-Based Methods

Emotional contagion in dyadic online video conferences—empirical evidence based on self-report and facial expression data

The Role of Energetic Music in a Video Game: Analyzing its Effect on Immersion, Perception and Performance

Related Posts

More Likes, More Tide? Insights into Award-winning Advertising with Affectiva’s Facial Coding

Why Dial Testing Alone Isn’t Enough in Media Testing — How to Build on It for Better Results

The Power of Emotional Engagement: Entertainment Content Testing with Affectiva’s Facial Expression Analysis

Tracking Emotional Engagement in Audience Measurement is Critical for Industry Success

🍪 Use of cookies

Settings