MTGAT: Multimodal Temporal Graph Attention Networks for Unaligned Human Multimodal Language Sequences

Jianing Yang

Yongxin Wang

Ruitao Yi

Yuying Zhu

Azaan Rehman

Amir Zadeh

Soujanya Poria

Louis-Philippe Morency

Human communication is multimodal in nature; it is through multiple modalities, i.e., language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Multimodal Temporal Graph Attention Networks (MTGAT). MTGAT is an interpretable graph-based neural model that provides a suitable framework for analyzing this type of multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions between different modalities through time. Then, a novel graph operation, called Multimodal Temporal Graph Attention, along with a dynamic pruning and read-out technique is designed to efficiently process this multimodal temporal graph.

By learning to focus only on the important interactions within the graph, our MTGAT is able to achieve state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks including IEMOCAP and CMU-MOSI, while utilizing significantly fewer computations.

We provide more details regarding the word-aligned and unalignd data sequences. In the word-aligned version, video and audio features are average-pooled based on word boundaries (extracted using P2FA (Yuan and Liberman 2008)) resulting in an equal sequence lengths of 50 for all three modalities. In the unaligned version, original audio and video features are used, resulting in a variable sequence lengths. For both datasets, the multimodal features are extracted from the textual (GloVe word embedding (Pennington, Socher, and Manning 2014)), visual (Facet (iMotions 2017)), and acoustic (COVAREP (Degottex et al. 2014)) data modalities.

This publication uses Facial Expression Analysis which is fully integrated into iMotions Lab

Learn more