Abstract: In this paper, we propose a novel multimodal fusion framework, named locally confined modality fusion network (LMFN), that contains a bidirectional multiconnected LSTM (BM-LSTM) to address the multimodal human affective computing problem. Instead of conducting fusion only on a holistic level, we propose a hierarchical fusion strategy that considers both local and global interactions to obtain a comprehensive interpretation of information. Specifically, we partition the feature vector corresponding to each modality into multiple segments and learn every local interaction through a tensor fusion procedure. Global interaction is then modeled by learning the interconnections of local interactions via an originally designed BM-LSTM architecture, establishing a direct connection of cells and states between local interactions that are several time steps apart. With LMFN, we achieve advantages over other methods in the following aspects: 1) local interactions are successfully modeled using a feasible vector segmentation procedure that can explore cross modal dynamics in a more specialized manner; 2) global interactions are modeled to obtain an integral view of multimodal information using BM-LSTM, which guarantees a sufficient exchange of information; and 3) our general fusion strategy is highly extendable by applying other local and global fusion methods. Experiments show that LMFN yields state-of-the-art results. Moreover, LMFN achieves higher efficiency compared to other models applying the outer product as the fusion method.