This study uses a multimodal data analysis approach to provide a more continuous and objective insight into how students’ engagement unfolds and impacts learning achievements. In this study, 61 nursing students’ learning processes with a virtual reality (VR)-based simulation were captured by psycho-physiological data streams of facial expression, eye-tracking, and electrodermal activity (EDA) sensors, as well as by subjective self-reports. Students’ learning achievements were evaluated by a pre- and post-test content knowledge test. Overall, while both facial expression and self-report modalities revealed that students experienced significantly higher levels of positive than negative emotions, only the facial expression data channel was able to detect fluctuations in engagement during the different learning session phases. Findings point towards the VR procedural learning phase as a reengaging learning activity, which induces more facial expressions of joy and triggers a higher mental effort as measured by eye tracking and EDA metrics. Most importantly, a regression analysis demonstrated that the combination of modalities explained 51% of post-test knowledge achievements. Specifically, higher levels of prior knowledge and self-reported enthusiasm, and lower levels of angry facial expressions, blink rate, and devotion of visual fixations to irrelevant information, were associated with higher achievements. This study demonstrates that the methodology of using multimodal data channels encompassing different types of objective and subjective measures, can provide insights into a more holistic understanding of engagement in learning and learning achievements.