Gaze attention estimation is the task that aims to understand where each person is looking in each scene. In this study, we introduce a new annotated dataset that is derived from medical simulation training videos, capturing diverse and authentic clinical scenarios from a practical medical environment and annotated by the ground truth data from eye-tracking devices (iMotions [1]) worn by medical professionals during procedures in the scenes. Most of the existing approaches in the field of gaze prediction often rely on object detection as a guide for the model. However, this becomes problematic when encountering specialised tools and equipment in medical environments, and that domain is probably absent from the standard large-scale object detection and segmentation dataset. To address the problem, this paper attempts to propose a gaze prediction framework that integrates with the head pose information, which consists of pitch, yaw, and roll. Enable the model to rely on gaze direction even when objects are not detected. This approach is based on the self-attention mechanism of vision transformers. We believe this will enhance the model’s performance in relation to the relationship between gaze direction and the scene. We hope to offer a more reliable framework for real-world medical applications.
