Understanding students’ emotions is important for creating adaptive learning environments. Advanced computer vision models like HSEmotion and EMONET are used for real-time emotion detection, but their effectiveness in real educational settings is insufficiently explored. Typically trained on adult datasets in controlled environments, these models encounter challenges due to varying camera angles, lighting, resolution, and skin tone. This study evaluated how image variants (camera angle, lighting, resolution) and demographic factors (skin tone) impact the accuracy and fairness of emotion detection in three different learning environments. Statistical analysis reveals significant effects of these variables on the accuracy of estimated valence and arousal values.