Semi-supervised Contrastive Regression for Estimation of Eye Gaze
Abstract
With the escalated demand of human-machine interfaces for intelligent systems, development of gaze controlled system have become a necessity. Gaze, being the non-intrusive form of human interaction, is one of the best suited approach. Appearance based deep learning models are the most widely used for gaze estimation. But the performance of these models is entirely influenced by the size of labeled gaze dataset and in effect affects generalization in performance.
This paper aims to develop a semi-supervised contrastive learning framework for estimation of gaze direction. With a small labeled gaze dataset, the framework is able to find a generalized solution even for unseen face images. In this paper, we have proposed a new contrastive loss paradigm that maximizes the similarity agreement between similar images and at the same time reduces the redundancy in embedding representations. Our contrastive regression framework shows good performance in comparison to several state of the art contrastive learning techniques used for gaze estimation.
Keywords:
Gaze Estimation Contrastive Regression Semi-supervised learning Dilated Convolution1 Introduction
Eye gaze is one of the most widely used human mode for developing human-machine interface system. Gaze controlled interface has become quite popular for virtual reality (VR) applications [1], navigation control of ground and aerial robots [2], control of robotic arms for surgery [3] and other commercial applications. With the advances in deep learning methods, gaze tracking has become an achievable task.
Appearance-based models are the most prominent approach that does not require an expensive and subject-dependent eye modeling. It mainly relies on annotated face images captured using camera/eye-tracker and learns a model to map gaze direction from the images. Most of the earliest methods relied on dedicated feature extraction [4][5] and feature selection steps, before performing the regression task [6]. This in effect makes the models more person-specific and generalization is not attained. At the same time, dedicated feature extraction also makes the computation time-taxing. With the advancements in deep learning [7], convolution neural networks(CNN) based frameworks [8][9] have become the most used architectures for gaze estimation. Mostly ResNet based CNN frameworks [10] have been developed that maps the gaze angles on the eye images. Dual branch CNN [11] has been another adopted approach for gaze estimation that reconstructs images with the gaze direction in supervision. Sequential models have also been used to learn the variation in gaze direction in subsequent frames, by leveraging LSTM models [12] on the residual features obtained using CNN. Capsule networks are another most promising approach for appearance-based models [13][14] as it emphasizes on learning equivariance instead of finding deeper features.
Generalization of appearance-based models is solely dependent on the volume of labeled data. Annotation of gaze direction in eye images is a difficult and time-consuming task. Even with the large gaze datasets available in the public domain, there are still issues in developing a domain-adaptive model due to varied background environment and illumination conditions. To deal with these problems, Semi-supervised learning(SSL) is one of the most promising solution.
SSL performs a pre-training [15] on the unlabeled data to learn an effective representation of the input. It further fine-tunes the solution using a small labeled dataset. Contrastive learning, a SSL technique, has been predominantly used in different computer vision applications in past few years. Contrastive learning based methods learn semantic embedding representation [16] of the input image by pulling the similar images together and pushing the disparate images away [17]. This facilitates the models to learn a suitable encoding of the images [18] using only the unlabeled data. Consequently, a limited gaze annotations are used to learn a model that can perform the final task of classification, segmentation or prediction. Wang [19] have first introduced the use of contrastive learning method for unsupervised regression learning of eye gaze direction.
In this paper, we present a semi-supervised regression technique to predict the eye gaze direction with two major contributions.
-
•
We have developed a contrastive learning framework that learns an encoder architecture to compute embeddings and further uses the pre-trained embeddings to predict the gaze directions.
-
•
We have proposed a new form of contrastive loss to maximize the agreement between similar images that can take care of both invariance and redundancy factors simultaneously.
2 Methodology
The designed framework is based on SimCLR framework [16] that performs the task in two stages. The first stage aims to learn an representation from the given images by maximizing the agreement between the vector representations learned from the images. The final stage uses the embeddings learnt in the pre-training stage and trains a model that performs the task of prediction by minimizing a loss function.
Mini-batches of batch size are sampled from the dataset , where represents the images and represents the gaze direction labels for those images. Random data augmentation is performed to create two varied representations of the images in the mini-batches and . We have designed an encoder module that learns latent space representations and from the augmented images. Latent space embeddings learns high-level features from the images. These embeddings are then passed to a projection head , which is designed as a multi-layer perceptron (MLP) to learn a non-linear projection vector for less complex processing. The learned embeddings are and , which are then compared to minimize the contrastive loss for similar image pairs.
The designed encoder can learn local as well as global dependency in feature maps. The architecture of the encoder is shown in Fig.1(a). In order to learn global spatial dependency along with local spatial details, larger kernel size needs to be considered. But, larger kernel size increases the computation complexity of the model and can also lead to overfitting. Thus we have used dilated convolution with different dilation rates and local spatial dependency is learnt with each convolution. By using dilated convolution, we were able to attain larger receptive fields without increasing the computation complexity due to larger kernel size by using different dilation rates. Feature map of high spatial resolution is obtained by concatenating the coarse to fine feature maps learned. Convolution filters for different dilation rates have been shown in Fig.1(b), where each colour indicates kernel corresponding to each dilation rate in concatenated feature map.


An updated version of the Normalized temperature scaled cross-entropy (NT-Xent) loss [16] has been used to define our contrastive loss. The aim is to minimize the invariance in data as well as the redundancy in feature maps at the same time. The NT-Xent loss aims to maximize the similarity between two positive images and minimize the similarity between a positive and a negative image. We have used the concept of computing cross-correlation matrix as introduced in Barlow Twins [18] to analyze the correlation between the positive and negative images. The cross-correlation between the projections and is computed as , where each element basically represents the similarity between each element of the projections. The contrastive loss defined in this paper is displayed in Equation [1], where the first term defines the NT-Xent loss that tries to minimize the invariance, whereas the second term reduces the redundant information in the output vector representation. The term represents loss coefficient factor to consider the redundancy term which has been set to . The terms defines the cosine similarity between and and is defined by Equation [2].
(1) |
(2) |
The framework pre-training stage and fine-tuning stage for regression is shown in Fig.2(a) and Fig.2(b) respectively. It uses the pre-trained encoder model and determines the values of latent space representations . The latent vectors are used as an input to the regression network to predict the values of gaze directions . The regression fine-tuning network is optimized by minimizing the huber loss, which is a step-wise amalgamation of mean squared error and mean absolute error to tackle the outlier issue, as defined in Equation [3] where . The loss parameter , that provides a measure of spread of the deviation error, ensures the loss function to generate large value of loss, by computing the squared value of deviation, for larger deviation values and smaller values of loss, by computing linear mean absolute error, for small deviation values.
(3) |


3 Results and Discussion
3.1 Dataset
We have used the ETH-XGaze [20] dataset for evaluation of our model. ETH-XGaze is a large data set for gaze estimation that includes images with excellent resolution with consistent label quality, captured from 110 individuals representing a wide range of ages, genders, and ethnicities. The dataset contains around 1,083,492 images of 6000*4000 resolution with 18 cameras placed at different locations to get different views. The dataset covers a large scale of head-pose and gaze directions, which makes it a good choice for developing a generalized solution. Sample images with different gaze directions have been displayed in Fig. 3. Images from the train subset of the dataset has been first split into train and validation sets in 80:20 ratio. Train set images, without labels, are used to learn encoder in pre-training stage where validation set images with labels are used further in the fine-tuning stage. Augmented version of the face images are obtained by applying the following transfer functions such as, horizontal mirror imaging, rescaling, zooming and varying the brightness, contrast, hue and saturation. We have performed augmentation of mini-batches of images by randomly applying any three of the above mentioned operations. Fig. 4 represents few samples of augmented images generated by weak and strong augmentation.



3.2 Evaluation
The performance of developed architecture has been evaluated by computing the Mean Angular Error. The performance of our model has been compared with different state of the art architectures and the results have been enlisted in Table [1]. Our model is able to generate significantly better result than SimCLR, which has been used for gaze estimation, and Barlow twins model. Our encoder computes features in an efficient manner by taking local dependencies in consideration. The encoder, when designed with a flatten layer at the output, provides slightly better result. We still prefer the model with global average pooling as displayed in Fig. 1(a), as it reduces the number of parameters in the architecture. We can see that the model with flatten layer estimates with mean angular error of degrees, whereas the model with reduced parameters generates mean angular error of degrees.
Method | Mean Angular Gaze Error |
---|---|
SimCLR | 9.175° |
Barlow Twins | 9.390° |
SimCLR with Deeplab Encoder | 2.246° |
Ours | 2.152° |
Ours (with reduced parameters) | 3.212° |
We have evaluated performance of models with other state of the art loss functions used in contrastive training for gaze estimation problem. The simultaneous optimization of invariance and redundancy provides us the upper hand in performance as shown in Table [2].
Method | Mean Angular Gaze Error |
---|---|
NT-Xent loss | 4.606° |
Barlow Twins loss | 4.3361° |
Our Contrastive loss | 3.212° |
3.3 Ablation Study
In order to understand the role of second term in loss function defined in Equation [1], which aims to minimize the redundancy, we have used a loss coefficient factor . By varying the value of , the performance of our model is evaluated. In Table [3], we can see that the performance improves as we increase the value of loss coefficient upto a certain value. The coefficient has been set as 0.1 for evaluation of our model.
Method | Mean Angular Gaze Error |
---|---|
4.755° | |
4.549° | |
3.212° |
4 Conclusion
This work has enlightened on semi-supervised learning for eye gaze prediction tasks by developing contrastive learning framework. The work has proposed a new form of contrastive loss to optimize the similarity agreement between augmented and captured face images considering two parameters, viz., invariant transformation and redundancy among images, to predict gaze direction. The proposed model has been tested on ETH-XGaze dataset. The evaluation of the proposed model has been done in terms of mean angular error. The model has outperformed different state of the art techniques such as SimCLR, Barlow twins as shown in the previous sections. In future we will experiment across different datasets and find a more generalized solution for cross-datasets scenarios.
4.0.1 Acknowledgements
Authors would like to acknowledge CSIR-Central Electronics Engineering Research Institute (CSIR-CEERI) for providing facilities and CSIR-AITS mission for providing fund to conduct this research work.
References
- [1] Konrad, R., Angelopoulos, A., Wetzstein, G.: Gaze-contingent ocular parallax rendering for virtual reality. ACM Transactions on Graphics (TOG) 39(2) (2020) 1–12
- [2] Gerber, M.A., Schroeter, R., Xiaomeng, L., Elhenawy, M.: Self-interruptions of non-driving related tasks in automated vehicles: Mobile vs head-up display. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. (2020) 1–9
- [3] Ferrier-Barbut, E., Gauthier, P., Luengo, V., Canlorbe, G., Vitrani, M.A.: Measuring the quality of learning in a human–robot collaboration: A study of laparoscopic surgery. ACM Transactions on Human-Robot Interaction (THRI) 11(3) (2022) 1–20
- [4] Rattarom, S., Uttama, S., Aunsri, N.: Model construction and validation in low-cost interpolation-based gaze tracking system. Engineering Letters 27(1) (2019)
- [5] Yilmaz, C.M., Kose, C.: Local binary pattern histogram features for on-screen eye-gaze direction estimation and a comparison of appearance based methods. In: 2016 39th International Conference on Telecommunications and Signal Processing (TSP), IEEE (2016) 693–696
- [6] Aunsri, N., Rattarom, S.: Novel eye-based features for head pose-free gaze estimation with web camera: New model and low-cost device. Ain Shams Engineering Journal 13(5) (2022) 101731
- [7] Pathirana, P., Senarath, S., Meedeniya, D., Jayarathna, S.: Eye gaze estimation: A survey on deep learning-based approaches. Expert Systems with Applications 199 (2022) 116894
- [8] Cheng, Y., Wang, H., Bao, Y., Lu, F.: Appearance-based gaze estimation with deep learning: A review and benchmark. arXiv preprint arXiv:2104.12668 (2021)
- [9] Lemley, J., Kar, A., Drimbarean, A., Corcoran, P.: Convolutional neural network implementation for eye-gaze estimation on low-quality consumer imaging systems. IEEE Transactions on Consumer Electronics 65(2) (2019) 179–187
- [10] Kanade, P., David, F., Kanade, S.: Convolutional neural networks (cnn) based eye-gaze tracking system using machine learning algorithm. European Journal of Electrical Engineering and Computer Science 5(2) (2021) 36–40
- [11] Zhu, Z., Zhang, D., Chi, C., Li, M., Lee, D.J.: A complementary dual-branch network for appearance-based gaze estimation from low-resolution facial image. IEEE Transactions on Cognitive and Developmental Systems (2022)
- [12] Chong, E., Wang, Y., Ruiz, N., Rehg, J.M.: Detecting attended visual targets in video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. (2020) 5396–5406
- [13] Bernard, V., Wannous, H., Vandeborre, J.P.: Eye-gaze estimation using a deep capsule-based regression network. In: 2021 International Conference on Content-Based Multimedia Indexing (CBMI), IEEE (2021) 1–6
- [14] Mahanama, B., Jayawardana, Y., Jayarathna, S.: Gaze-net: Appearance-based gaze estimation using capsule networks. In: Proceedings of the 11th augmented human international conference. (2020) 1–4
- [15] Crawford, E., Pineau, J.: Spatially invariant unsupervised object detection with convolutional neural networks. In: Proceedings of the AAAI Conference on Artificial Intelligence. Volume 33. (2019) 3412–3420
- [16] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning, PMLR (2020) 1597–1607
- [17] Grill, J.B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al.: Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33 (2020) 21271–21284
- [18] Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference on Machine Learning, PMLR (2021) 12310–12320
- [19] Wang, Y., Jiang, Y., Li, J., Ni, B., Dai, W., Li, C., Xiong, H., Li, T.: Contrastive regression for domain adaptation on gaze estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022) 19376–19385
- [20] Zhang, X., Park, S., Beeler, T., Bradley, D., Tang, S., Hilliges, O.: Eth-xgaze: A large scale dataset for gaze estimation under extreme head pose and gaze variation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, Springer (2020) 365–381