Effect Of Personalized Calibration On Gaze Estimation Using Deep-Learning
Abstract
With the increase in computation power and the development of new state-of-the-art deep learning algorithms, appearance-based gaze estimation is becoming more and more popular. It is believed to work well with curated laboratory data sets, however it faces several challenges when deployed in real world scenario. One such challenge is to estimate the gaze of a person about which the Deep Learning model trained for gaze estimation has no knowledge about. To analyse the performance in such scenarios we have tried to simulate a calibration mechanism. In this work we use the MPIIGaze data set. We trained a multi modal convolutional neural network and analysed its performance with and without calibration and this evaluation provides clear insights on how calibration improved the performance of the Deep Learning model in estimating gaze in the wild.
Keywords Gaze tracking Deep Learning Personal Calibration
1 Introduction
Increase in interactions between human and computing devices is leading to the popularity of research in gaze-tracking. Specific domains like Human Computer Interaction and Computer Vision are researching on the various applications of appearance based gaze estimation or tracking. (Morimoto and Mimica, 2005).
Machine learning methods can be used to train gaze-estimators, provided we have lots of reliable data, and head pose-independent training data (Sugano et al., 2014; Schneider et al., 2014). These methods can bring appearance-based methods into a certain state where these methods do not require any user or device-specific training. The quantity of user data is sufficient to provide information to the gaze-estimators. Gaze tracking using monocular cameras in mobile phones, laptops and interactive displays are cost effective because of their widespread availability. While appearance-based gaze estimation performs well with machine learning-based methods, new techniques are still developed. These new techniques are evaluated on curated data sets collected under controlled laboratory conditions (Zhang et al., 2015). As a result there is lack of availability of different eye shapes as well as we have to make assumptions that the head pose is accurate. This often results in problems in object recognition (Torralba and Efros, 2011) and object detection (Li et al., 2014).
In our study we have used MPIIGaze data set. The data set and annotations are publicly available online (Zhang et al., 2015).
Here in this article we are going to see how introduction of small amounts of test data in the training set can significantly increase the accuracy of the model. We are calling this approach as a personal calibration that is done in gaze estimation software - when a user starts using it for the first time.
We have divided our work in two parts. First, we train a CNN model for appearance-based gaze estimation. The data set we are using is one order of magnitude larger than existing data sets and has more variation with respect to illumination and appearance (Zhang et al., 2015). Second, we perform experiments on the model with and without calibration to see the effect in performance of the model.
2 Dataset and its features
We used MPIIGaze dataset that contains a total of 213,659 images from 15 participants. The number of images of each participant varied from 34,745 to 1,498. The data set contains larger variability in illumination and larger appearance variations (Zhang et al., 2015). All these data were collected from the laptop because it can be used for daily recordings and are an important platform of eye-tracking devices (Majaranta and Bulling, 2014).
3 Method
Figure 1 provides a high level view of gaze estimation task using multi modal convolutional neural networks (CNN). A monocular camera is used to capture the image of the face. Then the face detection and facial landmarks detection methods are used to locate landmarks in the image (Zhang et al., 2015). A 3D facial shape model is used to estimate 3D poses of the detected faces and apply the space normalisation technique to crop and warp the head pose and eye images to the normalised training space (Sugano et al., 2014; Zhang et al., 2015). Then we train a convolutional neural network to learn the mapping from the head poses and eye images to 2D gaze-points in the camera coordinate system.

3.1 Preprocessing : Face Alignment and 3D Head Pose Estimation
Before using the images to train the model, the images present in the MPIIGaze data set needs to be processed. The user’s face is detected in the image using Li et al.’s SURF cascade method (Li and Zhang, 2013; Zhang et al., 2015). Afterwards, Baltrušaitis et al.’s constrained local mode framework to detect facial landmarks (Baltrušaitis et al., 2014; Zhang et al., 2015). The 3D positions of six facial landmarks (eye and mouth corners, cf. Figure 1) constitutes the facial model. The head coordinate system is defined according to the triangle connecting three midpoints of the eyes and mouth. EPnP algorithm is used to fit the model by estimating the initial solution and further refining the pose via non-linear optimisation (Lepetit et al., 2009; Zhang et al., 2015). 3D head rotation r is defined as the rotation from the head coordinate system to the camera coordinate system, and the eye position t is defined as the midpoint of eye corners for each eye (Zhang et al., 2015).

3.2 Normalizing the Preprocessed Data
The normalisation is done by two camera matrix transformation operations: scaling and rotation of camera (Zhang et al., 2015). The x axes of the camera coordinate space and that of head coordinate space is made parallel by pointing the camera to one of the facial landmarks - the midpoint of eye corners. After that, the eye images were cropped at a fixed resolution W × H with a fixed focal length f in the normalised camera space, and histogram-equalised (to improve the contrast of the image) to form the input eye image. This results in a set of fixed-resolution eye images e and 2D head angle vectors h (Zhang et al., 2015). In order to reduce the effect of different lighting conditions, so that no extra noise gets inside the CNN model, eye images e are histogram-equalised after the normalisation process (Zhang et al., 2015). The setting for camera distance d, focal length f and the resolution W ×H remains same (Sugano et al., 2014).
3.3 Using Multimodal CNN To Learn The Mapping
The CNN learns the mapping from the input features (2D head angle h and eye image e) to 2D points g in the screen normalised space (Sugano et al., 2014; Zhang et al., 2015).
Our model uses a pre-activated ResNet like architecture that consists of one convolutional layer containing 3 stages, followed by batch normalisation and a final connected layer. There is a linear regression layer on top of the final connected layer to predict gaze positions g. The head pose information is introduced in our CNN model by concatenating h with the output of the fully connected layer. Our input for the network are the grey-scale eye images e with a fixed size of 60 ×36 pixels (Zhang et al., 2015). The number of parameters for the convolutional layer is 144. For more details on the architecture please refer to Figure 2 The output of the network is a 2D gaze position g that consists of x and y coordinates (normalised) of the 2D screen. Our loss function is the sum of the individual losses that measure the euclidean distance between the predicted g and the actual g. Thus with the neural network we are solving an optimisation problem of minimising this loss function
The normalized data set is available online for public use. Hence we did not have to perform the techniques in Section 3.1 and Section 3.2. We mainly worked on Section 3.3.

4 Experiments
Here we will be discussing about the person-independent gaze estimation task with and without calibration to validate the effectiveness of the proposed CNN based gaze estimation approach with personalized calibration. There are various ways of personal calibration, like fine-tuning the model in target domain (Krafka et al., 2016). Some researchers have added some target person specific features for gaze tracking which are learned during fine tuning (Linden et al., 2019). We will be retraining the model with the target feature and observe if there are any changes in the performance of the model, with and without personalized calibration. Please refer to this Figure 3 to understand the pipeline of personalized calibration.
4.1 Equipment Used
All the training were done using GeForce GTX 1650 Max-Q. The average time to train 1 epoch having 12 steps is around 20 seconds at around 90% GPU utilisation.
4.2 Methodology
We conduct the experiments using the MPIIGaze dataset only. We choose 1,500 left eye samples and 1,500 right eye samples from each person so that we can take into account the sample number bias in the data set. Since one participant has only 1,448 images, we randomly over sampled the data to get 3,000. So for each person, we now have 3000 images (Zhang et al., 2015). We split these 3000 images into 10 partitions, each containing 10% i.e 300 images for calibration. We used leave-one-out evaluation strategy. This means, if we want to evaluate the performance of a model on the data of person X, then we train a model using data of the remaining persons and some small percentage of data of person X. Here we have used 10% of the test data for calibration. Then we test the resulting model A with the remaining 90% data of the person X. Again, we also have to train a model B that is without calibration and test with the same 90% test data that we used to test model A. This will now give us the difference in estimation error with and without calibration. For each person, we perform this experiment 10 times (because we have 10 partitions) so as to get the mean estimation error and see which partition provides best calibration for each person.
We trained every model for 40 epochs because we found that around 40 epochs, the training loss and validation loss converges and also the loss function is minimum at that time.

5 Results
As described in Section 5.1, the experiments were conducted. We had to conduct 150 experiments and in each experiment we had to create two models - with and without calibration. For each experiment we obtain two error values. It is important to note that the error value here is unit less because it is a relative error. Due to differences in screen size for each participant, the ground truth pixel coordinates values were normalized by dividing the x and y pixel coordinate by width and height of the screen in pixels respectively so that the pixel coordinate values remain between 0 and 1. Hence the error value also remains in between 0 and 1. For training and testing of model, such normalization was necessary, or else the model was not getting trained due to some out of range errors. However for real time setup where the model infers the gaze position on the screen, in that case the x,y coordinate values can be multiplied with dimensions of the screen so as to get the exact pixel position. The results obtained showed an interesting observation that calibration significantly reduces the mean error compared to without calibration. This is because the model now has some prior information about the person (See Figure 4).
Let’s look at the results of p00. The mean error is 0.25 before calibration. It means this 0.25 is the difference in the ground truth value and the inferred value of the pixel coordinate. As mentioned before, it has no unit because the pixel coordinates were normalised. After calibration, it is now 0.20. There has been a significant improvement in case of p00 where the error decreased by almost 20%.
6 Limitations
One of the major limitation is the process of head-pose estimation in real time. Since we used the normalized dataset of MPIIGaze, we did not have to perform the head-pose estimation. However if someone decides to create a new dataset of their own, then they have to perform this head-pose estimation. Sometimes the detection of facial landmarks and obtaining eye-images are also challenging because of lighting conditions and presence of accessories like when a person wears glasses. We did not have to do this as we used the clean normalized MPIIGaze dataset.


7 Future Work
By using only 300 extra images (10% of the test data) we have seen significant decrease in estimation errors. So our next work will be to create a real-time software with this calibration method. Before the software starts gaze-tracking, it will capture images of the face of the person using webcam. Most modern day webcams can shoot 30 times in one second. So we can capture around 450 images in 15 seconds by which we can calibrate a pre-trained model. After that we can start using that software to get a predicted 2-D position on the screen.
We have already started creating such software which is still in development stage. As of now, it does not have the calibration feature and also we have not tested it rigorously. It can do the tasks of facial landmark detection and predicting the x,y coordinates of the estimated eye-gaze on a 2D screen. Please refer to Figure 5 which shows our software working in real time.
Apart from this, we can also try using other models like UNet etc. to see if it improves the performance of learning by the model. Thus, there are multiple scope of future works based on this work.
8 Conclusion
Appearance-based gaze estimation methods have so far been evaluated exclusively under controlled laboratory conditions. In this work, we present an extensive study on appearance-based gaze estimation along with a calibration technique. We used a data set that contains images which has wide variations. Our CNN-based estimation model significantly shows improvement in performance when calibration is performed. This work provides a critical insight on addressing the performance challenges of Deep Learning models in daily-life gaze interaction.
References
- Morimoto and Mimica (2005) Morimoto, C.; Mimica, M. Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding 2005, 98, 4–24.
- Sugano et al. (2014) Sugano, Y.; Matsushita, Y.; Sato, Y. Learning-by-Synthesis for Appearance-Based 3D Gaze Estimation. 2014; pp 1821–1828.
- Schneider et al. (2014) Schneider, T.; Schauerte, B.; Stiefelhagen, R. Manifold Alignment for Person Independent Appearance-Based Gaze Estimation. 2014 22nd International Conference on Pattern Recognition. 2014; pp 1167–1172.
- Zhang et al. (2015) Zhang, X.; Sugano, Y.; Fritz, M.; Bulling, A. Appearance-based gaze estimation in the wild. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015; pp 4511–4520.
- Torralba and Efros (2011) Torralba, A.; Efros, A. A. Unbiased look at dataset bias. CVPR 2011. 2011; pp 1521–1528.
- Li et al. (2014) Li, Y.; Hou, X.; Koch, C.; Rehg, J. M.; Yuille, A. L. The secrets of salient object segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition. 2014; pp 280–287.
- Majaranta and Bulling (2014) Majaranta, P.; Bulling, A. Advances in physiological computing; Springer, 2014; pp 39–65.
- Li and Zhang (2013) Li, J.; Zhang, Y. Learning surf cascade for fast and accurate object detection. Proceedings of the IEEE conference on computer vision and pattern recognition. 2013; pp 3468–3475.
- Baltrušaitis et al. (2014) Baltrušaitis, T.; Robinson, P.; Morency, L.-P. Continuous conditional neural fields for structured regression. European conference on computer vision. 2014; pp 593–608.
- Lepetit et al. (2009) Lepetit, V.; Moreno-Noguer, F.; Fua, P. Epnp: An accurate o (n) solution to the pnp problem. International journal of computer vision 2009, 81, 155.
- Cheng et al. (2021) Cheng, Y.; Wang, H.; Bao, Y.; Lu, F. Appearance-based Gaze Estimation With Deep Learning: A Review and Benchmark. arXiv preprint arXiv:2104.12668 2021,
- Krafka et al. (2016) Krafka, K.; Khosla, A.; Kellnhofer, P.; Kannan, H.; Bhandarkar, S.; Matusik, W.; Torralba, A. Eye Tracking for Everyone. 2016,
- Linden et al. (2019) Linden, E.; Sjostrand, J.; Proutiere, A. Learning to personalize in appearance-based gaze tracking. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 2019; pp 0–0.