This paper introduces a large-scale dataset for appearance-based gaze estimation in the wild. Their own dataset is larger than existing datasets and more variable with respect to illumination and appearance. They present multimodal convolutional neural networks for appearance-based gaze estimation and that significantly outperforms state-of-the-art methods in the most challenging cross-dataset evaluation.
2.The MPIIGaze Dataset
Authors designed the dataset based on two main objectives:
1) record images of participants outside of controlled laboratory conditions.
2) record participants over several months to cover a wider range of recording locations and times, illuminations, and eye appearances (see Figure 2).
Participants use laptops to implement a customised software which automatically asked participants to look at a random sequence of 20 on-screen positions. Participants were asked to stare at the point and confirm each by pressing the spacebar. The authors collected 213,659 images in total from 15 participants. Each participant contributes images varied from 34,745 to 1,498.
3.1 Face Alignment and 3D Head
The authors first detect the human face in the image by using SURF cascade method proposed by Li et al. . Subsequently, they use constrained local mode framework  to detect facial landmarks.
This paper use the same definition of
the face model and head coordinate system as . The face model consists of 3D positions of six facial landmarks (eye and mouth corners, shown in Figure 1). The head coordinate system is defined based on the triangle connecting three midpoints of the eyes and mouth. Thus, they define the 3D head rotation as the rotation from the head coordinate system to the camera coordinate system, and the eye position is treated as the midpoint of eye corners for each eye. 3D positions of the six landmarks are recorded from all of the participants using an external stereo camera prior to the data collection, and authors calculate the mean shape across all participants as the generic shape. The generic mean facial shape model is used to evaluate the whole gaze estimation pipeline in a practical setting.
3.2 Data Normalisation
The normalisation is done by scaling and rotating the camera so that the camera looks at the midpoint of each eye corners from a fixed distance and ‘x’ axs of the head coordinate system and camera coordinate system become parallel. The eye images are cropped at a fixed size ‘W x H’ with a fixed focal length ‘f’ in the normalised camera space, and the authors also apply the histogram equalization to form the input eye images. The eye images have a fixed resolution of 36 x 60, 2D head angle vector ‘h’. The ground-truth gaze positions are also converted to the normalised camera space to give 2D gaze angles of yaw and pitch angles. After the normalisation of eye images, all dataset have the same image sizes and focal length. Thus they can evaluate the appearance-based methods based on the different dataset.
3.3 Gaze Estimation With Multimodal CNNs
The task for CNN is to train the mapping from the input features, which are 2D head angle ‘h’ and eye images, to gaze angles ‘g’ in the normalised space (Figure 6). Since human eye physical structure, both eyes will stare at the same direction, when looking at somewhere. Thus authors flip eye images horizontally and mirror ‘h’ and ‘g’ around the ‘y’ axis. Subsequently predict both eyes by a single regression function.
The CNN model is based on LeNet architecture which shown as Figure 6. The author uses the linear regression layer on top of the fully connected layer to predict gaze angle vectors ‘g’. There is a characteristic of this model that they encode head pose information into the fully connected layer (see Figure 6). They use gray-scale images with a fixed resolution of 36 x 60 as network input. For the two convolutional layers, the filter size is 5 x 5
pixels, while the number of features is 20 for first layer and 50 for the second layer. The fully connected layer has 500 hidden units, where each unit connects to all the feature maps of the previous convolutional layer, and sum up all activation values in each unit. The output of the network is a 2D gaze angle vector ‘g^’ which consists pitch and yaw angles of the eye. They use L2 as the loss function to measure the distance between the predicted ‘g^’ to actual gaze angle vectors ‘g’.
The authors conducted both cross-dataset and within-dataset conditions to compare their Multimodal CNNs with state-of-the-art methods on the MPIIGaze dataset and the other datasets. The authors compare their method to Random Forest (RF), k-Nearest Neighbour (kNN), Adaptive Linear Regression (ALR), Support Vector Regression (SVR) and Shaped-based Approach (EyeTab).
Figure 7 shows cross dataset evaluation with different datasets which are MPIIGaze and Eyedip. The Multimodal CNNs has the best performance no matter in each dataset. Figure 8 illustrates the within-dataset result in which their method
also has the least mean errors as well.
This paper study on appearance-based gaze estimation in the unconstrained daily-life setting. Their CNN-based prediction model significantly outperforms state-of-the-art methods. They compiled MPIIGaze dataset which is a in-the-wild gaze dataset through a long-term data collection.
6.Thought from the reviewer
With the development of deep learning, researchers try to solve some questions by these new technologies. For the appearance-based method, we aim to extract features from the images. This method requires large amount of data to extract features from the images. Since previous computer can not support that amount of data, researchers attend to adapt the other methods to solve problems. However, CNN becomes more popular due to the rapid development of computation hardware and cloud computing.Furthermore, CNN is very capable for extracting visual features from images. Based on that this paper achieved promising results.
Contribute of this paper:
1.This paper provides a in-the-wild gaze dataset which consists large amount of daily life eye images. Researchers can make further research based on this dataset .
2.The authors use a Multimodal CNNs method to predict the gaze angle. They add head pose into the fully connected layer. In my view, it is very difficult to predict the pitch and yaw angle of human eyes if we do not add the head pose into the CNN model. Because the oval structure of human eyes decides eye kernel has a more apparent variation along the yaw angle than pitch angle. Multimodal CNNs provides extra information to assist CNN model to get a prediction with high accuracy.
. Li, Jianguo, and Yimin Zhang. “Learning surf cascade for fast and accurate object detection.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013.
. Baltrušaitis, Tadas, Peter Robinson, and Louis-Philippe Morency. “Continuous conditional neural fields for structured regression.” European Conference on Computer Vision. Springer International Publishing, 2014.
. Sugano, Yusuke, Yasuyuki Matsushita, and Yoichi Sato. “Learning-by-synthesis for appearance-based 3d gaze estimation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014.
Author: Jinpeng Cai | Reviewer: Haojin