In this paper, the authors build four classifiers which are the logistic regression, KNN, SVM, CNN by supervised machine learning to discriminate between criminals and non-criminals. There are 1856 real people’s facial images controlled for face, gender, age and facial expressions. The authors find that there are some discriminating structural features can help predict criminality such as eye inner corner distance and lip curvature. Upon further study, the authors found there is a large difference between criminals and normal people in facial expressions.
The authors first point out the lack of research on the analysis and quantification of social perception and attributes of faces . While psychologists have known for a long time that humans’ innate traits and social attributes, like trustworthiness and dominance, can be inferred from facial appearance, the authors propose we can use machine learning and computer vision to find that relationship. They point out that for computer vision algorithms, there is no subjective baggage. So that almost guarantees the objectivity of the results. They also mentions that the data for training these classifiers are standard ID images of real people controlled for race, gender, age, and facial expression, which will be more realistic and have higher quality compared with the studies based on sample face images generated by 2D or 3D face models [2, 3, 4].
II. Data preparation
We all know that a good data set is very important for experiments. The authors collected 1856 ID photos that satisfy the specific criteria: Chinese, male, range of age is between 18 and 55, no facial hair, no facial scars or another marking. Denote this data set using S, and divide it into two subsets Sn and Sc for non-criminals and criminals. Sn consists of 1126 non-criminals ID photos acquired from the Internet by web spider tool. One thing worth noting is that roughly half of individuals in subset Sn have university degrees. Subset Sc contains 730 criminals’ ID photos, 330 of these are published as wanted suspects, and the others are under a confidentiality agreement. Some sample ID photos in Sc and Sn are displayed in Figure 1.
For these ID photos, the authors extract the region of the face and upper neck, and the background is removed. All these faces are normalized into 80 * 80 images. They also take extra measures to neutralize any other varied illumination conditions’ possible effects.
III. Implementation of Face Classifiers on Criminality and its Validation
The authors use four different classification methods on data set S to prove or disprove the hypothesis of using face images to distinguish criminals and non-criminals. The classification methods are K-Nearest Neighbor, Logistic Regression, Support Vector Machine, and Convolutional Neural Network. The first three classification methods work on image features, and there are 4 features to evaluate their performances on:
- Facial landmark points, like corners of the eye and mouth
- Facial feature vector generated by modular PCA 
- Facial feature vector based on Local Binary Pattern (LBP) histograms 
- The concatenation of the above three feature vectors.
The criminal subset Sc is defined as positive class and the non-criminal subset Sn is defined as the negative class. The authors run 10-fold cross validation for all possible combinations of the first three feature-driven classifiers, with four types of feature vectors and one data-driven CNN without explicit feature vector. They examine the rate to classify a member of S into Sn or Sc, and average the rates of each case over ten runs in each of these 130 experiments (13 cases * 10 runs).
We can see the accuracies of all four classifiers for the thirteen cases in Figure 2. CNN classifier achieves 89.51% accuracy. The authors also plot the ROC curves for these four classifiers in Figure 3, and give the corresponding AUC results in Table 1. This can help measure the sensitivities of the data-driven and binary face classifiers for criminality. By far, the authors can say that the predictive power of this proposed approach is established.
Because of the high social sensitivities and repercussions of this topic, the authors want to excise maximum caution. They design and conduct the following experiments to challenge the validity of the classifiers: they randomly label the faces as negative or positive and redo all the above experiments. The outcomes show that the randomly generated instances cannot be distinguished, and the average classification accuracy is only 48%. They also tested the robustness of the experiments’ results. They take 40 pictures from 10 male Chinese students in different environments, and the classification results are still higher than 83 percent.
These experiments show that the good accuracies of the four evaluated classifiers are not due to data overfitting.
IV. Discrimination Features and Clustering of Face on Manifolds
After doing the above experiments, the authors also want to find out what features of a human face is important for classifiers to tell whether this is the face of a criminal. Here, they apply the Feature Generating Machine (FGM) of Tan in . We can see in Figure 4 (a) that the red-marked regions is the most critical parts. The authors find there are three critical areas that are very significant for separating criminals and non-criminals. In Figure 4 (b) it shows the discriminating features, which are the upper lip curvature (denoted by ρ), distance between two inner corners of the eye (denoted by d), and the angles between two lines from the tip of the nose to the two corners of the mouth (denoted by θ).
The authors use Hellinger distance , which shows the relationship between two histograms and ranges from 0 to 1, to examine the two histograms. The histograms of the three critical features are shown in Figure 5. The mean and variance of these also are tabulated in Table 2. For angle θ, the average is 19.6% smaller for criminals than non-criminals. Similarly, For the upper lip curvature ρ, the average is 23.4% larger for criminals than for non-criminals. And the distance d for criminals is slightly shorter.
The authors also generate average faces for criminals and non-criminals. It can be seen in Figure 6. But we can find that the average faces of these two datasets are very similar.
To explain this phenomenon, the authors give an assumption that faces of these two datasets are assumed to populate two distinctive manifolds. They compute the cross-class average manifold distance Dx between these two subsets, and in-class average manifold distances Dc and Dn in Function 2.
The results show that Dc > Dx > Dn, which means the two manifolds of these two datasets are concentric. Figure 7 shows the relationship of residual variance and Isomap dimensionality. This indicates that the original ultrahigh dimensional data set in a subspace of four to six dimensions can represent itself well.
We can also see the data clouds of criminals and non-criminals in Figure 8. These Figures and analysis proved that there is no subjectively meaningful typical face of criminals.
In Figure 9, it shows four subtypes of criminal faces in Sc and three subtypes of non-criminal faces in Sn.
The authors also asked 50 Chinese students to separate the criminals and non-criminals in Figure 9, and results turned out to match the results the authors expected. Figure 10 shows the relationship of variation within a cluster and number of clusters for the criminal and non-criminal dataset. It clearly illustrates that before K = 4, there are four well separable clusters of criminal faces. While for non-criminals, it doesn’t form as many separable clusters in the geodesic distance. This data analysis helped the authors to draw a conclusion that criminals have greater variations in facial appearance than the general public, although they are a small minority in the total population.
In this paper, by extensive experiments and vigorous cross-validations, the authors proved that data-driven face classifiers can be used to make reliable inference on criminality. Additionally, the general public has facial appearances variations less than criminals.
Before I read the whole paper, I already saw a lot of comments for this paper. One of the most famous comments is an article named “Physiognomy’s New Clothes” . The scientists in that article regarded this paper as scientific racism. In my opinion, from a methodological point of view, this article has merit. After achieving relatively good results, the authors didn’t just stop but instead went on exploring the essentials of this problems. And the authors use mathematical methods to test and support the speculations. I don’t think this can be seen at an ethical level. But in my opinion, using ID photos as a dataset doesn’t make sense. Imagine the conditions that we take ID photos, there are some restrictions due to the place (the police office) we take the photos, people will be more cautious. While the more effective way to detect criminals should focus on their actions, and the demeanor and expressions when they take these actions. I think this will be more meaningful.
 A. Todorov and N. N. Oosterhof. Modeling social perception of faces [social sciences]. IEEE Signal Processing Magazine, 28(2):117–122, 2011.
 M. Turk and A. Pentland. Eigenfaces for recognition. Journal of cognitive neuroscience, 3(1):71–86, 1991.
 V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187– 194. ACM Press/Addison-Wesley Publishing Co., 1999.
 V.BlanzandT.Vetter.Facerecognitionbasedonfittinga3d morphable model. IEEE Transactions on pattern analysis and machine intelligence, 25(9):1063–1074, 2003.
 R. Gottumukkal and V. K. Asari. An improved face recognition
technique based on modular PCA approach. Pattern
Recognition Letters, 25(4):429–436, 2004.
 T. Ahonen, A. Hadid, and M. Pietika ̈inen. Face recognition with local binary patterns. In European conference on computer vision, pages 469–481. Springer, 2004.
 M. Tan, L. Wang, and I. W. Tsang. Learning sparse SVM for feature selection on very high dimensional datasets. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1047–1054, 2010.
 R. Beran. Minimum Hellinger distance estimates for parametric models. The Annals of Statistics, pages 445–463, 1977.
Author: Shixin Gu | Reviewer: Haojin | Source: https://arxiv.org/pdf/1611.04135v1.pdf