Unconstrained Face Recognition using ASURF and Cloud-Forest Classifier optimized with VLAD

Vinay A Aviral Joshi Hardik Mahipal Surana Harsh Garg K N Balasubramanya Murthy S Natarajan Center for Pattern Recognition and Machine Intelligence, PES University, Bangalore 560085, India

Abstract

The paper posits a computationally-efficient algorithm for multi-class facial image classification in which images are constrained with translation, rotation, scale, color, illumination and affine distortion. The proposed method is divided into five main building blocks including Haar-Cascade for face detection, Bilateral Filter for image preprocessing to remove unwanted noise, Affine Speeded-Up Robust Features (ASURF) for keypoint detection and description, Vector of Locally Aggregated Descriptors (VLAD) for feature quantization and Cloud Forest for image classification. The proposed method aims at improving the accuracy and the time taken for face recognition systems. The usage of the Cloud Forest algorithm as a classifier on three benchmark datasets, namely the FACES95, FACES96 and ORL facial datasets, showed promising results. The proposed methodology using Cloud Forest algorithm successfully improves the recognition model by 2-12% when differentiated against other ensemble techniques like the Random Forest classifier depending upon the dataset used.

keywords:

Face Recognition; Haar Cascade; Bilateral Filter; ASURF; Bag of Words; VLAD; Cloud Forest; Random Forest

\email

[email protected], [email protected]

1 Introduction

Face recognition has always been one of the prime topics of research interest for computer vision enthusiasts [15]. Significant work has been done in the field of face recognition and other affiliated fields for decades. This is because of its application in numerous fields including surveillance systems, authentication systems and other security tools. Such applications can be used for authentication purposes at both public places like theaters, airports etc, and private enterprises for authentication of employees. It can also be used in electronic devices with embedded cameras for authorizing legitimate users. Real-time application of such face recognition models involve the development of fast and efficient models.

This paper proposes one of the efficient models for face recognition which is reliable and computationally inexpensive. To build such a model we have used Haar-Cascade for detection of faces from the image followed by Bilateral filter to eliminate unwanted noise from the detected image. Such a preprocessing step forms one of the important aspects in any image recognition problem. Subsequently, we have used ASURF[33] which makes the model invariant to translation, rotation, scale, illumination and affine distortion of images. The keypoints detected are characterized using ASURF descriptor which can be further used for classification. The descriptors are quantized using Bag of words[29] feature aggregation technique for compact representation. The vector obtained from VLAD and Bag of words model is used for classification using Cloud Forest classifier[17].

2 Related Work

Face detection is a non-invasive task in the space of object detection. The Haar Cascade technique uses haar-like features improving upon the implementation using the Viola-Jones detector [11]. According to [12], rotated haar-like features were calculated efficiently and improved over the Viola Jones algorithm, thereby reducing the false alarm rate by as much as 12.5%. As per [13] which presents an evaluation study using this technique, it is found that in the FA1 and FA2 databases, an accuracy of 100% is reached while the accuracy for the FEI database is 99.25% according to the Criterium II benchmark.

Bilateral Filtering has been found to work well in most image processing applications[1]. It has been used in various contexts such as tone management [5], texture editing and relighting [8], image noise reduction[5, 6, 7], stylization [10] and demosaicking [9]. In [6], Bilateral Filters are used to enhance low dynamic-range, underexposed videos by varying the exposure in every photoreceptor. Bilateral filters are also used immensely in medical imaging and movie restoration applications.

Speeded-up robust features (SURF) is one of the well sought after methods for key point detection and description. This method is invariant to in-plane rotation, contrast, scale and brightness. The key point point detector in SURF interpolates the highly discriminative facial points. For the purpose of extracting the physical characteristics from the detected key points, it is sent for description by constructing feature vectors. In order to achieve faster computation time, fast Hessian matrix approximation is used. Also, for the detection of interest points, the scale space is determined by up scaling the integral image based filter sizes. Interest points from the facial images are computed at varied scales where the implementation of scale space is done through an image pyramid. Techniques such as sub-sampling and Gaussian smoothing are incorporated to generate the pyramid labels. For description of interest points the algorithm sets a reproducible orientation through the use of information from the circular area around the derived key point. Subsequently, a square region is constructed according to the chosen orientation. Then descriptors are computed using the key points. The SURF descriptor mainly emphasizes on the spatial distribution of gradient information inside the nearest key point neighborhood.

Although SURF has far-reaching applications, distortions like affine transforms and camera angles in images tend to reduce its accuracy. To overcome the same, a new method termed Affine-SURF (ASURF) [33] was proposed by Yanwei Pang, Wei Li and Yuan Yuan, which overcame these issues. This method was effective in all major applications where SIFT or SURF typically gave poor performance. Affine SURF finds its applications in robust image matching[16], automatic identification of cloud cover regions[18].

Vector of Locally Aggregated Descriptors (VLAD) find its use in applications like weakly supervised place recognition[19], improvement of image similarity using tensors[20], fast video classification[21], large scale image retrieval[22] and event detection[23]. In [19], VLAD is used as one of the important layers of the CNN architecture,i.e., NetVLAD, for image representation which can be used for image retrieval and is readily pluggable as well as amendable to training. In [21], VLAD is used in combination with Fisher kernels to outperform the Bag of Words technique in terms of accuracy. [22] uses VLAD in large scale image search applications.

Bag of words find its use in image categorization [24], human action classification[25], multiple-class segmentation[26], medical image retrieval[27] and red eye detection[28]. [26] proposes partitioning of many classes based on objects using bag of keypoints which are combined over mean-shift patches. Also, in [25], a hierarchical model is presented which is characterized as a constellation of bag of words.

3 Method Proposed

An efficient model is proposed with Haar-Cascade for face detection, Bilateral Filters for image preprocessing, ASURF for feature detection and description, Bag of Words and Vector of Locally Aggregated Descriptors (VLAD) for feature aggregation and Cloud Forest for classification.

3.1 Image detection using Haar-Cascade

The input image sent to the model is first used for detection of faces from it. For the purpose of face detection from the original image, we have used Haar-Cascade detector. [11] has the ability of detecting faces rapidly by keeping only information present in grayscale images. The major aspects of Haar-Cascade include an integral image representation of the original image which accounts for quick evaluation of Haar-like features, introduction of an efficient classifier that is modeled using a subset of features from Adaboost and finally a method for integrating more complex classifiers such that they form a cascading architecture. The integral image is computed by sum of pixels above and to the left of the location x,y and given by

ii(x,y)=\sum_{x^{\prime}\leq x,y^{\prime}\leq y}i(x^{\prime},y^{\prime})

(1)

where $i(x,y)$ is the original image and $ii(x,y)$ is the integral image. Finally the detected faces are sent further into the model for image preprocessing and subsequently for classification.

Algorithm 1 Face Detection

1:procedure DetectFace(

image

)

\triangleright

detects face from input image

face\leftarrow haar\_cascade.find\_face(image)

3: return

face

3.2 Image preprocessing using Bilateral Filter

After receiving the detected image, it must be preprocessed to reduce the noise present in it. Noise is defined as random fluctuations in brightness or color information in images. Filtering is generally used for reducing noise in the nearby pixel values which are distant from the signal values.

The method proposed in this paper for filtering uses Bilateral filter, a non-linear, non-iterative and simple technique for blurring an image while taking care of strong edges. [2] smoothen the image by utilizing a non-linear combination derived by averaging the smooth regions of the nearby image values, preserving edges. Other filtering techniques such as Gaussian blurring [3] which performs linear operation, neglecting edges are determined by sigma and non-linear filters like median filter [4], which replace the pixel values with the median value available in the local neighborhood are outperformed by Bilateral filters. In Bilateral Filtering, weight of the intensity values of the surrounding pixels replace the intensity value of a pixel that is based on Gaussian distribution. This system loops over each pixel and simultaneously adjust weights of the neighboring pixels, retaining sharp edges. Traditional filters generally operate on three separate bands of color resulting in different levels of contrast and smoothing patterns which causes perturbation in the balance of colors and order in which they appear. The method used here uses three bands at once resulting into the average of similar colors and minimizing the artifacts highlighted above. The pixel at value $x$ is replaced with the mean of similar and nearby pixel values. The preprocessed image is sent for keypoint detection further in the model. Bilateral filtering combines the range and spatial domain filters,

h(x)=\frac{1}{n(x)}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}g(\xi)c(\xi,x)s(g(\xi),g(x))d\xi

(2)

where n(x) is the normalization factor calculated as,

n(x)=\frac{1}{n(x)}\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}c(\xi,x)s(g(\xi),g(x))d\xi

(3)

in which c( $\xi$ , x) computes the geometric proximity between a nearby point $\xi$ and the neighboring center $x$ . s(g( $\xi$ ), g(x)) calculates the photometric similarity between $x$ and point $\xi$ .

3.3 Keypoint detection and description using ASURF

Smooth apparent deformations are caused in images with smooth boundaries when cameras click these image when placed in varying positions. Affine transforms of an image plane play a good role to locally approximate these deformations. Calculation of affine invariant image local features are the main cause for issues in the domain of object recognition. Speeded-up robust features (SURF) is invariant to in-plane rotation, contrast, scale and brightness. The key point point detector in SURF interpolates the highly discriminative facial points. For the purpose extracting the physical characteristics from these key points detected it is sent for description of these interest points by constructing feature vectors. In order to achieve faster computation time, fast Hessian matrix approximation is used.

The proposed method performs well when the object has similar illumination conditions and suffers from extreme changes in angle. The method implemented in this paper - Affine-SURF(ASURF) - not only combines the advantages of affine invariance but is also computationally efficient as SURF [33]. The method reliably finds features that have large affine distortions. After this, SURF is applied on all the images. Thus, ASURF effectively overcomes all six mentioned constraints.

3.4 Feature Aggregation using Bag of Words

After computing the descriptors from ASURF, we have gathered enough information which can be used to distinguish relevant changes in the image parts. For creating the feature vector, a codebook $C$ of clusters is being created. Subsequently each cluster has a set of descriptors which are computed using k-means clustering technique. For every descriptor in an image, it is associated with its nearest centroid of its cluster, $c_{i}$ in the vocabulary.

Bag-of-words (BoW) model involves counting the number of descriptors in each cluster in the vocabulary. Subsequent steps involve the creation of sparse histogram over the vocabulary, to represent the image in the compact form. Such histograms can be used by classification algorithms to categorize them into different classes.

3.5 Feature Aggregation using VLAD

Similar to Bag of Words (BoW) model, Vector of Locally Aggregated Descriptors (VLAD) is used for representing images in compact form based on the locality criteria. In Bag of words a codebook $C$ is constructed of $k$ visual words with the help of k-means. Here, each local descriptor $x$ is related to its closest visual word. In case of VLAD descriptor, the main idea is to accumulate the differences between the descriptor and its associated cluster. The component of $v$ is computed to obtain the sum of all image descriptors which is given by,

v_{i,j}=\sum_{x|x=NN(c_{i})}(x_{i}-c_{ij})

(4)

where $v_{i,j}$ represents the vector over visual words and local descriptor component, x is the descriptor taken for visual word $c_{i}$ . Further, the vector v is L2-normalised by

v=\frac{v}{||v||_{2}}

(5)

The quantized vector is sent further for multi-class image classification.

Algorithm 2 Accuracy of Classifier

1:procedure GetAccuracy(

trueLabels,predictedLabels

)

\triangleright

Calculates the accuracy of the classifier

CorrectlyClassified\leftarrow 0

NumberOfLabels\leftarrow 0

4: for all

tl,pl

\triangleright

iterate over each true(tl) and predicted(pl) label

5: if tl = pl then

CorrectlyClassified\leftarrow CorrectlyClassified+1

NumberOfLabels\leftarrow NumberOfLabels+1

8: return

\frac{CorrectlyClassified}{NumberOfLabels}

\triangleright

calculate the accuracy

Algorithm 3 Face Recognition

1:procedure FaceRecognition(

ImagesData,Aggregator,Classifier

)

\triangleright

Performs face recognition

ImageDescriptors\leftarrow emptyList()

3: for all

im

\triangleright

iterate over each image(im)

face\leftarrow DetectFace(im)

filteredFace\leftarrow BilateralFilter(face)

faceKeypoints,faceDescriptor\leftarrow DetectAffineSURF(filteredFace)

imageDescriptors.append(descriptor)

vocabulary\leftarrow aggregator.createVocabulary(descriptors)

NewDescriptors\leftarrow emptyList()

10:

DescriptorClass\leftarrow emptyList()

11: for all

im,des

\triangleright

iterate over each image(im) and its corresponding descriptor(des)

12:

newDescriptor\leftarrow aggregator.computeDescriptor(des,im,vocabulary)

13:

newDescriptor.append(newDescriptor)

14:

DescriptorClass.append(newDescriptor)

15:

trainData,trainLabels,testData,testLabels\leftarrow TrainTestSplit(NewDescriptors,DescriptorClass)

16: trainingAccuracy = Classifier.train(trainData, trainLabels)

17: predictedOutputs = Classifier.predict(testData)

18: accuracy = GetAccuracy(testLabels, predictedOutputs)

\triangleright

calculate accuracy

3.6 Classification using Cloud Forest

CloudForest is a classification algorithm that uses an ensemble of decision trees and is written in Go programming language. Its merit lies in its efficient implementation which fully utilizes the multi-threading potential of modern day machines thus making it fast and flexible. Apart from classification, this method also performs feature selection, regression and structure analysis on heterogeneous data with missing values.

Decision trees are often used for many machine learning and data mining applications. According to [14], it is unaltering under many feature transformations and is also robust to addition of irrelevant features. In particular, trees that grow deep tend to learn exceedingly irregular patterns, thereby overfitting training datasets.

Random forests finds its use in all sorts of machine learning applications, ranging from classification to regression. It constructs multiple decision trees with the goal of reducing the variance. It does so by taking the mean of the predictions calculated by the individual trees. This comes at the cost of an increase in bias and loss of interpretability, but greatly boosts the operation of the final model.

4 Performance Analysis

4.1 Datasets

1.

FACES95[30] contains 1440 images of 72 individuals with 20 images for each individual being taking using a fixed camera with a delay of 0.5 seconds between successive frames in the sequence. Significant head movement and lighting variations were introduced between images of the same individual. Minor variations in head turn, tilt and expression is present.
2.

FACES96[31] is a significantly larger facial dataset which was constructed in a manner similar to FACES95. In total, there are 3040 images for 152 individuals with complex backgrounds.
3.

The ORL[32] Database is relatively small with 40 classes and 400 images. The pictures vary with respect to time and light conditions, facial expression and details.

4.2 Results

Table 1: Performance of model on multiple datasets and classifiers

\@tabular

@llll@

Dataset Classifier BoW VLAD

FACES95 Cloud Forest 97.22% 97.69%
Random Forest 97.45% 95.37%
kNN 64.14% 62.36%
FACES96 Cloud Forest 94.08% 89.94%
Random Forest 88.07% 77.24%
kNN 61.54% 57.21%
ORL Cloud Forest 77.50% 86.67%
Random Forest 84.17% 91.67%
kNN 63.89% 71.62%

Table 2: Computational efficiency of model on multiple datasets

\@tabular

@llll@

Dataset Classifier Training Time(in seconds) Testing Time(in seconds)

FACES95 Cloud Forest 11.88 1.41
Random Forest 139.46 0.21
FACES96 Cloud Forest 103.66 5.51
Random Forest 733.33 0.60
ORL Cloud Forest 1.75 1.22
Random Forest 15.82 0.08

The model proposed in the paper has been executed on three standard datasets, namely, FACES95, FACES96 and ORL faces. The datasets considered pose variations in head rotation, expression, illumination, scale and affine distortions in the facial images. The model was ran for two different feature aggregators including VLAD and Bag of words with Cloud Forest as classifier. Cloud Forest was further compared with other ensemble classification algorithms like Random Forest and the results are tabulated in Table 1. By setting the classifier as Cloud Forest and running the model with VLAD as the quantization method, it achieves an accuracy of 97.69%, 89.94%, 86.67% on FACES95, FACES96 and ORL respectively in contrast to Bag of words where 97.22%, 94.08%, 77.50% are the accuracies when tested for the same datasets as mentioned before. Also we can conclude that VLAD performs better than Bag of words in most cases. Now upon changing the classifier from Cloud Forest to Random Forest, it can be seen from Table 1 that it could not keep up with the performance in most datasets as that of Cloud Forest except for ORL faces. It gave an accuracy of 95.37% on FACES95 whereas Cloud Forest proved to be 97.69% accurate on the same. Similar trend can be seen in case of FACES96 with 77.24% on Random Forest and 89.94% on Cloud Forest. The experimentation has been extended for other classifiers like k-nearest neighbours which when tested on the above mentioned datasets like FACES95, FACES96 and ORL achieves an accuracy of 64.14%, 61.54%, 63.89% with Bag of Words as the aggregator and 62.36%, 57.21%, 71.62% with VLAD as the aggregator.

In addition to testing the model on parameters like accuracy, analysis on training and testing time has been performed for deeper insights into the efficiency of the proposed system. On training the model with Cloud Forest as the classifier, it takes much lesser time as compared to Random Forest in all the datasets considered in our case. Such a decrease in time in Cloud Forest can be seen because of the multi-threaded ensemble nature of the classifier. On datasets like FACES95, FACES96, ORL the Cloud Forest algorithm proposed in the paper takes 11.88 seconds, 103.66 seconds, 1.75 seconds to train as compared to 139.46 seconds, 733.33 seconds, 15.82 seconds for the same respectively. Thus, if the application being considered requires training large facial datasets, Cloud Forest classifier is the more appropriate candidate to be considered. However, for time-constrained applications like real time surveillance systems, Random Forest Classifier is more appropriate as testing time is more efficient in this method. The metrics for the time taken for classification is shown in Table 2.

References

[1] Paris S, Kornprobst P, Tumblin J, Durand F, Curless B, Van Gool L, Szeliski R. Bilateral Filtering: Theory and Applications: Series: Foundations and Trends® in Computer Graphics and Vision.
[2] Tomasi C, Manduchi R. Bilateral filtering for gray and color images. InComputer Vision, 1998. Sixth International Conference on 1998 Jan 4 (pp. 839-846). IEEE.
[3] Mostaghim M, Ghodousi E, Tajeripoor F. Image smoothing using non-linear filters a comparative study. InIntelligent Systems (ICIS), 2014 Iranian Conference on 2014 Feb 4 (pp. 1-6). IEEE.
[4] Dreuw P, Steingrube P, Hanselmann H, Ney H, Aachen G. SURF-Face: Face Recognition Under Viewpoint Consistency Constraints. InBMVC 2009 Sep 7 (pp. 1-11).
[5] Aleksic M, Smirnov M, Goma S. Novel bilateral filter approach: Image noise reduction with sharpening. InDigital Photography II 2006 Feb 10 (Vol. 6069, p. 60690F). International Society for Optics and Photonics.
[6] Bennett EP, McMillan L. Video enhancement using per-pixel virtual exposures. ACM Transactions on Graphics (TOG). 2005 Jul 1;24(3):845-52.
[7] Liu C, Freeman WT, Szeliski R, Kang SB. Noise estimation from a single image. InComputer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on 2006 Jun 17 (Vol. 1, pp. 901-908). IEEE.
[8] Oh BM, Chen M, Dorsey J, Durand F. Image-based modeling and photo editing. InProceedings of the 28th annual conference on Computer graphics and interactive techniques 2001 Aug 1 (pp. 433-442). ACM.
[9] Ramanath R, Snyder WE. Adaptive demosaicking. Journal of Electronic Imaging. 2003 Oct;12(4):633-43.
[10] Winnemöller H, Olsen SC, Gooch B. Real-time video abstraction. InACM Transactions On Graphics (TOG) 2006 Jul 30 (Vol. 25, No. 3, pp. 1221-1226). ACM.
[11] Viola P, Jones MJ. Robust real-time face detection. International journal of computer vision. 2004 May 1;57(2):137-54.
[12] Lienhart R, Maydt J. An extended set of haar-like features for rapid object detection. InImage Processing. 2002. Proceedings. 2002 International Conference on 2002 (Vol. 1, pp. I-I). IEEE.
[13] Padilla R, Costa Filho CF, Costa MG. Evaluation of haar cascade classifiers designed for face detection. World Academy of Science, Engineering and Technology. 2012 Apr 22;64:362-5.
[14] https://web.stanford.edu/ hastie/Papers/ESLII.pdf
[15] Porta M. Vision-based user interfaces: methods and applications. International Journal of Human-Computer Studies. 2002 Jul 1;57(1):27-73.
[16] Lin C, Liu J, Cao L. Image matching by affine speed-up robust features. InMIPPR 2011: Pattern Recognition and Computer Vision 2011 Dec 2 (Vol. 8004, p. 80040G). International Society for Optics and Photonics.
[17] Bressler R, Kreisberg RB, Bernard B, Niederhuber JE, Vockley JG, Shmulevich I, Knijnenburg TA. CloudForest: a scalable and efficient random forest implementation for biological data. PloS one. 2015 Dec 17;10(12):e0144820.
[18] Roomi MM, Bhargavi R, Banu TH. Automatic identification of cloud cover regions using SURF. International Journal of Computer Science, Engineering and Information Technology. 2012 Apr;2:159-75.
[19] Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J. NetVLAD: CNN architecture for weakly supervised place recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2016 (pp. 5297-5307).
[20] Picard D, Gosselin PH. Improving image similarity with vectors of locally aggregated tensors. InImage Processing (ICIP), 2011 18th IEEE International Conference on 2011 Sep 11 (pp. 669-672). IEEE.
[21] Mironică I, Duţă IC, Ionescu B, Sebe N. A modified vector of locally aggregated descriptors approach for fast video classification. Multimedia Tools and Applications. 2016 Aug 1;75(15):9045-72.
[22] Amato G, Bolettieri P, Falchi F, Gennaro C. Large scale image retrieval using vector of locally aggregated descriptors. InInternational Conference on Similarity Search and Applications 2013 Oct 2 (pp. 245-256). Springer, Berlin, Heidelberg.
[23] Xu Z, Yang Y, Hauptmann AG. A discriminative CNN video representation for event detection. InComputer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on 2015 Jun 7 (pp. 1798-1807). IEEE.
[24] Farquhar J, Szedmak S, Meng H, Shawe-Taylor J. Improving” bag-of-keypoints” image categorisation: Generative models and pdf-kernels.
[25] Niebles JC, Fei-Fei L. A hierarchical model of shape and appearance for human action classification. InComputer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on 2007 Jun 17 (pp. 1-8). IEEE.
[26] Yang L, Meer P, Foran DJ. Multiple class segmentation using a unified framework over mean-shift patches. InComputer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on 2007 Jun 17 (pp. 1-8). IEEE.
[27] Rahman MM, Antani SK, Thoma GR. Biomedical CBIR using “bag of keypoints” in a modified inverted index. InComputer-Based Medical Systems (CBMS), 2011 24th International Symposium on 2011 Jun 27 (pp. 1-6). IEEE.
[28] Battiato S, Guarnera M, Meccio T, Messina G. Red eye detection through bag-of-keypoints classification. InInternational Conference on Image Analysis and Processing 2009 Sep 8 (pp. 528-537). Springer, Berlin, Heidelberg.
[29] Zhang Y, Jin R, Zhou ZH. Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics. 2010 Dec 1;1(1-4):43-52.
[30] FACES95:http://cswww.essex.ac.uk/mv/allfaces/faces95.html
[31] FACES96:http://cswww.essex.ac.uk/mv/allfaces/faces96.html
[32] ORL:http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html
[33] Pang Y, Li W, Yuan Y, Pan J. Fully affine invariant SURF for image matching. Neurocomputing. 2012 May 15;85:6-10.

\normalMode