LFO! - Face Recognition

Base	Red	Green	Blue
Decimal	255	0	0
Octal (8)	377	0	0
Hex (16)	FF	00	00
Binary	1111 1111	0000 0000	0000 0000

Face Detection

Face detection is a two class problem where we have to decide if there is a face or not in a picture. Here, we introduce four different general approaches:

Knowledge-based methods

These are rule-based methods. They try to capture our knowledge of faces, and translate them into a set of rules. It’s easy to guess some simple rules. For example, a face usually has two symmetric eyes, and the eye area is darker than the cheeks. Facial features could be the distance between eyes or the color intensity difference between the eye area and the lower zone.

The big problem with these methods is the difficulty in building an appropriate set of rules. There could be many false positives if the rules were too general. On the other hand, there could be many false negatives if the rules were too detailed.

Feature-invariant methods

Algorithms that try to find invariant features of a face despite it’s angle or position. The idea is to overcome the limits of our instinctive knowledge of faces.

Template matching

Template matching methods try to define a face as a function. We try to find a standard template of all the faces. Different features can be defined independently. For example, a face can be divided into eyes, face contour, nose and mouth. Also a face model can be built by edges. A face can also be represented as a silhouette. Other templates use the relation between face regions in terms of brightness and darkness. These standard patterns are compared to the input images to detect faces.

This approach is simple to implement, but it’s inadequate for face detection. It cannot achieve good results with variations in pose, scale and shape.

Appearance-based methods

A template matching method whose pattern database is learnt from a set of training images.

In general, appearance-based methods rely on techniques from statistical analysis and machine learning to find the relevant characteristics of face images. Some appearance-based methods work in a probabilistic network. An image or feature vector is a random variable with some probability of belonging to a face or not. Another approach is to to define a discriminant function between face and non-face classes.

Example: Histogram of Oriented Gradients Method (2005)

TLDR: This is a template matching algorithm. Taking a gray-scale image, we represent the entire image by using arrows, at each pixel, to point towards the direction that the image is getting darker. And then we can see if the pattern of arrows resembles a face or not.

Calculate Gradiant (the arrow)

Taking the gray-scale picture above, we can draw an arrow at the marked pixel showing in which direction the image is getting darker.

If you repeat that process for every single pixel in the image, you end up with every pixel being replaced by an arrow. These arrows are called gradients and they show the flow from light to dark across the entire image:

By only considering the direction that brightness changes, we reduces the influence of different illumination.

Saving the gradient for every single pixel gives us way too much detail. We end up missing the forest for the trees. It would be better if we could just see the basic flow of lightness/darkness at a higher level so we could see the basic pattern of the image

The original image is turned into a HOG representation that captures the major features of the image regardless of image brightnesss.

HOG Matching

To find faces in this HOG image, all we have to do is find the part of our image that looks the most similar to a known HOG pattern that was extracted from a bunch of other training faces:

Feature Extraction

This feature extraction process can be defined as the procedure of extracting relevant information from a face image. This information must be valuable to the later step of identifying the subject with an acceptable error rate. The feature extraction process must be efficient in terms of computing time and memory usage. The output should also be optimized for the classification step.

Dimension Reduction

The performance of a classifier depends on the amount of sample images, number of features and classifier complexity.

Rationale for reducing dimensionality:

"Curse of dimensionality" or "Peaking phenomenon": added features may degrade the performance of a classification algorithm.
Speed & Memory. The classifier will be faster and will use less memory if feature number is small.
a large set of features can result in a false positive when these features are redundant.

Ultimately, the number of features must be carefully chosen. Too less or redundant features can lead to a loss of accuracy of the recognition system.

Feature selection

Features are extracted from the face images, then a optimum subset of these features is selected. It aims to select a subset of the extracted features that cause the smallest classification error. Dimension reduction method can be embedded in the selection method.

Example: Eigenfaces

TLDR: This feature extraction algorithm uses the numerical method of principle component analysis (PCA) to select a representative set of features that, when combined linearly, best describes faces in a low dimension.

Space of Faces

If we take small images of faces as the feature vector, without detecting local features, then, for example, a 100x100 image is considered a 10000 dimensional feature vector. Each of these vectors are a location in a 10000 dimensional vector space. Practically any vector in this space is a 100x100 image.

As these are vectors just like any other vector, we can compute difference vectors, average vectors, sum of vectors etc. The difference of two face-vectors is obviously not a face, it is just a difference image.

Compact Representation

If we can find a smaller set of basis vectors that can be used to reconstruct the vectors in the dataset, we can represent them in a more compact way. This process of finding a smaller set of basis vectors to represent the data is called dimensionality reduction. For the space of faces this means: we need to find some set of vectors (corresponding to images) that can be used to mix any face image. If we can manage to find such a set, we will have a much more compact representation of faces.

Principal Component Analysis (PCA)

The Principal Component Analysis is the tool to find this subspace, find a new set of basis vectors that can be used to mix any data vector.

The first step is to find the average vector using all the points. For the space of faces this would be the average face.
Then iterating through all the points we can build a covariance matrix A that represents the distribution of the points around the average. The A matrix maps the unit hypersphere to a “hyper-ellipse” corresponding to the distribution of data points. In the direction where the data points have significant variation the ellipse will be large, on other directions where there is none or little variation, the ellipse will be small, or flat.
We can then use either the eigendecomposition or the SVD on matrix A –which is a symmetric matrix –to find these directions as well as the scaling factors in those direction.
By analyzing all the eigenvectors we get and the corresponding eigenvalues, we can come up with a new set of basis vectors by simply keeping the eigenvectors with large eigenvalues and dropping vectors of low eigenvalues.

Eigenfaces

Eigenfacesare the significant eigenvectors of the face dataset. Using the eigenfaces, which are really images, any face image can be mixed using a simple linear combination. This means we can represent any face with weights of the eigenfaces.

While some eigenfacesseem to correspond to real features such as beard or glasses, in general eigenfacesare not faces. They are just difference images used to recreate any face in the dataset.

First image is the average face, the others are the eigenfaces.

Feature Selection

If the eigenfacesare chosen properly, eg: we include all eigenfaces with larger eigenvalues, any face in the dataset can be reconstructed by a linear combination of the eigenfaces. To come up with these weights for a real face image, we simply need to multiply this 10000 dimensional input vector by a rotation matrix and drop all components with indices greater than K (the number of features we want to keep).

x is an N dimensional feature vector. v1, v2,..., vk are eigenfaces.

x = x_avg + a1v1 + a1v2 + ... + akvk

Face detection with eigenfaces

The eigenfacesneed to be computed only once, and then all faces in the database can be represented by a set of weights. This is a much more compact representation.

Then, to verify if there is a face on a new image we simply need to compute the reconstruction with eigenfaces and check if the reconstruction matches the original image. If it does, we are likely dealing with a face image.

To identify the face we need to find the closest weight-vector in the database and see if they are closed then some distance threshold.

Face Classification and Recognition

Appearance-based face recognition algorithms use a wide variety of classification methods. Sometimes two or more classifiers are combined to achieve better results. On the other hand, most model-based algorithms match the samples with the model or template.

Then, a learning method is can be used to improve the algorithm - supervised, unsupervised or semi-supervised. Unsupervised learning is the most difficult approach, as there are no tagged examples. However, many face recognition applications include a tagged set of subjects. Consequently, most face recognition systems implement supervised learning methods. There are also cases where the labeled data set is small. Sometimes, the acquisition of new tagged samples can be infeasible. Therefore, semi-supervised learning is required.

One way or another, classifiers have a big impact in face recognition. Classification methods are also generally used in many areas like data mining, finance, signal decoding, voice recognition, natural language processing or medicine. 2 concepts that are key in building a classifier - similarity and probability.

Classifiers

Similarity: This approach is intuitive and simple. Patterns that are similar should belong to the same class. The idea is to establish a metric that defines similarity and a representation of the same-class samples. For example, the metric can be the euclidean distance. The representation of a class can be the mean vector of all the patterns belonging to this class.
Probability: Some classifiers are build based on a probabilistic approach. Bayes decision rule is often used. The rule can be modified to take into account different factors that could lead to mis-classification.
- Bayesian: Assign pattern to the class with the highest estimated posterior probability. Eg: a Maximum A Posteriori (MAP) decision rule: where wi are the face classes and Z an image in a reduced PCA space.
- Logistic Classifier: Predicts probability using logistic curve method.
- Parzen Classifier: Bayesian classifier with Parzen density estimates.

Example: Neural Network

Combining the idea of face detection and feature extraction ideas above, we see a wide variety of comprehensive face recognition algorithms applied in computer vision. We can summarize them into three different broad categories: template matching, statistical approach and neural network. The examples above corresponds to ideas within template matching and statistical category. Here, we want to show an example of how neural network face recognition algorithm works.

Which measurements should we collect from each face to build our known face database? Ear size? Nose length? Eye color?

The measurements that seem obvious to us humans (like eye color) don’t really make sense to a computer looking at individual pixels in an image. Researchers have discovered that the most accurate approach is to let the computer figure out the measurements to collect itself. Deep learning does a better job than humans at figuring out which parts of a face are important to measure.

The solution is to train a Deep Convolutional Neural Network to generate a number of different measurements for each face.

Training

The training process works by looking at 3 face images at a time:

Load a training face image of a known person
Load another picture of the same known person
Load a picture of a totally different person

Then the algorithm looks at the measurements it is currently generating for each of those three images. It then tweaks the neural network slightly so that it makes sure the measurements it generates for #1 and #2 are slightly closer while making sure the measurements for #2 and #3 are slightly further apart.

Machine learning people call these measurements of each face an embedding. The idea of reducing complicated raw data like a picture into a list of computer-generated numbers comes up a lot in machine learning (especially in language translation).

This process of training a convolutional neural network to output face embeddings requires a lot of data and computer power. With days of continuous training to get good accuracy, we will have a trained network that can generate measurements for any face, even ones it has never seen before! The network should generate nearly the same numbers when looking at two different pictures of the same person.

Recognition

This last step is actually the easiest step in the whole process. All we have to do is find the person in our database of known people who has the closest measurements to our test image. And you can do that by using any basic machine learning classifier (eg: nearest neighbor, or the classifiers mentioned above, etc.).

Pixel

Digital images are comprised of tiny amounts of information.

One pixel represents a single color.

Example representation schemes (RGB):

Pixels are identified by their location within the coordinate grid.