Designing An AI-enabled Facial Recognition System For Retail

This blog is a transcription of the webinar - Designing an AI-enabled facial recognition system for retail. To watch the recorded version of the webinar, scroll down to the end of the blog.

Facial recognition is a hot topic these days. Forbes magazine and USA TODAY had recent stories about the alleged misuse of facial recognition by Facebook and Cambridge Analytica.  We won't be tackling the social, legal or customer opt-in or opt-out considerations of facial recognition on this blog. Instead, we will provide a foundation of the technology and describe both mathematical and computational models and processes about facial recognition. This blog does not provide hands-on information but high-level guidance around the technology, the processes involved in the development of a facial recognition system and its practical applications in a retail or consumer-focused business.

AI-enabled facial recognition is vital for the retail industry as it can help retailers understand their customers better and deliver improved customer experiences. Recent advances in AI have significantly improved the accuracy of facial recognition and in this blog we help you understand the design of a deep learning based facial recognition system. Deep learning is a technology of AI that has recently created significant breakthroughs in the field of computer vision speech and natural language processing.

Before we go into details, let us understand what machine learning is. Machine learning is a study that gives the computers the ability to learn without being explicitly programmed by learning from data or experience to make predictions on unseen input data.  A machine learning model produces an output for a set of inputs, which is then compared with the desired output. Any non-conformance or error is fed back to change the parameters of the model for making the output closer to the desired output or target. Machine learning is used in applications where there is no empirical relationship between the inputs and output. The objective of AI is to build machines that possess the same level of human intelligence though it still remains unachievable.

Deep learning is based on the artificial neural network, which is a type of machine learning technique that takes inspiration from the structure and function of the human brain. It uses a large number of processing layers and large datasets as inputs to improve the model or prediction accuracy. Deep learning concept is not new and data scientists have striven hard to enhance prediction accuracy by adding more layers.  With the confluence of several factors including availability of large data sets, powerful hardware – for example GPUs, and better algorithms and architectures, model accuracies have improved over time.

Let’s talk about machine learning applications in a retail scenario. We can use face recognition to improve in-store personalization and provide one-to-one personalized shopping experience. We can understand their buying patterns or we could even use it in online retail to help customers perform a visual product search, provide product recommendations or identify high-traction blogs and articles. Retailers are getting closer to customers more than ever to the extent that they can identify their needs faster than the customers would even realize it themselves. In the olden times, the moment you walk into your local baker, the business owner would recognize you, greet you and tag your favorite pastry. Today, retailers are trying to recreate such shopping experiences with facial recognition.

Humans recognize faces with experience- the more we see, the faster we recognize. Face recognition system when fed with an image or a video scene, identifies and recognizes a person from a database of facial images. The process of facial recognition from a large image set is complicated and cannot be modeled using mathematical or empirical methods. The numerous challenges involved in facial recognition include:

  • Information redundancy: When you take an example of a 100 × 100 facial image, you get 25610000 = 280000 possible combination of intensity values.
  • Intrapersonal variations: If you take two pictures of the same person there will be variations in determinations they could be variations in pose, partial obstructions, changes in facial expressions or even temporal changes like aging.
  • Interpersonal variations: Images of two different persons may look similar

Facial Recognition Approaches

The problems in facial authentication have been studied in detail and researchers have proposed many methods to improve its accuracy rate.

Classical Approach

It involves handpicking features using domain knowledge of the data to create features, which are then classified using a machine learning algorithm. The approach works well for small data sets but fails for larger ones. Additionally, they are not effective on variations in pose, illumination or occlusions.

Modern Approach

In this approach, the neural network will find features itself. This works on large data sets and is invariant to pose, illuminations, occlusions, etc. Facebook’s DeepFace and Google’s FaceNet use this approach.

This is a high-level block diagram of a face recognition system.

Face detection, landmarks detection and face alignment form the stages in pre-processing step. In the face recognition phase, we use the pre-processed images to identify a subject’s face correctly. In the face detection stage, the system detects whether there is a face in the image or not and if there is a face, the facial landmarks of the image are plotted and face alignment is carried out. The system then applies deep learning techniques to recognize who the person is.

Face Detection Using Histogram of Oriented Gradients (HOG)

The Histogram of Oriented Gradient uses visual attributes of the content in images, videos or applications to process images and detect faces. It spots image gradient or intensity change in localized portions of the image to extract features about the edges and shapes. HOG features are classified with a Support Vector Machine classifier for face detection.

After the system has extracted face from the bigger image, the images are aligned using landmark detection. Thisimage is then compared with mean landmarks on the reference image and aligned using affine transformation. Even if the subject’s image is tilted, the image becomes well-aligned after carrying out affine transformation. Affine transformation is a linear mapping method that preserves point straight lines and planes without causing any distortion. The image thus created by affine transformation is used for facial recognition using deep learning.

There are basically two steps involved in deep learning:

1. Facial Learning

Consider we have a database of 1 million images of 1k users. The neural network having a deep learning architecture uses images to extract the image-specific features and labels. These features are then stored as embedded vectors, representing the face of each user.

2. Facial Matching

When a new input image is fed to the system, it extracts features from this image and compares it with a learned feature vector to perform a similarity measurement. Similarity may be measured by Siamese, Cosine or Euclidian methods. The output decides whether there is a match or mismatch.

Convolutional Neural Network

Convolutional Neural Network (CNN) is the most widely used deep learning architecture in computer vision. This is because it is:

  • Rugged to shifts and distortions in the image
  • Requires smaller memory as the same filter coefficients are used across different locations in the space
  • Invariant to different poses, partial obstructions, horizontal or vertical shift
  • Proven to work well in vision, speech and natural language processing

It is made of a convolutional layer, a non-linear activation function layer, pooling layer and fully connected layer. The function of the pooling layer is to reduce the spatial dimension of the image and the output from this layer is a fully connected neural network.

How Does Learning Happen In a Neural Network?

The objective of the neural network is to adjust the parameters in order to make the training sample closer to the desired result. We define the parameters in terms of cost functions. In other words, cost functions are errors and needs to be minimized as far as possible.


Total cost= Σn i=1 cost(i)

The filter parameters in the convolution layer and the synaptic weights in the fully connected neural network layer are the commonly adjusted parameters to minimize cost function. Stochastic Gradient Descent (SGD) based learning is popularly used for training CNN. It enables faster training, gives better prediction accuracy compared to traditional methods and is more efficient on large datasets.
Let’s move on to an example to demystify the working of CNN.

Convolution Layer

Consider we input 5 × 5 image, which is convolved with a 3 × 3 filter matrix. We get a convolved output or feature map from the dot product (element-wise multiplication with matrices) of chunks of input image and filter image.  In simple terms, assume we are looking at an object through a smaller window. When you move that window in different directions, we get different perspectives of the object. Likewise, when you slide a filter image over an input image, you get feature maps or a combination of features specific to the area that was slidden.


ReLU Activation Function

ReLu is also known as a rectified linear unit. As more of the real-world data is non-linear in nature, ReLu introduces non-linearity in CNN. It selectively activates neuron by returning zero for negative pixel values in the input image and the particular neuron is not activated. It returns output value which is equal in intensity to the input value if the input is greater than zero. Thus the rectified filter image has only non-negative values as shown in the figure below.

Max Pooling

In this layer, the spatial size of the representation is progressively reduced reducing the amount of parameters and the computational steps in the deep learning architecture. There are different ways of pooling.

  • Average pooling: The input is divided into smaller portions and the average of the slice or full values are computed
  • Max Pooling: The abstracted form of the representation is achieved by dividing the input into smaller pooling regions and taking the maximum value in each region. In the example below, if we take 5, 11, 0 & 4, the output element contains the maximum of the 2×2 matrix, i.e. 11.

In the same manner, if we take a real image and pass it through a filter we get a convolved output. This is then passed through a rectified linear unit and pooling is performed over each map to get an output image as shown in the figure.

Deep architecture is formed by stacking together a number of CNN building blocks. Deep learning procedure involves initializing the filters in the convolution randomly and automatically learning the most important parameters by the network.


Through backpropagation using SVD, the network is trained end-to-end for all the global or local parameters to recognize a subject’s face correctly. There is a natural progression from low level to the high-level structure as it passes through the different convolution layers. As we go more in-depth to other convolution layers, the filters carry out dot product with the input of the previous convolution layers for classifying pixels to edges. Thus, the deep learning model performs hierarchical learning to combine the multistage outputs for accomplishing edge detection better. The deep learning architecture represents the face as a feature vector in an N × N matrix.

Facial Matching

Scenario – 1 New input image (image which was not used for training the neural network)


Scenario – 2 User wearing a sunglass

Scenario – 3 User wearing a scarf


Scenario – 4 User wearing a scarf with partially-covered face


Even with variation in pose and illumination, the deep learning model correctly identifies the face in all the 4 different scenarios.

Integration of Facial Recognition Model With a Clienteling Software

Customers being the new market-makers, the success of retail stores depend on how rapidly retailers respond to their customer’s needs. To win in this customer age, retailers need to move from their traditional retailing software and adopt clienteling software coupled with face recognition. This new generation of clienteling will help them identify their premium customers quickly, transform from an information source to points of engagement, and deliver the right product to their customers with the customized shopping experience.

Use Cases of Facial Recognition

  • Clienteling: one-to-one personalized shopping experience
  • In-store traffic analytics
  • Dwell time at an aisle
  • Visualize customer path in-store
  • Emotion recognition at point of sale
  • Order online and pick from store
  • Payment and check out through face verification

Watch the recorded session of the webinar

This webinar was part of the Applexus Experts Series delivered in partnership with SAP. Following this session, there are series of webinars lined up for the coming months including Smart Supply Chain, Secrets of Omnichannel Excellence, Blockchain and Big Data – S/4 Data Migration. For more information, visit

About The Author

Thomas Koickal's picture
Thomas Koickal
Chief Technologist - Artificial Intelligence & Cognitive Computing
Dr. Thomas leads and drives AI product engineering efforts, technology strategy and machine learning innovation culture. For over 20 years, he has researched and worked on predictive modeling, machine learning, deep learning, neuromorphic computing and brain-inspired computing architectures. He is one of the longest-serving practitioners of machine learning and has developed AI applications for healthcare, technology, financial, and space systems. He has delivered several talks on AI and ML technologies and his works have been published in leading IEEE journals and conferences.

Leave a Reply