Catch up on the latest AI articles

IdentiFace: A Multimodal Face Recognition System That Captures Everything From Emotion To Gender And Its Potential

IdentiFace: A Multimodal Face Recognition System That Captures Everything From Emotion To Gender And Its Potential

Face Recognition

3 main points
✔️ Development of IdentiFace, a multimodal face recognition system: a highly accurate face recognition system combining biometric features such as gender, face shape, and emotion in addition to face recognition
✔️ Achieving high recognition accuracy: tests on FERET, the authors' dataset, and public datasets show that gender Achieved accuracy of up to 99.4% for gender recognition, 88.03% for facial shape recognition, and 66.13% for emotion recognition
✔️ Details on applied datasets and preprocessing: specific methods on datasets and preprocessing methods applied to face recognition, gender classification, facial shape determination, and emotion recognition tasks.

IdentiFace : A VGG Based Multimodal Facial Biometric System
written by Mahmoud RabeaHanya AhmedSohaila MahmoudNourhan Sayed
(Submitted on 2 Jan 2024 (v1), last revised 10 Jan 2024 (this version, v2))
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)


The images used in this article are from the paper, the introductory slides, or were created based on them.


The development of face recognition systems has contributed greatly to the development of the computer vision field. And now, the development of multimodal systems that combine multiple biometric characteristics in an efficient and effective manner is being actively pursued.

This paper presents a multimodal face recognition system called IdentiFace. The system combines face recognition with important biometric information such as gender, facial shape, and emotion; it uses a model based on the VGG-16 architecture with minor changes between different subsystems.

We achieved a gender recognition accuracy of 99.2% on data collected from the FERET database, 99.4% on the authors' dataset, and 95.15% on the public dataset. We also achieved a test accuracy of 88.03% for facial shape recognition using the celebrity face shape dataset and 66.13% for emotion recognition using the FER2013 dataset. It is suggested that the gender recognition task is relatively easy, while the facial shape and emotion recognition tasks are prone to confusion among similar classes.

The "IdentiFace" multimodal facial recognition system has potential applications in security, surveillance, and personal identification. By utilizing facial features, more efficient and accurate biometric authentication is possible.


Here are the datasets used for each task. First, for face recognition, we used the Color FERET dataset from NIST. This dataset contains 11,338 face images from 994 individuals. The dataset contains 13 different face orientations, each with an assigned degree of facial rotation. Additionally, some subjects have images with and without glasses, while others include images with a variety of hairstyles. The paper uses a compressed version of these images, with an image size of 256 x 484 pixels. This dataset was chosen for its wide variation, which helps the model learn high generalization performance. In addition, the four authors themselves were added to the database as new subjects and tested in a variety of scenarios.

Next, for the gender classification, the data set was collected from the authors' faculty. Initially, it consisted of 15 males and 8 females, each containing multiple images with multiple variations to increase the data size. However, the number of subjects was subsequently increased, eventually increasing the data size to a total of 31 males and 27 females with a total number of images (133 males/66 females). No splitting of the training/validation data was done during collection, but rather during the preprocessing phase. For comparison purposes, we also use Kaggle's Gender Classification Dataset. This dataset is split into approximately 23,000 images per class for training data and 5,500 images per class for validation.

Due to the complexity of the task, which required manual labeling, the authors were unable to collect their own dataset for facial shape prediction and used the Face Shape Dataset, which includes the most popular facial shape dataset, Celebrity Facial Shapes. This dataset was released in 2019, contains only female subjects, and includes 100 images for each of the five classes (round/oval/square/rectangular/heart-shaped).

Finally, in emotion recognition, the author corps initially collected its own data set for this task. This included 38 subjects, divided into 22 males and 16 females. Each subject has 7 images for each specific emotion, with 38 images per class for a total of 266 images. The images in each class are manually labeled. Some subjects had similar facial expressions for more than one class, which made the image labeling and classification process relatively challenging. So, in order to collect an appropriate emotion dataset. We used the dataset " FER-2013." It is publicly available and consists of more than 30,000 images, including 7 classes (anger / disgust / fear / happiness / sadness / surprise / neutral). All images are converted to 48x48 grayscale images and all classes are approximately evenly distributed.


By slightly adjusting the network between each task, this paper builds a single network that can adapt to multiple face-related tasks. We use the VGGNet architecture as the primary network for our multimodal system. We experimented with the basic VGG-16 and simplified it by having only three main blocks and removing the last two convolutional blocks in the end. This is done primarily to reduce the number of parameters and the overall complexity of the model, since the model was already performing well on a variety of tasks. Information about the model regarding the number of layers, output geometry, and number of parameters is shown in the table below.

When finally compiling the model, we apply the Adam optimizer with sparse categorical cross entropy as the loss function. Early stopping is also introduced to prevent over-training of the model.

In addition, the preprocessing for the face recognition task is as follows

  • Face detection using Dlib's CNN-based face detection
  • Crop identified faces and convert to grayscale images
  • Resize to 128x128 pixels
  • Number of classes changed to 5 (Hanya, Mahmoud, Nourhan, Sohaila, etc.)

The following preprocessing is applied to all tasks except face recognition.

  • Apply face detection using 68 facial landmarks in Dlib
  • All detected faces are cropped and images without faces are filtered
  • Face resized to 128 x128 and converted to grayscale

After resizing, each dataset is expanded in the following manner to ensure that it is balanced for all tasks. To ensure a fair distribution across all classes, this is only done for unbalanced & smaller data sets.

The dataset after data expansion is shown below. For face recognition, the Color FERET dataset originally contained 11,338 images for the "Other" class, but was reduced to 500 to avoid over-training. Some datasets, such as the emotion and gender identification datasets, did not require data expansion because there were so many images per class and the distribution was balanced.


For face recognition, the dataset is split training:testing = 80:20 and the following parameters are used to train the model

  • Learning rate (lr) = 0.0001
  • Batch size = 32
  • Test size = 0.2
  • Number of epochs = 100

The results are as follows

For gender classification, the task is viewed as a multi-class classification, labeling female subjects with 0 and male subjects with 1. The following parameters are used to train both the model from the authors' dataset and the model from the public dataset

  • Learning rate (lr) = 0.0001
  • Batch size = 128
  • Test size = 0.2

The results are as follows

The figure below is a Confusion Matrix of the dataset by the authors.

The figure below shows the Confusion Matrix of the dataset with the public dataset.

In predicting facial shapes, two different models are tried to address this task: one model for all classes and one model for only three classes (rectangle/square/round). This is done to observe how the model works with minimally overlapping classes and to compare it with other models that include all classes The following parameters are used in the two models.

  • Learning rate (lr) = 0.0001
  • Batch size = 128
  • Test size = 0.2

It is labeled as follows

The results are as follows

Support Vector Machines (SVM) and Convolutional Neural Networks (CNN) have been tried in emotion recognition. The results for the Support Vector Machine (SVM) are as follows

The convolutional neural network (CNN) results are as follows

And to visualize the results, we are developing a multi-modal facial biometric system called "IdentiFace" as a Pyside-based desktop application. It can perform facial biometrics both online and offline at the same time.


We tried different methods and used our own datasets for each task and other publicly available datasets for face recognition, gender classification, facial shape determination, and emotion recognition. We also chose the VGGNet model because it performed best on all tasks using these datasets. We have also combined all of the best performing models to develop a multimodal facial biometric system called IdentiFace. This systemintegratesface recognition, gender classification, facial shape determination, and emotionrecognition into onesystem.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.

Contact Us