Hire the author: Merishna S
Manually monitoring people entering such institutions is tedious and requires a workforce. In short, in this tutorial, we will learn how to build a face mask classifier using deep learning to automatically detect people not wearing masks to prevent their entry into cafeterias, restaurants, schools, etc.
Introduction and Motivation
The CDC continues to monitor the spread of COVID-19 and advises people who are completely vaccinated as well as those who are not fully vaccinated to wear face masks. When visiting the doctor’s office, hospitals, or long-term care institutions, the CDC recommends wearing masks and keeping a safe distance.
This tutorial will walk you through the process of developing a deep learning model to automatically detect a person’s face through an image supplied from a file path. We will then use our trained model to detect if the person is violating public rules by not wearing a mask.​
Glossary
​Deep Learning: It is a kind of machine learning technique that enables learning through the use of neural networks that mimic the human brain.
Prerequisites
- Programming knowledge in Python.
- Basic knowledge of Jupyter Notebook, Deep Learning, Keras.
Creating the deep learning face mask classifier model
Firstly, we will start building a Deep Learning model to predict (detect) if a person is violating the rules by not wearing a mask in public spaces. Also, all the steps that we are going to discuss hereafter are available in a notebook here.
Step 1: Installing and Importing the necessary Python libraries
Clone this repository and install the libraries by using the command:
pip install -r requirements.txt
Note: If your Python version is >=3.9, there is no stable version of the library tensorflow. To install tensorflow use the command:
pip install tf-nightly
Now, we will import the necessary libraries in Python.
Step 2: Getting the training data
​For the training data, we are using the face mask detection data from here. The dataset contains 12 thousand images divided into Test, Train, and Validation sets which were scraped from Google and the CelebFace dataset created by Jessica Li. To start using it, you can download the dataset and save it in the working directory.

We can see that our training data has people with different kinds and patterns of masks including all worst-case scenarios. Similarly, the training data without masks also contain a varying number of facial images in different lighting, with and without beards which will help us handle different scenarios when determining faces with no masks.

Step 3: Reading a sample image and performing face detection
Consequently, we will now read in a sample image of public space and perform face detection using a haar cascade classifier.
The Haar cascade classifier, originally known as the Viola-Jones Face Detection Technique is an object detection algorithm for detecting faces in images or real-time video. Viola and Jones proposed edge or line detection features in their research paper “Rapid Object Detection using a Boosted Cascade of Simple Features,” published in 2001. The algorithm is given a large number of positive photos with faces and a large number of negative images with no faces. The model developed as a result of this training can be found in the OpenCV GitHub repository.[1]

Note: If you want to try using a different image, change the image path in the above code as follows:
img = cv2.imread("<path to new image file>")
Step 4: Data preprocessing for building the face mask classifier in Keras
Keras is a deep learning API written in Python, running on top of the machine learning platform TensorFlow. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result as fast as possible is key to doing good research.[2]
We will now pass our datasets into Keras ImageDataGenerator() to perform some preliminary data augmentation steps such as rescaling.
Step 5: Create the face mask classifier transfer learning model using Keras
We are building the deep learning classifier using the VGG19 transfer learning model. The VGG19 model is the successor of AlexNet, a variation of the VGG model named after the group named as Visual Geometry Group at Oxford which created it. In addition, it is a deep CNN consisting of 19 layers (16 convolution layers, 3 Fully connected layers, 5 MaxPool layers, and 1 SoftMax layer) used to classify images.[4]
Additionally, it has been trained on ImageNet, a picture database with 14,197,122 images structured according to the WordNet hierarchy.


Step 6: Train the model
We will now train our neural network model for 20 epochs. Moreover, we will also use the validation dataset for improving the model performance while training.
Epoch 1/20 9/9 [==============================] - 31s 3s/step - loss: 0.6622 - accuracy: 0.6632 Epoch 2/20 9/9 [==============================] - 30s 3s/step - loss: 0.2757 - accuracy: 0.9028 Epoch 3/20 9/9 [==============================] - 31s 3s/step - loss: 0.1645 - accuracy: 0.9444 Epoch 4/20 9/9 [==============================] - 30s 3s/step - loss: 0.1694 - accuracy: 0.9410 Epoch 5/20 9/9 [==============================] - 30s 3s/step - loss: 0.0984 - accuracy: 0.9669 Epoch 6/20 9/9 [==============================] - 31s 3s/step - loss: 0.1003 - accuracy: 0.9688 Epoch 7/20 9/9 [==============================] - 32s 3s/step - loss: 0.1194 - accuracy: 0.9444 Epoch 8/20 9/9 [==============================] - 30s 3s/step - loss: 0.0736 - accuracy: 0.9792 Epoch 9/20 9/9 [==============================] - 31s 3s/step - loss: 0.0519 - accuracy: 0.9965 Epoch 10/20 9/9 [==============================] - 31s 3s/step - loss: 0.0663 - accuracy: 0.9722 Epoch 11/20 9/9 [==============================] - 33s 4s/step - loss: 0.0799 - accuracy: 0.9653 Epoch 12/20 9/9 [==============================] - 30s 3s/step - loss: 0.0680 - accuracy: 0.9688 Epoch 13/20 9/9 [==============================] - 29s 3s/step - loss: 0.0727 - accuracy: 0.9792 Epoch 14/20 9/9 [==============================] - 28s 3s/step - loss: 0.0647 - accuracy: 0.9757 Epoch 15/20 9/9 [==============================] - 31s 3s/step - loss: 0.0680 - accuracy: 0.9826 Epoch 16/20 9/9 [==============================] - 29s 3s/step - loss: 0.0875 - accuracy: 0.9669 Epoch 17/20 9/9 [==============================] - 30s 3s/step - loss: 0.0500 - accuracy: 0.9931 Epoch 18/20 9/9 [==============================] - 30s 3s/step - loss: 0.0553 - accuracy: 0.9861 Epoch 19/20 9/9 [==============================] - 30s 3s/step - loss: 0.0504 - accuracy: 0.9792 Epoch 20/20 9/9 [==============================] - 30s 3s/step - loss: 0.0484 - accuracy: 0.9861
Step 7: Evaluate the model performance on test set
25/25 [==============================] - 78s 3s/step - loss: 0.0544 - accuracy: 0.9825 Model has a loss of 0.05 and accuracy 98.25%
Step 8: Save the trained model
We can also choose to save the trained model as an h5 file for future use.
Step 9: Test the face mask classifier model on the sample image
Lastly, we will test the trained model on our use case for detecting faces and masks for a group of people. We take the detected face crops of the faces detected in the image and then predict the mask or no mask using the model trained.
In conclusion, the results for some sample images from the model we trained are as below.




Learning References
- HaarCascade Github repository
- Keras Github repository
- Object Detection using HaarCascade
- Understand the VGG19 Architecture
- Sample Images Dataset from Kaggle
Learning Strategies
- Deep Learning is a type of machine learning algorithm that uses neural networks for performing its predictions. The key to learning about neural networks effectively is to learn and visualize the whole architecture of the system. By doing this, we can easily understand how the data is being processed step-by-step.
- Also, it is a good practice to print or log important messages and errors to help with debugging.
- For instance, neural networks are very sensitive to hyperparameters. Therefore, it’s very important to tune them precisely to increase the model’s accuracy and improve its performance.
Reflective Analysis
Working on this project was challenging mainly in terms of taking into account more than one face in images. The detection accuracy also depends on a variety of factors such as lighting, time of the day, and the orientation of the person’s face in front of the camera. Thankfully, the training dataset I used had most of these conditions which made it easier for the model to be trained for worst-case scenarios.
Additionally, VGG19 is a CNN architecture that can be easily modified according to the need of the problem, which makes them so versatile to use and create.
Conclusions and Future Directions
In conclusion, the results generated from the model were satisfactory to an extent. We can improve the model performance by training the model for larger epochs and more training images. Additionally, we can also extend this use case for live surveillance cameras (CCTVs) to detect people with no masks and social distancing in real-time.
Also, the code for this project is available on GitHub and Kaggle.
Additionally, if you’re looking to do similar innovative projects in Deep Learning, you might be interested in this project on How to generate unique architectures using GANs.
Hire the author: Merishna S
Artificial intelligence & Machine Learning engineer. Has strong fundamentals in machine learning algorithms (neural networks, dimensionality reduction, feature utilization, and extraction and clustering), programming, statistics, and mathematics.