Final Project
Using A Convolutional Neural Network (CNN) To Read Emotions From Facial Expressions
To view my PowerPoint and project poster: Powerpoint/Poster (PPT format) Powerpoint/Poster (PDF format)
There are two main methods of communicating – verbally and nonverbally. The nonverbal methods of communication involve visual cues like body language and facial expressions. In a study conducted in the mid-20th century by UCLA Professor Emeritus of Psychology, Albert Mehrabian, it was discovered that a majority of the meaning imparted by someone’s message is gleaned from the nonverbal means of communicating. Mehrabian is credited with the 7-38-55% rule, which states that listeners derive just 7% of the meaning behind what someone says comes from the actual words they say and only 38% of the meaning comes from the way those words are said, like the person’s tone of voice. However, over half (55%) comes from the non-verbal, visual cues that are displayed as they speak, like facial expression.
Since visual cues impart such a large portion of the meaning in someone’s message, those who are visually-impaired could potentially misinterpret what is being spoken. In addition, those who are on the autism spectrum may have difficulty interpreting facial expressions and need some assistance. Consequently, I decided to create a convolutional neural network (CNN) that could read facial expressions from a live webcam/video stream and predict the person’s emotional state. Then, the computer would provide both an audio output of how they feel and a visual label of the predicted emotion. This would assist the person utilizing the model by providing a real-time indicator of the emotional state of the person they are communicating with. As a result, I believe my CNN model would prove to be significantly useful to improve quality of life for a wide demographic of people by promoting positive social interaction.
In applying this model to the real world, I determined that there could be a few potential hindrances to its accuracy. For example, there are many occasions where people’s facial expressions are incongruous with how they’re feeling. The lighting of the person’s face, as well as their distance away from the camera could potentially obscure their facial features from the camera. Another potential complication for the CNN model that has become more prevalent since the onset of the COVID-19 pandemic, is the wearing of face masks. As these obscure the face, particularly the mouth, the model could have difficulty predicting one’s emotional state based solely on their eyes.
I trained and tested the model using a dataset of 2,473 images from the University of Central Florida. This dataset consisted of a set of images of people displaying one of the seven basic emotions (happy, sad, angry, fear, disgust, surprise, and neutral) and a .csv file matching the seven emotions’ labels to their respective images. These images were spliced from videos filmed by a weather station in Germany for a 2014 research project at RWTH Aachen University in Germany, which filmed sign language interpreters speaking each emotion. I selected 20 images from each emotion to be in the testing set, for a total of 140 images set aside for testing. As such, there were 2,333 images leftover for training. There were originally over 3,000 images in the dataset, but I decided to remove several during my data cleaning process because they were labeled as potentially showing more than one of these seven emotions at the same time, or because they were labeled as showing none of the seven basic emotions. As the seven basic emotion labels are categorical variables describing a set of pictures, the data were discrete (the dataset can be found at: https://doi.org/10.7910/DVN/358QMQ)
In reviewing the dataset, I found that many of the images seemed to overlap multiple emotions, even though they were labeled with only one emotion. In addition, the images were quite blurry, some seemed repetitive (possibly due to them being extracted from a similar timestamp in the original video), and several depicted faces that were partially obscured by the person’s hands. As a result, this may have negatively-impacted the model’s accuracy, so it would be important to retrain the model with another dataset in the future. Examples of images for each emotion are provided on the fourth slide in the attached PowerPoint/project poster.
I created a CNN with three sets of convolutional and pooling layers. The input layers was a Conv2D layer with 64, 3x3 filters and an input image size of 48x48x1 (the 1 was used, as opposed to a 3, because I converted the images to grayscale to keep colors from impacting the model’s accuracy). The next two Conv2D layers double the number of filters to 128 and 256, respectively. I also included “same” padding to ensure the filters were able to be applied to the entirety of the images’ pixels. In addition, all three Conv2D layers utilized the ReLU activation function to prevent negative neural outputs. Each of the three MaxPooling2D layers followed each of the three Conv2D layers and reduced the images by a quarter of their previous size. After these convolutional and pooling layers, I included a flattening layer to convert the image into a one-dimensional array. This array was then passed to three Dense layers. The first two Dense layers both used a ReLU activation function, but the first Dense layer contained 256 nodes while the second contained 64. Finally, the last Dense layer contained seven nodes to allow the model to predict the probabilities for each of the seven emotion classification labels. For this layer, I also used the SoftMax activation function, as this would output a predicted probability distribution.
I then compiled this CNN model using categorical cross entropy loss because the model sought to solve a classification problem by labeling the images as belonging to one of seven different categories of emotion. Then, I set the optimizer to RMSProp with an initial learning rate of 0.001. To fit the data to the CNN model, I selected 30 epochs and set the training steps per epoch equal to the size of the training set divided by the batch size of 64. Similarly, I set the testing/validation steps per epoch equal to the size of the testing set divided by the batch size.
Using these settings, the model achieved approximately 93% and 68% accuracy on the training and testing sets, respectively. Thus, the model was rather overfit. Interestingly, the predictions mainly tended to be angry, happy, and neutral; very few predictions came from the other four labels. Looking at the accuracy graph, the testing/validation accuracy appears to be continuously increasing as the epochs approach 30. As such, it may be helpful to increase the number of epochs while fitting the model to the training data in order to improve its predictive power and decrease the extent of overfitting. However, after looking at the loss graph, the testing/validation loss appears to be increasing after the 25th epoch. Consequently, rather than increasing the number of epochs, it may be better to simply increase the size of the dataset.
In the future, I would like to continue this project by using a different dataset that has more distinct displays of emotion (like this one from Kaggle). In addition, I would like to improve the audio output. At this time, the audio report of the person’s emotional state only provides information about their emotion when the camera stream ends. This is not ideal because the person utilizing the model would want to hear this output in real-time while they communicate. When first writing the program, I had included the audio output within the loop that reads data from the video or webcam stream. However, this caused the computer to constantly repeat the emotion over and over again. To resolve this problem, I had to save the audio output until the video or webcam stream ended. In the future, I would need to determine a way to make the audio output only play when the model detects that the person’s emotion has changed. I think it would also be helpful to include facial recognition, which could inform the person utilizing the model if they know the person who is approaching them and whether they have spoken to each other before. As previously mentioned, 38% of a message’s meaning comes from the way it is delivered, so it would also be beneficial to include a model that could make predictions based off of the speaker’s tone of voice.
In previous research from the NIH (NIH article), adding physical hardware (like sensors that record the person’s heart rate and brainwave activity) can also enhance the ability to predict one’s emotional state. While the model’s accuracy could be improved by implementing such additional hardware, I think this would likely be infeasible and invasive to impose on the population. Thus, I decided to search for algorithms that wouldn’t require additional hardware, aside from a camera, and found that the Affdex algorithm can achieve 83% accuracy in classifying a wide range of emotions (Second NIH article). This may be a promising method of emotion detection in the future.