This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LRDif: Learning Representations in Diffusion Models for Under-Display Camera Emotion Recognition

Zhifeng Wang1, Kaihao Zhang1, Ramesh Sankaranarayana1
Australian National University1
Email: {zhifeng.wang, kaihao.zhang, ramesh.sankaranarayana}@anu.edu.au
Abstract

Under-display camera (UDC) provides an elegant solution for full-screen smartphones. However, UDC captured images suffer from severe degradation since sensors lie under the display. It is hard for recent neural networks to recognise the emotions in UDC images.

Index Terms:
emotion recognition, graph neural network, human-centered computing

I Introduction

In the realm of digital technology, the under-display camera (UDC) represents a significant advancement, merging aesthetic elegance with functional innovation. This emerging technology, primarily seen in modern smartphones and other digital devices, cleverly embeds the camera beneath the display screen, eliminating the need for external camera notches or cutouts and thereby offering an uninterrupted, edge-to-edge screen experience. However, the incorporation of UDC technology extends beyond mere aesthetic appeal and enters the realm of advanced applications like emotion recognition.

Emotion recognition, a subset of computer vision and artificial intelligence, involves analyzing facial expressions to identify human emotions. This technology has vast applications, ranging from enhancing user experience in consumer electronics to providing critical data in fields like mental health, marketing, and human-computer interaction. Facial expression recognition (FER) has undergone remarkable advancements in recent years and is now widely used across various industries. However, implementing emotion recognition using UDC poses unique challenges, primarily due to the inherent limitations of the UDC setup.

Refer to caption
Figure 1: The image appears to present a comparison between two types of camera images with their respective color histograms. (a) shows an image taken with an under-display camera (UDC), which looks less clear compared to a regular camera. It’s a bit blurry, has more grain or ”noise,” and the colors aren’t as true to life. (b) shows an image taken with a traditional external camera, which is much clearer. The little girl’s face is sharp, with well-defined features and colors that look more natural.

The fundamental challenge lies in the image quality and clarity. UDC images often suffer from reduced sharpness, increased noise, and color fidelity issues compared to those captured with traditional external cameras. These quality constraints stem from the fact that the camera lens is positioned beneath the screen, which can obstruct and scatter light in unpredictable ways. For emotion recognition algorithms, which rely heavily on the nuances of facial expressions, this can lead to decreased accuracy and reliability.

Moreover, UDC images may exhibit unique artifacts and lighting inconsistencies, further complicating the task of accurately detecting and interpreting emotions. These challenges necessitate not only advancements in UDC hardware but also tailored algorithmic approaches in the software. Machine learning models used for emotion recognition must be adapted or retrained to account for the peculiarities of UDC images, ensuring they can effectively discern emotional states despite the additional noise and distortions. Previous studies on FER [1, 2] have not given adequate attention to the influence of additional noise and distortions bringed by UDC images. However, addressing this challenge is crucial for enhancing the practical applications of FER in real-world scenarios.

Refer to caption
Figure 2: The overview of the proposed LRDif, which consists of UDCformer, FPEN and denoising network. LRDif has two training stages:(a) In the first stage, FPENS1FPEN_{S_{1}} takes the ground-truth label and UDC image as input and outputs an EPR Z to guide UDCformer to restore labels. We optimize the FPENS1FPEN_{S_{1}} with LRDifS1{}_{S_{1}} together to make LRDifS1{}_{S_{1}} can fully use extracted EPR Z. (b)In the second stage, we use the strong data estimation of the PDDM to estimate the EPR extracted by pretrained FPENS1FPEN_{S_{1}}. Notably, we do not input the ground-truth label into FPENS2FPEN_{S_{2}} and denoising networks. In the inference stage, we only use the reverse process of PDDM.

Currently, several approaches address the noise learning problem in the face recognition field. RUL [2021rul] addresses uncertainties due to ambiguous expressions and inconsistent labels by weighting facial features based on their relative difficulty, improving performance in noisy data environments. EAC[2022eac] tackles noisy labels by using flipped semantic consistency and selective erasure of input images, preventing the model from over-focusing on partial features associated with noisy labels. SCN[2020scn] addresses uncertainties in large-scale datasets by employing a self-attention mechanism for weighting training samples and a relabeling mechanism to adjust labels of low-ranked samples, thereby reducing overfitting to uncertain facial images. However,all of these methods have defects when coming across UDC images. Specifically, RUL [2021rul] and EAC[2022eac] are based on the small-loss assumption [2,48], which might confuse hard samples and noisy samples as both of them have large loss values during the training process. Thus, learning features from noise label and noise images is a challenge task.

To address the problem, we perceive data uncertainty from a different point of view. In this paper, instead of following the traditional path to detect noisy samples according to their loss values and then suppress them, we view noisy label learning from a new feature-learning perspective and propose a novel framework to deal with all the aforementioned defects in UDC images. We aim to design a difussion-based FER network that can fully and efficiently use the powerful distribution mapping abilities of DM to restore noise label and image pair. To this end, we propose LRDif. Since the transformer can model long-range pixel dependencies, we adopt the transformer blocks as our basic unit of LRDif. We stack dynamic transformer blocks in Unet shape to form Under-Display Camera Transformer (UDCformer) to extract and aggregate multi-level features. The UDCformer consist of two-streams networks-DTnetwork and DILnetwork. DTnetwork extract the latent features of label and UDC images at multi-level features. DILnetwork learn the similarities between faical landmarks and UDC images. We train our LRDif in two stages: (1) In the first stage (Fig. 2 (a)), we develop a compact fusion prior extraction network (FPEN) to extract a compact emotion prior representation (EPR) from label and UDC images to guide the UDCformer. It is notable that FPEN and UDCformer are optimized together. (2) In the second stage (Fig. 2 (b)), we train the DM to directly estimate the accurate EPR from UDC images. Since the EPR is light and only adds details for restore labels, our DM can estimate quite an accurate EPR and obtain a high test accuracy after several iterations.The main contributions in this work are summarized as follows:

  • We innovatively propose under-display camera (UDC) problem in facial expression recognition, and propose a diffusion-based work to reduce the impact of additional noise and distortions on images.

  • We propose LRDif,a strong,simple,and efficient DM-based baseline for FER. Unlike previous methods, requiring prior knowledge of the dataset uncertainty distribution, we use the strong mapping abilities of DM to estimate a compact EPR to guide FER, which can improve the prediction efficiency and stability for DM in FER.

  • We propose DGNet and DMNet for Dynamic UDCformer to fully exploit the UDC images in multiple scale. Different from traditional latent DMs optimizing the denoising network individually, we propose joint optimization of the denoising network and decoder (UDCformer) to further improve the robustness of estimation errors.

  • Extensive experiments show that the proposed LRDif can achieve SOTA performance in UDC emotion recognition tasks on several standard FER datasets such as RAF-DB, KDEF and FERPlus. This shows the powerful expression analysis capability of LRDif.

II Related Work

II-A Facial Expression Recognition

Generally, a FER system mainly consists of three stages, namely face detection, feature extraction, and expression recognition. In face detection stage, several face detectors like MTCNN [44] and Dlib [2]) are used to locate faces in complex scenes. The detected faces can be further aligned alternatively. For feature extraction, various methods are designed to capture facial geometry and appearance features caused by facial expressions. Wang et al. [wang2021light] leverages a spatial attention mechanism to focus on emotion-relevant image areas, enhancing accuracy under real-world conditions using a lightweight network embeddable in standard convolutional neural networks. MRAN[chen2023multi] enhances performance under uncontrolled conditions by integrating spatial attention on both global and local facial features, employing transformers to understand relationships within and between these features, and leveraging a sample relation transformer to focus on similarities among training samples, all optimized through a joint strategy. MPACNN [li2021learning] uses multiple paths to extract diverse features, which are then attentively fused for both basic and compound expression recognition. Concurrently, BS loss optimizes these features by maximizing inter-class and minimizing intra-class distances, ensuring their discriminative quality for high-accuracy recognition, with the BS loss functioning as the objective for MPACNN, enabling simultaneous learning of informative and discriminative features. FG-AGR[li2023fg] addresses challenges in uncontrolled environments by using an Adaptive Salient Region Induction (ASRI) to highlight key facial areas, a Local Fine-grained Feature Extraction (LFFE) via Visual Transformers for detailed feature extraction, and an Adaptive Graph Association Reasoning (AGAR) with Graph Convolutional Networks to learn combinations of these fine-grained features effectively. Zhang et al.[zhang2023transformer] propose using Transformer-based Multimodal Emotional Perception (T-MEP) framework to enhance dynamic expression recognition by employing specialized transformer-based encoders for audio, image, and text modalities, alongside a multimodal information fusion module that utilizes both self-attention and cross-attention mechanisms to robustly integrate and augment cross-modal representations for a comprehensive understanding of human emotions.

II-B Diffusion Models

Rombach et al. [rombach2022high] enhances the efficiency and visual fidelity of diffusion models by training them in the latent space of pre-trained autoencoders, integrating cross-attention layers for versatile conditioning inputs, enabling high-resolution image synthesis with reduced computational demands and preserving detail. DiffusionDet[chen2023diffusiondet] is an innovative object detection framework that treats the detection process as a denoising diffusion from random to precise object boxes, learning to reverse the diffusion from ground truth during training and iteratively refining randomly generated boxes during inference, offering flexibility in the number of boxes and evaluation method. StyleSwin[zhang2022styleswin] is a high-resolution image synthesis framework using a pure transformer-based generative adversarial network, incorporating Swin transformer in a style-based architecture with double attention for larger receptive fields and absolute position knowledge, enhanced by a wavelet discriminator to overcome blocking artifacts and maintain spatial coherence.Yang et al. [yang2023paint] advances exemplar-guided image editing by employing self-supervised training to skillfully disentangle and rearrange elements from a source image and an exemplar, using an information bottleneck and strong augmentations to avoid copy-paste artifacts, and introducing an arbitrary shape mask with classifier-free guidance for enhanced control, all within a single forward pass of a diffusion model. Whang et al. [whang2022deblurring] utilizes conditional diffusion models to train a stochastic sampler that refines outputs from a deterministic predictor, offering a diverse set of perceptually superior reconstructions for a given blurred input, while also achieving efficiency in sampling and competitiveness in standard distortion metrics like PSNR.

Refer to caption
Figure 3: The overview of DTNet, which consists of DGNet and DMNet.

III Methods

III-A Pretrain DTnetwork

Before introducing pretraining DTnetwork, we would like to introduce two networks in the first stage, including a compact fusion prior extraction network (FPEN) and a dynamic transfomer network (DTnetwork). The structure of FPEN is shown in Fig. 2 yellow box, which is mainly consist of linear layers to extract the compact emotion prior representation (EPR). After that, DTnetwork can use the extracted EPR to restore label of UDC images. The structure of the DTnetwork is shown in Fig. 2 red box which consist of dynamic gated feed-forward network DGNet and dynamic multi-head transposed attention DMNet, which can use EPR as dynamic modelation parameters to add label restoration details into feature maps for emotion recognition.

Algorithm 1 LRDifS2 Training

Input: Ground truth label yy, UDC image xx,Label encoder ELE_{L}, UDC image encoder EIE_{I} and FPENS1FPEN_{S}1, UDCformer
Output: Trained LRDifS2

1:Initialize: αt=1βt\alpha_{t}=1-\beta_{t}, α¯T=i=0Tαi\bar{\alpha}_{T}=\prod_{i=0}^{T}\alpha_{i}.
2:for 1,..N1,..N do
3:    Z=FPENS1(Concat(EL(y),EI(x)))Z=\text{FPEN}_{S1}(\text{Concat}(E_{L}(y),E_{I}(x)))
4:    Diffusion Process:
5:    Sample ZTZ_{T} by q(ZT|Z)=𝒩(ZT;α¯TZ,(1α¯T)I)q(Z_{T}|Z)=\mathcal{N}(Z_{T};\sqrt{\bar{\alpha}_{T}}Z,(1-\bar{\alpha}_{T})I)
6:    Reverse Process:
7:    Z=ZTZ^{{}^{\prime}}=Z_{T}
8:    xS2=FPENS2(EI(x))x_{S2}=\text{FPEN}_{S2}(E_{I}(x))
9:    for t=Tt=T to 11 do
10:        Zt1=1αt(Ztϵθ(Concat(Zt,t,xS2))1αt1αt)Z_{t-1}^{{}^{\prime}}=\frac{1}{\sqrt{\alpha_{t}}}(Z_{t}^{{}^{\prime}}-\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2}))\frac{1-\alpha_{t}}{\sqrt{1-\alpha_{t}}})
11:    end for
12:    y=UDCformer(x,Z0,xS2)y^{{}^{\prime}}=\text{UDCformer}(x,Z_{0}^{{}^{\prime}},x_{S2})
13:    Calculate d\mathcal{L}_{d} loss
14:end for
15:Output the trained model LRDifS2.

In the pretraining (Fig. 2 (a)), we train FPENS1FPEN_{S1} and DTnetwork together. We first concatenate ground-truth label and UDC images together obtained by CLIP[radford2021clip] text and image encoders to obtain the input for FPENS1FPEN_{S1}. Then, FPENS1FPEN_{S1} extract the EPR ZRCZ\in R^{C} as:

Z=FPENS1(Concat(EL(y),EI(x)))Z=FPEN_{S1}(Concat(E_{L}(y),E_{I}(x))) (1)

Then EPR ZZ is sent into DGNet and DMNet of DTnetwork as dynamic modulation parameters to guide label restoration:

F=W1lZLN(F)+W2lZF^{{}^{\prime}}=W_{1}^{l}Z\circ LN(F)+W_{2}^{l}Z (2)

where \circ indicates element-wise multiplication, LN denotes layer normalization, WlW^{l} represents weights of linear layer, FF and FF^{{}^{\prime}} are input and output feature maps.

Then, we aggregate global spatial information in DMNet. Specifically, FF^{{}^{\prime}} is projected into query QQ, key KK and value VV by using convolution layer and depth-wise convolution layer. After that, we reshape the query QQ to H"W"×C"H^{"}W^{"}\times C^{"}, key KK to C"×H"W"C^{"}\times H^{"}W^{"} and value VV to H"W"×C"H^{"}W^{"}\times C^{"}. After that, we perform dot-product between QQ and KK to generate a transposed-attention map AA with size C"×C"C^{"}\times C^{"}. The overall process of DMNet can be descrive as follows:

F"=WcV×softmax(K×Q/α)+FF^{"}=W_{c}V\times softmax(K\times Q/\alpha)+F (3)

where α\alpha is a learnable scaling parameter. Next, in DGNet, we aggregate local features. We use 1×11\times 1 Conv to aggregate information from different channels and adopt 3×33\times 3 depth-wise Conv to aggregate information from spatially neighboring pixels. Besides, we adopt the gating mechanism to enhance information encoding. The overall process of DGNet is defined as:

F"=GELU(Wd1Wc1F)Wd2Wc2F+FF^{"}=GELU(W^{1}_{d}W^{1}_{c}F^{{}^{\prime}})\circ W^{2}_{d}W^{2}_{c}F^{{}^{\prime}}+F (4)

III-B Dynamic Image and Landmarks Network (DILnetwork)

Algorithm 2 LRDifS2 Inference

Input: Trained LRDifS2
Output: Restored label yy^{{}^{\prime}}

1:Initialize: αt=1βt\alpha_{t}=1-\beta_{t}, α¯T=i=0Tαi\bar{\alpha}_{T}=\prod_{i=0}^{T}\alpha_{i}.
2:Reverse Process:
3:Sample ZT𝒩(0,1)Z_{T}^{{}^{\prime}}\sim\mathcal{N}(0,1)
4:xS2=FPENS2(EI(x))x_{S2}=\text{FPEN}_{S2}(E_{I}(x))
5:for t=Tt=T to 11 do
6:    Zt1=1αt(Ztϵθ(Concat(Zt,t,xS2))1αt1αt)Z_{t-1}^{{}^{\prime}}=\frac{1}{\sqrt{\alpha_{t}}}(Z_{t}^{{}^{\prime}}-\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2}))\frac{1-\alpha_{t}}{\sqrt{1-\alpha_{t}}})
7:end for
8:y=UDCformer(x,Z0,xS2)y^{{}^{\prime}}=\text{UDCformer}(x,Z_{0}^{{}^{\prime}},x_{S2})
9:Output Restored label yy^{{}^{\prime}}

For DILnetwork, we use window-based cross-attention mechanism to extract facial landmarks features and UDC images features. For UDC image features XudcRN×DX_{udc}\in R^{N\times D}, we first divide the UDC them into several non-overlapping windows xudcRM×Dx_{udc}\in R^{M\times D}. For the facial landmarks feature XflmRC×H×WX_{flm}\in R^{C\times H\times W}, we first down-sample it into the window size xflmRc×h×wx_{flm}\in R^{c\times h\times w}, where c=Dc=D and M=h×wM=h\times w. So, we can do cross-attention with NN heads on the facial landmarks and UDC images.

Q\displaystyle Q =xflmwQ,K=xudcwK,V=xudcwV\displaystyle=x_{flm}w_{Q},K=x_{udc}w_{K},V=x_{udc}w_{V} (5)
Oi\displaystyle O_{i} =Softmax(QiKiTd+b)Vi,i=1,,N\displaystyle=Softmax(\frac{Q_{i}K_{i}^{T}}{\sqrt{d}}+b)V_{i},i=1,...,N (6)
O\displaystyle O =[O1,O2,,ON]WO\displaystyle=[O_{1},O_{2},...,O_{N}]W_{O} (7)

where wQw_{Q} wKw_{K}, wVw_{V}, wOw_{O} are the mapping matrix and bb is the relative position bias.

We perform the above cross-attention calculation for all windows. We refer to this cross-attention mechanism as window-based multi-head cross-attention (MHCA).Thus the cross-fusion transformer encoder in LRDif can be expressed as follows:

Xudc\displaystyle X^{{}^{\prime}}_{udc} =MHCA(Xudc)+Xudc\displaystyle=MHCA(X_{udc})+X_{udc} (8)
Xudc"\displaystyle X^{"}_{udc} =MLP(LN(Xudc))+Xudc\displaystyle=MLP(LN(X^{{}^{\prime}}_{udc}))+X^{{}^{\prime}}_{udc} (9)

We need to merge the output features FF from DTnetwork and the output features OO from DILnetwork to get the fused multi-scale features x1x_{1}, x2x_{2} and x3x_{3}, where x1=Concat(F1,O1)x_{1}=Concat(F_{1},O_{1}), x2=Concat(F2,O2)x_{2}=Concat(F_{2},O_{2}) and x3=Concat(F3,O3)x_{3}=Concat(F_{3},O_{3}). Then, the fused features XX will input into vanilla transformer blocks for processing.

X\displaystyle X =[x1,x2,x3],\displaystyle=[x_{1},x_{2},x_{3}], (10)
X\displaystyle X^{{}^{\prime}} =MSA(X)+X,\displaystyle=MSA(X)+X, (11)
y\displaystyle y^{{}^{\prime}} =MLP(LN(X))+X,\displaystyle=MLP(LN(X^{{}^{\prime}}))+X^{{}^{\prime}}, (12)

where MSAMSA is multi-head self-attention blocks. LNLN is layer normalization function.The training loss is defined as follows:

ce=i=1Nc=1Myiclog(pic)\mathcal{L}_{ce}=-\sum_{i=1}^{N}\sum_{c=1}^{M}y_{ic}\log(p_{ic}) (14)

We use cross-entropy loss to train our model, where N is the number of samples. M is the number of classes. yicy_{ic} indicates whether class c is the correct classification for sample i (1 if true, 0 otherwise). picp_{ic} is the predicted probability of sample i belonging to class c.

III-C Diffusion Models for Label Restoration

In the second stage (Fig. 2), we exploit the strong data estimation ability of the DM to estimate EPR. Specifically, we use the pretrained FPENS1 to caputure the EPR ZRCZ\in R^{C}. After that, we apply the diffusion process on Z to sample ZTRCZ_{T}\in R^{C}, which can be described as :

q(ZT|Z)=𝒩(ZT;α¯TZ,(1α¯T)I)q(Z_{T}|Z)=\mathcal{N}(Z_{T};\sqrt{\bar{\alpha}_{T}}Z,(1-\bar{\alpha}_{T})I) (15)

, where T is the total number of iterations, α¯T=i=0Tαi\bar{\alpha}_{T}=\prod_{i=0}^{T}\alpha_{i} and αt=1βt\alpha_{t}=1-\beta_{t}. βt\beta_{t} is the predefined scale factor. 𝒩(.)\mathcal{N}(.) represents the Gaussian distribution.

In the reverse process of DM, we first use CLIP image encoder EIE_{I} to encode the input UDC image xx. After that the encoded features will be sent to FPENS2 to abtain a conditional vector xS2RCx_{S2}\in R^{C} from UDC images.

xS2=FPENS2(EI(x))x_{S2}=\text{FPEN}_{S2}(E_{I}(x)) (16)

,where FPENS2\text{FPEN}_{S2} has the same structure as FPENS1\text{FPEN}_{S1} except the input dimension of the first Linear layer. Then, we use the denoising network ϵθ\epsilon_{\theta} to estimate noise in each time step tt as ϵθ(Concat(Zt,t,xS2))\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2})). The estimated noise is substituted into Eq. (17) to obtain Zt1Z_{t-1}^{{}^{\prime}} to start next iteration:

Zt1=1αt(Ztϵθ(Concat(Zt,t,xS2))1αt1αt)Z_{t-1}^{{}^{\prime}}=\frac{1}{\sqrt{\alpha_{t}}}(Z_{t}^{{}^{\prime}}-\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2}))\frac{1-\alpha_{t}}{\sqrt{1-\alpha_{t}}}) (17)

After T times iterations, we obtained final estimated EPR Z0Z_{0}^{{}^{\prime}}. We joint train FPENS2, denoising network, and UDCformer using total\mathcal{L}_{total}:

kl=i=1CZnorm(i)log(Znorm(i)Z¯norm(i))\mathcal{L}_{kl}=\sum_{i=1}^{C}Z_{\text{norm}}(i)\log(\frac{Z_{\text{norm}}(i)}{\bar{Z}_{\text{norm}}(i)}) (18)
total=ce+kl\mathcal{L}_{total}=\mathcal{L}_{ce}+\mathcal{L}_{kl} (19)

where Znorm(i)Z_{\text{norm}}(i) and Z¯norm(i)\bar{Z}_{\text{norm}}(i) are EPRs extracted by LRDifS1 and LRDifS2 and normalized with softmax operation. kl\mathcal{L}_{kl} is a variant of the Kullback-Leibler divergences, summing over C dimenstions. We add Kullback-Leibler divergence loss kl\mathcal{L}_{kl} (Eq. 18) and Cross-Entropy loss ce\mathcal{L}_{ce} (Eq. 14) to form the total loss total\mathcal{L}_{total} (Eq. 19). Since the EPR includes the UDC image and related emotion label encoded by CLIP, LRDifS2 can use few iterations and small model size to obtain quite good estimations.

In the inference stage, we only use the reverse diffuion process, which is described in Algorithm 2. FPENS2 extracts a conditional vector xS2x_{S2} from UDC images encoded by CLIP, and we randomly sample a Gaussian noise ZtZ_{t}^{{}^{\prime}}. Denoising network utilises the ZtZ_{t}^{{}^{\prime}} and xS2x_{S2} to estimate the EPR ZZ^{{}^{\prime}} after T iterations. After that, UDCformer exploits the EPR to restore label yy^{{}^{\prime}}.

IV Experiments

IV-A Datasets

RAF-DB [rafdb] is acquired through in the wild, with each image meticulously annotated by 40 independent raters to ensure precision. It is composed of two subsets: a basic subset with single annotations and a compound subset with dual annotations. Our study employs the basic subset. This particular subset includes seven emotional categories: surprise, fear, disgust, happiness, sadness, anger, and neutral. The dataset is partitioned into a training set comprising 12,271 images and a testing set consisting of 3,068 images, both of which exhibit a nearly equivalent distribution of expressions.

FERPlus [ferplus] serves as an extension of the FER2013 [goodfellow2013challenges], enriched with a fresh set of labels created by ten annotators. We use the enhanced dataset includes 28,709 images for training, 7,178 for testing. FERPlus incorporates ‘Contempt’ as an additional emotion category, thereby extending the classification categories to eight distinct emotional classes.

KDEF [kdef] represents a comprehensive dataset encompassing a collection of 4,900 images, meticulously designed to describe human emotional expressions. This dataset includes images of 70 amateur actors, balanced with 35 females and 35 males, each demonstrating seven distinct emotional expressions. A unique aspect of this dataset is the multi-angled approach, where each expression is captured from five different angles, offering a rich, diverse set of visual data. The participants were selected based on specific criteria: they are between 20 and 30 years of age and exhibit a clear, unobstructed facial appearance.

UDC-RAF-DB dataset comprises a training set with 12,271 images and a testing set encompassing 3,068 images, offering a robust platform for developing and testing FER algorithms in UDC environments.

UDC-FERPlus dataset extends UDC realm, providing an expansive collection of 28,709 UDC images for training and 7,178 for testing.

UDC-KDEF dataset includes a total of 4,900 UDC images, offering varied perspectives with images captured from five different angles. The training set comprises 3,920 UDC images, while the testing set includes 980 images, ensuring a wide range of data for effective training and evaluation of FER systems in UDC contexts.

IV-B Implementation Details

We trained a SOTA MPGNet [zhou2022modular] for modeling the UDC imageing degradation process and a U-Net-like network called DWFormer for UDC image generation. The UDC imaging degradation process contains brightness attenuation, blurring, and noise corruption. We use the pretrained MPGNet to synthesize three benchmark Facial Expression Recognition (FER) datasets: the UDC-RAF-DB dataset, UDC-FERPlus dataset, and UDC-KDEF dataset. All of the experiments are implemented with PyTorch, and the models are trained with a GTX-3090 GPU.

IV-C Comparison with SOTA FER Methods

IV-C1 Comparison with Typical FER-model

TABLE I: Comparison results with SOTA FER algorithms on RAF-DB, FERPlus and KDEF.
RAF-DB FERPlus KDEF
Methods Acc. (%) Methods Acc. (%) Methods Acc. (%)
ARM[2021arm] 90.42 DACL[2021dacl] 83.52 DACL[2021dacl] 88.61
POSTER++[2023posterv2] 92.21 POSTER++[2023posterv2] 86.46 POSTER++[2023posterv2] 94.44
RUL[2021rul] 88.98 RUL[2021rul] 85.00 RUL[2021rul] 87.83
DAN[2023dan] 89.70 DAN[2023dan] 85.48 DAN[2023dan] 88.77
SCN[2020scn] 87.03 SCN[2020scn] 83.11 SCN[2020scn] 89.55
EAC[2022eac] 90.35 EAC[2022eac] 86.18 EAC[2022eac] 72.32
MANet[2021manet] 88.42 MANet[2021manet] 85.49 MANet[2021manet] 91.75
OursOurs 92.24 OursOurs 87.13 OursOurs 95.75

Table I presents a comprehensive comparison of accuracy between the proposed method and current state-of-the-art (SOTA) Facial Expression Recognition (FER) algorithms [2020scn, 2021manet, 2021arm, 2022eac, 2021dacl, 2021rul, 2023dan, 2023posterv2] across three benchmark datasets: RAF-DB, FERPlus, and KDEF. For the RAF-DB dataset, the proposed method (‘Ours’) achieves an accuracy of 92.24%, surpassing several established algorithms such as ARM[2021arm], RUL[2021rul], DAN[2023dan], SCN[2020scn], EAC[2022eac], and MANet[2021manet], and is competitive with POSTER++[2023posterv2], which scores slightly lower at 92.21%. In the FERPlus dataset, the proposed method also demonstrates superior performance with an accuracy of 87.13%, outperforming other notable techniques like DACL[2021dacl], RUL[2021rul], DAN[2023dan], SCN[2020scn], and MANet[2021manet]. Remarkably, in the KDEF dataset comparison, the proposed method achieves the highest accuracy of 95.75%, indicating a significant advancement over other methodologies, including the second-highest performing POSTER++[2023posterv2] at 94.44% and the lower-scoring MANet[2021manet] at 91.75%. Overall, this table highlights the proposed method’s robustness and effectiveness in facial expression recognition tasks, as evidenced by its leading accuracy rates against a range of SOTA algorithms on diverse and challenging datasets.

IV-C2 Comparison with the UDC FER-model

We conducted a performance evaluation of our proposed facial expression recognition (FER) model, specifically tailored for under-display camera (UDC) systems, against other leading state-of-the-art FER models.

Results on UDC-RAF-DB. Table II delineates the comparative analysis of accuracy among various state-of-the-art facial expression recognition (FER) models on the UDC-RAF-DB dataset. The table is structured into three columns: ‘Method’, ‘Backbone’, and ‘Acc.(%)’. The ‘Method’ lists the algorithms along with their corresponding references. The ‘Backbone’ specifies the backbone network architectures employed by each method, indicating the structural foundation upon which these models are built. The ‘Acc.(%)’ displays the accuracy percentage achieved by each method, providing a quantifiable measure of performance. The ARM[2021arm], RUL[2021rul], DAN[2023dan], SCN[2020scn], EAC[2022eac], and MANet[2021manet] models predominantly utilize the ResNet-18 architecture, with the exception of POSTER++[2023posterv2], which is based on the VIT (Vision Transformer) architecture, and ARM[2021arm], which employs a modified ResNet-18 architecture called ResNet18-ARM. The proposed model diverges from the ResNet-18 convention by incorporating a ‘Diffusion’ backbone. This innovative approach yields an accuracy of 87.90%, thereby demonstrating a notable improvement in performance over the other methods. All models are evaluated across seven classes of expression, maintaining a consistent comparison framework. The results underscore the efficacy of the diffusion-based model in the field of UDC-based FER systems.

Results on UDC-FERPlus. Table III offers an analytical comparison of accuracy for various state-of-the-art models as applied to the UDC-FERPlus Dataset. A closer comparative analysis reveals that the proposed model outperforms its counterparts with an accuracy of 84.89%, marking a significant advancement in precision over the other techniques. POSTER++ [2023posterv2] with VIT architecture shows competitive performance at 83.78%, followed closely by DAN [2023dan] with ResNet-18 at 83.25% and MANet[2021manet] with ResNet-18 at 83.19%. The remaining methods, while also based on the ResNet-18 architecture, register lower accuracy rates, with EAC [2022eac] at 82.72%, RUL [2021rul] at 82.15%, DACL [2021dacl] at 78.11%, and SCN [2020scn] at the lower end with 77.38%. In summary, the proposed model exhibits superior performance on the UDC-FERPlus Dataset, setting a new benchmark for accuracy within the domain and underscoring the potential of utilizing advanced backbone architectures for enhancing FER system efficacy in under-display camera applications.

Results on UDC-KDEF.Table IV provides an elaborate benchmarking of accuracy rates among leading facial expression recognition (FER) models on the UDC-KDEF dataset. The proposed method achieves an impressive accuracy of 94.07%, surpassing all competing methodologies. Notably, POSTER++[2023posterv2] is a strong contender, realizing an accuracy of 91.92% owing to its Vision Transformer (VIT) architecture, while MANet[2021manet] also delivers a commendable performance with an 88.48% accuracy rate. Other methods, although effective, yield relatively lower accuracy rates: DAN[2023dan] attains 85.71%, DACL[2021dacl] records 84.44%, RUL[2021rul] achieves 82.69%, and SCN[2020scn] reports 78.69%. The EAC[2022eac] method exhibits the lowest accuracy at 54.31%, which may be attributed to its lack of robustness against the varied perspectives characteristic of UDC images. The superior performance of the proposed model underscores its potential as a robust solution for FER tasks in UDC settings.

TABLE II: Comparison of accuracy (%) with state-of-the-art results on the UDC-RAF-DB Dataset.
Method Backbone Classes Acc.(%)
ARM[2021arm] ResNet18-ARM 7 86.44
POSTER++[2023posterv2] VIT 7 86.76
RUL[2021rul] ResNet-18 7 85.59
DAN[2023dan] ResNet-18 7 86.47
SCN[2020scn] ResNet-18 7 85.89
EAC[2022eac] ResNet-18 7 86.51
MANet[2021manet] ResNet-18 7 85.62
OursOurs Diffusion 7 88.55
TABLE III: Comparison of accuracy (%) with state-of-the-art results on the UDC-FERPlus Dataset.
Method Backbone Classes Acc.(%)
DACL[2021dacl] ResNet18 8 78.11
POSTER++[2023posterv2] VIT 8 83.78
RUL[2021rul] ResNet-18 8 82.15
DAN[2023dan] ResNet-18 8 83.25
SCN[2020scn] ResNet-18 8 77.38
EAC[2022eac] ResNet-18 8 82.72
MANet[2021manet] ResNet-18 8 83.19
OursOurs Diffusion 8 84.89
TABLE IV: Comparison of accuracy (%) with state-of-the-art results on the UDC-KDEF Dataset.
Method Backbone Classes Acc.(%)
DACL[2021dacl] ResNet18 7 84.44
POSTER++[2023posterv2] VIT 7 91.92
RUL[2021rul] ResNet-18 7 82.69
DAN[2023dan] ResNet-18 7 85.71
SCN[2020scn] ResNet-18 7 78.69
EAC[2022eac] ResNet-18 7 54.31
MANet[2021manet] ResNet-18 7 88.48
OursOurs Diffusion 7 94.07

IV-D FLOPs and Param Comparison

TABLE V: Comparison based on Param and FLOPs on UDC-RAF-DB and UDC-KDEF. Our method provides competitive performance while maintaining a manageable computational cost.
Method #Params #FLOPs RAF-DB KDEF
ARM[2021arm] 11.2M 1.8G 86.44 88.49
POSTER++[2023posterv2] 58.3M 8.5G 86.76 91.92
RUL[2021rul] 14.4M 1.8G 85.59 82.69
DAN[2023dan] 19.7M 2.2G 86.47 85.71
SCN[2020scn] 11.2M 1.8G 85.89 78.69
EAC[2022eac] 23.5M 3.9G 86.51 54.31
MANet[2021manet] 50.5M 3.7G 85.62 88.48
OursOurs 211.5M 8.9G 87.90 94.07

Table V provides a comparative analysis of various facial expression recognition (FER) models, focusing on the Parameters and FLOPs as evaluated on the UDC-RAF-DB and UDC-KDEF dataset. Our proposed model stands out with a substantial 211.5 million parameters and 8.9 billion FLOPs, suggesting a more complex and computationally intensive architecture. Nevertheless, it demonstrates a high level of accuracy with 87.90% on UDC-RAF-DB dataset and a notable 94.07% on UDC-KDEF dataset, indicating that increased computational requirements are justified by enhanced performance outcomes. In contrast, models like ARM[2021arm], RUL[2021rul], and SCN[2020scn], while being more lightweight with around 11.2 million parameters and 1.8 billion FLOPs, exhibit moderate accuracy levels. The POSTER++ [2023posterv2] model, with a middle-ground approach using 58.3 million parameters and 8.5 billion FLOPs, achieves substantial accuracy, particularly on the UDC-KDEF dataset with 91.92%. MANet [2021manet], another relatively large model with 50.5 million parameters and 3.7 billion FLOPs, shows competitive but slightly lower accuracy scores compared to our model. EAC [2022eac], with 23.5 million parameters and 3.9 billion FLOPs, records the lowest accuracy on UDC-KDEF dataset, suggesting a possible inefficiency in its architecture when dealing with the specific challenges of UDC images. The table effectively underscores the trade-offs between model size, computational cost, and performance, showcasing our model’s ability to achieve robust results while maintaining computational efficiency, thereby providing an effective solution for high-accuracy FER in UDC contexts. Visualisation of probability distribution of labels

Refer to caption
(a) SCN learned features on clear images
Refer to caption
(b) SCN learned features on UDC images
Figure 4: The learned feature distribution by SCN and LRDif training on the RAF-DB datasets.
Refer to caption
(a) SCN learned features on clear images
Refer to caption
(b) SCN learned features on UDC images
Refer to caption
(c) LRDif learned features on clear images
Refer to caption
(d) LRDif learned features on UDC images
Figure 5: The learned feature distribution by SCN and LRDif training on the RAF-DB datasets.

IV-E Feature Visualization

We use t-SNE to visualise the learned feature distributions of different methods to show effectiveness of LRDif on UDC images. The results are shown in Fig. 5. It is shown that the comparison of different facial expressions encourages intra-class compactness and inter-class seperability of the learned features. Compared with Fig. 5 (a) and Fig. 5 (b), SCN can’t seperated the emotion categories well, especially on UDC images. While LRDIf(Ours) can recognize expressions from the mixed feature and noisy images, and it will be forced to learn the most discriminative feature that can tell an expression image apart from all the other expression. Compared with Fig . 5 (c) and Fig. 5 (d), we can draw the conclusion that LRDif can automatically prevent the model from remembering noisy features and learn useful features from UDC images.

IV-F Ablation Study

TABLE VI: Comparison of accuracy (%) on the UDC-RAF-DB Dataset.
Method #Params #FLOPs Core components in LRDif Acc.(%)
Ground Truth Diffusion Model ce\mathcal{L}_{ce} total\mathcal{L}_{total} Insert Noise
LRDifS1 100
LRDifS2-V1 86.08
LRDifS2-V2 88.98
LRDifS2-V3 (Ours) 88.55
LRDifS2-V4 88.78

The loss functions for DM.

Impact of the number of iterations.

(a) T = 1
Refer to caption
(b) T = 2
Refer to caption
(c) T = 4
Refer to caption
(d) T= 8
Refer to caption
(e) T= 16
Refer to caption
(f) T= 32
Refer to caption
Figure 6: t-SNE feature visualization on for DDPM trained on UDC-KDEF datasets.
Refer to caption
Figure 7: Investigation of number of iterations in DM.

In this part, we explore how the numbe of iterations in DM affects the performance of LRDifS2. We set different number of iterations in LRDifS2 and tune the βt\beta_{t} (αt=1βt\alpha_{t}=1-\beta_{t}) in Eq. 15 to make ZZ be Gaussian noise ZT𝒩(0,1)Z_{T}\sim\mathcal{N}(0,1) after diffusion process (i.e., αT0\alpha_{T}\rightarrow 0). The results are shown in Fig. 7. As iterations increase to 4, the performance of LRDifS2 will significantly improve. As the number of iteration is larger than 4, LRDifS2 almost keep stable, which means it reaches the upper bound. Besides, we can see that our LRDifS2 has more quick convergence speed than traditional DM (requiring more than 50 iterations). That is because we merely perform DM on EPR (a compact one-dimensional vector).

V Conclusion

VI Acknowledgments

This work was partially supported by Australian Government Research Training Program Scholarship.