LRDif: Learning Representations in Diffusion Models for Under-Display Camera Emotion Recognition

I Introduction

In the realm of digital technology, the under-display camera (UDC) represents a significant advancement, merging aesthetic elegance with functional innovation. This emerging technology, primarily seen in modern smartphones and other digital devices, cleverly embeds the camera beneath the display screen, eliminating the need for external camera notches or cutouts and thereby offering an uninterrupted, edge-to-edge screen experience. However, the incorporation of UDC technology extends beyond mere aesthetic appeal and enters the realm of advanced applications like emotion recognition.

Emotion recognition, a subset of computer vision and artificial intelligence, involves analyzing facial expressions to identify human emotions. This technology has vast applications, ranging from enhancing user experience in consumer electronics to providing critical data in fields like mental health, marketing, and human-computer interaction. Facial expression recognition (FER) has undergone remarkable advancements in recent years and is now widely used across various industries. However, implementing emotion recognition using UDC poses unique challenges, primarily due to the inherent limitations of the UDC setup.

Refer to caption — Figure 1: The image appears to present a comparison between two types of camera images with their respective color histograms. (a) shows an image taken with an under-display camera (UDC), which looks less clear compared to a regular camera. It’s a bit blurry, has more grain or ”noise,” and the colors aren’t as true to life. (b) shows an image taken with a traditional external camera, which is much clearer. The little girl’s face is sharp, with well-defined features and colors that look more natural.

The fundamental challenge lies in the image quality and clarity. UDC images often suffer from reduced sharpness, increased noise, and color fidelity issues compared to those captured with traditional external cameras. These quality constraints stem from the fact that the camera lens is positioned beneath the screen, which can obstruct and scatter light in unpredictable ways. For emotion recognition algorithms, which rely heavily on the nuances of facial expressions, this can lead to decreased accuracy and reliability.

Moreover, UDC images may exhibit unique artifacts and lighting inconsistencies, further complicating the task of accurately detecting and interpreting emotions. These challenges necessitate not only advancements in UDC hardware but also tailored algorithmic approaches in the software. Machine learning models used for emotion recognition must be adapted or retrained to account for the peculiarities of UDC images, ensuring they can effectively discern emotional states despite the additional noise and distortions. Previous studies on FER [1, 2] have not given adequate attention to the influence of additional noise and distortions bringed by UDC images. However, addressing this challenge is crucial for enhancing the practical applications of FER in real-world scenarios.

Currently, several approaches address the noise learning problem in the face recognition field. RUL [2021rul] addresses uncertainties due to ambiguous expressions and inconsistent labels by weighting facial features based on their relative difficulty, improving performance in noisy data environments. EAC[2022eac] tackles noisy labels by using flipped semantic consistency and selective erasure of input images, preventing the model from over-focusing on partial features associated with noisy labels. SCN[2020scn] addresses uncertainties in large-scale datasets by employing a self-attention mechanism for weighting training samples and a relabeling mechanism to adjust labels of low-ranked samples, thereby reducing overfitting to uncertain facial images. However,all of these methods have defects when coming across UDC images. Specifically, RUL [2021rul] and EAC[2022eac] are based on the small-loss assumption [2,48], which might confuse hard samples and noisy samples as both of them have large loss values during the training process. Thus, learning features from noise label and noise images is a challenge task.

To address the problem, we perceive data uncertainty from a different point of view. In this paper, instead of following the traditional path to detect noisy samples according to their loss values and then suppress them, we view noisy label learning from a new feature-learning perspective and propose a novel framework to deal with all the aforementioned defects in UDC images. We aim to design a difussion-based FER network that can fully and efficiently use the powerful distribution mapping abilities of DM to restore noise label and image pair. To this end, we propose LRDif. Since the transformer can model long-range pixel dependencies, we adopt the transformer blocks as our basic unit of LRDif. We stack dynamic transformer blocks in Unet shape to form Under-Display Camera Transformer (UDCformer) to extract and aggregate multi-level features. The UDCformer consist of two-streams networks-DTnetwork and DILnetwork. DTnetwork extract the latent features of label and UDC images at multi-level features. DILnetwork learn the similarities between faical landmarks and UDC images. We train our LRDif in two stages: (1) In the first stage (Fig. 2 (a)), we develop a compact fusion prior extraction network (FPEN) to extract a compact emotion prior representation (EPR) from label and UDC images to guide the UDCformer. It is notable that FPEN and UDCformer are optimized together. (2) In the second stage (Fig. 2 (b)), we train the DM to directly estimate the accurate EPR from UDC images. Since the EPR is light and only adds details for restore labels, our DM can estimate quite an accurate EPR and obtain a high test accuracy after several iterations.The main contributions in this work are summarized as follows:

•

We innovatively propose under-display camera (UDC) problem in facial expression recognition, and propose a diffusion-based work to reduce the impact of additional noise and distortions on images.
•

We propose LRDif,a strong,simple,and efficient DM-based baseline for FER. Unlike previous methods, requiring prior knowledge of the dataset uncertainty distribution, we use the strong mapping abilities of DM to estimate a compact EPR to guide FER, which can improve the prediction efficiency and stability for DM in FER.
•

We propose DGNet and DMNet for Dynamic UDCformer to fully exploit the UDC images in multiple scale. Different from traditional latent DMs optimizing the denoising network individually, we propose joint optimization of the denoising network and decoder (UDCformer) to further improve the robustness of estimation errors.
•

Extensive experiments show that the proposed LRDif can achieve SOTA performance in UDC emotion recognition tasks on several standard FER datasets such as RAF-DB, KDEF and FERPlus. This shows the powerful expression analysis capability of LRDif.

II Related Work

II-A Facial Expression Recognition

Generally, a FER system mainly consists of three stages, namely face detection, feature extraction, and expression recognition. In face detection stage, several face detectors like MTCNN [44] and Dlib [2]) are used to locate faces in complex scenes. The detected faces can be further aligned alternatively. For feature extraction, various methods are designed to capture facial geometry and appearance features caused by facial expressions. Wang et al. [wang2021light] leverages a spatial attention mechanism to focus on emotion-relevant image areas, enhancing accuracy under real-world conditions using a lightweight network embeddable in standard convolutional neural networks. MRAN[chen2023multi] enhances performance under uncontrolled conditions by integrating spatial attention on both global and local facial features, employing transformers to understand relationships within and between these features, and leveraging a sample relation transformer to focus on similarities among training samples, all optimized through a joint strategy. MPACNN [li2021learning] uses multiple paths to extract diverse features, which are then attentively fused for both basic and compound expression recognition. Concurrently, BS loss optimizes these features by maximizing inter-class and minimizing intra-class distances, ensuring their discriminative quality for high-accuracy recognition, with the BS loss functioning as the objective for MPACNN, enabling simultaneous learning of informative and discriminative features. FG-AGR[li2023fg] addresses challenges in uncontrolled environments by using an Adaptive Salient Region Induction (ASRI) to highlight key facial areas, a Local Fine-grained Feature Extraction (LFFE) via Visual Transformers for detailed feature extraction, and an Adaptive Graph Association Reasoning (AGAR) with Graph Convolutional Networks to learn combinations of these fine-grained features effectively. Zhang et al.[zhang2023transformer] propose using Transformer-based Multimodal Emotional Perception (T-MEP) framework to enhance dynamic expression recognition by employing specialized transformer-based encoders for audio, image, and text modalities, alongside a multimodal information fusion module that utilizes both self-attention and cross-attention mechanisms to robustly integrate and augment cross-modal representations for a comprehensive understanding of human emotions.

II-B Diffusion Models

Rombach et al. [rombach2022high] enhances the efficiency and visual fidelity of diffusion models by training them in the latent space of pre-trained autoencoders, integrating cross-attention layers for versatile conditioning inputs, enabling high-resolution image synthesis with reduced computational demands and preserving detail. DiffusionDet[chen2023diffusiondet] is an innovative object detection framework that treats the detection process as a denoising diffusion from random to precise object boxes, learning to reverse the diffusion from ground truth during training and iteratively refining randomly generated boxes during inference, offering flexibility in the number of boxes and evaluation method. StyleSwin[zhang2022styleswin] is a high-resolution image synthesis framework using a pure transformer-based generative adversarial network, incorporating Swin transformer in a style-based architecture with double attention for larger receptive fields and absolute position knowledge, enhanced by a wavelet discriminator to overcome blocking artifacts and maintain spatial coherence.Yang et al. [yang2023paint] advances exemplar-guided image editing by employing self-supervised training to skillfully disentangle and rearrange elements from a source image and an exemplar, using an information bottleneck and strong augmentations to avoid copy-paste artifacts, and introducing an arbitrary shape mask with classifier-free guidance for enhanced control, all within a single forward pass of a diffusion model. Whang et al. [whang2022deblurring] utilizes conditional diffusion models to train a stochastic sampler that refines outputs from a deterministic predictor, offering a diverse set of perceptually superior reconstructions for a given blurred input, while also achieving efficiency in sampling and competitiveness in standard distortion metrics like PSNR.

III Methods

III-A Pretrain DTnetwork

Before introducing pretraining DTnetwork, we would like to introduce two networks in the first stage, including a compact fusion prior extraction network (FPEN) and a dynamic transfomer network (DTnetwork). The structure of FPEN is shown in Fig. 2 yellow box, which is mainly consist of linear layers to extract the compact emotion prior representation (EPR). After that, DTnetwork can use the extracted EPR to restore label of UDC images. The structure of the DTnetwork is shown in Fig. 2 red box which consist of dynamic gated feed-forward network DGNet and dynamic multi-head transposed attention DMNet, which can use EPR as dynamic modelation parameters to add label restoration details into feature maps for emotion recognition.

Algorithm 1 LRDif_S2 Training

Input: Ground truth label $y$ , UDC image $x$ ,Label encoder $E_{L}$ , UDC image encoder $E_{I}$ and $FPEN_{S}1$ , UDCformer
Output: Trained LRDif_S2

1:Initialize:

\alpha_{t}=1-\beta_{t}

,

\bar{\alpha}_{T}=\prod_{i=0}^{T}\alpha_{i}

.

2:for

1,..N

do

3:

Z=\text{FPEN}_{S1}(\text{Concat}(E_{L}(y),E_{I}(x)))

4: Diffusion Process:

5: Sample

Z_{T}

by

q(Z_{T}|Z)=\mathcal{N}(Z_{T};\sqrt{\bar{\alpha}_{T}}Z,(1-\bar{\alpha}_{T})I)

6: Reverse Process:

7:

Z^{{}^{\prime}}=Z_{T}

8:

x_{S2}=\text{FPEN}_{S2}(E_{I}(x))

9: for

t=T

to

1

do

10:

Z_{t-1}^{{}^{\prime}}=\frac{1}{\sqrt{\alpha_{t}}}(Z_{t}^{{}^{\prime}}-\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2}))\frac{1-\alpha_{t}}{\sqrt{1-\alpha_{t}}})

11: end for

12:

y^{{}^{\prime}}=\text{UDCformer}(x,Z_{0}^{{}^{\prime}},x_{S2})

13: Calculate

\mathcal{L}_{d}

loss

14:end for

15:Output the trained model LRDif_S2.

In the pretraining (Fig. 2 (a)), we train $FPEN_{S1}$ and DTnetwork together. We first concatenate ground-truth label and UDC images together obtained by CLIP[radford2021clip] text and image encoders to obtain the input for $FPEN_{S1}$ . Then, $FPEN_{S1}$ extract the EPR $Z\in R^{C}$ as:

Z=FPEN_{S1}(Concat(E_{L}(y),E_{I}(x)))

(1)

Then EPR $Z$ is sent into DGNet and DMNet of DTnetwork as dynamic modulation parameters to guide label restoration:

F^{{}^{\prime}}=W_{1}^{l}Z\circ LN(F)+W_{2}^{l}Z

(2)

where $\circ$ indicates element-wise multiplication, LN denotes layer normalization, $W^{l}$ represents weights of linear layer, $F$ and $F^{{}^{\prime}}$ are input and output feature maps.

Then, we aggregate global spatial information in DMNet. Specifically, $F^{{}^{\prime}}$ is projected into query $Q$ , key $K$ and value $V$ by using convolution layer and depth-wise convolution layer. After that, we reshape the query $Q$ to $H^{"}W^{"}\times C^{"}$ , key $K$ to $C^{"}\times H^{"}W^{"}$ and value $V$ to $H^{"}W^{"}\times C^{"}$ . After that, we perform dot-product between $Q$ and $K$ to generate a transposed-attention map $A$ with size $C^{"}\times C^{"}$ . The overall process of DMNet can be descrive as follows:

F^{"}=W_{c}V\times softmax(K\times Q/\alpha)+F

(3)

where $\alpha$ is a learnable scaling parameter. Next, in DGNet, we aggregate local features. We use $1\times 1$ Conv to aggregate information from different channels and adopt $3\times 3$ depth-wise Conv to aggregate information from spatially neighboring pixels. Besides, we adopt the gating mechanism to enhance information encoding. The overall process of DGNet is defined as:

F^{"}=GELU(W^{1}_{d}W^{1}_{c}F^{{}^{\prime}})\circ W^{2}_{d}W^{2}_{c}F^{{}^{\prime}}+F

(4)

III-B Dynamic Image and Landmarks Network (DILnetwork)

Algorithm 2 LRDif_S2 Inference

Input: Trained LRDif_S2
Output: Restored label $y^{{}^{\prime}}$

1:Initialize:

\alpha_{t}=1-\beta_{t}

,

\bar{\alpha}_{T}=\prod_{i=0}^{T}\alpha_{i}

.

2:Reverse Process:

3:Sample

Z_{T}^{{}^{\prime}}\sim\mathcal{N}(0,1)

4:

x_{S2}=\text{FPEN}_{S2}(E_{I}(x))

5:for

t=T

to

1

do

6:

Z_{t-1}^{{}^{\prime}}=\frac{1}{\sqrt{\alpha_{t}}}(Z_{t}^{{}^{\prime}}-\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2}))\frac{1-\alpha_{t}}{\sqrt{1-\alpha_{t}}})

7:end for

8:

y^{{}^{\prime}}=\text{UDCformer}(x,Z_{0}^{{}^{\prime}},x_{S2})

9:Output Restored label

y^{{}^{\prime}}

For DILnetwork, we use window-based cross-attention mechanism to extract facial landmarks features and UDC images features. For UDC image features $X_{udc}\in R^{N\times D}$ , we first divide the UDC them into several non-overlapping windows $x_{udc}\in R^{M\times D}$ . For the facial landmarks feature $X_{flm}\in R^{C\times H\times W}$ , we first down-sample it into the window size $x_{flm}\in R^{c\times h\times w}$ , where $c=D$ and $M=h\times w$ . So, we can do cross-attention with $N$ heads on the facial landmarks and UDC images.

$\displaystyle Q$	$\displaystyle=x_{flm}w_{Q},K=x_{udc}w_{K},V=x_{udc}w_{V}$	(5)
$\displaystyle O_{i}$	$\displaystyle=Softmax(\frac{Q_{i}K_{i}^{T}}{\sqrt{d}}+b)V_{i},i=1,...,N$	(6)
$\displaystyle O$	$\displaystyle=[O_{1},O_{2},...,O_{N}]W_{O}$	(7)

where $w_{Q}$ $w_{K}$ , $w_{V}$ , $w_{O}$ are the mapping matrix and $b$ is the relative position bias.

We perform the above cross-attention calculation for all windows. We refer to this cross-attention mechanism as window-based multi-head cross-attention (MHCA).Thus the cross-fusion transformer encoder in LRDif can be expressed as follows:

	$\displaystyle X^{{}^{\prime}}_{udc}$	$\displaystyle=MHCA(X_{udc})+X_{udc}$		(8)
	$\displaystyle X^{"}_{udc}$	$\displaystyle=MLP(LN(X^{{}^{\prime}}_{udc}))+X^{{}^{\prime}}_{udc}$		(9)

We need to merge the output features $F$ from DTnetwork and the output features $O$ from DILnetwork to get the fused multi-scale features $x_{1}$ , $x_{2}$ and $x_{3}$ , where $x_{1}=Concat(F_{1},O_{1})$ , $x_{2}=Concat(F_{2},O_{2})$ and $x_{3}=Concat(F_{3},O_{3})$ . Then, the fused features $X$ will input into vanilla transformer blocks for processing.

$\displaystyle X$	$\displaystyle=[x_{1},x_{2},x_{3}],$	(10)
$\displaystyle X^{{}^{\prime}}$	$\displaystyle=MSA(X)+X,$	(11)
$\displaystyle y^{{}^{\prime}}$	$\displaystyle=MLP(LN(X^{{}^{\prime}}))+X^{{}^{\prime}},$	(12)

where $MSA$ is multi-head self-attention blocks. $LN$ is layer normalization function.The training loss is defined as follows:

\mathcal{L}_{ce}=-\sum_{i=1}^{N}\sum_{c=1}^{M}y_{ic}\log(p_{ic})

(14)

We use cross-entropy loss to train our model, where N is the number of samples. M is the number of classes. $y_{ic}$ indicates whether class c is the correct classification for sample i (1 if true, 0 otherwise). $p_{ic}$ is the predicted probability of sample i belonging to class c.

III-C Diffusion Models for Label Restoration

In the second stage (Fig. 2), we exploit the strong data estimation ability of the DM to estimate EPR. Specifically, we use the pretrained FPEN_S1 to caputure the EPR $Z\in R^{C}$ . After that, we apply the diffusion process on Z to sample $Z_{T}\in R^{C}$ , which can be described as :

q(Z_{T}|Z)=\mathcal{N}(Z_{T};\sqrt{\bar{\alpha}_{T}}Z,(1-\bar{\alpha}_{T})I)

(15)

, where T is the total number of iterations, $\bar{\alpha}_{T}=\prod_{i=0}^{T}\alpha_{i}$ and $\alpha_{t}=1-\beta_{t}$ . $\beta_{t}$ is the predefined scale factor. $\mathcal{N}(.)$ represents the Gaussian distribution.

In the reverse process of DM, we first use CLIP image encoder $E_{I}$ to encode the input UDC image $x$ . After that the encoded features will be sent to FPEN_S2 to abtain a conditional vector $x_{S2}\in R^{C}$ from UDC images.

x_{S2}=\text{FPEN}_{S2}(E_{I}(x))

(16)

,where $\text{FPEN}_{S2}$ has the same structure as $\text{FPEN}_{S1}$ except the input dimension of the first Linear layer. Then, we use the denoising network $\epsilon_{\theta}$ to estimate noise in each time step $t$ as $\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2}))$ . The estimated noise is substituted into Eq. (17) to obtain $Z_{t-1}^{{}^{\prime}}$ to start next iteration:

Z_{t-1}^{{}^{\prime}}=\frac{1}{\sqrt{\alpha_{t}}}(Z_{t}^{{}^{\prime}}-\epsilon_{\theta}(\text{Concat}(Z_{t}^{{}^{\prime}},t,x_{S2}))\frac{1-\alpha_{t}}{\sqrt{1-\alpha_{t}}})

(17)

After T times iterations, we obtained final estimated EPR $Z_{0}^{{}^{\prime}}$ . We joint train FPEN_S2, denoising network, and UDCformer using $\mathcal{L}_{total}$ :

\mathcal{L}_{kl}=\sum_{i=1}^{C}Z_{\text{norm}}(i)\log(\frac{Z_{\text{norm}}(i)}{\bar{Z}_{\text{norm}}(i)})

(18)

\mathcal{L}_{total}=\mathcal{L}_{ce}+\mathcal{L}_{kl}

(19)

where $Z_{\text{norm}}(i)$ and $\bar{Z}_{\text{norm}}(i)$ are EPRs extracted by LRDif_S1 and LRDif_S2 and normalized with softmax operation. $\mathcal{L}_{kl}$ is a variant of the Kullback-Leibler divergences, summing over C dimenstions. We add Kullback-Leibler divergence loss $\mathcal{L}_{kl}$ (Eq. 18) and Cross-Entropy loss $\mathcal{L}_{ce}$ (Eq. 14) to form the total loss $\mathcal{L}_{total}$ (Eq. 19). Since the EPR includes the UDC image and related emotion label encoded by CLIP, LRDif_S2 can use few iterations and small model size to obtain quite good estimations.

In the inference stage, we only use the reverse diffuion process, which is described in Algorithm 2. FPEN_S2 extracts a conditional vector $x_{S2}$ from UDC images encoded by CLIP, and we randomly sample a Gaussian noise $Z_{t}^{{}^{\prime}}$ . Denoising network utilises the $Z_{t}^{{}^{\prime}}$ and $x_{S2}$ to estimate the EPR $Z^{{}^{\prime}}$ after T iterations. After that, UDCformer exploits the EPR to restore label $y^{{}^{\prime}}$ .

IV Experiments

IV-A Datasets

RAF-DB [rafdb] is acquired through in the wild, with each image meticulously annotated by 40 independent raters to ensure precision. It is composed of two subsets: a basic subset with single annotations and a compound subset with dual annotations. Our study employs the basic subset. This particular subset includes seven emotional categories: surprise, fear, disgust, happiness, sadness, anger, and neutral. The dataset is partitioned into a training set comprising 12,271 images and a testing set consisting of 3,068 images, both of which exhibit a nearly equivalent distribution of expressions.

FERPlus [ferplus] serves as an extension of the FER2013 [goodfellow2013challenges], enriched with a fresh set of labels created by ten annotators. We use the enhanced dataset includes 28,709 images for training, 7,178 for testing. FERPlus incorporates ‘Contempt’ as an additional emotion category, thereby extending the classification categories to eight distinct emotional classes.

KDEF [kdef] represents a comprehensive dataset encompassing a collection of 4,900 images, meticulously designed to describe human emotional expressions. This dataset includes images of 70 amateur actors, balanced with 35 females and 35 males, each demonstrating seven distinct emotional expressions. A unique aspect of this dataset is the multi-angled approach, where each expression is captured from five different angles, offering a rich, diverse set of visual data. The participants were selected based on specific criteria: they are between 20 and 30 years of age and exhibit a clear, unobstructed facial appearance.

UDC-RAF-DB dataset comprises a training set with 12,271 images and a testing set encompassing 3,068 images, offering a robust platform for developing and testing FER algorithms in UDC environments.

UDC-FERPlus dataset extends UDC realm, providing an expansive collection of 28,709 UDC images for training and 7,178 for testing.

UDC-KDEF dataset includes a total of 4,900 UDC images, offering varied perspectives with images captured from five different angles. The training set comprises 3,920 UDC images, while the testing set includes 980 images, ensuring a wide range of data for effective training and evaluation of FER systems in UDC contexts.

IV-B Implementation Details

We trained a SOTA MPGNet [zhou2022modular] for modeling the UDC imageing degradation process and a U-Net-like network called DWFormer for UDC image generation. The UDC imaging degradation process contains brightness attenuation, blurring, and noise corruption. We use the pretrained MPGNet to synthesize three benchmark Facial Expression Recognition (FER) datasets: the UDC-RAF-DB dataset, UDC-FERPlus dataset, and UDC-KDEF dataset. All of the experiments are implemented with PyTorch, and the models are trained with a GTX-3090 GPU.

IV-C Comparison with SOTA FER Methods

IV-C1 Comparison with Typical FER-model

TABLE I: Comparison results with SOTA FER algorithms on RAF-DB, FERPlus and KDEF.

RAF-DB		FERPlus		KDEF
Methods	Acc. (%)	Methods	Acc. (%)	Methods	Acc. (%)
ARM[2021arm]	90.42	DACL[2021dacl]	83.52	DACL[2021dacl]	88.61
POSTER++[2023posterv2]	92.21	POSTER++[2023posterv2]	86.46	POSTER++[2023posterv2]	94.44
RUL[2021rul]	88.98	RUL[2021rul]	85.00	RUL[2021rul]	87.83
DAN[2023dan]	89.70	DAN[2023dan]	85.48	DAN[2023dan]	88.77
SCN[2020scn]	87.03	SCN[2020scn]	83.11	SCN[2020scn]	89.55
EAC[2022eac]	90.35	EAC[2022eac]	86.18	EAC[2022eac]	72.32
MANet[2021manet]	88.42	MANet[2021manet]	85.49	MANet[2021manet]	91.75
$Ours$	92.24	$Ours$	87.13	$Ours$	95.75

Table I presents a comprehensive comparison of accuracy between the proposed method and current state-of-the-art (SOTA) Facial Expression Recognition (FER) algorithms [2020scn, 2021manet, 2021arm, 2022eac, 2021dacl, 2021rul, 2023dan, 2023posterv2] across three benchmark datasets: RAF-DB, FERPlus, and KDEF. For the RAF-DB dataset, the proposed method (‘Ours’) achieves an accuracy of 92.24%, surpassing several established algorithms such as ARM[2021arm], RUL[2021rul], DAN[2023dan], SCN[2020scn], EAC[2022eac], and MANet[2021manet], and is competitive with POSTER++[2023posterv2], which scores slightly lower at 92.21%. In the FERPlus dataset, the proposed method also demonstrates superior performance with an accuracy of 87.13%, outperforming other notable techniques like DACL[2021dacl], RUL[2021rul], DAN[2023dan], SCN[2020scn], and MANet[2021manet]. Remarkably, in the KDEF dataset comparison, the proposed method achieves the highest accuracy of 95.75%, indicating a significant advancement over other methodologies, including the second-highest performing POSTER++[2023posterv2] at 94.44% and the lower-scoring MANet[2021manet] at 91.75%. Overall, this table highlights the proposed method’s robustness and effectiveness in facial expression recognition tasks, as evidenced by its leading accuracy rates against a range of SOTA algorithms on diverse and challenging datasets.

IV-C2 Comparison with the UDC FER-model

We conducted a performance evaluation of our proposed facial expression recognition (FER) model, specifically tailored for under-display camera (UDC) systems, against other leading state-of-the-art FER models.

Results on UDC-RAF-DB. Table II delineates the comparative analysis of accuracy among various state-of-the-art facial expression recognition (FER) models on the UDC-RAF-DB dataset. The table is structured into three columns: ‘Method’, ‘Backbone’, and ‘Acc.(%)’. The ‘Method’ lists the algorithms along with their corresponding references. The ‘Backbone’ specifies the backbone network architectures employed by each method, indicating the structural foundation upon which these models are built. The ‘Acc.(%)’ displays the accuracy percentage achieved by each method, providing a quantifiable measure of performance. The ARM[2021arm], RUL[2021rul], DAN[2023dan], SCN[2020scn], EAC[2022eac], and MANet[2021manet] models predominantly utilize the ResNet-18 architecture, with the exception of POSTER++[2023posterv2], which is based on the VIT (Vision Transformer) architecture, and ARM[2021arm], which employs a modified ResNet-18 architecture called ResNet18-ARM. The proposed model diverges from the ResNet-18 convention by incorporating a ‘Diffusion’ backbone. This innovative approach yields an accuracy of 87.90%, thereby demonstrating a notable improvement in performance over the other methods. All models are evaluated across seven classes of expression, maintaining a consistent comparison framework. The results underscore the efficacy of the diffusion-based model in the field of UDC-based FER systems.

Results on UDC-FERPlus. Table III offers an analytical comparison of accuracy for various state-of-the-art models as applied to the UDC-FERPlus Dataset. A closer comparative analysis reveals that the proposed model outperforms its counterparts with an accuracy of 84.89%, marking a significant advancement in precision over the other techniques. POSTER++ [2023posterv2] with VIT architecture shows competitive performance at 83.78%, followed closely by DAN [2023dan] with ResNet-18 at 83.25% and MANet[2021manet] with ResNet-18 at 83.19%. The remaining methods, while also based on the ResNet-18 architecture, register lower accuracy rates, with EAC [2022eac] at 82.72%, RUL [2021rul] at 82.15%, DACL [2021dacl] at 78.11%, and SCN [2020scn] at the lower end with 77.38%. In summary, the proposed model exhibits superior performance on the UDC-FERPlus Dataset, setting a new benchmark for accuracy within the domain and underscoring the potential of utilizing advanced backbone architectures for enhancing FER system efficacy in under-display camera applications.

Results on UDC-KDEF.Table IV provides an elaborate benchmarking of accuracy rates among leading facial expression recognition (FER) models on the UDC-KDEF dataset. The proposed method achieves an impressive accuracy of 94.07%, surpassing all competing methodologies. Notably, POSTER++[2023posterv2] is a strong contender, realizing an accuracy of 91.92% owing to its Vision Transformer (VIT) architecture, while MANet[2021manet] also delivers a commendable performance with an 88.48% accuracy rate. Other methods, although effective, yield relatively lower accuracy rates: DAN[2023dan] attains 85.71%, DACL[2021dacl] records 84.44%, RUL[2021rul] achieves 82.69%, and SCN[2020scn] reports 78.69%. The EAC[2022eac] method exhibits the lowest accuracy at 54.31%, which may be attributed to its lack of robustness against the varied perspectives characteristic of UDC images. The superior performance of the proposed model underscores its potential as a robust solution for FER tasks in UDC settings.

TABLE II: Comparison of accuracy (%) with state-of-the-art results on the UDC-RAF-DB Dataset.

Method	Backbone	Classes	Acc.(%)
ARM[2021arm]	ResNet18-ARM	7	86.44
POSTER++[2023posterv2]	VIT	7	86.76
RUL[2021rul]	ResNet-18	7	85.59
DAN[2023dan]	ResNet-18	7	86.47
SCN[2020scn]	ResNet-18	7	85.89
EAC[2022eac]	ResNet-18	7	86.51
MANet[2021manet]	ResNet-18	7	85.62
$Ours$	Diffusion	7	88.55

TABLE III: Comparison of accuracy (%) with state-of-the-art results on the UDC-FERPlus Dataset.

Method	Backbone	Classes	Acc.(%)
DACL[2021dacl]	ResNet18	8	78.11
POSTER++[2023posterv2]	VIT	8	83.78
RUL[2021rul]	ResNet-18	8	82.15
DAN[2023dan]	ResNet-18	8	83.25
SCN[2020scn]	ResNet-18	8	77.38
EAC[2022eac]	ResNet-18	8	82.72
MANet[2021manet]	ResNet-18	8	83.19
$Ours$	Diffusion	8	84.89

TABLE IV: Comparison of accuracy (%) with state-of-the-art results on the UDC-KDEF Dataset.

Method	Backbone	Classes	Acc.(%)
DACL[2021dacl]	ResNet18	7	84.44
POSTER++[2023posterv2]	VIT	7	91.92
RUL[2021rul]	ResNet-18	7	82.69
DAN[2023dan]	ResNet-18	7	85.71
SCN[2020scn]	ResNet-18	7	78.69
EAC[2022eac]	ResNet-18	7	54.31
MANet[2021manet]	ResNet-18	7	88.48
$Ours$	Diffusion	7	94.07

IV-D FLOPs and Param Comparison

TABLE V: Comparison based on Param and FLOPs on UDC-RAF-DB and UDC-KDEF. Our method provides competitive performance while maintaining a manageable computational cost.

Method	#Params	#FLOPs	RAF-DB	KDEF
ARM[2021arm]	11.2M	1.8G	86.44	88.49
POSTER++[2023posterv2]	58.3M	8.5G	86.76	91.92
RUL[2021rul]	14.4M	1.8G	85.59	82.69
DAN[2023dan]	19.7M	2.2G	86.47	85.71
SCN[2020scn]	11.2M	1.8G	85.89	78.69
EAC[2022eac]	23.5M	3.9G	86.51	54.31
MANet[2021manet]	50.5M	3.7G	85.62	88.48
$Ours$	211.5M	8.9G	87.90	94.07

Table V provides a comparative analysis of various facial expression recognition (FER) models, focusing on the Parameters and FLOPs as evaluated on the UDC-RAF-DB and UDC-KDEF dataset. Our proposed model stands out with a substantial 211.5 million parameters and 8.9 billion FLOPs, suggesting a more complex and computationally intensive architecture. Nevertheless, it demonstrates a high level of accuracy with 87.90% on UDC-RAF-DB dataset and a notable 94.07% on UDC-KDEF dataset, indicating that increased computational requirements are justified by enhanced performance outcomes. In contrast, models like ARM[2021arm], RUL[2021rul], and SCN[2020scn], while being more lightweight with around 11.2 million parameters and 1.8 billion FLOPs, exhibit moderate accuracy levels. The POSTER++ [2023posterv2] model, with a middle-ground approach using 58.3 million parameters and 8.5 billion FLOPs, achieves substantial accuracy, particularly on the UDC-KDEF dataset with 91.92%. MANet [2021manet], another relatively large model with 50.5 million parameters and 3.7 billion FLOPs, shows competitive but slightly lower accuracy scores compared to our model. EAC [2022eac], with 23.5 million parameters and 3.9 billion FLOPs, records the lowest accuracy on UDC-KDEF dataset, suggesting a possible inefficiency in its architecture when dealing with the specific challenges of UDC images. The table effectively underscores the trade-offs between model size, computational cost, and performance, showcasing our model’s ability to achieve robust results while maintaining computational efficiency, thereby providing an effective solution for high-accuracy FER in UDC contexts. Visualisation of probability distribution of labels

IV-E Feature Visualization

We use t-SNE to visualise the learned feature distributions of different methods to show effectiveness of LRDif on UDC images. The results are shown in Fig. 5. It is shown that the comparison of different facial expressions encourages intra-class compactness and inter-class seperability of the learned features. Compared with Fig. 5 (a) and Fig. 5 (b), SCN can’t seperated the emotion categories well, especially on UDC images. While LRDIf(Ours) can recognize expressions from the mixed feature and noisy images, and it will be forced to learn the most discriminative feature that can tell an expression image apart from all the other expression. Compared with Fig . 5 (c) and Fig. 5 (d), we can draw the conclusion that LRDif can automatically prevent the model from remembering noisy features and learn useful features from UDC images.

IV-F Ablation Study

TABLE VI: Comparison of accuracy (%) on the UDC-RAF-DB Dataset.

Method	#Params	#FLOPs	Core components in LRDif					Acc.(%)
Method	#Params	#FLOPs	Ground Truth	Diffusion Model	$\mathcal{L}_{ce}$	$\mathcal{L}_{total}$	Insert Noise	Acc.(%)
LRDif_S1			✔	✗	✔	✗	✗	100
LRDif_S2-V1			✗	✗	✔	✗	✗	86.08
LRDif_S2-V2			✗	✔	✔	✗	✗	88.98
LRDif_S2-V3 (Ours)			✗	✔	✗	✔	✔	88.55
LRDif_S2-V4			✗	✔	✗	✔	✗	88.78

The loss functions for DM.

Impact of the number of iterations.

In this part, we explore how the numbe of iterations in DM affects the performance of LRDif_S2. We set different number of iterations in LRDif_S2 and tune the $\beta_{t}$ ( $\alpha_{t}=1-\beta_{t}$ ) in Eq. 15 to make $Z$ be Gaussian noise $Z_{T}\sim\mathcal{N}(0,1)$ after diffusion process (i.e., $\alpha_{T}\rightarrow 0$ ). The results are shown in Fig. 7. As iterations increase to 4, the performance of LRDif_S2 will significantly improve. As the number of iteration is larger than 4, LRDif_S2 almost keep stable, which means it reaches the upper bound. Besides, we can see that our LRDif_S2 has more quick convergence speed than traditional DM (requiring more than 50 iterations). That is because we merely perform DM on EPR (a compact one-dimensional vector).

V Conclusion

VI Acknowledgments

This work was partially supported by Australian Government Research Training Program Scholarship.