This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\addauthor

Yingjie [email protected] \addauthorJiarui [email protected] \addauthorTao [email protected] \addauthorYun [email protected] \addinstitution School of Computer Science
Peking University
Beijing, China EventFormer

EventFormer: AU Event Transformer for Facial Action Unit Event Detection

Abstract

Facial action units (AUs) play an indispensable role in human emotion analysis. We observe that although AU-based high-level emotion analysis is urgently needed by real-world applications, frame-level AU results provided by previous works cannot be directly used for such analysis. Moreover, as AUs are dynamic processes, the utilization of global temporal information is important but has been gravely ignored in the literature. To this end, we propose EventFormer for AU event detection, which is the first work directly detecting AU events from a video sequence by viewing AU event detection as a multiple class-specific sets prediction problem. Extensive experiments conducted on a commonly used AU benchmark dataset show the superiority of EventFormer under suitable metrics.

1 Introduction

Facial expression, as the most expressive emotional signal, plays an essential role in human emotion analysis. According to Facial Action Coding System (FACS) [Ekman and Friesen(1978)], facial action units (AUs) refer to a set of facial muscle movements and are the basic components of almost all facial behaviors. The increasing need for user emotion analysis in application scenarios, such as online education and remote interview, leads to rapid growth in the field of AU analysis in recent years. With the prosperity of deep learning, two mainstream tasks of AU analysis, AU recognition [Ciftci et al.(2017)Ciftci, Zhang, and Tin, Wang and Wang(2018), Tirupattur et al.(2021)Tirupattur, Duarte, Rawat, and Shah, Chen et al.(2021b)Chen, Wu, Wang, Wang, and Liang, Yang and Yin(2020), Chen et al.(2021a)Chen, Chen, Wang, Wang, and Liang] and AU intensity estimation [Baltrušaitis et al.(2017)Baltrušaitis, Li, and Morency, Fan et al.(2020b)Fan, Lam, and Li, Song et al.(2020)Song, Shi, Feng, Song, Lin, Lin, Fan, and Yuan, Fan et al.(2020a)Fan, Shen, Cheng, and Tian, Song et al.(2021)Song, Cui, Wang, Zheng, and Ji], have seen great improvements in recent years, both of which aim to estimate AU occurrence state or intensity for a given frame, i.e., frame-level AU results.

However, when it comes to high-level emotion analysis, frame-level analysis results are not enough for the need of some real-world applications [Schmidt et al.(2006)Schmidt, Ambadar, Cohn, and Reed, Cohn and Schmidt(2003)], due to the lack of various sequence-level information for further analysis, such as the occurrence frequencies, durations, and chronological order of AU events, each of which is a temporal segment containing one AU’s complete temporal evolution, as shown in Fig. 1. For example, in public places such as airports, unnatural facial expressions or fleeting panics act as key information for the discrimination of a suspicious passenger. In this case, sequence-level AU event results are required to capture such abnormal emotions and based on which, warnings will be sent to officers for a further inspection of the person. Furthermore, sequence-level AU event results are also important for the distinguishment between spontaneous and pretended facial expressions. For happiness, the identification depends heavily on the overlapping situation of events labeled AU6 and AU12, in which case not only AU events’ durations but also their chronological order matters.

Some works [Chen et al.(2021c)Chen, Zhang, Chen, Wang, Wang, and Liang, Ding et al.(2013)Ding, Chu, De la Torre, Cohn, and Wang] have made attempts to generate AU event results based on frame-level or unit-level (a fixed number of frames centered on the current one is regarded as a unit) AU results via a series of postprocessing steps, but they suffer from the lacking of global temporal information and are highly dependent on hyper-parameters for postprocessing. To this end, we design an AU Event TransFormer (EventFormer) architecture to directly detect AU events from a video sequence by utilizing the benefit of global temporal information. EventFormer takes a video sequence as input and detects AU events for each AU class directly and simultaneously by viewing AU event detection as a multiple class-specific sets prediction problem.

Refer to caption
Figure 1: An illustration of the potential application scenario of direct sequence-level AU event detection. AU event provides overlapping as well as chronological information between different AUs, which is more suitable for further emotion analysis.

Specifically, first, a region-aware AU feature encoder is used for extracting fine-grained AU features as frame embedding. Then, an event transformer encoder-decoder module is used to generate event embeddings by learning global dependencies among frames in a video sequence. Through self-attention mechanism, all frame embeddings are fed to transformer architecture simultaneously, and each frame can interact with others directly. In this way, a global view is maintained. After that, classification branch and regression branch are applied to each event embedding for event validity prediction and boundary regression, respectively.

Unlike methods such as [Li et al.(2017)Li, Abtahi, and Zhu] modeling local temporal relations among several frames via RNN [Hochreiter and Schmidhuber(1996)] or methods such as [Ding et al.(2013)Ding, Chu, De la Torre, Cohn, and Wang, Simon et al.(2010)Simon, Nguyen, De La Torre, and Cohn] using local temporal information by extracting unit-level features, our EventFormer takes the whole video sequence as input and model global temporal relations through the mechanism of transformer, which allows each frame to access all the other frames simultaneously. And instead of generating AU events based on frame-level results through postprocessing which highly depends on manually selected hyper-parameters, EventFormer detects AU events in a direct way, and thus is able to alleviate discontinuous results.

The key contributions of our work are listed as:

  • To the best of our knowledge, this is the first work that directly detect AU events from a video sequence, which are more critical and practical for real-world applications.

  • We propose EventFormer for AU event detection, taking advantage of the mechanism of transformer to maintain a temporal global view and alleviate discontinuous results.

  • Extensive experiments conducted on a commonly used AU benchmark dataset, BP4D, show the superiority of our method under suitable metrics for AU event detection.

2 Related Work

In recent years, AU analysis tasks have drawn increasing attention as fundamental tasks in the field of affective computing. Conventional methods [Simon et al.(2010)Simon, Nguyen, De La Torre, and Cohn, Simon et al.(2010)Simon, Nguyen, De La Torre, and Cohn] mainly design hand-crafted features as the input of a classifier for AU recognition. With the development of deep learning, methods [Zhao et al.(2016)Zhao, Chu, and Zhang, Li et al.(2017)Li, Abtahi, and Zhu, Fan et al.(2020a)Fan, Shen, Cheng, and Tian, Chen et al.(2022b)Chen, Chen, Wang, Wang, and Liang, Chen et al.(2022a)Chen, Chen, Luo, Huang, Hua, Wang, and Liang] have raised the performance of AU analysis to a new height. Since AUs are dynamic processes, methods such as [Li et al.(2017)Li, Abtahi, and Zhu] employ LSTM to model local temporal relations, but longer the video sequence, weaker the relationships between temporally far apart frames. To obtain AU event results, Ding et al\bmvaOneDot [Ding et al.(2013)Ding, Chu, De la Torre, Cohn, and Wang] proposed a method that extracts unit-level features, predicts event-related scores and generates AU events via a series of postprocessing steps. In contrast, we propose EventFromer to model global temporal relations among frames in a video sequence and detect AU events in a direct way.

Transformer [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] has attracted increasing research interest in computer vision tasks. Self-attention mechanism as the core of transformer allows the model to aggregate information from the whole input sequence with much less memory consumption and computing time compared to RNNs. However, it is not until works [Wu et al.(2020)Wu, Xu, Dai, Wan, Zhang, Yan, Tomizuka, Gonzalez, Keutzer, and Vajda, Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] succeeded that the architecture has been proved effective and efficient in computer vision tasks. Our EventFormer makes full use of the mechanism to model global temporal information.

3 EventFormer for AU Event Detection

Refer to caption
Figure 2: Architecture of EventFormer. EventFormer takes a video sequence as input, and AU feature encoder is first applied to the input to generate frame embeddings. Then position embeddings are concatenated to each frame embedding. Event transformer encoder models global relationships among frames via self-attention mechanism to enhance frame embeddings. After that, event transformer decoder takes encoder output and sets of queries, i.e. Query Sets, as input, and outputs aggregated event embeddings for each query. Then the output event embeddings are passed to two branches to obtain the final Event Sets.

3.1 Problem Definition

Given an input video sequence ={It}t=1T\mathcal{I}=\{I_{t}\}^{T}_{t=1} with TT frames recording facial actions, where ItI_{t} is the ttht^{\rm th} frame in \mathcal{I}. The annotations of \mathcal{I} are composed of a set of ground-truth AU events Φg={ϕi=(sig,eig,cig)|0sig<eigT{\Phi}_{\rm{g}}=\{{\phi}_{i}=(s^{\rm{g}}_{i},e^{\rm{g}}_{i},c^{\rm{g}}_{i})|0\leq s^{\rm{g}}_{i}<e^{\rm{g}}_{i}\leq T, cig{1,2,,C}}Mi=1c^{\rm{g}}_{i}\in\{1,2,\dots,C\}\}^{M}_{i=1}, where MM is the number of ground-truth AU events in the video sequence \mathcal{I}, CC is the number of AU classes, and sigs^{\rm{g}}_{i}, eige^{\rm{g}}_{i}, cigc^{\rm{g}}_{i} are the start time, end time, and AU class label of AU events ϕi{\phi}_{i}, respectively. AU event detection aims to detect a set of events Φp={φi=(sip,eip,cip)|0sip<eipT{\Phi}_{\rm p}=\{{\varphi}_{i}=(s^{\rm p}_{i},e^{\rm p}_{i},c^{\rm p}_{i})|0\leq s^{\rm p}_{i}<e^{\rm p}_{i}\leq T, cip{1,2,,C}}Ni=1c^{\rm p}_{i}\in\{1,2,\dots,C\}\}^{N}_{i=1} which match Φg{\Phi}_{\rm{g}} precisely and exhaustively. During training, the set of ground-truth AU events Φg{\Phi}_{\rm{g}} is used as supervision for detected events Φp{\Phi}_{\rm p}, and during inference, the Φp{\Phi}_{\rm p} can be simply filtered as results.

3.2 Multiple Class-specific Sets Prediction

Due to AU co-occurrence relationships, AU events in different AU classes can have near-identical or exactly identical temporal boundaries. Inspired by DETR [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] which views object detection as a single set prediction problem, we view AU event detection as a multiple class-specific sets prediction problem. Instead of predicting a class-agnostic set with events of all classes, we predict multiple class-specific sets, each contains events for a specific AU class.

As noted above, AU event detection aims to detect class-specific sets of AU events, Φp={φi=(sip,eip,cip)}i=1N{\Phi}_{\rm p}=\{{\varphi}_{i}=(s^{\rm p}_{i},e^{\rm p}_{i},c^{\rm p}_{i})\}^{N}_{i=1}, from \mathcal{I}. If we bind AU class labels to events and search for a permutation of Φp{\Phi}_{\rm p} to match Φg{\Phi}_{\rm{g}} directly, some class mismatches caused by the multi-label property of AU event detection are hard to solve. For example, events with identical temporal boundaries but different AU labels in Φg{\Phi}_{\rm{g}} can match with any permutation of the corresponding detected events in Φp{\Phi}_{\rm p} during training, which causes unstable training and makes the class labels hard to learn. To alleviate the instability issue, we split Φg{\Phi}_{\rm{g}} into CC disjoint class-specific sets {Φgc}c=1C\{{\Phi}^{c}_{\rm{g}}\}^{C}_{c=1} such that Φg=c=1CΦgc{\Phi}_{\rm{g}}=\bigcup^{C}_{c=1}{\Phi}^{c}_{\rm{g}}, where Φgc={ϕic=(sig,eig,cig)|ϕis.t.cig=c}i=1Nc{\Phi}^{c}_{\rm{g}}=\{{\phi}^{c}_{i}=(s^{\rm{g}}_{i},e^{\rm{g}}_{i},c^{\rm{g}}_{i})|\forall{\phi}_{i}~{}\mathrm{s.t.}~{}c^{\rm{g}}_{i}=c\}^{N_{c}}_{i=1}, and NcN_{c} is the number of ground-truth events belonging to class cc. In this way, the problem turns into predicting several class-specific sets Φpc{\Phi}^{c}_{\rm p}, (Φp=c=1CΦpc{\Phi}_{\rm p}=\bigcup^{C}_{c=1}{\Phi}^{c}_{\rm p}), one for each AU class, i.e. a multiple class-specific sets prediction problem.

For the implementation of EventFormer, assuming N0N_{0} is a number larger than any NcN_{c}, we pad Φgc{\Phi}^{c}_{\rm{g}} with \varnothing (no event) to make the set Φ~gc={ϕ~ic=(sig,eig,vig)}i=1N0\tilde{{\Phi}}^{c}_{\rm{g}}=\{\tilde{\phi}^{c}_{i}=(s^{\rm{g}}_{i},e^{\rm{g}}_{i},v^{\rm{g}}_{i})\}^{N_{0}}_{i=1} with a fixed number of events, where vig{0,1}v^{\rm{g}}_{i}\in\{0,1\}, vig=0v^{\rm{g}}_{i}=0 represents ϕ~ic\tilde{\phi}^{c}_{i} is not a valid event, i.e. \varnothing for padding, and vig=1v^{\rm{g}}_{i}=1 represents ϕ~ic\tilde{\phi}^{c}_{i} is a valid event. We denote Φ~pc={φ~ic=(sip,eip,vip)}i=1N0\tilde{{\Phi}}^{c}_{\rm p}=\{\tilde{\varphi}^{c}_{i}=(s^{\rm p}_{i},e^{\rm p}_{i},v^{\rm p}_{i})\}^{N_{0}}_{i=1} as the set of N0N_{0} detected events for class cc, called Event Set cc. EventFormer takes a video sequence \mathcal{I} and a union of CC Query Sets Φq=c=1CΦqc{\Phi}_{\rm q}=\bigcup^{C}_{c=1}{\Phi}^{c}_{\rm q} as inputs, and outputs a union of CC corresponding Event Sets Φ~p=c=1CΦ~pc\tilde{{\Phi}}_{\rm p}=\bigcup^{C}_{c=1}\tilde{{\Phi}}^{c}_{\rm p}.

3.3 EventFormer Architecture

As shown in Fig. 2, our EventFormer mainly consists of three parts.

AU Feature Encoder

Compared to coarse-grained body actions, AUs only cause subtle appearance changes on several local facial regions, which puts high demands on the discriminability of frame embeddings. Thus, we design a region-aware AU feature encoder to extract local features for each AU separately to preserve more detailed information. Each frame It3×H×WI_{t}\in\mathbb{R}^{3\times H\times W} in \mathcal{I} is fed to a backbone network to extract Fglobald×H0×W0F^{\rm{global}}\in\mathbb{R}^{d\times H_{0}\times W_{0}} as global feature. And CC spatial attention layers [Zhao and Wu(2019)] are applied to the global feature to extract local features flocaldf^{\rm local}\in\mathbb{R}^{d} for each AU. Then, the concatenated local features Flocal(d×C)F^{\rm local}\in\mathbb{R}^{(d\times C)} are mapped to EtdmE_{t}\in\mathbb{R}^{d_{\rm m}} via a linear layer as frame embedding for ItI_{t}.

Event Transformer Encoder-decoder Module

We start from the original transformer encoder-decoder architecture [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] and design an event transformer encoder-decoder module specially for AU event detection. Our event transformer encoder-decoder consists of LL encoder layers and LL decoder layers. To ensure a stable training period, LayerNorm [Ba et al.(2016)Ba, Kiros, and Hinton] is applied before Multi-head Attention and Multi-layer Perceptron, according to [Xiong et al.(2020)Xiong, Yang, He, Zheng, Zheng, Xing, Zhang, Lan, Wang, and Liu]. To maintain positional information in time dimension, positional encoding is employed to generate positional embeddings 𝒫={Pt}t=1TT×dm\mathcal{P}=\{P_{t}\}^{T}_{t=1}\in\mathbb{R}^{T\times d_{\rm m}}, corresponding to frame embeddings ={Et}t=1TT×dm\mathcal{E}=\{E_{t}\}^{T}_{t=1}\in\mathbb{R}^{T\times d_{\rm m}}. The transformer encoder takes the concatenated frame embeddings and positional embeddings, i.e. video embedding, as input and outputs refined frame embeddings by enabling interaction among frames via self-attention mechanism. Event transformer decoder takes event queries Q(CN0)×dmQ\in\mathbb{R}^{(CN_{0})\times d_{\rm m}} as input, which can be regarded as the union of CC Query Sets, Φq=c=1CΦqc{\Phi}_{\rm q}=\bigcup^{C}_{c=1}{\Phi}^{c}_{\rm q}, where Φqc={qic}i=1N0{\Phi}^{c}_{\rm q}=\{q^{c}_{i}\}^{N_{0}}_{i=1}. Each query qicdmq^{c}_{i}\in\mathbb{R}^{d_{\rm m}} is a learned positional embedding, which differs from each other. Queries first interact with each other through self-attention to alleviate event redundancy, and then interact with the encoder output, i.e. the refined frame embeddings as keys and values, to aggregate frame embeddings relative to each potential event as event embeddings D(CN0)×dmD\in\mathbb{R}^{(CN_{0})\times d_{\rm m}}.

Classification and Regression Branches

The output event embeddings DD of the event transformer encoder-decoder module are further fed into classification branch and regression branch separately. For each feature vector didmd_{i}\in\mathbb{R}^{d_{\rm m}} in DD representing a potential event φi{\varphi}_{i}, the regression branch aims to estimate the start time sis_{i} and duration lil_{i} of the event, and ei=min(T,si+li)e_{i}=min(T,s_{i}+l_{i}). The classification branch aims to estimate a one-hot vector p^i2\hat{p}_{i}\in\mathbb{R}^{2} that denotes the probabilities of the value of viv_{i}. We use a linear layer to output the classification probability p^i\hat{p}_{i} and two linear layers to regress sis_{i} and lil_{i}. After that, by reorganizing events in order, Φ~p=c=1CΦ~pc\tilde{{\Phi}}_{\rm p}=\bigcup^{C}_{c=1}\tilde{{\Phi}}^{c}_{\rm p} is obtained.

4 Training and Inference of EventFormer

To train EventFormer, a multiple class-specific sets matching cost is introduced for class-specific bipartite matching between each pair of Φ~gc\tilde{{\Phi}}^{c}_{\rm{g}} and Φ~pc\tilde{{\Phi}}^{c}_{\rm p}. After the matching for each AU class, a multiple class-specific sets prediction loss can be computed for back-propagation.

4.1 Multiple Class-specific Sets Matching Cost

Due to the disorder of events in one set, the loss function designed for multiple class-specific sets prediction should be invariant by a permutation of the detected events with identical class labels, i.e. in one Event Set. We apply a loss based on Hungarian algorithm [Kuhn(1955)], to find a bipartite matching between ground-truth events and detected ones for each class.

A permutation of N0N_{0} elements σΩN0\sigma\in\Omega_{N_{0}} is searched by finding a bipartite matching between Φ~gc\tilde{{\Phi}}^{c}_{\rm{g}} and Φ~pc\tilde{\Phi}^{c}_{\rm p} for each class that minimizes the total matching cost, as shown in Eq. 1:

σ^=argminσΩN0i=1N0match(ϕ~ic,φ~σ(i)c),\hat{\sigma}={\arg\min}_{\sigma\in\Omega_{N_{0}}}\sum\nolimits^{N_{0}}_{i=1}\mathcal{L}_{\rm match}(\tilde{\phi}^{c}_{i},\tilde{\varphi}^{c}_{\sigma(i)}), (1)

where match(ϕ~ic,φ~σ(i)fc)\mathcal{L}_{\rm match}(\tilde{\phi}^{c}_{i},\tilde{\varphi}^{c}_{\sigma(i)f}) is a pair-wise matching cost between ground-truth ϕ~ic\tilde{\phi}^{c}_{i} and a detected event φ~σ(i)c\tilde{\varphi}^{c}_{\sigma(i)} with matching index σ(i)\sigma(i). The loss function of matching is designed to minimize the distance between matched pairs and maximize the validity of matched events at the same time. We define match(ϕ~ic,φ~σ(i)c)\mathcal{L}_{\rm match}(\tilde{\phi}^{c}_{i},\tilde{\varphi}^{c}_{\sigma(i)}) as

𝟙{vig=1}(λboundbound(ϕ~ic,φ~σ(i)c)λvalidp^σ(i)c[vig]),\mathds{1}_{\{v^{\rm{g}}_{i}=1\}}\left(\lambda_{\rm{bound}}\mathcal{L}_{\rm{bound}}(\tilde{\phi}^{c}_{i},\tilde{\varphi}^{c}_{\sigma(i)})-\lambda_{\rm{valid}}\hat{p}^{c}_{\sigma(i)}[v^{\rm{g}}_{i}]\right), (2)

where p^σ(i)c2\hat{p}^{c}_{\sigma(i)}\in\mathbb{R}^{2} denotes the probabilities of the value of vσ(i)pv^{p}_{\sigma(i)} indicating whether φ~σ(i)c\tilde{\varphi}^{c}_{\sigma(i)} is a valid event, i.e. the probabilities of vσ(i)p[0,1]v^{\rm p}_{\sigma(i)}\in[0,1], p^σ(i)c[vig]\hat{p}^{c}_{\sigma(i)}[v^{\rm{g}}_{i}] denotes the probability of vσ(i)p=vigv^{\rm p}_{\sigma(i)}=v^{\rm{g}}_{i}, and λbound\lambda_{\rm{bound}} and λvalid\lambda_{\rm{valid}} are for balancing. The boundary loss bound\mathcal{L}_{\rm{bound}} measures the similarity between a pair of matched ground-truth event and detected event. L1 loss measures the numerical difference of the regression results, and tIoU loss measures the overlapping area ratio of matched pairs. Both of them are used, considering pairs of ground-truth event and detected event may have a minor difference in terms of L1 but a huge difference in terms of tIoU. Thus, a linear combination of tIoU loss (Eq. 3) and L1 loss is used as our boundary loss bound\mathcal{L}_{\rm{bound}}, as shown in Eq. 4.

T=max(0,min(e1,e2)max(s1,s2)),\displaystyle T_{\cap}=max(0,min(e_{1},e_{2})-max(s_{1},s_{2})), (3)
tIoU\displaystyle\mathcal{L}_{\rm tIoU} ((s1,e1),(s2,e2))=T(e1s1)+(e2s2)+T.\displaystyle((s_{1},e_{1}),(s_{2},e_{2}))=\frac{T_{\cap}}{(e_{1}-s_{1})+(e_{2}-s_{2})+T_{\cap}}.
bound\displaystyle\mathcal{L}_{\rm{bound}} (ϕ~ic,φ~σ(i)c)=λtIoUtIoU((sig,eig),(sσ(i)p,eσ(i)p))\displaystyle(\tilde{\phi}^{c}_{i},\tilde{\varphi}^{c}_{\sigma(i)})=\lambda_{\rm tIoU}\mathcal{L}_{\rm tIoU}((s^{\rm{g}}_{i},e^{\rm{g}}_{i}),(s^{\rm p}_{\sigma(i)},e^{\rm p}_{\sigma(i)})) (4)
+λL1(sigsσ(i)p+eigeσ(i)p),\displaystyle+\lambda_{\rm L1}(\|s^{\rm{g}}_{i}-s^{\rm p}_{\sigma(i)}\|+\|e^{\rm{g}}_{i}-e^{\rm p}_{\sigma(i)}\|),

where λtIoU\lambda_{\rm tIoU} and λL1\lambda_{\rm L1} are for balancing.

4.2 Multiple Class-specific Sets Prediction Loss

After finding a bipartite matching minimizing the matching cost, the loss function can be computed. A combination of bound\mathcal{L}_{\rm{bound}} and class\mathcal{L}_{\rm class} (Eq. 5) forms the event detection loss \mathcal{L}, as shown in Eq. 6.

class(pic,p^σ(i)c)=v(0,1)pic(v)log(p^σ(i)c(v)).\mathcal{L}_{\rm class}(p^{c}_{i},\hat{p}^{c}_{\sigma(i)})=-\sum\nolimits_{v\in(0,1)}p^{c}_{i}(v)\log(\hat{p}^{c}_{\sigma(i)}(v)). (5)
=c=1Ci=1N0(𝟙{vig=1}bound(ϕ~ic,φ~σ(i)c)+λclassclass(pic,p^σ(i)c)),\mathcal{L}=\sum\nolimits^{C}_{c=1}\sum\nolimits^{N_{0}}_{i=1}(\mathds{1}_{\{v^{\rm{g}}_{i}=1\}}\mathcal{L}_{\rm{bound}}(\tilde{\phi}^{c}_{i},\tilde{\varphi}^{c}_{\sigma(i)})+\lambda_{\rm class}\mathcal{L}_{\rm class}(p^{c}_{i},\hat{p}^{c}_{\sigma(i)})), (6)

where pic2p^{c}_{i}\in\mathbb{R}^{2} is the one-hot encoding of vigv^{g}_{i} for ϕ~ic\tilde{\phi}^{c}_{i}, and λclass\lambda_{\rm class} is for balancing. It is worth mentioning that all the detected events are involved in the calculation of class\mathcal{L}_{\rm class}, but only the matched events in Event Sets are involved in the calculation of bound\mathcal{L}_{\rm{bound}}.

In the inference stage, bipartite matching is disabled. By given a threshold τ\tau, we can simply filter out events with p^i\hat{p}_{i} lower than the threshold and preserve more valid events. For each Event Set cc, each preserved event φ~ic\tilde{\varphi}^{c}_{i} is assigned with an AU class label cc to form one final detected event φi=(sip,eip,cip){\varphi}_{i}=(s^{\rm p}_{i},e^{\rm p}_{i},c^{\rm p}_{i}) with cip=cc^{\rm p}_{i}=c in Φp{\Phi}_{\rm p}. In this way, the set of final AU events Φp{\Phi}_{\rm p} can be easily obtained.

5 Experiments

5.1 Experimental Settings

Datasets & Metrics

Extensive experiments are conducted on a commonly used benchmark dataset, BP4D [Zhang et al.(2014)Zhang, Yin, Cohn, Canavan, Reale, Horowitz, Liu, and Girard]. In BP4D, 328 videos of 41 participants are taken, including 23 women and 18 men. Each frame is annotated by certificated FACS coders with binary AU occurrence labels. We consider 12 emotion-related AUs on BP4D, including AU1, 2, 4, 6, 7, 10, 12, 14, 15, 17, 23 and 24. To construct the training data, we use a sliding window with length TT to truncate all the videos into several equal length video sequences, and there is an overlap of T/2T/2. Based on binary AU occurrence labels, ground-truth AU events Φg{\Phi}_{\rm{g}} are obtained for each video sequence. Following the common protocol mentioned in [Zhao et al.(2016)Zhao, Chu, and Zhang], subject-exclusive 3-fold cross-validation is conducted for all experiments.

Considering that AU event detection shares some similarities with temporal action detection [Lin et al.(2020)Lin, Li, Wang, Tai, Luo, Cui, Wang, Li, Huang, and Ji], we select several suitable metrics for AU event detection drawing on those used in that task. The goal of AU event detection task is to detect AU events with not only high precision but also acceptable recall. Thus, we consider Mean Average Precision(mAP) and Average Recall with an average number of events (AR@AN) at different tIoU thresholds α\alpha. α\alpha is set to [0.3:0.1:0.7][0.3:0.1:0.7] for mAP and [0.5:0.05:0.95][0.5:0.05:0.95] for AR@AN. We also report Area under the AR vs. AN curve (AUC) for evaluation.

Implementation Details

All facial images are aligned and cropped according to facial landmarks and resized to 256×256256\times 256. RN50 [He et al.(2016)He, Zhang, Ren, and Sun] without the last linear layer is used as backbone in AU feature encoder. Empirically, we set local feature dimension d=512d=512, H0=W0=16H_{0}=W_{0}=16, embedding dimension dm=256d_{\rm m}=256, and the number of encoder/decoder layers LL is set to 6. Other hyper-parameters λbound\lambda_{\rm{bound}}, λvalid\lambda_{\rm{valid}}, λtIoU\lambda_{tIoU}, λL1\lambda_{L1} and λclass\lambda_{class} are set to 5, 1, 2, 5, 1, respectively. The number of queries N0N_{0} in each Query Set is set to 100, and τ\tau is set to 0.5. We train EventFormer with AdamW [Loshchilov and Hutter(2017)] optimizer setting transformer’s learning rate to 10410^{-4}, AU feature encoder’s learning rate to 10510^{-5}, and weight decay to 10410^{-4}. Batch size is set to 8 and the number of training epochs is set to 100. All models are trained on two NVIDIA Tesla V100 GPUs.

Backbone Scheme [email protected] [email protected] [email protected] [email protected] [email protected] AR@10 AR@50 AR@100 AUC
RN18 Frame2Event [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] 10.03 8.97 8.07 7.17 6.22 3.83 24.15 60.15 27.34
Unit2Event [Chen et al.(2021c)Chen, Zhang, Chen, Wang, Wang, and Liang] 22.09 19.52 16.71 14.53 12.54 48.22 73.83 74.51 68.17
EventFormer 33.82 29.09 24.34 20.29 16.39 51.46 62.27 68.34 60.06
RN34 Frame2Event [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] 14.33 12.20 10.53 8.80 7.23 3.93 18.58 41.37 20.09
Unit2Event [Chen et al.(2021c)Chen, Zhang, Chen, Wang, Wang, and Liang] 24.04 21.18 18.15 15.84 13.77 49.09 73.79 75.85 68.49
EventFormer 36.92 32.06 26.87 22.37 18.18 52.83 64.95 69.62 62.12
RN50 Frame2Event [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] 16.56 14.32 12.50 10.63 8.88 4.16 18.94 40.36 20.30
Unit2Event [Chen et al.(2021c)Chen, Zhang, Chen, Wang, Wang, and Liang] 24.40 22.01 19.36 17.03 14.67 50.24 73.46 76.15 68.24
EventFormer 41.41 35.79 30.10 25.00 20.32 53.59 66.72 72.31 63.76
Table 1: Comparison among schemes on BP4D in terms of mAP@tIoU, AR@AN, AUC.

5.2 Comparison among Schemes

We classify AU event detection schemes into three categories, which use frame-level
(Frame2Event), unit-level (Unit2Event), and video-level (Video2Event) results to detect AU events respectively. To demonstrate the effectiveness of EventFormer, which follows the scheme of Video2Event, two methods following other schemes are selected for comparison. For a fair comparison, all methods including EventFormer apply the same AU feature encoder pre-trained using frame-level AU occurrence labels.

Comparison to Frame2Event

Frame2Event scheme collects frame-level AU results and converts them to AU events through postprocessing such as Temporal Actionness Grouping (TAG) [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin], which involves two hyper-parameters, water level γ\gamma and union threshold τ\tau. We use the same AU feature encoder and an MLP as classifier to obtain frame-level AU results. Based on the predicted AU occurrence probabilities, candidate events are generated under several combinations of γ\gamma and τ\tau. Specifically, we sample γ\gamma and τ\tau within the range of 0.5 and 0.95 with a step of 0.05. The confidence score of each candidate event (s,e,c)(s,e,c) is computed by averaging the probabilities of class cc within segment (s,e)(s,e). Soft-NMS [Bodla et al.(2017)Bodla, Singh, Chellappa, and Davis] is employed to select N0N_{0} events for each class from the candidate events.

As shown in Table 1, EventFormer outperforms Frame2Event method by a large margin in terms of mAP given any tIoU threshold, regardless of what backbone is used. Especially, EventFormer achieves a performance gain of 24.85%24.85\% in [email protected] than Frame2Event method using RN50 as backbone, which shows the superiority of EventFormer. We also notice that Frame2Event method obtains a pretty low AR given a small AN, which is because the prediction jitters due to the lack of a global view make it hard to use a set of fixed hyper-parameters to balance the trade-off between AP and AR. Such limitation reflects the necessity of maintaining a global view and detecting events directly.

Hyper-parameters Values [email protected] AUC
#Queries in Query Set N0N_{0} 10 31.78 34.77
50 31.33 45.77
100 30.03 62.32
200 25.04 63.68
Embedding Dimension dmd_{\rm m} 128 30.22 59.65
256 30.03 62.32
512 24.82 61.20
1024 22.95 59.33
#Encoder/decoder layers LL 3 29.39 58.67
4 29.58 59.51
5 30.31 62.05
6 30.03 62.32
Table 2: Sensitivity to hyper-parameters.

Comparison to Unit2Event

We choose AUPro [Chen et al.(2021c)Chen, Zhang, Chen, Wang, Wang, and Liang] on behalf of Unit2Event scheme, which extracts unit-level features and predicts event-related scores to generate AU events. AUPro estimates the start and end probabilities, PsP_{\rm s} and PeP_{\rm e}, for each time position exhaustively, and generates an action completeness map PcP_{\rm c} consisting of the completeness score for any event (s,e)(s,e), and the final confidence score for an event (s,e)(s,e) is calculated as Ps(s)×Pe(e)×Pc(s,e)P_{\rm s}(s)\times P_{\rm e}(e)\times P_{\rm c}(s,e). Since the method only predicts class-agnostic events, we simply make it generates CC sets of PsP_{s}, PeP_{e}, and PcP_{c}, one set for each class. We adopt Soft-NMS to select N0N_{0} events out of T2T^{2} detected events for each class.

As shown in Table 1, EventFormer outperforms Unit2Event method with any backbone in terms of mAP given any tIoU. Specifically, EventFormer achieves a performance gain of 17.01%17.01\% in [email protected] than Unit2Event method with RN50 as backbone. Since Unit2Event method generates events exhaustively, it is supposed to obtain better results in terms of AR. Although EventFormer performs a little bit worse than Unit2Event method in terms of AR given a large AN, it outperforms it in AR@10 by 3.35%3.35\%, which indicates that events generated by EventFormer are of better quality.

5.3 Sensitivity Analysis

Table 2 shows sensitivity analysis of hyper-parameters, including the number of queries in Query Set N0N_{0}, embedding dimension dmd_{\rm m} and the number of layers LL of encoder and decoder. As the number of queries in Query Set increases, mAP decreases while AUC increases, due to the trade-off between mAP and AR. We notice that the mAP does not decrease a lot from N0=10N_{0}=10 to N0=100N_{0}=100, and AUC increases much slower when N0>100N_{0}>100. Thus, we choose N0=100N_{0}=100 for EventFormer. As for dmd_{\rm m}, AUC reaches its peak when we set dmd_{\rm m} to 256, while at the same time, mAP is also around its best score. As for the number of layers LL, we notice that EventFormer achieves better performance with LL increasing. For the balance between computing complexity and model performance, we choose L=6L=6 for EventFormer.

5.4 Class-agnostic Set vs. Class-specific Sets

To show the superiority of viewing AU event detection as a multiple class-specific sets prediction problem instead of a single class-agnostic set prediction problem, We implement a class-agnostic version of EventFormer for comparison, which generates events with AU class labels directly and applies bipartite matching once between the ground-truth events and detected ones of all classes. From Fig. 3 we can see that the class-agnostic version obtains poor results on AU2, AU15, AU23, and AU24, of which the [email protected] and AR@100 are near zero. The results variance among AU classes is huge for the class-agnostic version, while the class-specific version achieves relatively balanced results. We attribute the superiority to the binding between sets and AU classes, which is essential for stabilizing training and alleviating the variance of the results among AU classes caused by data imbalance problem.

Refer to caption
Figure 3: A comparison between multiple class-specific sets prediction and single class-agnostic set prediction.

5.5 Qualitative Results

Visualization of Attention in EventFormer

Refer to caption
Figure 4: Visualization of attention.

To better understand how the attention mechanism takes effect in EventFormer, we visualize the attention weights of the last layer of transformer decoder in Fig. 4. The brightest parts of the attention show that the cross-attention of event transformer decoder tends to focus on the embeddings of frames where the states of AUs change. The results also show that EventFormer could capture subtle and transient appearance changes that occur in a very short duration (\leq5 frames) and detect an event successfully, as shown in Fig. 4(a).

Refer to caption
Figure 5: Visualization of detected AU events.

Visualization of Detected AU Events

Fig. 5 shows AU events detected by (a) Frame2Event method and (b) EventFormer. AU events detected by EventFormer are of better quality, while AU events detected by Frame2Event method contains several false positive events with a very short duration. There is a false positive event of AU2 (Outer Brow Raiser) in Fig 5(b), and the visualized frames corresponding to this period show a process of the subject opening her eyes, during which wrinkles appeared above her eyebrow, misleading EventFormer to detect an AU2 event.

6 Conclusion

This paper focuses on making full use of global temporal information to directly detect AU events from a whole video sequence, which are more practical and critical in some real-world application scenarios, such as financial anti-fraud. We propose EventFormer for AU event detection, which use the mechanism of transformer to model global temporal relationships among frames, and extensive experiments show the effectiveness of our EventFormer.

References

  • [Ba et al.(2016)Ba, Kiros, and Hinton] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • [Baltrušaitis et al.(2017)Baltrušaitis, Li, and Morency] T. Baltrušaitis, L. Li, and L. Morency. Local-global ranking for facial expression intensity estimation. In ACII, pages 111–118, 2017. 10.1109/ACII.2017.8273587.
  • [Bodla et al.(2017)Bodla, Singh, Chellappa, and Davis] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. In ICCV, Oct 2017.
  • [Carion et al.(2020)Carion, Massa, Synnaeve, Usunier, Kirillov, and Zagoruyko] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  • [Chen et al.(2021a)Chen, Chen, Wang, Wang, and Liang] Yingjie Chen, Diqi Chen, Yizhou Wang, Tao Wang, and Yun Liang. Cafgraph: Context-aware facial multi-graph representation for facial action unit recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1029–1037, 2021a.
  • [Chen et al.(2021b)Chen, Wu, Wang, Wang, and Liang] Yingjie Chen, Han Wu, Tao Wang, Yizhou Wang, and Yun Liang. Cross-modal representation learning for lightweight and accurate facial action unit detection. IEEE Robotics and Automation Letters, 6(4):7619–7626, 2021b.
  • [Chen et al.(2021c)Chen, Zhang, Chen, Wang, Wang, and Liang] Yingjie Chen, Jiarui Zhang, Diqi Chen, Tao Wang, Yizhou Wang, and Yun Liang. Aupro: Multi-label facial action unit proposal generation for sequence-level analysis. In ICONIP, pages 88–99. Springer, 2021c.
  • [Chen et al.(2022a)Chen, Chen, Luo, Huang, Hua, Wang, and Liang] Yingjie Chen, Chong Chen, Xiao Luo, Jianqiang Huang, Xian-Sheng Hua, Tao Wang, and Yun Liang. Pursuing knowledge consistency: Supervised hierarchical contrastive learning for facial action unit recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 111–119, 2022a.
  • [Chen et al.(2022b)Chen, Chen, Wang, Wang, and Liang] Yingjie Chen, Diqi Chen, Tao Wang, Yizhou Wang, and Yun Liang. Causal intervention for subject-deconfounded facial action unit recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 374–382, 2022b.
  • [Ciftci et al.(2017)Ciftci, Zhang, and Tin] UmurAybars Ciftci, Xing Zhang, and Lijun Tin. Partially occluded facial action recognition and interaction in virtual reality applications. In ICME, pages 715–720. IEEE, 2017.
  • [Cohn and Schmidt(2003)] Jeffrey F Cohn and Karen Schmidt. The timing of facial motion in posed and spontaneous smiles. In Active Media Technology, pages 57–69. World Scientific, 2003.
  • [Ding et al.(2013)Ding, Chu, De la Torre, Cohn, and Wang] Xiaoyu Ding, Wen-Sheng Chu, Fernando De la Torre, Jeffery F Cohn, and Qiao Wang. Facial action unit event detection by cascade of tasks. In ICCV, pages 2400–2407, 2013.
  • [Ekman and Friesen(1978)] P. Ekman and W. Friesen. Facial action coding system: A technique for the measurement of facial movement. 1978.
  • [Fan et al.(2020a)Fan, Shen, Cheng, and Tian] Yachun Fan, Jie Shen, Housen Cheng, and Feng Tian. Joint facial action unit intensity prediction and region localisation. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2020a.
  • [Fan et al.(2020b)Fan, Lam, and Li] Yingruo Fan, Jacqueline Lam, and Victor Li. Facial action unit intensity estimation via semantic correspondence learning with dynamic graph convolution. In AAAI, volume 34, pages 12701–12708, 2020b.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [Hochreiter and Schmidhuber(1996)] Sepp Hochreiter and Jürgen Schmidhuber. LSTM can solve hard long time lag problems. In NIPS, 1996.
  • [Kuhn(1955)] H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97, 1955.
  • [Li et al.(2017)Li, Abtahi, and Zhu] Wei Li, Farnaz Abtahi, and Zhigang Zhu. Action unit detection with region adaptation, multi-labeling learning and optimal temporal fusing. In CVPR, 2017.
  • [Lin et al.(2020)Lin, Li, Wang, Tai, Luo, Cui, Wang, Li, Huang, and Ji] Chuming Lin, Jian Li, Yabiao Wang, Ying Tai, Donghao Luo, Zhipeng Cui, Chengjie Wang, Jilin Li, Feiyue Huang, and Rongrong Ji. Fast learning of temporal action proposal via dense boundary generator. In AAAI, volume 34, pages 11499–11506, 2020.
  • [Loshchilov and Hutter(2017)] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • [Schmidt et al.(2006)Schmidt, Ambadar, Cohn, and Reed] Karen L Schmidt, Zara Ambadar, Jeffrey F Cohn, and L Ian Reed. Movement differences between deliberate and spontaneous facial expressions: Zygomaticus major action in smiling. Journal of nonverbal behavior, 30(1):37–52, 2006.
  • [Simon et al.(2010)Simon, Nguyen, De La Torre, and Cohn] T. Simon, M. H. Nguyen, F. De La Torre, and J. F. Cohn. Action unit detection with segment-based svms. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2737–2744, 2010. 10.1109/CVPR.2010.5539998.
  • [Simon et al.(2010)Simon, Nguyen, De La Torre, and Cohn] Tomas Simon, Minh Hoai Nguyen, Fernando De La Torre, and Jeffrey F Cohn. Action unit detection with segment-based svms. In CVPR, pages 2737–2744. IEEE, 2010.
  • [Song et al.(2021)Song, Cui, Wang, Zheng, and Ji] Tengfei Song, Zijun Cui, Yuru Wang, Wenming Zheng, and Qiang Ji. Dynamic probabilistic graph convolution for facial action unit intensity estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4845–4854, 2021.
  • [Song et al.(2020)Song, Shi, Feng, Song, Lin, Lin, Fan, and Yuan] Xinhui Song, Tianyang Shi, Zunlei Feng, Mingli Song, Jackie Lin, Chuanjie Lin, Changjie Fan, and Yi Yuan. Unsupervised learning facial parameter regressor for action unit intensity estimation via differentiable renderer. In ACM MM, pages 2842–2851, 2020.
  • [Tirupattur et al.(2021)Tirupattur, Duarte, Rawat, and Shah] Praveen Tirupattur, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. Modeling multi-label action dependencies for temporal action localization. In CVPR, pages 1460–1470, 2021.
  • [Vaswani et al.(2017)Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
  • [Wang and Wang(2018)] Can Wang and Shangfei Wang. Personalized multiple facial action unit recognition through generative adversarial recognition network. In ACM MM, pages 302–310, 2018.
  • [Wu et al.(2020)Wu, Xu, Dai, Wan, Zhang, Yan, Tomizuka, Gonzalez, Keutzer, and Vajda] Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph Gonzalez, Kurt Keutzer, and Peter Vajda. Visual transformers: Token-based image representation and processing for computer vision. arXiv preprint arXiv:2006.03677, 2020.
  • [Xiong et al.(2020)Xiong, Yang, He, Zheng, Zheng, Xing, Zhang, Lan, Wang, and Liu] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In ICML, pages 10524–10533. PMLR, 2020.
  • [Yang and Yin(2020)] Huiyuan Yang and Lijun Yin. Re-net: A relation embedded deep model for au occurrence and intensity estimation. In Proceedings of the Asian Conference on Computer Vision, 2020.
  • [Zhang et al.(2014)Zhang, Yin, Cohn, Canavan, Reale, Horowitz, Liu, and Girard] Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. Bp4d-spontaneous: A high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014.
  • [Zhao et al.(2016)Zhao, Chu, and Zhang] Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. Deep region and multi-label learning for facial action unit detection. In CVPR, 2016.
  • [Zhao and Wu(2019)] Ting Zhao and Xiangqian Wu. Pyramid feature attention network for saliency detection. In CVPR, pages 3085–3094, 2019.
  • [Zhao et al.(2017)Zhao, Xiong, Wang, Wu, Tang, and Lin] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang, and Dahua Lin. Temporal action detection with structured segment networks. In ICCV, Oct 2017.