HIINT: Historical, Intra- and Inter- personal Dynamics Modeling with Cross-person Memory Transformer

Yubin Kim [email protected] MIT Media LabCambridgeMAUSA , Dong Won Lee [email protected] MIT Media LabCambridgeMAUSA , Paul Pu Liang [email protected] Carnegie Mellon UniversityPittsburghPAUSA , Sharifa Algohwinem [email protected] MIT Media LabCambridgeMAUSA , Cynthia Breazeal [email protected] MIT Media LabCambridgeMAUSA and Hae Won Park [email protected] MIT Media LabCambridgeMAUSA

Abstract.

Accurately modeling affect dynamics, which refers to the changes and fluctuations in emotions and affective displays during human conversations, is crucial for understanding human interactions. By analyzing affect dynamics, we can gain insights into how people communicate, respond to different situations, and form relationships. However, modeling affect dynamics is challenging due to contextual factors, such as the complex and nuanced nature of interpersonal relationships, the situation, and other factors that influence affective displays. To address this challenge, we propose a Cross-person Memory Transformer (CPM-T) framework which is able to explicitly model affective dynamics (intrapersonal and interpersonal influences) by identifying verbal and non-verbal cues, and with a large language model to utilize the pre-trained knowledge and perform verbal reasoning. The CPM-T framework maintains memory modules to store and update the contexts within the conversation window, enabling the model to capture dependencies between earlier and later parts of a conversation. Additionally, our framework employs cross-modal attention to effectively align information from multi-modalities and leverage cross-person attention to align behaviors in multi-party interactions. We evaluate the effectiveness and generalizability of our approach on three publicly available datasets for joint engagement, rapport, and human beliefs prediction tasks. Remarkably, the CPM-T framework outperforms baseline models in average F1-scores by up to 7.3%, 9.3%, and 2.0% respectively. Finally, we demonstrate the importance of each component in the framework via ablation studies with respect to multimodal temporal behavior.

1. Introduction

In social interactions, individuals rely on a combination of cues to perceive and comprehend the affective states of others, enabling them to gain insights into the contextual aspects of the interaction (Hall et al., 2019; Knapp et al., 2013; Rouast et al., 2019; Picard, 2000). This process of understanding is influenced by multiple factors, such as the specific situation at hand, the nature of the relationship between the individuals involved, and the observer’s own emotions, experiences, and expectations. Additionally, in the context of multi-party interactions, challenges arise from both inter-personal influences, which involve dynamics among participants, and intra-personal influences, which pertain to individual contributions within the group (Dowell et al., 2018; Lee et al., 2023). These challenges highlight the need for a comprehensive approach that considers the lasting impact of affects, the influences of individual and group dynamics, and the contextual nuances within social interactions.

Refer to caption — Figure 1. Best viewed zoomed in and in color. Interactive conversation scenarios from (a) DAMI-P2C, (b) MPII, and (c) BOSS datasets. For each conversation, intrapersonal influence and interpersonal influence are evident with the affective displays and this overall affective momentum with the self and interpersonal influence dynamics are depicted in (d).

To address the challenges posed by affect dynamics in interactive conversations, we propose the Crossperson Memory Transformer (CPM-T) for modeling affect dynamics in interactive conversations. Our model incorporates a cross-modal transformer (Tsai et al., 2019) to obtain fused representations of multiple modality features extracted by modality-specific backbones. Additionally, we leverage cross-person attention (Lee et al., 2023) to capture the influences of intrapersonal and interpersonal factors by encoding verbal and nonverbal cue features. The model also includes a memory network that allows for the retention of past interactions and utilizes the reasoning capabilities of a large language model to guide the interpretation of verbal cues. Given intrapersonal and interpersonal inputs from multiple modalities, CPM-T applies cross-modal and cross-person attention to encode nonverbal representations. This encoding process, guided by verbal reasoning from a large language model and supported by the memory modules, enables the model to autoregressively output an embedding that encapsulates contextualized information about the affective dynamics and interactions in the ongoing conversation. By capturing the momentum of affective states and the complex dependencies between individuals and their historical context, CPM-T enables a deeper understanding of the interplay between verbal and nonverbal cues in social interactions.

To evaluate the effectiveness of our proposed approach, we selected three complex social and affective dynamics tasks: joint engagement, rapport, and human belief prediction from DAMI-P2C (Chen et al., 2022), MPIIGroupInteraction (Müller et al., 2018), and BOSS (Duan et al., 2022) datasets, which involve long-term dependencies influenced by various intra- and inter-personal dynamics. These tasks share commonalities that involve interpreting nonverbal cues, understanding social dynamics and context, possessing empathy and theory of mind, aligning communication, demonstrating cognitive flexibility, and engaging in collaborative problem-solving. By addressing these aspects, our proposed approach aims to enhance social cognition, communication skills, and interpersonal understanding in human interactions.

To summarize, the main contributions of our work are as follows:

(1)

We propose Crossperson Memory Transformer (CPM-T), a novel transformer-based model which combines the concept of Cross-person Attention (CPA) and Memory (Slot) Attention for capturing the intra- and inter- personal relationship between pairs of people that lies in long-term dependencies with multi-modal streams.
(2)

We utilize the Large Language Model (LLM)’s reasoning as verbal context which provides guidance for nonverbal cues through the memory network to improve the model’s performance.
(3)

We successfully integrate the proposed model into the joint engagement, rapport, and belief dynamics prediction task on three publicly available datasets. Experiments, ablation studies, and qualitative analysis support the effectiveness of our model and open up new possibilities for improving social human-robot interaction in various settings.

2. Related Works

2.1. Memory Networks

Memory networks have gained considerable attention due to their ability to capture and leverage contextual information for understanding and modeling affective experiences. One specific challenge involves effectively capturing the temporal dynamics inherent in affective experiences. In response, researchers have investigated the use of memory networks for modeling interactive conversational memory networks in tasks such as emotion recognition (Koval et al., 2021; Lee and Lee, 2021; Shen et al., 2021), sentiment analysis (Xu et al., 2019), and emotion flip reasoning (Kumar et al., 2022). These models significantly enhance the comprehension and prediction of affective states within real-world interactions. Despite the promising potential of memory networks to address these challenges, further research is necessary to enhance their scalability, interpretability, and generalization capabilities in the context of affective communication tasks.

2.2. Modeling Interactive Conversations

The field of conversational modeling has increasingly recognized the importance of incorporating affect dynamics into understanding human interactions. Specifically, (Curto et al., 2021) proposes DyadFormer, multi-modal transformer architecture to model individual and interpersonal features in dyadic interactions for personality prediction. (Lee et al., 2023) present MultiPar-T, a transformer-based model that can capture the contingent behavior in a multi-party setting by conducting an engagement prediction task. (Ng et al., 2022) models interactional communication in dyadic interaction by autoregressively outputting multiple possibilities of corresponding listener motion. (Hazarika et al., 2018) proposes a multimodal emotion detection framework that extracts multimodal features from conversational videos and hierarchically models the self- and inter- speaker emotional influences into global memories. Among these prior works, only a few studies have addressed the challenges of modeling affect dynamics in more complex conversational tasks; joint engagement, rapport, and belief dynamics prediction. Previous models have been limited in their ability to capture the nuances of human interactions by only recognizing a limited range of affective states and contextual cues. Furthermore, they have not fully accounted for the complexities of multimodal features. By addressing these limitations, our proposed model represents a comprehensive approach to modeling affect dynamics in interactive conversations, building on prior work on affect dynamics, multimodal features, and complex conversational tasks. By recognizing a broader range of affective states and contextual cues, our model can capture the nuances of human interactions and enable more accurate modeling of affective dynamics.

2.3. Language Models as Multimodal Guides

Language Models (LMs) have proven to be powerful tools in various domains, including affective computing. They have been successfully applied in guiding other modalities for video segmentation (Liang et al., 2023), context-aware prompting (Rao et al., 2021), and image classification (Yang et al., 2023). These studies highlight the potential of combining verbal context with nonverbal context to enhance the understanding and generation of nonverbal behaviors in affective computing. By leveraging the capabilities of LMs, we can effectively bridge the gap between verbal and nonverbal cues, enabling a more comprehensive and nuanced understanding of affective dynamics in human interactions. This integration allows us to capture the interplay between verbal and nonverbal expressions, fusing nonverbal behaviors that align with the given verbal context.

In our work, we extend the application of LMs to the modeling of affect dynamics in interactive conversations. By utilizing a large language model as a guiding source for nonverbal cues, we aim to enhance the performance of our proposed Crossperson Memory Transformer (CPM-T) framework in capturing the intricate relationship between verbal and nonverbal aspects of affective communication. Through the integration of verbal context provided by LMs, CPM-T can generate more contextually relevant and emotionally expressive nonverbal behaviors, thereby improving the overall fidelity and naturalness of affective communication modeling. The utilization of LMs in the affective computing domain not only enriches our understanding of human interactions but also opens up new possibilities for applications in social robotics, virtual agents, and human-computer interaction. By effectively combining verbal and nonverbal cues, we can create more engaging and empathetic systems that can better understand and respond to users’ affective states and needs.

3. Methods

In this section, we describe our proposed Crossperson Memory Transformer (CPM-T) (Figure 2). At the high level, CPM-T takes fused multi-modal representation from each person using Cross-modal Transformer and utilize Crossperson Attention (CPA) to discover the self and interpersonal influences. Next, we utilize Memory (Slot) Attention modules to incorporate an external dynamic memory to encode and retrieve past information. In Section 3.3, and 3.4, we present in details about the ingredients of the CPM-T architecture (see Figure 2) and explain the importance of each component.

3.1. Problem Statement

Consider a set of video-audio pairs $\mathcal{D}=\{(X,y)\}$ where $X$ is the audio and video input and $y\in Y$ , is the label from a set of $N$ classes. We extract the features of all audio and video clips in $\mathcal{D}$ as ${X_{a}=E_{a}(X[a])\in{\mathbb{R}^{T_{a}\times{d_{a}}}}}$ and ${X_{v}=E_{v}({X[v]})\in{\mathbb{R}^{T_{a}\times{d_{a}}}}}$ , respectively (for certain models, we add extra modalities such as pose ${p}$ and text ${t}$ along with audio and video). Given the task-specific concepts $C=\{c_{1},c_{2},...,c_{N}\}$ and the task, we generate a set of reasoning sentences $s=LLM(C)$ and feed these sentences to the memory encoder $E_{mem}$ , to generate verbal memory $mem=E_{mem}(s)$ . Combined with the Cross-person Memory Transformer model, it produces a prediction, $\hat{y}=f\big{(}g(X,mem)\big{)}$ , in which $g$ is the CPM-T model, $f$ is the MLP layer, $X$ is the sequence tokens and $mem$ is the memory tokens.

3.2. Intrapersonal Input Separation

In Figure 2, we display the individuals are separated from the original videos, and this process was done by 1) video inpainting (Figure 3) and 2) speaker diarization (Figure 4).

-Video Inpainting

For video inpainting, we use a flow-guided video inpainting model, $E^{2}FGVI$ (Li et al., 2022) which can handle videos with arbitrary resolution. This model exhibits a strong ability to generalize effectively to higher resolutions, as demonstrated by experimental results and validated performance metrics such as PSNR and SSIM. The video inpainting process can be divided into three interconnected stages. Firstly, flow completion is performed to estimate the missing optical flow fields in corrupted regions, as the absence of flow information in those areas can impact subsequent processes. Secondly, pixel propagation is employed to fill the holes in corrupted videos by bi-directionally propagating pixels from visible areas, leveraging the completed optical flow as a guide. Finally, content hallucination takes place, where the remaining missing regions are generated through the use of a pre-trained image inpainting network.

-Speaker Diarization

For speaker diarization, we use the model that uses attention to localize and group sound sources, and optical flow to aggregate information overtime which is presented in (Afouras et al., 2020). The performance of the model has been validated through four downstream speech-oriented tasks: (a) multi speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection which showed the effectiveness of the model’s learned audio-visual object embeddings.

3.3. Verbal Memory from LLM Reasoning

In this paper, we harness the power of LLMs to facilitate the extraction and integration of nonverbal cues within the realm of affective dynamics in interactive conversations. Specifically, we use OpenAI’s Chat-GPT, which possesses advanced capabilities to understand and generate human-like text based on given prompts. By leveraging the reasoning abilities of LLMs, we can tap into their deep understanding of language and utilize their contextual comprehension to enhance our understanding of affective states in conversational settings. We prompt the model to generate a set of reasoning about the joint affective states, taking into account the context $c$ and the conversation history within the designated window $w$ . The choice of window size determines the amount of context we consider for analysis. For example, for the DAMI-P2C dataset, we provide the following contexts to make a final formatted prompt:

•

the type of relationship: parent-child
•

the type of activity they are doing: story reading
•

the conversation history within window size $w$
•

the label we want to predict: joint engagement
•

the entities: parent, child, and both

The prompt can be obtained by filling the contextual information to the pre-defined format (see Figure 2) and we collect responses for 1) parent, 2) child, and 3) dyad by inputting the prompt to the large language model. the encoder part from the memformer (Wu et al., 2022) which includes the memory reading and writing operations to generate verbal context. This verbal context is used to initialize the memory part of crossperson memory network which takes the nonverbal cue segments as input and updates the nonverbal context guided by the verbal context.

By incorporating LLMs into our methodology, we open up possibilities for more nuanced and insightful analysis of affective dynamics. The integration of nonverbal cues with verbal memory allows us to capture a more holistic view of affective experiences in social interactions. This approach offers valuable insights into the complex interplay between verbal and nonverbal communication, contributing to a deeper understanding of affective inertia and its temporal aspects.

3.4. Crossperson Memory Network (CPM-T)

In order to successfully address the complex affect dynamics taking place in interactive conversations, we must properly represent each person’s individual nonverbal cues, address self and interpersonal influences, then take into account the long-range dependencies that last over time.

Affect Dynamics Encoding: Cross-person Attention (CPA)

To explicitly model the self- and interpersonal influences between pairs of people, we utilized Cross-person Attention (CPA), proposed in (Lee et al., 2023). This method states that given a pair of people, the target person $self$ ’s behavior is contingent on person $other$ ’s behavior if person $self$ ’s behavior was likely to be influenced by person $other$ ’s behavior ( $other$ $\rightarrow$ $self$ ). Hence, for the target person $self$ , and another person $other$ , we utilize the multi-modal representations $Z_{self}$ , $Z_{other}\in$ $\mathbb{R}^{T\ \times\ 2d}$ obtained from the Cross-modal Transformer, where $T$ is the sequence length and $d$ is the projected feature dimension. Following how Multimodal Transformer (Tsai et al., 2019) calculated the cross-modal attention, the cross-person attention can be calculated similarly as below:

(1)			$\displaystyle\textstyle{\mathrm{CPA}_{other\rightarrow\text{s}elf}\left(Z_{other},Z_{self}\right)=\operatorname{softmax}\left(\frac{Q_{other}K_{self}^{\top}}{\sqrt{d_{x}}}\right)V_{self}}$
(1)			$\displaystyle=\operatorname{softmax}\left(\frac{Z_{other}W_{Q_{other}}\left(Z_{self}W_{K_{self}}\right)^{\top}}{\sqrt{d_{x}}}\right)Z_{self}W_{V_{self}}.$

$\text{CPA}_{other\rightarrow self}^{m,\text{multi}}$ refers to the multi-headed from person other to person self at the m-th layer. $\text{CPA}_{other\rightarrow self}(Z_{other},Z_{self})$ outputs an embedding which has captured the person $self$ ’s behavior contingent on person $other$ ’s behaviors. Note that depending on the task, we concatenate the different outputs from the Crossperson Transformers (e.g., for DAMI-P2C (child-coordinated joint engagement), we concatenate the outputs only from ${parent\rightarrow child}$ and ${child}$ whereas we concatenate the outputs from ${other\rightarrow self}$ , ${self\rightarrow other}$ and ${self}$ for rest of datasets (rapport, human belief dynamics)).

Dynamic Memory Update: Memory Slot Attention

To encode and retain important past context, we utilize the external dynamic memory method presented in (Wu et al., 2022). The model interactively encodes and retrieves the information from memory in a recurrent way by conducting memory read and write operations. At each timestep $t$ , we have $M_{t}=[m_{t}^{0},m_{t}^{1},...,m_{t}^{k}]$ . For each slot in the batch, they keep separate memory representations by working individually. For each segment sequence, the model first read the memory to retain the past important information by using cross-attention.

(2)		$\displaystyle Q_{x},K_{M},V_{M}=xW_{Q},M_{t}W_{K},M_{t}W_{K}$
		$\displaystyle A_{x,M}=\operatorname{MHAttn}(Q_{x},K_{M})$
		$\displaystyle H_{x}=\operatorname{Softmax}(A_{x,M})V_{M}$

Here, we project memory slot vectors into keys and values and input sequences into queries and use these queries to attend to all key-value pairs in the memory slots, ultimately resulting in the output of the final hidden states. This enables the model to learn the complex association of the memory. Next, memory writing happens with a slot attention module to update memory information and a forgetting method to clean up unimportant memory information. Memory writing only occurs at the last layer of the encoder and allows high-level contextual representations to be stored in memory. Slot attention happens in this stage where each memory slot only attends to itself and token representations, and this prevents each memory slot to write its own information to other slots directly, as memory slots should be independent of each other.

(3)		$\displaystyle Q_{m^{i}},K_{m^{i}}=m^{i}W_{Q},m^{i}W_{K}$
		$\displaystyle K_{x},V_{x}=xW_{K},xW_{V}$
		$\displaystyle A_{m^{i}}^{{}^{\prime}}=\operatorname{MHAttn}(A_{m^{i}},[K_{m^{i}};K_{x}])$

Each slot is separately projected into queries and keys. The segment token representations are projected into keys and values. Slot attention means that each memory slot can only attend to itself and the token representations. Thus, each memory slot cannot write its own information to other slots directly, as memory slots should not be interfering with each other. Finally, after the attention scores are calculated, the raw attention weights are divided by the temperature $\tau$ , and the next timestep’s memory is collected with attention:

(4)		$\displaystyle A_{m^{i}}={{exp(A_{i}^{{}^{\prime}}/\tau)}\over{\sum}_{j}exp(A_{j}^{{}^{\prime}}/\tau)}$
(4)		$\displaystyle m_{t+1}^{i}=\operatorname{Softmax}(A_{x,M})[m_{t}^{i};V_{x}]$

For the details of how the memory read and write operation works, we encourage the readers to refer to Appendix C.

Dataset	Model	Modality	All Joint Engagement Classes			Low Eng	Mid Eng	High Eng
			Accuracy	Weighted F1	Macro F1	F1	F1	F1
	I3D	v	$0.508\pm 0.02$	$0.548\pm 0.04$	$0.416\pm 0.02$	$0.232\pm 0.08$	$0.290\pm 0.15$	$0.724\pm 0.02$
	MulT	a+v+t	$0.564\pm 0.02$	$0.587\pm 0.04$	$0.414\pm 0.02$	$0.250\pm 0.03$	$0.273\pm 0.06$	$0.719\pm 0.02$
DAMI-P2C	DyadFormer	a+v	$0.494\pm 0.01$	$0.500\pm 0.01$	$0.362\pm 0.03$	$0.186\pm 0.04$	$0.250\pm 0.08$	$0.651\pm 0.01$
	Multipar-T	a+v+p	$0.573\pm 0.01$	$0.589\pm 0.01$	$0.417\pm 0.01$	$0.191\pm 0.01$	$0.361\pm 0.02$	$0.698\pm 0.01$
	Ours	t $\rightarrow$ a+v	$\textbf{0.634}\pm\textbf{0.00}$	$\textbf{0.677}\pm\textbf{0.01}^{\dagger}$	$\textbf{0.490}\pm\textbf{0.00}^{\dagger}$	$\textbf{0.286}\pm\textbf{0.01}$	$\textbf{0.430}\pm\textbf{0.01}$	$\textbf{0.755}\pm\textbf{0.00}$

Table 1. Results and standard deviations for the proposed and baseline models on DAMI-P2C dataset using 3 seeds. In ”Modality” column,

m_{i}

m_{j}

stands for modality fusion for modality

m_{i}

and

m_{j}

. Eng stands for Engagement.

{m_{i}\rightarrow m_{j}+m_{k}}

means

{m_{j}+m_{k}}

was guided by

{m_{i}}

modality using the memory encoder. For MulT, we provided the sentences from the original transcription as

t

, and for Ours, we used the reasoning output sentences from LLM as

t

(see Section 3.3 and Appendix B for details).

\dagger

represents statistical significance over state-of-the-art scores under the paired bootstrap test

(p<0.05)

and Bonferroni correction.

Dataset	Model	Modality	All Rapport Classes			Low Rap	Mid Rap	High Rap
			Accuracy	Weighted F1	Macro F1	F1	F1	F1
	I3D	v	$0.371\pm 0.02$	$0.333\pm 0.02$	$0.278\pm 0.01$	$0.000\pm 0.00$	$\textbf{0.260}\pm\textbf{0.06}$	$\textbf{0.575}\pm\textbf{0.05}^{\dagger}$
	MulT	a+v	$0.319\pm 0.01$	$0.298\pm 0.02$	$0.261\pm 0.01$	$0.107\pm 0.01$	$0.145\pm 0.05$	$0.531\pm 0.01$
MPII	DyadFormer	a+v	$0.298\pm 0.02$	$0.346\pm 0.02$	$0.257\pm 0.02$	$0.08\pm 0.02$	$0.248\pm 0.05$	$0.447\pm 0.02$
	Multipar-T	v+p	$0.494\pm 0.01$	$0.468\pm 0.01$	$0.315\pm 0.01$	$0.636\pm 0.02$	$0.000\pm 0.00$	$0.311\pm 0.04$
	Ours	a+v	$\textbf{0.534}\pm\textbf{0.04}$	$\textbf{0.551}\pm\textbf{0.06}$	$\textbf{0.408}\pm\textbf{0.01}^{\dagger}$	$\textbf{0.678}\pm\textbf{0.02}$	$0.01\pm 0.00$	$0.537\pm 0.02$

Dataset	Model	Modality	All Human Belief Classes			No Comm	Attn Follw	Joint Attn
			Accuracy	Weighted F1	Macro F1	F1	F1	F1
	I3D	v	$0.211\pm 0.02$	$0.282\pm 0.01$	$0.272\pm 0.02$	$\textbf{0.204}\pm\textbf{0.12}$	$\textbf{0.304}\pm\textbf{0.10}^{\dagger}$	$0.307\pm 0.10$
	MulT	a+v	$0.545\pm 0.02$	$\textbf{0.638}\pm\textbf{0.07}^{\dagger}$	$0.267\pm 0.06$	$0.000\pm 0.00$	$0.090\pm 0.01$	$0.711\pm 0.01$
BOSS	DyadFormer	a+v	$0.422\pm 0.03$	$0.446\pm 0.02$	$0.265\pm 0.01$	$0.000\pm 0.01$	$0.194\pm 0.02$	$0.600\pm 0.01$
	Multipar-T	v+p	$0.351\pm 0.03$	$0.338\pm 0.03$	$0.272\pm 0.01$	$0.098\pm 0.02$	$0.216\pm 0.03$	$0.501\pm 0.05$
	Ours	a+v	$\textbf{0.569}\pm\textbf{0.01}^{\dagger}$	$0.528\pm 0.02$	$\textbf{0.292}\pm\textbf{0.00}$	$0.157\pm 0.01$	$0.143\pm 0.02$	$\textbf{0.765}\pm\textbf{0.01}$

Table 2. Results and standard deviations for the proposed and baseline models on MPIIGroupInteraction and BOSS dataset using 3 seeds. In ”Modality” column,

m_{i}

m_{j}

stands for modality fusion for modality

m_{i}

and

m_{j}

. Rap, Comm, Follw, and Attn stands for Rapport, Communication, Following, and Attention respectively. For MPII and BOSS datasets, experiments on t modality are not reported since it was not provided in the original datasets.

\dagger

represents statistical significance over state-of-the-art scores under the paired bootstrap test

(p<0.05)

and Bonferroni correction.

4. Experiments

In this section, we empirically evaluate the Crossperson Memory Transformer (CPM-T) on three datasets that are frequently used to benchmark human affect communication tasks in prior works (Chen et al., 2022; Duan et al., 2022; Müller et al., 2018). Our goal is to compare CPM-T with prior competitive approaches on which almost all prior works employ) and unaligned(which is more challenging, and which CPM-T is generically designed for) multimodal language sequences.

4.1. Datasets and Evaluation Metrics

We utilize the DAMI-P2C (Chen et al., 2022), MPIIGroupInteraction (Müller et al., 2018), and BOSS (Duan et al., 2022) (See Appendix A.3 for more details) as benchmarks to measure the performance of our proposed method against other baselines. Each task requires the understanding of verbal and nonverbal cues from each person and modeling the affect dynamics to predict the joint labels between pairs of people.

DAMI-P2C

DAMI-P2C is a corpus of multimodal, multiparty conversational interactions in which participants followed a collaborative parent-child interaction to elicit their joint engagement. The dataset was collected in a study of 34 families, where a parent and a child (3-7 years old) engage in reading storybooks together. From the original five-point ordinal scale [-2,2], we modified the labels to three discrete categories for the classification task; Low, Mid, and High joint engagement. A ten-second window was selected as the fragment interval of target audio-visual recordings for the annotation to capture the long-range context of affect dynamics between pairs of dyads. When annotating the recordings, annotators were instructed to judge whether a given fragment contained the story-related dyadic interaction and filter out those that did not. In total, 16,593 fragments have been utilized with 488.03 ± 123.25 fragments from each family on average.

MPIIGroupInteraction

MPIIGroupInteraction is a dataset that collected audio-visual non-verbal behavior data and rapport ratings during small group interactions. It consists of 22 group discussions in German, each involving either three or four participants and each lasting about 20 minutes, resulting in a total of more than 440 minutes of audio-visual data. 78 German-speaking participants were recruited from a German university campus, resulting in 12 group interactions with four participants, and 10 interactions with three participants. Since rapport is a subjective feeling that is hard to gauge through any existing equipment, the rapport was self-reported by the participants. Responses were recorded on seven-point Likert scales and we modified the labels into three categories (Low, Mid, and High Rapport) to conduct the classification task with the same model structure across different tasks. Each participant rated each item for other individuals in the group, yielding two rapport scores for each dyad inside the larger group and we evaluate the model’s performance by averaging the results in different directions.

Dataset	Ablation	All Joint Engagement Classes			Low Eng	Mid Eng	High Eng
		Accuracy	Weighted F1	Macro F1	F1	F1	F1
	Ours w/o LLM	$0.618\pm 0.02$	$0.637\pm 0.02$	$0.476\pm 0.01$	$\textbf{0.328}\pm\textbf{0.02}$	$0.355\pm 0.02$	$0.745\pm 0.01$
DAMI-P2C	Ours w/o Memory	$0.613\pm 0.02$	$0.651\pm 0.03$	$0.478\pm 0.01$	$0.327\pm 0.02$	$0.392\pm 0.02$	$0.727\pm 0.02$
	Ours w/o Individuals	$0.609\pm 0.02$	$0.606\pm 0.03$	$0.428\pm 0.02$	$0.178\pm 0.04$	$0.365\pm 0.02$	$0.740\pm 0.01$
	Ours	$\textbf{0.634}\pm\textbf{0.00}$	$\textbf{0.677}\pm\textbf{0.01}$	$\textbf{0.490}\pm\textbf{0.00}^{\dagger}$	$0.286\pm 0.01$	$\textbf{0.430}\pm\textbf{0.01}$	$\textbf{0.755}\pm\textbf{0.00}^{\dagger}$

Table 3. Ablation. Effect of ablating key components of our method (CPM-T). We encourage the readers to see Figure 2.

w/o

LLM refers to the ablation of verbal memory initialization.

w/o

Memory refers to the ablation of memory modules in CPM-T but passes through a sequence model to do prediction.

w/o

Individuals refers to the ablation of video inpainting and speaker dimerization to separate individuals in in-person interaction videos. Results with different combinations of components are displayed, where Ours

w/

All performs well in general.

\dagger

represents statistical significance over state-of-the-art scores under the paired bootstrap test

(p<0.05)

and Bonferroni correction.

Dataset	Ablation	All Joint Engagement Classes			Low Eng	Mid Eng	High Eng
		Accuracy	Weighted F1	Macro F1	F1	F1	F1
	v	$0.552\pm 0.01$	$0.574\pm 0.02$	$0.385\pm 0.01$	$0.158\pm 0.04$	$0.310\pm 0.01$	$0.687\pm 0.01$
DAMI-P2C	a+v	$0.618\pm 0.02$	$0.637\pm 0.02$	$0.476\pm 0.01$	$\textbf{0.328}\pm\textbf{0.02}$	$0.355\pm 0.02$	$0.745\pm 0.01$
	t $\rightarrow$ v	$0.611\pm 0.00$	$0.597\pm 0.01$	$0.431\pm 0.01$	$0.226\pm 0.05$	$0.323\pm 0.06$	$0.743\pm 0.00$
	t $\rightarrow$ a+v	$\textbf{0.634}\pm\textbf{0.00}^{\dagger}$	$\textbf{0.677}\pm\textbf{0.01}^{\dagger}$	$\textbf{0.490}\pm\textbf{0.00}$	$0.286\pm 0.01$	$\textbf{0.430}\pm\textbf{0.01}$	$\textbf{0.755}\pm\textbf{0.00}^{\dagger}$

Table 4. Ablation. Effect of different combinations of modalities for our method (CPM-T). Again,

m_{i}

m_{j}

stands for modality fusion for modality

m_{i}

and

m_{j}

, and

{m_{i}\rightarrow m_{j}+m_{k}}

means

{m_{j}+m_{k}}

was guided by

{m_{i}}

modality using the memory encoder.

\dagger

represents statistical significance over state-of-the-art scores under the paired bootstrap test

(p<0.05)

and Bonferroni correction.

BOSS

BOSS is a 3D video dataset compiled from a sequence of social interactions between two individuals in an object-context scenario. The two participants are required to accomplish a collaborative task by inferring and interpreting each other’s beliefs through nonverbal communication. Individuals’ latent mental belief states were annotated, for which ground-truth labels are extremely challenging to obtain. Ten pairs of participants (five pairs of friends and five pairs of strangers) were recruited in 15 distinct contexts to compile the dataset. 900 videos from both the egocentric and third-person perspectives were gathered, totaling 347,490 frames. However, the focus on object matching alone does not capture the rich nonverbal communication that occurs during social interactions. To capture these nonverbal cues and enable a more comprehensive analysis of social interactions, the annotation in the BOSS dataset has been modified in this work to include information on participants’ joint attention, attention following, and communication. In detail, we define joint attention, attention following, and no communication using a threshold-based approach. Specifically, we considered an instance of joint attention when the number of matched objects exceeded a threshold of 30. For attention following, we set the threshold to a value between 0 and 30. Finally, we defined no communication as an instance where there were no matched objects which were inspired by (Fan et al., 2021).

4.2. Baselines

We compare our proposed model with a family of baselines in emotion recognition, action recognition, and personality recognition. We run the latest versions of these models and report their scores on a unified benchmark. For affect recognition models, we compare CPM-T to MulT (Tsai et al., 2019) and MultiPar-T (Lee et al., 2023). For the action recognition model, we compare our method with I3D (Carreira and Zisserman, 2017). Finally, for personality recognition model, we compare CPM-T to DyadFormer (Curto et al., 2021).

4.3. Implementation Details

We train our models on 4 NVIDIA GeForce GTX 2080 Ti with different training settings which are described in Appendix A. For all three datasets, we conduct cross-validation by iterating through 0.1 proportion of the groups’ data as the test, 0.2 proportion of the other groups’ data as the valid, and the rest of the other groups’ data as the train set for 3 seeds. Our code can be found in the [Anonymous] and will be shared with the dataset access link through the GitHub repository with a camera-ready version.

5. Results & Discussion

In this section, we discuss the quantitative results of our experiments. We compare our approach CPM-T with state-of-the-art baselines. Then, we discuss the importance of each component in the framework and modality used to train the model through ablation studies. Finally, we leave qualitative analysis in Appendix E to show different types of memory slots obtained from memory writer and crossmodal attention weights to show the correlation learned from audio-visual inputs. Drawing on prior research (Lee et al., 2023), we report the macro, weighted F1-score, and F1-score for every class, as well as an accuracy metric. The macro F1-score, calculated as the unweighted average of per-class F1-scores, holds significant value in our study as it signifies the model’s performance across all classes, irrespective of their representation in the dataset.

Comparison against baseline models

In Table 1 and 2, we evaluated the performance of the proposed model along with baseline models for the task of predicting joint engagement (DAMI-P2C), rapport (MPIIGroupInteraction), and human belief dynamics (BOSS) between dyads. The datasets are highly imbalanced (see Figure. 5) where predicting low engagement, low rapport, and joint attention is challenging.

For DAMI-P2C, our proposed model, which used audio and video modalities guided by verbal context, achieved the highest performance across all evaluation measures. Moreover, our model achieved a weighted F1-score of 0.677, an improvement of 8.8%, and the highest macro F1-score of 0.490, an improvement of 7.3% over the next best-performing model. Particularly, our model outperformed all baselines in predicting the low engagement class, achieving an F1-score of 0.286. This is particularly important given the imbalanced nature of the dataset, with only 493 instances of the low engagement class. Our model’s ability to accurately predict low engagement instances could help parents and clinicians identify areas of potential concern and intervene early. In contrast, the I3D model, which only used the video modality, achieved the lowest performance across all evaluation measures. This suggests that the inclusion of other modalities, such as audio and text, can improve the model’s ability to predict joint engagement.

For MPIIGroupInteraction and BOSS, our proposed model achieved the highest performance in accuracy and macro F1-score by using audio and video modality (text modality was not supported in the original dataset). As stated earlier, since the dataset is highly imbalanced, acquiring a high macro F1-score is important. Our model’s ability to accurately predict low rapport and joint attention could help teachers or professors to identify the cohesion between students and the mental states of one another. It is also interesting to see that MultiPar-T performed worse compared to the other two datasets and it might be due to the information loss that came from the blurred faces (See Figure 5). In contrast, which only used the video modality, achieved the second-best performance in macro F1-score and we assume this is due to the less meaningful information came from the audio modality (people kept repeating the objects they want to insist).

Ablation Studies

In order to compare the contribution of each component in our proposed model and the modality we used to train the model with the DAMI-P2C dataset, we performed two ablation studies (See Table 3 and 4).

We first systematically removed three components from the full model and compared the resulting performance to the baseline where all components were present. The three components that we removed were the LLM component, the Memory modules, and the separation of individuals from original videos. We measured the accuracy, weighted f1-score, macro f1-score, and f1-scores for each class. Our main interest was the macro f1-score, which provides a better indication of the overall performance of the model when the class distribution is imbalanced. Our results showed that removing any of the three components resulted in a decrease in macro f1-score compared to the baseline. Specifically, removing the LLM component resulted in a decrease of 0.014 in macro f1-score, while removing the Memory modules and the component that separates individuals from in-person interaction videos resulted in decreases of 0.012 and 0.062, respectively. Notably, the full model achieved the highest macro f1-score of 0.490, which was a statistically significant improvement over the baseline. These results suggest that all three components are important for achieving high performance in joint engagement prediction.

In addition to the first ablation study, we conducted another study to investigate the impact of different modality inputs on our model’s performance. Specifically, we tested three different input modalities: video only, audio and video, and video guided by verbal memory. We also tested the same input modalities with the addition of verbal memory guidance. Our results show that using both audio and video inputs significantly improved our model’s performance, as indicated by a macro f1-score of 0.476, which was significantly higher than using video input only (f1-score of 0.385). Guiding the video input using verbal memory also improved the performance slightly (f1-score of 0.431), but not significantly so. Our findings suggest that using both audio and video inputs is crucial for accurate joint engagement prediction, and that verbal memory guidance can further enhance the video modality’s performance.

6. Conclusion

In this paper, we presented Crossperson Memory Transformer (CPM-T), a multi-modal multi-party framework for modeling affect dynamics in interactive conversations. Our model capitalizes on modeling contextual information that incorporates self and inter-speaker influences. We accomplish this by using a memory and crossperson transformer. Experiments show that our model outperforms state-of-the-art models on three benchmark datasets. Extensive evaluations and case studies demonstrate the effectiveness of our proposed model. Additionally, the ability to visualize the attentions brings a sense of interpretability to the model, as it allows us to investigate which utterances in the conversational history provide important emotional cues for the current emotional state of the speaker. In the future, we plan to test our model on other relevant affective communication tasks and also explore more into the property of the momentum that comes from second derivatives.

Limitations & Future Works

This paper proposes a novel approach, the Cross-person Memory Transformer (CPM-T), which leverages long-range contextual information to predict affect communication tasks between individuals based on verbal and nonverbal cues. However, it is important to note that while the DAMI-P2C dataset used in this study relied on human-generated context, in real-world applications, it will be necessary for the agent to autonomously capture and reason about the context in order to produce appropriate verbal and nonverbal responses. Furthermore, affective momentum, which is a second-order derivative property that arises in the context of affect dynamics, was not explicitly considered in this study. As such, future research should focus on developing models that take into account this property and other higher-order affective phenomena.

References

(1)
Adamson et al. (2016) Lauren B Adamson, Roger Bakeman, and Katharine Suma. 2016. The joint engagement rating inventory (JERI). Technical Report. Technical report 25: Developmental Laboratory, Department of Psychology ….
Afouras et al. (2020) Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In European Conference on Computer Vision.
Baltrusaitis et al. (2018) Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59–66.
Cao et al. (2019) Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).
Carreira and Zisserman (2017) João Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. CoRR abs/1705.07750 (2017). arXiv:1705.07750 http://arxiv.org/abs/1705.07750
Chen et al. (2022) Huili Chen, Sharifa Mohammed Alghowinem, Soo Jung Jang, Cynthia Breazeal, and Hae Won Park. 2022. Dyadic Affect in Parent-child Multi-modal Interaction: Introducing the DAMI-P2C Dataset and its Preliminary Analysis. IEEE Transactions on Affective Computing (2022), 1–1. https://doi.org/10.1109/TAFFC.2022.3178689
Curto et al. (2021) David Curto, Albert Clapés, Javier Selva, Sorina Smeureanu, Júlio C. S. Jacques Júnior, David Gallardo-Pujol, Georgina Guilera, David Leiva, Thomas B. Moeslund, Sergio Escalera, and Cristina Palmero. 2021. Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions. CoRR abs/2109.09487 (2021). arXiv:2109.09487 https://arxiv.org/abs/2109.09487
Dowell et al. (2018) Nia Dowell, Oleksandra Poquet, and Christopher Brooks. 2018. Applying group communication analysis to educational discourse interactions at scale. International Society of the Learning Sciences, Inc.[ISLS].
Duan et al. (2022) Jiafei Duan, Samson Yu, Nicholas Tan, Li Yi, and Cheston Tan. 2022. BOSS: A Benchmark for Human Belief Prediction in Object-context Scenarios. arXiv:2206.10665 [cs.CV]
Fan et al. (2021) Lifeng Fan, Shuwen Qiu, Zilong Zheng, Tao Gao, Song-Chun Zhu, and Yixin Zhu. 2021. Learning Triadic Belief Dynamics in Nonverbal Communication from Videos. CoRR abs/2104.02841 (2021). arXiv:2104.02841 https://arxiv.org/abs/2104.02841
Hall et al. (2019) Judith A. Hall, Terrence G. Horgan, and Nora A. Murphy. 2019. Nonverbal Communication. Annual Review of Psychology 70, 1 (2019), 271–294. https://doi.org/10.1146/annurev-psych-010418-103145 arXiv:https://doi.org/10.1146/annurev-psych-010418-103145 PMID: 30256720.
Hazarika et al. (2018) Devamanyu Hazarika, Soujanya Poria, Rada Mihalcea, Erik Cambria, and Roger Zimmermann. 2018. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing. 2594–2604.
Hershey et al. (2016) Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, and Kevin W. Wilson. 2016. CNN Architectures for Large-Scale Audio Classification. CoRR abs/1609.09430 (2016). arXiv:1609.09430 http://arxiv.org/abs/1609.09430
Knapp et al. (2013) Mark L Knapp, Judith A Hall, and Terrence G Horgan. 2013. Nonverbal communication in human interaction. Cengage Learning.
Koval et al. (2021) Peter Koval, Patrick T Burnett, and Yixia Zheng. 2021. Emotional inertia: On the conservation of emotional momentum. In Affect dynamics. Springer, 63–94.
Kumar et al. (2022) Shivani Kumar, Anubhav Shrimal, Md Shad Akhtar, and Tanmoy Chakraborty. 2022. Discovering emotion and reasoning its flip in multi-party conversations using masked memory network and transformer. Knowledge-Based Systems 240 (2022), 108112.
Lee et al. (2023) Dong Won Lee, Yubin Kim, Rosalind Picard, Cynthia Breazeal, and Hae Won Park. 2023. Multipar-T: Multiparty-Transformer for Capturing Contingent Behaviors in Group Conversations. arXiv:2304.12204 [cs.CV]
Lee and Lee (2021) Joosung Lee and Wooin Lee. 2021. CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation. arXiv preprint arXiv:2108.11626 (2021).
Li et al. (2022) Zhen Li, Cheng-Ze Lu, Jianhua Qin, Chun-Le Guo, and Ming-Ming Cheng. 2022. Towards An End-to-End Framework for Flow-Guided Video Inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Liang et al. (2023) Chen Liang, Wenguan Wang, Tianfei Zhou, Jiaxu Miao, Yawei Luo, and Yi Yang. 2023. Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
Müller et al. (2018) Philipp Müller, Michael Xuelin Huang, and Andreas Bulling. 2018. Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior. In Proc. ACM International Conference on Intelligent User Interfaces (IUI). 153–164. https://doi.org/10.1145/3172944.3172969
Ng et al. (2022) Evonne Ng, Hanbyul Joo, Liwen Hu, Hao Li, Trevor Darrell, Angjoo Kanazawa, and Shiry Ginosar. 2022. Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. arXiv:2204.08451 [cs.CV]
Picard (2000) Rosalind W Picard. 2000. Affective computing. MIT press.
Rao et al. (2021) Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. 2021. DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting. CoRR abs/2112.01518 (2021). arXiv:2112.01518 https://arxiv.org/abs/2112.01518
Rouast et al. (2019) Philipp V. Rouast, Marc T. P. Adam, and Raymond Chiong. 2019. Deep Learning for Human Affect Recognition: Insights and New Developments. CoRR abs/1901.02884 (2019). arXiv:1901.02884 http://arxiv.org/abs/1901.02884
Shen et al. (2021) Weizhou Shen, Junqing Chen, Xiaojun Quan, and Zhixian Xie. 2021. Dialogxl: All-in-one xlnet for multi-party conversation emotion recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13789–13797.
Tran et al. (2017) Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2017. A Closer Look at Spatiotemporal Convolutions for Action Recognition. CoRR abs/1711.11248 (2017). arXiv:1711.11248 http://arxiv.org/abs/1711.11248
Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
Wu et al. (2022) Qingyang Wu, Zhenzhong Lan, Kun Qian, Jing Gu, Alborz Geramifard, and Zhou Yu. 2022. Memformer: A Memory-Augmented Transformer for Sequence Modeling. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022. 308–318.
Xu et al. (2019) Nan Xu, Wenji Mao, and Guandan Chen. 2019. Multi-interactive memory network for aspect based multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 371–378.
Yang et al. (2023) Yue Yang, Artemis Panagopoulou, Shenghao Zhou, Daniel Jin, Chris Callison-Burch, and Mark Yatskar. 2023. Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification. arXiv:2211.11158 [cs.CV]

Appendix A Datasets

A.1. DAMI-P2C

DAMI-P2C dataset was collected for capturing natural story-reading interactions between a parent and their child in a lab setting. The dataset consists of five major categories of content (audio and video, sociodemographic profiles, reading behavior features, affect annotations, and person identification and body tracking) necessary to understand the social-emotional behaviors and affective states of parent-child dyads in the co-reading context. This dataset focused on the parent-child co-reading interaction activity, a practice positively associated with both children’s later reading and language outcomes and their interest and enjoyment in reading later in childhood. To capture the parent-child engagement quality, the Joint Engagement Rating Inventory (JERI) (Adamson et al., 2016) was selected, as it quantitates and qualities the caregiver-child interaction during a joint activity where verbal and nonverbal behaviors related to engagement are observed and rated. Specifically, we selected to use Child Coordinated Engagement (CCE) in this work which involves the child’s engagement with the parent instead of their engagement with the activity. The child’s CCE will be rated low if the child is engaging in story listening or reading without attending to the parent and acknowledging their presence.

A.2. MPIIGroupInteraction

The data recording took place in a quiet office in which a larger area was cleared of existing furniture. To capture rich visual information and allow for natural bodily expressions, they used a 4DV camera system to record frame-synchronized video from eight ambient cameras. Specifically, two cameras were placed behind each participant and with a position slightly higher than the head of the participant. During the group forming process, experimenters ensured that participants in the same group did not know each other prior to the study. To prevent learning effects, every participant took part in only one interaction. To increase engagement, experimenters prepared a list of potential discussion topics and asked each group to choose the topic that was most controversial among group members. Afterward, the experimenter left the room and came back about 20 minutes later to end the discussion. Participants were then asked to complete several questionnaires about the other group members.

A.3. BOSS

Participants were instructed to form pairs and stand in front of a table. One table contained a list of contextual objects, and the other table contained a collection of objects that could be selected based on the presented context. Each contextual object had at least two and no more than three possible combinations of object table selections. The set of contextual objects is defined as $o_{i}^{context}=\{$ Chips, Magazine, Chocolate, Crackers, Sugar, Apple, Wine, Potato, Lemon, Orange, Sardines, TomatoCan, Walnut, Nail, and Plant $\}$ and the set of objects selected to match these contextual objects is defined as $o_{i}^{select}=\{$ Wine Opener, Knife, Mug, Peeler, Bowl, Scissors, Chips Cap, Marker, Water Spray, Hammer, and Can Opener $\}$ . This experimental design allowed for the investigation of participants’ ability to match objects with contextual information, and the BOSS dataset contains the collected data from this task.

Appendix B Training Details

	DAMI-P2C	MPII	BOSS
Batch Size	48	48	32
Initial Learning Rate	3e-3	1e-3	2e-3
Optimizer		AdamW (Loshchilov and Hutter, 2017)
Behavior Dim		640
$\#$ of Memory Slots	128	256	256
$\#$ of Epochs	20	15	15
Focusing parameter $\gamma$		10

Table 5. Hyperparameters of CPM-T model for best performance in various tasks.

Appendix C Features

The features for each modality are extracted using the following tools:

-Audio

We use Vggish (Hershey et al., 2016) for extracting low level acoustic features. The VGGish model was pre-trained on AudioSet. The extracted features are from the pre-classification layer after activation. The dimension of the feature tensor is 128.

-Vision

We use R(2+1)D (Tran et al., 2017) which the model was pre-trained on Kinetics 400. The model expects to input a stack of 16 RGB frames (112x112) and the dimension of the feature tensor is 512.

-Pose

We use OpenFace (Baltrusaitis et al., 2018) which provides normalized eye gaze direction, location of the head, location of 3D landmarks, and facial action units with a 128-dimensional vector. For the BOSS dataset, the face part was blurred due to privacy issues. However, we could utilize provided OpenPose (Cao et al., 2019) features which support 25-keypoint body/foot keypoint estimation, including 6-foot key points.

Appendix D Memory Modules

In Figure 6 (a), the input sequence $x$ undergoes an attention mechanism that encompasses all the memory slots, allowing it to retain historical information. In (b), each memory slot attends over itself and the representations of the input sequence to generate the subsequent memory slot at the next time step. This approach assumes that each memory slot independently stores information and introduces a specific form of sparse attention pattern. In this pattern, each slot in the memory has the ability to attend solely to itself and the outputs of the encoder. The primary objective is to maintain the information within each slot for an extended period throughout the time horizon. By limiting the attention to the slot itself during the writing process, the information contained within that slot remains unchanged in the subsequent timestep.

In addition, forgetting is an essential aspect of learning since it enables the filtering out of trivial and temporary information, allowing for the retention of more significant and valuable knowledge. In (Wu et al., 2022), they propose the utilization of Biased Memory Normalization (BMN), a forgetting mechanism designed specifically for slot memory representations. BMN involves the normalization of memory slots at each step, preventing the memory weights from growing infinitely and ensuring gradient stability over extended periods. To facilitate forgetting of previous information, they introduce a learnable vector bias, $v_{bias}$ . The initial state, $v_{bias}^{i}$ , is naturally incorporated after normalization.

(5)		$\displaystyle m_{t+1}^{i}\leftarrow m_{t+1}^{i}+v_{bias}^{i}$
		$\displaystyle m_{t+1}^{i}\leftarrow{{m_{t+1}^{i}}\over{\|\|m_{t+1}^{i}\|\|}}$
		$\displaystyle m_{0}^{i}\leftarrow{{v_{bias}^{i}}\over{\|\|v_{bias}^{i}\|\|}}$

The vector $v_{bias}$ serves as a control mechanism for the rate and direction of forgetting. Introducing $v_{bias}$ to the memory slot, it induces movement along the sphere, resulting in the forgetting of a portion of the stored information. If a memory slot remains unchanged for an extended period, it will eventually reach the terminal state, T, unless new information is injected. The terminal state also serves as the initial state and is subject to learning. The speed of forgetting is determined by the magnitude of $v_{bias}$ and the cosine distance between $m_{t+1}^{{}^{\prime}}$ (the updated memory slot) and $v_{bias}$ . For instance, if $m_{b}$ is nearly opposite to the terminal state, it would be challenging to forget its information. On the other hand, if $m_{a}$ is closer to the terminal state, it becomes easier to forget its information.

Appendix E Attention Analysis

To demonstrate how the memory network could be used to explain the needs for long-range dependencies, we analyzed the attention outputs from the memory writer following (Wu et al., 2022). We empirically categorized the memory slots into three different types and visualized three examples with normalized attention values in Figure 7 (a). We particularly selected memory slots with indexes of 0,17, and 22. These memory slots represent three types of memories. In 1st type of memory like $m_{22}$ , their attention focused on themselves, meaning that they were not updating for the current timestep. This suggests that memory slots can carry information from the distant past. For the second type, the memory slot $m_{17}$ had some partial attention over itself and the rest of the attention over other tokens. This type of memory slot is transformed from the first type of memory slot, and at the current timestep, they aggregate information from other tokens. The third type of memory slot looks like $m_{0}$ . It completely attended to the input tokens. In the beginning, nearly all memory slots belong to this type, but later only 5% to 10% of the total memory slots account for this type. We also found that the forgetting vector’s bias for $m_{0}$ had a larger magnitude compared to some other slots, suggesting that the information was changing rapidly for this memory slot.

In addition, to see how crossmodal attention learned the correlation between different modalities (Tsai et al., 2019), we visualize the visualize the attention activation in Figure 7 (b), which shows an example of a section of the crossmodal attention matrix on layer 3 of the $V\rightarrow A$ network (the original matrix has dimension $T_{A}\times T_{V}$ ; the figure shows the attention corresponding to approximately a 10-sec short window of that matrix). We observe that crossmodal attention has learned to attend to meaningful signals across the two modalities. For example, stronger attention is given to the intersection of story-related utterances that tend to trigger the engagement (e.g., “Only Comes”, “Big Brown”) and drastic facial expressions and body gesture change in the video. This observation advocates the advantage of crossmodal transformer over conventional alignment; crossmodal attention enables direct capture of potentially long-range signals, including those off diagonals on the attention matrix (Tsai et al., 2019).