Anchorage: Visual Analysis of Satisfaction in Customer Service Videos via Anchor Events

Kam Kwai Wong, Xingbo Wang, Yong Wang, Jianben He, Rong Zhang, and Huamin Qu KK Wong, X. Wang, J. He, R. Zhang, and H. Qu are with the Hong Kong University of Science and Technology. Email: {kkwongar, xingbo.wang, jhebtr, zhangab}@connect.ust.hk and [email protected]. Wang is with the Singapore Management University. Email: [email protected]. He is the corresponding author.Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Delivering customer services through video communications has brought new opportunities to analyze customer satisfaction for quality management. However, due to the lack of reliable self-reported responses, service providers are troubled by the inadequate estimation of customer services and the tedious investigation into multimodal video recordings. We introduce Anchorage, a visual analytics system to evaluate customer satisfaction by summarizing multimodal behavioral features in customer service videos and revealing abnormal operations in the service process. We leverage the semantically meaningful operations to introduce structured event understanding into videos which help service providers quickly navigate to events of their interest. Anchorage supports a comprehensive evaluation of customer satisfaction from the service and operation levels and efficient analysis of customer behavioral dynamics via multifaceted visualization views. We extensively evaluate Anchorage through a case study and a carefully-designed user study. The results demonstrate its effectiveness and usability in assessing customer satisfaction using customer service videos. We found that introducing event contexts in assessing customer satisfaction can enhance its performance without compromising annotation precision. Our approach can be adapted in situations where unlabelled and unstructured videos are collected along with sequential records.

Index Terms:

Video data, Video visualization, Customer satisfaction, Visual analytics.

1 Introduction

Customer satisfaction, hereafter “satisfaction,” is an important service quality metric that highly correlates with the perception of brand image, loyalty, and switching behavior [1]. Monitoring satisfaction is advocated by several international standards (e.g., ISO 10004 [2]) and helps organizations of any size to evaluate their performance and obtain managerial insights. Satisfaction is often measured directly (e.g., self-reported surveys [3]) and inferred indirectly (e.g., complaints [4]). Since customers’ direct responses are costly to acquire, the sample size of direct measurement is usually limited and thus damages its reliability. The summative post-service assessments are also weak in spontaneity and prone to cognitive biases such as leading questions and the peak-end rule [5]. Therefore, there is a pressing need for automatic methods that evaluate satisfaction efficiently and concurrently.

Services have increasingly been provided remotely because agents can serve more clients with fewer time and location constraints. Prior works on satisfaction analysis mainly focused on traditional digital delivery channels, such as text messages [6, 7] and phone calls [8, 9]. These are anticipated to transform into video-based communications because of the enhanced user experience, especially for customer services [10]. In the transformation, the supplement of video data has offered new opportunities for evaluating satisfaction, but it also comes with two main challenges.

First, the collected video data is in multimodalities, leading to difficulties in processing and comprehension. Emotions have been widely adopted to model satisfaction because of their high correlation. Therefore, previous analyses [8, 11, 9] have exploited multimodal emotional features to deduce the satisfaction level. However, emotions can have different meanings and impacts on the satisfaction dimension [12, 3]. For example, a client who complains about the service demonstrates intense negative feelings and low satisfaction, but a client who smiles and says “thank you” might only do so out of courtesy; thus, it is difficult to infer their satisfaction. The problem is further exacerbated by the two-way communication setting in customer services. Analyzing satisfaction from the client’s point of view is inadequate because the agent’s behaviors could be the root cause of the affective reactions [13]. There is a need to investigate how to combine emotional features with other behavioral cues to analyze satisfaction more effectively.

Second, customer services are characterized by sparse feature distributions and diverse event contexts. While video recordings may span several minutes, the desired features (e.g., facial expressions) could last only a few seconds. Identifying such subtle and instantaneous details from large video collections is tedious. Furthermore, deducing the event contexts from videos requires extra attention and domain expertise. The unstructured video data lacks a proper segmentation scheme to effectively summarize the whole service and its segments. The sequential and temporal relationship is seldom considered for evaluating satisfaction [9], leading to an extra cost in studying satisfaction patterns.

To address these challenges, we bring forward the use of anchors, i.e., semantically meaningful events that describe service procedures (operational anchors) and observable human behaviors (behavioral anchors), to represent critical event characteristics. The operational anchors introduce ordered event understanding into videos by offering sequential and temporal contexts. The behavioral anchors represent multimodal human behaviors compactly and straightforwardly to provide automated satisfaction evaluation. The two anchors were combined to navigate events of interest and interpret the progression of satisfaction levels with contexts.

We propose Anchorage, a visual analytics system to evaluate customer satisfaction by summarizing multimodal behavioral features in customer service videos and revealing abnormal events in service procedures. Anchorage generates potential operational and behavioral anchors based on a multi-perspective anomaly detection framework and provides a primitive satisfaction estimation. A set of coordinated visualizations is designed to analyze satisfaction contextualized by anchors, such that it magnifies a conventional satisfaction score with greater sequential and temporal resolutions. The effectiveness of Anchorage is verified through a case study and a carefully-designed user study. We found that introducing event contexts (i.e., anchors) to video analytics can enhance the performance of satisfaction evaluation tasks without compromising annotation precision. Anchorage is useful in summarizing video contents, identifying anomalous events, and understanding multimodal features. Our approach can be adapted in situations where unlabelled and unstructured videos are collected along with sequential records. In summary, our main contributions are:

$\diamond$

Problem characterization in evaluating satisfaction levels with customer service videos and machine logs through iterative discussions with domain experts. We applied the understanding to create an improvised dataset to verify the approach’s efficacy under different satisfaction scenarios.
$\diamond$

A multi-level anomaly detection framework to generate anchors for efficient event understanding in video analysis and adaptation of discrete event analysis to video visual analytics.
$\diamond$

Novel and metaphoric visualization designs that facilitate effective identification of multimodal anomalous events to evaluate satisfaction levels in customer service videos.

2 Related work

Our work is relevant to prior studies on customer satisfaction analysis, visual analytics for multimodal videos and their event understanding, and discrete event sequence visualization.

2.1 Data-driven customer satisfaction analysis

Automatic satisfaction evaluation has been conducted on diverse data types, such as texts [6, 14, 7], eye gaze trajectories [15], and electroencephalogram signals [16]. These approaches rely upon a high active participation rate and specific devices that are both uncommon in many service scenarios. Unlike other data sources, videos are more accessible, provide continuous input for in-depth satisfaction analysis [17], and do not impose an extra cognitive burden on customers [18]. For example, surveillance videos in retail stores [19] and crowd-sourced web camera videos [18, 20] have been analyzed for customer responses to products and services.

Many video-based approaches have focused on extracting facial expressions as frame-level features and evaluating overall video-level satisfaction [21, 22, 23]. Yolcu et al. [24] accumulated all the emotions of different customers as a proxy to estimate their satisfaction. They also considered the head pose in tackling poor performance when the targets’ faces are occluded. However, these methods were naively evaluated on how accurately the facial expressions in the videos were identified but not the satisfaction. They only applied a trivial model between emotion and satisfaction.

Besides the visual channels, acoustic information is also crucial for satisfaction estimation. Park and Gates [8] derived several prosodic and lexical features for SVM to model satisfaction. Seng and Ang [11] fused affective features in visual and acoustic channels with a linear model bespoke to satisfaction evaluation. Ando et al. [9] introduced a hierarchical framework to combine emotional features in the conversation and individual utterances. These approaches provided data-driven satisfaction scores for videos and were tested against real and improvised datasets. However, a video-level satisfaction score cannot identify the critical turning points and fails to distinguish the counteracted cases. Also, these approaches put little or no emphasis on the agent’s behaviors, which could be the antecedent incidents that affect the customer’s consequential reactions [1, 17]. Our work explores multimodal fusion with behavioral features and event contexts to establish background associations for interactive satisfaction analysis.

2.2 Visual analysis of multimodal videos

A plethora of visualizations has been proposed to summarize and represent video data in the community [25, 26, 27, 28, 29, 30]. Emotional features are commonly extracted from videos for many application problems. EmotionCues [25] summarized the emotional dynamics in classroom videos and highlighted model uncertainties by a stream graph design. EmotionMap [26] and E-ffective [29] proposed a map-based and spiral-based design to provide a temporal overview of the affective dimension in multimedia videos. These visualizations effectively presented visual cues for notable moments. However, videos often require more features than emotions alone to serve various domain-specific analytical purposes.

Recently, more modalities have been used to provide additional information missing from the visual channel. EmoCo [31] explored the emotional coherence across facial expressions, audio emotions, and transcript sentiments to mitigate over-reliance on a particular modality. Li et al. [32] visualized head pose with mouse movement data to connect multimodal behaviors in online proctoring. VideoModerator [33] engineered risk-related features from images and transcripts to assist live stream moderation. Besides verifying features, analyzing multimodalities has been useful in interpreting multimodal models [34, 35], querying large video collections [36], and annotating think-aloud usability test videos [37, 38]. Although integrating more relevant data sources increases the credibility of analysis results [39, 40], it inevitably increases cognitive loads in comprehending them. We utilize a set of well-coordinated views to facilitate the intuitive interpretation of multimodal features. In particular, we propose a scatter-based metaphoric visualization design to summarize the multimodal features and show satisfaction progression. We leverage machine logs to contextualize and grant procedural understanding to customer service videos. It avails a new perspective in summarizing video sequences.

2.3 Event understanding in video visual analytics

According to Höferlin et al. [41], video visual analytics has three main goals: status determination, event detection, and model generation. Status determination identifies frame-level features such as tracking objects. Contrarily, the other two goals consider a larger portion of the video. While event detection seeks to locate the moment when a specified event occurs, model generation aims at mining common patterns from video collections that can later be used to detect events. Our work focuses on model generation to make sense of unlabelled satisfaction patterns.

As the desired patterns are vaguely defined, analysts have to explore a large low-level event space to generate high-level concepts [37]. The frame-shot-scene hierarchy in movie analysis [42] and the object-event-tactic hierarchy in sports video annotation [43] can be viewed conjunctively to illustrate the complexity and interdependency of video events. To streamline the exploration, EventAnchor [44] traced visually available objects in racket sports videos and denoted their critical change of states as anchors. These anchors were plotted on the screen to indicate the objects’ locations for interactive calibration of the machine errors. However, the computer vision-based anchors are limited in numbers and cannot be trivially applied to acoustic channels and other modalities. Also, the visual effects, such as scoreboards and scene changes, are absent in many real-life video recordings to provide the event contexts. Inspired by the anchor concept, we extend its usefulness in understanding multimodal video events and generalize it into scenarios where videos are recorded along with sequential records. We leverage semantically meaningful events in service procedures to prioritize investigative efforts among different anchors. We propose a semi-automatic framework that generates anchor candidates with anomaly detection methods and supports candidate validation with intuitive visualization designs.

2.4 Discrete event sequence visualization

Discrete event analysis usually applies to log data such as computer system records [45], which have a timestamp for each event record and a semantic meaning for each event type. The multi-scale temporal structure of these events is often harnessed for visual summarization. For example, event sequences can be aggregated by multivariate regular expressions [46] and hierarchical clustering [47] to serve different level-of-detail requirements. Distinguishing branches [48] and bundling frequent patterns in event sequences [49] have facilitated a further understanding of the diverging and converging event evolution patterns. A recent survey [50] has summarized the design space for event sequences.

Through multi-scale overviews, analysts can visually compare the temporal characteristics of different event sequences to locate abnormal behaviors [51] and process drifts [52]. Guo et al. [53] proposed a VAE-based approach to detect arbitrary ordering, absence, and duplication of events. We borrow ideas from this line of research to discover unusual service procedures. We employ a similarity-based method to find temporal anomalies and a Markovian technique to detect sequential anomalies. They formulate operational anchors and contextualize the multimodal behavioral anchors for evaluating customer satisfaction.

3 Problem Characterization

This section introduces the background of satisfaction evaluation in customer services. We describe the requirement analysis and the details of the improvised dataset used for evaluation.

3.1 Satisfaction and customer service

Customer satisfaction is widely defined as the fulfillment of customers’ expectations with the perceived service quality [1, 2]. However, as suggested by the highly subjective terms “expectation” and “perceived,” satisfaction is a complicated construct with various interpretations by different people. Therefore, services are usually recorded to prevent misinformation and avoid conflicts in complaints, providing rich sources of recordings for analysis.

Customer service refers to the organization’s assistance and service for their customers before, during, and after-sales [2]. For example, contact centers provide customer services through phone calls for handling inquiries, managing orders, and troubleshooting issues [10]. While kiosks, mobile applications, and virtual assistants have been established to let customers serve themselves, staff-assisted customer services remain irreplaceable because of regulatory requirements, business procedural complexity, and insufficient machine capabilities in directing human intents [54].

Refer to caption — Figure 1: A customer service example modified from a customer service dialogue dataset in online shopping [54]. (A) The agent guidelines outline the typical procedure. (B) The service procedure is interwoven with the interactions and operations from the agent and the client. (C) The service record logs the executed operations with metadata such as timestamps.

Customer services are characterized by their goal orientation and communication dynamics. We focus on the typical setting where a customer service agent assists a client in completing domain-specific goals. We outline a typical workflow in Fig. 1 and clarify the terms used in the paper. First, the agent assists the client in specifying their intents and interpreting their needs at the beginning of a service. Second, the agent follows internal guidelines to derive a list of operations and guide the client through the sequential service procedure. Third, the agent and the client usually take turns communicating and performing the operations. For example, to verify a client’s identity, the agents can ask questions verbally, prompt for entry, or make a database query for the profile. Finally, the executed operations formulate a service record that may deviate from the agent guideline due to complex real-life situations. The service is usually recorded to be the service video.

3.2 Design requirement analysis

We adopted the design study methodology [55] to characterize the domain problems. During the past two years, we have collaborated with a domestic information technology company that digitalizes public services and develops remote service terminals for tax authorities. The terminals connect agents and clients at different locations through video communications. It facilitates essential operations such as transmitting legal documents and processing digital payments. The equipment collects video recordings and machine logs of the staff-assisted tax filing services.

To understand their workflows and satisfaction evaluation methods, we conducted contextual inquiries and semi-structured interviews with three frontline tax officers (E1-3) with over 5 years of service experience. We interviewed a professor (E4) with 10 years of research experience in customer relations to gain a second opinion from the marketing field. Under their influence, we read domain literature to understand important concepts and link them with visualization studies. We held a series of remote meetings with three business analysts (E5-7) with over 5 years of public service experience from our industry collaborator to refine design requirements, adapt previous findings to our application, and verify the iterative designs. None of E1-7 is a co-author of this paper.

Our goal was to design a satisfaction evaluation system for customer service providers to identify satisfaction patterns for improving their services and workflows. We identified six design requirements to support the development of Anchorage.

R1

Rank satisfaction by objective metrics. Clients seldom provide satisfaction feedback after services. When they do, their self-reported evaluation is prone to cognitive biases such as the peak-end rule [5] nonetheless. E1-2 added, “some clients rushed to leave, so they randomly clicked any buttons.” The large video collections also require a ranking order to prioritize videos of interest. The system should provide uniform and objective assessments based on users’ behaviors.
R2

Contextualize the satisfaction evaluation with operations. Customer behaviors should be interpreted with the antecedent events [4]. For example, expecting smooth services, clients would perceive repeated and interrupted operations as troublesome and unsatisfactory, resulting in a negative emotional response. However, clients have diverse affective reactions to provocative actions. E7 pointed out that “some people keep a poker face, but they could be furious,” suggesting the unreliability and inadequacy of using emotional features only. The system should incorporate procedural considerations as the common ground to explain and evaluate clients’ behaviors.
R3

Show satisfaction progression in a service. Automatic methods usually aggregate frame-level evaluations to model satisfaction [11]. However, the aggregated service score is inferior in differentiating counteracted cases. E5 proposed a satisfied case with a client showing unsatisfied behaviors at first but becoming more satisfied with the service at last. The case would be underrepresented in an accumulative service-level satisfaction score. Assessing satisfaction by individual operations naturally magnifies their contributions to the overall evaluation [9]. The system should visualize the dynamic satisfaction progression to reveal the causal relationships between behaviors and satisfaction.
R4

Provide an overview of the service record. The service records provide sequential and temporal information to indicate the service smoothness. The records of smooth services usually match the typical workflows described in the agent guidelines explicitly. Experienced agents (E1-3) could easily identify deviated operations when they read the records in semantically meaningful terms. The operation duration also implicitly hints on the procedural difficulties and the service status. The system should present adequate event contexts to foster procedural awareness and segment the services properly.
R5

Highlight the anomalous operations. Satisfaction generally follows a steady progression with previous states. A significant turning point could indicate a potential satisfaction pattern induced by internal factors (e.g., exceeding expectations [1]) and external factors (e.g., agents’ misconduct [4]). E4 stated that looking into the “peaks” and “troughs” of the satisfaction level would help derive more managerial insights. They are worth more attention to be further investigated. The system should distinguish uncommon satisfaction development to identify critical transition moments.
R6

Support interactive navigation of original videos. Video recordings are the strongest evidence in evaluating satisfaction. Yet, reviewing the videos from scratch is inefficient. Features extracted by machine learning models are helpful, but they might suffer from model uncertainty and multimodal interactions [35]. E6-7 expressed a need to validate the features when they convey “unreasonable and contradictory meanings.” Also, the dynamics between agents and clients are challenging to define and detect. The system should support various interactions to streamline the fast location of interested events.

3.3 Improvised dataset of customer service videos

We analyzed 20 authentic videos from E5-7 regarding tax service assistance in local government tax authorities. These videos had no self-reported satisfaction levels. Most clients showed a neutral face, and the services operated on average time. They could only reflect neutrally satisfying cases, as E6 and E7 provided their evaluations as ground truths. Obtaining more recordings was challenging because the Covid-19 pandemic lockdowns limited our presence at the local tax offices to obtain the client’s consent. To the best of our knowledge, there are no publicly available datasets that include both video recordings and service records for evaluating satisfaction. To overcome the imbalanced distribution per satisfaction level, we created an improvised dataset for proof-of-concept.

Participants and apparatus. We invited 26 employees (4 females and 22 males) of our industry collaborator to participate in improvising. These participants improvised either as agents or clients. Four business analysts were qualified as agents because they had expertise in the service workflows. The other participants were included as clients if they had visited government authorities for staff-assisted public services. The recording was taken with the collaborator’s terminals to simulate real illumination and occlusion settings. The study was approved by the internal IRB (#HREP-2021-0162), and the videos were recorded with written consent.

Designs and setup. We designed and exemplified typical service scenarios with different satisfaction levels. Four satisfaction types were deduced by observing the collected footage’s workflows and interviewing frontline agents (E1-3). They include:

ST

a SaTisfied service with a shorter completion time than expected. The agent delivers clear instructions and demonstrates proficiency in completing their tasks. The client is thankful.
NM

a NorMal service with matched expected completion time. The agent controls the time of each operation to be around average. The client is given no instructions.
DA

a Dissatisfied service about the Agent with a longer completion time than expected. The agent demonstrates inattentive behaviors (e.g., using mobile phones and chitchatting with others) to prolong the service. The client is annoyed.
DP

a Dissatisfied service about the Procedure with a longer completion time than expected. The service procedure is interrupted by the malfunctioned terminal, which requires the client to repeat certain operations. The client is annoyed.

We expressed the satisfaction types in high-level terms as guidelines. The participants created their own speech and reaction to improvise the services. The agents were asked to keep a neutral face to prevent emotional contagion [13]. The clients’ reactions were described in abstract terms such as “being thankful” and “being annoyed.” We refrained from scripting the scenarios to sustain spontaneity, enhance generalizability, and avoid the curse of knowledge.

The service scenario is about amending membership of social security insurance. Historical records show that the average time for the service is around 8 minutes. Since expectation significantly impacts satisfaction [1], we communicated this information as the expected time to control the temporal expectation for all participants. The typical process involves nine steps from initiate to close, as shown in Fig. 3B1. Our service scenario’s workflow is transferable to a task-oriented dataset about customer service [54].

Result. We collected 61 service videos, with at least twelve for each satisfaction type. We ensured no clients repeated acting in the same type. The total duration of the services is 5.8 hours, and each spans 3-12 minutes, averaging 5.7 minutes. The corresponding satisfaction type labels the ground truth of the videos.

4 Anchor generation

This section introduces the construction of anchors. Anchors refer to semantically meaningful events that describe operations (operational anchors) and behaviors (behavioral anchors). We identified anomalous events for services and operations as operational anchors with a multi-perspective anomaly detection framework (R5). We further extracted multimodal features from service videos to compute primitive satisfaction estimations for services and operations as behavioral anchors (R1). These anchors can be viewed as an interactive table of content to define the video event structure (R2) for quick navigation to desired segments of satisfaction patterns without searching the whole video (R6).

4.1 Processing multimodal features

We processed the visual and audio channels decomposed from videos separately and aligned them with the parsed machine logs.

Visual features. We detected the bounding boxes of faces on every frame with YOLO5Face [56] and applied triangular smooth to reduce glitches. Since occlusion is still challenging for facial expression recognition (FER) [57], we detected the head pose with FSA-Net [58] to validate the reliability and reduce the impact of misclassification. We adopted a FER model [59] and aggregated the output discrete emotions into three large classes (positive, neutral, and negative) because the correspondence between discrete emotions and satisfaction is unclear [1, 24]. E5-7 were confused with the role of sadness and fear in evaluating satisfaction in customer services. Moreover, sacrificing granularity for generalizability is a common approach in the affective analysis [17].

Audio features. We applied a speaker diarization algorithm [60] to remove noise, locate speech segments, and cluster utterances by speakers. Since there are two speakers in a video, we registered the agent with heuristics, such as identifying the common speaker across two videos with the same agent. The audio segments are piped into an audio emotion classification model [61]. The discrete emotion outputs are also aggregated, as in facial expressions.

Event features. The machine logs contain discrete events that describe operations in the service records. However, an operation could be represented as multiple unstructured free-form text messages due to inconsistent coding styles. We first transformed the logs into tuple representations that contain a timestamp, an event type, and a list of log parameters for analysis. A service is identified by matching a beginning and an ending log message with the same terminal request ID. We aggregated the co-occurring raw event types that are semantically related, and confirmed the aggregated event types, denoted as operations, with E5. The nine operations are used to segment the videos (Fig. 3B1). We counted the logs with the same operation $e$ to obtain a service record vector $E=[(e_{1},n_{1}),(e_{2},n_{2}),...]$ , where $n_{i}$ is the count of consecutive $e_{i}$ .

4.2 Operational anchors

The primary purpose of operational anchors is to lift the burden of status determination (as discussed in Sec. 2.3) for analysts when they watch the videos. The operational anchors can segment operations in service records and grant semantic meanings to segments (R4). They also introduce event structure to summarize video content. To prioritize anomalous events (R5), we employed a similarity-based method to find temporal anomalies and a Markovian technique to detect sequential anomalies from the service records. The service record vector $E$ is piped into the following algorithms to obtain the corresponding anomaly scores.

Temporal anomaly locates uncommon durations of operations. The Principle Component Analysis (PCA) [62] is a popular unsupervised method for system log analysis detecting anomalous discrete events. It computes the similarity between the input and the labeled sequences based on the assumption that anomalous sequences should be dissimilar to normal ones. Service records labeled as normal $E_{n}$ are further aggregated by the operations to obtain fixed-size vectors. They are reduced to $k$ principal components to formulate the normal space $S_{n}$ . A service record is said to be anomalous if $||y||^{2}>Q_{1-\alpha}$ , where $y$ is the projection length to $S_{n}$ , and $Q_{1-\alpha}$ is the confidence threshold defaulted at 95%. We had considered another popular method in system log analysis, invariant mining [63]. However, it is tailored to rigorous procedures in software systems and has limited generalizability. Meanwhile, PCA has the advantage of high interpretability and does not require a large training set.

Sequential anomaly locates uncommon chronological orders in service records. The Markov chain model [64] learns a transition probability distribution $P$ of different discrete states at each time frame in normal sequences. It assigns the service vector $E$ with:

P(E)=P(e_{1},e_{2},...,e_{T})=\prod^{T-1}_{t=1}{p_{e_{t},e_{t+1}}}

(1)

where $T$ is the fixed window size. A non-zero constant $\epsilon$ is introduced to prevent zero probability when a particular sub-sequence has not appeared in the training set. We scaled up and sampled down the event records to create fixed-sized event vectors. We set the anomalous threshold for operations at $|1/n|$ such that the transition occurs at least once in the training set and for service at a constant that captures 95% variance. The Markov chain model is chosen for three reasons: (1) It does not require large training data; (2) It has good scalability by enlarging the window size to facilitate a large number of log records; (3) It identifies the exact operation when it deviates from standard procedures (R4). Moreover, customer services have predefined procedures (agent guidelines) acting as the normal training set. Although the Markovian model is not designed to identify missing and abundance events [53], its anomaly score would still reflect these conditions as they would appear in the wrong place. The Markovian model is well-suited to detecting repeated and out-of-sync operations (R2).

4.3 Behavioral anchors

The behavioral anchors are the multimodal satisfaction evaluation. Similar to [24, 11], we adopt a linear model to generate a customer satisfaction score. We extended the model to cover event duration rather than affective status only (R2). The model combines facial expression $v$ , audio emotion $a$ , and events $e$ to evaluate satisfaction. The customer satisfaction score for a service $CS_{s}$ is calculated by:

CS_{s}=w_{v}f(\sum_{i=1}^{N}m_{v}v_{i})+w_{a}f(\sum_{j=1}^{M}m_{a}a_{j})-w_{e}\sum_{t=1}^{T}z_{e,t}

(2)

where $N$ , $M$ , and $T$ denote the total number of frames, utterances, and operations. $w$ is the weights of each channel defaulted as equally weighted. We also obtained the operation’s satisfaction score $CS_{e}$ for each modality by confining the summation scope to individual operation and modality. $f$ is the normal standardization across all services. $z_{e}$ is the z-score for the event duration. We grouped them by operations before standardizing because repeated operations are usually shorter and obfuscate the calculation. For emotional responses, we assigned a magnitude weight $m$ to each discrete emotion and adopted the scheme proposed by previous work [11]. In general, positive emotion has a value of +1.0, neutral emotions are 0.0, and negative emotions tend to -1.0. We slightly modified the weightings of anger to -1.2 and disgust to -1.0 based on the domain literature [12] and discussions with E5-7. A large value of $CS_{s}$ indicates high satisfaction and vice versa. All of the above settings can be reconfigured to adapt to other needs.

5 Visual interface

The visual interface of Anchorage supports satisfaction evaluation at multiple scales and anchor candidate validation with intuitive visualization designs. Fig. 3 shows the snapshot of the interactive system annotated with (A) the service overview, (B) the anchor exploration view, (C) the multimodal feature navigation view, and (D) the service video view. The service video view shows the original service videos and plays them in sync with other views when the corresponding video or event is selected. It also supports conventional video playback functions and other interactions described in the following sections.

5.1 Service overview and the buoy chart

The Service overview (Fig. 3A) provides an overview of all the service videos. It lists all the videos and supports fast comparison over multiple videos to search for a service of interest. Each list item (Fig. 3A1) contains three columns that show different satisfaction metrics (R1). The color encodings are unified for the visual interface (green for visual, red for audio, and purple for event). The leftmost column displays the basic information of the video. The horizontal bar chart shows the temporal and sequential anomaly scores described in Sec. 4.2. Identified anomalies are represented by filled color, and normal services are in striped color. The rightmost column is a vertical bar chart showing the satisfaction scores $CS_{s}$ of different modalities described in Sec. 4.3.

The middle column is a scatter-based design called the buoy chart (Fig. 4C). It summarizes the multimodal satisfaction scores of individual operations into a single graph. An operation is represented by two dots, which are in different colors to indicate their types (visual or audio). The vertical position encodes the dot’s satisfaction score $CS_{e}$ of its type. The dots close to the horizontal line have lower opacity, making them visually insignificant. The horizontal position encodes the reversed event score for all dots. The negation is designed to match the quadrants heuristic by converting a negative score, which denotes a shorter duration, into a positive. Moreover, two dots belonging to the same operation will be linked vertically to show a connected visual component based on the principle of continuity. The vertical and horizontal scales are centered at zero and capped within a threshold.

Justification. We developed a metaphor to flatten the buoy charts’ learning curve and lower the bar of visualization literacy [65]. The buoy chart employs the buoy metaphor as a buoy attaches to an anchor. A dot is referred to as a buoy. Normal services should have many ordinary operations; thus, dots are clustered around zero. While these buoys in proximity float on the surface imperceptibly, the significant buoys sink and rise to become visually apparent anchors. These outliers highlight anomalous operations with potential satisfaction patterns (R5). They detect counteracted cases (R3) when the polarized anchors are seen together with the near-zero aggregated scores (e.g., the audio channel of Fig. 3A1).

Moreover, the buoy chart can be interpreted with the quadrant heuristic in Fig. 4A. We estimate the overall satisfaction by checking the quadrant with the most anchors. For example, Fig. 3A3 shows many floating buoys and a few sinking anchors. The anchors’ positions suggest that the client exhibited negative emotions in a few prolonged operations. It shows how the buoy chart efficiently summarizes the distinguishing patterns (R1). The buoy chart is the visual alternative for multimodal fusion (e.g., Eq. 2) which may require extra effort to optimize the cost functions.

The buoy chart effectively encodes multimodal characteristics. Fig. 4C demonstrates the visual patterns of the three inter-modal interaction types summarized by Wang et al.[35] ( $e_{1-3}$ for dominant, $e_{4}$ for complementary, and $e_{5}$ for conflicting modals). We use the four cases in Fig. 3 as examples: A2-A3 dominantly express strong emotions in one channel; A4 conveys negative emotions by complementing both channels; A1 shows conflicting behaviors for some operations. These visual cues help analyze the clients’ emotional profiles (R2), which inform the subsequent analysis.

The buoy chart can be augmented to address various design issues. Techniques applicable to scatter-based designs are also likely effective for the buoy chart. For scalability issues, we can reduce the dot size and superpose bar charts to the sides to observe the operation’s distributions in densely populated regions. To prioritize the most important items, we can use a quadrant-based heatmap to filter and rank the videos by anchor patterns of interest.

Design alternatives. We focused on designing straightforward and standardized charts to suit the diverse background of our target users, i.e., service providers and agents. Building abstractions is a popular strategy for handling numerous videos in video collections. We implemented dimension reduction techniques to generate video clusters and visualize outliers to reduce review efforts. However, the techniques neglect temporal relationships and could not detect counteracted cases. Although we can set up exemplars to guide the clusters, it is challenging to make novices aware of the technical assumptions and avoid over-reliance on unsupervised results. We used the “view sequentially” strategy [66] instead and provided the satisfaction scores in different modalities as the ranking orders.

As an alternative to the buoy chart, we considered using the stacked bar chart to display the operations sequentially, as in Fig. 4D. It is visually apparent when the multimodal scores are dominant. However, the complementary and conflicting modalities challenge the interpretability of the chart for lacking quick decision rules. The chart also creates confusion when displaying positive and negative values together, and suffers scalability issues with more operations. Another option was the parallel coordinates (Fig. 4E). However, the stronger intra-modal scalability cannot compensate for the visual clutter of lines when performing inter-modal comparisons. It is also complicated to compare three modalities simultaneously to find anomalous operations. Fig. 4C-E share the same set of data to highlight their difference.

5.2 Anchor exploration view

This view (Fig. 3B) supports operation-level anchor exploration based on its service record. The timeline-based visualization (Fig. 3B1) represents each operation with a column of visual components. The horizontal position encodes the service time. Each column contains four rows. In the top row (Fig. 3Bi), we visualize the operations, turn-taking information, and the indicator of sequential anomalies to provide the procedural context and indicate sequential inconsistency (R4). The line is colored pink for the agent’s turn or yellow for the client’s. The triangle icon indicates that the operation is sequentially anomalous, as in Fig. 5A.

The bottom three bars (Fig. 3Bii) illustrate the statistics around the event, visual, and audio modalities, respectively. The first bar in purple shows the duration of the operation. The portion in dark purple indicates the amount of time exceeding the operation’s average. It signals a longer-than-usual operation and can be considered a temporal anomaly. The second bar in green and the third in red represent the proportion of time with detected features for the visual and audio channels. The striped pattern encodes the absence of features, such that green is for obscured client’s face and red is for silence. For example, Fig. 3Bii shows that the client’s face is not obscured for the whole operation, and the conversation lasts for about one-third of the time. These indicators provide background information about the reliability of the detected features. The visual components are associated with other views, so clicking on them can navigate to the multimodal features and the original service videos (R6).

The lateral buoy chart (Fig. 3B2) shows the satisfaction progression. The chart’s horizontal encoding follows the timeline above, so all dots are located in the middle of the operation. Its correspondence with the buoy chart is illustrated in Fig. 4B-C. While the two charts share the same metaphor, there are subtle differences. Each operation is represented by three dots, including the event. Here, the vertical position utilizes the z-score to unify all modalities. The visual and audio scores are summed over the operation and further standardized within the selected service. For example, $e_{5}$ in Fig. 4B contains vastly deviated scores for all modalities, while the $e_{3}$ counterparts have average scores. An anchor icon denotes higher values than two standard deviations. The scale helps detect the most anomalous service operations (R5). The buoy’s size encodes the absolute deviation rank to draw attention to the most influential anchor. The more significant deviation, the larger the buoy. The drawing order favors smaller buoys to prevent occlusion and visual clutters (see Fig. 5A).

Justification. The lateral buoy chart bridges the gap between the buoy chart and the timeline. We did not explicitly encode the operations shorter than average in the timeline because they could distort the layout. Also, they are less significant as shorter events usually have fewer behavioral anchors to verify. The missing temporal anomalous information is covered in the lateral buoy chart with the introduction of event buoys. Using familiar visual elements and metaphor reduce the burden of learning a new visualization. The correspondence could introduce higher efficiency when users are familiar with the system.

Design alternatives. During the design process, we referred to the event sequence design space [50] and quickly eliminated hierarchy-based, Sankey-based, and matrix-based designs because of the unfit tasks. We created the current design by combining bar charts and timelines for simplicity and familiarity. For visualizing feature progression, Zeng et al.[25] proposed five designs to show emotion flows. However, these designs are limited in visual summarization power because they lack event contexts. Our lateral buoy chart combines multiple visual elements to coherently express the satisfaction progression in service operations.

5.3 Multimodal feature navigation view

This view supports interactive navigation of the original videos (R6). We adopted the periphery plot [67] as the operation summary (Fig. 3C1). In the middle focused detail view, we fused the facial and audio features to assign an activation value $v_{i}=\{-1,0,1\}$ to frame $i$ . The fusion favors non-neutral emotions with higher priority given to negative ones because they have a greater impact on satisfaction [12]. The activation values are plotted to show an overview of the operation. Brushing selects the period for the below features. The periphery plots on both sides allow quick navigation to consecutive operations and contextualize the focused operation with neighbors. The three bars show the count of activation values.

The audio and visual channels are one-dimensional shaded matrices (Fig. 3C2) that encode the positive and negative outputs of facial expressions and audio emotions. Obscured and muted frames are encoded with the stripped pattern as before. The head pose information is shown in line charts (Fig. 3C3). The chart scales are written with the semantic meaning of the angles directly. The proximity between visual features and head pose allows users to verify emotions with occlusion from looking down. Clicking on the views (Fig. 3C4) seeks the time point in the video.

6 Evaluation

We present a case study with E5 and a structured user study to demonstrate the effectiveness and usability of Anchorage.

6.1 Case study

This study describes the satisfaction evaluation process of E5 for services provided by two of his actual subordinates, S1 and S2. He was tasked with rating the services and the agents by exploring the improvised dataset (Sec. 3.3) filtered by S1 and S2. He has been closely involved in the problem characterization and improvised dataset formulation, but he had not used Anchorage nor seen the videos before the case study. He knew about the satisfaction scenarios typically known to frontline agents. He was encouraged to follow the think-aloud protocol during his exploration.

Service exploration (R1-2). Beginning from the service overview (Fig. 3A), E5 first ranked the videos by descending service satisfaction score. ST09 (Fig. 3A2; Video names were masked during the exploration) was ranked number one on the list. He noticed that all metrics favored the service because all bars showed positive values, and there was a distinctive audio anchor in the buoy chart. Reading the basic info on the left column, he quickly determined it should be the ST type. He selected the service to see who the agent was (S1) from the service video view. Then, he continued browsing other videos. After a few attempts, his attention was caught by NM08 (Fig. 3A1). He examined the near-zero audio satisfaction score and the polarized buoys in the buoy chart and suspected it was a counteracted case. He felt interested in the case and wanted to know about the conflicting behaviors. He selected the service and proceeded to explore the service record.

Operation exploration (R3-5). Looking at the anchor exploration view (Fig. 3B), he first checked the procedure summary (Fig. 3B1) and did not find many sequential or temporal anomalies. Most of the agent’s operations finished in time, and no procedures deviated from agent guidelines. He ruled out the DP type. He turned to the lateral buoy chart (Fig. 3B2) and discovered the visual anchor indicating significant negative facial expressions. He also noticed that the previous operation of the anchor was very positive. Revisiting the operations’ names (“Execute” and “Pay”), he had a clue about the incident but needed more evidence. He clicked on the anchor icon to investigate the critical transition moment.

Feature verification (R5-6). Entering the multimodal feature navigation view (Fig. 3C), he found that the most negativity was located in the latter part of the “Pay” operation from the periphery summary and the visual feature (Fig. 3C4). He clicked on the orange frames at C4 to navigate the service videos (Fig. 3D). By watching the original video, he concluded that the negative emotions came from having to pay but not because the agent was inattentive. He rejected the DA type and declared it the NM type. However, he wondered why the client had many positive behaviors during the previous operation as he read the left periphery plot (Fig. 3C1). He clicked on the plot and repeated the feature verification analysis. He was intrigued by the fact that the client was only texting on his phone the whole time. This reinforced his NM rating, despite the high satisfaction score on the visual channel. He became confident about Anchorage’s ability to detect counteracted cases. S1 accrued a few NM and ST cases to be rated as good performance.

Satisfaction evaluation (R1-6). By ranking the videos in ascending scores, E5 found DP07 (Fig. 3A4) to have the lowest satisfaction score in both visual and audio channels. He noticed that the service was detected with both temporal and sequential anomalies. He reviewed the service in the anchor exploration view (Fig. 5A). He noticed that the client took longer than usual to upload his files. The procedure was also repeated twice such that it could annoy the client. He concluded that the service belonged to the DP type because he knew it was the only scenario.

A DAP case. The visual anchor under the “Verify” operation drew E5’s attention. He was confused because the typical DP scenario does not stage like this. He investigated the visual anchor and saw the lasting orange frames in Fig. 5B1. From the head pose chart, he observed that S2 had been looking down. He clicked on the behavior (Fig. 5B2) and derived from the videos that S2 had been playing on her phone (Fig. 5C). He speculated S2 might have combined the DA and DP types to create a more unsatisfied DAP case which annoyed the client with the prolonged procedure and the inattentive agent. Nevertheless, he rated S2 as having poor performance as an agent, but good performance as a business analyst in taking the initiative to create new requirements.

6.2 User study

We conducted the user study with a between-subject design and two conditions on the system used to evaluate the effectiveness and usability of Anchorage in evaluating customer service videos.

Apparatus. Presenting multimodal features in a VA system generally leads to better task performance than a baseline system without much computational support [32, 44, 33]. However, it could be unfair to compare Anchorage with both operational and behavioral anchors to a naive baseline because of the wide interaction gap and the compound effect. In this study, we aimed to evaluate whether the introduction of event context (i.e., operational anchors) enhances the performance of satisfaction evaluation tasks. We created the baseline system (Fig. 6) by ablating the event-based visualization components in Anchorage, namely, the buoy chart, the anchor exploration view, and the periphery plots. The remaining parts visualize multimodal features beneficial to automated satisfaction evaluation [11, 9]. The user’s mouse actions were logged for provenance analysis.

Data and tasks. We sampled videos from the dataset described in Sec. 3.3 for the satisfaction evaluation tasks. From a pilot study with E5-6, we estimated that participants could annotate three videos in ten minutes. We randomly selected 12 videos for the study in consideration of workload and duration. The videos are selected with two constraints: 1) equal coverage of all four satisfaction types, and 2) acted by different clients. They ensure the samples’ diversity and independence. We used two videos for demonstration, and participants evaluated the remaining. The remaining ten videos span 68.6 minutes and formulate the ten satisfaction evaluation tasks (T1-T10 in Fig. 7 with masked video names). In addition to rating the client satisfaction on a Likert scale from 1 (low) to 5 (high), we asked them to evaluate the agent proficiency and the service smoothness to further distinguish the videos.

Participants. We adopted snowball sampling starting from the colleagues of E5-7 to recruit 24 participants from our collaborators’ company (8 female, 16 male; age: Mean $(M)=28.4$ , Standard Deviation $(SD)=5.3$ ). While 16 participants are undergraduates in STEM disciplines, others attain diplomas in diverse backgrounds. They have 1-13 years of related experience in customer services ( $M=5.1$ , $SD=3.7$ ). They were compensated with $7.50 USD equivalent upon completion. We randomly assigned the participants to use the Anchorage system ( $P_{A}$ ) and the Baseline system ( $P_{B}$ ).

Procedure. The study was conducted remotely due to quarantine restrictions. The participants could access the assigned system deployed online. We first obtained their consent, and introduced the research background and the system’s functions via recorded videos for around 12 minutes. After three minutes of free exploration with the training examples, the participants should complete the ten tasks using the assigned system. Since we did not enforce a time limit on the tasks, we provided cash incentives to prevent low-quality responses. Each satisfaction rating is $1.5 USD (maximum five rewards), if it matches the ones by E5-6 within one point scale. Finally, they filled in a questionnaire about the assigned system and their background information. All sessions spanned between 30-90 minutes ( $M=56.5$ , $SD=18.3$ ).

6.2.1 Results

The Anchorage group had a shorter completion time than the Baseline group, while the annotations mainly stayed the same.

Completion time. We compared the completion time in minutes of the Anchorage group ( $A$ ) and the Baseline group ( $B$ ) on the ten evaluation tasks. Using Anchorage ( $M_{A}=27.6$ , $SD_{A}=19.3$ ) is, on average, 21.8% faster than using Baseline ( $M_{B}=35.3$ , $SD_{B}=17.1$ ) in evaluating the service videos, although the Mann-Whitney U test suggested that the difference is not statistically significant (W = 51.0, p = 0.118). Two individual tasks were significantly faster for the Anchorage group, namely, T2 ( $U=38$ , $p<0.05$ ) and T3 ( $U=30$ , $p<0.01$ ). Combined with the findings in the case study, the two satisfaction types (DA and ST) could be easier for Anchorage users to make preliminary decisions under event contexts. From Fig. 7, we observed that the Anchorage group has a shorter average and median completion time for all tasks except T1. Since T1 was the first task, a possible reason is that the time needed to learn the Anchorage system is longer than that of the Baseline.

The Anchorage group demonstrated a larger variance in completion time. Nine $P_{A}$ finished the ten tasks faster than average (i.e., $<27.6$ mins), while only three $P_{B}$ finished in that time. From the analytic provenance of three $P_{A}$ who needed more than 50 minutes, we found that they tended to watch the raw service videos instead of leveraging anchors to prioritize investigative efforts. This pattern only appeared in one $P_{B}$ . On the contrary, $P_{A}$ who finished in 10 minutes were observed focused on validating the anchors and verifying the features. We expect this would be the norm when the users become familiar with the system.

Annotated scores. We compared the three evaluation metrics of each satisfaction type across the Anchorage and Baseline group. Fig. 8 illustrates the concrete differences in all metrics between satisfaction types. It indicates that only the difference in client satisfaction score for NM ( $U=32.5$ , $p<0.05$ ) and ST ( $U=39.5$ , $p<0.05$ ) between the two groups has statistical significance. In general, $P_{B}$ tended to rate NM and ST with higher satisfaction than $P_{A}$ , whose ratings are closer to the labels provided by E5-6.

We further investigated whether the two systems can clearly distinguish the different satisfaction types. We performed the Wilcoxon signed-rank tests to accommodate the potential dependencies between tasks. For both groups, there is no evidence supporting the significance of the agent proficiency between NM and ST ( $W_{A}=15$ , $p_{A}=0.202$ ; $W_{B}=5$ , $p_{B}=0.068$ ). All other pair-wise comparisons among satisfaction types in the Anchorage group have $p<0.05$ , meaning that $P_{A}$ can clearly distinguish the four satisfaction types. However, for the Baseline group, the client satisfaction ( $W=10.5$ , $p=1$ ), agent proficiency ( $W=18$ , $p=1$ ), and service smoothness ( $W=7.5$ , $p=0.262$ ) for the DA and DP pair have no significance. It suggests that without the support of event contexts in customer services, users might consider the DA and DP types as equals. This annotation might not be fair to agents who have not caused unsatisfied cases. Therefore, event contexts should be considered in satisfaction evaluation, even for automatic methods, to prevent biases against agents.

Questionnaire. Fig. 9 reports the comparable questionnaire results among the two groups in three dimensions: task effectiveness, system usability, and visual designs. All the metrics listed in Fig. 9 have not been found with statistical significance between the two systems. Using multimodal behavioral features to support satisfaction evaluation tasks are welcomed by practitioners. The systems are also able to help quickly navigate anomalies. Users can distinguish more anomalies (DA vs. DP) using Anchorage with operational anchors than using the baseline system with multimodal behavioral anchors only. With visual analytics systems, they gained more trust in the automatic results. Some $P_{A}$ commented that the scores in individual operations had higher accuracy than the service ones. This confirms our hypothesis that the introduction of event structure helps evaluate satisfaction.

Participants generally have positive feedback towards the systems for system usability and visual designs. An interesting finding is that although the baseline system is easier to learn, its understandability and overwhelmingness seem worse than in Anchorage. The slight difference in the learning curve might also be reflected by the slower completion time for T1. With more visual components and novel visualization designs, the Anchorage is less overwhelming than the Baseline. It could be attributed to the familiarity of the event contexts and the intuitiveness of the metaphoric designs. Participants in the Anchorage group have reported more event-related insights, used the quadrant heuristic, and adopted the anchor analogies for the rating rationale.

7 Discussion

We summarize the lessons learned during the design study about satisfaction analysis in customer service settings as follows.

Transferability of customer service. We used the public service to characterize the problem domain. The abstracted workflow is transferable to a customer service dialogue dataset in online shopping [54], as shown in Fig. 1. It is because customer service is characterized by goal-oriented tasks and collaborative communication. Since our service type focuses on processing applications, our data has more sparse conversations than contact center calls [8, 9]. However, we also captured dense features such as the clients’ facial expressions and designed the system with both types of features. It enhances transferability to other service types regardless of feature sparsity.

Structuring video with event analysis. Videos are often classified as unstructured data because of the difficulties in understanding the states and detecting events. In situations where the videos are recorded along sequential records, we can adopt event analysis to formulate granular video segmentation schemes. These scenarios will become more prevalent as remote work environments become increasingly popular. The framework described in this work can be generalized to other applications, such as online education, smart manufacturing, system interface testing, and interactions in virtual reality. We showed that introducing operational anchors into the conventional video-based satisfaction analysis streamlines video content understanding. Video analytics should eagerly look for event structures to frame the features extracted from videos.

Improvised dataset as test cases for VAST. Collecting a real dataset is one of the most difficult challenges for VAST systems [55], including video analytics. In satisfaction analysis, extreme cases (very satisfied and very dissatisfied) are rare in real life. We used the improvised dataset to tackle the sample imbalance challenge. With close collaboration with domain experts, we define the typical satisfaction patterns and collect these patterns through improvising. They provide ground truth that can be used for pretraining models and act as the initial reference points to mitigate the cold-start problems. We can also control the environment to isolate unwanted effects and make the results comparable.

Incorporating domain experts to participate in creating such a dataset has been beneficial to our study. The dataset laid the common ground for our collaborators and us to discuss the expectation over the visual designs during the design process. However, in our case, the authenticity of the client’s response and dynamics is disputable due to improvision. For example, customers may look at their phones more often when they feel bored by a long process [15], while we asked our subjects to be more attentive. We should also be mindful of the Observer’s Paradox, which states that experimenters’ presence influences data gathering [17, 18]. More standardized protocols should be discussed and developed to promote fair and just evaluation.

Privacy and multimodal features. Recording the services is a norm in customer services to prevent conflicts resulting from misinformation [10]. Videos are particularly useful when complaints need to be investigated. In this work, the customer service videos were collected with internal IRB’s approval (#HREP-2021-0162) and written consent from participants. However, the increasingly tightening privacy policies might forbid the collection of service videos that include sensitive features such as the clients’ faces. Since Anchorage utilizes many emotional features extracted from the videos, its effectiveness would be significantly affected if the video collections were prohibited.

However, we have shown that combining operational and behavioral anchors can enhance satisfaction evaluation performance without compromising annotation precision. There are more multimodal features worth exploring that could facilitate satisfaction evaluation. For example, event features (e.g., business procedures) and other behavioral features (e.g., machine operations and agent behaviors) provide contexts for causal analysis to satisfaction. Screen-space and data-space sanitization techniques [68] could be useful in protecting clients’ privacy. We have also collected multi-view videos about the clients which can provide additional viewpoints to infer their actions. Future works can explore visualization techniques for summarizing multi-view videos. Another under-explored feature is the agents’ emotions. The emotional interaction between the agent and client could infer the other party’s affective status and evaluate whether the agent reacts appropriately.

8 Conclusion

This paper investigates customer satisfaction evaluation with customer service videos and service records. The fusion of multimodal behavioral features extracted from videos provides a primitive satisfaction evaluation. We introduce the use of machine logs to provide semantically meaningful video understanding and magnify a conventional satisfaction score with greater sequential and temporal resolutions. They both constitute the anchor concept. We constructed the anchors with a multi-perspective anomaly detection framework to help narrow down the vast event space. We developed the buoy charts and multi-faceted views to effectively summarize the services and navigate users to segments of interest.

We created an improvised dataset to show that Anchorage can detect satisfactory, unsatisfactory, and counteract normal cases. The combination of video analytics and event sequence analysis shows promising results in effectively understanding videos. We found that introducing event contexts to video analytics can enhance the performance of evaluating customer satisfaction in videos. Our approach can be adapted in situations where unlabelled and unstructured videos are collected along with sequential records.

Acknowledgments

We are grateful to Wucong Chen and the Guang Fo Zhan Qu team for their helpful feedback and coordination for the user studies. We also thank anonymous reviewers for their constructive comments. This research was supported in part by grant FSUST19-CWB09.

References

[1] R. L. Oliver, Satisfaction: A behavioral perspective on the consumer, 2nd ed. Routledge, 2010.
[2] “Quality management — Customer satisfaction — Guidelines for monitoring and measuring,” ISO 10004:2018, 2018.
[3] A. Wong, “The role of emotional satisfaction in service encounters,” Managing Service Quality: An International Journal, vol. 14, no. 5, pp. 365–376, 2004.
[4] B. Tronvoll, “Negative emotions and their effect on customer complaint behaviour,” Journal of Service Management, vol. 22, no. 1, pp. 111–134, 2011.
[5] R. A. Peterson and W. R. Wilson, “Measuring customer satisfaction: fact and artifact,” Journal of the academy of marketing science, vol. 20, no. 1, pp. 61–71, 1992.
[6] S. Al-Otaibi, A. Alnassar, A. Alshahrani, A. Al-Mubarak, S. Albugami, N. Almutiri, and A. Albugami, “Customer satisfaction measurement using sentiment analysis,” International Journal of Advanced Computer Science and Applications, vol. 9, no. 2, pp. 106–117, 2018.
[7] A. See and C. Manning, “Understanding and predicting user dissatisfaction in a neural generative chatbot,” in Proc. SIGDIAL. Singapore and Online: Association for Computational Linguistics, Jul. 2021, pp. 1–12.
[8] Y. Park and S. C. Gates, “Towards real-time measurement of customer satisfaction using automatically generated call transcripts,” in Proc. CIKM. NY: ACM, 2009, p. 1387–1396.
[9] A. Ando, R. Masumura, H. Kamiyama, S. Kobashikawa, Y. Aono, and T. Toda, “Customer satisfaction estimation in contact center calls based on a hierarchical multi-task model,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 715–728, 2020.
[10] M. Saberi, O. K. Hussain, and E. Chang, “Past, present and future of contact centers: a literature review,” Business Process Management Journal, 2017.
[11] K. P. Seng and L.-M. Ang, “Video analytics for customer emotion and satisfaction at contact centers,” IEEE Transactions on Human-Machine Systems, vol. 48, no. 3, pp. 266–278, 2018.
[12] V. Liljander and T. Strandvik, “Emotions in service satisfaction,” International Journal of Service Industry Management, vol. 8, no. 2, pp. 148–169, 1997.
[13] A. Cheshin, A. Amit, and G. A. van Kleef, “The interpersonal effects of emotion intensity in customer service: Perceived appropriateness and authenticity of attendants’ emotional displays shape customer trust and satisfaction,” Organizational Behavior and Human Decision Processes, vol. 144, pp. 97–111, 2018.
[14] Q. Zhang, W. Wang, and Y. Chen, “Frontiers: In-consumption social listening with moment-to-moment unstructured data: The case of movie appreciation and live comments,” Marketing Science, vol. 39, no. 2, pp. 285–295, 2020.
[15] Z. X. Liu, Y. Liu, and X. Gao, “Using mobile eye tracking to evaluate the satisfaction with service office,” in Design, User Experience, and Usability. Practice and Case Studies. Cham: Springer International Publishing, 2019, pp. 183–195.
[16] S. Kumar, M. Yadava, and P. P. Roy, “Fusion of eeg response and sentiment analysis of products review to predict customer satisfaction,” Information Fusion, vol. 52, pp. 41–52, 2019.
[17] H. Gunes and B. Schuller, “Categorical and dimensional affect analysis in continuous input: Current trends and future directions,” Image and Vision Computing, vol. 31, no. 2, pp. 120–136, 2013.
[18] D. McDuff, R. E. Kaliouby, and R. W. Picard, “Crowdsourcing facial responses to online videos,” IEEE Transactions on Affective Computing, vol. 3, no. 4, pp. 456–468, 2012.
[19] A. Generosi, S. Ceccacci, and M. Mengoni, “A deep learning-based system to track and analyze customer behavior in retail store,” in Proc. ICCE-Berlin, 2018, pp. 1–6.
[20] D. McDuff, R. E. Kaliouby, J. F. Cohn, and R. W. Picard, “Predicting ad liking and purchase intent: Large-scale analysis of facial responses to ads,” IEEE Transactions on Affective Computing, vol. 6, no. 3, pp. 223–235, 2015.
[21] M. Slim, R. Kachouri, and A. B. Atitallah, “Customer satisfaction measuring based on the most significant facial emotion,” in Proc. SSD, 2018, pp. 502–507.
[22] N. Sugianto, D. Tjondronegoro, and B. Tydd, “Deep residual learning for analyzing customer satisfaction using video surveillance,” in Proc. AVSS, 2018, pp. 1–6.
[23] M. González-Rodríguez, M. Díaz-Fernández, and C. Pacheco Gómez, “Facial-expression recognition: An emergent approach to the measurement of tourist satisfaction through emotions,” Telematics and Informatics, vol. 51, p. 101404, 2020.
[24] G. Yolcu, I. Oztel, S. Kazan, C. Oz, and F. Bunyak, “Deep learning-based face analysis system for monitoring customer interest,” Journal of ambient intelligence and humanized computing, vol. 11, no. 1, pp. 237–248, 2020.
[25] H. Zeng, X. Wang, A. Wu, Y. Wang, Q. Li, A. Endert, and H. Qu, “EmoCo: Visual analysis of emotion coherence in presentation videos,” IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 1, pp. 927–937, 2020.
[26] C.-X. Ma, J.-C. Song, Q. Zhu, K. Maher, Z.-Y. Huang, and H.-A. Wang, “Emotionmap: Visual analysis of video emotional content on a map,” Journal of Computer Science and Technology, vol. 35, no. 3, pp. 576–591, 2020.
[27] X. Wang, H. Zeng, Y. Wang, A. Wu, Z. Sun, X. Ma, and H. Qu, “Voicecoach: Interactive evidence-based training for voice modulation skills in public speaking,” in Proc. CHI. New York, NY, USA: ACM, 2020, p. 1–12.
[28] X. Wang, Y. Ming, T. Wu, H. Zeng, Y. Wang, and H. Qu, “Dehumor: Visual analytics for decomposing humor,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 12, pp. 4609–4623, 2022.
[29] K. Maher, Z. Huang, J. Song, X. Deng, Y.-K. Lai, C. Ma, H. Wang, Y.-J. Liu, and H. Wang, “E-ffective: A visual analytic system for exploring the emotion and effectiveness of inspirational speeches,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 508–517, 2022.
[30] H. Zeng, X. Wang, Y. Wang, A. Wu, T.-C. Pong, and H. Qu, “Gesturelens: Visual analysis of gestures in presentation videos,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–1, 2022.
[31] H. Zeng, X. Shu, Y. Wang, Y. Wang, L. Zhang, T.-C. Pong, and H. Qu, “EmotionCues: Emotion-oriented visual summarization of classroom videos,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 7, pp. 3168–3181, 2021.
[32] H. Li, M. Xu, Y. Wang, H. Wei, and H. Qu, “A visual analytics approach to facilitate the proctoring of online exams,” in Proc. CHI. NY: ACM, 2021.
[33] T. Tang, Y. Wu, Y. Wu, L. Yu, and Y. Li, “Videomoderator: A risk-aware framework for multimodal video moderation in e-commerce,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 846–856, 2022.
[34] P. P. Liang, Y. Lyu, G. Chhablani, N. Jain, Z. Deng, X. Wang, L.-P. Morency, and R. Salakhutdinov, “Multiviz: An analysis benchmark for visualizing and understanding multimodal models,” ArXiv preprint ArXiv:2207.00056, 2022.
[35] X. Wang, J. He, Z. Jin, M. Yang, Y. Wang, and H. Qu, “M2Lens: Visualizing and explaining multimodal models for sentiment analysis,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 802–812, 2022.
[36] A. Wu and H. Qu, “Multimodal analysis of video collections: Visual exploration of presentation techniques in ted talks,” IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 7, pp. 2429–2442, 2020.
[37] T. Blascheck, F. Beck, S. Baltes, T. Ertl, and D. Weiskopf, “Visual analysis and coding of data-rich user behavior,” in Proc. VAST, 2016, pp. 141–150.
[38] E. J. Soure, E. Kuang, M. Fan, and J. Zhao, “Coux: Collaborative visual analysis of think-aloud usability test videos for digital interfaces,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 643–653, 2022.
[39] Y. Lin, K. Wong, Y. Wang, R. Zhang, B. Dong, H. Qu, and Q. Zheng, “Taxthemis: Interactive mining and exploration of suspicious tax evasion groups,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 2, pp. 849–859, 2021.
[40] W. Zhang, J. K. Wong, X. Wang, Y. Gong, R. Zhu, K. Liu, Z. Yan, S. Tan, H. Qu, S. Chen, and W. Chen, “Cohortva: A visual analytic system for interactive exploration of cohorts based on historical data,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 1, pp. 756–766, 2023.
[41] B. Höferlin, M. Höferlin, G. Heidemann, and D. Weiskopf, “Scalable video visual analytics,” Information Visualization, vol. 14, no. 1, pp. 10–26, 2015.
[42] K. Kurzhals, M. John, F. Heimerl, P. Kuznecov, and D. Weiskopf, “Visual movie analytics,” IEEE Transactions on Multimedia, vol. 18, no. 11, pp. 2149–2160, 2016.
[43] Z. Chen, S. Ye, X. Chu, H. Xia, H. Zhang, H. Qu, and Y. Wu, “Augmenting sports videos with viscommentator,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 824–834, 2022.
[44] D. Deng, J. Wu, J. Wang, Y. Wu, X. Xie, Z. Zhou, H. Zhang, X. L. Zhang, and Y. Wu, “Eventanchor: Reducing human interactions in event annotation of racket sports videos,” in Proc. CHI. NY: ACM, 2021.
[45] C. Shi, Y. Wu, S. Liu, H. Zhou, and H. Qu, “Loyaltracker: Visualizing loyalty dynamics in search engines,” IEEE Transactions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 1733–1742, 2014.
[46] B. C. Cappers and J. J. van Wijk, “Exploring multivariate event sequences using rules, aggregations, and selections,” IEEE Transactions on Visualization and Computer Graphics, vol. 24, no. 1, pp. 532–541, 2018.
[47] J. Magallanes, T. Stone, P. D. Morris, S. Mason, S. Wood, and M.-C. Villa-Uriol, “Sequen-c: A multilevel overview of temporal event sequences,” IEEE Transactions on Visualization and Computer Graphics, vol. 28, no. 1, pp. 901–911, 2022.
[48] Z. Liu, B. Kerr, M. Dontcheva, J. Grover, M. Hoffman, and A. Wilson, “Coreflow: Extracting and visualizing branching patterns from event sequences,” Computer Graphics Forum, vol. 36, no. 3, pp. 527–538, 2017.
[49] P. J. Polack Jr., S.-T. Chen, M. Kahng, K. D. Barbaro, R. Basole, M. Sharmin, and D. H. Chau, “Chronodes: Interactive multifocus exploration of event sequences,” ACM Trans. Interact. Intell. Syst., vol. 8, no. 1, feb 2018.
[50] Y. Guo, S. Guo, Z. Jin, S. Kaul, D. Gotz, and N. Cao, “A survey on visual analysis of event sequence data,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–1, 2021.
[51] P. H. Nguyen, C. Turkay, G. Andrienko, N. Andrienko, O. Thonnard, and J. Zouaoui, “Understanding user behaviour through action sequences: From the usual to the unusual,” IEEE Transactions on Visualization and Computer Graphics, vol. 25, no. 9, pp. 2838–2852, 2019.
[52] A. Yeshchenko, C. Di Ciccio, J. Mendling, and A. Polyvyanyy, “Visual drift detection for sequence data analysis of business processes,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–1, 2021.
[53] S. Guo, Z. Jin, Q. Chen, D. Gotz, H. Zha, and N. Cao, “Interpretable anomaly detection in event sequences via sequence matching and visual comparison,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–1, 2021.
[54] D. Chen, H. Chen, Y. Yang, A. Lin, and Z. Yu, “Action-based conversations dataset: A corpus for building more in-depth task-oriented dialogue systems,” in Proc. NAACL-HLT. Online: ACL, Jun. 2021, pp. 3002–3017.
[55] M. Sedlmair, M. Meyer, and T. Munzner, “Design study methodology: Reflections from the trenches and the stacks,” IEEE Transactions on Visualization and Computer Graphics, vol. 18, no. 12, pp. 2431–2440, 2012.
[56] D. Qi, W. Tan, Q. Yao, and J. Liu, “Yolo5face: Why reinventing a face detector,” ArXiv preprint ArXiv:2105.12931, 2021.
[57] S. Li and W. Deng, “Deep facial expression recognition: A survey,” IEEE Transactions on Affective Computing, pp. 1–1, 2020.
[58] T.-Y. Yang, Y.-T. Chen, Y.-Y. Lin, and Y.-Y. Chuang, “Fsa-net: Learning fine-grained structure aggregation for head pose estimation from a single image,” in Proc. CVPR, 2019, pp. 1087–1096.
[59] K. Wang, X. Peng, J. Yang, S. Lu, and Y. Qiao, “Suppressing uncertainties for large-scale facial expression recognition,” in Proc. CVPR, June 2020.
[60] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “Pyannote.audio: Neural building blocks for speaker diarization,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, 2020, pp. 7124–7128.
[61] M. G. de Pinto, M. Polignano, P. Lops, and G. Semeraro, “Emotions understanding model from spoken language using deep neural networks and mel-frequency cepstral coefficients,” in 2020 IEEE Conference on Evolving and Adaptive Intelligent Systems (EAIS), 2020, pp. 1–5.
[62] W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale system problems by mining console logs,” in Proc. SIGOPS. New York, NY, USA: ACM, 2009, p. 117–132.
[63] J.-G. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li, “Mining invariants from console logs for system problem detection,” in Proc. USENIX, ser. USENIXATC’10. USA: USENIX Association, 2010, p. 24.
[64] N. Ye, “A markov chain model of temporal behavior for anomaly detection,” in Proceedings of the IEEE Systems, Man, and Cybernetics Information Assurance and Security Workshop, vol. 166, 2000, p. 169.
[65] L. Yang, C. Xiong, J. K. Wong, A. Wu, and H. Qu, “Explaining with examples lessons learned from crowdsourced introductory description of information visualizations,” IEEE Transactions on Visualization and Computer Graphics, pp. 1–1, 2021.
[66] M. Gleicher, “Considerations for visualizing comparison,” IEEE Transactions on Visualization and Computer Graphics, vol. 24, no. 1, pp. 413–423, 2018.
[67] B. Morrow, T. Manz, A. E. Chung, N. Gehlenborg, and D. Gotz, “Periphery plots for contextualizing heterogeneous time-based charts,” in 2019 IEEE Visualization Conference (VIS), 2019, pp. 1–5.
[68] J. Zhou, X. Wang, J. K. Wong, H. Wang, Z. Wang, X. Yang, X. Yan, H. Feng, H. Qu, H. Ying, and W. Chen, “Dpviscreator: Incorporating pattern constraints to privacy-preserving visualizations via differential privacy,” IEEE Transactions on Visualization and Computer Graphics, vol. 29, no. 1, pp. 809–819, 2023.