Winning the CVPR’2022 AQTC Challenge:
A Two-stage Function-centric Approach

Shiwei Wu Weidong He Tong Xu Hao Wang Enhong Chen
Anhui Province Key Laboratory of Big Data Analysis and Application
State Key Laboratory of Cognitive Intelligence
University of Science and Technology of China
{dwustc, hwd}@mail.ustc.edu.cn {tongxu, wanghao3, cheneh}@ustc.edu.com

Abstract

Affordance-centric Question-driven Task Completion for Egocentric Assistant (AQTC) is a novel task which helps AI assistant learn from instructional videos and scripts and guide the user step-by-step. In this paper, we deal with the AQTC via a two-stage Function-centric approach, which consists of Question2Function Module to ground the question with the related function and Function2Answer Module to predict the action based on the historical steps. We evaluated several possible solutions in each module and obtained significant gains compared to the given baselines. Our code is available at https://github.com/starsholic/LOVEU-CVPR22-AQTC.

1 Introduction

Intelligent assistants are increasingly becoming a part of users’ daily lives. Along this line, Affordance-centric Question-driven Task Completion (AQTC) [22], which aims to guide the user to deal with unfamiliar events step-by-step with the knowledge learned from instructional videos and scripts, is newly introduced. Different from existing works such as Visual Question Answering (VQA) [3] or Visual Dialog [7], the question in AQTC is about specific task and the answer is multi-modal and multi-step, which makes it more challenging.

To solve this problem, we propose a novel two-stage Function-centric approach, which consists of a Question2Function Module and a Function2Answer Module. Our main motivation is that only part of the instructional video is helpful to answer the question and taking the entire video into account could introduce unnecessary noise. Along this line, we first define several schemas to segment the scripts into the textual function-paras. Then we design a text similarity based method to select specific video clips as well as paras that are closely related to users’ question. After obtaining relative context information, we formulate the multi-step QA as a classification task and leverage a neural network to retrieve correct answer for each step.

With our model and several training tricks, we achieved substantial performance boost compared to the given baselines.

2 Related Work

2.1 Measurement of Text Similarity

Generally speaking, in order to calculate text similarity, it is important to represents the text as numerical features that can be calculated directly, which could be categorized into two groups: string-based method and corpus-based method.

String-based methods aim to measure similarity between two text strings based on string sequences or character composition, including character-based methods [14, 21] and phrase-based methods [11, 15]. Different from string-based methods, corpus-based methods leverage the textual feature or co-occurrence probability to calculate the text similarity at the corpus level, which are usually achieved in three ways: bag-of-words model like Term Frequency–Inverse Document Frequency (TF-IDF) [20], distributed representation methods like Word2Vec [18] and BERT [10], and matrix factorization methods like Latent Semantic Analysis (LSA) [8] and Latent Dirichlet Allocation (LDA) [5].

2.2 Visual Question Answering

VQA is a task to answer questions based on image [4] or video [17], which could be roughly divided into attention based methods [2, 25] and bilinear pooling based approaches [13, 16, 26]. [2] developed different attention modules to adaptively attend on the relevant image regions based on the question representation. [13] proposed to employ the compact bilinear pooling methods to combine the visual and linguistic features. However, these tasks mainly focus on the third-person perspective, while the AQTC task concentrates on the egocentric scenes.

3 Proposed Method

In this section, we first present the problem statement of our proposed two-stage Function-centric approach towards solving the AQTC task. Then, we introduce the technical details of each module in our framework step-by-step as shown in Fig.1.

Refer to caption — Figure 1: The Proposed Function-centric approach

3.1 Problem Statement

Given an instructional video $V$ , the video’s corresponding textual script $S$ , the user’s question $Q$ , the set of candidate answers $A^{i}_{j}$ where $A^{i}_{j}$ denotes the $j$ -th potential answer in the $i$ -th step, we target at select one correct answer in each step. Specifically, to grounding question $Q$ in the instructional video $V$ and script $S$ , we first segment $V$ and $S$ into function set $\{f_{1},f_{2},\dots,f_{n}\}$ and then matching $Q$ with related functions. Note that each function $f$ consists of function-clip $f^{v}$ and function-para $f^{t}$ . Afterwards, taking the weighted function set $\{f_{1},f_{2},\dots,f_{n}\}$ , the question $Q$ and the candidate answer $A^{i}_{j}$ as input, we formulate the multi-step QA as a classification task in a supervised way.

3.2 Question2Function Module

We now turn to explain the technical details for the Question2Function Module. Since the instructional videos are used to guide user or AI assistant in a step-by-step manner, we first segment both script and video into individual functions, instead of sentences or frames, to insure the completeness of each step. Meanwhile, it is critical to ground the specific question with the related function as the correct answer often co-occurs with the corresponding function.

Specifically, we first segment the script $S$ into the textual function-paras $f^{t}$ according to the pre-defined schema (see details in our project’s repo), and then divide the corresponding video $V$ into the visual function-clips $f^{v}$ via the aligned script timestamp. In this way, the instructional video and script are divided into the functions set $\{f_{1},f_{2},\dots,f_{n}\}$ , and each function $f$ not only contains textual description $f^{t}$ but also includes visual guidance $f^{v}$ .

We further match the specific question with the function set based on text similarity. Due to the small volume of the dataset and the corresponding functions are not highly semantic similar with the given question, we calculate the similarity score between $Q$ and $\{f_{1},f_{2},\dots,f_{n}\}$ via TF-IDF model [19] instead of deep learning based methods. The ablation result in Sec.4.2 indicates that the traditional statistical based TF-IDF method behaves much better than the deep learning based approach.

3.3 Function2Answer Module

After we determined the related function thanks to the former module, we now turn to formulate the following multi-step QA as a classification task. Since the candidate answers are given as textual action descriptions and visual button images, we need to predict the correct action in each step well as the corresponding button according to the historical steps.

Input Features. For the text embedding, we encode the function-para, question and candidate answers into $E_{f}^{t}$ , $E_{q}^{t}$ and $E_{a}^{t}$ via XL-Net [24], which performs much better than the Bert [9] backbone as shown in Tab.1, since the XL-Net is good at processing long context. For the visual part, we encode the frames of function-clip and the button image in candidate answers into $E_{f}^{v}$ and $E_{a}^{v}$ via vision transformer (ViT) [12] following [23].

Steps Network and Prediction Head. Same as baseline [23], we use GRU [6] to leverage the historical steps and predict the final score for each answer via a two-layer MLP followed by softmax activation.

Loss Function. By taking the embeddings of question $E_{q}^{t}$ , candidate answer $E_{a}=\{E_{a}^{t},E_{a}^{v}\}$ and the weighted function set $E_{f}=\{E_{f}^{t},E_{f}^{v}\}$ given by the former module as input, we formulate the following multi-step QA as a classification task as follows.

L=\sum\limits_{i=1}^{N}-y_{i}\log\hat{y_{i}},\\

(1)

where $\hat{y_{i}}$ = Pred_head(Steps_network( $E_{f}$ , $E_{q}^{t}$ , $E_{a}$ )) and $y_{i}$ represents the ground truth.

3.4 Other Attempts

Considering the changing views within the same instructional video and the occlusion of buttons, it is really challenging to link the button’s image to its function. Therefore, we tried to use finger detection [1] in the video as additional information in the Function2Answer module (see details in our project’s repo). However, since the strong assumption and the cumulative error introduced by the detection module, the performance of this purely-inference method is poor as shown in Tab.2.

4 Experiments

4.1 Dataset and Parameters setting

Following [23], we trained our model on the training set containing 80 instructional videos and evaluated the model performance on the testing set which contains 20 instructional videos. We use the same evaluation metrics in [23], i.e., Recall@k, Mean rank (MR) and Mean reciprocal rank (MRR).

For fair comparison, we train our model as well as ablation experiments under the same parameters setting (Adam optimizer with the learning rate $1\times 10^{-4}$ , maximum 100 training epochs and 16 batchsize) and report the best evaluation metrics epoch on the testing set.

4.2 Ablation Study

In this section, we compare the performance of possible solutions in each module separately.

	Grounding Approach	R@1	R@3	MR	MRR
Baseline with raw settings	cross-att.	30.2	62.3	3.2	3.2
Sentence- centric	cross-att. TF-IDF	39.3 44.4	70.2 75.0	2.8 2.7	3.5 3.8
Function- centric	cross-att. TF-IDF	38.5 45.2	74.2 75.4	2.9 2.6	3.5 3.9

Table 1: Ablation studies in Question2Function module

Specifically, in the Question2Function module, we evaluate the different segment methods (sentence-centric v.s. function-centric) as well as the grounding approaches (cross attention v.s. TF-IDF) as shown in Tab.1. Compared with baseline using raw settings, we observe that the adjustment of the text encoder(BERT $\rightarrow$ XL-Net) and optimizer (SGD $\rightarrow$ Adam) has a significant impact on performance (30.2 $R@1$ $\rightarrow$ 39.3 $R@1$ ). Meanwhile, the function-centric segment approach performs better than the sentence-centric on most of the metrics, which demonstrates the significance of the step completeness. For grounding approach, the TF-IDF model performs much better than the cross-attention mechanism (39.3 $R@1$ $\rightarrow$ 44.4 $R@1$ on sentence-centric and 38.5 $R@1$ $\rightarrow$ 45.2 $R@1$ on function-centric). We also show an case (see Fig.2) between the two different grounding approaches, which reveals that TF-IDF model can help us find the related functions effectively.

For Function2Answer module, we evaluate the impact of visual and textual part of functions and answers via extensive ablation studies. As we can see, the visual guidance of the function helps the model choose the correct answer significantly (41.0 $R@1$ $\rightarrow$ 45.2 $R@1$ ). However, the inclusion of the visual part of answer brings little gains on metrics (43.3 $R@1$ $\rightarrow$ 45.2 $R@1$ ), which reveals that the link of button image to its function should be further investigated.

Functions V T	Answers V T	R@1	R@3	MR	MRR
✓✓	✗✓	43.3	76.6	2.6	3.8
✗✓	✓✓	41.0	71.8	2.8	3.7
✓✓	✓✓	45.2	75.4	2.6	3.9
Finger detection [1]		23.8	66.8	3.1	3.0
Submitted version on CodaLab ¹¹1https://codalab.lisn.upsaclay.fr/competitions/4642#results		41.0	72.0	2.8	3.7

Table 2: Ablation studies in Function2Answer module

5 Conclusion

In this paper, we introduced a two-stage method to solve the novel AQTC task. Our model achieved significant performance boost compared to the given baseline. For future work, we will attempt other solutions to model this interesting task.

6 Acknowledgement

This work was supported by the grants from the National Natural Science Foundation of China (No.62072423).

References

[1] Mohammad Mahmudul Alam, Mohammad Tariqul Islam, and SM Mahbubur Rahman. Unified learning approach for egocentric hand gesture recognition and fingertip detection. Pattern Recognition, 121:108200, 2022.
[2] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 6077–6086. Computer Vision Foundation / IEEE Computer Society, 2018.
[3] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
[4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: visual question answering. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2425–2433. IEEE Computer Society, 2015.
[5] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
[6] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555, 2014.
[7] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 326–335, 2017.
[8] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics, 2019.
[10] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[11] Lee R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
[12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
[13] Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Jian Su, Xavier Carreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 457–468. The Association for Computational Linguistics, 2016.
[14] Robert W Irving and Campbell B Fraser. Two algorithms for the longest common subsequence of three (or more) strings. In Annual Symposium on Combinatorial Pattern Matching, pages 214–229. Springer, 1992.
[15] Paul Jaccard. The distribution of the flora in the alpine zone. 1. New phytologist, 11(2):37–50, 1912.
[16] Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. Bilinear attention networks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 1571–1581, 2018.
[17] Jie Lei, Licheng Yu, Mohit Bansal, and Tamara L. Berg. TVQA: localized, compositional video question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1369–1379. Association for Computational Linguistics, 2018.
[18] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[19] Juan Ramos et al. Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning, volume 242, pages 29–48. Citeseer, 2003.
[20] Stephen E Robertson and Steve Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR’94, pages 232–241. Springer, 1994.
[21] William E Winkler. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. 1990.
[22] Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. Assistq: Affordance-centric question-driven task completion for egocentric assistant. arXiv preprint arXiv:2203.04203, 2022.
[23] Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, and Mike Zheng Shou. Assistq: Affordance-centric question-driven task completion for egocentric assistant. CoRR, abs/2203.04203, 2022.
[24] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 5754–5764, 2019.
[25] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alexander J. Smola. Stacked attention networks for image question answering. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 21–29. IEEE Computer Society, 2016.
[26] Zhou Yu, Jun Yu, Jianping Fan, and Dacheng Tao. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1839–1848. IEEE Computer Society, 2017.

Winning the CVPR’2022 AQTC Challenge: A Two-stage Function-centric Approach