Hierarchical Visual Interface for Educational Video Retrieval and Summarization

Jiahao Weng Japan Advanced Institute of Science and Technology, Ishikawa, Japan Chao Zhang University of Fukui, Fukui, Japan Xi Yang Jilin University, Changchun, China Haoran Xie Japan Advanced Institute of Science and Technology, Ishikawa, Japan

Abstract

With the emergence of large-scale open online courses and online academic conferences, it has become increasingly feasible and convenient to access online educational resources. However, it is time consuming and challenging to effectively retrieve and present numerous lecture videos for common users. In this work, we propose a hierarchical visual interface for retrieving and summarizing lecture videos. Users can utilize the proposed interface to effectively explore the required video information through the results of the video summary generation in different layers. We retrieve the input keywords with the corresponding video layer with timestamps, a frame layer with slides, and the poster layer with summarization of the lecture videos. We verified the proposed interface with our user study by comparing it with other conventional interfaces. The results from our user study confirmed that the proposed interface can achieve high retrieval accuracy and good user experience.

keywords:

Educational video, slide-based video, user interface, information retrieval

1 INTRODUCTION

Open online courses and online academic conferences play an important role in addressing the global pandemic issue. There is a growing trend of courses and academic conferences being held online with available online video sources. However, a major challenge with the emergence of large-scale online resources is how to effectively retrieve and present the desired video contents from numerous resources for end users. While the conventional interface can help navigate educational videos, provide content-based lecture video retrieval, and use data-driven interaction techniques for improving the navigation of Educational Videos [1, 2], it is difficult for the common user to explore videos with satisfied interaction and summarization. For example, when users want to search for a specific technique in an academic conference, they will likely get several results after typing a keyword. The current video retrieval systems do not provide a satisfactory interaction and summarization function for saving the user’s time.

Refer to caption — Figure 1: The proposed hierarchical visual interface for lecture videos with retrieval page with keywords (A), poster layer (B), video layer (C) and frame layer (D).

In this paper, we propose a hierarchical visual interface for lecture videos and academic presentations with video retrieval and summarization as shown in Figure 1. The proposed system utilizes the slide detection algorithm to extract slides from the video and adopts state-of-the-art image analysis approach [3] to extract visual contents from the slides. The hierarchical visual interface consists of a video layer that contains video and groups of slides with timestamps, a frame layer that contains the speech content and visual content of one selected slide, and a poster layer that demonstrates the summarized video in a poster style. This interface can fulfill the diverse usage scenario for different users. The first scenario is that when users want to see an overview of a specific video, they can enter the poster layer.The second scenario is that when users only want to view a specific part of the video, they can click the title in the poster layer to redirect them to the corresponding video layer.The third scenario is that when users want to view the textual matter of the speech content or the slide rather than the video, they can enter the frame layer. Finally, we verify that the proposed visual interface can help the user retrieve lecture videos more conveniently and efficiently.

2 RELATED WORK

Given that academic conference presentations and lectures are increasingly being made available on YouTube and Massive Open Online Course (MOOC) websites, the video retrieval from video contents has become particularly common and plays an important role in personal study. These websites obtain metadata from the video and create an algorithm for estimating places on the slide where the speaker explains automatically [4]. DynamicSlide designed an interaction interface that uses the speech content in the lecture video to show where the lecturer explains on the slide and helps the user take notes [5]. Although these user interfaces may help users in online lecture learning, they may have difficulty with retrieving the information conveniently.

A data-driven interaction technique was proposed for navigating educational videos [2]. The proposed method can improve the current video user interface by introducing a timeline with interaction peaks to show points of high user activity and personal watching traces. A novel voice navigation for how-to videos was also proposed [6] to allow users to search for content by saying a keyword. The system provides user recommendation suggestions for each interaction scenario. With the interaction technique, users can directly obtain the result through the metadata from the video, rather than by directly searching for the keyword. MMToC parsed the content in lecture videos and generated an algorithm to index the topic in the video so that users can retrieve their interests conveniently and efficiently [7]. Aside from this, TalkMiner is a search engine for slide-based videos in which users can search for keywords from the video title and speech content or text in the slides [8]. Based on the visual components of presentation slides, textual and mathematical phrases, speech contents, and mouse and cursor pointing motions acquired throughout the presentation, a visual interface for lecture video was proposed to extract various semantic clues for indexing video content and providing visual assistance [9]. There are several non-linear video search techniques. ViZig [1] identified the slide content and used anchor points to represent and visualize in the video timeline. Hierarchical visual interfaces have been proposed for drawing assistance [10], motion editing [11], and document retrieval [12]. In this work, we aim to propose a hierarchical user interface for video retrieval and summarization.

3 SYSTEM Overview

In this work, we propose a hierarchical visual interface for lecture video retrieval and summarization. Before the proposal of user interface, we conducted a preliminary study about conventional video retrieval interfaces.

3.1 Preliminary Study

We first conducted a preliminary investigation with a 7-Point Likert Scale questionnaire from 15 participants (college students around 20-years-old, 12 males and 3 females) with the following three questions (1 for strongly disagree and 7 for strongly agree).

Q 1. Are you satisfied with the UI when you see the slide-based video on websites like YouTube?
Q 2. Do you think you are spending too much time searching for educational videos?
Q 3. If there is a better UI for retrieving and skimming educational videos, are you willing to try it?

We found that most participants felt that it is time consuming to search for educational videos were not satisfied with the current interface. As show in Fig. 2(a), most participants are interested in trying a new interface if it can help them retrieve desired information more conveniently and efficiently. To solve these issues, we consider the hierarchical visual interface with three content layers.

3.2 System Framework

In this work, the proposed architecture consists of three parts: the offline computation process for video pre-processing, construction of a video database, and a hierarchical user interface for educational videos as shown in Fig. 2(b). First, we extract slides, speech contents, and outlines from the educational video resource and store it into the database. The proposed interface consists of three layers: a video layer that contains video and groups of slides with timestamps, a frame layer that contains the speech content and visual content of one selected slide, and a poster layer that demonstrates the summarized video in a poster style.

When a user browses a video, the client will submit an HTTP request. The back-end then processes the request and returns JSON data to the client in the HTTP response. The front-end will then render the JSON data and generate a web page with the hierarchical visual interface.

3.3 Video Pre-Processing

Video pre-processing plays a critical role in this system as it is a basic stage in listing and summarizing the video content show in Fig. 2. Given an educational video, we first design an algorithm to compare the image hash of each frame in the video. If the differential of two frames is larger than the threshold, it means that the slide has changed; the slide will then be extracted and named with its timestamp in the video. During the extraction processing, we use the layoutparser [3] function—a tool that comes with a set of layout data structures with carefully designed APIs that are optimized for document image analysis tasks: selecting layout/textual elements in the document, performing optical character recognition (OCR) for each detected layout region, visualizing the detected layouts, and loading layout data stored in JSON, CSV, and PDF—to extract titles, figures, and tables from the extracted slides. The layoutparser enables the extraction of complicated document structures using only several lines of code, with the help of state-of-the-art deep learning models. The extracted content is formed as a JSON file, in which ”key” is the title and the ”value” is the stored path for the other content. We thereafter download the subtitle of the corresponding video; if the video does not have subtitles, we use the state-of-the-art automatic speech recognition (ASR) approach¹¹1https://cloud.google.com/speech-to-text to extract the speech contents. After that, we compare the timestamp between the speech content and the slide, and put the corresponding text into the JSON file. Upon completion of pre-processing, all JSON files are stored into the database and used to render the hierarchical visual interface. The title and its corresponding speech content, figures, and tables will render the poster layer page; the slide and subtitle will render the frame layer page and video layer page.

3.4 Hierarchical User Interface

The hierarchical visual interface can help users retrieve a video conveniently and efficiently. When users search for a keyword, the system will retrieve the video title and its corresponding speech content from the database. The contents that meet the requirements will be rendered in the web page, as shown in Fig. 1. When users browse a specific video, they will enter the hierarchical visual interface. The first page presented to the user is the poster layer.

In our survey of related work, we found that users are looking to use the textual content or the images in slides when they search for a specific topic in the video. Therefore, we design a poster layer that demonstrates the chapter of the video showing the figure in the slides and the speech content in the video in a poster format. This design can satisfy the users’ browsing habits mentioned above. In case users want to see the detail of the point they are interested in, we bind the timestamp of the HTML data attribute of the title, figure, and speech content. When users click these elements (see red boxes in Fig. 1), the website will redirect them to the corresponding time of the video in the video layer.

The left part of the video layer is a video player. The right part has a group of slides for the video, which the users can adjust as regards the number of slides shown. During the video playback, the slides on the right will change along with the video. To enable users to fast-forward the video to the part which they are interested in watching, we use the same design mentioned above. Thus, users can view the corresponding part in the video by clicking on the slide.

The frame layer fulfills the usage scenario, where users want to view the textual matter of the speech content or the slide rather than the video. Users can enter to the frame layer and view the textual matter of the speech content and slide by clicking the detail button in the slides. Users can also redirect to the other two layers by clicking on the corresponding buttons in the frame layer.

4 User Study

After the preliminary study, we conducted an experiment to compare the conventional retrieval interfaces and our interface. We invited 15 participants (college students around 20-years-old, 12 males and 3 females) in 3 groups. After a brief introduction of how to use our interface, we let each group use one of the following interfaces: YouTube style (video-only), Coursera style (see Fig. 3, video and subtitles), and our interface to retrieve the given five keywords in a limited time. For example, “Find a video talking about ‘Image Synthesis’. ” All participants have not seen these videos before, and the duration of all the videos is around 5 minutes.

After the experiment, we asked them about their strategy in retrieving the video. We found that both our interface and Coursera style have indices which can help users locate the required information with more accuracy. However, as opposed to the Coursera style, our interface has more than one index, each of which can navigate to the other, making the retrieval processing more convenient. As demonstrated in Fig. 3, our interface achieved the highest accuracy results. The exact accuracy value is calculated by taking the average score from each interface.

4.1 User Experience

In the second experiment, we conducted a user study to verify the user experiences by asking the participants to complete multiple video retrieval tasks. A general evaluation task consisting of visual search task, problem search task, and summarization task was used to test the participants. The visual search task requires participants to find a specific content in the video, for example, “Find a slide where the author talks about the framework of this research. ”. The problem search task requires participants to answer a question relevant to the video, for example, “What is the input in Partial-editing layer? ”. The summarization task requires participants to summarize the key points in the video. This time, we selected 8 participants from the previous experiment. After the experiment, they finished the questionnaire about their experience. The questionnaire uses a 7-Point Likert Scale ( 1 for strongly disagree and 7 for strongly agree).

[Uncaptioned image] — Table 1: Results of the questionnaire.

Tab. 1 shows the results of the experiment. The overview of our interface shows a better performance. As for the easy to use, participants are quite satisfied with our system. For the usefulness of the proposed 3-layer user interface, some participants felt the frame layer and poster layer can help them better retrieve the information than the video layer. In general, the results from our user study verified that the proposed interface can achieve high retrieval accuracy and good user experience.

5 CONCLUSION

This paper presented a hierarchical visual interface to retrieve and summarize educational lecture videos. We designed a three-layer user interface to fulfill users’ searching intents during various usage scenarios. We then conducted a user study to verify our interface. In our user study, we demonstrated the usefulness of this three-layer interface for effective video navigation. It was shown to be more efficient than the outline-based video website offered by Massive Open Online Course (MOOC) websites and YouTube. Based on the feedback of our user study, this interface is anticipated to improve the video summarization through the poster layer in case of long duration videos (around one hour). It is also promising to improve the usability of the retrieval interface to fulfill users’ various retrieval intents in future work.

Acknowledgment

We thank all participants in our user study. This work was supported by JAIST Research Grants, and JSPS KAKENHI Grant 20K19845, Japan.

References

[1] Yadav, K., Gandhi, A., Biswas, A., Shrivastava, K., Srivastava, S., and Deshmukh, O., “Vizig: Anchor points based non-linear navigation and summarization in educational videos,” in [Proceedings of the 21st International Conference on Intelligent User Interfaces ], 407–418 (2016).
[2] Kim, J., Guo, P. J., Cai, C. J., Li, S.-W., Gajos, K. Z., and Miller, R. C., “Data-driven interaction techniques for improving navigation of educational videos,” in [Proceedings of the 27th annual ACM symposium on User interface software and technology ], 563–572 (2014).
[3] Shen, Z., Zhang, R., Dell, M., Lee, B. C. G., Carlson, J., and Li, W., “Layoutparser: A unified toolkit for deep learning based document image analysis,” arXiv preprint arXiv:2103.15348 (2021).
[4] Tsujimura, S., Yamamoto, K., and Nakagawa, S., “Automatic Explanation Spot Estimation Method Targeted at Text and Figures in Lecture Slides,” in [Proc. Interspeech 2017 ], 2764–2768 (2017).
[5] Jung, H., Shin, H. V., and Kim, J., “Dynamicslide: Reference-based interaction techniques for slide-based lecture videos,” in [The 31st Annual ACM Symposium on User Interface Software and Technology Adjunct Proceedings ], 23–25 (2018).
[6] Chang, M., Huh, M., and Kim, J., “Rubyslippers: Supporting content-based voice navigation for how-to videos,” in [Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems ], 1–14 (2021).
[7] Biswas, A., Gandhi, A., and Deshmukh, O., “Mmtoc: A multimodal method for table of content creation in educational videos,” in [Proceedings of the 23rd ACM international conference on Multimedia ], 621–630 (2015).
[8] Adcock, J., Cooper, M., Denoue, L., Pirsiavash, H., and Rowe, L. A., “Talkminer: a lecture webcast search engine,” in [Proceedings of the 18th ACM international conference on Multimedia ], 241–250 (2010).
[9] Zhao, B., Xu, S., Lin, S., Wang, R., and Luo, X., “A new visual interface for searching and navigating slide-based lecture videos,” in [2019 IEEE International Conference on Multimedia and Expo (ICME) ], 928–933, IEEE (2019).
[10] Huang, Z., Peng, Y., Hibino, T., Zhao, C., Xie, H., Fukusato, T., and Miyata, K., “dualface: Two-stage drawing guidance for freehand portrait sketching,” Computational Visual Media 8, 63–77 (2022).
[11] Peng, Y., Zhao, C., Huang, Z., Fukusato, T., Xie, H., and Miyata, K., “Two-stage motion editing interface for character animation,” in [The ACM SIGGRAPH / Eurographics Symposium on Computer Animation ], SCA ’21, Association for Computing Machinery, New York, NY, USA (2021).
[12] Matejka, J., Grossman, T., and Fitzmaurice, G., “Paper forager: Supporting the rapid exploration of research document collections,” in [Proceedings of Graphics Interface 2021 ], GI 2021, 237 – 245, Canadian Information Processing Society (2021).