HYouTube: Video Harmonization Dataset

Xinyuan Lu, Shengyuan Huang, Li Niu, Wenyan Cong, Liqing Zhang
Shanghai Jiao Tong University

Abstract

Video composition aims to generate a composite video by combining the foreground of one video with the background of another video, but the inserted foreground may be incompatible with the background in terms of color and illumination. Video harmonization aims to adjust the foreground of a composite video to make it compatible with the background. So far, video harmonization has only received limited attention and there is no public dataset for video harmonization. In this work, we construct a new video harmonization dataset HYouTube by adjusting the foreground of real videos to create synthetic composite videos. Considering the domain gap between real composite videos and synthetic composite videos, we additionally create 100 real composite videos via copy-and-paste. Datasets are available at https://github.com/bcmi/Video-Harmonization-Dataset-HYouTube.

1 Introduction

Image or video composition is a common operation to create visual content. Given two different videos, video composition aims to generate a composite video by combining the foreground of one video with the background of another video. However, composite videos are usually not realistic enough due to the appearance (e.g., illumination, color) incompatibility between foreground and background, which is caused by distinctive capture conditions (e.g., season, weather, time of the day) of foreground and background [4, 3]. To address this issue, video harmonization [7] has been proposed to adjust the foreground appearance to make it compatible with the background, resulting in a more realistic composite video.

Refer to caption — Figure 1: Illustration of video harmonization task (blue arrows) and dataset construction process (red arrows).

As a closely related task, image harmonization has attracted growing research interest. Traditional image harmonization methods [2, 9, 15, 18] attempted to learn the hand-crafted appearance transform between foreground and background, but they neglected the high-level appearance gap between foreground and background. Recently, several deep learning based image harmonization methods [4, 5, 6, 3, 13, 10] have been proposed. They changed the foreground style to be harmonious with the background using deep learning techniques.

Although deep image harmonization methods have achieved remarkable success, directly applying them to video harmonization by harmonizing each frame separately will cause flickering artifacts [7], which largely downgrades the harmonization quality. Thus, it is imperative to design deep video harmonization method by taking temporal consistency into consideration. To the best of our knowledge, the only deep video harmonization method is [7], which proposed an end-to-end network to harmonize the composite frames while considering the temporal consistency between adjacent frames.

Training deep video harmonization network requires abundant pairs of composite videos and their ground-truth harmonized videos, but manually editing composite videos to obtain their harmonized videos is extremely tedious and expensive. Therefore, [7] adopted an inverse approach, that is, creating synthetic composite videos from real images. Particularly, they applied the traditional color transfer method [12] to the foreground of the real image to make it incompatible with the background, leading to the synthetic composite image. Then, they applied affine transformation to foreground and background to simulate the motion between adjacent frames, through which synthetic composite video (resp., ground-truth video) are created based on synthetic composite image (resp., real image). Nevertheless, there is a huge gap between the simulated movement and the complex movement in realistic videos. Moreover, the dataset constructed in [7] is not publicly available. Different from [7], we create synthetic composite videos based on real video without sacrificing realistic motion. Specifically, we apply color transfer based on lookup table (LUT) [8] to the foregrounds of all frames. We construct our video harmonization dataset named HYouTube based on YouTube-VOS 2018 [17], leading to 3194 pairs of synthetic composite videos and real videos, which will be detailed in Section 2.

2 Dataset Construction

In this section, we will describe the process of constructing our dataset HYouTube based on the large-scale video object segmentation dataset YouTube-VOS 2018 [17]. Given real videos with object masks, we first select the videos which meet our requirements and then adjust the foregrounds of these videos to produce synthetic composite videos.

2.1 Real Video Selection

Since constructing video harmonization dataset requires foreground masks and the cost of annotating foreground masks is very high, we build our dataset based on the existing large-scale video object segmentation dataset YouTube-VOS 2018 [17]. YouTube-VOS contains 4453 YouTube video clips and one video clip is annotated with the object masks for one or multiple objects. Each second has 6 frames with mask annotations and we only utilize these annotated frames. Then, for each annotated foreground object in each video clip, if there exist more than 20 consecutive frames containing this foreground object, we save the first 20 consecutive frames with the corresponding 20 foreground masks as one video sample. After that, we remove the video samples with foreground ratio (the area of foreground over the area of the whole frame) smaller than 1% to ensure that the foreground area is in a reasonable range. After the above filtering steps, there are 3194 video samples left.

2.2 Composite Video Generation

Based on real video samples, we adjust the appearance of their foregrounds to make them incompatible with backgrounds, producing synthetic composite videos. We have tried different color transfer methods [15, 18, 9, 20] following [4, 16, 7] and 3D color lookup table (LUT) [11, 1] following [8] to adjust the foreground appearance. The color transfer methods [15, 18, 9, 20] need a reference image and adjust the source image appearance based on the reference image appearance, while LUT is [11, 1] a simple array indexing operation to realize color mapping. We observe that applying [15, 18, 9, 20] requires carefully picking reference images, otherwise the transferred foreground may have obvious artifacts or look unrealistic. Thus, we employ LUT to adjust the foreground appearance for convenience. Since one LUT corresponds to one type of color transfer, we can ensure the diversity of the composite videos by applying different LUTs to video samples. Firstly, we collect more than 400 LUTs from the Internet. Secondly, we calculate their pairwise differences. Specifically, we sample 1000 real video frames. For each real frame, we apply all the collected LUTs to transfer its foreground to obtain composite frames and calculate fMSE between every two composite images as the pairwise difference between two LUTs. We average the pairwise differences over all 1000 real video frames as the final pairwise difference between two LUTs. Finally, we select 100 mutually different LUTs in an iterative approach to enlarge the diversity of synthesized composite videos. In particular, in each iteration, we find two closest LUTs from the remaining LUTs and remove one of them. This step is repeated until there are 100 LUTs left. Thus, we get 100 candidate LUTs with the largest mutual difference, which are used to transfer the foregrounds of real video samples.

The process of generating composite video samples is illustrated in Figure 1. Given a video sample, we first select a LUT from 100 candidate LUTs randomly to transfer the foreground of each frame. Lookup table (LUT) records the input color and the corresponding output color, so one LUT corresponds to one color mapping function $f$ . LUT has been applied in a variety of computer vision tasks. An LUT is a 3D lattice in the RGB space and each dimension corresponds to one color channel (e.g., red). LUT consists of $V=(B+1)^{3}$ entries by uniformly discretizing the RGB color space, where $B$ is the number of bins in each dimension (we set $B=32$ following the convention in image processing field). Each entry $v$ in the LUT has an indexing color $\mathbf{c}^{\prime}_{v}=(r^{\prime}_{v},g^{\prime}_{v},b^{\prime}_{v})$ and its corresponding output color $\tilde{\mathbf{c}}^{\prime}_{v}=(\tilde{r}^{\prime}_{v},\tilde{g}^{\prime}_{v},\tilde{b}^{\prime}_{v})$ . The color transformation process based on LUT has two steps: look up and trilinear interpolation. Specifically, given a color value, we first look up its eight nearest entries in the LUT, and then interpolate its transformed value based on eight nearest entries via trilinear interpolation.

The transferred foregrounds and the original backgrounds form the composite frames, and the composite frames form composite video samples. Following [4], we set some rules to filter out unqualified composite video samples: 1) The transferred foreground should be obviously incompatible with the background; 2) Although the transferred foreground looks incompatible with the background, the transferred foreground itself should look realistic; 3) The albedo of the foreground should remain the same after color transfer. For example, transferring a red car to a blue car is not meaningful for image harmonization [4]. Given a real video sample, if the obtained composite video sample after applying one LUT does not satisfy the above criteria, we will randomly choose another LUT again and repeat the transfer process until the obtained composite video sample satisfies the above criteria.

We name our constructed video harmonization dataset as HYouTube. HYouTube dataset includes 3194 pairs of synthetic composite video samples and real video samples. Each video sample contains 20 consecutive frames with the foreground mask for each frame. The numbers of composite video samples created using different LUTs are shown in the left subfigure in Figure 2. We can see that all 100 LUTs have been used, but some LUTs are more frequently used because they are suitable for more real video samples. The average fMSE between composite video samples and ground-truth video samples for different LUTs is shown in the right subfigure in Figure 2. It can be seen that the average fMSE using different LUTs is quite different, which proves the diversity of different LUTs to some extent. Finally, we show some example pairs of composite video samples and real video samples in our HYoutube dataset in Figure 3.

3 Real Composite Videos

We have created pairs of synthetic composite videos and ground-truth real videos in Section 2. However, the synthetic composite videos may have a domain gap with real composite videos. To create real composite videos, we first collect 30 video foregrounds with masks from a video matting dataset [14] as well as 30 video backgrounds from Vimeo-90k Dataset [19] and Internet. Then, we create composite videos via copy-and-paste and select 100 composite videos which look reasonable w.r.t. foreground placement but inharmonious w.r.t. color/illumination. We have released these 100 real composite videos for evaluation.

4 Conclusion

In this paper, we have constructed a new video harmonization dataset HYouTube which consists of pairs of synthetic composite videos and ground-truth real videos. We have also released 100 real composite videos. The contributed datasets will facilitate the future research in the field of video harmonization.

References

[1] F. Bo, F. Zhou, and H. Han. Medical image enhancement based on modified lut-mapping derivative and multi-scale layer contrast modification. IEEE, 2:696–703, 2011.
[2] Daniel Cohen-Or, Olga Sorkine, Ran Gal, Tommer Leyvand, and Ying-Qing Xu. Color harmonization. Acm Trans Graph, 25(3):p. 624–630, 2006.
[3] Wenyan Cong, Li Niu, Jianfu Zhang, Jing Liang, and Liqing Zhang. Bargainnet: Background-guided domain translation for image harmonization. In ICME, 2021.
[4] W. Cong, J. Zhang, L. Niu, L. Liu, and L. Zhang. Dovenet: Deep image harmonization via domain verification. In CVPR, 2020.
[5] X. Cun and C. M. Pun. Improving the harmony of the composite image by spatial-separated attention module. IEEE Transactions on Image Processing, PP(99):1–1, 2020.
[6] Zonghui Guo, Haiyong Zheng, Yufeng Jiang, Zhaorui Gu, and Bing Zheng. Intrinsic image harmonization. In CVPR, 2021.
[7] H. Z. Huang, S. Z. Xu, J. X. Cai, W. Liu, and S. M. Hu. Temporally coherent video harmonization using adversarial networks. IEEE Transactions on Image Processing, 29:214–224, 2019.
[8] Yifan Jiang, He Zhang, Jianming Zhang, Yilin Wang, Zhe Lin, Kalyan Sunkavalli, Simon Chen, Sohrab Amirghodsi, Sarah Kong, and Zhangyang Wang. Ssh: A self-supervised framework for image harmonization. arXiv, 2021.
[9] Jean Franois Lalonde and A. A. Efros. Using color compatibility for assessing image realism. In ICCV, 2007.
[10] Jun Ling, Han Xue, Li Song, Rong Xie, and Xiao Gu. Region-aware adaptive instance normalization for image harmonization. In CVPR, 2021.
[11] Mese, Murat, Vaidyanathan, and P. P. Look-up table (lut) method for inverse halftoning. IEEE Transactions on Image Processing, 10(10):1566–1578, 2001.
[12] E. Reinhard, M. Adhikhmin, B. Gooch, and P. Shirley. Color transfer between images. IEEE Computer Graphics and Applications, 21(5):34–41, 2001.
[13] K. Sofiiuk, P. Popenova, and A. Konushin. Foreground-aware semantic representations for image harmonization. arXiv, 2020.
[14] Yanan Sun, Guanzhi Wang, Qiao Gu, Chi-Keung Tang, and Yu-Wing Tai. Deep video matting via spatio-temporal alignment and aggregation. In CVPR, 2021.
[15] K. Sunkavalli, M. K. Johnson, W. Matusik, and H. Pfister. Multi-scale image harmonization. Acm Trans Graph, 29(4):1–10, 2010.
[16] Y. H. Tsai, X. Shen, Z. Lin, K. Sunkavalli, X. Lu, and M. H. Yang. Deep image harmonization. In CVPR, 2017.
[17] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv, 2018.
[18] S. Xue, A. Agarwala, J. Dorsey, and H. Rushmeier. Rushmeier h.: Understanding and improving the realism of image composites. In Acm Trans Graph, 2012.
[19] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 2019.
[20] Jun-Yan Zhu, Philipp Krahenbuhl, Eli Shechtman, and Alexei A Efros. Learning a discriminative model for the perception of realism in composite images. In ICCV, 2015.