This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MDFlow: Unsupervised Optical Flow Learning by Reliable Mutual Knowledge Distillation

Lingtong Kong,    Jie Yang The authors are with the Institute of Image Processing and Pattern Recognition, Department of Automation, Shanghai Jiao Tong University, Shanghai 200240, China. (e-mail: [email protected], [email protected]).
Abstract

Recent works have shown that optical flow can be learned by deep networks from unlabelled image pairs based on brightness constancy assumption and smoothness prior. Current approaches additionally impose an augmentation regularization term for continual self-supervision, which has been proved to be effective on difficult matching regions. However, this method also amplify the inevitable mismatch in unsupervised setting, blocking the learning process towards optimal solution. To break the dilemma, we propose a novel mutual distillation framework to transfer reliable knowledge back and forth between the teacher and student networks for alternate improvement. Concretely, taking estimation of off-the-shelf unsupervised approach as pseudo labels, our insight locates at defining a confidence selection mechanism to extract relative good matches, and then add diverse data augmentation for distilling adequate and reliable knowledge from teacher to student. Thanks to the decouple nature of our method, we can choose a stronger student architecture for sufficient learning. Finally, better student prediction is adopted to transfer knowledge back to the efficient teacher without additional costs in real deployment. Rather than formulating it as a supervised task, we find that introducing an extra unsupervised term for multi-target learning achieves best final results. Extensive experiments show that our approach, termed MDFlow, achieves state-of-the-art real-time accuracy and generalization ability on challenging benchmarks. Code is available at https://github.com/ltkong218/MDFlow.

Index Terms:
Optical flow, Unsupervised learning, Knowledge distillation, Real time

I Introduction

Optical flow estimation is a longstanding and fundamental task in computer vision, usually serving as one building block for a wide range of downstream video processing tasks, including video inpainting [1], frame interpolation [2, 3] and video stabilization [4]. Traditional approaches [5, 6, 7, 8] usually cast optical flow estimation into an energy optimization problem, which requires time-consuming iterations and heavy computation burden, hindering them for real-time applications. Inspired by great success of deep learning on various computer vision tasks, an increasing number of works [9, 10, 11, 12, 13, 14] concentrate on making architectural progress based on convolutional neural networks (CNNs), aiming at accurate, generalization and efficiency. To achieve this goal, supervised learning methods rely on a massive amount of labeled data, which is usually synthesized by computer graphic engines [9, 15, 16], because of the extremely prohibitive cost to get ground truth in the wild. However, the inherent domain gap in image style and motion scene structure will damage model performance when deploying for real applications [17].

Refer to caption
Figure 1: Overview of proposed MDFlow algorithm for unsupervised optical flow. Our mutual knowledge distillation approach includes an unsupervised teacher initialization stage, a confidence matching selection based reliable forward knowledge distillation stage and a multi-target optimization based reliable backward knowledge distillation stage. Each stage can improve current flow prediction accuracy while maintaining good generalization ability by employing the data augmentor 𝒜\mathcal{A}. Thanks to the decouple nature of our framework, we can adopt any efficient teacher network 𝒯\mathcal{T} with any initialization method, and employ any powerful student network 𝒮\mathcal{S} without worrying about additional complexity in final usage. After this bidirectional mutual distillation procedure, the efficient teacher flow network is employed for real deployment.

One alternative pipeline is to leverage countless unlabeled video sequences for unsupervised learning, which could presumably produce satisfying results without suffering from any domain mismatch. These methods usually build objective functions based on brightness constancy and spatial smoothness [18, 19], and further include occlusion handling mechanism [20, 21]. Most recently, the state-of-the-art ARFlow [22], UFlow [23] and UPFlow [24] propose to integrate augmentation regularization term into each iteration step for continuous improvement. Although their self-learning approach can improve flow accuracy, the unavoidable mismatch in predicted pseudo labels can mislead the network conversely. As a result, the optimization procedure is blocked by such match noise even with more training iterations. This puts up a question: Can we decouple such mismatching regions in augmentation regularization for further accuracy improvement?

On the other hand, comparing with recent progress in deep flow architectures [13, 14], most of recent top unsupervised approaches [25, 22, 26, 23] adopt the PWC-Net [11] as flow estimation model, for its well-behaved speed-accuracy trade-off and efficiency in real deployment. Intuitively, we can achieve better accuracy by naively replacing PWC-Net [11] with superior flow architecture, such as RAFT [13], like the current state-of-the-art method SMURF [27]. However, this is at the cost of extra computation and delay during inference. In addition, more complex structure may lead to lower performance when guided by unsupervised losses [23], which brings another question: Can unsupervised optical flow benefit from recent and future advanced flow architecture without introducing additional cost at inference?

In this paper, we jointly address the above two questions with a novel unsupervised optical flow learning framework, termed MDFlow, by exploring reliable knowledge distillation between the teacher and student networks for alternative improvement. A high-level abstract of proposed mutual distillation framework is depicted in Figure 1. To answer the first question, we primarily take UFlow [23] loss functions to pretrain a teacher network as an initialization, which generates pseudo labels for image pairs. Then, a confidence mask based on residuals of census transform [28] is proposed to select reliable matching for training the student network with diverse data augmentation. This usually yields a student model outperforming the teacher. Concerning on the second question, we can select a stronger student to fully learn from our reliable proxy ground truth, which can in turn transfer its better knowledge back to the teacher model. Thanks to the decouple nature of our teacher student networks and mutual knowledge distillation strategy, our framework enjoys both advanced flow architecture for sufficient learning on reliable pseudo labels and efficient inference by employing a lightweight and fast teacher network. Last but not least, we will show that confidence mask plays a key role in improving student performance, and formulating knowledge distillation back to the teacher as a multi-target optimization objective yields the best final results.

Our contributions are summarized as:

  • We present a novel mutual knowledge distillation framework on unsupervised optical flow for improved performance without additional cost during inference.

  • Newly proposed reliable matching selection mechanism and multi-target joint learning pipeline are proved to be effective in forward and backward flow knowledge distillation processes.

  • Our approach achieves state-of-the-art accuracy and generalization ability on Sintel [29] and KITTI 2015 [30] benchmarks compared with other pyramid-based methods.

II Related Work

II-A Deep Optical Flow Architecture

Finding dense correspondences between a pair of time adjacent frames, namely optical flow estimation, has been studied for decades for its fundamental role in many downstream vision tasks. FlowNet [9] and its successor FlowNet2 [10] are the first attempt to apply deep learning methods for optical flow estimation, which directly regress flow field based on the encoder-decoder U-shape network or its cascaded version. Inspired by traditional coarse-to-fine pipeline [6, 31, 8], SPyNet [32], PWC-Net [11] and LiteFlowNet [33] employ pyramid, warping and cost volume into end-to-end learning and achieve impressive real-time performance. After that, IRR-PWC [12] iteratively and jointly estimates residual flow and occlusion with shared estimators across pyramid levels. The work [34] introduces a dual self-attention module to improve original pyramid flow network. MaskFlowNet [35] and OAS-Net [36] explore to resolve ambiguous matching caused by warping operation. As for more efficient architecture, FDFlowNet [37] introduces U-shape backbone and partial fully connected decoder for compact structure. FastFlowNet [38] constructs center dense dilated correlation layer and shuffle block decoder to reduce computation and first achieves real-time performance on embedded systems. Recently, DICL-Flow [14] transfers concatenated feature volume in stereo matching to optical flow and propose displacement-invariant cost learning. Different from above coarse-to-fine pipeline, RAFT [13] first calculates all pairs’ correspondences and then introduces a recurrent module to estimate residual flow and update multi-scale matching cost simultaneously, which achieves significant accuracy improvement. Later on, GMA [39] introduces a global motion aggregation module based on transformer to improve recurrent-based flow architecture and obtains state-of-the-art accuracy.

II-B Learning Unsupervised Optical Flow

Due to prohibitive cost to acquire optical flow ground truth of real-world images, alternative approaches try to leverage countless unlabeled video sequences for unsupervised learning. These methods build objective functions based on brightness constancy assumption and local smoothness prior [18, 19]. Then, UnFlow [20] and OccAwareFlow [21] boost unsupervised performance by excluding residual calculation in reasoned occluded regions. However, the occluded areas are only guided by rigid smoothness constraint, that can damage overall accuracy. To handle this problem, DDFlow [40] and SelFlow [25] distill flow estimation from the teacher model to the student with random crop and occlusion hallucination in a data-driven manner, that further improves unsupervised accuracy. However, the teacher can not benefit from the improved student. STFlow [41] integrates variational refinement with unsuperivsed learning and propose a self-taught framework for continual improvement. SimFlow [26] explores learnable feature similarity for regulating previous census reconstruction loss. ARFlow [22] and UFlow [23] propose to integrate augmentation regularization into each iteration step for continuous improvement of a single model. CoT-AMFlow [42] develops an adaptive modulation network and adopts a co-teaching strategy for better accuracy. Later on, UPFlow [24] improves the upsampling unit of PWC-Net and proposes a better pyramid distillation loss. DistillFlow [43] trains multiple teacher models and introduces a confidence based two-stage distillation approach for improvement. OIFlow [44] puts up an occlusion-inpainting framework to make full use of occlusion regions. Recently, ASFlow [45] presents content-aware pooling and adaptive flow upsampling modules to improve pyramid-based unsupervised flow deep structure. MRDFlow [46] further introduces 4D correlation layer and recurrent decoder of RAFT to strengthen flow estimation network. At the same time, SMURF [27] replaces PWC-Net with a more powerful RAFT backbone and proposes a sequence-aware self-supervision loss to achieve state-of-the-art accruacy. However, it incurs huge computation cost and can not satisfy many real-time applications. Different from above methods, we separate and recombine multiple objective parts into a mutual distillation framework for reliable performance improvement without increasing model size and inference delay.

Another line of unsupervised optical flow methods resort to additional information beyond two input frames. Based on the setting of stereo video systems, UnOS [47] enforces geometry constraints among stereo depth, camera ego-motion and optical flow for mutual promotion. Flow2Stereo [48] introduces data distillation into the joint learning framework of optical flow and stereo matching. Most recently, the work [49] shows that feature-level collaboration of the networks for optical flow, stereo depth and camera motion can outperform previous methods that only consider loss-level joint optimization. To facilitate unsupervised optical flow in challenging scenes, such as fog, rain and night, GyroFlow [50] converts gyroscope readings into gyro field and fuse it with flow information for recovering more motion details. Different from these methods that resort to additional information, proposed MDFlow focuses on improving the basic setting by reducing matching noise and exploring better flow architecture during the decoupled bidirectional distillation procedure.

II-C Knowledge Distillation and Mutual Distillation

Knowledge Distillation (KD) is usually exploited to train a compact student network by mimicking the output distribution of a pre-trained complex teacher model as well as one-hot ground-truth labels [51]. Following variants [52, 53, 54, 55] mainly focus on utilizing intermediate information of teacher, such as feature maps or attention masks as extra hints for improvement. Different from above one-way transfer between a teacher and a student in knowledge distillation, Deep Mutual Learning (DML) [56] finds that mutual learning of a collection of simple student networks can outperforms distillation from a more powerful yet static teacher. The work EKD [57] introduces an evolutionary teacher that can enable more efficient knowledge transfer by minimizing the capability gap between teacher and student. Dense Cross-layer Mutual-distillation (DCM) [58] improves the two-way Knowledge Transfer (KT) scheme by adding auxiliary classifiers to hidden layers of both teacher and student networks and building dense bidirectional KD between these classifiers. The work [59] also adopts mutual distillation method to encourage information exchange between the local branch and the global branch of a single network for discovering new categories. There are also some research about applying mutual learning to computer vision tasks, such as object detection [60, 61] and video-based sign language recognition [62]. Existing mutual distillation approaches have achieved success in high-level classification tasks. However, whether mutual distillation can help the middle-level dense matching tasks has not been fully explored.

As for optical flow estimation, DDFlow [40] and SelFlow [25] distill flow estimation from teacher to student with several occlusion hallucination methods, which has been shown to be effective when inferring on occluded regions. DistillFlow [43] further improves above two-stage data distillation by introducing multiple teacher models and confidence mechanism. As for a related medical image registration task, the work CRD [63] distills knowledge from a feature-shifted teacher model with high resolution input to a student model with low resolution input for more efficiency. Different from their works, our MDFlow transfers knowledge between teacher and student networks mutually, so as to decouple matching outliers in augmentation regularization and exploit recent advanced architecture as student for better final prediction, while maintaining real-time inference.

III Method

In this section, we first illustrate the overall pipeline of MDFlow algorithm for unsupervised optical flow, and then provide notations and unsupervised initialization approach used in our framework. Further, the reliable forward knowledge distillation process is demonstrated, where a novel confidence matching selection mechanism plays a key role. At last, we present the reliable backward knowledge distillation process for improving the accuracy of final deployed teacher model.

Refer to caption
Figure 2: Detailed framework of reliable mutual knowledge distillation for unsupervised optical flow. Green line denotes the forward path and blue line denotes the backward path. Networks with the same color share weights in each stage. Reliable masks Mf,MbM_{f},M_{b} generated in stage 2 are shown in the left bottom corner. Note that the inputs and outputs of the data augmentor 𝒜\mathcal{A} are detached from the gradient calculation diagram for stable learning.

III-A Overview

As depicted in Figure 1, our proposed MDFlow algorithm trains a teacher model 𝒯\mathcal{T} and a student model 𝒮\mathcal{S} interactively for progressive improvement in a three-stage manner. In the first stage, our teacher network can be trained by any unsupervised approach for generating noisy pseudo labels, which will be used later. Specifically, we adopt PWC-Net [11] as teacher and use UFlow [23] loss functions for unsupervised initialization. In the second stage, we introduce a novel reliable matching selection mechanism, that can partly remove outliers in generated pseudo labels, which will then be used to train a student network with diverse data augmentation approaches, constituting our forward knowledge distillation process. Note that we can select a stronger student architecture, such as RAFT [13], for sufficient learning thanks to the decouple nature in mutual knowledge distillation. In the last stage, better pseudo labels predicted by the student will be adopted to supervise the learning of original teacher model enriched by diverse data augmentation. Moreover, we can improve the final teacher performance by adding a weak unsupervised learning objective, formulating it into a multi-target manner. More details are shown in Figure 2.

III-B Unsupervised Initialization for Teacher Model

Given two consecutive frames I1,I2I_{1},I_{2} from unlabeled image sequences \mathcal{I}, unsupervised optical flow methods aim to train a flow network 𝒩(.)\mathcal{N}(.) based on brightness constancy assumption and spatial smoothness prior. The first goal is usually realized as a photometric reconstruction loss:

ph(𝒩)=pρ(I1I2w)(1Of)p(1Of)+pρ(I2I1w)(1Ob)p(1Ob),\begin{split}\mathcal{L}_{ph}(\mathcal{N})=&\frac{\sum_{p}{\rho(I_{1}-I_{2}^{w})\odot(1-O_{f})}}{\sum_{p}{(1-O_{f})}}+\\ &\frac{\sum_{p}{\rho(I_{2}-I_{1}^{w})\odot(1-O_{b})}}{\sum_{p}{(1-O_{b})}},\end{split} (1)

where \odot represents element-wise multiplication, Of,ObO_{f},O_{b} are occlusion masks inferred by forward-backward consistency check [64, 20], and ρ(x)=(|x|+ϵ)q\rho(x)=(|x|+\epsilon)^{q} is a robust function with ϵ=0.01,q=0.4\epsilon=0.01,q=0.4 in all our experiments. Denoting Ff=𝒩(I1,I2),Fb=𝒩(I2,I1)F_{f}=\mathcal{N}(I_{1},I_{2}),F_{b}=\mathcal{N}(I_{2},I_{1}) as predicted forward and backward flow fields, and I2w=w(I2,Ff),I1w=w(I1,Fb)I_{2}^{w}=\textit{w}(I_{2},F_{f}),I_{1}^{w}=\textit{w}(I_{1},F_{b}) are flow warped reconstruction images. Following previous research [40, 23], we use the soft Hamming distance of census transformed [28] image patches to calculate reconstruction for its robust to illuminance variation.

Refer to caption Refer to caption Refer to caption
(a) First Input Image (b) Flow Prediction of UFlow [23] (c) Forward Non-Occluded Mask
Refer to caption Refer to caption Refer to caption
(d) Ground Truth Flow (e) Flow Error of UFlow [23] (f) Proposed Reliable Matching Mask
Figure 3: Problem of current superior approach and proposed reliable matching mask. Forward non-occluded mask in (c) is calculated by forward backward consistency check. For flow prediction error visualization in (e), deeper red means larger error and deeper blue means smaller error. Best viewed in color.

Due to the well-known aperture problem, above solution can be ambiguous on textureless or repetitive pattern regions, so we further introduce a loss item to constrain the spatial smoothness of predicted flow fields, which is usually reweighted by local image gradients:

sm(𝒩)=dx,yp(|dkFf|eλ|d1I1|+|dkFb|eλ|d1I2|),\mathcal{L}_{sm}(\mathcal{N})=\sum_{d\in x,y}\sum_{p}(|\nabla_{d}^{k}F_{f}|e^{-\lambda|\nabla_{d}^{1}I_{1}|}+|\nabla_{d}^{k}F_{b}|e^{-\lambda|\nabla_{d}^{1}I_{2}|}), (2)

where λ\lambda controls edge-aware weighting strength conditioned on reference image. dk,k=1,2\nabla_{d}^{k},k=1,2 stands for the kk-th order gradient operator. In our experiments, we set λ=150\lambda=150. For smoothness, following the experience in UFlow [23], we use first order operator on Flying Chairs and Sintel, while adopting second order operator on KITTI for their better results. The reason is that there are more parallel motion about image plane in Chairs and Sintel, while more vertical motion in regard to image plane in driving scenes of KITTI. Flow fields projected from the parallel and vertical motion about image plane satisfy the first and second order smoothness respectively.

Inspired by recent work of ARFlow [22] and UFlow [23], we also add a self-supervised loss for continual improvement. Denoting 𝒜\mathcal{A} as the data augmentor, which can jointly augment input images I1,I2I_{1},I_{2}, pseudo labels Ff,FbF_{f},F_{b} and valid masks Mf,MbM_{f},M_{b} by applying spatial and color transformations. Indicating I1¯,I2¯,Ff¯,Fb¯\overline{I_{1}},\overline{I_{2}},\overline{F_{f}},\overline{F_{b}} and Mf¯,Mb¯\overline{M_{f}},\overline{M_{b}} as corresponding augmented images, pseudo labels and valid masks, the surrogate supervision loss can be formulated as:

sup(𝒩1|𝒩2,𝒜)=p|𝒩1(I1¯,I2¯)Ff¯|Mf¯pMf¯+p|𝒩1(I2¯,I1¯)Fb¯|Mb¯pMb¯.\begin{split}\mathcal{L}_{sup}(\mathcal{N}_{1}|\mathcal{N}_{2},\mathcal{A})=&\frac{\sum_{p}{|\mathcal{N}_{1}(\overline{I_{1}},\overline{I_{2}})-\overline{F_{f}}|\odot\overline{M_{f}}}}{\sum_{p}{\overline{M_{f}}}}+\\ &\frac{\sum_{p}{|\mathcal{N}_{1}(\overline{I_{2}},\overline{I_{1}})-\overline{F_{b}}|\odot\overline{M_{b}}}}{\sum_{p}{\overline{M_{b}}}}.\end{split} (3)

Note that there are two models 𝒩1\mathcal{N}_{1} and 𝒩2\mathcal{N}_{2} in Eq. 3, where 𝒩1\mathcal{N}_{1} is optimized conditioned on the pseudo labels Ff,FbF_{f},F_{b} predicted by 𝒩2\mathcal{N}_{2} and augmentor 𝒜\mathcal{A}. Following previous experience, augmented training samples are detached from gradient calculation diagram for stable learning. We adopt above loss notation for clarity of following expression. In regard to the first stage of MDFlow, we employ the off-the-shelf UFlow [23] method to initialize the teacher 𝒯\mathcal{T} with following objective:

s1=ph(𝒯)+λ1sm(𝒯)+λ2sup(𝒯|𝒯,𝒜),\mathcal{L}_{s1}=\mathcal{L}_{ph}(\mathcal{T})+\lambda_{1}\mathcal{L}_{sm}(\mathcal{T})+\lambda_{2}\mathcal{L}_{sup}(\mathcal{T}|\mathcal{T},\mathcal{A}), (4)

where λ1,λ2\lambda_{1},\lambda_{2} are trade-off coefficients, while sup(𝒯|𝒯,𝒜)\mathcal{L}_{sup}(\mathcal{T}|\mathcal{T},\mathcal{A}) means that the second forward prediction is supervised by the augmented first prediction of the same model 𝒯\mathcal{T}.

Refer to caption
Figure 4: Sparsification Curves on Sintel and KITTI datasets.

III-C Reliable Forward Knowledge Distillation

Due to shadow, exposure, textureless and repetitive patterns existed in natural images, optimal solution of photometric reconstruction will inevitably contain mismatching areas. On the other hand, state-of-the-art unsupervised optical flow approaches [22, 23] augment predicted flow fields for a second supervision to improve performance. Concretely, they select non-occluded regions as the valid mask when calculating above self-supervised loss. As shown in Figure 3, large error regions in predicted pseudo labels, for example, the background surrounded by moving persons, are not identified by occlusion map. Thus, the augmented surrogate supervision will hinder the network from learning true displacement. To alleviate this problem, we first propose a novel confidence matching selection mechanism to find relatively reliable prediction, and then separate sup\mathcal{L}_{sup} in Eq. 4 with ph,sm\mathcal{L}_{ph},\mathcal{L}_{sm} into a supervised knowledge distillation process. Holding the opinion that better matched image patches should have lower residuals, we explore to set a threshold to keep only a portion of reliable pseudo labels during forward knowledge distillation. In this work, we employ the soft Hamming distance of census transformed image patches for confidence measurement, and also include a robust function ρ\rho, which does not change confidence ranking. Denote MfM_{f} as forward valid mask and τ\tau as a threshold, the selection approach can be formulated as:

Mf(p)={1,ρ(I1(p)I2w(p))τ,0,ρ(I1(p)I2w(p))>τ.M_{f}(p)=\left\{\begin{aligned} 1&,&\rho(I_{1}(p)-I_{2}^{w}(p))\leq\tau,\\ 0&,&\rho(I_{1}(p)-I_{2}^{w}(p))>\tau.\end{aligned}\right. (5)
TABLE I: Valid mask selection threshold and endpoint error on MPI Sintel and KITTI 2015 training datasets with specific removal rates. Models are trained with loss function in Eq. 4 on Sintel and KITTI training datasets respectively.
Removal Rate MPI Sintel KITTI 2015
Threshold EPE Threshold EPE
0% 4.66 1.21 4.66 1.53
10% 3.22 0.58 3.59 1.26
20% 2.82 0.46 3.26 1.12
30% 2.54 0.40 2.99 1.03

In order to set meaningful threshold for outlier removal, we analyse the variation curve of average endpoint error with removal rate on Sintel and KITTI training sets, which is controlled by τ\tau. Similar to uncertainty estimation [65], the sparsification curves are depicted in Figure 4, while several typical removal rates with corresponding thresholds and errors are listed in Table I. As can be seen, average endpoint error of remaining pseudo labels decreases as we gradually remove more prediction from original non-occluded mask, until removal rate reaches about 90%90\%. On the other hand, larger percentage of removal means less pseudo labels for training, that can damage performance. Therefore, setting reasonable removal rate for better trade-off plays a key role in forward distillation process. We leave this question in experiment part for detailed discussion. Guided by the above confidence matching selection mask Mf,MbM_{f},M_{b}, reliable forward knowledge distillation can be formulated as:

s2=sup(𝒮|𝒯,𝒜).\mathcal{L}_{s2}=\mathcal{L}_{sup}(\mathcal{S}|\mathcal{T},\mathcal{A}). (6)

III-D Reliable Backward Knowledge Distillation

Due to the decouple characteristic of proposed mutual distillation framework, we can employ a stronger flow architecture as student 𝒮\mathcal{S} than the original teacher 𝒯\mathcal{T}, so as to learn from reliable pseudo labels sufficiently. This process is different from traditional knowledge distillation [51, 66] where a lightweight student usually distills knowledge from a large teacher model. Therefore, our approach further includes a backward distillation stage, aiming to transfer better student knowledge back to the efficient teacher model. As a result, proposed MDFlow does not add extra computation cost and inference delay in real deployment, compared with other unsupervised approaches.

To achieve this goal, one may follow the supervision loss of Eq. 6 used in stage 2, by exchanging 𝒯\mathcal{T} and 𝒮\mathcal{S}. However, due to the limited learning ability of efficient optical flow network, and diverse strong augmentation imposed on pseudo labels, training data for the efficient teacher 𝒯\mathcal{T} is blended with augmentation noise, which deviates from the distribution of real-world dynamic scenes. On the other hand, we can not remove the augmentor 𝒜\mathcal{A} during backward distillation, since data augmentation has been proved to be one crucial step for excellent performance [67, 68]. To deal with this problem, we establish a multi-target learning manner for reliable backward knowledge distillation, integrating specific domain knowledge on optical flow task. Specifically, unsupervised photometric and smoothness losses based on original scene structure are introduced as regularization. In short, reliable backward distillation of stage 3 can be written as:

s3=sup(𝒯|𝒮,𝒜)+λ3ph(𝒯)+λ4sm(𝒯),\mathcal{L}_{s3}=\mathcal{L}_{sup}(\mathcal{T}|\mathcal{S},\mathcal{A})+\lambda_{3}\mathcal{L}_{ph}(\mathcal{T})+\lambda_{4}\mathcal{L}_{sm}(\mathcal{T}), (7)

where λ3,λ4\lambda_{3},\lambda_{4} are weight coefficients controlling unsupervised regularization intensity, and non-occluded masks 1Of,1Ob1-O_{f},1-O_{b} of 𝒯\mathcal{T} are adopted as valid masks in sup\mathcal{L}_{sup}.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Input Frames MDFlow SimFlow [26] UFlow [23] UPFlow [24]
Figure 5: Qualitative results of state-of-the-art pyramid-based unsupervised optical flow methods on MPI Sintel test datasets. Colored flow fields and flow error maps are interlaced. Fine details and less artifacts can be observed in the flow fields of proposed MDFlow. Zoom in for best view.

IV Experiments

In this section, we first describe implementation details and training strategy of proposed MDFlow algorithm. Then, we quantitively and qualitatively compare our results with existing state-of-the-art approaches on standard benchmark datasets, including Flying Chairs [9], MPI Sintel [29] and KITTI 2015 [30]. Further, ablation study on proposed reliable matching selection mechanism and superior student architecture in forward distillation, as well as multi-target regularized learning in backward distillation are carried out. Finally, we show superior generalization ability of our method.

TABLE II: Comparison of different optical flow architectures. Running time and Computation complexity are measured on Sintel resolution images with one NVIDIA GTX 1080 Ti GPU.
Parameters (M) Runtime (ms) FLOPs (G)
RAFT [13] 5.26 125 362.5
PWC-Net [11] 8.75 34 90.8
FastFlowNet [38] 1.37 11 12.2

IV-A Implementation Detail

We employ the efficient PWC-Net [11] as teacher model, and the powerful RAFT [13] as student model, whose model size, running time and computation complexity are compared in Table II. It can be seen that even RAFT contains a little less parameters than PWC-Net, however, its running time and computation complexity is 4×\times larger. All our experiments are implemented in PyTorch and conducted on 2 NVIDIA Tesla V100 GPUs. We adopt the Adam [69] and AdamW [70] optimizers to train the PWC-Net and RAFT respectively. Batch size is set to 4 for all experiments. Following previous work [40, 26, 23], we first pre-train MDFlow on Flying Chairs, and then fine-tune above teacher and student networks on each target dataset. As for Sintel, we use the combination of Clean and Final training parts, and adopt the fine-tuned teacher model for online evaluation. In regard to KITTI, we train on the KITTI 2015 multi-view extension training set, while removing frame numbers 9-12 in each sequence. Results of the teacher on KITTI 2015 test set are uploaded to KITTI website for online comparison. It takes overall about 4 days to train the three-stage MDFlow first on Chairs and then on Sintel or KITTI datasets, which also relys on the model complexity, IO speed and code parallelism degree.

TABLE III: Quantitative results on Flying Chairs, MPI Sintel and KITTI 2015 datasets. The metric EPE is the average endpoint error and Fl-all is the percentage of erroneous pixels over all pixels. Results in parentheses indicates it is evaluated using data it is trained on. The unavailable results are marked as ‘-’. ‘C’, ‘T’, ‘S’ and ‘K’ stand for Flying Chairs, FlyingThings3D, MPI Sintel and KITTI 2015 datasets respectively. Superscript ‘raw’ means the raw part of corresponding datasets. Kvo\rm K^{vo} means KITTI Visual Odometry dataset. (Stereo) denotes stereo data is used during training. For each item in supervised setting (top part), the best result is in bold. For each item in unsupervised setting (bottom part), the best result is boldfaced, and the second best is underlined.
Data Method C-test S-train (EPE) S-test (EPE) K-15-train K-15-test
EPE Clean Final Clean Final EPE Fl-all
C+T\rm C+T FlowNet2 [10] - 2.02 3.14 3.96 6.02 10.06 -
S/K\rm S/K FlowNet2-ft [10] - (1.45) (2.01) 4.16 5.74 (2.3) 10.41%
C+T\rm C+T FastFlowNet [38] - 2.89 4.14 - - 12.24 -
S/K\rm S/K FastFlowNet-ft [38] - (2.08) (2.71) 4.89 6.08 (2.13) 11.22%
C+T\rm C+T PWC-Net [11] 2.30 2.55 3.93 - - 10.35 -
S/K\rm S/K PWC-Net-ft [11] - (1.70) (2.21) 3.86 5.13 (2.16) 9.60%
C+T\rm C+T RAFT [13] - 1.43 2.71 - - 5.04 -
S/K\rm S/K RAFT-ft [13] - (0.77) (1.20) 2.08 3.41 (1.5) 5.27%
SY/Kraw\rm SY/K^{raw} UnFlow-CSS [20] - - 7.91 - 10.22 8.10 -
C+S/K\rm C+S/K OccAwareFlow [21] 3.30 (4.03) (5.95) 7.95 9.15 8.88 31.20%
R+S/K\rm R+S/K MFOccFlow [71] - (3.89) (5.52) 7.23 8.81 6.59 22.94%
C+S/K\rm C+S/K EPIFlow [72] - (3.54) (4.99) 7.00 8.51 5.56 16.95%
C+S/K\rm C+S/K DDFlow [40] 2.97 (2.92) (3.98) 6.18 7.40 5.72 14.29%
Sraw/K\rm S^{raw}/K SelFlow [25] - (2.88) (3.87) 6.56 6.57 4.84 14.19%
C+S/K\rm C+S/K STFlow [41] 2.53 (2.91) (3.59) 6.12 6.63 3.56 13.83%
Kraw(Stereo)\rm K^{raw}(Stereo) UnOS [47] - - - - - 5.58 18.00%
K(Stereo)\rm K(Stereo) Flow2Stereo [48] - - - - - 3.54 11.10%
Sraw/Kraw\rm S^{raw}/K^{raw} ARFlow [22] - 2.79 3.73 4.78 5.89 2.85 11.80%
C+S/K\rm C+S/K SimFlow [26] 2.69 (2.86) (3.57) 5.92 6.92 5.19 13.38%
C+S/K\rm C+S/K UFlow [23] 2.55 (2.50) (3.39) 5.21 6.50 2.71 11.13%
S+Sraw/Kraw\rm S+S^{raw}/K^{raw} DistillFlow [43] - (2.61) (3.70) 4.23 5.81 2.93 10.54%
C+S/K\rm C+S/K OIFlow [44] 2.53 (2.44) (3.35) 4.26 5.71 2.57 9.81%
S/Kraw\rm S/K^{raw} ASFlow [45] - (2.40) (2.89) 4.56 5.86 2.47 9.67%
Sraw/Kraw\rm S^{raw}/K^{raw} CoT-AMFlow [42] - - - 3.96 5.14 - 10.34%
Kvo+K(Stereo)\rm K^{vo}+K(Stereo) FLC [49] - - - - - 2.35 9.70%
S/Kraw\rm S/K^{raw} UPFlow [24] - (2.33) (2.67) 4.68 5.32 2.45 9.38%
C+S/K\rm C+S/K MDFlow-Fast (Ours) 2.75 (2.53) (3.47) 4.73 5.99 3.02 11.43%
C+S/K\rm C+S/K MDFlow (Ours) 2.48 (2.17) (3.14) 4.16 5.46 2.45 8.91%

IV-B Training Strategy

On each dataset, we perform proposed three-stage training pipeline for mutual knowledge distillation, where the first and last stage both take 300k iterations to train the teacher, while the second stage takes 100k iterations to train the student. When training on Flying Chairs, learning rate is initially set to 1e41e-4 and decays half at 100k, 150k, 200k and 250k iterations in both stage 1 and stage 3. As for stage 2, it is initially set to 4e44e-4 and decays half at 20k, 40k, 60k and 80k iterations. When fine-tuning on Sintel and KITTI, we use the same training schedule as Chairs, however, learning rate is divided by 2 in all corresponding phases. Across stages, both the teacher and the student load latest updated checkpoint for initialization, and are random initialized from scratch before training on Chairs. The data augmentor 𝒜\mathcal{A} includes operations of random flipping, resizing, rotating, cropping and color jittering. According to hyperparameter search, weight coefficients of λ1,λ2,λ3,λ4\lambda_{1},\lambda_{2},\lambda_{3},\lambda_{4} in Eq. 4 and Eq. 7 are respectively set to 1.0,0.05,0.1,0.11.0,0.05,0.1,0.1 for Chairs and Sintel, while being 5.0,0.05,0.02,0.15.0,0.05,0.02,0.1 for KITTI. Note that the smoothness coefficient λ1\lambda_{1} in KITTI is 5 ×\times of that in Chairs and Sintel in the first stage, while the photometric coefficient λ3\lambda_{3} in KITTI is only 1/51/5 of that in Chairs and Sintel in the third stage. It is because that KITTI uses the second order smoothness, a larger coefficient can reach a reasonable regularization strength in the first stage. And in the third stage, the non-occluded region in KITTI benefits more from the distillation term rather than the noisy reconstruction term.

IV-C Comparison to State-of-the-art Methods

IV-C1 Quantitative Results

As shown in Table III, we quantitatively compare MDFlow with existing supervised and unsupervised methods on three leading optical flow benchmarks with standard metrics, where our approach outperforms most pyramid-based unsupervised methods on almost all datasets. The current state-of-the-art SMURF [27] uses the unfair recurrent model during evaluation, which is not listed in Table III because of large computation and time delay. Moreover, proposed MDFlow framework is general and does not depend on particular flow architecture and unsupervised initialization. Due to the diverse training configuration used in previous work, we list their training data in the first column for reference.

On Flying Chairs, MDFlow reduces previous best EPE from 2.53 to 2.48 slightly, possibly because of the relative simple motion in this domain. As for more challenging Sintel and KITTI 2015 datasets, our method achieves obvious improvement over others. On Sintel training, we obtains the best EPE of 2.17 on the Clean pass, and only behaves a little worse than UPFlow [24] and ASFlow [45] on the Final pass. For the online test benchmarks, our method achieves EPE of 4.16 and 5.46 on Clean and Final respectively, even outperforming DistillFlow [43] (4.23 and 5.81), which is trained on Sintel raw dataset, containing the unfair test image sequences. The method CoT-AMFlow [42] behaves best on both Sintel Clean and Final test sets. We attribute the reason to be that it both employs a more complex adaptive modulation flow network and uses the unfair Sintel raw movie sequences.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Input Frames MDFlow SimFlow [26] UFlow [23] SelFlow [25]
Figure 6: Qualitative results of state-of-the-art pyramid-based unsupervised optical flow methods on KITTI 2015 test datasets. Colored flow fields and flow error maps are interlaced. Fine details and less artifacts can be observed in the flow fields of proposed MDFlow. Zoom in for best view.

In regard to KITTI 2015, we achieve the same second best EPE of 2.45 with UPFlow [24] on the training set, but without additional parameters and inference delay, which is 9.2% improvement over comparable UFlow [23] (2.71). We further upload flow predictions of the efficient teacher on test images to the KITTI benchmark website and achieve 8.91% Fl-all error rate, outperforming the state-of-the-art pyramid-based unsupervised approach UPFlow [24], i.e., 8.91% vs 9.38%. It is worth noting that proposed MDFlow also behaves better than the state-of-the-art binocular SLAM method FLC [49], that shows the room for improvement in this research area.

To facilitate a large number of downstream video tasks, especially for embedded real-time applications, we replace the orignal PWC-Net [11] with a more efficient FastFlowNet [38] as the teacher model, and repeat the same experiments as before. Model size, running time and computation complexity of FastFlowNet are compared in Table II. As shown in Table III, MDFlow-Fast can approach the state-of-the-art unsupervised flow accuracy while only containing 1.37 M parameters and consuming much less computation and inference time, demonstrating the efficient flow architecture of FastFlowNet and the efficient learning framework of MDFlow.

IV-C2 Qualitative Results

Some qualitative results on MPI Sintel and KITTI 2015 test datasets are depicted in Figure 5 and Figure 6 for visual comparison respectively. It can be seen that flow fields of proposed MDFlow can keep better motion scene structure layout than other methods, especially on regions with large occlusion and complex texture.

IV-D Ablation Study

To verify the effectiveness of proposed contributions, we conduct thorough ablation experiments of reliable mutual knowledge distillation algorithm considering all training stages and different training settings, and our findings are reported in Table IV. All ablation experiments are trained on Sintel training or KITTI 2015 multi-view extension datasets with corresponding loss functions in their training stage. We first report results of unsupervised initialization (stage 1) for teacher model on Sintel training and KITTI 2015 training sets using Eq. 4 as the baseline, which is denoted as E1 in Table IV.

TABLE IV: Ablation study on MPI Sintel and KITTI 2015 datasets. \rightarrow and \leftarrow represent forward and backward distillation processes, and no arrow means using a single model without distillation. ‘Init’ stands for checkpoints that corresponding models are initialized with. ‘C’ means checkpoints pre-trained on Flying Chairs dataset. RRRR stands for removal rate in Figure 4. sup,ph,sm\mathcal{L}_{sup},\mathcal{L}_{ph},\mathcal{L}_{sm} refer to different loss functions in Eq. 7. Results are tested on the model which is optimized during this ablation experiment.
Stage ID Mode Init Setting S-train (EPE) K-15-train
Clean Final EPE Fl-all
1 E1 PWC C 300k Iterations (2.39) (3.34) 2.66 9.64%
E2 PWC C 700k Iterations (2.38) (3.32) 2.64 9.66%
2 E3 PWC\rightarrowPWC E1\rightarrowC RR=0%RR=0\% (2.43) (3.33) 2.64 8.86%
E4 RR=10%RR=10\% (2.32) (3.28) 2.54 8.50%
E5 RR=20%RR=20\% (2.36) (3.31) 2.60 8.48%
E6 RR=30%RR=30\% (2.37) (3.35) 2.67 8.60%
E7 PWC E1 RR=10%RR=10\% (2.33) (3.31) 3.00 9.89%
E8 PWC\rightarrowRAFT E1\rightarrowC RR=10%RR=10\% (2.16) (3.15) 2.31 8.31%
3 E9 PWC\leftarrowPWC E1\leftarrowE4 sup\mathcal{L}_{sup} (2.31) (3.25) 2.53 8.36%
E10 PWC\leftarrowRAFT E1\leftarrowE8 sup\mathcal{L}_{sup} (2.27) (3.22) 2.48 8.22%
E11 PWC\leftarrowRAFT E1\leftarrowE8 sup+ph+sm\mathcal{L}_{sup}+\mathcal{L}_{ph}+\mathcal{L}_{sm} (2.17) (3.14) 2.45 8.10%
4 E12 PWC\rightarrowRAFT E11\rightarrowE8 RR=10%RR=10\% (2.08) (3.10) 2.20 7.85%
5 E13 PWC\leftarrowRAFT E11\leftarrowE12 sup+ph+sm\mathcal{L}_{sup}+\mathcal{L}_{ph}+\mathcal{L}_{sm} (2.16) (3.14) 2.36 8.04%

IV-D1 Reliable Matching Selection

As shown in the middle part of Table IV, we conduct an ablation to explore the effectiveness of proposed confidence matching selection mechanism. According to Figure 4, removal rate is set to four typical values as 0%,10%,20%0\%,10\%,20\% and 30%30\%, denoted by E3, E4, E5 and E6 in Table IV, where 0%0\% means that original non-occluded map is adopted as valid mask during forward knowledge distillation. In this group, both the teacher and the student models are instantiated as PWC-Net. Compared with the baseline E1, forward distillation with RR=0%RR=0\% has almost no improvement on the student, due to the mismatched regions in pseudo labels. When we remove 10%10\% prediction of the teacher according to proposed confidence ranking mechanism in Eq. 5, the student yields the lowest endpoint error on both Sintel and KITTI 2015 training sets. As more uncertain labels being removed, the EPE gets larger, while the Fl-all metric on KITTI 2015 behaves the best with RR=20%RR=20\%. Therefore, we use RR=10%RR=10\% in MDFlow, that is equivalent to adopting τ\tau of 3.22 and 3.59 on Sintel and KITTI respectively in Eq. 5 according to Table I. Speak of here, one may ask whether the distillation procedure is necessary, since we can directly apply proposed confidence mask to sup(𝒯|𝒯,𝒜)\mathcal{L}_{sup}(\mathcal{T}|\mathcal{T},\mathcal{A}) when optimizing Eq. 4. To explore this, we further carry out an experiment named ‘PWC’ in stage 2 marked as E7 in Table IV. As can be seen, it can approach the result of distillation counterpart on Sintel, while the results behave much worse on KITTI, whose reason may be the instability when combining the sparse valid mask with unsupervised loss in challenging scenes.

IV-D2 Stronger Student Model

Thanks to the decouple nature of our mutual distillation framework, we can employ any optical flow architecture as student without worrying about introducing additional cost in real deployment. Inspired by recent success on supervised optical flow task, for the first time, we try to explore whether a stronger student model can achieve better results when training on pseudo labels with the same amount of noise. As expected, when we employ advanced RAFT [13] as the student, endpoint error is reduced by 6.9%6.9\%, 4.0%4.0\% and 9.1%9.1\% on Sintel Clean, Final and KITTI 2015 respectively, that is shown in E8 of stage 2. Significantly improved results demonstrate the effectiveness of proposed reliable forward knowledge distillation process. Moreover, E9 is the corresponding backward distillation experiment to E4, which jointly constitute a complete mutual distillation procedure but do not employ a stronger student. Compare E9 and E10, we can conclude that a stronger student can improve the final performance of the teacher.

IV-D3 Multi-Target Backward Distillation

Superior accuracy in stage 2 is at the cost of inference delay, thus transferring the better student knowledge back to the efficient teacher model makes sense. One can directly regard student’s prediction as pseudo labels, and train the teacher in a supervised manner, like what has been done in stage 2. However, as E10 shown in Table IV, performance of final teacher model declines back distinctly compared with E8, whose reason may be the limited learning capability of efficient network and relatively difficult task due to strong data augmentation. To deal with this problem, we formulate this step as a multi-target learning pipeline by introducing an unsupervised objective for regularization. As M11 listed in Table IV, proposed approach gets better results than pure supervision counterpart, and can well maintain or even surpass the powerful student model. Furthermore, we do sensitivity analysis of the photometric loss ph\mathcal{L}_{ph} and the smoothness loss sm\mathcal{L}_{sm} in backward distillation stage according to Eq. 7, where original λ3ph\lambda_{3}\mathcal{L}_{ph} and λ4sm\lambda_{4}\mathcal{L}_{sm} are multiplied by a scale factor of α\alpha and β\beta respectively. As shown in Table V, α\alpha and β\beta are alternately set to 0.5 and 1.5 to perform a perturbation for weighting hyperparameters λ3\lambda_{3} and λ4\lambda_{4}. It can be seen that a relatively large coefficient for ph\mathcal{L}_{ph} is more benefit on Sintel, while a relatively large coefficient for sm\mathcal{L}_{sm} is more helpful on KITTI. The reason may be Sintel dataset includes more non-rigid motion, while brightness noise is more obvious in real world KITTI dataset. In summary, in a reasonable range of values for λ3\lambda_{3} and λ4\lambda_{4}, mutli-target backward distillation behaves better than the pure supervision counterpart, demonstrating the robustness of proposed knowledge distillation approach.

TABLE V: Sensitivity analysis of the photometric and the smoothness objective functions in the backward distillation stage.
sup+αλ3ph\mathcal{L}_{sup}+\alpha\lambda_{3}\mathcal{L}_{ph} S-train (EPE) K-15-train
+βλ4sm+\beta\lambda_{4}\mathcal{L}_{sm} Clean Final EPE Fl-all
α=1.0,β=1.0\alpha=1.0,\,\beta=1.0 (2.17) (3.14) 2.45 8.10%
α=0.5,β=1.0\alpha=0.5,\,\beta=1.0 (2.21) (3.18) 2.46 8.13%
α=1.5,β=1.0\alpha=1.5,\,\beta=1.0 (2.18) (3.16) 2.48 8.16%
α=1.0,β=0.5\alpha=1.0,\,\beta=0.5 (2.19) (3.17) 2.49 8.21%
α=1.0,β=1.5\alpha=1.0,\,\beta=1.5 (2.23) (3.20) 2.43 8.12%

IV-D4 Necessity of Multi-Stage Distillation

To demonstrate the necessity of multi-stage distillation, we conduct an experiment named E2 by enlarging training iterations of stage 1 to be equal to proposed three-stage learning procedure, where stage 1 is optimized with total 700k iterations for fair comparison. Compare E2 and E1 in Table IV, flow accuracy on diverse datasets has almost no improvement. We attribute the reason to be that error of pseudo label in augmentation regularization term has impeded the optimization of Eq. 4 to trap into local minimum. Therefore, multi-stage distilation is necessary.

IV-D5 One More Mutual Knowledge Distillation

Since proposed approach can bring progressive improvement between the teacher and the student, we carry out an experiment to perform one more mutual distilation between these two networks, whose forward and backward distillation stages are denoted as E12 and E13 in Table IV, respectively. As is expected, compare E12 with E8, and compare E13 with E11, we can see that both the student and the teacher have achieved better flow accuracy against their previous optimization stage. However, the reduction of average endpoint error on both Sintel and KITTI are almost less than 0.1, while it requires double training iterations. Therefore, it shows that the three-stage MDFlow has already achieved relatively good results and been saturated.

TABLE VI: Generalization ability comparison of methods with pyramid flow architectures across different datasets.
        Method         C-test         S-train (EPE)         K-15-train
        EPE         Clean         Final         EPE         Fl-all
         Chairs         DDFlow [40]         2.97         4.83         4.85         17.26         -
        SimFlow [26]         2.69         3.66         4.67         16.99         -
        UFlow [23]         2.55         3.43         4.17         11.27         30.31%
        MDFlow (Ours)         2.48         2.89         4.00         9.60         25.87%
         Sintel         DDFlow [40]         3.46         (2.92)         (3.98)         12.69         -
        ARFlow [22]         3.50         2.79         3.73         9.04         -
        SimFlow [26]         3.01         (2.86)         (3.57)         12.75         -
        UFlow [23]         3.25         (2.50)         (3.39)         9.40         20.02%
        MDFlow (Ours)         2.79         (2.17)         (3.14)         5.92         16.00%
         KITTI         DDFlow [40]         6.35         6.20         7.08         5.72         -
        SimFlow [26]         4.32         5.49         7.24         5.19         -
        UFlow [23]         5.05         5.58         6.31         2.71         9.05%
        MDFlow (Ours)         4.14         4.04         5.34         2.45         8.10%

In short, our proposed MDFlow (E11) improves the baseline (E1) results by 9.2%9.2\%, 6.0%6.0\% and 16.0%16.0\% on Sintel Clean, Sintel Final and KITTI 2015 training sets respectively, that is consistent with the improvement over the comparable state-of-the-art UFlow [23] on multiple test benchmarks.

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 7: Qualitative results of the teacher network on the cross domain DAVIS dataset [73]. Three pairs of image and flow examples are shown.

IV-E Cross Domain Generalization

Finally, we conduct cross domain experiments to compare with recent top-performing unsupervised methods. As illustrated in Table VI, our method consistently performs best when training on one dataset and testing on another one, which demonstrates the state-of-the-art generalization ability of MDFlow among pyramid-based approaches. We attribute its excellent generalization performance to the efficient utilization of data augmentation on reliable pseudo labels during the decoupled mutual knowledge distillation process. Figure 7 presents qualitative results of the efficient teacher network on the cross domain DAVIS dataset [73]. It can be seen that proposed unsupervised learning approaches can estimate relatively good optical flow with clear motion boundary in diverse dynamic scenes.

V Conclusion

To our best knowledge, it is the first time that mutual knowledge distillation framework is introduced to unsupervised optical flow, which can efficiently leverage countless unlabeled video sequences for optical flow learning. To decouple mismatched pseudo labels that block the learning process, we propose a confidence matching selection mechanism to partly exclude influence from outliers. Then, we use diverse augmentation like supervised methods and above reliable pseudo labels to train a stronger student model for more accurate flow prediction. Finally, a novel multi-target backward distillation procedure is built to transfer better student knowledge back to the efficient teacher without sacrificing generalization ability. Experiments on Flying Chairs, MPI Sintel and KITTI 2015 datasets show that our framework achieves state-of-the-art real-time accuracy and generalization performance. In the future, we plan to explore proposed reliable mutual knowledge distillation approach on other unsupervised matching tasks, such as Structure from Motion (SfM) and scene flow estimation.

References

  • [1] C. Wang, X. Chen, S. Min, J. Wang, and Z.-J. Zha, “Structure-guided deep video inpainting,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [2] L. Kong, B. Jiang, D. Luo, W. Chu, X. Huang, Y. Tai, C. Wang, and J. Yang, “Ifrnet: Intermediate feature refine network for efficient frame interpolation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [3] J. Liu, L. Kong, and J. Yang, “Atca: an arc trajectory based model with curvature attention for video frame interpolation,” in 2022 IEEE International Conference on Image Processing (ICIP), 2022.
  • [4] J. Dong and H. Liu, “Video stabilization for strict real-time applications,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
  • [5] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial Intelligence, 1981.
  • [6] T. Brox, C. Bregler, and J. Malik, “Large displacement optical flow,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • [7] D. Sun, S. Roth, and M. J. Black, “Secrets of optical flow estimation and their principles,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010.
  • [8] C. Bailer, B. Taetz, and D. Stricker, “Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [9] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. v. d. Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in 2015 IEEE International Conference on Computer Vision (ICCV), 2015.
  • [10] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “Flownet 2.0: Evolution of optical flow estimation with deep networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [11] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • [12] J. Hur and S. Roth, “Iterative residual refinement for joint optical flow and occlusion estimation,” in CVPR, 2019.
  • [13] Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in European Conference on Computer Vision (ECCV), 2020.
  • [14] J. Wang, Y. Zhong, Y. Dai, K. Zhang, P. Ji, and H. Li, “Displacement-invariant matching cost learning for accurate optical flow estimation,” in Advances in Neural Information Processing Systems, 2020.
  • [15] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [16] A. Ranjan, J. Romero, and M. J. Black, “Learning human optical flow,” in 29th British Machine Vision Conference, Sep. 2018.
  • [17] N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy, and T. Brox, “What makes good synthetic training data for learning disparity and optical flow estimation?” International Journal of Computer Vision, 2018.
  • [18] J. J. Yu, A. W. Harley, and K. G. Derpanis, “Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness,” in Computer Vision – ECCV 2016 Workshops, 2016.
  • [19] Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha, “Unsupervised deep learning for optical flow estimation,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, ser. AAAI’17, 2017.
  • [20] S. Meister, J. Hur, and S. Roth, “UnFlow: Unsupervised learning of optical flow with a bidirectional census loss,” in AAAI, Feb. 2018.
  • [21] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu, “Occlusion aware unsupervised learning of optical flow,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • [22] L. Liu, J. Zhang, R. He, Y. Liu, Y. Wang, Y. Tai, D. Luo, C. Wang, J. Li, and F. Huang, “Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [23] R. Jonschkowski, A. Stone, J. T. Barron, A. Gordon, K. Konolige, and A. Angelova, “What matters in unsupervised optical flow,” in Computer Vision – ECCV 2020, 2020.
  • [24] K. Luo, C. Wang, S. Liu, H. Fan, J. Wang, and J. Sun, “Upflow: Upsampling pyramid for unsupervised optical flow learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021.
  • [25] P. Liu, M. R. Lyu, I. King, and J. Xu, “Selflow: Self-supervised learning of optical flow,” in CVPR, 2019.
  • [26] W. Im, T.-K. Kim, and S.-E. Yoon, “Unsupervised learning of optical flow with deep feature similarity,” in Computer Vision – ECCV 2020, 2020.
  • [27] A. Stone, D. Maurer, A. Ayvaci, A. Angelova, and R. Jonschkowski, “Smurf: Self-teaching multi-frame unsupervised raft with full-image warping,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [28] R. Zabih and J. Woodfill, “Non-parametric local transforms for computing visual correspondence,” in Computer Vision — ECCV ’94, 1994.
  • [29] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalistic open source movie for optical flow evaluation,” in European Conf. on Computer Vision (ECCV), Oct. 2012, pp. 611–625.
  • [30] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [31] Y. Hu, R. Song, and Y. Li, “Efficient coarse-to-fine patch match for large displacement optical flow,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [32] A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [33] T.-W. Hui, X. Tang, and C. C. Loy, “Liteflownet: A lightweight convolutional neural network for optical flow estimation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018.
  • [34] M. Zhai, X. Xiang, R. Zhang, N. Lv, and A. El Saddik, “Optical flow estimation using dual self-attention pyramid networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2020.
  • [35] S. Zhao, Y. Sheng, Y. Dong, E. I.-C. Chang, and Y. Xu, “Maskflownet: Asymmetric feature matching with learnable occlusion mask,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [36] L. Kong, X. Yang, and J. Yang, “Oas-net: Occlusion aware sampling network for accurate optical flow,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
  • [37] L. Kong and J. Yang, “Fdflownet: Fast optical flow estimation using a deep lightweight network,” in 2020 IEEE International Conference on Image Processing (ICIP), 2020.
  • [38] L. Kong, C. Shen, and J. Yang, “Fastflownet: A lightweight network for fast optical flow estimation,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021.
  • [39] S. Jiang, D. Campbell, Y. Lu, H. Li, and R. Hartley, “Learning to estimate hidden motions with global motion aggregation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • [40] P. Liu, I. King, M. R. Lyu, and J. Xu, “Ddflow: Learning optical flow with unlabeled data distillation,” in AAAI, 2019.
  • [41] Z. Ren, W. Luo, J. Yan, W. Liao, X. Yang, A. Yuille, and H. Zha, “Stflow: Self-taught optical flow estimation using pseudo labels,” IEEE Transactions on Image Processing, 2020.
  • [42] H. Wang, R. Fan, and M. Liu, “Cot-amflow: Adaptive modulation network with co-teaching strategy for unsupervised optical flow estimation,” in 4th Conference on Robot Learning, CoRL 2020, 16-18 November 2020, Virtual Event / Cambridge, MA, USA, 2020.
  • [43] P. Liu, M. R. Lyu, I. King, and J. Xu, “Learning by distillation: A self-supervised learning framework for optical flow estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  • [44] S. Liu, K. Luo, N. Ye, C. Wang, J. Wang, and B. Zeng, “Oiflow: Occlusion-inpainting optical flow estimation by unsupervised learning,” IEEE Transactions on Image Processing, 2021.
  • [45] S. Liu, K. Luo, A. Luo, C. Wang, F. Meng, and B. Zeng, “Asflow: Unsupervised optical flow learning with adaptive pyramid sampling,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [46] R. Zhao, R. Xiong, Z. Ding, X. Fan, J. Zhang, and T. Huang, “Mrdflow: Unsupervised optical flow estimation network with multi-scale recurrent decoder,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  • [47] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu, “Unos: Unified unsupervised optical-flow and stereo-depth estimation by watching videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [48] P. Liu, I. King, M. R. Lyu, and J. Xu, “Flow2stereo: Effective self-supervised learning of optical flow and stereo matching,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  • [49] C. Chi, Q. Wang, T. Hao, P. Guo, and X. Yang, “Feature-level collaboration: Joint unsupervised learning of optical flow, stereo depth and camera motion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [50] H. Li, K. Luo, and S. Liu, “Gyroflow: Gyroscope-guided unsupervised optical flow learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • [51] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015.
  • [52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  • [53] S. Zagoruyko and N. Komodakis, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in ICLR, 2017.
  • [54] D. Sun, A. Yao, A. Zhou, and H. Zhao, “Deeply-supervised knowledge synergy,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [55] Y. Tian, D. Krishnan, and P. Isola, “Contrastive representation distillation,” in International Conference on Learning Representations, 2020.
  • [56] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [57] K. Zhang, C. Zhang, S. Li, D. Zeng, and S. Ge, “Student network learning via evolutionary knowledge distillation,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [58] A. Yao and D. Sun, “Knowledge transfer via dense cross-layer mutual-distillation,” in Computer Vision – ECCV 2020, 2020.
  • [59] B. Zhao and K. Han, “Novel visual category discovery with dual ranking statistics and mutual knowledge distillation,” in Advances in Neural Information Processing Systems.   Curran Associates, Inc., 2021.
  • [60] R. Wu, M. Feng, W. Guan, D. Wang, H. Lu, and E. Ding, “A mutual learning method for salient object detection with intertwined multi-supervision,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [61] Q. Zhai, X. Li, F. Yang, C. Chen, H. Cheng, and D.-P. Fan, “Mutual graph learning for camouflaged object detection,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  • [62] A. Hao, Y. Min, and X. Chen, “Self-mutual distillation learning for continuous sign language recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • [63] B. Hu, S. Zhou, Z. Xiong, and F. Wu, “Cross-resolution distillation for efficient 3d medical image registration,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
  • [64] N. Sundaram, T. Brox, and K. Keutzer, “Dense point trajectories by gpu-accelerated large displacement optical flow,” in Computer Vision – ECCV 2010, 2010.
  • [65] E. Ilg, Ö. Çiçek, S. Galesso, A. Klein, O. Makansi, F. Hutter, and T. Brox, “Uncertainty estimates and multi-hypotheses networks for optical flow,” in Computer Vision – ECCV 2018, 2018.
  • [66] F. Aleotti, M. Poggi, F. Tosi, and S. Mattoccia, “Learning end-to-end scene flow by distilling single tasks knowledge,” in Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  • [67] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Models matter, so does training: An empirical study of cnns for optical flow estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [68] A. Bar-Haim and L. Wolf, “Scopeflow: Dynamic scene scoping for optical flow,” in The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [69] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in The 3rd International Conference on Learning Representations, San Diego, 2015, 2015.
  • [70] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2019.
  • [71] J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger, “Unsupervised learning of multi-frame optical flow with occlusions,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  • [72] Y. Zhong, P. Ji, J. Wang, Y. Dai, and H. Li, “Unsupervised deep epipolar flow for stationary or dynamic scenes,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [73] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. V. Gool, “The 2017 davis challenge on video object segmentation,” arXiv:1704.00675, 2017.