This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DeepMI: A Mutual Information Based Framework For Unsupervised Deep Learning of Tasks

Ashish Kumar, L. Behera, Senior Member IEEE
Department of Electrical Engineering, Indian Institute of Technology, Kanpur {krashish,lbehera}@iitk.ac.in
Abstract

In this work, we propose an information theory based framework “DeepMI” to train deep neural networks (DNN) using Mutual Information (\mathcal{MI}). The DeepMI framework is especially targeted but not limited to the learning of real world tasks in an unsupervised manner. The primary motivation behind this work is the limitation of the traditional loss functions for unsupervised learning of a given task. Directly using \mathcal{MI} for the training purpose is quite challenging to deal with because of its unbounded above nature. Hence, we develop an alternative linearized representation of \mathcal{MI} as a part of the framework. Contributions of this paper are three fold: i) investigation of \mathcal{MI} to train deep neural networks, ii) novel loss function LMI\mathcal{L}_{LMI}, and iii) a fuzzy logic based end-to-end differentiable pipeline to integrate DeepMI into deep learning framework. Due to the unavailabilty of a standard benchmark, we carefully design the experimental analysis and select three different tasks for the experimental study. We demonstrate that LMI\mathcal{L}_{LMI} alone provides better gradients to achieve a neural network better performance over the popular loss functions, also in the cases when multiple loss functions are used for a given task.

{justify}

I Introduction

Selection of a suitable loss function is crucial in order to train a neural network for a desired task. For a given neural network architecture [1, 2] and optimization procedure [3], the profile of the loss function largely governs what is learnt by the neural network and its generalization on unseen data. The above statement well applies to the choice of a similarity metric in the learning process. The deep learning based approaches involve minimizing a similarity metric between the ground truth and the predictions in a way or another. Though, the way it is performed may vary from task to task. For example, the visual perception tasks such as image classification [4, 2], image segmentation [5, 6], it is performed using cross entropy, whereas, for the other tasks such as forecasting [7], image generation [8], depth estimation [9], it is performed using 2\mathcal{L}_{2}, 1\mathcal{L}_{1} metrics. Generative adversaries [8] based learning algorithms also largely depends on the choice of similarity metric.

In the area of machine vision, there are certain tasks for which the groundtruth can be not be obtained easily. It is primarily due to a very high cost of measurement devices or unavailability of labelling process. These tasks include depth estimation using images [9], visual / LiDAR odometery [10]. From the perspective of the autonomous vehicles and robotics, the above tasks are undeniably important. For this reason, development of unsupervised learning techniques for these tasks have recently gained attention [9, 11, 12, 13]. The proposed methods in this direction extensively use similarity metric minimization between various information sources.

From the above discussion, it is quite evident that similarity metric plays an important role in the learning process. In general, 1,2\mathcal{L}_{1},\mathcal{L}_{2} are the most preferred choice for this purpose. These losses, despite their popularity, do not provide the desired results in many cases. It is mainly due to the fact that these are pointwise operators and do not account for statistical information while matching. For example, in images, 1/2\mathcal{L}_{1}/\mathcal{L}_{2} loss penalizes the neural network on a per-pixel basis. The statistical properties are thus left unaccounted. To address this issue, Structural Similarity (SSIM) index [14] has recently become popular and is being used as an alternative to 1,2\mathcal{L}_{1},\mathcal{L}_{2}. The SSIM index is computed over a window instead of a pixel and is based on local statistics. In practice, the aforementioned losses are used in conjunction with each other which increases the number of individual loss functions, leading to an increased complexity in order to tune the loss weights [15].

Keeping in mind the above observations, in this work, we explore the potential of \mathcal{MI} [16, 17, 18] to train deep neural networks for supervised / unsupervised learning of a task. \mathcal{MI} is essentially an information theoretic measure to reason about the statistical independence of two random variables. An interesting property of \mathcal{MI} is that it operates on probability distributions instead of the data directly. Therefore, \mathcal{MI} does not depend on the signal type i.e. images or time-series and proves to be a powerful measure in many areas. For this reason, we consider \mathcal{MI} as a potential alternative measure of similarity. Despite, its diverse applications, the expression of \mathcal{MI} is infeasible to be used directly for the training a neural network (Sec. III). However, the interesting properties of \mathcal{MI} encourage us to dive deep into the problem and lead us to contribute through this paper as follows:

  • Feasibility of \mathcal{MI} formulation for deep learning tasks.

  • An \mathcal{MI} inspired novel loss function LMI\mathcal{L}_{LMI}.

  • DeepMI: a fuzzy logic based framework to train DNNs using LMI\mathcal{L}_{LMI} or MI\mathcal{L}_{MI}.

In the next section, we discuss the related work. In Sec. III, we brief \mathcal{MI} and in Sec.  IV, we discuss the limitations of regular \mathcal{MI} expression and develop LMI\mathcal{L}_{LMI} along with gradient calculation required for back propagation. In Sec. V, we experimentally verify the importance of DeepMI through a number of unsupervised learning tasks. Finally, Sec. VI provides conclusion about the paper.

II Related Work

Literature on Mutual Information is diverse and vast. Therefore, we limit our discussion only to the most relevant works in this area. Mutual Information [18, 17, 16] is a fundamental measure of information theory which provides a sense of independence between random variables. It has been widely used in a variety of applications. The works [19, 20] are typical examples which exploit \mathcal{MI} in order to align medical images. \mathcal{MI} has also been successfully used in speech recognition [21], machine vision and robotic applications. [22] is a typical example in the area of autonomous vehicles to register 3D points clouds obtained by LIDARs. Apart from that, \mathcal{MI} has widely been used in independent component analysis [23], key feature selection [24, 25]. From the above applications of \mathcal{MI} into diverse area, \mathcal{MI} can be thought as a pivotal measure.

The works [19, 20, 22] are non-parametric approaches which maximize \mathcal{MI} to achieve the desired purpose. \mathcal{MI} is operated upon the distributions of the raw signals or the features extracted. For example, [19] used image histograms, whereas [22] uses the distributions of 3D points in a voxel. In these techniques the feature extraction is quite important which is handcrafted. In the past decade, deep neural network architectures [1, 2] have been proved to be excellent in learning high quality embeddings / feature from the input data in an entirely unsupervised manner which in turn are used for various tasks [26, 27, 28, 29, 30, 6, 8]. Therefore, we believe that bringing deep learning framework in conjunction with \mathcal{MI} can be extremely useful. However, so far, there does not exist any unified standard framework which can be used for this purpose. It is mainly due to the issues related with \mathcal{MI}. For example, the distributions required for mutual information are not exact, instead they are only the approximations of the true distribution [31]. Also, these approximations are not differentiable, thus making it difficult for \mathcal{MI} to be included in the deep learning methods [32]. Since, affordable deep learning methods have only recently emerged, the learning process is mostly based on the traditional losses [15, 9, 11, 12, 13]. A very recent work [32] proposes to use \mathcal{MI} with neural networks. However, the work mainly addresses to estimate the distributions using neural networks and does not talk about per-sample \mathcal{MI} which is required for the tasks such as [19, 20, 22].

The works [9, 11, 12, 13, 33] in the area of depth estimation and visual odometery using deep neural networks in an unsupervised fashion are typical examples where \mathcal{MI} can be employed. These works only utilize the losses such as 1,2,SSIM\mathcal{L}_{1},\mathcal{L}_{2},\mathcal{L}_{SSIM}. We believe that since \mathcal{MI} has successfully been employed in diverse applications, it is worth developing a well defined and benchmarked \mathcal{MI} based framework for deep learning. Based on the motivation, in this paper, we explore the possibility and feasibility of the above idea of using \mathcal{MI} for robotics applications. Our intention is not to outcast the existing losses, instead to bring in \mathcal{MI} into deep learning, and to build a baseline in the paper to open the doors to new research area in this direction.

III Mutual Information (\mathcal{MI})

For any two random variables XX and YY, the measure \mathcal{MI} is defined as.

(X;Y)=(X)+(Y)(X,Y),(X;Y)0\mathcal{I}(X;Y)=\mathcal{H}(X)+\mathcal{H}(Y)-\mathcal{H}(X,Y),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \mathcal{I}(X;Y)\geq 0 (1)
(X)\displaystyle\mathcal{H}(X) =xXpXxlog(pXx),\displaystyle=-\sum_{x\in X}p_{X}^{x}log(p_{X}^{x}), (2)
(Y)\displaystyle\mathcal{H}(Y) =yYpYylog(pYy),\displaystyle=-\sum_{y\in Y}p_{Y}^{y}log(p_{Y}^{y}),
(X,Y)\displaystyle\mathcal{H}(X,Y) =xXyYpXYxylog(pXYxy)\displaystyle=-\sum_{x\in X}\sum_{y\in Y}p_{XY}^{xy}log(p_{XY}^{xy})

where (X),(Y)\mathcal{H}(X),\mathcal{H}(Y) represent the entropy [18] of XX, entropy of YY, whereas (XY)\mathcal{H}(XY) represents the joint entropy of X,YX,Y when both variables are co-observed. The symbols pX,pYp_{X},p_{Y} and pXYp_{XY} represent the marginal of XX, marginal of YY and joint probability density function (pdf) of X,YX,Y respectively.

Mutual Information is an important term in the information theory as it provides a measure of statistical independence between two random variables based on the distribution. In other words, \mathcal{MI} governs that how well one can explain about a random variable XX after observing another random variable YY or vice-versa. The expression of \mathcal{MI} in Eq. 1 is defined in terms of entropies. For any random variable XX, its entropy quantifies uncertainty associated with its occurrence.

III-A \mathcal{MI} as a similarity metric

\mathcal{MI} is a convex function and attains global minima when any two random variables under consideration are independent. Mathematically, 0{\mathcal{MI}}\to 0 when the variables are independent, whereas, {(X)=(Y)}{\mathcal{MI}\to\{\mathcal{H}(X)=\mathcal{H}(Y)\}} when both the variables are identical statistically. This property of \mathcal{MI} can readily be employed to quantify similarity between two signals. However, while doing so, the definition of \mathcal{MI} has to be interpreted in a quite different manner.

To better understand, let us consider an example of image matching, provided two images XX and YY. In order to measure the similarity between the images using \mathcal{MI}, the image itself can not be considered as a random variable. Because in that case, pX,pYp_{X},p_{Y} and pXYp_{XY} shall be meaningless. In other words, per sample \mathcal{MI} is not defined. Hence, instead of an image, its pixel values are considered as a random variable over which the relevant distributions can be defined. The pixel values may refer to intensity, color, gradients etc. In order to compute the similarity score, first the marginals and joint pdfs over the selected variable has to be computed, and the similarity can be obtained by using the Eq. 1. While doing so, Eq. 2 needs to be rewritten as given below.

(X)\displaystyle\mathcal{H}(X) =i=1NpXilog(pXi),\displaystyle=-\sum_{i=1}^{N}p_{X}^{i}log(p_{X}^{i}), (3)
(Y)\displaystyle\mathcal{H}(Y) =i=1NpYilog(pYi),\displaystyle=-\sum_{i=1}^{N}p_{Y}^{i}log(p_{Y}^{i}),
(X,Y)\displaystyle\mathcal{H}(X,Y) =i=1Nj=1NpXYijlog(pXYij)\displaystyle=-\sum_{i=1}^{N}\sum_{j=1}^{N}p_{XY}^{ij}log(p_{XY}^{ij})

Where NN is the number of bins in the pdf.

As an another example, we can consider matching of two time series signals by using \mathcal{MI}. Following the above discussion, the two signal instances under consideration can not be considered as random variables, instead their instantaneous values are considered as random variable. It must be noticed that the choice of random variable depends on the application.

IV DeepMI Framework

To understand the concept of DeepMI, consider the task of image reconstruction using autoencoders. In order to minimize the gap between an input image and the reconstructed image, the \mathcal{MI} has to be maximized. The regular \mathcal{MI} expression however, can not be used directly for this purpose. It is primarily because \mathcal{MI} attains global minima when both random variables are dissimilar and our optimal point which is sup\sup\leavevmode\nobreak\ {\mathcal{MI}}, is not well defined as \mathcal{MI} is unbounded above. Although, various normalized versions of \mathcal{MI} have also been proposed [34] in the literature, the previously discussed issues still remain intact. Hence, normalized \mathcal{MI} (𝒩\mathcal{NMI}) also can not serve our purpose.

The above challenges encourage us to develop linearized mutual information \mathcal{LMI} which attains a global minima when the two images are exactly the same. In order to achieve this, we turn towards the the working of the \mathcal{MI} and make a following important insight.

IV-A A key insight to MI

Consider two images XX and YY, with pXN,pYNp_{X}\in\mathbb{R}^{N},p_{Y}\in\mathbb{R}^{N} and pXYN×Np_{XY}\in\mathbb{R}^{N\times N} as their marginals and joint pdfs respectively. The dimensions of pX,pYp_{X},p_{Y} is N×1N\times 1, whereas it is N×NN\times N for pXYp_{XY}. From, Eq. 12, we can immediately say that 0\mathcal{MI}\to 0 when two signals are dissimilar while {(X)=(Y)}{\mathcal{MI}\to\{\mathcal{H}(X)=\mathcal{H}(Y)\}} when the signals are exactly the same. Hence, for the images XX and YY to be identical, the necessary but not sufficient condition is that pXp_{X} and pYp_{Y} should be same. While, in order to guarantee, the following has to be satisfied.

pXi=pYi=pXYii,pXY|ijij=0,i,j=1,2,,Np_{X}^{i}=p_{Y}^{i}=p_{XY}^{ii},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ p_{XY_{|i\neq j}}^{ij}=0,\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ i,j=1,2,...,N (4)

In other words, when XYX\equiv Y, the off-diagonal elements of pXYp_{XY} are zero while all the diagonal elements are non-zero (depending on a distribution) and equals to pXp_{X} and pYp_{Y} simultaneously. The above insight leads us to derive an expression for the \mathcal{LMI} function to train deep networks.

IV-B \mathcal{LMI} Derivation

We know that for any probability density function Eq. 5 and 6 holds. These equations represent a 11D and a 22D probability density function respectively.

i=1NpXi=i=1NpYi=1,\displaystyle\footnotesize\sum_{i=1}^{N}p^{i}_{X}=\sum_{i=1}^{N}p^{i}_{Y}=1, (5)
i=1Nj=1NpXYij=1\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}p^{ij}_{XY}=1 (6)

Rewriting Eq. 6 as combination of its diagonal (i=j)(i=j) and off-diagonal elements (ij)(i\neq j), we get

i=1Nj=1ijNpXYij+i=1NpXYii=1\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}p_{XY}^{ii}=1 (7)
i=1Nj=1ijNpXYij+i=1NpXYii+i=1NpXYiii=1NpXYii=1\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}p_{XY}^{ii}\leavevmode\nobreak\ +\sum_{i=1}^{N}p_{XY}^{ii}-\sum_{i=1}^{N}p_{XY}^{ii}=1 (8)
i=1Nj=1ijNpXYij+i=1NpXYii+i=1NpXYiii=1NpXYii+\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}p_{XY}^{ii}\leavevmode\nobreak\ +\sum_{i=1}^{N}p_{XY}^{ii}-\sum_{i=1}^{N}p_{XY}^{ii}+ (9)
i=1NpXii=1NpXi+i=1NpYii=1NpYi=1\displaystyle\sum_{i=1}^{N}p_{X}^{i}-\sum_{i=1}^{N}p_{X}^{i}+\sum_{i=1}^{N}p_{Y}^{i}-\sum_{i=1}^{N}p_{Y}^{i}=1
i=1Nj=1ijNpXYij+i=1N|pXYiipXi|+\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}|p_{XY}^{ii}-p_{X}^{i}|\leavevmode\nobreak\ + (10)
i=1N|pXYiipYi|i=1NpXYii+1+11\displaystyle\sum_{i=1}^{N}|p_{XY}^{ii}-p_{Y}^{i}|-\sum_{i=1}^{N}p_{XY}^{ii}+1+1\geq 1
i=1Nj=1ijNpXYij+i=1N|pXYiipXi|+\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}|p_{XY}^{ii}-p_{X}^{i}|\leavevmode\nobreak\ + (11)
i=1N|pXYiipYi|i=1NpXYii+10\displaystyle\sum_{i=1}^{N}|p_{XY}^{ii}-p_{Y}^{i}|-\sum_{i=1}^{N}p_{XY}^{ii}+1\geq 0

Now, reffering to Eq. 7, we can write

i=1NpXYii1 1i=1NpXYii0\sum_{i=1}^{N}p_{XY}^{ii}\leq 1\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \Rightarrow\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ 1-\sum_{i=1}^{N}p_{XY}^{ii}\geq 0 (12)

Using the above into Eq. 11, we get

LMI=13(i=1Nj=1ijNpXYij+\displaystyle\mathcal{L}_{LMI}=\frac{1}{3}\Big{(}\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+ (13)
i=1N|pXYiipXi|+i=1N|pXYiipYi|)\displaystyle\sum_{i=1}^{N}|p_{XY}^{ii}-p_{X}^{i}|+\sum_{i=1}^{N}|p_{XY}^{ii}-p_{Y}^{i}|\Big{)} 0\displaystyle\geq 0

Where the factor 1/3\nicefrac{{1}}{{3}} is included to ensure 1\mathcal{LMI}\leq 1. It is obtained by replacing each of the three terms to their maximum. The equality of the equation to 0 will hold iff pXi=pYi=pXYii,pXY|ijij=0i,j1,2,..,Np_{X}^{i}=p_{Y}^{i}=p_{XY}^{ii},\leavevmode\nobreak\ \leavevmode\nobreak\ p_{XY_{|i\neq j}}^{ij}=0\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ i,j\in 1,2,..,N. i.e. two images match perfectly. Hence, the L.H.S. of the equation Eq. 13 is treated as the objective function which we call the \mathcal{LMI} function. The “\mathcal{L}” stands for “linearized” which arises because the \mathcal{LMI} formulation is linear in the elements of the pdfs, whereas the “\mathcal{MI}” term arises because at the equality, the regular \mathcal{MI} expression will also be maximized. The \mathcal{LMI} formulation is quite interesting because it is essentially a combination of three different losses weighted equally. The LMI\mathcal{L}_{LMI} formulation is quite intuitive and the gradients are straightforward to compute.

IV-C Fuzzy Probability Density Function

The \mathcal{LMI} function utilizes the pdfs pX,pYp_{X},p_{Y} and pXYp_{XY} which are discrete in nature. These are typically obtained by computing an NN bin histogram followed by a normalization step with ||.||1=1||.||_{1}=1. As per the standard procedure to compute a regular histogram, first, a bin-id corresponding to an observation of the random variable is computed and later, the count of the respective bin is incremented by unity. The computation of the bin-id is carried out by a ceilceil or floorfloor operation which is not differentiable. While performing the rounding step, the actual contribution of the observation is lost. Thus, both the incremental and rounding procedure prevent the gradient flow which is needed during the training process.

To better understand the above, let hXh_{X} be an NN bin histogram of the random variate XX and xXx\in X be an observation. In the case of a regular histogram, the bin-id xbx_{b} corresponding to an observation xx is computed as follows:

xb=x^bin_res,\displaystyle x_{b}=\frac{\hat{x}}{bin\_res},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak x^=xminXmaxXminX,\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \hat{x}=\frac{x-min_{X}}{max_{X}-min_{X}}, (14)
bin_res\displaystyle bin\_res =maxXminXN\displaystyle=\frac{max_{X}-min_{X}}{N}

where maxX,minXmax_{X},min_{X} are the maximum and minimum values which the variable XX can attain at any instant. Typically for 88-bit images, [minX,maxX]=[0,255][min_{X},max_{X}]=[0,255]. From the Eq. 14, it can be noticed that the value of xbx_{b} is not necessarily an integer. In this case, xbx_{b} is rounded to the nearest integer by using ceilxbceil\equiv\lceil x_{b}\rceil or floorxbfloor\equiv\lfloor x_{b}\rfloor, depending upon one’s convention. Now, the count of xbx_{b} is incremented by one. Therefore, it becomes evident that the rounding procedure and the unit incremental procedure do not allow the gradient computation w.r.t. the observations. To cope up with this, we employ fuzzification strategy in order to ensure valid gradients during back-propagation.

IV-C1 Fuzzification of pXp_{X}, pYp_{Y}

In order to fuzzify hXh_{X}, instead of one as in Eq. 14, we compute two bins corresponding to xXx\in X i.e. x0=xbx_{0}=\lfloor x_{b}\rfloor and x1=xbx_{1}=\lceil x_{b}\rceil. We define a membership function for each of the two bins as follows.

mx0=1(xbx0),mx1=(xbx0)m_{x_{0}}=1-(x_{b}-x_{0}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ m_{x_{1}}=(x_{b}-x_{0}) (15)

where mx0,mx1m_{x_{0}},m_{x_{1}} are the membership functions of x0x_{0} and x1x_{1}. Essentially, mx0+mx1=1m_{x_{0}}+m_{x_{1}}=1. While performing the unit incremental step, the count of the bins x0x_{0} and x1x_{1} are incremented by mx0m_{x_{0}} and mx1m_{x_{1}} respectively instead of increasing by one. With the help of fuzzification, it can be inferred that now the gradients of xbx_{b} w.r.t. mx0m_{x_{0}} and mx1m_{x_{1}} are fully defined (Sec. IV-D). The above steps are followed in order to compute pXp_{X} and pYp_{Y}, while the normalization step being performed at the end. As a matter of convention, the memberships corresponding to yby_{b} for a yYy\in Y are denoted by my0m_{y_{0}} and my1m_{y_{1}}.

IV-C2 Fuzzification of pXYp_{XY}

The fuzzification of a joint pdf pXYp_{XY} is simply an extension of previous steps. For a regular 22D histogram, the unit incremental procedure is applied to the bin location defined by the coordinate (xb,yb)(x_{b},y_{b}) (Eq. 14) which in this case as well, need not to be exactly an integral value. Therefore, to ensure valid gradient flow in this case, four memberships are defined which corresponds to the four coordinates top-left, top-right, bottom-left, and bottom-right w.r.t (xb,yb)(x_{b},y_{b}). Mathematically, these four coordinates are given by (x0,y0)(x_{0},y_{0}), (x1,y0)(x_{1},y_{0}), (x0,y1)(x_{0},y_{1}), (x1,y1)(x_{1},y_{1}), and their respective memberships can be written as:

mx0y0=mx0my0,mx1y0=mx1my0,\displaystyle m_{x_{0}y_{0}}=m_{x_{0}}m_{y_{0}},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ m_{x_{1}y_{0}}=m_{x_{1}}m_{y_{0}}, (16)
mx0y1=mx0my1,mx1y1=mx1my1\displaystyle m_{x_{0}y_{1}}=m_{x_{0}}m_{y_{1}},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ m_{x_{1}y_{1}}=m_{x_{1}}m_{y_{1}}

While performing the unit incremental step, the count of the four bins mentioned above is incremented by their respective membership value.

IV-D Back-propagation through DeepMI framework

From the Eq. 16, it can be inferred that, gradient of xx w.r.t. LMI\mathcal{L}_{LMI} depends on pXx0,pXx0,pXYx0y0,pXYx0y1,pXYx1y0,pXYx1y1p_{X}^{x_{0}},p_{X}^{x_{0}},p_{XY}^{x_{0}y_{0}},p_{XY}^{x_{0}y_{1}},p_{XY}^{x_{1}y_{0}},p_{XY}^{x_{1}y_{1}}. Therefore, we can write:

x=i=01pXxipXxix+i=01j=01pXYxiyipXYxiyix\frac{\partial\mathcal{L}}{\partial x}=\sum_{i=0}^{1}\frac{\partial\mathcal{L}}{\partial p_{X}^{x_{i}}}\leavevmode\nobreak\ \frac{\partial p_{X}^{x_{i}}}{\partial x}+\sum_{i=0}^{1}\sum_{j=0}^{1}\frac{\partial\mathcal{L}}{\partial p_{XY}^{x_{i}y_{i}}}\leavevmode\nobreak\ \frac{\partial p_{XY}^{x_{i}y_{i}}}{\partial x} (17)

Using chain rule:

pXxix=pXxip^Xxi×p^Xximxi×mxixb×\displaystyle\frac{\partial p_{X}^{x_{i}}}{\partial x}=\frac{\partial p_{X}^{x_{i}}}{\partial\hat{p}_{X}^{x_{i}}}\times\frac{\partial\hat{p}_{X}^{x_{i}}}{\partial{m_{x_{i}}}}\times\frac{\partial m_{x_{i}}}{\partial x_{b}}\times (18)
xbx^×x^x,pXxi=p^Xxii=1Np^Xxi\displaystyle\frac{\partial x_{b}}{\partial\hat{x}}\times\frac{\partial\hat{x}}{\partial x},\leavevmode\nobreak\ \leavevmode\nobreak\ p_{X}^{x_{i}}=\frac{\hat{p}_{X}^{x_{i}}}{\sum_{i=1}^{N}\hat{p}_{X}^{x_{i}}}

and similarly,

pXYxiyjx=pXYxiyjp^XYxiyj\displaystyle\frac{\partial p_{XY}^{x_{i}y_{j}}}{\partial x}=\frac{\partial p_{XY}^{x_{i}y_{j}}}{\partial\hat{p}_{XY}^{x_{i}y_{j}}} ×p^XYxiyjmxiyj×mxiyjmxi×mxixb\displaystyle\times\frac{\partial\hat{p}_{XY}^{x_{i}y_{j}}}{\partial m_{x_{i}y_{j}}}\times\frac{\partial m_{x_{i}y_{j}}}{\partial m_{x_{i}}}\times\frac{\partial m_{x_{i}}}{\partial x_{b}} (19)
×xbx^×x^x,pXYxiyj=p^XYxiyji=1Np^XYxiyj\displaystyle\times\frac{\partial x_{b}}{\partial\hat{x}}\times\frac{\partial\hat{x}}{\partial x},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ p_{XY}^{x_{i}y_{j}}=\frac{\hat{p}_{XY}^{x_{i}y_{j}}}{\sum_{i=1}^{N}\hat{p}_{XY}^{x_{i}y_{j}}}

Each of the above partial derivatives can be easily computed using the Eq. 13 - 16 via the chain rule. It must be noticed that in the gradient calculations, x0,x1x_{0},x_{1} does not occur which would have been there in case of regular histograms, leading to undefined derivatives. Similarly, gradients can be also be computed for an observation yy.

IV-E LMI\mathcal{L}_{LMI} Implementation

The formulation of LMI\mathcal{L}_{LMI} seems quite intuitive. There is however a consideration which must be accounted while using it for signal matching, especially in the scenarios where one of the signal is the groundtruth and the another is the estimated version of the first by a neural network.

To understand this, consider two images XX and YY, where XX is the image to be reconstructed and YY is the reconstruction or simply the output of a neural network. The marginals pX,pYp_{X},p_{Y} and the joint pXYp_{XY} of XX and YY are computed using the fuzzification procedure as described previously. As we know that these distributions are the approximated version of the underlying distribution, therefore, undersampling or oversampling of the underlying distribution is possible. By nature the distributions pXp_{X}, pYp_{Y} and pXYp_{XY} do not have a well defined mathematical expressions in such scenarios. Moreover, both the pXp_{X} and pYp_{Y} can be obtained using pXYp_{XY}, therfore, while backpropagation, the gradients only w.r.t pXYijp_{XY}^{ij} are backpropagated. Mathematically, gradients w.r.t. pYp_{Y} should also be backpropagated because calculations of pYp_{Y} is dependent on YY. However doing so, disturbs the training process and affects the neural network performance. After our experimental study, we mark this observation as an outcome of the distributions approximation procedure.

IV-F Hyperparameters

The DeepMI framework has NN as the only hyperparameter. The number of bins NN mainly depicts that how precisely the LMI\mathcal{L}_{LMI} should penalize the network while matching. For example, consider an 88-bit image with dynamic range [0,255][0,255]. Now, we set N=255N=255, the network will be penalized very strongly while matching. While if the NN is set to a small number, the network will be forced to focus only on the important details. This property can be quite useful in cases where two images from different cameras need to be matched while both the images have different brightness levels.

V Experiments

In this section, we benchmark the effectiveness and the applicability of the DeepMI framework. Due to the unavailability of a standard evaluation procedure, we define three tasks on which a deep neural network is trained in an unsupervised manner. The three tasks vary in their difficulty levels from baseline, moderate, to extremely difficult from the learning perspective. As an experimental study, the training of each task is performed using different loss functions 1,2,SSIM\mathcal{L}_{1},\mathcal{L}_{2},\mathcal{L}_{SSIM} along with LMI\mathcal{L}_{LMI}. For SSIM\mathcal{L}_{SSIM}, we use a 3×33\times 3 block filter following [9]. All the performance metrics are provided in Table-I and Table-II for quantitative analysis with the best scores highlighted in the blue. For training, we use base_lr = 0.0010.001, lr_policy=polypoly, ADAM optimizer with β1=0.9\beta_{1}=0.9 and β2=0.99\beta_{2}=0.99 unless otherwise stated. The encoder-decoders [35] have a filter size of 3×33\times 3 and the number of filters equals to 16,32,64,128,25616,32,64,128,256 for the five stages for each task. The whole framework has been implemented as a layer-wise architecture in C++ and the codes will be available at the link provided in the beginning.

V-A Unsupervised Bar Alignment (Exp1)

This experiment consists of two binary images where the second image is a spatially transformed version of the first i.e. a 22D rigid body transform is defined between the two images. Both the images consist of a black background and a white rectangular bar. The bar sized of 50×12550\times 125 in an image (192×640192\times 640) has two degrees-of-freedom (DoF): txtx, and θ\theta which correspond to horizontal motion and in-plane rotation. The goal of the experiment is to learn the 22DoF parameters in an unsupervised manner. For the training purpose, we generate a dataset of 15001500 images with 1000+5001000+500 train-test split. The dataset is generated by transforming the bar in the image by randomly generating tx[100,100]tx\in[-100,100] pixels and θ[40,40]\theta\in[-40,40] degrees. The training of this experiment is performed using SGD with Nesterov momentum =0.95=0.95 for 55 epochs.

The unsupervised training pipeline for the task is depicted in Fig. 1. While training, the neural network takes two images, source (IsI_{s}) and target (ItI_{t}) as input and predicts tx,θtx,\theta. The image IsI_{s} is then warped (I^s\hat{I}_{s}) using fully differentiable spatial-transformer-networks (STN) [36] and I^s\hat{I}_{s} is obtained. Now, the neural network is penalized using back-propagation to force I^sIt\hat{I}_{s}\to I_{t}. In the testing phase, the neural network predicts tx,θtx,\theta on the test data and Mean-Absolute-Error (MAE) is reported between the predictions and the groundtruth tx,θtx,\theta, already stored during the data generation process. From the Table-I under the column Exp1, it can be noticed that the network trained using LMI\mathcal{L}_{LMI} formulation exhibits better performance as compared to the rest of the loss functions. Fig. 2 shows a few qualitative results of this experiment. In this task, almost all of the losses perform equally well. It can be verified visually as well as from the quantitative results provided in the Table I.

STNtxtxθ\thetaIsI_{s}ItI_{t}\mathcal{L}Refer to captionRefer to caption
Figure 1: Unsupervised learning framework for the Exp1. The blue box is a convolutional block, orange box is a concatenation block. STN [36], and \mathcal{L} is a loss function.
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionInputTarget1\mathcal{L}_{1}2\mathcal{L}_{2}SSIM\mathcal{L}_{SSIM}LMI\mathcal{L}_{LMI}
Figure 2: Qualitative results of Exp1
Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionInputTargetGT Mask1\mathcal{L}_{1}2\mathcal{L}_{2}SSIM\mathcal{L}_{SSIM}LMI\mathcal{L}_{LMI}
Figure 3: Qualitative results of Exp2
MsM_{s}IsI_{s}ItI_{t}\mathcal{L}Refer to captionRefer to caption
Figure 4: Unsupervised learning framework for Exp2.

V-B Unsupervised Mask Prediction (Exp2)

This experiment consists of two ordinary grayscale images: source (IsI_{s}) and target (ItI_{t}). From the the image ItI_{t}, a significantly large rectangular region is cropped out and filled with zeros. The end goal of the experiment is to predict a mask MsM_{s} which depicts similar and dissimilar regions between IsI_{s} and ItI_{t}. To achieve this, a neural network is trained in an unsupervised manner. For the learning process, we generate a dataset of 15001500 images with 1000+5001000+500 train-test split. The dataset is generated by randomly cropping a rectangular region from IsI_{s} and filling th cropped region with zeros. The obtained image is referred as ItI_{t}. The cropping region has a size of 40×20040\times 200.

Fig. 4 shows the the unsupervised training framework for this task. We use UNet [35] network architecture. The network is trained to predict MsM_{s} such that IsMsItI_{s}*M_{s}\to I_{t}. This experiment is essentially a binary segmentation task, therefore, we adopt intersection-over-union metric (IoU) which is widely used to quantify the segmentation performance. In order to best evaluate different training losses, we provide IoU scores by thresholding the predicted mask at different confidence levels. It is done in order to examine the network’s capability to push the feature embeddings of similar and dissimilar regions significantly apart. In other words, a perfectly trained network will exhibit same IoU scores at various threshold levels. From the Table-I under Exp2, it is clear that the proposed LMI\mathcal{L}_{LMI} formulation shows consistent and better performance over the other loss functions across various thresholding levels.

Fig. 3 shows a few qualitative results corresponding to this experiment. The results visualized are thresholded at 0.100.10 (IoU.10). It can be seen that 1\mathcal{L}_{1} and SSIM\mathcal{L}_{SSIM} performs worst whereas 2\mathcal{L}_{2} and LMI\mathcal{L}_{LMI} performs marginally equally. The visual observation can also be verified quantitatively from the Table I, under Exp2. In the quantitative results, 1\mathcal{L}_{1} and SSIM\mathcal{L}_{SSIM} performs worst in increasing order, as verified visually. On the other hand, 2\mathcal{L}_{2} and LMI\mathcal{L}_{LMI} have only marginal difference, which can be also be verified visually by zooming in the results and examining the boundary of white region.

V-C Unsupervised Depth Estimation (Exp3)

Depth estimation has been a long standing task in front of the computer vision community. This task has worldwide industrial importance because depth perception is a must for autonomous robotics and vehicles. Due to the advancements in the supervised learning techniques, researchers have developed several methods to predict depth using neural networks by employing supervised learning techniques. However, obtaining accurate groundtruth for supervised learning of this task is extremely challenging because it requires very expensive measurement instruments such as LIDARs. Hence, this task has gained considerable amount of attention in the recent years in order to develop unsupervised learning frameworks for this task. The existing methods make use of several loss functions in order to learn the depth effectively.

In this experiment, We demonstrate the learning of a neural network for the task of depth estimation using stereo images in an unsupervised manner. We form the seminal work for this task [9] as our basis. Our aim is not to show improvements in the datasets, instead we emphasis that how the DeepMI framework can be easily integrated for these real world applications. Hence, instead of a bigger dataset, we use 8787 grayscale rectified stereo images from KITTI [37] sequence-113113. We select this sequence because it is quite difficult sequence from the perspective of this task.

The framework to carryout the experiment is shown in Fig. 6. It must be noticed that instead of the three different losses as in [9], we only use one loss for the evaluation. This is done in order to demonstrate that LMI\mathcal{L}_{LMI} provides stronger gradients and alone can lead to improved results. We report MAE between the predicted depth and the groundtruth measurements. From the Table-I under Exp3, one can notice that the neural network trained using LMI\mathcal{L}_{LMI} outperforms other five variants by large margin. For the case of 1,2\mathcal{L}_{1},\mathcal{L}_{2}, the base_lr is lowered to 0.000010.00001 to prevent gradient explosion.

Table I: Quantitative analysis
\mathcal{L} Exp1 Exp2 Exp3
MAEtx MAEθ IoU.05 IoU.10 IoU.20 IoU.40 IoU.50 MAE(mm)
1\mathcal{L}_{1} 1.521.52 3.443.44 .48.48 .61.61 .65.65 .68.68 .68.68 105.46105.46
2\mathcal{L}_{2} 1.931.93 3.653.65 .66.66 .68.68 .68.68 .68.68 .68.68 39.7839.78
SSIM\mathcal{L}_{SSIM} 1.251.25 3.953.95 .50.50 .52.52 .68.68 .69.69 .81{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.81} .46.46
SSIM+2\mathcal{L}_{SSIM}+\mathcal{L}_{2} - - - - - - - .64.64
LMI\mathcal{L}_{LMI} 1.04{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.04} 3.18{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.18} .70{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.70} .70{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.70} .71{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.71} .80{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.80} .81{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.81} .27{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.27}

Fig. 5 shows a few qualitative results for this experiment. From the figure, it can be noticed that, both the 1\mathcal{L}_{1} and 2\mathcal{L}_{2} perform poorly, whereas SSIM\mathcal{L}_{SSIM} performs quite better than them. This indicates the reason behind recent adaptation of SSIM\mathcal{L}_{SSIM} over 1\mathcal{L}_{1} and 2\mathcal{L}_{2} for image matching purpose.

Refer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionInputTarget1\mathcal{L}_{1}2\mathcal{L}_{2}SSIM\mathcal{L}_{SSIM}LMI\mathcal{L}_{LMI}
Figure 5: Qualitative results of Exp3. More brighter the depth map, more closer is an object

.

WarpD0D_{0}\mathcal{L}WarpD1D_{1}\mathcal{L}I0I_{0}I1I_{1}Refer to captionRefer to caption
Figure 6: Unsupervised learning framework for Exp3. Warp [9].

Further, it can be noticed that, visually the results of LMI\mathcal{L}_{LMI} are the most pleasing as well as consistent among all. As a comparison between the SSIM\mathcal{L}_{SSIM} and LMI\mathcal{L}_{LMI}, we can see that the depth estimations of the former contains texture copy [9] artifacts, whereas these are absent in latter. It can also be verified by taking a closer look at the car in the bottom right of the image. It can be seen that for the case of SSIM\mathcal{L}_{SSIM}, the depth map of the car has holes near edges and also contains severe texture copy artifacts near the number plate and other body area of the car. These artifacts, on the other hand are not present in the case of LMI\mathcal{L}_{LMI}. This shows the clear effectiveness of LMI\mathcal{L}_{LMI} loss and the DeepMI framework.

V-D Effect of number of bins NN

Table-II shows the effect of the hyperparameter NN on each of the three tasks. For Exp1, it can be noticed that the performance for all the four values of NN is same. It has to be the case because the images are binary in this task. For Exp2, there is considerable drop in the performance for N=3N=3 for IoU0.05 (highlighted in red). It is because, the images in this case have multiple grayscale levels and details about which can not be efficiently captured. The same is the case for Exp3 when N=3N=3. Overall, we can see that LMI\mathcal{L}_{LMI} has significant advantages over other loss functions. Through our experimentations, it sufficient to keep the value of N25N\leq 25 for a signal having dynamic range [0,255]\in[0,255].

V-E A unified discussion on the experiments

From the Table-I, it can be noticed that the information theory based measures are much more consistent as compared to the other losses. Also, it can be noticed that for the Exp3, the LMI\mathcal{L}_{LMI} proves to be better over the case when SSIM\mathcal{L}_{SSIM} and 2\mathcal{L}_{2} are used together. Also, it is noticeable that LMI\mathcal{L}_{LMI} shows consistent and best scores amongst all variants of losses. In the experiments, our intention has not been to outweigh the existing losses, instead to mark the potential of information theory based methods in deep learning for real world applications.

Table II: Effect of bin size NN on LMI\mathcal{L}_{LMI}
NN Exp1 Exp2 Exp3
MAEtx MAEθ IoU.05 IoU.10 IoU.20 IoU.40 IoU.50 MAE(mm)
33 1.041.04 3.183.18 .41.41 .51.51 .52.52 .50.50 .81.81 .41.41
1111 0.890.89 2.692.69 .50.50 .59.59 .61.61 .66.66 .81.81 .27.27
1515 0.780.78 2.892.89 .66.66 .67.67 .68.68 .69.69 .81.81 .28.28
2525 0.990.99 4.364.36 .47.47 .48.48 .58.58 .79.79 .79.79 .29.29

VI Conclusion

In this paper, we proposed an end-to-end differentiable framework “DeepMI” and a novel similarity metric \mathcal{LMI} to train deep neural networks. The metric is mutual information (\mathcal{MI}) inspired and cops up with the difficulty to be able to train a deep neural network using the (\mathcal{MI}) expression. The metric is based on probability density functions which makes it signal agnostic. The density functions are discrete in nature for real world signals (images, time series) and can not support backpropagation. Therefore, a fuzzification strategy to support smooth backward gradient flow through the density functions is also developed. We show that the neural network trained using MI\mathcal{L}_{MI} metric, outperforms their counterparts which are trained using 1,2,SSIM\mathcal{L}_{1},\mathcal{L}_{2},\mathcal{L}_{SSIM}. Additionally, we also show that it can be easily integrated for real world applications.

The DeepMI framework can be thought of as an effort to club the deep learning and mutual information together for real world applications. Through this work, we believe that, the learning based methods in several areas such as autonomous vehicles, robotic vision, speech / audio can be greatly benefitted in terms of the performance. The extensive experimental study in this work, can be used as a ground to develop extensions to the DeepMI framework in order to further improve the overall algorithmic performance. We also believe that inclusion of DeepMI into deep learning frameworks shall open door to new applications.

References

  • [1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  • [2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [3] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
  • [5] A. Kumar and L. Behera, “Semi supervised deep quick instance detection and segmentation,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 8325–8331, IEEE, 2019.
  • [6] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [7] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
  • [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
  • [9] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279, 2017.
  • [10] A. Kumar, J. R. McBride, and G. Pandey, “Real time incremental foveal texture mapping for autonomous vehicles,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3233–3240, IEEE, 2018.
  • [11] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349, 2018.
  • [12] R. Li, S. Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visual odometry through unsupervised deep learning,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 7286–7291, IEEE, 2018.
  • [13] Y. Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni, “Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 5474–5480, IEEE, 2019.
  • [14] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [15] J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger, “Unsupervised learning of multi-frame optical flow with occlusions,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 690–706, 2018.
  • [16] S. Kullback, Information theory and statistics. Courier Corporation, 1997.
  • [17] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
  • [18] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
  • [19] P. Viola and W. M. Wells III, “Alignment by maximization of mutual information,” International journal of computer vision, vol. 24, no. 2, pp. 137–154, 1997.
  • [20] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE transactions on Medical Imaging, vol. 16, no. 2, pp. 187–198, 1997.
  • [21] L. Bahl, P. Brown, P. De Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11, pp. 49–52, IEEE, 1986.
  • [22] G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Toward mutual information based automatic registration of 3d point clouds,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2698–2704, IEEE, 2012.
  • [23] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4-5, pp. 411–430, 2000.
  • [24] N. Kwak and C.-H. Choi, “Input feature selection by mutual information based on parzen window,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 12, pp. 1667–1671, 2002.
  • [25] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
  • [26] C. Zhu, Y. Zheng, K. Luu, and M. Savvides, “Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection,” in Deep Learning for Biometrics, pp. 57–79, Springer, 2017.
  • [27] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
  • [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” CVPR, 2017.
  • [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37, Springer, 2016.
  • [31] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partitioning of the observation space,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999.
  • [32] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm, “Mine: mutual information neural estimation,” arXiv preprint arXiv:1801.04062, 2018.
  • [33] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” in Advances in Neural Information Processing Systems, pp. 35–45, 2019.
  • [34] J. P. Pluim, J. A. Maintz, and M. A. Viergever, “Image registration by maximization of combined mutual information and gradient information,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 452–461, Springer, 2000.
  • [35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Springer, 2015.
  • [36] M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial transformer networks,” in Advances in neural information processing systems, pp. 2017–2025, 2015.
  • [37] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.