DeepMI: A Mutual Information Based Framework For Unsupervised Deep Learning of Tasks

Ashish Kumar^†, L. Behera^†, Senior Member IEEE
^†Department of Electrical Engineering, Indian Institute of Technology, Kanpur {krashish,lbehera}@iitk.ac.in

Abstract

In this work, we propose an information theory based framework “DeepMI” to train deep neural networks (DNN) using Mutual Information ( $\mathcal{MI}$ ). The DeepMI framework is especially targeted but not limited to the learning of real world tasks in an unsupervised manner. The primary motivation behind this work is the limitation of the traditional loss functions for unsupervised learning of a given task. Directly using $\mathcal{MI}$ for the training purpose is quite challenging to deal with because of its unbounded above nature. Hence, we develop an alternative linearized representation of $\mathcal{MI}$ as a part of the framework. Contributions of this paper are three fold: i) investigation of $\mathcal{MI}$ to train deep neural networks, ii) novel loss function $\mathcal{L}_{LMI}$ , and iii) a fuzzy logic based end-to-end differentiable pipeline to integrate DeepMI into deep learning framework. Due to the unavailabilty of a standard benchmark, we carefully design the experimental analysis and select three different tasks for the experimental study. We demonstrate that $\mathcal{L}_{LMI}$ alone provides better gradients to achieve a neural network better performance over the popular loss functions, also in the cases when multiple loss functions are used for a given task.

{justify}

I Introduction

Selection of a suitable loss function is crucial in order to train a neural network for a desired task. For a given neural network architecture [1, 2] and optimization procedure [3], the profile of the loss function largely governs what is learnt by the neural network and its generalization on unseen data. The above statement well applies to the choice of a similarity metric in the learning process. The deep learning based approaches involve minimizing a similarity metric between the ground truth and the predictions in a way or another. Though, the way it is performed may vary from task to task. For example, the visual perception tasks such as image classification [4, 2], image segmentation [5, 6], it is performed using cross entropy, whereas, for the other tasks such as forecasting [7], image generation [8], depth estimation [9], it is performed using $\mathcal{L}_{2}$ , $\mathcal{L}_{1}$ metrics. Generative adversaries [8] based learning algorithms also largely depends on the choice of similarity metric.

In the area of machine vision, there are certain tasks for which the groundtruth can be not be obtained easily. It is primarily due to a very high cost of measurement devices or unavailability of labelling process. These tasks include depth estimation using images [9], visual / LiDAR odometery [10]. From the perspective of the autonomous vehicles and robotics, the above tasks are undeniably important. For this reason, development of unsupervised learning techniques for these tasks have recently gained attention [9, 11, 12, 13]. The proposed methods in this direction extensively use similarity metric minimization between various information sources.

From the above discussion, it is quite evident that similarity metric plays an important role in the learning process. In general, $\mathcal{L}_{1},\mathcal{L}_{2}$ are the most preferred choice for this purpose. These losses, despite their popularity, do not provide the desired results in many cases. It is mainly due to the fact that these are pointwise operators and do not account for statistical information while matching. For example, in images, $\mathcal{L}_{1}/\mathcal{L}_{2}$ loss penalizes the neural network on a per-pixel basis. The statistical properties are thus left unaccounted. To address this issue, Structural Similarity (SSIM) index [14] has recently become popular and is being used as an alternative to $\mathcal{L}_{1},\mathcal{L}_{2}$ . The SSIM index is computed over a window instead of a pixel and is based on local statistics. In practice, the aforementioned losses are used in conjunction with each other which increases the number of individual loss functions, leading to an increased complexity in order to tune the loss weights [15].

Keeping in mind the above observations, in this work, we explore the potential of $\mathcal{MI}$ [16, 17, 18] to train deep neural networks for supervised / unsupervised learning of a task. $\mathcal{MI}$ is essentially an information theoretic measure to reason about the statistical independence of two random variables. An interesting property of $\mathcal{MI}$ is that it operates on probability distributions instead of the data directly. Therefore, $\mathcal{MI}$ does not depend on the signal type i.e. images or time-series and proves to be a powerful measure in many areas. For this reason, we consider $\mathcal{MI}$ as a potential alternative measure of similarity. Despite, its diverse applications, the expression of $\mathcal{MI}$ is infeasible to be used directly for the training a neural network (Sec. III). However, the interesting properties of $\mathcal{MI}$ encourage us to dive deep into the problem and lead us to contribute through this paper as follows:

•

Feasibility of $\mathcal{MI}$ formulation for deep learning tasks.
•

An $\mathcal{MI}$ inspired novel loss function $\mathcal{L}_{LMI}$ .
•

DeepMI: a fuzzy logic based framework to train DNNs using $\mathcal{L}_{LMI}$ or $\mathcal{L}_{MI}$ .

In the next section, we discuss the related work. In Sec. III, we brief $\mathcal{MI}$ and in Sec. IV, we discuss the limitations of regular $\mathcal{MI}$ expression and develop $\mathcal{L}_{LMI}$ along with gradient calculation required for back propagation. In Sec. V, we experimentally verify the importance of DeepMI through a number of unsupervised learning tasks. Finally, Sec. VI provides conclusion about the paper.

II Related Work

Literature on Mutual Information is diverse and vast. Therefore, we limit our discussion only to the most relevant works in this area. Mutual Information [18, 17, 16] is a fundamental measure of information theory which provides a sense of independence between random variables. It has been widely used in a variety of applications. The works [19, 20] are typical examples which exploit $\mathcal{MI}$ in order to align medical images. $\mathcal{MI}$ has also been successfully used in speech recognition [21], machine vision and robotic applications. [22] is a typical example in the area of autonomous vehicles to register 3D points clouds obtained by LIDARs. Apart from that, $\mathcal{MI}$ has widely been used in independent component analysis [23], key feature selection [24, 25]. From the above applications of $\mathcal{MI}$ into diverse area, $\mathcal{MI}$ can be thought as a pivotal measure.

The works [19, 20, 22] are non-parametric approaches which maximize $\mathcal{MI}$ to achieve the desired purpose. $\mathcal{MI}$ is operated upon the distributions of the raw signals or the features extracted. For example, [19] used image histograms, whereas [22] uses the distributions of 3D points in a voxel. In these techniques the feature extraction is quite important which is handcrafted. In the past decade, deep neural network architectures [1, 2] have been proved to be excellent in learning high quality embeddings / feature from the input data in an entirely unsupervised manner which in turn are used for various tasks [26, 27, 28, 29, 30, 6, 8]. Therefore, we believe that bringing deep learning framework in conjunction with $\mathcal{MI}$ can be extremely useful. However, so far, there does not exist any unified standard framework which can be used for this purpose. It is mainly due to the issues related with $\mathcal{MI}$ . For example, the distributions required for mutual information are not exact, instead they are only the approximations of the true distribution [31]. Also, these approximations are not differentiable, thus making it difficult for $\mathcal{MI}$ to be included in the deep learning methods [32]. Since, affordable deep learning methods have only recently emerged, the learning process is mostly based on the traditional losses [15, 9, 11, 12, 13]. A very recent work [32] proposes to use $\mathcal{MI}$ with neural networks. However, the work mainly addresses to estimate the distributions using neural networks and does not talk about per-sample $\mathcal{MI}$ which is required for the tasks such as [19, 20, 22].

The works [9, 11, 12, 13, 33] in the area of depth estimation and visual odometery using deep neural networks in an unsupervised fashion are typical examples where $\mathcal{MI}$ can be employed. These works only utilize the losses such as $\mathcal{L}_{1},\mathcal{L}_{2},\mathcal{L}_{SSIM}$ . We believe that since $\mathcal{MI}$ has successfully been employed in diverse applications, it is worth developing a well defined and benchmarked $\mathcal{MI}$ based framework for deep learning. Based on the motivation, in this paper, we explore the possibility and feasibility of the above idea of using $\mathcal{MI}$ for robotics applications. Our intention is not to outcast the existing losses, instead to bring in $\mathcal{MI}$ into deep learning, and to build a baseline in the paper to open the doors to new research area in this direction.

III Mutual Information ( $\mathcal{MI}$ )

For any two random variables $X$ and $Y$ , the measure $\mathcal{MI}$ is defined as.

\mathcal{I}(X;Y)=\mathcal{H}(X)+\mathcal{H}(Y)-\mathcal{H}(X,Y),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \mathcal{I}(X;Y)\geq 0

(1)

$\displaystyle\mathcal{H}(X)$	$\displaystyle=-\sum_{x\in X}p_{X}^{x}log(p_{X}^{x}),$	(2)
$\displaystyle\mathcal{H}(Y)$	$\displaystyle=-\sum_{y\in Y}p_{Y}^{y}log(p_{Y}^{y}),$
$\displaystyle\mathcal{H}(X,Y)$	$\displaystyle=-\sum_{x\in X}\sum_{y\in Y}p_{XY}^{xy}log(p_{XY}^{xy})$

where $\mathcal{H}(X),\mathcal{H}(Y)$ represent the entropy [18] of $X$ , entropy of $Y$ , whereas $\mathcal{H}(XY)$ represents the joint entropy of $X,Y$ when both variables are co-observed. The symbols $p_{X},p_{Y}$ and $p_{XY}$ represent the marginal of $X$ , marginal of $Y$ and joint probability density function (pdf) of $X,Y$ respectively.

Mutual Information is an important term in the information theory as it provides a measure of statistical independence between two random variables based on the distribution. In other words, $\mathcal{MI}$ governs that how well one can explain about a random variable $X$ after observing another random variable $Y$ or vice-versa. The expression of $\mathcal{MI}$ in Eq. 1 is defined in terms of entropies. For any random variable $X$ , its entropy quantifies uncertainty associated with its occurrence.

III-A $\mathcal{MI}$ as a similarity metric

$\mathcal{MI}$ is a convex function and attains global minima when any two random variables under consideration are independent. Mathematically, ${\mathcal{MI}}\to 0$ when the variables are independent, whereas, ${\mathcal{MI}\to\{\mathcal{H}(X)=\mathcal{H}(Y)\}}$ when both the variables are identical statistically. This property of $\mathcal{MI}$ can readily be employed to quantify similarity between two signals. However, while doing so, the definition of $\mathcal{MI}$ has to be interpreted in a quite different manner.

To better understand, let us consider an example of image matching, provided two images $X$ and $Y$ . In order to measure the similarity between the images using $\mathcal{MI}$ , the image itself can not be considered as a random variable. Because in that case, $p_{X},p_{Y}$ and $p_{XY}$ shall be meaningless. In other words, per sample $\mathcal{MI}$ is not defined. Hence, instead of an image, its pixel values are considered as a random variable over which the relevant distributions can be defined. The pixel values may refer to intensity, color, gradients etc. In order to compute the similarity score, first the marginals and joint pdfs over the selected variable has to be computed, and the similarity can be obtained by using the Eq. 1. While doing so, Eq. 2 needs to be rewritten as given below.

$\displaystyle\mathcal{H}(X)$	$\displaystyle=-\sum_{i=1}^{N}p_{X}^{i}log(p_{X}^{i}),$	(3)
$\displaystyle\mathcal{H}(Y)$	$\displaystyle=-\sum_{i=1}^{N}p_{Y}^{i}log(p_{Y}^{i}),$
$\displaystyle\mathcal{H}(X,Y)$	$\displaystyle=-\sum_{i=1}^{N}\sum_{j=1}^{N}p_{XY}^{ij}log(p_{XY}^{ij})$

Where $N$ is the number of bins in the pdf.

As an another example, we can consider matching of two time series signals by using $\mathcal{MI}$ . Following the above discussion, the two signal instances under consideration can not be considered as random variables, instead their instantaneous values are considered as random variable. It must be noticed that the choice of random variable depends on the application.

IV DeepMI Framework

To understand the concept of DeepMI, consider the task of image reconstruction using autoencoders. In order to minimize the gap between an input image and the reconstructed image, the $\mathcal{MI}$ has to be maximized. The regular $\mathcal{MI}$ expression however, can not be used directly for this purpose. It is primarily because $\mathcal{MI}$ attains global minima when both random variables are dissimilar and our optimal point which is $\sup\leavevmode\nobreak\ {\mathcal{MI}}$ , is not well defined as $\mathcal{MI}$ is unbounded above. Although, various normalized versions of $\mathcal{MI}$ have also been proposed [34] in the literature, the previously discussed issues still remain intact. Hence, normalized $\mathcal{MI}$ ( $\mathcal{NMI}$ ) also can not serve our purpose.

The above challenges encourage us to develop linearized mutual information $\mathcal{LMI}$ which attains a global minima when the two images are exactly the same. In order to achieve this, we turn towards the the working of the $\mathcal{MI}$ and make a following important insight.

IV-A A key insight to MI

Consider two images $X$ and $Y$ , with $p_{X}\in\mathbb{R}^{N},p_{Y}\in\mathbb{R}^{N}$ and $p_{XY}\in\mathbb{R}^{N\times N}$ as their marginals and joint pdfs respectively. The dimensions of $p_{X},p_{Y}$ is $N\times 1$ , whereas it is $N\times N$ for $p_{XY}$ . From, Eq. 1, 2, we can immediately say that $\mathcal{MI}\to 0$ when two signals are dissimilar while ${\mathcal{MI}\to\{\mathcal{H}(X)=\mathcal{H}(Y)\}}$ when the signals are exactly the same. Hence, for the images $X$ and $Y$ to be identical, the necessary but not sufficient condition is that $p_{X}$ and $p_{Y}$ should be same. While, in order to guarantee, the following has to be satisfied.

p_{X}^{i}=p_{Y}^{i}=p_{XY}^{ii},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ p_{XY_{|i\neq j}}^{ij}=0,\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ i,j=1,2,...,N

(4)

In other words, when $X\equiv Y$ , the off-diagonal elements of $p_{XY}$ are zero while all the diagonal elements are non-zero (depending on a distribution) and equals to $p_{X}$ and $p_{Y}$ simultaneously. The above insight leads us to derive an expression for the $\mathcal{LMI}$ function to train deep networks.

IV-B $\mathcal{LMI}$ Derivation

We know that for any probability density function Eq. 5 and 6 holds. These equations represent a $1$ D and a $2$ D probability density function respectively.

	$\displaystyle\footnotesize\sum_{i=1}^{N}p^{i}_{X}=\sum_{i=1}^{N}p^{i}_{Y}=1,$		(5)
	$\displaystyle\sum_{i=1}^{N}\sum_{j=1}^{N}p^{ij}_{XY}=1$		(6)

Rewriting Eq. 6 as combination of its diagonal $(i=j)$ and off-diagonal elements $(i\neq j)$ , we get

	$\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}p_{XY}^{ii}=1$		(7)
	$\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}p_{XY}^{ii}\leavevmode\nobreak\ +\sum_{i=1}^{N}p_{XY}^{ii}-\sum_{i=1}^{N}p_{XY}^{ii}=1$		(8)

		$\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}p_{XY}^{ii}\leavevmode\nobreak\ +\sum_{i=1}^{N}p_{XY}^{ii}-\sum_{i=1}^{N}p_{XY}^{ii}+$		(9)
		$\displaystyle\sum_{i=1}^{N}p_{X}^{i}-\sum_{i=1}^{N}p_{X}^{i}+\sum_{i=1}^{N}p_{Y}^{i}-\sum_{i=1}^{N}p_{Y}^{i}=1$		(9)

		$\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}\|p_{XY}^{ii}-p_{X}^{i}\|\leavevmode\nobreak\ +$		(10)
		$\displaystyle\sum_{i=1}^{N}\|p_{XY}^{ii}-p_{Y}^{i}\|-\sum_{i=1}^{N}p_{XY}^{ii}+1+1\geq 1$		(10)

		$\displaystyle\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+\sum_{i=1}^{N}\|p_{XY}^{ii}-p_{X}^{i}\|\leavevmode\nobreak\ +$		(11)
		$\displaystyle\sum_{i=1}^{N}\|p_{XY}^{ii}-p_{Y}^{i}\|-\sum_{i=1}^{N}p_{XY}^{ii}+1\geq 0$		(11)

Now, reffering to Eq. 7, we can write

\sum_{i=1}^{N}p_{XY}^{ii}\leq 1\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \Rightarrow\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ 1-\sum_{i=1}^{N}p_{XY}^{ii}\geq 0

(12)

Using the above into Eq. 11, we get

		$\displaystyle\mathcal{L}_{LMI}=\frac{1}{3}\Big{(}\sum_{i=1}^{N}\sum_{\begin{subarray}{c}j=1\\ i\neq j\end{subarray}}^{N}p_{XY}^{ij}+$			(13)
		$\displaystyle\sum_{i=1}^{N}\|p_{XY}^{ii}-p_{X}^{i}\|+\sum_{i=1}^{N}\|p_{XY}^{ii}-p_{Y}^{i}\|\Big{)}$	$\displaystyle\geq 0$		(13)

Where the factor $\nicefrac{{1}}{{3}}$ is included to ensure $\mathcal{LMI}\leq 1$ . It is obtained by replacing each of the three terms to their maximum. The equality of the equation to $0$ will hold iff $p_{X}^{i}=p_{Y}^{i}=p_{XY}^{ii},\leavevmode\nobreak\ \leavevmode\nobreak\ p_{XY_{|i\neq j}}^{ij}=0\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ i,j\in 1,2,..,N$ . i.e. two images match perfectly. Hence, the L.H.S. of the equation Eq. 13 is treated as the objective function which we call the $\mathcal{LMI}$ function. The “ $\mathcal{L}$ ” stands for “linearized” which arises because the $\mathcal{LMI}$ formulation is linear in the elements of the pdfs, whereas the “ $\mathcal{MI}$ ” term arises because at the equality, the regular $\mathcal{MI}$ expression will also be maximized. The $\mathcal{LMI}$ formulation is quite interesting because it is essentially a combination of three different losses weighted equally. The $\mathcal{L}_{LMI}$ formulation is quite intuitive and the gradients are straightforward to compute.

IV-C Fuzzy Probability Density Function

The $\mathcal{LMI}$ function utilizes the pdfs $p_{X},p_{Y}$ and $p_{XY}$ which are discrete in nature. These are typically obtained by computing an $N$ bin histogram followed by a normalization step with $||.||_{1}=1$ . As per the standard procedure to compute a regular histogram, first, a bin-id corresponding to an observation of the random variable is computed and later, the count of the respective bin is incremented by unity. The computation of the bin-id is carried out by a $ceil$ or $floor$ operation which is not differentiable. While performing the rounding step, the actual contribution of the observation is lost. Thus, both the incremental and rounding procedure prevent the gradient flow which is needed during the training process.

To better understand the above, let $h_{X}$ be an $N$ bin histogram of the random variate $X$ and $x\in X$ be an observation. In the case of a regular histogram, the bin-id $x_{b}$ corresponding to an observation $x$ is computed as follows:

	$\displaystyle x_{b}=\frac{\hat{x}}{bin\_res},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak$	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \hat{x}=\frac{x-min_{X}}{max_{X}-min_{X}},$		(14)
	$\displaystyle bin\_res$	$\displaystyle=\frac{max_{X}-min_{X}}{N}$		(14)

where $max_{X},min_{X}$ are the maximum and minimum values which the variable $X$ can attain at any instant. Typically for $8$ -bit images, $[min_{X},max_{X}]=[0,255]$ . From the Eq. 14, it can be noticed that the value of $x_{b}$ is not necessarily an integer. In this case, $x_{b}$ is rounded to the nearest integer by using $ceil\equiv\lceil x_{b}\rceil$ or $floor\equiv\lfloor x_{b}\rfloor$ , depending upon one’s convention. Now, the count of $x_{b}$ is incremented by one. Therefore, it becomes evident that the rounding procedure and the unit incremental procedure do not allow the gradient computation w.r.t. the observations. To cope up with this, we employ fuzzification strategy in order to ensure valid gradients during back-propagation.

IV-C1 Fuzzification of $p_{X}$ , $p_{Y}$

In order to fuzzify $h_{X}$ , instead of one as in Eq. 14, we compute two bins corresponding to $x\in X$ i.e. $x_{0}=\lfloor x_{b}\rfloor$ and $x_{1}=\lceil x_{b}\rceil$ . We define a membership function for each of the two bins as follows.

m_{x_{0}}=1-(x_{b}-x_{0}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ m_{x_{1}}=(x_{b}-x_{0})

(15)

where $m_{x_{0}},m_{x_{1}}$ are the membership functions of $x_{0}$ and $x_{1}$ . Essentially, $m_{x_{0}}+m_{x_{1}}=1$ . While performing the unit incremental step, the count of the bins $x_{0}$ and $x_{1}$ are incremented by $m_{x_{0}}$ and $m_{x_{1}}$ respectively instead of increasing by one. With the help of fuzzification, it can be inferred that now the gradients of $x_{b}$ w.r.t. $m_{x_{0}}$ and $m_{x_{1}}$ are fully defined (Sec. IV-D). The above steps are followed in order to compute $p_{X}$ and $p_{Y}$ , while the normalization step being performed at the end. As a matter of convention, the memberships corresponding to $y_{b}$ for a $y\in Y$ are denoted by $m_{y_{0}}$ and $m_{y_{1}}$ .

IV-C2 Fuzzification of $p_{XY}$

The fuzzification of a joint pdf $p_{XY}$ is simply an extension of previous steps. For a regular $2$ D histogram, the unit incremental procedure is applied to the bin location defined by the coordinate $(x_{b},y_{b})$ (Eq. 14) which in this case as well, need not to be exactly an integral value. Therefore, to ensure valid gradient flow in this case, four memberships are defined which corresponds to the four coordinates top-left, top-right, bottom-left, and bottom-right w.r.t $(x_{b},y_{b})$ . Mathematically, these four coordinates are given by $(x_{0},y_{0})$ , $(x_{1},y_{0})$ , $(x_{0},y_{1})$ , $(x_{1},y_{1})$ , and their respective memberships can be written as:

		$\displaystyle m_{x_{0}y_{0}}=m_{x_{0}}m_{y_{0}},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ m_{x_{1}y_{0}}=m_{x_{1}}m_{y_{0}},$		(16)
		$\displaystyle m_{x_{0}y_{1}}=m_{x_{0}}m_{y_{1}},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ m_{x_{1}y_{1}}=m_{x_{1}}m_{y_{1}}$		(16)

While performing the unit incremental step, the count of the four bins mentioned above is incremented by their respective membership value.

IV-D Back-propagation through DeepMI framework

From the Eq. 16, it can be inferred that, gradient of $x$ w.r.t. $\mathcal{L}_{LMI}$ depends on $p_{X}^{x_{0}},p_{X}^{x_{0}},p_{XY}^{x_{0}y_{0}},p_{XY}^{x_{0}y_{1}},p_{XY}^{x_{1}y_{0}},p_{XY}^{x_{1}y_{1}}$ . Therefore, we can write:

\frac{\partial\mathcal{L}}{\partial x}=\sum_{i=0}^{1}\frac{\partial\mathcal{L}}{\partial p_{X}^{x_{i}}}\leavevmode\nobreak\ \frac{\partial p_{X}^{x_{i}}}{\partial x}+\sum_{i=0}^{1}\sum_{j=0}^{1}\frac{\partial\mathcal{L}}{\partial p_{XY}^{x_{i}y_{i}}}\leavevmode\nobreak\ \frac{\partial p_{XY}^{x_{i}y_{i}}}{\partial x}

(17)

Using chain rule:

	$\displaystyle\frac{\partial p_{X}^{x_{i}}}{\partial x}=\frac{\partial p_{X}^{x_{i}}}{\partial\hat{p}_{X}^{x_{i}}}\times\frac{\partial\hat{p}_{X}^{x_{i}}}{\partial{m_{x_{i}}}}\times\frac{\partial m_{x_{i}}}{\partial x_{b}}\times$		(18)
	$\displaystyle\frac{\partial x_{b}}{\partial\hat{x}}\times\frac{\partial\hat{x}}{\partial x},\leavevmode\nobreak\ \leavevmode\nobreak\ p_{X}^{x_{i}}=\frac{\hat{p}_{X}^{x_{i}}}{\sum_{i=1}^{N}\hat{p}_{X}^{x_{i}}}$		(18)

and similarly,

	$\displaystyle\frac{\partial p_{XY}^{x_{i}y_{j}}}{\partial x}=\frac{\partial p_{XY}^{x_{i}y_{j}}}{\partial\hat{p}_{XY}^{x_{i}y_{j}}}$	$\displaystyle\times\frac{\partial\hat{p}_{XY}^{x_{i}y_{j}}}{\partial m_{x_{i}y_{j}}}\times\frac{\partial m_{x_{i}y_{j}}}{\partial m_{x_{i}}}\times\frac{\partial m_{x_{i}}}{\partial x_{b}}$		(19)
		$\displaystyle\times\frac{\partial x_{b}}{\partial\hat{x}}\times\frac{\partial\hat{x}}{\partial x},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ p_{XY}^{x_{i}y_{j}}=\frac{\hat{p}_{XY}^{x_{i}y_{j}}}{\sum_{i=1}^{N}\hat{p}_{XY}^{x_{i}y_{j}}}$		(19)

Each of the above partial derivatives can be easily computed using the Eq. 13 - 16 via the chain rule. It must be noticed that in the gradient calculations, $x_{0},x_{1}$ does not occur which would have been there in case of regular histograms, leading to undefined derivatives. Similarly, gradients can be also be computed for an observation $y$ .

IV-E $\mathcal{L}_{LMI}$ Implementation

The formulation of $\mathcal{L}_{LMI}$ seems quite intuitive. There is however a consideration which must be accounted while using it for signal matching, especially in the scenarios where one of the signal is the groundtruth and the another is the estimated version of the first by a neural network.

To understand this, consider two images $X$ and $Y$ , where $X$ is the image to be reconstructed and $Y$ is the reconstruction or simply the output of a neural network. The marginals $p_{X},p_{Y}$ and the joint $p_{XY}$ of $X$ and $Y$ are computed using the fuzzification procedure as described previously. As we know that these distributions are the approximated version of the underlying distribution, therefore, undersampling or oversampling of the underlying distribution is possible. By nature the distributions $p_{X}$ , $p_{Y}$ and $p_{XY}$ do not have a well defined mathematical expressions in such scenarios. Moreover, both the $p_{X}$ and $p_{Y}$ can be obtained using $p_{XY}$ , therfore, while backpropagation, the gradients only w.r.t $p_{XY}^{ij}$ are backpropagated. Mathematically, gradients w.r.t. $p_{Y}$ should also be backpropagated because calculations of $p_{Y}$ is dependent on $Y$ . However doing so, disturbs the training process and affects the neural network performance. After our experimental study, we mark this observation as an outcome of the distributions approximation procedure.

IV-F Hyperparameters

The DeepMI framework has $N$ as the only hyperparameter. The number of bins $N$ mainly depicts that how precisely the $\mathcal{L}_{LMI}$ should penalize the network while matching. For example, consider an $8$ -bit image with dynamic range $[0,255]$ . Now, we set $N=255$ , the network will be penalized very strongly while matching. While if the $N$ is set to a small number, the network will be forced to focus only on the important details. This property can be quite useful in cases where two images from different cameras need to be matched while both the images have different brightness levels.

V Experiments

In this section, we benchmark the effectiveness and the applicability of the DeepMI framework. Due to the unavailability of a standard evaluation procedure, we define three tasks on which a deep neural network is trained in an unsupervised manner. The three tasks vary in their difficulty levels from baseline, moderate, to extremely difficult from the learning perspective. As an experimental study, the training of each task is performed using different loss functions $\mathcal{L}_{1},\mathcal{L}_{2},\mathcal{L}_{SSIM}$ along with $\mathcal{L}_{LMI}$ . For $\mathcal{L}_{SSIM}$ , we use a $3\times 3$ block filter following [9]. All the performance metrics are provided in Table-I and Table-II for quantitative analysis with the best scores highlighted in the blue. For training, we use base_lr = $0.001$ , lr_policy= $poly$ , ADAM optimizer with $\beta_{1}=0.9$ and $\beta_{2}=0.99$ unless otherwise stated. The encoder-decoders [35] have a filter size of $3\times 3$ and the number of filters equals to $16,32,64,128,256$ for the five stages for each task. The whole framework has been implemented as a layer-wise architecture in C++ and the codes will be available at the link provided in the beginning.

V-A Unsupervised Bar Alignment (Exp₁)

This experiment consists of two binary images where the second image is a spatially transformed version of the first i.e. a $2$ D rigid body transform is defined between the two images. Both the images consist of a black background and a white rectangular bar. The bar sized of $50\times 125$ in an image ( $192\times 640$ ) has two degrees-of-freedom (DoF): $tx$ , and $\theta$ which correspond to horizontal motion and in-plane rotation. The goal of the experiment is to learn the $2$ DoF parameters in an unsupervised manner. For the training purpose, we generate a dataset of $1500$ images with $1000+500$ train-test split. The dataset is generated by transforming the bar in the image by randomly generating $tx\in[-100,100]$ pixels and $\theta\in[-40,40]$ degrees. The training of this experiment is performed using SGD with Nesterov momentum $=0.95$ for $5$ epochs.

The unsupervised training pipeline for the task is depicted in Fig. 1. While training, the neural network takes two images, source ( $I_{s}$ ) and target ( $I_{t}$ ) as input and predicts $tx,\theta$ . The image $I_{s}$ is then warped ( $\hat{I}_{s}$ ) using fully differentiable spatial-transformer-networks (STN) [36] and $\hat{I}_{s}$ is obtained. Now, the neural network is penalized using back-propagation to force $\hat{I}_{s}\to I_{t}$ . In the testing phase, the neural network predicts $tx,\theta$ on the test data and Mean-Absolute-Error (MAE) is reported between the predictions and the groundtruth $tx,\theta$ , already stored during the data generation process. From the Table-I under the column Exp₁, it can be noticed that the network trained using $\mathcal{L}_{LMI}$ formulation exhibits better performance as compared to the rest of the loss functions. Fig. 2 shows a few qualitative results of this experiment. In this task, almost all of the losses perform equally well. It can be verified visually as well as from the quantitative results provided in the Table I.

Figure 1: Unsupervised learning framework for the Exp₁. The blue box is a convolutional block, orange box is a concatenation block. STN [36], and

\mathcal{L}

is a loss function.

Figure 2: Qualitative results of Exp₁

Refer to caption — Figure 3: Qualitative results of Exp₂

Figure 4: Unsupervised learning framework for Exp₂.

V-B Unsupervised Mask Prediction (Exp₂)

This experiment consists of two ordinary grayscale images: source ( $I_{s}$ ) and target ( $I_{t}$ ). From the the image $I_{t}$ , a significantly large rectangular region is cropped out and filled with zeros. The end goal of the experiment is to predict a mask $M_{s}$ which depicts similar and dissimilar regions between $I_{s}$ and $I_{t}$ . To achieve this, a neural network is trained in an unsupervised manner. For the learning process, we generate a dataset of $1500$ images with $1000+500$ train-test split. The dataset is generated by randomly cropping a rectangular region from $I_{s}$ and filling th cropped region with zeros. The obtained image is referred as $I_{t}$ . The cropping region has a size of $40\times 200$ .

Fig. 4 shows the the unsupervised training framework for this task. We use UNet [35] network architecture. The network is trained to predict $M_{s}$ such that $I_{s}*M_{s}\to I_{t}$ . This experiment is essentially a binary segmentation task, therefore, we adopt intersection-over-union metric (IoU) which is widely used to quantify the segmentation performance. In order to best evaluate different training losses, we provide IoU scores by thresholding the predicted mask at different confidence levels. It is done in order to examine the network’s capability to push the feature embeddings of similar and dissimilar regions significantly apart. In other words, a perfectly trained network will exhibit same IoU scores at various threshold levels. From the Table-I under Exp₂, it is clear that the proposed $\mathcal{L}_{LMI}$ formulation shows consistent and better performance over the other loss functions across various thresholding levels.

Fig. 3 shows a few qualitative results corresponding to this experiment. The results visualized are thresholded at $0.10$ (IoU_.10). It can be seen that $\mathcal{L}_{1}$ and $\mathcal{L}_{SSIM}$ performs worst whereas $\mathcal{L}_{2}$ and $\mathcal{L}_{LMI}$ performs marginally equally. The visual observation can also be verified quantitatively from the Table I, under Exp₂. In the quantitative results, $\mathcal{L}_{1}$ and $\mathcal{L}_{SSIM}$ performs worst in increasing order, as verified visually. On the other hand, $\mathcal{L}_{2}$ and $\mathcal{L}_{LMI}$ have only marginal difference, which can be also be verified visually by zooming in the results and examining the boundary of white region.

V-C Unsupervised Depth Estimation (Exp₃)

Depth estimation has been a long standing task in front of the computer vision community. This task has worldwide industrial importance because depth perception is a must for autonomous robotics and vehicles. Due to the advancements in the supervised learning techniques, researchers have developed several methods to predict depth using neural networks by employing supervised learning techniques. However, obtaining accurate groundtruth for supervised learning of this task is extremely challenging because it requires very expensive measurement instruments such as LIDARs. Hence, this task has gained considerable amount of attention in the recent years in order to develop unsupervised learning frameworks for this task. The existing methods make use of several loss functions in order to learn the depth effectively.

In this experiment, We demonstrate the learning of a neural network for the task of depth estimation using stereo images in an unsupervised manner. We form the seminal work for this task [9] as our basis. Our aim is not to show improvements in the datasets, instead we emphasis that how the DeepMI framework can be easily integrated for these real world applications. Hence, instead of a bigger dataset, we use $87$ grayscale rectified stereo images from KITTI [37] sequence- $113$ . We select this sequence because it is quite difficult sequence from the perspective of this task.

The framework to carryout the experiment is shown in Fig. 6. It must be noticed that instead of the three different losses as in [9], we only use one loss for the evaluation. This is done in order to demonstrate that $\mathcal{L}_{LMI}$ provides stronger gradients and alone can lead to improved results. We report MAE between the predicted depth and the groundtruth measurements. From the Table-I under Exp₃, one can notice that the neural network trained using $\mathcal{L}_{LMI}$ outperforms other five variants by large margin. For the case of $\mathcal{L}_{1},\mathcal{L}_{2}$ , the base_lr is lowered to $0.00001$ to prevent gradient explosion.

Table I: Quantitative analysis

$\mathcal{L}$	Exp₁		Exp₂					Exp₃
$\mathcal{L}$	MAE_tx	MAE_θ	IoU_.05	IoU_.10	IoU_.20	IoU_.40	IoU_.50	MAE( $m$ )
$\mathcal{L}_{1}$	$1.52$	$3.44$	$.48$	$.61$	$.65$	$.68$	$.68$	$105.46$
$\mathcal{L}_{2}$	$1.93$	$3.65$	$.66$	$.68$	$.68$	$.68$	$.68$	$39.78$
$\mathcal{L}_{SSIM}$	$1.25$	$3.95$	$.50$	$.52$	$.68$	$.69$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.81}$	$.46$
$\mathcal{L}_{SSIM}+\mathcal{L}_{2}$	$-$	$-$	$-$	$-$	$-$	$-$	$-$	$.64$
$\mathcal{L}_{LMI}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}1.04}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}3.18}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.70}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.70}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.71}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.80}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.81}$	${\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}.27}$

Fig. 5 shows a few qualitative results for this experiment. From the figure, it can be noticed that, both the $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ perform poorly, whereas $\mathcal{L}_{SSIM}$ performs quite better than them. This indicates the reason behind recent adaptation of $\mathcal{L}_{SSIM}$ over $\mathcal{L}_{1}$ and $\mathcal{L}_{2}$ for image matching purpose.

Figure 6: Unsupervised learning framework for Exp₃. Warp [9].

Further, it can be noticed that, visually the results of $\mathcal{L}_{LMI}$ are the most pleasing as well as consistent among all. As a comparison between the $\mathcal{L}_{SSIM}$ and $\mathcal{L}_{LMI}$ , we can see that the depth estimations of the former contains texture copy [9] artifacts, whereas these are absent in latter. It can also be verified by taking a closer look at the car in the bottom right of the image. It can be seen that for the case of $\mathcal{L}_{SSIM}$ , the depth map of the car has holes near edges and also contains severe texture copy artifacts near the number plate and other body area of the car. These artifacts, on the other hand are not present in the case of $\mathcal{L}_{LMI}$ . This shows the clear effectiveness of $\mathcal{L}_{LMI}$ loss and the DeepMI framework.

V-D Effect of number of bins $N$

Table-II shows the effect of the hyperparameter $N$ on each of the three tasks. For Exp₁, it can be noticed that the performance for all the four values of $N$ is same. It has to be the case because the images are binary in this task. For Exp₂, there is considerable drop in the performance for $N=3$ for IoU_0.05 (highlighted in red). It is because, the images in this case have multiple grayscale levels and details about which can not be efficiently captured. The same is the case for Exp₃ when $N=3$ . Overall, we can see that $\mathcal{L}_{LMI}$ has significant advantages over other loss functions. Through our experimentations, it sufficient to keep the value of $N\leq 25$ for a signal having dynamic range $\in[0,255]$ .

V-E A unified discussion on the experiments

From the Table-I, it can be noticed that the information theory based measures are much more consistent as compared to the other losses. Also, it can be noticed that for the Exp₃, the $\mathcal{L}_{LMI}$ proves to be better over the case when $\mathcal{L}_{SSIM}$ and $\mathcal{L}_{2}$ are used together. Also, it is noticeable that $\mathcal{L}_{LMI}$ shows consistent and best scores amongst all variants of losses. In the experiments, our intention has not been to outweigh the existing losses, instead to mark the potential of information theory based methods in deep learning for real world applications.

Table II: Effect of bin size

N

\mathcal{L}_{LMI}

$N$	Exp₁		Exp₂					Exp₃
$N$	MAE_tx	MAE_θ	IoU_.05	IoU_.10	IoU_.20	IoU_.40	IoU_.50	MAE( $m$ )
$3$	$1.04$	$3.18$	$.41$	$.51$	$.52$	$.50$	$.81$	$.41$
$11$	$0.89$	$2.69$	$.50$	$.59$	$.61$	$.66$	$.81$	$.27$
$15$	$0.78$	$2.89$	$.66$	$.67$	$.68$	$.69$	$.81$	$.28$
$25$	$0.99$	$4.36$	$.47$	$.48$	$.58$	$.79$	$.79$	$.29$

VI Conclusion

In this paper, we proposed an end-to-end differentiable framework “DeepMI” and a novel similarity metric $\mathcal{LMI}$ to train deep neural networks. The metric is mutual information ( $\mathcal{MI}$ ) inspired and cops up with the difficulty to be able to train a deep neural network using the ( $\mathcal{MI}$ ) expression. The metric is based on probability density functions which makes it signal agnostic. The density functions are discrete in nature for real world signals (images, time series) and can not support backpropagation. Therefore, a fuzzification strategy to support smooth backward gradient flow through the density functions is also developed. We show that the neural network trained using $\mathcal{L}_{MI}$ metric, outperforms their counterparts which are trained using $\mathcal{L}_{1},\mathcal{L}_{2},\mathcal{L}_{SSIM}$ . Additionally, we also show that it can be easily integrated for real world applications.

The DeepMI framework can be thought of as an effort to club the deep learning and mutual information together for real world applications. Through this work, we believe that, the learning based methods in several areas such as autonomous vehicles, robotic vision, speech / audio can be greatly benefitted in terms of the performance. The extensive experimental study in this work, can be used as a ground to develop extensions to the DeepMI framework in order to further improve the overall algorithmic performance. We also believe that inclusion of DeepMI into deep learning frameworks shall open door to new applications.

References

[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
[2] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[3] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255, Ieee, 2009.
[5] A. Kumar and L. Behera, “Semi supervised deep quick instance detection and segmentation,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 8325–8331, IEEE, 2019.
[6] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[7] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw audio,” arXiv preprint arXiv:1609.03499, 2016.
[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, pp. 2672–2680, 2014.
[9] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with left-right consistency,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279, 2017.
[10] A. Kumar, J. R. McBride, and G. Pandey, “Real time incremental foveal texture mapping for autonomous vehicles,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3233–3240, IEEE, 2018.
[11] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349, 2018.
[12] R. Li, S. Wang, Z. Long, and D. Gu, “Undeepvo: Monocular visual odometry through unsupervised deep learning,” in 2018 IEEE international conference on robotics and automation (ICRA), pp. 7286–7291, IEEE, 2018.
[13] Y. Almalioglu, M. R. U. Saputra, P. P. de Gusmao, A. Markham, and N. Trigoni, “Ganvo: Unsupervised deep monocular visual odometry and depth estimation with generative adversarial networks,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 5474–5480, IEEE, 2019.
[14] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
[15] J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger, “Unsupervised learning of multi-frame optical flow with occlusions,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 690–706, 2018.
[16] S. Kullback, Information theory and statistics. Courier Corporation, 1997.
[17] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.
[18] C. E. Shannon, “A mathematical theory of communication,” Bell system technical journal, vol. 27, no. 3, pp. 379–423, 1948.
[19] P. Viola and W. M. Wells III, “Alignment by maximization of mutual information,” International journal of computer vision, vol. 24, no. 2, pp. 137–154, 1997.
[20] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, “Multimodality image registration by maximization of mutual information,” IEEE transactions on Medical Imaging, vol. 16, no. 2, pp. 187–198, 1997.
[21] L. Bahl, P. Brown, P. De Souza, and R. Mercer, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in ICASSP’86. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 11, pp. 49–52, IEEE, 1986.
[22] G. Pandey, J. R. McBride, S. Savarese, and R. M. Eustice, “Toward mutual information based automatic registration of 3d point clouds,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2698–2704, IEEE, 2012.
[23] A. Hyvärinen and E. Oja, “Independent component analysis: algorithms and applications,” Neural networks, vol. 13, no. 4-5, pp. 411–430, 2000.
[24] N. Kwak and C.-H. Choi, “Input feature selection by mutual information based on parzen window,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 12, pp. 1667–1671, 2002.
[25] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on pattern analysis and machine intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
[26] C. Zhu, Y. Zheng, K. Luu, and M. Savvides, “Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection,” in Deep Learning for Biometrics, pp. 57–79, Springer, 2017.
[27] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
[28] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
[29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” CVPR, 2017.
[30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37, Springer, 2016.
[31] G. A. Darbellay and I. Vajda, “Estimation of the information by an adaptive partitioning of the observation space,” IEEE Transactions on Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999.
[32] M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm, “Mine: mutual information neural estimation,” arXiv preprint arXiv:1801.04062, 2018.
[33] J. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” in Advances in Neural Information Processing Systems, pp. 35–45, 2019.
[34] J. P. Pluim, J. A. Maintz, and M. A. Viergever, “Image registration by maximization of combined mutual information and gradient information,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 452–461, Springer, 2000.
[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, pp. 234–241, Springer, 2015.
[36] M. Jaderberg, K. Simonyan, A. Zisserman, et al., “Spatial transformer networks,” in Advances in neural information processing systems, pp. 2017–2025, 2015.
[37] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.

DeepMI: A Mutual Information Based Framework For Unsupervised Deep Learning of Tasks

Abstract

I Introduction

II Related Work

III Mutual Information (ℳ​ℐ\mathcal{MI})

III-A ℳ​ℐ\mathcal{MI} as a similarity metric