A Residual Solver and Its Unfolding Neural Network for Total Variation Regularized Models

Yuanhao Gong Manuscript received April 19, 2005; revised September 17, 2014.

Abstract

This paper proposes to solve the Total Variation regularized models by finding the residual between the input and the unknown optimal solution. After analyzing a previous method, we developed a new iterative algorithm, named as Residual Solver, which implicitly solves the model in gradient domain. We theoretically prove the uniqueness of the gradient field in our algorithm. We further numerically confirm that the residual solver can reach the same global optimal solutions as the classical method on 500 natural images. Moreover, we unfold our iterative algorithm into a convolution neural network (named as Residual Solver Network). This network is unsupervised and can be considered as an “enhanced version” of our iterative algorithm. Finally, both the proposed algorithm and neural network are successfully applied on several problems to demonstrate their effectiveness and efficiency, including image smoothing, denoising, and biomedical image reconstruction. The proposed network is general and can be applied to solve other total variation regularized models.

Index Terms:

TV; residual; network; rsnet; solver

I Introduction

Various image processing and computer vision tasks are ill-posed, such as image denoising, biomedical image reconstruction, image smoothing, segmentation, object recognition, etc. The illness comes from the fact that it is impossible to recover the ground truth without making any assumption about it. For example, in the image denoising task, it is impossible to tell if a pixel intensity contains noise or not without making any assumption about the ground truth. Another example is the image smoothing. It is not clear what should be smoothed out without any prior knowledge. In the biomedical image reconstruction, the input signal and the desired output even do not work in the same space. And the illness comes from the fact that there are not enough observations available due to physical limitations, such as dose restriction, equipment angle limitation, temperature requirement, pressure condition, etc.

Although ill-posed problems are common, solving these problems is challenging. Usually, there are many solutions that satisfy the constraints in the ill-posed problems. In mathematical words, the solution space is too large to find the desired one solution. Moreover, the computational cost for such problems is usually very high. Therefore, it is not realistic to search the full solution space in practice.

To solve such ill-posed problems, a prior or an assumption about the ground truth must be imposed, reducing the solution space or even making the solution unique. For example, the well-known total variation regularization leads to a piecewise constant signal [1]. The ground truth signal might be assumed to be piecewise linear, leading to the famous guided filter [2]. The ground truth might be assumed to be a minimal surface, leading to the area regularization [3, 4, 5]. The ground truth might be assumed to be piecewise developable, leading to the Gaussian curvature regularization [6, 7]. Different assumptions usually lead to different resulting images, which visually reflects the imposed priors. For example, the well-known total variation regularization will make the resulting image piecewise constant, which is known as staircase artifacts. These artifacts reflect the imposed prior. Such prior is usually imposed by a regularization term, which is a mathematical model for the imposed prior or assumption.

The same prior or assumption, however, might have several different regularization forms because it can be modeled by different mathematical formula with various accuracy and computation complexity. For example, the well-known gradient distribution prior can be modeled by Gaussian distribution, Hyper Laplace distribution [8] or the derivative of $atan$ function [9, 10], etc. These models impose the same gradient distribution prior, but they have different approximation accuracy and different optimization properties, such as convexity, fast computation, explicit gradient formula, etc.

I-A Total Variation Regularization

Among all regularization terms, Total Variation (TV) might be the most popular one because of its simplicity and effectiveness [1]. Assuming the unknown ground truth signal is $U(\vec{x})$ where $\vec{x}$ is the spatial coordinate, the traditional total variation regularization term is

TV(U(\vec{x}))=\|\nabla U(\vec{x})\|_{p}\,,

(1)

where $\nabla$ is the gradient operator and $p=\{1,2\}$ indicates the standard $\ell_{p}$ norm. This term is isotropic when $p=2$ and becomes anisotropic when $p=1$ . Choosing which $\ell_{p}$ norm is problem specific. In the rest of this paper, we use $p=1$ because it has been shown preserving edges better in many practical applications [10].

Although there are some variants of TV, such as spatial weighted TV and $\ell_{0.5}$ norm TV (Hyper Laplace), the standard TV with $\ell_{1}$ norm still is the most popular regularization term. The reasons come from three aspects. First, from optimization point of view, this term is convex. This means the global optimal solution is unique. And the convex optimization algorithms can be used for this term. Second, the $\ell_{1}$ norm is a simple computation operation. Thus, it can be efficiently computed and evaluated in practice. Third, the anisotropy property of this term is important in preserving edges in image processing. In general, preserving edges is always preferred in many image processing tasks.

When TV is used as the regularization term, many image processing problems, such as denoising, segmentation, inpainting and fusion, etc, can be formulated as the following minimization problem:

\min_{U}\left\{\frac{1}{2}\|AU(\vec{x})-f(\vec{x})\|_{2}^{2}+\lambda TV(U(\vec{x}))\right\}

(2)

where $A$ is an imaging matrix, $U$ is the unknown signal to be estimated, $\vec{x}$ is a spatial coordinate, $f$ is the observed signal and $\lambda>0$ is a scalar that balances the imaging and the TV prior ( $\lambda$ usually is related with noise level). The imaging matrix $A$ models a linear imaging process, such as blurring, down sampling, up sampling, etc. A linear matrix already covers a large range of image processing problems.

I-B Iterative Solvers

Many solvers have been developed for this model (Equation 2). The earliest iterative scheme was proposed in [1]. This method is slow due to the stability constraints in the time step size. To accelerate the optimization procedure, a number of alternative methods have been proposed. One popular method is primal dual solver [11, 12]. Primal dual method simultaneously finds the optimal solution for this model and its dual, improving the numerical stability. Another technique is to split the imaging model and the TV regularization, by introducing a variable $W$ :

\min_{U}\left\{\frac{1}{2}\|AU-f\|_{2}^{2}+\lambda TV(W)\right\}~{}\mathrm{s.t.}~{}W=U\,.

(3)

This new model can be solved by alternating direction method of multipliers (ADMM) [13], split Bregman [14], etc. Equation 2 can also be solved by Iterative Shrinkage and Thresholding Algorithm (ISTA) [15]. These methods are based on nonlinear partial differential equations and thus time consuming, compared with recently developed neural network solvers.

I-C Neural Network Solvers

With the development of deep learning, several neural networks have been proposed to solve Eq. 2. One of the earliest work is the Learned ISTA (LISTA), which unfolds the ISTA as neural networks to perform the optimization algorithm [16]. One layer in the network corresponds to one iteration in ISTA. Different from ISTA, all the parameters in LISTA implicitly communicate with each other during the training process. Such communication can be considered as high order scheme from the partial differential equation point of view. Therefore, the network can solve the problem within a small number of layers, in contrast with the large number of iterations in iterative algorithm ISTA.

ISTA-net is another extension of ISTA [17]. It unfolds the ISTA but keeps the symmetric network structure, requiring the encoding and decoding are inverse functions for each other. Such constraint makes the network invertable, leading to better numerical stability than LISTA.

Another unfolding method is the primal dual network, which unfolds the primal dual iteration for Eq. 2 [18]. It has been shown converging much faster than the original primal dual iteration.

ADMM-net unfolds the classical ADMM iteration, leading to a neural network representation [19]. It first trains one layer. Then it fixes this layer, adds another layer and only trains the new added layer. By repeating this procedure, it shows good performance for MRI reconstruction problem. In this network, after 10 layers, adding more layers does not improve the result anymore.

Recently, a CNN based image reconstruction method for inverse problem is proposed in [20]. It shows that neural networks are very efficient in solving image reconstruction problems thanks to the parameter communication between different layers.

In general, neural network methods are more efficient than the iterative algorithms because of two reasons. First, the parameters in the neural networks can communicate with each other during the training process. This corresponds to the high order scheme from partial differential equation point of view. Second, the parameters in the neural networks are more adaptive to the training data. In contrast, the operations in the iterative algorithms are independent from the input image. As a result, the neural network usually only needs a small number of layers to solve Eq. 2 (computational efficiency).

I-D Motivations and Contributions

Most of these neural network algorithms are supervised, requiring the ground truth to be explicitly given. Such requirement is not easily satisfied in most practical applications. This motivates us to develop an unsupervised neural network. Our contributions fall in following two aspects:

•

We develop a new iterative algorithm that solves the Eq. 2. Instead of directly finding the optimal solution, we find its related residual (details will be explained in later sections). Therefore, our solver is called Residual Solver (RS).
•

We unfold the proposed iterative residual solver into a neural network and our network is unsupervised. Therefore, it can be trained without knowing the optimal solution. Another advantage of our network is that it can be trained on low resolution images but applied on high resolution images. Our experiments confirm this property.

II Mathematical Background

Equation 2 can be solved by many algorithms as mentioned in previous sections. Here, we give the general procedure that solves Eq. 2 and then a typical algorithm for solving the bottleneck sub problem in this procedure.

In general, Eq. 2 can be solved by the following iteration (proximal gradient method [21])

$\displaystyle V^{t+1}$	$\displaystyle=$	$\displaystyle U^{t}-\frac{1}{\alpha}A^{T}(AU^{t}-f-d^{t})$	(4)
$\displaystyle U^{t+1}$	$\displaystyle=$	$\displaystyle\arg\min_{U}\left\{\frac{\alpha}{2}\\|U-V^{t+1}\\|_{2}^{2}+\lambda TV(U)\right\}$	(5)
$\displaystyle d^{t+1}$	$\displaystyle=$	$\displaystyle d^{t}+f-AU^{t+1}\,,$	(6)

where $\alpha>2\|A^{T}A\|$ is a parameter, $d$ is the scaled dual variable and $t$ is the iteration index. The initial values are $d^{0}=0$ and $U^{0}=0$ .

II-A Dual First Iteration

Since this procedure is iterative, we can modify this iteration by updating the dual variable $d$ first, leading to a more efficient algorithm

$\displaystyle d^{t+1}$	$\displaystyle=$	$\displaystyle d^{t}+f-AU^{t}$	(7)
$\displaystyle V^{t+1}$	$\displaystyle=$	$\displaystyle U^{t}+\frac{1}{\alpha}A^{T}(2d^{t+1}-d^{t})$	(8)
$\displaystyle U^{t+1}$	$\displaystyle=$	$\displaystyle\arg\min_{U}\left\{\frac{\alpha}{2}\\|U-V^{t+1}\\|_{2}^{2}+\lambda TV(U)\right\}\,.$	(9)

In this new iteration, the term $2d^{t+1}-d^{t}$ in Eq. 8 can be considered as extrapolation. And we also avoid the matrix multiplication with $A^{T}A$ in Eq. 4. In fact, Eq. 8 is simpler to compute, compared with Eq. 4. Thus, it is faster in practice.

II-B The Bottleneck

In both iteration methods, Eq. 9 is the computation bottleneck. This equation is known as the standard ROF model [1]. Many algorithms have been developed to efficiently solve Eq. 9. Among these algorithms, following iteration was proposed in [22]

	$\displaystyle\vec{b}^{in+1}$	$\displaystyle=$	$\displaystyle cut(\nabla U^{in}+\vec{b}^{in},\beta)$		(10)
	$\displaystyle U^{in+1}$	$\displaystyle=$	$\displaystyle V^{t+1}-\lambda\beta(\nabla^{T}\vec{b}^{in+1})\,,$		(11)

where $0<\beta<1/4$ is a scalar parameter, $in$ is an inner iteration index, $\vec{b}=(b_{x},b_{y})$ corresponds to the gradient field, $\vec{b}^{0}=0$ , and $cut$ function is defined as

\displaystyle cut(x,\beta)=\begin{cases}\beta,&x>\beta\cr x,&|x|\leq\beta\cr-\beta,&x<-\beta\end{cases}\,.

(12)

This iteration can theoretically guarantee the numerical convergence and has shown good performance in theory and practice [22]. We refer this algorithm as FastSolver (FS) in this paper.

This $cut$ function proposed in FS is the residual of the well-known $shrink$ operation. As shown in Fig. 1, $cut$ function plus $shrink$ function equals the identity function. Their difference forms a nonlinear transfer function that also appears in deep learning community. Moreover, the $cut$ function is similar with several well-known activation functions from neural network field, such as sigmoid, $atan$ , soft sign, etc.

Refer to caption — Figure 1: The relationship between $cut$ and $shrink$ functions.

Until Section IV-D, we will focus on this bottleneck problem (Eq. 9). We will show a residual solver for this problem. Then we will unfold this iterative solver into a neural network, which is an enhanced version of the iterative solver. In Section IV-D, we will show how to embed this neural network into a larger network for the general TV regularized models.

III Residual Solver

In this paper, we modify above iteration from [22], leading to a faster iterative algorithm. In FastSolver, during each iteration, $U^{t}$ has to be explicitly computed by Eq. 11 since it is needed in Eq. 10. Explicitly computing $U$ and its gradient adds extra computation burden for this algorithm. As shown in the following section, this computation can be avoided by finding the residual between the input and the optimal solution. Our method does not explicitly compute $U$ or its gradient.

Different from previous approaches, we propose a residual representation of Eq. 9 and perform the optimization in gradient domain. Moreover, our method does not explicitly compute $U$ during the optimization procedure.

Specifically, from Eq. 11 we have

U^{in}=V^{t+1}-\lambda\beta(\nabla^{T}\vec{b}^{in})\,.

(13)

Taking above equation into Eq. 10, we get

\vec{b}^{in+1}=cut(\nabla V^{t+1}-\lambda\beta\nabla\nabla^{T}\vec{b}^{in}+\vec{b}^{in},\beta)\,.

(14)

For a given $\nabla V^{t+1}$ , the optimal solution $\vec{b}^{*}$ is a fixed point for this equation. After finding $\vec{b}^{*}$ , the optimal solution $U^{t+1}$ can be obtained by Eq. 11

U^{t+1}=V^{t+1}+(-\lambda\beta\nabla^{T}\vec{b}^{*})\,.

(15)

In Eq. 15, $-\lambda\beta\nabla^{T}\vec{b}^{*}$ is the residual between input $V^{t+1}$ and its optimal solution $U^{*}$ . Since only the residual is solved, we call our method as Residual Solver (RS). And the residual term is defined as

R=-\lambda\beta\nabla^{T}\vec{b}^{*}\,.

(16)

Clearly, the residual $R$ is unique if $U^{t+1}$ is unique for given $V^{t+1}$ . This algorithm is summarized in Algorithm 1.

Our RS is computationally efficient. In RS, during the iteration in Eq. 14, only one vector field $\vec{b}$ has to be updated. And the two computation operations, $cut$ and $I-\lambda\beta\nabla\nabla^{T}$ can be efficiently performed, making RS faster. This is numerically confirmed in later section.

Algorithm 1 Residual Solver

1:input

V^{t+1}

, initial

\vec{b}^{0}=0

\beta

, Iteration

N

2:compute

\nabla V^{t+1}

3:for

in=0

N-1

do,

\vec{b}^{in+1}=cut(\nabla V^{t+1}+(1-\lambda\beta\nabla\nabla^{T})\vec{b}^{in},\beta)

4:end for

U^{t+1}=V^{t+1}+(-\lambda\beta\nabla^{T}\vec{b}^{N})

6:output

U^{t+1}

III-A Uniqueness of $\vec{b}$

To show that our RS is numerical stable, we need to find a unique vector field $\vec{b}$ for the unique residual $R$ . It is well-known that Eq. 9 is convex [1]. Therefore, its optimal solution $U^{t+1}$ is unique, leading to the uniqueness of the residual $R=-\lambda\beta\nabla^{T}\vec{b}^{*}$ . Although the vector field $\vec{b}^{*}$ is not unique, it belongs to a unique set in terms of divergence $\nabla^{T}$ equivalency

{\cal S}=\left\{\vec{b}~{}|~{}\nabla^{T}\vec{b}=\nabla^{T}\vec{b}^{*}\,,\|\vec{b}\|_{2}<+\infty\right\}\,.

(17)

This set $\cal S$ is convex because $\nabla^{T}$ is a linear operation. Let $0\leq\beta\leq 1$ be a scalar. Then, for any $\vec{b}_{1}$ and $\vec{b}_{2}$ in this set, since $\nabla^{T}(\beta\vec{b}_{1}+(1-\beta)\vec{b}_{2})=\nabla^{T}\vec{b}^{*}$ , we have

\nabla^{T}(\beta\vec{b}_{1}+(1-\beta)\vec{b}_{2})\in{\cal S}=\left\{\vec{b}~{}|~{}\nabla^{T}\vec{b}=\nabla^{T}\vec{b}^{*}\right\}\,.

(18)

This result indicates that this unique set is a convex set.

Since the ground truth image $U$ has bounded intensity values (usually $U(\vec{x})\in[0,1])$ or $U(\vec{x})\in[0,255])$ ), $\nabla^{T}\vec{b}^{*}$ is also bounded ( $<+\infty$ ). Therefore, we can require $\|\vec{b}\|_{2}<+\infty\,,\forall\vec{b}\in{\cal S}$ .

Thank to the bounded norm and convexity, we can obtain a unique solution if we add an additional constraint, such as minimizing $||\vec{b}||_{1}$ . More specifically, we have

\vec{b}_{1}^{*}=\arg\min_{\vec{b}}||\vec{b}||_{1},~{}s.t.~{}\vec{b}\in{\cal S}\,.

(19)

Thanks to the convexity of $\cal S$ , $\vec{b}_{1}^{*}$ is unique.

In practice, we always initialize $\vec{b}^{0}=0$ . This initialization leads to the unique $\vec{b}_{1}^{*}$ since it is the “minimum” vector field in this convex set. We observed that the same input always find the same vector field $\vec{b}_{1}^{*}$ in practice.

The convexity of the solution set and uniqueness of vector field $\vec{b}_{1}^{*}$ provide the theoretical guarantees for our RS. They also reveal the numerical stability of our RS, which is confirmed in following experiment on BSDS500 data set.

III-B Convergence to the Optimal Solution

Our RS can reach the optimal solution. We run RS and FS with 200 iterations on 500 natural images from BSDS500 data set. Then we compute their final loss function value for each image, respectively. For the same image, we compute the loss function value difference. The distribution of such differences for all 500 images is shown in Fig. 2 (a). The average difference is about $-2\times 10^{-3}$ . The negative values indicate that our residual solver always reaches lower energy level for these images. The results confirm that RS can reach the optimal solution.

We also compute the SSIM between the original and processed images. For FastSolver and RS, the SSIM difference is small ( $\approx 10^{-8}$ ) and its distribution is shown in Fig. 2 (b).

In addition, our RS runs faster. The average running time of RS and FastSolver for the 500 color images on the same hardware with the same iteration number is 0.89 seconds and 1.07 seconds, respectively. This indicates that our RS can reduce about $17\%$ of the running time.

III-C Properties

This new residual solver has three important properties:

•

implicit: Our RS does not require to compute $U^{t}$ during the iteration. In contrast, $U^{t}$ has to be explicitly computed in Eq. 11 for each iteration in FastSolver.
•

gradient domain: Our RS solves the problem in the gradient domain, which is different from previous methods that work in the intensity domain.
•

fast: In Eq. 14, $\nabla V^{t+1}$ is fixed during the inner iterations. Therefore, $\nabla V^{t+1}$ can be computed before the loop, which reduces about $17\%$ running time in practice.

IV Residual Solver Network

In this paper, we propose a new network, based on the residual solver in the previous section. Inspired by the neural network development in the past few years (Section I-C), we unfold the RS algorithm into a neural network, called RSnet, which can be considered as an “enhanced” version of our iterative algorithm.

The enhancement comes from two aspects. First, there are only two components $\vec{b}=(b_{x},b_{y})$ in RS, covering the $x$ and $y$ axis direction respectively. In contrast, RSnet uses more channels (eight, sixteen or thirty two), which cover more orientations (including the $x$ and $y$ directions). Second, the operations (gradient and divergence) in RS are independent from the input. In contrast, these operations in RSnet are learned from training data set and thus more adaptive in solving this problem. Therefore, RS iterative algorithm can be considered as a special case of RSnet with two channels and fixed convolution kernels. And RSnet is an “enhanced” version of RS.

IV-A Network Architecture

First, the gradient operator for $V$ is relaxed to a convolution kernel in our neural network. And this convolution kernel is shared among each unfolding block. Second, the linear operator $I-\lambda\beta\nabla\nabla^{T}$ ( $I$ is the identity matrix) for $\vec{b}^{t}$ in Eq. 14 is relaxed to a convolution kernel. But it is not shared for the network blocks in order to improve the neural network’s capacity. Finally, the $-\lambda\beta\nabla^{T}$ operator in Eq. 15 is relaxed to a convolution kernel. The network architecture is shown in Fig. 3. One block corresponds to one iteration in RS.

In this network, $V$ is the input for our neural network. The convolution of $V$ is shared for all blocks, which is shown in shaded blue. The step symbol indicates the $cut$ activation function, which is implemented by clip operation. The $\otimes$ indicates a convolution kernel that is not shared between blocks. The initial $\vec{b}^{0}=0$ is the same as the iterative algorithm. Different from previous related neural networks, our network does not adopt any Batch Normalization. Therefore, it can be trained on low resolution images but applied on high resolution images.

The proposed RSnet has three important features:

•

Our network is unsupervised. It does not require the ground truth images. It is suitable for practical applications where the ground truth is unknown, such as biomedical image reconstruction.
•

Our network can be trained on low resolution images but applied on high resolution images. This property is important for training process on hardware with limited memory. Our designed loss function is independent from image resolution and our network does not contain any batch normalization nor drop out layer.
•

The result from our network is comparable with the counterpart from classical iterative algorithms. As shown in Section IV-B, we can set the parameter in our loss function for this network such that results from our method are comparable with the counterpart from classical iterative algorithms.

IV-B The Loss Function

Here, we propose a loss function that is independent from the image resolution. Although Eq. 2 is well-known from the literature, it depends on the resolution of input image because the $\lambda$ is for the whole image rather than for each pixel. To relax this constraint, we propose to use following loss function (normalized version) for our neural network

\min_{U}\left\{\frac{\frac{1}{2}\|AU(\vec{x})-f(\vec{x})\|_{2}^{2}}{\int_{\vec{x}}1\mathrm{d}\vec{x}}+{\lambda}\frac{TV(U(\vec{x}))}{\int_{\vec{x}}1\mathrm{d}\vec{x}}\right\}\,.

(20)

This loss function is independent from the image resolution. Therefore, this loss function is comparable for different resolution images. In contrast, Eq. 2 is NOT comparable for images with different resolution. When we compare the energy levels between iterative algorithms and our neural network method, we simply scale them by the total number of pixels ( $\int_{\vec{x}}1\mathrm{d}\vec{x}$ ). With this loss function and our network architecture, our neural network can be trained on low resolution images but applied on high resolution images. This property is different from traditional neural networks that involve batch normalization layers or drop out layers. They can only deal with the same resolution images (training and inference images must have the same resolution).

IV-C Training on Natural Images

We randomly crop 10,000 image patches with $128\times 128$ resolution from 500 natural images in the BSDS dataset [23]. Each patch is turned into gray scale. Eighty percent of these patches are used as training set and the rest is the validation set. During training, we fix ${\lambda}=10$ , all kernel size is set to $3\times 3$ , learning rate is 0.0003, batch size is 64, and the epoch number is 3000. We increase the number of block from 3 to 20 and set the number of channels as 8, 16 and 32, respectively.

After training, the final loss function values are shown in Fig. 4. When increasing the number of blocks or the number of channels, the loss is reduced after training. After some value, increasing the them does not improve the result anymore. As comparison, the loss function value for the optimal solutions from FastSolver is 126. This result shows that our neural network can reach the optimal results.

Keeping in mind the complexity of network and the effectiveness, we fix the number of block as 10 and the number of channel as 16 in the rest of our paper. This configuration leads to 21K learnable parameters. Such small number of parameters can be easily trained with a modern GPU hardware.

The loss function values are compared with the counterpart from FastSolver. The distribution of their difference is shown in Fig. 5 (a). We further compare their SSIM difference and the distribution is shown in Fig. 5 (b). RSnet keeps more structural information (Fig. 5 (b)).

Some learned convolution kernels are visualized in Fig. 6. These kernels already contain the classical gradient and divergence operators. Therefore, this network is an enhanced version of our iterative algorithm that only has gradient and divergence operators. As a result, this network has better performance in some applications (Section V-B).

RSnet takes only 0.064 second for each color image with 481 $\times$ 321 resolution from BSDS500 data set on a GTX 1080 Ti GPU card. In contrast, the average running time of RS and FastSolver is 0.89 seconds and 1.07 seconds, respectively.

IV-D Embedding RSnet into Other Networks

We have shown the residual solver and its unfolding RSnet that can efficiently solve the bottleneck problem in TV regularized models.

In this section, we show how to embed RSnet into a larger network that can solve the general total variation regularized models (Eq. 2). More specifically, we unfold the dual first iteration algorithm (Eq. 7 to Eq. 9) into a neural network, which contains RSnet as a sub network for Eq. 9. The network structure is shown in Fig. 7.

V Applications

In this section, the proposed RS and RSnet are applied in three scenarios, image smoothing, image denoising and ultrasonic image reconstruction. In image smoothing, we show that RS and RSnet can reach the optimal solution as the classical FastSolver [22]. This experiment also confirms that RSnet can be trained on low resolution images ( $128\times 128$ ) but used on other resolution images ( $321\times 481$ in this case). In image denoising, we show that RSnet can get better results because it is an enhanced version of RS. In this scenario, we directly adopt the already trained RSnet from the image smoothing problem and apply it on noisy images, showing the transferring ability of RSnet. In the ultrasonic image reconstruction, we show that RS and RSnet can be used for a global operation, not limited in local operations, such as smoothing and denoising. In all these scenarios, we compared the RS and RSnet results with the counterpart from the FastSolver [22]. In all these applications, we use Eq. 20 as the loss function and set ${\lambda}=10$ . All the experiments are performed on a GTX 1080 Ti GPU.

V-A Image smoothing

In image smoothing, we show that RS and RSnet can reach the optimal solution as FastSolver [22]. This experiment also confirms that RSnet can be trained on low resolution images but applied on other resolution images.

In image smoothing, the imaging matrix $A$ is simply the identity matrix. We apply FastSolver, RS or RSnet on 500 color images from BSDS dataset. And the algorithms perform on each color channel separately. The iteration number of FastSolver and RS is set to 200. The RSnet is trained on 10,000 image patches with resolution $128\times 128$ , but applied on $321\times 481$ or $481\times 321$ resolution images. Two examples from the data set are shown in Fig. 8. As can be told from Fig. 8, the smoothing results are comparable. To evaluate the quality of these results, structural similarity index measure (SSIM) is used to calculate the similarity between the input and the output images [24].

For these two examples, we also calculate the energy for both methods with first 20 iterations. The results are shown in Fig. 9. RS converges faster than FastSolver at the first few iterations. We observed that this fact is also true for other images from the BSDS500 data set. This is because the operations in RS are independent from the image content.

V-B Image denoising

In this experiment, we show the transferring ability of RSnet. We train our network on the noise free image patches with resolution $128\times 128$ , but apply the trained network on noisy images with higher resolution. This indicates that the learned RSnet can be transferred for other tasks. We add noises with Gaussian distribution ( $N(0,15^{2})$ ) on the images from BSDS dataset. We apply FastSolver, RS and RSnet on these noisy images, respectively.

Two examples are shown in Fig. 10. As can be seen in Fig. 10, the results of FastSolver and RS are comparable but the result from RSnet is better. PSNR is calculated to show this quantitatively. There are two reasons for this. First, RSnet is an enhanced version of RS because RSnet has multi channels (16 in our case) while RS only has two channels (gradient operator). Second, RSnet was trained on the noise free images, which make the learned parameters produce similar results.

We evaluate the denoising performance on 500 images of the dataset. In Fig. 11(a) and (b), we show the difference of the converged energy between RS, RSnet and FastSolver, respectively. We show the difference of their PSNR in Fig. 11 (c) and (d), respectively . As can be seen, RS can converge to a smaller energy than FastSolver. Although RSnet gets higher energy, it can reach better PSNR. In this application, RSnet has the best performance due to the two reasons mentioned above.

V-C Ultrasonic image reconstruction

In previous experiments, we have shown the performance of RS and RSnet on the problems where pixels are only affected by their local neighbors. In this section, we use our RS and RSnet to reconstruct ultrasonic images where each pixel has an influence on all other pixels (global behavior).

The imaging principle of ultrasound is shown in Fig. 12 (a) and the imaging domain is discretized, shown in Fig. 12 (b). The average speed of sound in human tissue $S$ is 1540 meters per second. We denote the contrast speed of sound image as $C(\vec{x})$ and then the speed of sound is $S+C(\vec{x})$ . Letting $P(\vec{x})$ denote the sound wave traveling path, we can calculate its traveling time by

T=\int\limits_{\vec{x}\in P}\frac{\mathrm{d}P(\vec{x})}{S+C(\vec{x})}\,.

(21)

Letting $U=\frac{1}{S+C(\vec{x})}-\frac{1}{S}$ be the time delay compared with the average speed $S=1540$ , we can get the total time delay

f=\int\limits_{\vec{x}\in P}(\frac{\mathrm{d}P(\vec{x})}{S+C(\vec{x})}-\frac{\mathrm{d}P(\vec{x})}{S})=\int\limits_{\vec{x}\in P}U\mathrm{d}P\,.

(22)

In practice, $\mathrm{d}P$ can be discretized on a mesh grid to form an imaging matrix $A$ (see Fig. 12 (b)). And the time delay $f$ can be measured by an ultrasound device.

To recover the $U$ (and $C$ ), we minimize following model

\frac{1}{2}\|AU-f\|_{2}+\lambda(w\|\nabla_{x}U\|_{1}+(1-w)\|\nabla_{y}U\|_{1})\,,

(23)

where $0<w<1$ is a weight parameter since the uncertainty of each pixel along $x$ and $y$ axes are different [25]. We set $w=0.9$ according to this work.

We use FastSolver, RS and RSnet to solve this problem, respectively. We generate 500 simulated images as ground truth. To generate one image, we first generate a random region number $n$ and then generate $n$ random seeds in 2D. We use Voronoi diagram to obtain $n$ regions from these seeds. Then, we give each region a random number in $[0,255]$ . For such image, according to the discretization in Fig. 12 (b), we obtain the corresponding observations.

Our RSnet is trained on the same 10,000 natural image patches with $128\times 128$ resolution (the same data set as described in previous sections), but the loss function is changed to Eq. 23. The learning rate, batch size, epoch, etc. are the same as described in previous sections. This also shows the transfer property of RSnet since the training data set is natural images but the testing data set is piecewise constant biomedical images.

Two examples are shown in Fig. 13, including the ground truth, observation and the reconstruction results from FastSolver, RS and RSnet, respectively. The reconstructed results from FastSolver and RS are almost the same since they reach the same global optimal solution. To evaluate the quality of the results, PSNR and SSIM are calculated and the values are shown in Fig. 13.

The reconstruction results have some issue at left and right boundaries, which is indicated by red arrows. The reason is that the ultrasound device does not have enough samples in the left and right boarder regions. This issue is not caused by solving algorithms but is the limitation from hardware. In practice, such boundary regions should be removed from the reconstructed images and the central region is the main concern.

In this application, RSnet can achieve real time performance, much faster than the other two iterative algorithms because of the small number of blocks. Therefore, RSnet can be used in real-time imaging in our future work.

VI Conclusion

We have developed a new algorithm for solving the total variation models, called residual solver. It works in gradient domain and implicitly optimizes the model. This new algorithm runs faster than the well-known fast solver [22] and can reach the same global optimal solution. Its efficiency and effectiveness has been numerically confirmed by several experiments, including image smoothing, denoising and biomedical image reconstruction.

Moreover, we unfold the new algorithm into a neural network. Our network is unsupervised and thus can be trained without knowing the ground truth. Our network does not contain any batch normalization or drop out layers. Therefore, it can be trained on low resolution images but applied on high resolution images. Our numerical experiments have confirmed this property. The results from our network can be compared with the counterpart from iterative algorithms because they use the same loss function.

A small number of blocks in our neural network can reach the global optimal solutions. This property is different from iterative algorithms, such as FastSolver and RS, which require a large number of iterations to converge. Thanks to the high performance, our network is suitable for many real-time applications in practice. Such high performance might lead to real-time high resolution video processing algorithms.

We believe that the proposed RS and RSnet can be applied on a large range of image processing problems for several reasons. First, the loss function is well established in the field and has been shown successful in the past decades. Therefore, the proposed solvers can be applied on a large range of applications. Second, our RS and RSnet can reach the global optimal solutions, as confirmed in this paper. RS is suitable for tasks with high accuracy requirement. RSnet is suitable for real-time applications.

In future work, we will investigate their applications in the tasks where total variation is used as regularization, for example, image deblurring [26], inpainting [27], image smoothing [5], super resolution, image curvature optimization [4], dehazing [28], image fusion, motion estimation [29], computed tomography in X-ray imaging [30], etc.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 61907031.

References

[1] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D, vol. 60, no. 1, pp. 259–268, 1992.
[2] K. He, J. Sun, and X. Tang, “Guided image filtering,” ECCV 2010, pp. 1–14, 2010.
[3] Y. Gong, “Bernstein filter: A new solver for mean curvature regularized models,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, pp. 1701–1705.
[4] Y. Gong and I. F. Sbalzarini, “Curvature filters efficiently reduce certain variational energies,” IEEE Transactions on Image Processing, vol. 26, no. 4, pp. 1786–1798, 2017.
[5] Y. Gong and O. Goksel, “Weighted mean curvature,” Signal Processing, vol. 164, pp. 329 – 339, 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0165168419302282
[6] Y. Gong and I. F. Sbalzarini, “Local weighted Gaussian curvature for image processing,” Intl. Conf. Image Proc. (ICIP), pp. 534–538, 2013.
[7] Y. Gong, Q. Wang, C. Yang, Y. Gao, and C. Li, “Symmetry detection for multi-object using local polar coordinate,” Lecture Notes in Computer Science, vol. 5702, p. 277, 2009.
[8] D. Krishnan and R. Fergus, “Fast image deconvolution using hyper-Laplacian priors,” Adv. Neural Inform. Proc. Sys., vol. 22, pp. 1–9, 2009.
[9] Y. Gong and I. F. Sbalzarini, “Image enhancement by gradient distribution specification,” in In Proc. Workshop ”Emerging Topics in Image Enhancement and Restoration”, 12th Asian Conference on Computer Vision, Singapore, Nov 2014, pp. w7–p3.
[10] Y. Gong and I. Sbalzarini, “A natural-scene gradient distribution prior and its application in light-microscopy image processing,” Selected Topics in Signal Processing, IEEE Journal of, vol. 10, no. 1, pp. 99–114, 2016.
[11] T. F. Chan, G. H. Golub, and P. Mulet, “A nonlinear primal-dual method for total variation-based image restoration,” SIAM J. Sci. Comput., vol. 20, no. 6, pp. 1964–1977, May 1999. [Online]. Available: http://dx.doi.org/10.1137/S1064827596299767
[12] A. Chambolle and T. Pock, “A first-order primal-dual algorithm for convex problems with applications to imaging,” Journal of Mathematical Imaging and Vision, vol. 40, pp. 120–145, 2011.
[13] B. Wahlberg, S. Boyd, M. Annergren, and Y. Wang, “An admm algorithm for a class of total variation regularized estimation problems,” 2012.
[14] T. Goldstein and S. Osher, “The split Bregman method for L1-regularized problems,” SIAM J. Imaging Sci., vol. 2, no. 2, pp. 323–343, 2009.
[15] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
[16] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 399–406.
[17] J. Zhang and B. Ghanem, “Ista-net: Interpretable optimization-inspired deep network for image compressive sensing,” 2017.
[18] J. Adler and O. Öktem, “Learned primal-dual reconstruction,” IEEE Transactions on Medical Imaging, 2018.
[19] J. Sun, H. Li, Z. Xu et al., “Deep admm-net for compressive sensing mri,” in Advances in Neural Information Processing Systems, 2016, pp. 10–18.
[20] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional neural network for inverse problems in imaging,” IEEE Transactions on Image Processing, vol. 26, no. 9, pp. 4509–4522, 2017.
[21] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, 2011.
[22] R.-Q. Jia and H. Zhao, “A fast algorithm for the total variation model of image denoising,” Advances in Computational Mathematics, vol. 33, no. 2, pp. 231–241, 2010.
[23] P. Arbelaez, M.Maire, C. Fowlkes., and J. Malik, “Contour detection and hierarchical image segmentation.” IEEE TPAMI, 33:898–916, May 2011.
[24] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” Image Processing, IEEE Transactions on, vol. 13, no. 4, pp. 600–612, 2004.
[25] S. J. Sanabria and O. Goksel, “Hand-held sound-speed imaging based on ultrasound reflector delineation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2016, pp. 568–576.
[26] Y. Gong, B. Liu, X. Hou, and G. Qiu, “Sub-window box filter,” in Proc. IEEE Visual Communications and Image Processing (VCIP), Dec. 2018, pp. 1–4.
[27] H. Yin, Y. Gong, and G. Qiu, “Side window filtering,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[28] ——, “Side window guided filtering,” Signal Processing, vol. 165, pp. 315 – 330, 2019.
[29] Y. Gong, “Mean curvature is a good regularization for image processing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 8, pp. 2205–2214, Aug. 2019.
[30] Y. Gong, H. Yin, J. Liu, B. Liu, and G. Qiu, “Soft tissue removal in x-ray images by half window dark channel prior,” in 2019 IEEE International Conference on Image Processing (ICIP), Sep. 2019, pp. 3576–3580.