This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\authorinfo

Send correspondence to Jonathan S. Kent
E-mail: [email protected]

Unsupervised Learning for Target Tracking and Background Subtraction in Satellite Imagery

Jonathan S. Kent Ball Aerospace & Technologies Corp., 2875 Presidential Drive, Fairborn, OH, USA 45324 Charles C. Wamsley Ball Aerospace & Technologies Corp., 2875 Presidential Drive, Fairborn, OH, USA 45324 Davin Flateau Ball Aerospace & Technologies Corp., 2875 Presidential Drive, Fairborn, OH, USA 45324 Amber Ferguson Ball Aerospace & Technologies Corp., 2875 Presidential Drive, Fairborn, OH, USA 45324
Abstract

This paper describes an unsupervised machine learning methodology capable of target tracking and background suppression via a novel dual-model approach. “Jekyll” produces a video bit-mask describing an estimate of the locations of moving objects, and “Hyde” outputs a pseudo-background frame to subtract from the original input image sequence. These models were trained with a custom-modified version of Cross Entropy Loss.

Simulated data were used to compare the performance of Jekyll and Hyde against a more traditional supervised Machine Learning approach. The results from these comparisons show that the unsupervised methods developed are competitive in output quality with supervised techniques, without the associated cost of acquiring labeled training data.

keywords:
Unsupervised Learning, Satellite Imagery, Target Tracking, Machine Learning

1 INTRODUCTION

Target tracking and background subtraction are central tasks in the field of satellite imagery analysis [1, 2, 3], made more difficult by the existence of background motion, weather patterns, and both static and transient sensor artifacts. However, manual segmentation of available data remains costly in time, effort, and money [4], and transfers that cost to supervised learning-based projects, which require a set of manually created targets to train on. Therefore, despite the increased theoretical complexity required, it is fruitful to develop unsupervised algorithms and model architectures capable of operating in this area, which do not require explicit targets and therefore are not burdened by the cost of creating them [5]. The solution proposed here involves developing two networks, called “Jekyll” and “Hyde,”111In the book “The Strange Case of Dr. Jekyll and Mr. Hyde,” Dr. Jekyll and Mr. Hyde are two halves of the same person, and a recurring theme involves Dr. Jekyll using his wealth and influence to mask over the crimes committed by Mr. Hyde, while they are investigated by Mr. Utterson. Similarly, Jekyll produces a bit-mask at cost to itself that covers the errors generated by Hyde, and they are compared to a supervised model named Utterson. [6] which respectively produce a target tracking bit-mask over a series of frames, and an approximation of a background image. These two outputs are compared to the input used to generate them, with a new loss function defined to express the difference between the Hyde-generated pseudo-background and every frame of the input, but masked out by the presence of an object detection by Jekyll.

The data domain in use for these experiments consists of Earth-imagery frames that change over time, i.e. “video” from an infrared “camera”, containing targets of interest which are spatially unresolved. These targets are observed by the sensors with variable Signal-to-Noise ratios.

The data used for the models and experiments in this work were simulated using the Air Force Institute of Technology (AFIT) Sensor and Scene Emulation Tool, or ASSET [7, 8]. However, the work presented in this paper focuses solely on developing an initial approach to an unsupervised target tracking step, rather than providing a complete solution for target tracking and background subtraction, robust to background motion, weather, and sensor artifacts. To facilitate this, those confounding effects were left out of the data generated using ASSET. Further work will be required to address them for real world applications.

The actual data-set generated consists of two 500×500×500500\times 500\times 500111500 frames, 500 pixels of width and height. data tensors: an input data tensor containing Gaussian scaled222Here, Gaussian scaling of a data-set 𝒟u\mathcal{D}_{u} into 𝒟s\mathcal{D}_{s} consists of 𝒟s𝒟uμ(𝒟u)σ(𝒟u)\mathcal{D}_{s}\leftarrow\frac{\mathcal{D}_{u}-\mu(\mathcal{D}_{u})}{\sigma(\mathcal{D}_{u})}, which gives 𝒟s\mathcal{D}_{s} a mean of 0 and a standard deviation of 11. Where 𝒟u\mathcal{D}_{u} may have values in the hundreds of thousands, which would break the numerics of neural network processing, 𝒟s\mathcal{D}_{s} will be constrained relatively close to 0.[9] infrared intensity “image” or “video” data including backgrounds and target objects, and a synthetic target data tensor consisting of a corresponding pixel-by-pixel binary target mask [10] denoting the locations of target objects. The synthetic target data is used for reporting numerical and graphical results, as well as to train a supervised model that can be used as a benchmark for the unsupervised models. These tensors were carved into samples consisting of 1616 frames of 64×6464\times 64 video, which were separated into Training, Validation, and Testing subsets to avoid over-fitting during experimentation [11] with 50%, 20%, and 30% of the samples respectively.

2 RELATED WORK

Modern developments in unsupervised learning were kick-started by the development of the Generative Adversarial Networks (GAN) framework [12], which was one of the first to introduce a methodology involving multiple networks “playing against” one another, hence adversarial networks. Typically, GANs are employed in the task of generating new samples from a high-dimensional distribution, by feeding a generator network a random vector and having a discriminator network attempt to determine which of a collection of samples have been generated artificially. Somewhat similar in nature to the GAN is the Variational Auto-Encoder (VAE) [13], which attaches the output of an encoding network directly to the input of a decoding network, and trains both according to reconstruction loss. In the case of VAEs, the two networks share a loss function and are encouraged to work together to minimize it. The use of multi-network systems for unsupervised learning inspired the approach used in this paper.

Previous work involving VAEs have shown that it is possible to perform image segmentation in an unsupervised setting [14]. This was achieved by minimizing both the reconstruction error of the VAE, and the normalized cut [15] of the encoded state. Although image segmentation and target tracking possess theoretical similarities, in an unsupervised setting the tendency of segmentation algorithms is towards segmenting the background into, for example, mountains and valleys, rather than background versus target objects. [16] introduced an extremely clever unsupervised target tracking scheme that achieved high accuracy. It produces a single target track forwards and backwards through time, and takes the differences between the two tracks as its loss, assuming the only consistent solution is to track a moving target, using regularization to rule out staying put. Unfortunately, it is unsuitable for analyzing satellite imagery, as it can neither track multiple objects simultaneously, nor render the size of detected objects.

Both [17] and [18] accomplished unsupervised object detection using classical methods. [17] leverages algorithmic clustering of similar arrangements of pixels together, and [18] uses a modified form of the Winnow algorithm [19]. Both of these predate current Computer Vision research, and as a result do not leverage the capabilities of neural Machine Learning, instead producing much less powerful linear classification models akin to a single convolutional layer. Both algorithms are limited by their problem formulation, in that they are intended to be used with a single background, e.g. a separate model would be developed for each camera in a security system because they look at different scenes around a house, despite them all being used for the same task. This leaves them both incapable of generalization, and both assume that unusual deviation from the mean value of a pixel implies an object of interest. However, by abstracting much of the algorithmic work of [17] or [18] to the Machine Learning model, their shared assumption that unusual deviation implies interest can be reused in designing a loss function.

3 Model Formulation and Loss Function

3.1 Formulation

Both Jekyll and Hyde are formulated as feed-forward, three dimensional convolutional neural networks (CNN) [20, 21, 22], implemented with PyTorch [23, 24]. Due to the nature of the satellite sensor, the imagery that is used represents an approximate intensity mapping, scaled according to mean and standard deviation, rather than color photography, and therefore is to be treated as single-channel rather than multi-channel imagery. Both Jekyll and Hyde accept a K×1×N×W×HK\times 1\times N\times W\times H tensor as their input, where KK is the number of samples in the batch, NN is the number of frames in a sample, and W×HW\times H describes the width and height of the input image.

Both models employ a number of skip/highway connections [25] in their architecture, concatenating the hidden states from the nthn^{th} layer in the first half of the network with the inputs to the nthn^{th}-from-the-last layer in the second half of the network. This helps to combat gradient degradation in the early layers, as well as giving access to more primitive data to the later layers. The structure of these networks could be analogized to an hourglass, skip connections notwithstanding. The first half of the network increases the number of filters and decreases the width and height of the hidden state, with the second half of the network reversing the process by decreasing the number of filters and increasing the width and height. The first half consists of convolution and max-pool operations, and the latter half of transpose convolutions and max-unpool operations. ReLU activations [26] are used between convolution-type operations. This is somewhat similar to the model architectures described in [27, 28, 29, 30], but simplified and using 3D convolutions.

The only architectural difference between Jekyll and Hyde is an output function, which is executed immediately after their last layer. Jekyll possesses a final Sigmoid activation function [11], turning a hidden state into a tensor of probabilities denoting the estimated likelihood that a given pixel at a given time contains a target object. Hyde’s final output is taken as the frame-wise mean over its last layer, producing a single frame to be used as an approximated background.

For direct comparison between supervised and unsupervised methods in this domain, a competing supervised model, named “Utterson” [6], is trained against synthetic target data. Utterson possesses an architecture identical to that of Jekyll, and is tested on the same task as Jekyll; producing a bit-mask denoting the positions of targets of interest. For a loss function, Utterson uses vanilla Binary Cross Entropy loss, as is natively supported in PyTorch [11, 23]. A supervised model against which to compare the unsupervised models will allow for direct analysis of the costs and benefits associated with using unsupervised instead of supervised learning on this task.

3.2 Loss Function

The goal of the loss function is to impel Hyde to produce an approximation of the background of the input image sequence sans moving target objects, and get Jekyll to label the locations of those moving objects without also including portions of the background. This will be accomplished by using a modified form of Cross Entropy Loss [11] along with a cost associated with masking. While Cross Entropy Loss is typically used for classification problems, by formulating target tracking as “classifying pixels between is-a-target and just-the-background,” it becomes applicable.

The typical formulation of Binary Cross Entropy Loss is given as:

x𝒳Y(x)ln(Y^(x))+(1Y(x))ln(1Y^(x))-\sum_{x\in\mathcal{X}}Y(x)\textbf{ln}\big{(}\hat{Y}(x)\big{)}\ +\ \big{(}1-Y(x)\big{)}\textbf{ln}\big{(}1-\hat{Y}(x)\big{)}

where Y(x)Y(x) is the true label of xx, and takes the value 11 where xx is a target, and 0 where xx is the background. Y^(x)\hat{Y}(x) is the probability estimated by the model that xx is a target. By using the squared difference between the input and Hyde’s pseudo-background as an analogue for Y(x)Y(x), and replacing the (1Y(x))ln(1Y^(x))\big{(}1-Y(x)\big{)}\textbf{ln}\big{(}1-\hat{Y}(x)\big{)} term111This term breaks down for Y(x)>1Y(x)>1, and so requires modification. The original term’s use was that it would increase loss where the model incorrectly output high values, encouraging it not to simply declare everything to be a target. But it will be demonstrated that, in this case, this functionality can be replicated by assigning a constant cost to outputting any value at all, encouraging the model only to “spend” where it expects to lower the other term by a greater amount. with a simpler cost term, an alternate version of entropic loss can be used in the unsupervised case. In addition, Hyde contributes to decreasing loss by producing a background that minimizes the difference between it and the input, except where Jekyll is masking it out. As a result, the same loss can be used to optimize both models. This loss function will now be constructed from its elements.

For the purposes of simplifying notation, the loss function is described as it applies to a single N×W×HN\times W\times H sample, rather than to the full K×1×N×W×HK\times 1\times N\times W\times H batch input.

The tensor describing the frame-wise linear differential between Hyde’s output HH and the input ii, the frames of which are i1,i2iNi_{1},i_{2}\ ...\ i_{N}:

Δ^(θh,i)=[i1H(θh,i),i2H(θh,i),iNH(θh,i)]\hat{\Delta}(\theta_{h},i)=\begin{bmatrix}i_{1}-H(\theta_{h},i),&i_{2}-H(\theta_{h},i),&\dots&i_{N}-H(\theta_{h},i)\end{bmatrix}

The element-wise square of the linear differential:

Δ^2(θh,i)=Δ^(θh,i)Δ^(θh,i)\hat{\Delta}^{2}(\theta_{h},i)=\hat{\Delta}(\theta_{h},i)\odot\hat{\Delta}(\theta_{h},i)

The tensor containing the squared differentials, masked out by the negative natural logarithm of Jekyll’s bit-mask:

^h(θh,θj,i)=ln(J(θj,i)+ϵ)Δ^2(θh,i)\hat{\mathcal{L}}_{h}(\theta_{h},\theta_{j},i)=-\textbf{ln}\big{(}J(\theta_{j},i)+\epsilon\big{)}\odot\hat{\Delta}^{2}(\theta_{h},i)111Consider that where J(θj,i)1J(\theta_{j},i)\approx 1, ln(J(θj,i))0-\textbf{ln}\big{(}J(\theta_{j},i)\big{)}\approx 0, so where J(θj,i)J(\theta_{j},i) is large, the difference Δ^2(θh,i)\hat{\Delta}^{2}(\theta_{h},i) does not contribute significantly to loss. Alternatively, whereJ(θj,i)0J(\theta_{j},i)\approx 0, ln(J(θj,i))0-\textbf{ln}\big{(}J(\theta_{j},i)\big{)}\gg 0, meaning that any substantial value in Δ^2(θh,i)\hat{\Delta}^{2}(\theta_{h},i) will contribute heavily. A small additive ϵ=0.001\epsilon=0.001 is used to prevent numerical errors.

The loss term contributed by masked squared “pseudo-background” error loss as the element-wise mean of the masked squared differentials:

h(θh,θj,i)=mean(^h(θh,θj,i))\mathcal{L}_{h}(\theta_{h},\theta_{j},i)=\textbf{mean}\big{(}\hat{\mathcal{L}}_{h}(\theta_{h},\theta_{j},i)\big{)}

Including the mean quantity masked with a multiplicative hyper-parameter:

(θh,θj,i)=h(θh,θj,i)+αmean(J(θj,i))\mathcal{L}(\theta_{h},\theta_{j},i)=\mathcal{L}_{h}(\theta_{h},\theta_{j},i)+\alpha\cdot\textbf{mean}\big{(}J(\theta_{j},i)\big{)}222This “pixel-wise cost” addition to the entropic loss essentially forces the model to make a decision; increase loss via an expenditure associated with masking out more pixels, or increase loss by choosing not to mask out error between the background and the input at a particular location. Thus, the model will be incentivized to mask out error where and only where entropic loss would be expected to exceed α\alpha, which, with an accurate background estimation, would only be at the location of a moving object.

Which serves as the loss function for both Jekyll and Hyde. This unified loss function allows for the computer to execute a single back-propagation operation per training batch, which speeds up computation appreciably.

4 HYPER-PARAMETERS USED AND TRAINING REGIME

Initially, the simulated data is sliced into samples with N=16,W=64,H=64N=16,\ W=64,\ H=64, and separated into Training, Validation, and Testing subsets with 50%, 20%, and 30% of the samples in them respectively, with a total of approximately 1,000 samples over all three subsets. It was found experimentally that α=1\alpha=1 produced high quality results, where α\alpha is the multiplicative hyper-parameter associated with assigning pixel-wise cost to masking.

Jekyll, Hyde, and Utterson were each given 100 epochs with PyTorch’s native implementation of Adam optimizer [31, 23]. All three models had a weight decay of 0.01, Hyde and Utterson had a learning rate of 5.01045.0*10^{-4}, and Jekyll had a learning rate of 5.01055.0*10^{-5}. All other hyper-parameters were left as default. Jekyll’s learning rate was made lower than Hyde’s as it was found that, learning at the same rate, Jekyll would begin masking out large errors that Hyde produced early in training, with Hyde never learning to correct the errors, and Jekyll never being able to stop masking them out. Jekyll and Hyde were trained using the loss function described earlier in this paper, while Utterson was trained with vanilla Binary Cross-Entropy Loss.

5 RESULTS

Numerical results describing the accuracy of the target tracking models, the unsupervised Jekyll and the supervised Utterson, on the testing data-set are reported in Figures 1 and 2.111Due to the nature of the differences in approach to evaluation by the supervised and unsupervised models, Utterson was incapable of producing an output exceeding 0.45\sim 0.45. To account for this in numerical and graphical comparisons, Utterson’s outputs were linearly re-scaled to (0,1)(0,1), similar to those of Jekyll. These were obtained for a variety of “threshold” values, by, for a given threshold tt, considering any model output greater than or equal to tt to be a positive prediction (i.e. predicting the presence of a target), and any output less than tt to be a negative prediction (i.e. predicting the absence of a target). As the threshold increases, a model is less likely to be read as predicting the presence of a target as the model must be more “confident” for this to be the case, but it is also less likely to raise a false alarm. The numerical results include: Positive Predictive Value, the likelihood of a model’s positive predictions to be correct; Negative Predictive Value, the likelihood of a model’s negative predictions to be correct; Sensitivity, the likelihood for a model to pick up a particular target pixel; and Specificity, the likelihood for a model to reject a particular non-target pixel.222These numerical results are meant as a rough comparison between models, and as support for the claims made. They do not precisely represent the quality or value of the outputs of the models, e.g. both models tend to over-represent the size of the target objects, as can be seen in the included imagery, lowering the PPV significantly, while still accurately representing the locations and motions of targets. Of these results, Sensitivity is the most meaningful, as it provides a measurement of the ability of the models to pick up targets.

Figure 1: The Positive Predictive Values (PPV) and Negative Predictive Values (NPV) by threshold on the left and right, respectively. PPV is a measure of the likelihood that, if the model has labeled a pixel as containing a target, it does actually contain a target. NPV measures the likelihood that if the model says there is no target there, that it does not contain a target. The threshold (t) determines at what point the numerical output by the model is considered to be a positive prediction, e.g. for t=0.3t=0.3, if the model outputs a 0.4, then it is considered to be predicting the presence of a target, and if it outputs 0.2, it is read as predicting the absence of a target.
Refer to caption
Refer to caption
Figure 2: The Sensitivity and Specificity by threshold on the left and right, respectively. Sensitivity measures, of the pixels that do actually contain a target object, the proportion of which that were actually picked up by the model. Specificity measures, of the pixels that do not contain a target object, the proportion that was appropriately ignored by the model.
Refer to caption
Refer to caption

In addition, imagery is provided in Figures 3, 4, 5, 6, and 7 demonstrating the results. Of the column titles: “Input” refers to the input passed to Jekyll, Hyde, and Utterson; “Hyde”, “Jekyll”, and “Utterson” each show the respective outputs of those models; “Subtr” shows the result of subtracting Hyde’s pseudo-background from the input; and “Label” shows the ground truths, the locations of targets of interest that were synthetically generated alongside the data. The samples shown come from the test subset of the data, and show frames 1, 5, 9, and 13 of a 16 frame sequence, with frame 1 of each sequence on the top of the figure, with the time increasing in each subsequent row.

It can clearly be seen in the graphical results that the unsupervised approach demonstrated by Jekyll and Hyde is competitive with the supervised approach represented by Utterson, within this particular problem domain on this data-set. This is supported by the similarity of the Sensitivity values between Jekyll and Utterson for thresholds below 0.40.4.

Figure 3: Below is a figure showing each element of the Jekyll and Hyde formulation: an input series or “video” of “images” containing small moving objects, a static pseudo-background generated by one unsupervised model “Hyde”, a sequence containing the estimated objects detected by another unsupervised model “Jekyll,” Hyde’s pseudo-background subtracted from the respective input frame, the results of a supervised model “Utterson” for comparison, and the true labels/locations of the objects of interest. The first, second, third, and fourth rows relate to the 1st,5th,9th1^{st},5^{th},9^{th}, and 13th13^{th} frames of a 16-frame input sequence respectively. In this instance, there are stationary features in this input sequence that closely resemble the target objects. The target tracking models demonstrate both their accuracy and value by ignoring these.
Refer to caption
Figure 4: Here, the target tracking models manage to pick up a small, dark, very slowly moving object on a dark background near other dark objects. By looking closely at the locations marked out by the models, one can just barely make out the target, which would be otherwise unnoticed.
Refer to caption
Figure 5: This example shows the models’ ability to track a large number of objects at the same time, while also being able to ignore parts of the background that superficially resemble the objects being tracked.
Refer to caption
Figure 6: The models are capable not only of accurately tracking moving objects, but also of accurately producing a negative result.
Refer to caption
Figure 7: An example of the limitations of the models being used; when two tracked objects come close to one other, the model will track them both as a single misshapen object or “blob”, rather than as discrete entities. This is the result of both Jekyll and Utterson giving “slack” to target objects, and masking them as larger than they are, as a precaution against costly failure to mask something.
Refer to caption

6 CONCLUSIONS AND FUTURE WORK

While this formulation of target tracking as an unsupervised learning problem is potentially extremely powerful, this particular approach is limited in scope. On its own, it is not robust to transient sensor artifacts, to background motion, or to weather. It also does not perform any sort of discrimination between targets worth tracking and those not worth tracking, classification of different kinds of targets, or prediction of trajectories.

However, this work does present an important step towards the development of a completely unsupervised pipeline, by allowing for the transformation of imagery into target locations. Artifact removal [32, 33, 34, 35], image stabilization [36, 37, 38], and saliency estimation [39, 40, 41, 42, 43] are all active research areas making great strides, and the coalescing of these under an unsupervised or semi-supervised Machine Learning paradigm would allow for an extremely robust, accurate, scalable, and inexpensive solution for a variety of challenges in satellite imagery analysis.

“A little song, a little dance, a little seltzer down your pants.” -Tim

References

  • [1] Blackman, S. S., [Multiple-target tracking with radar applications ] (1986).
  • [2] Fox, P. J., Liu, J., and Weiner, N., “Integrating out astrophysical uncertainties,” Phys. Rev. D 83, 103514 (May 2011).
  • [3] Kopsiaftis, G. and Karantzalos, K., “Vehicle detection and traffic density monitoring from very high resolution satellite video data,” in [2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) ], 1881–1884 (2015).
  • [4] Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J., “Collecting image annotations using amazon’s mechanical turk,” in [Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk ], 139–147 (2010).
  • [5] Hastie, T., Tibshirani, R., and Friedman, J., “Unsupervised learning,” in [The elements of statistical learning ], 485–585, Springer (2009).
  • [6] Stevenson, R. L., Grennell, A., Somerset, R., Young, D., O’Connell, P., and Casson, G., [The Strange Case of Dr. Jekyll and Mr. Hyde ], Didier (1947).
  • [7] Young, S. R., Steward, B. J., and Gross, K. C., “Development and validation of the AFIT scene and sensor emulator for testing (ASSET),” 101780A (May 2017).
  • [8] AFIT, “AFIT Sensor And Scene Emulation Tool,” (2020). publisher: Air Force Institute of Technology.
  • [9] Zill, D., Wright, W. S., and Cullen, M. R., [Advanced engineering mathematics ], Jones & Bartlett Learning (2011).
  • [10] Iliadis, M., Spinoulas, L., and Katsaggelos, A. K., “Deepbinarymask: Learning a binary mask for video compressive sensing,” arXiv preprint arXiv:1607.03343 (2016).
  • [11] Goodfellow, I., Bengio, Y., and Courville, A., [Deep Learning ], MIT Press (2016).
  • [12] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y., “Generative Adversarial Networks,” arXiv:1406.2661 [cs, stat] (June 2014). arXiv: 1406.2661.
  • [13] Kingma, D. P. and Welling, M., “Auto-Encoding Variational Bayes,” arXiv:1312.6114 [cs, stat] (May 2014). arXiv: 1312.6114.
  • [14] Xia, X. and Kulis, B., “W-Net: A Deep Model for Fully Unsupervised Image Segmentation,” (Nov. 2017).
  • [15] Shi, J. and Malik, J., “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence 22(8), 888–905 (2000).
  • [16] Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., and Li, H., “Unsupervised Deep Tracking,” (Apr. 2019).
  • [17] Ghasemi, A. and Safabakhsh, R., “Unsupervised foreground-background segmentation using growing self organizing map in noisy backgrounds,” in [2011 3rd International Conference on Computer Research and Development ], 1, 334–338 (Mar. 2011).
  • [18] Nair, V. and Clark, J., “An unsupervised, online learning framework for moving object detection,” in [Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. ], 2, 317–324, IEEE, Washington, DC, USA (2004).
  • [19] Littlestone, N., “Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm,” Machine learning 2(4), 285–318 (1988).
  • [20] Zhang, W., Itoh, K., Tanida, J., and Ichioka, Y., “Parallel distributed processing model with local space-invariant interconnections and its optical architecture,” Applied optics 29(32), 4790–4797 (1990).
  • [21] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D., “Backpropagation applied to handwritten zip code recognition,” Neural computation 1(4), 541–551 (1989).
  • [22] Ji, S., Xu, W., Yang, M., and Yu, K., “3d convolutional neural networks for human action recognition,” IEEE transactions on pattern analysis and machine intelligence 35(1), 221–231 (2012).
  • [23] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al., “Pytorch: An imperative style, high-performance deep learning library,” in [Advances in neural information processing systems ], 8026–8037 (2019).
  • [24] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A., “Automatic differentiation in pytorch,” (2017).
  • [25] Srivastava, R. K., Greff, K., and Schmidhuber, J., “Highway networks,” CoRR abs/1505.00387 (2015).
  • [26] Glorot, X., Bordes, A., and Bengio, Y., “Deep sparse rectifier neural networks,” in [Proceedings of the fourteenth international conference on artificial intelligence and statistics ], 315–323 (2011).
  • [27] Ronneberger, O., Fischer, P., and Brox, T., “U-net: Convolutional networks for biomedical image segmentation,” in [International Conference on Medical image computing and computer-assisted intervention ], 234–241, Springer (2015).
  • [28] Long, J., Shelhamer, E., and Darrell, T., “Fully convolutional networks for semantic segmentation,” in [2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ], 3431–3440, IEEE, Boston, MA, USA (June 2015).
  • [29] Mao, X.-J., Shen, C., and Yang, Y.-B., “Image restoration using convolutional auto-encoders with symmetric skip connections,” arXiv preprint arXiv:1606.08921 (2016).
  • [30] Quan, T. M., Hildebrand, D. G., and Jeong, W.-K., “Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics,” arXiv preprint arXiv:1612.05360 (2016).
  • [31] Kingma, D. P. and Ba, J., “Adam: A method for stochastic optimization,” (2014). cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
  • [32] Winkler, I., Haufe, S., and Tangermann, M., “Automatic classification of artifactual ica-components for artifact removal in eeg signals,” Behavioral and brain functions 7(1), 30 (2011).
  • [33] Allman, D., Reiter, A., and Bell, M. A. L., “A machine learning method to identify and remove reflection artifacts in photoacoustic channel data,” in [2017 IEEE International Ultrasonics Symposium (IUS) ], 1–4, IEEE (2017).
  • [34] Biswas, R., Blackburn, L., Cao, J., Essick, R., Hodge, K. A., Katsavounidis, E., Kim, K., Kim, Y.-M., Le Bigot, E.-O., Lee, C.-H., et al., “Application of machine learning algorithms to the study of noise artifacts in gravitational-wave data,” Physical Review D 88(6), 062003 (2013).
  • [35] Zhang, Y., Haghdan, M., and Xu, K. S., “Unsupervised motion artifact detection in wrist-measured electrodermal activity data,” in [Proceedings of the 2017 ACM International Symposium on Wearable Computers ], 54–57 (2017).
  • [36] Walha, A., Wali, A., and Alimi, A. M., “Video stabilization for aerial video surveillance,” AASRI Procedia 4, 72–77 (2013).
  • [37] Saitwal, K. A., Cobb, W. K., and Yang, T., “Image stabilization techniques for video surveillance systems,” (Jan. 5 2016). US Patent 9,232,140.
  • [38] Deng, H.-B., Jia, Y.-D., Xu, Y.-H., and Liang, W., “An airborne image stabilization method based on projection and the gaussian mixture model,” in [2007 International Conference on Machine Learning and Cybernetics ], 1, 345–349, IEEE (2007).
  • [39] Li, J., Tian, Y., Huang, T., and Gao, W., “Probabilistic multi-task learning for visual saliency estimation in video,” International journal of computer vision 90(2), 150–165 (2010).
  • [40] Tang, Y. and Wu, X., “Saliency detection via combining region-level and pixel-level predictions with cnns,” in [European Conference on Computer Vision ], 809–825, Springer (2016).
  • [41] Xia, C., Qi, F., and Shi, G., “Bottom–up visual saliency estimation with deep autoencoder-based sparse reconstruction,” IEEE transactions on neural networks and learning systems 27(6), 1227–1240 (2016).
  • [42] Hu, Y.-T., Huang, J.-B., and Schwing, A. G., “Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation,” in [Proceedings of the European conference on computer vision (ECCV) ], 786–802 (2018).
  • [43] Aytekin, C., Iosifidis, A., and Gabbouj, M., “Probabilistic saliency estimation,” Pattern Recognition 74, 359–372 (2018).