Tensor GP Regression
1 Bayesian Linear Model
Let be an input vector, , and be an observed noisy output value. We model the response value as
(1) |
where .
We can write the predictive distribution of as
(2) |
where is the matrix of all training data and is the vector of response values.
Taking to have prior distribution , then the posterior distribution of with respect to the training data is
(3) |
where . Prediction of the true function values on unseen input data has predictive distribution
(4) |
2 Linear Models with Input Projections
We can define a projection function: which maps to a feature space such that . Then we can generalize the linear model to
(5) |
We can define Equations (2) and (3) as before only replacing the matrix of input data with . We can write the predictive distribution of as
(6) |
where . Noting that where , we note that the kernel trick can be applied in this setting and define as a kernel function.
3 Gaussian Process
A Gaussian process is a collection of random variables, any finite number of which has a joint Gaussian distribution. It is defined by a mean function and covariance function
We have already seen an example of a Gaussian process: the linear model with input projection . To see that this is a Gaussian process, take
We now consider an additional example given by (mackay1997gaussian). Let . Taking gives the squared exponential covariance function. So for some covariance functions, we need an infinite number of basis functions. A method for converting from an arbitrary GP to an equivalent linear model is given in (quinonero2005analysis). However for infinite covariance functions, the associated linear model will also be infinite. It is important to note that for the conversion to work, there must be a weight associated to each training and test input, which would necessitate an infinite number of weights.
4 Tensor Regression
The general form for the optimization problem is
Several forms for estimates of have been studied. They are outlined below.
-
1.
(zhao2011multilinear)
-
2.
(zhou2013tensor)
-
3.
(romera2013multilinear)
-
4.
(bahadori2014fast)
The choice of tensor regression used to derive the multi-linear GP is (3) (yu2018tensor).
5 Multi-linear Gaussian Process
Given a total of training samples from related tasks, we assume that each data point is drawn i.i.d according to the probabilistic model
(7) |
with . The full model concatenates the data from all tasks (yu2018tensor). The kernel of the full model is the kronecker product of a feature correlation matrix , a group correlation , and measuring the correlation between groups. Then,
(8) |
and the full model is
where , is a block diagonal matrix of where and is a block diagonal matrix of .
It is shown that when each covariance matrix where is a low rank orthogonal matrix, one can show that the solution which estimates is also an approximate solution which minimizes the rank of in tensor regression. Thus there appears to be an equivalence between the covariance and the parameter tensor (yu2018tensor).
A limitation of the above model is that it cannot model a tensor output where is multivariate.
An alternative approach is given by (angell2018inferring). In this work, Let be a single radial velocity measurement, be the set of stations that measure radial velocities at location , the latent unobserved velocity vector, and the radial axis. We define . Then the distribution of is
(9) |
Here indexes over training points, and indexes over the set .
The joint likelihood factorizes completely and can be written as the product of individual components.
The latent velocity field is modeled as a vector valued GP with , and kernel function
(10) | ||||
(11) |
Note that the kernel function produces a matrix rather than a standard single value. The covariance of the observed outputs is
(12) |
The joint distribution of and given is Gaussian given the prior forms of and . Then, since the joint mean is , it suffices to find the covariance. Let , , and be the prior covariance matrix of . Then the covariance of the point is
(13) |
Naive exact inference can then be performed by considering the posterior mean of :
(14) |
To make this computationally tractable, the authors use Laplace’s method and transform the kernel to a stationary one.
A drawback of this model is that the authors consider a basic covariance kernel which is a diagonal matrix based on a squared exponential covariance kernel. Perhaps a more general kernel would yield improved results here.
As a general point of inquiry, can we somehow build a general framework which encompasses these methods? Are these methods similar and how do they come out of tensor regression generally?
6 Multivariate Generalized Gaussian Process Models
An extension to Gaussian process models considers data from a more general exponential family (chan2013multivariate). In this setting, the authors extend the GLM model to incorporate Gaussian process correlation structure. In particular they extend the standard GLM framework to be
-
1.
-
2.
-
3.
for a link function .
Why does this model consider the expectation of rather than which is in the standard setting?
7 Neural Network Approaches
7.1 RNN
A recurrent neural network predicts given its history of a fixed length , by
(15) |
where represents the network, parameterized by weight matrices and bias vectors. For , with being the zero matrix, the update equation is
where denotes the sigmoid function. The hidden state is . The final estimate of is
The network parameters can optimized by minimizing the mean squared prediction error, which is equivalent to maximizing the likelihood assuming spherical Gaussian .
7.2 CCRNN
A modification on the RNN is to first non-linearly transform the list of inputs and then pass the results through an individual RNN for each target of prediction. For example, for ,
The first non-linear transformation learns relationships between the variable inputs. By increasing the number of layers in this feedforward stage, non-linear operations such as multiplication can be approximated. Such relationships are then used to predict each variable separately.
If we explicitly limit the types of relationships possible in , for instance by specifying a library of functional relationships, we can also study the matrices and to infer the important functional relationships that are predictive of each variable.
8 Deep Kernel Learning
The goal of deep kernel learning is to combine the flexibility of deep neural networks to learn representations of high-dimensional data with Gaussian processes. To do this, the authors update the covariance kernel function
(16) |
From a neural network perspective this is also a neural network where the final network has an infinite number of basis functions. All parameters in this model are learned jointly via gradient based methods (wilson2016deep).
9 Our Model
We propose extending the works of deep kernel learning (wilson2016deep) and multivariate generalized gaussian processes (chan2013multivariate) in two ways:
-
1.
generalizing deep kernel learning to the multivariate gaussian process setting and
-
2.
generalizing the multivariate gaussian process to a more flexible (non-diagonal) kernel with low-rank structure.
9.1 Simple Model
Notation: Let and be the set of training observations where is the number of observed time series (realizations of a dynamical system), is the number of observed variables in the dynamical system, and is the number of time steps for which the system is observed.. We encode in the following way:
where , and .
We then take where and . We consider two possibilities for constructing :
-
1.
where .
-
2.
We find empirically that (2) performs better than (1) while both perform worse than the ccrnn and (2) performs similarly to the rnn on the position data from acrobot. We are unable to find any interpretability in in (1).
9.2 Example
To illustrate our model, we consider the Henon map dynamical system which maps a point to:
for fixed values of , and .
In this example, we can omit both and and consider our input to be of the form . Then the low rank representation could be
where the first column of represents the influence of on each sample from the dynamical system, and the second column represents the influence of on each sample. For this small example, the size of is , and the size of the kernel of the Gaussian process is .
10 Continuous-time Neural Network Approach
Initial idea:
We observe realizations of , a continuous process in variables with the parameters of the system. We aim to learn the causal relationships between the variables, and to predict for . We do so by learning by a neural network , such that . The loss function is defined on the prediction of .
-
•
The evaluation of the integral can follow the Monte Carlo trick in mei2016hawkes. The single function evaluation at a random gives an unbaised estimate of the integral. The algorithm averages over several samples to reduce the variance of the estimator. [Drawing a single sample seems like the network will just learn the mean of the gradient. A better scheme may be to draw multiple samples to approximate the gradient function.]
-
•
Causality may be learned by enforcing sparsity in the neural network . tank2018NeuralGC places a group lasso penalty on the weights in the first layer, where zero outgoing weights are a sufficient condition to represent Granger non-causality.
-
•
In the case where represent latent variables, we’ll need an additional mapping back to the observed space .
-
•
To learn interactions between different variables, we can introduce an attention mechanism between the different components of the system as is done in (goyal2019recurrent) and (kim2019attentive).
Evaluation idea:
To evaluate whether the network has learned causal relationships between the variables, we can evaluate the model on dynamical systems with unknown . Specifically, consider a dataset of systems sampled (non-uniformly in time) from for . During training, we consider access to dynamical systems sampled from where . At test time, we evaluate on series sampled from where .
11 Zero-shot regression
The Omnipush dataset bauza2019 collected 250 pushes for 250 objects. Each object has 4 possible sides (concave, triangular, circular, rectangular) with 2 types of extra weights (60g, 150g). This allows the ability to test whether the learned model can generalize across objects.
Let denote the state of the object at time , and the action to be performed on the object at time . In the Omnipush scenario, , and which is the starting and ending positions of the pusher. The input can be reduced to 3 dimensions by treating as the origin and making use of the fact that the pusher moves at constant speed. Then , and contains the pusher location and angle with respect to the object. The target to predict is or equivalently . We denote the characteristics, i.e. types of sides and extra weights, of the object through vector . Additionally, RGD-D videos are captured, which we can explore adding to in an extension.
A predictive model normally takes the form of
or
for some function parameterized by , depending on whether is included as an input. For instance can be a neural network. However, the learned tends to be biased towards training samples available and the resulting model does not generalize well to new objects. Additional procedures such as context identifiers are needed to correct for this bias sanchezgonzalez2018.
We would like to learn the model in an end-to-end fashion to incorporate the ability to generalize to new objects. We propose learning
such that if , then , where is a distance function defining the difference between object characteristics. The key idea is that physical dynamics and consequently model parameters are more similar for objects that have more similar characteristics. This would allow the model to generalize to new objects. One possible way to impose this constraint is to first define , where is the elementwise product, and acts as a mask for objects of characteristics mallya2018. Then the loss:
can be included in the objective function on top of the prediction loss. The network is trained through a Siamese network structure.
12 Updated Zero Shot Regression
Let denote the state of the object at time , and the action to be performed on the object at time . The input can be reduced to 3 dimensions by treating as the origin and making use of the fact that the pusher moves at constant speed. Then , and contains the pusher location and angle with respect to the object. The target to predict is or equivalently . Additionally let be a characteristic vector denoting characteristics of the object.
We propose learning the function
where is a deep neural network with layers with an additional linear transformation which depends on the characteristic .
To do this learn we optimize over pairs of inputs and dropping since we only perform one step prediction. Denoting the first layer as the embedding layer, we optimize the loss function
where is the negative log likelihood of with mean . The idea here is that by incorporating characteristic information, we will be able to learn to better perform pushes better on unseen objects based on objects with similar characteristics.