Tensor GP Regression

1 Bayesian Linear Model

Let $x\in\mathbb{R}^{d}$ be an input vector, $w\in\mathbb{R}^{d}$ , and $y\in\mathbb{R}$ be an observed noisy output value. We model the response value $y$ as

y=f(x)+\epsilon\text{ and }f(x)=x^{T}w

(1)

where $\epsilon\sim N(0,\Sigma_{d}$ .

We can write the predictive distribution of $y$ as

p(y|X,w)=N(X^{T}w,\sigma_{n}^{2}I)

(2)

where $X$ is the matrix of all training data and $y$ is the vector of response values.

Taking $w$ to have prior distribution $w\sim N(0,\Sigma_{d})$ , then the posterior distribution of $w$ with respect to the training data is

p(w|X,y)=N\left(\frac{1}{\sigma_{n}^{2}}A^{-1}Xy,A^{-1}\right)

(3)

where $A=\sigma_{n}^{-2}XX^{T}+\Sigma_{d}^{-1}$ . Prediction of the true function values on unseen input data has predictive distribution

p(f_{*}|x_{*},X,y)=N\left(\sigma_{n}^{-2}x_{*}^{T}A^{-1}Xy,x_{*}^{T}A^{-1}x_{*}\right)

(4)

2 Linear Models with Input Projections

We can define a projection function: $\phi(\cdot)$ which maps $x$ to a feature space such that $\phi(x)\in\mathbb{R}^{N}$ . Then we can generalize the linear model to

y=f(x)+\epsilon\text{ and }f(x)=\phi(x)^{T}w.

(5)

We can define Equations (2) and (3) as before only replacing the matrix of input data $X$ with $\Phi$ . We can write the predictive distribution of $f$ as

p(f_{*}|x_{*},X,y)=N\left(\phi_{*}^{T}\Sigma_{d}\Phi(K+\sigma_{n}^{2}I)^{-1}y,\phi_{*}^{T}\Sigma_{d}\phi_{*}-\phi_{*}^{T}\Sigma_{d}\Phi(K+\sigma_{n}^{2}I)^{-1}\Phi^{T}\Sigma_{d}\phi_{*}\right)

(6)

where $K=\Phi^{T}\Sigma_{d}\Phi=\sum_{k=1}^{N}\sum_{\ell=1}^{N}\Sigma_{k,\ell}\phi_{k}(x)\phi_{\ell}(x^{\prime})=k(x,x^{\prime})$ . Noting that $k(x,x^{\prime})=\psi(x)\cdot\psi(x^{\prime})$ where $\psi(x)=\Sigma_{d}^{1/2}\phi(x)$ , we note that the kernel trick can be applied in this setting and define $k$ as a kernel function.

3 Gaussian Process

A Gaussian process is a collection of random variables, any finite number of which has a joint Gaussian distribution. It is defined by a mean function and covariance function

	$\displaystyle m(x)$	$\displaystyle=\mathbb{E}[f(x)]$
	$\displaystyle k(x,x^{\prime})$	$\displaystyle=\mathbb{E}[(f(x)-m(x))(f(x^{\prime})-m(x^{\prime}))]$
	$\displaystyle f(x)$	$\displaystyle\sim GP(m(x),k(x,x^{\prime}))$

We have already seen an example of a Gaussian process: the linear model with input projection $f(x)=\phi(x)^{T}w$ . To see that this is a Gaussian process, take

	$\displaystyle m(x)$	$\displaystyle=E[f(x)]=\phi(x)^{T}E[w]=0$
	$\displaystyle k(x,x^{\prime})$	$\displaystyle=\phi(x)^{T}\Sigma_{d}\phi(x)=\sum_{k=1}^{N}\sum_{\ell=1}^{N}\Sigma_{k,\ell}\phi_{k}(x)\phi_{\ell}(x^{\prime})$

We now consider an additional example given by (mackay1997gaussian). Let $\phi_{c}(x)=exp\{-(x-c)/(2\lambda^{2})\}$ . Taking $N\rightarrow\infty$ gives the squared exponential covariance function. So for some covariance functions, we need an infinite number of basis functions. A method for converting from an arbitrary GP to an equivalent linear model is given in (quinonero2005analysis). However for infinite covariance functions, the associated linear model will also be infinite. It is important to note that for the conversion to work, there must be a weight associated to each training and test input, which would necessitate an infinite number of weights.

4 Tensor Regression

The general form for the optimization problem is

	$\displaystyle W^{*}$	$\displaystyle=\arg\min_{W}\mathbb{L}(W;X,Y)$
		$\displaystyle\text{s.t }rank(W)\leq R$

Several forms for estimates of $Y$ have been studied. They are outlined below.

1.

$y=cov\left<X,W\right>+\epsilon$ (zhao2011multilinear)
2.

$y=vec(X)^{T}vec(W)+\epsilon$ (zhou2013tensor)
3.

$Y^{t}=X^{t}w^{t}+\epsilon^{t}$ (romera2013multilinear)
4.

$Y_{m}=X_{m}W_{m}+E_{m}$ (bahadori2014fast)

The choice of tensor regression used to derive the multi-linear GP is (3) (yu2018tensor).

5 Multi-linear Gaussian Process

Given a total of $m=\sum_{t=1}^{T}n_{t}$ training samples from $T$ related tasks, we assume that each data point $(x_{t,i},y_{t,i})$ is drawn i.i.d according to the probabilistic model

y_{t,i}=f(x_{t,i})+\epsilon_{t}

(7)

with $f(x_{t,i})\sim GP(0,k)$ . The full model concatenates the data from all tasks (yu2018tensor). The kernel of the full model is the kronecker product of a feature correlation matrix $K_{1}$ , a group correlation $K_{2}$ , and $K_{3}$ measuring the correlation between groups. Then,

K=\phi(X)K_{3}\otimes K_{2}\otimes K_{1}\phi(X)^{T}

(8)

and the full model is

	$\displaystyle y$	$\displaystyle=f(X)+e$
	$\displaystyle f(X)$	$\displaystyle\sim GP(0,K)$
	$\displaystyle e$	$\displaystyle\sim N(0,D)$

where $y=[y_{i,1},y_{i,2},\dots,y_{T,n_{T}}]$ , $X$ is a block diagonal matrix of $X_{1},X_{2},\dots X_{T}$ where $X_{t}=[x_{t,1};x_{t,2};\dots x_{t,n_{t}}]$ and $D$ is a block diagonal matrix of $\sigma_{t}^{2}\otimes I_{n_{t}}$ .

It is shown that when each covariance matrix $K_{m}=U_{m}U_{m}^{T}$ where $U_{m}$ is a low rank orthogonal matrix, one can show that the solution which estimates $K$ is also an approximate solution which minimizes the rank of $W$ in tensor regression. Thus there appears to be an equivalence between the covariance $K$ and the parameter tensor $W$ (yu2018tensor).

A limitation of the above model is that it cannot model a tensor output $y$ where $y_{t,i}$ is multivariate.

An alternative approach is given by (angell2018inferring). In this work, Let $y_{ij}$ be a single radial velocity measurement, $O_{i}$ be the set of stations that measure radial velocities at location $x_{i}$ , $z_{i}=(u_{i},v_{i},w_{i})$ the latent unobserved velocity vector, and $a_{ij}$ the radial axis. We define $y_{i}=a_{i}^{T}z_{i}+\epsilon_{i}$ . Then the distribution of $y|z,x$ is

p(y_{ij}|z_{i};x_{i})=N(y_{ij};a_{ij}^{T}z_{i},\sigma^{2}).

(9)

Here $i$ indexes over training points, and $j$ indexes over the set $O_{i}$ .

The joint likelihood factorizes completely and can be written as the product of individual components.

The latent velocity field $z_{ij}$ is modeled as a vector valued GP with $m(x)=0$ , and kernel function

	$\displaystyle k_{\theta}(x,x^{\prime})$	$\displaystyle=diag\left(exp\left(\frac{-d_{\alpha}(x,x^{\prime})}{2\beta_{u}}\right),exp\left(\frac{-d_{\alpha}(x,x^{\prime})}{2\beta_{b}}\right),exp\left(\frac{-d_{\alpha}(x,x^{\prime})}{2\beta_{w}}\right)\right)$		(10)
	$\displaystyle d_{\alpha}(x,x^{\prime})$	$\displaystyle=\alpha_{1}(x_{1}-x^{\prime}_{1})^{2}+\alpha_{2}(x_{2}-x^{\prime}_{2})^{2}+\alpha_{3}(x_{3}-x^{\prime}_{3})^{2}$		(11)

Note that the kernel function produces a $3\times 3$ matrix rather than a standard single value. The covariance of the observed outputs is

Cov(y,y^{\prime})=\mathbb{E}[yy^{\prime}]=a^{T}\mathbb{E}[zz^{\prime T}]a^{\prime T}=a^{T}k_{\theta}(x,x^{\prime})a^{\prime}

(12)

The joint distribution of $y$ and $z$ given $x$ is Gaussian given the prior forms of $y|z,x$ and $z|x$ . Then, since the joint mean is $0$ , it suffices to find the covariance. Let $q^{T}=[z^{T}y^{T}]$ , $A=diag(a_{ij}^{T})$ , and $K$ be the prior covariance matrix of $z|x$ . Then the covariance of the point is

\mathbb{E}[qq^{T}]=\begin{pmatrix}K&KA^{T}\\ AK^{T}&AKA^{T}+\sigma^{2}I\end{pmatrix}

(13)

Naive exact inference can then be performed by considering the posterior mean of $z|y,x$ :

\mathbb{E}[x|y,x]=KA^{T}(AKA^{T}+\sigma^{2}I)^{-1}y.

(14)

To make this computationally tractable, the authors use Laplace’s method and transform the kernel to a stationary one.

A drawback of this model is that the authors consider a basic covariance kernel which is a $3\times 3$ diagonal matrix based on a squared exponential covariance kernel. Perhaps a more general kernel would yield improved results here.

As a general point of inquiry, can we somehow build a general framework which encompasses these methods? Are these methods similar and how do they come out of tensor regression generally?

6 Multivariate Generalized Gaussian Process Models

An extension to Gaussian process models considers data from a more general exponential family (chan2013multivariate). In this setting, the authors extend the GLM model to incorporate Gaussian process correlation structure. In particular they extend the standard GLM framework to be

1.

$y\sim p(y;\theta,\varphi)$
2.

$\eta=GP(0,K(x,x^{\prime}))$
3.

$\mathbb{E}[T(y)|\theta]=g^{-1}(\eta(X))$ for a link function $g$ .

Why does this model consider the expectation of $T(y)$ rather than $y$ which is $\mu$ in the standard setting?

7 Neural Network Approaches

7.1 RNN

A recurrent neural network predicts $y_{T+1,\cdot}$ given its history of a fixed length $y_{T,\cdot},y_{T-1,\cdot},\dots,y_{T-\ell+1,\cdot}$ , by

y_{T+1,\cdot}=f\left(y_{T,\cdot},y_{T-1,\cdot},\dots,y_{T-\ell+1,\cdot}\right)+\epsilon

(15)

where $f()$ represents the network, parameterized by weight matrices and bias vectors. For $1\leq t\leq T$ , with $h_{0}$ being the zero matrix, the update equation is

$\displaystyle\begin{array}[]{ll}h_{t}&=\sigma\left(W_{y}\cdot y_{t,\cdot}+W_{h}\cdot h_{t-1}+b_{h}\right)\\ \end{array}$

where $\sigma(\cdot)$ denotes the sigmoid function. The hidden state is $h_{t}$ . The final estimate of $y_{T+1,\cdot}$ is

$\displaystyle\begin{array}[]{ll}\hat{y}_{T+1,\cdot}&=W_{v}\cdot h_{t}+b_{v}\end{array}$

The network parameters can optimized by minimizing the mean squared prediction error, which is equivalent to maximizing the likelihood assuming spherical Gaussian $\epsilon$ .

7.2 CCRNN

A modification on the RNN is to first non-linearly transform the list of inputs $y_{T,\cdot},y_{T-1,\cdot},\dots,y_{T-\ell+1,\cdot}$ and then pass the results through an individual RNN for each target of prediction. For example, for $n_{t}=2$ ,

$\displaystyle\begin{array}[]{ll}\left[g_{t,1};g_{t,2}\right]&=\sigma\left(W_{g}\cdot[y_{t,1};y_{t,2}]+b_{g}\right)\\ h_{t,1}&=\sigma\left(W_{y,1}\cdot[g_{t,1};g_{t,2}]+W_{h,1}\cdot h_{t-1,1}+b_{h,1}\right)\\ h_{t,2}&=\sigma\left(W_{y,2}\cdot[g_{t,1};g_{t,2}]+W_{h,2}\cdot h_{t-1,2}+b_{h,2}\right)\\ \end{array}$

The first non-linear transformation learns relationships between the variable inputs. By increasing the number of layers in this feedforward stage, non-linear operations such as multiplication can be approximated. Such relationships are then used to predict each variable separately.

If we explicitly limit the types of relationships possible in $[g_{t,1};g_{t,2}]$ , for instance by specifying a library of functional relationships, we can also study the matrices $W_{y,1}$ and $W_{y,2}$ to infer the important functional relationships that are predictive of each variable.

8 Deep Kernel Learning

The goal of deep kernel learning is to combine the flexibility of deep neural networks to learn representations of high-dimensional data with Gaussian processes. To do this, the authors update the covariance kernel function

k(x,x^{\prime})\rightarrow k\left(g(x,w),g(x^{\prime},w);\theta,w\right)

(16)

From a neural network perspective this is also a neural network where the final network has an infinite number of basis functions. All parameters in this model are learned jointly via gradient based methods (wilson2016deep).

9 Our Model

We propose extending the works of deep kernel learning (wilson2016deep) and multivariate generalized gaussian processes (chan2013multivariate) in two ways:

1.

generalizing deep kernel learning to the multivariate gaussian process setting and
2.

generalizing the multivariate gaussian process to a more flexible (non-diagonal) kernel with low-rank structure.

9.1 Simple Model

Notation: Let $X\in\mathbb{R}^{Dn\times T}$ and $Y\in\mathbb{R}^{Dn\times 1}$ be the set of training observations where $n$ is the number of observed time series (realizations of a dynamical system), $D$ is the number of observed variables in the dynamical system, and $T$ is the number of time steps for which the system is observed.. We encode $X$ in the following way:

	$\displaystyle z^{(1)}$	$\displaystyle=\text{tanh}\left(XW^{(0)}+b^{(0)}\right)$
	$\displaystyle z^{(2)}$	$\displaystyle=\text{tanh}\left(z^{(1)}W^{(1)}+b^{(1)}\right)$
	$\displaystyle z^{(3)}$	$\displaystyle=\text{sparsemax}\left(z^{(2)}\right)$

where $W^{(i)}\in\mathbb{R}^{h_{i}\times h_{i+1}}$ , $b\in\mathbb{R}^{h_{i+1}}$ and $h_{0}=T$ .

We then take $Y=f\left(z^{(3)}(x)\right)+\epsilon$ where $f(z^{(3)}(x))\sim GP(0,K)$ and $\epsilon\sim\mathcal{N}(0,\sigma^{2}I$ . We consider two possibilities for constructing $K$ :

1.

$K\left(z_{i}^{(3)},z_{j}^{(3)}\right)=z_{i}^{(3)}k(z_{j}^{(3)})^{T}$ where $k\in\mathbb{R}^{h_{2}\times h_{2}}$ .
2.

$K\left(z_{i}^{(3)},z_{j}^{(3)}\right)=RBF\left(z_{i}^{(3)},z_{j}^{(3)}\right)$

We find empirically that (2) performs better than (1) while both perform worse than the ccrnn and (2) performs similarly to the rnn on the position data from acrobot. We are unable to find any interpretability in $k$ in (1).

9.2 Example

To illustrate our model, we consider the Henon map dynamical system which maps a point $(x^{(1)}_{n},x^{(2)}_{n})$ to:

	$\displaystyle x^{(1)}_{n+1}$	$\displaystyle=1-a(x^{(1)}_{n})^{2}+x^{(2)}_{n}$
	$\displaystyle x^{(2)}_{n+1}$	$\displaystyle=bx^{(1)}_{n}$

for fixed values of $a$ , and $b$ .

In this example, we can omit both $g$ and $h$ and consider our input to be of the form $X_{i}=[x_{i}^{(1)},x_{i}^{(2)}]$ . Then the low rank representation $U$ could be

U=\begin{pmatrix}-&-\\ -&0\\ -&-\\ -&0\\ \vdots&\vdots\\ -&-\\ -&0\end{pmatrix}

where the first column of $U$ represents the influence of $x^{(1)}$ on each sample $[x^{(1)}_{i},x^{(2)}_{i}]$ from the dynamical system, and the second column represents the influence of $x^{(2)}$ on each sample. For this small example, the size of $U$ is $2n\times 2$ , and the size of the kernel $K$ of the Gaussian process is $2n\times 2n$ .

10 Continuous-time Neural Network Approach

Initial idea:

We observe realizations of $y_{\phi}(t)\in\mathbb{R}^{d}$ , a continuous process in $d$ variables with $\phi$ the parameters of the system. We aim to learn the causal relationships between the variables, and to predict $y_{\phi}(t+s)$ for $s\in\mathbb{R}$ . We do so by learning $\nabla y_{\phi}(t)$ by a neural network $f_{\theta}(t,y(t))$ , such that $y_{i}(t+s)=y_{i}(t)+\int_{t}^{t+s}\nabla y_{i}(a)da$ . The loss function is defined on the prediction of $y_{\phi}(t+s)$ .

•

The evaluation of the integral can follow the Monte Carlo trick in mei2016hawkes. The single function evaluation $s\nabla y_{i}(a)$ at a random $a\sim Unif(t,t+s)$ gives an unbaised estimate of the integral. The algorithm averages over several samples to reduce the variance of the estimator. [Drawing a single sample seems like the network will just learn the mean of the gradient. A better scheme may be to draw multiple samples to approximate the gradient function.]
•

Causality may be learned by enforcing sparsity in the neural network $f$ . tank2018NeuralGC places a group lasso penalty on the weights in the first layer, where zero outgoing weights are a sufficient condition to represent Granger non-causality.
•

In the case where $y(t)$ represent latent variables, we’ll need an additional mapping back to the observed space $z(t)=g(y(t))$ .
•

To learn interactions between different variables, we can introduce an attention mechanism between the different components of the system as is done in (goyal2019recurrent) and (kim2019attentive).

Evaluation idea:

To evaluate whether the network has learned causal relationships between the variables, we can evaluate the model on dynamical systems with unknown $\phi$ . Specifically, consider a dataset of systems sampled (non-uniformly in time) from $y_{\phi_{k}}$ for $k\in\{1,2,\dots,K\}$ . During training, we consider access to dynamical systems $y_{1},y_{2},\dots,y_{n}$ sampled from $y_{\phi_{k}}$ where $k\leq J<K$ . At test time, we evaluate on series sampled from $y_{\phi_{k}}$ where $J<k\leq K$ .

11 Zero-shot regression

The Omnipush dataset bauza2019 collected 250 pushes for 250 objects. Each object has 4 possible sides (concave, triangular, circular, rectangular) with 2 types of extra weights (60g, 150g). This allows the ability to test whether the learned model can generalize across objects.

Let $z_{t}$ denote the state of the object at time $t$ , and $a_{t}$ the action to be performed on the object at time $t$ . In the Omnipush scenario, $z_{t}=[x_{t},y_{t},\theta_{t}]$ , and $a_{t}=[x^{(start)}_{t},y^{(start)}_{t},x^{(end)}_{t},y^{(end)}_{t}]$ which is the starting and ending positions of the pusher. The input can be reduced to 3 dimensions by treating $z_{t}$ as the origin and making use of the fact that the pusher moves at constant speed. Then $z_{t}=[0,0,0]$ , and $a_{t}=[x^{(p)}_{t},y^{(p)}_{t},a^{(p)}_{t}]$ contains the pusher location and angle with respect to the object. The target to predict is $z_{t+1}$ or equivalently $\triangle z_{t}$ . We denote the characteristics, i.e. types of sides and extra weights, of the object through vector $c$ . Additionally, RGD-D videos are captured, which we can explore adding to $c$ in an extension.

A predictive model normally takes the form of

z_{t+1}=g(z_{t},a_{t},c;W)

or

z_{t+1}=g(z_{t},a_{t};W)

for some function $g$ parameterized by $W$ , depending on whether $c$ is included as an input. For instance $g$ can be a neural network. However, the $W$ learned tends to be biased towards training samples available and the resulting model does not generalize well to new objects. Additional procedures such as context identifiers are needed to correct for this bias sanchezgonzalez2018.

We would like to learn the model $g$ in an end-to-end fashion to incorporate the ability to generalize to new objects. We propose learning

z_{t+1}=g(z_{t},a_{t};W(c))

such that if $d(c,c^{\prime})<d(c,c^{\prime\prime})$ , then $\|W(c)-W(c^{\prime})\|_{F}^{2}<\|W(c)-W(c^{\prime\prime})\|_{F}^{2}$ , where $d$ is a distance function defining the difference between object characteristics. The key idea is that physical dynamics and consequently model parameters are more similar for objects that have more similar characteristics. This would allow the model to generalize to new objects. One possible way to impose this constraint is to first define $W(c)=W\odot M(c)$ , where $\odot$ is the elementwise product, and $M(c)$ acts as a mask for objects of characteristics $c$ mallya2018. Then the loss:

L(c,c^{\prime})=\left(\|M(c)-M(c^{\prime})\|_{F}^{2}-d(c,c^{\prime})\right)^{2}

can be included in the objective function on top of the prediction loss. The network is trained through a Siamese network structure.

12 Updated Zero Shot Regression

Let $z_{t}$ denote the state of the object at time $t$ , and $a_{t}$ the action to be performed on the object at time $t$ . The input can be reduced to 3 dimensions by treating $z_{t}$ as the origin and making use of the fact that the pusher moves at constant speed. Then $z_{t}=[0,0,0]$ , and $a_{t}=[x^{(p)}_{t},y^{(p)}_{t},a^{(p)}_{t}]$ contains the pusher location and angle with respect to the object. The target to predict is $z_{t+1}$ or equivalently $\triangle z_{t}$ . Additionally let $c$ be a characteristic vector denoting characteristics of the object.

We propose learning the function

z_{t+1}=g(z_{t},a_{t},c;W),

where $g(z_{t},a_{t},c)=g_{L}(g_{L-1}(\dots m(c)\cdot g_{1}(z_{t},a_{t})\dots))$ is a deep neural network with $L$ layers with an additional linear transformation $m(c)$ which depends on the characteristic $c$ .

To do this learn $g$ we optimize over pairs of inputs $q_{i}=(z_{i},a_{i},c_{i})$ and $q_{j}=(z_{j},a_{j},c_{j})$ dropping $t$ since we only perform one step prediction. Denoting the first layer $g_{1}(\cdot)$ as the embedding layer, we optimize the loss function

\mathcal{L}(q_{i},q_{j};W)=L(q_{i};W)+L(q_{j};W)+\lambda_{1}\left(\|(m(c_{i})-m(c_{j})\|_{2}-\lambda_{2}\|c_{i}-c_{j}\|_{2}\right)

where $L$ is the negative log likelihood of with mean $g$ . The idea here is that by incorporating characteristic information, we will be able to learn to better perform pushes better on unseen objects based on objects with similar characteristics.