This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tensor GP Regression

1 Bayesian Linear Model

Let xdx\in\mathbb{R}^{d} be an input vector, wdw\in\mathbb{R}^{d}, and yy\in\mathbb{R} be an observed noisy output value. We model the response value yy as

y=f(x)+ϵ and f(x)=xTwy=f(x)+\epsilon\text{ and }f(x)=x^{T}w (1)

where ϵN(0,Σd\epsilon\sim N(0,\Sigma_{d}.

We can write the predictive distribution of yy as

p(y|X,w)=N(XTw,σn2I)p(y|X,w)=N(X^{T}w,\sigma_{n}^{2}I) (2)

where XX is the matrix of all training data and yy is the vector of response values.

Taking ww to have prior distribution wN(0,Σd)w\sim N(0,\Sigma_{d}), then the posterior distribution of ww with respect to the training data is

p(w|X,y)=N(1σn2A1Xy,A1)p(w|X,y)=N\left(\frac{1}{\sigma_{n}^{2}}A^{-1}Xy,A^{-1}\right) (3)

where A=σn2XXT+Σd1A=\sigma_{n}^{-2}XX^{T}+\Sigma_{d}^{-1}. Prediction of the true function values on unseen input data has predictive distribution

p(f|x,X,y)=N(σn2xTA1Xy,xTA1x)p(f_{*}|x_{*},X,y)=N\left(\sigma_{n}^{-2}x_{*}^{T}A^{-1}Xy,x_{*}^{T}A^{-1}x_{*}\right) (4)

2 Linear Models with Input Projections

We can define a projection function: ϕ()\phi(\cdot) which maps xx to a feature space such that ϕ(x)N\phi(x)\in\mathbb{R}^{N}. Then we can generalize the linear model to

y=f(x)+ϵ and f(x)=ϕ(x)Tw.y=f(x)+\epsilon\text{ and }f(x)=\phi(x)^{T}w. (5)

We can define Equations (2) and (3) as before only replacing the matrix of input data XX with Φ\Phi. We can write the predictive distribution of ff as

p(f|x,X,y)=N(ϕTΣdΦ(K+σn2I)1y,ϕTΣdϕϕTΣdΦ(K+σn2I)1ΦTΣdϕ)p(f_{*}|x_{*},X,y)=N\left(\phi_{*}^{T}\Sigma_{d}\Phi(K+\sigma_{n}^{2}I)^{-1}y,\phi_{*}^{T}\Sigma_{d}\phi_{*}-\phi_{*}^{T}\Sigma_{d}\Phi(K+\sigma_{n}^{2}I)^{-1}\Phi^{T}\Sigma_{d}\phi_{*}\right) (6)

where K=ΦTΣdΦ=k=1N=1NΣk,ϕk(x)ϕ(x)=k(x,x)K=\Phi^{T}\Sigma_{d}\Phi=\sum_{k=1}^{N}\sum_{\ell=1}^{N}\Sigma_{k,\ell}\phi_{k}(x)\phi_{\ell}(x^{\prime})=k(x,x^{\prime}). Noting that k(x,x)=ψ(x)ψ(x)k(x,x^{\prime})=\psi(x)\cdot\psi(x^{\prime}) where ψ(x)=Σd1/2ϕ(x)\psi(x)=\Sigma_{d}^{1/2}\phi(x), we note that the kernel trick can be applied in this setting and define kk as a kernel function.

3 Gaussian Process

A Gaussian process is a collection of random variables, any finite number of which has a joint Gaussian distribution. It is defined by a mean function and covariance function

m(x)\displaystyle m(x) =𝔼[f(x)]\displaystyle=\mathbb{E}[f(x)]
k(x,x)\displaystyle k(x,x^{\prime}) =𝔼[(f(x)m(x))(f(x)m(x))]\displaystyle=\mathbb{E}[(f(x)-m(x))(f(x^{\prime})-m(x^{\prime}))]
f(x)\displaystyle f(x) GP(m(x),k(x,x))\displaystyle\sim GP(m(x),k(x,x^{\prime}))

We have already seen an example of a Gaussian process: the linear model with input projection f(x)=ϕ(x)Twf(x)=\phi(x)^{T}w. To see that this is a Gaussian process, take

m(x)\displaystyle m(x) =E[f(x)]=ϕ(x)TE[w]=0\displaystyle=E[f(x)]=\phi(x)^{T}E[w]=0
k(x,x)\displaystyle k(x,x^{\prime}) =ϕ(x)TΣdϕ(x)=k=1N=1NΣk,ϕk(x)ϕ(x)\displaystyle=\phi(x)^{T}\Sigma_{d}\phi(x)=\sum_{k=1}^{N}\sum_{\ell=1}^{N}\Sigma_{k,\ell}\phi_{k}(x)\phi_{\ell}(x^{\prime})

We now consider an additional example given by (mackay1997gaussian). Let ϕc(x)=exp{(xc)/(2λ2)}\phi_{c}(x)=exp\{-(x-c)/(2\lambda^{2})\}. Taking NN\rightarrow\infty gives the squared exponential covariance function. So for some covariance functions, we need an infinite number of basis functions. A method for converting from an arbitrary GP to an equivalent linear model is given in (quinonero2005analysis). However for infinite covariance functions, the associated linear model will also be infinite. It is important to note that for the conversion to work, there must be a weight associated to each training and test input, which would necessitate an infinite number of weights.

4 Tensor Regression

The general form for the optimization problem is

W\displaystyle W^{*} =argminW𝕃(W;X,Y)\displaystyle=\arg\min_{W}\mathbb{L}(W;X,Y)
s.t rank(W)R\displaystyle\text{s.t }rank(W)\leq R

Several forms for estimates of YY have been studied. They are outlined below.

  1. 1.

    y=covX,W+ϵy=cov\left<X,W\right>+\epsilon (zhao2011multilinear)

  2. 2.

    y=vec(X)Tvec(W)+ϵy=vec(X)^{T}vec(W)+\epsilon (zhou2013tensor)

  3. 3.

    Yt=Xtwt+ϵtY^{t}=X^{t}w^{t}+\epsilon^{t} (romera2013multilinear)

  4. 4.

    Ym=XmWm+EmY_{m}=X_{m}W_{m}+E_{m} (bahadori2014fast)

The choice of tensor regression used to derive the multi-linear GP is (3) (yu2018tensor).

5 Multi-linear Gaussian Process

Given a total of m=t=1Tntm=\sum_{t=1}^{T}n_{t} training samples from TT related tasks, we assume that each data point (xt,i,yt,i)(x_{t,i},y_{t,i}) is drawn i.i.d according to the probabilistic model

yt,i=f(xt,i)+ϵty_{t,i}=f(x_{t,i})+\epsilon_{t} (7)

with f(xt,i)GP(0,k)f(x_{t,i})\sim GP(0,k). The full model concatenates the data from all tasks (yu2018tensor). The kernel of the full model is the kronecker product of a feature correlation matrix K1K_{1}, a group correlation K2K_{2}, and K3K_{3} measuring the correlation between groups. Then,

K=ϕ(X)K3K2K1ϕ(X)TK=\phi(X)K_{3}\otimes K_{2}\otimes K_{1}\phi(X)^{T} (8)

and the full model is

y\displaystyle y =f(X)+e\displaystyle=f(X)+e
f(X)\displaystyle f(X) GP(0,K)\displaystyle\sim GP(0,K)
e\displaystyle e N(0,D)\displaystyle\sim N(0,D)

where y=[yi,1,yi,2,,yT,nT]y=[y_{i,1},y_{i,2},\dots,y_{T,n_{T}}], XX is a block diagonal matrix of X1,X2,XTX_{1},X_{2},\dots X_{T} where Xt=[xt,1;xt,2;xt,nt]X_{t}=[x_{t,1};x_{t,2};\dots x_{t,n_{t}}] and DD is a block diagonal matrix of σt2Int\sigma_{t}^{2}\otimes I_{n_{t}}.

It is shown that when each covariance matrix Km=UmUmTK_{m}=U_{m}U_{m}^{T} where UmU_{m} is a low rank orthogonal matrix, one can show that the solution which estimates KK is also an approximate solution which minimizes the rank of WW in tensor regression. Thus there appears to be an equivalence between the covariance KK and the parameter tensor WW (yu2018tensor).

A limitation of the above model is that it cannot model a tensor output yy where yt,iy_{t,i} is multivariate.

An alternative approach is given by (angell2018inferring). In this work, Let yijy_{ij} be a single radial velocity measurement, OiO_{i} be the set of stations that measure radial velocities at location xix_{i}, zi=(ui,vi,wi)z_{i}=(u_{i},v_{i},w_{i}) the latent unobserved velocity vector, and aija_{ij} the radial axis. We define yi=aiTzi+ϵiy_{i}=a_{i}^{T}z_{i}+\epsilon_{i}. Then the distribution of y|z,xy|z,x is

p(yij|zi;xi)=N(yij;aijTzi,σ2).p(y_{ij}|z_{i};x_{i})=N(y_{ij};a_{ij}^{T}z_{i},\sigma^{2}). (9)

Here ii indexes over training points, and jj indexes over the set OiO_{i}.

The joint likelihood factorizes completely and can be written as the product of individual components.

The latent velocity field zijz_{ij} is modeled as a vector valued GP with m(x)=0m(x)=0, and kernel function

kθ(x,x)\displaystyle k_{\theta}(x,x^{\prime}) =diag(exp(dα(x,x)2βu),exp(dα(x,x)2βb),exp(dα(x,x)2βw))\displaystyle=diag\left(exp\left(\frac{-d_{\alpha}(x,x^{\prime})}{2\beta_{u}}\right),exp\left(\frac{-d_{\alpha}(x,x^{\prime})}{2\beta_{b}}\right),exp\left(\frac{-d_{\alpha}(x,x^{\prime})}{2\beta_{w}}\right)\right) (10)
dα(x,x)\displaystyle d_{\alpha}(x,x^{\prime}) =α1(x1x1)2+α2(x2x2)2+α3(x3x3)2\displaystyle=\alpha_{1}(x_{1}-x^{\prime}_{1})^{2}+\alpha_{2}(x_{2}-x^{\prime}_{2})^{2}+\alpha_{3}(x_{3}-x^{\prime}_{3})^{2} (11)

Note that the kernel function produces a 3×33\times 3 matrix rather than a standard single value. The covariance of the observed outputs is

Cov(y,y)=𝔼[yy]=aT𝔼[zzT]aT=aTkθ(x,x)aCov(y,y^{\prime})=\mathbb{E}[yy^{\prime}]=a^{T}\mathbb{E}[zz^{\prime T}]a^{\prime T}=a^{T}k_{\theta}(x,x^{\prime})a^{\prime} (12)

The joint distribution of yy and zz given xx is Gaussian given the prior forms of y|z,xy|z,x and z|xz|x. Then, since the joint mean is 0, it suffices to find the covariance. Let qT=[zTyT]q^{T}=[z^{T}y^{T}], A=diag(aijT)A=diag(a_{ij}^{T}), and KK be the prior covariance matrix of z|xz|x. Then the covariance of the point is

𝔼[qqT]=(KKATAKTAKAT+σ2I)\mathbb{E}[qq^{T}]=\begin{pmatrix}K&KA^{T}\\ AK^{T}&AKA^{T}+\sigma^{2}I\end{pmatrix} (13)

Naive exact inference can then be performed by considering the posterior mean of z|y,xz|y,x:

𝔼[x|y,x]=KAT(AKAT+σ2I)1y.\mathbb{E}[x|y,x]=KA^{T}(AKA^{T}+\sigma^{2}I)^{-1}y. (14)

To make this computationally tractable, the authors use Laplace’s method and transform the kernel to a stationary one.

A drawback of this model is that the authors consider a basic covariance kernel which is a 3×33\times 3 diagonal matrix based on a squared exponential covariance kernel. Perhaps a more general kernel would yield improved results here.

As a general point of inquiry, can we somehow build a general framework which encompasses these methods? Are these methods similar and how do they come out of tensor regression generally?

6 Multivariate Generalized Gaussian Process Models

An extension to Gaussian process models considers data from a more general exponential family (chan2013multivariate). In this setting, the authors extend the GLM model to incorporate Gaussian process correlation structure. In particular they extend the standard GLM framework to be

  1. 1.

    yp(y;θ,φ)y\sim p(y;\theta,\varphi)

  2. 2.

    η=GP(0,K(x,x))\eta=GP(0,K(x,x^{\prime}))

  3. 3.

    𝔼[T(y)|θ]=g1(η(X))\mathbb{E}[T(y)|\theta]=g^{-1}(\eta(X)) for a link function gg.

Why does this model consider the expectation of T(y)T(y) rather than yy which is μ\mu in the standard setting?

7 Neural Network Approaches

7.1 RNN

A recurrent neural network predicts yT+1,y_{T+1,\cdot} given its history of a fixed length yT,,yT1,,,yT+1,y_{T,\cdot},y_{T-1,\cdot},\dots,y_{T-\ell+1,\cdot}, by

yT+1,=f(yT,,yT1,,,yT+1,)+ϵy_{T+1,\cdot}=f\left(y_{T,\cdot},y_{T-1,\cdot},\dots,y_{T-\ell+1,\cdot}\right)+\epsilon (15)

where f()f() represents the network, parameterized by weight matrices and bias vectors. For 1tT1\leq t\leq T, with h0h_{0} being the zero matrix, the update equation is

ht=σ(Wyyt,+Whht1+bh)\displaystyle\begin{array}[]{ll}h_{t}&=\sigma\left(W_{y}\cdot y_{t,\cdot}+W_{h}\cdot h_{t-1}+b_{h}\right)\\ \end{array}

where σ()\sigma(\cdot) denotes the sigmoid function. The hidden state is hth_{t}. The final estimate of yT+1,y_{T+1,\cdot} is

y^T+1,=Wvht+bv\displaystyle\begin{array}[]{ll}\hat{y}_{T+1,\cdot}&=W_{v}\cdot h_{t}+b_{v}\end{array}

The network parameters can optimized by minimizing the mean squared prediction error, which is equivalent to maximizing the likelihood assuming spherical Gaussian ϵ\epsilon.

7.2 CCRNN

A modification on the RNN is to first non-linearly transform the list of inputs yT,,yT1,,,yT+1,y_{T,\cdot},y_{T-1,\cdot},\dots,y_{T-\ell+1,\cdot} and then pass the results through an individual RNN for each target of prediction. For example, for nt=2n_{t}=2,

[gt,1;gt,2]=σ(Wg[yt,1;yt,2]+bg)ht,1=σ(Wy,1[gt,1;gt,2]+Wh,1ht1,1+bh,1)ht,2=σ(Wy,2[gt,1;gt,2]+Wh,2ht1,2+bh,2)\displaystyle\begin{array}[]{ll}\left[g_{t,1};g_{t,2}\right]&=\sigma\left(W_{g}\cdot[y_{t,1};y_{t,2}]+b_{g}\right)\\ h_{t,1}&=\sigma\left(W_{y,1}\cdot[g_{t,1};g_{t,2}]+W_{h,1}\cdot h_{t-1,1}+b_{h,1}\right)\\ h_{t,2}&=\sigma\left(W_{y,2}\cdot[g_{t,1};g_{t,2}]+W_{h,2}\cdot h_{t-1,2}+b_{h,2}\right)\\ \end{array}

The first non-linear transformation learns relationships between the variable inputs. By increasing the number of layers in this feedforward stage, non-linear operations such as multiplication can be approximated. Such relationships are then used to predict each variable separately.

If we explicitly limit the types of relationships possible in [gt,1;gt,2][g_{t,1};g_{t,2}], for instance by specifying a library of functional relationships, we can also study the matrices Wy,1W_{y,1} and Wy,2W_{y,2} to infer the important functional relationships that are predictive of each variable.

8 Deep Kernel Learning

The goal of deep kernel learning is to combine the flexibility of deep neural networks to learn representations of high-dimensional data with Gaussian processes. To do this, the authors update the covariance kernel function

k(x,x)k(g(x,w),g(x,w);θ,w)k(x,x^{\prime})\rightarrow k\left(g(x,w),g(x^{\prime},w);\theta,w\right) (16)

From a neural network perspective this is also a neural network where the final network has an infinite number of basis functions. All parameters in this model are learned jointly via gradient based methods (wilson2016deep).

9 Our Model

We propose extending the works of deep kernel learning (wilson2016deep) and multivariate generalized gaussian processes (chan2013multivariate) in two ways:

  1. 1.

    generalizing deep kernel learning to the multivariate gaussian process setting and

  2. 2.

    generalizing the multivariate gaussian process to a more flexible (non-diagonal) kernel with low-rank structure.

9.1 Simple Model

Notation: Let XDn×TX\in\mathbb{R}^{Dn\times T} and YDn×1Y\in\mathbb{R}^{Dn\times 1} be the set of training observations where nn is the number of observed time series (realizations of a dynamical system), DD is the number of observed variables in the dynamical system, and TT is the number of time steps for which the system is observed.. We encode XX in the following way:

z(1)\displaystyle z^{(1)} =tanh(XW(0)+b(0))\displaystyle=\text{tanh}\left(XW^{(0)}+b^{(0)}\right)
z(2)\displaystyle z^{(2)} =tanh(z(1)W(1)+b(1))\displaystyle=\text{tanh}\left(z^{(1)}W^{(1)}+b^{(1)}\right)
z(3)\displaystyle z^{(3)} =sparsemax(z(2))\displaystyle=\text{sparsemax}\left(z^{(2)}\right)

where W(i)hi×hi+1W^{(i)}\in\mathbb{R}^{h_{i}\times h_{i+1}}, bhi+1b\in\mathbb{R}^{h_{i+1}} and h0=Th_{0}=T.

We then take Y=f(z(3)(x))+ϵY=f\left(z^{(3)}(x)\right)+\epsilon where f(z(3)(x))GP(0,K)f(z^{(3)}(x))\sim GP(0,K) and ϵ𝒩(0,σ2I\epsilon\sim\mathcal{N}(0,\sigma^{2}I. We consider two possibilities for constructing KK:

  1. 1.

    K(zi(3),zj(3))=zi(3)k(zj(3))TK\left(z_{i}^{(3)},z_{j}^{(3)}\right)=z_{i}^{(3)}k(z_{j}^{(3)})^{T} where kh2×h2k\in\mathbb{R}^{h_{2}\times h_{2}}.

  2. 2.

    K(zi(3),zj(3))=RBF(zi(3),zj(3))K\left(z_{i}^{(3)},z_{j}^{(3)}\right)=RBF\left(z_{i}^{(3)},z_{j}^{(3)}\right)

We find empirically that (2) performs better than (1) while both perform worse than the ccrnn and (2) performs similarly to the rnn on the position data from acrobot. We are unable to find any interpretability in kk in (1).

9.2 Example

To illustrate our model, we consider the Henon map dynamical system which maps a point (xn(1),xn(2))(x^{(1)}_{n},x^{(2)}_{n}) to:

xn+1(1)\displaystyle x^{(1)}_{n+1} =1a(xn(1))2+xn(2)\displaystyle=1-a(x^{(1)}_{n})^{2}+x^{(2)}_{n}
xn+1(2)\displaystyle x^{(2)}_{n+1} =bxn(1)\displaystyle=bx^{(1)}_{n}

for fixed values of aa, and bb.

In this example, we can omit both gg and hh and consider our input to be of the form Xi=[xi(1),xi(2)]X_{i}=[x_{i}^{(1)},x_{i}^{(2)}]. Then the low rank representation UU could be

U=(000)U=\begin{pmatrix}-&-\\ -&0\\ -&-\\ -&0\\ \vdots&\vdots\\ -&-\\ -&0\end{pmatrix}

where the first column of UU represents the influence of x(1)x^{(1)} on each sample [xi(1),xi(2)][x^{(1)}_{i},x^{(2)}_{i}] from the dynamical system, and the second column represents the influence of x(2)x^{(2)} on each sample. For this small example, the size of UU is 2n×22n\times 2, and the size of the kernel KK of the Gaussian process is 2n×2n2n\times 2n.

10 Continuous-time Neural Network Approach

Initial idea:

We observe realizations of yϕ(t)dy_{\phi}(t)\in\mathbb{R}^{d}, a continuous process in dd variables with ϕ\phi the parameters of the system. We aim to learn the causal relationships between the variables, and to predict yϕ(t+s)y_{\phi}(t+s) for ss\in\mathbb{R}. We do so by learning yϕ(t)\nabla y_{\phi}(t) by a neural network fθ(t,y(t))f_{\theta}(t,y(t)), such that yi(t+s)=yi(t)+tt+syi(a)𝑑ay_{i}(t+s)=y_{i}(t)+\int_{t}^{t+s}\nabla y_{i}(a)da. The loss function is defined on the prediction of yϕ(t+s)y_{\phi}(t+s).

  • The evaluation of the integral can follow the Monte Carlo trick in mei2016hawkes. The single function evaluation syi(a)s\nabla y_{i}(a) at a random aUnif(t,t+s)a\sim Unif(t,t+s) gives an unbaised estimate of the integral. The algorithm averages over several samples to reduce the variance of the estimator. [Drawing a single sample seems like the network will just learn the mean of the gradient. A better scheme may be to draw multiple samples to approximate the gradient function.]

  • Causality may be learned by enforcing sparsity in the neural network ff. tank2018NeuralGC places a group lasso penalty on the weights in the first layer, where zero outgoing weights are a sufficient condition to represent Granger non-causality.

  • In the case where y(t)y(t) represent latent variables, we’ll need an additional mapping back to the observed space z(t)=g(y(t))z(t)=g(y(t)).

  • To learn interactions between different variables, we can introduce an attention mechanism between the different components of the system as is done in (goyal2019recurrent) and (kim2019attentive).

Evaluation idea:

To evaluate whether the network has learned causal relationships between the variables, we can evaluate the model on dynamical systems with unknown ϕ\phi. Specifically, consider a dataset of systems sampled (non-uniformly in time) from yϕky_{\phi_{k}} for k{1,2,,K}k\in\{1,2,\dots,K\}. During training, we consider access to dynamical systems y1,y2,,yny_{1},y_{2},\dots,y_{n} sampled from yϕky_{\phi_{k}} where kJ<Kk\leq J<K. At test time, we evaluate on series sampled from yϕky_{\phi_{k}} where J<kKJ<k\leq K.

11 Zero-shot regression

The Omnipush dataset bauza2019 collected 250 pushes for 250 objects. Each object has 4 possible sides (concave, triangular, circular, rectangular) with 2 types of extra weights (60g, 150g). This allows the ability to test whether the learned model can generalize across objects.

Let ztz_{t} denote the state of the object at time tt, and ata_{t} the action to be performed on the object at time tt. In the Omnipush scenario, zt=[xt,yt,θt]z_{t}=[x_{t},y_{t},\theta_{t}], and at=[xt(start),yt(start),xt(end),yt(end)]a_{t}=[x^{(start)}_{t},y^{(start)}_{t},x^{(end)}_{t},y^{(end)}_{t}] which is the starting and ending positions of the pusher. The input can be reduced to 3 dimensions by treating ztz_{t} as the origin and making use of the fact that the pusher moves at constant speed. Then zt=[0,0,0]z_{t}=[0,0,0], and at=[xt(p),yt(p),at(p)]a_{t}=[x^{(p)}_{t},y^{(p)}_{t},a^{(p)}_{t}] contains the pusher location and angle with respect to the object. The target to predict is zt+1z_{t+1} or equivalently zt\triangle z_{t}. We denote the characteristics, i.e. types of sides and extra weights, of the object through vector cc. Additionally, RGD-D videos are captured, which we can explore adding to cc in an extension.

A predictive model normally takes the form of

zt+1=g(zt,at,c;W)z_{t+1}=g(z_{t},a_{t},c;W)

or

zt+1=g(zt,at;W)z_{t+1}=g(z_{t},a_{t};W)

for some function gg parameterized by WW, depending on whether cc is included as an input. For instance gg can be a neural network. However, the WW learned tends to be biased towards training samples available and the resulting model does not generalize well to new objects. Additional procedures such as context identifiers are needed to correct for this bias sanchezgonzalez2018.

We would like to learn the model gg in an end-to-end fashion to incorporate the ability to generalize to new objects. We propose learning

zt+1=g(zt,at;W(c))z_{t+1}=g(z_{t},a_{t};W(c))

such that if d(c,c)<d(c,c′′)d(c,c^{\prime})<d(c,c^{\prime\prime}), then W(c)W(c)F2<W(c)W(c′′)F2\|W(c)-W(c^{\prime})\|_{F}^{2}<\|W(c)-W(c^{\prime\prime})\|_{F}^{2}, where dd is a distance function defining the difference between object characteristics. The key idea is that physical dynamics and consequently model parameters are more similar for objects that have more similar characteristics. This would allow the model to generalize to new objects. One possible way to impose this constraint is to first define W(c)=WM(c)W(c)=W\odot M(c), where \odot is the elementwise product, and M(c)M(c) acts as a mask for objects of characteristics cc mallya2018. Then the loss:

L(c,c)=(M(c)M(c)F2d(c,c))2L(c,c^{\prime})=\left(\|M(c)-M(c^{\prime})\|_{F}^{2}-d(c,c^{\prime})\right)^{2}

can be included in the objective function on top of the prediction loss. The network is trained through a Siamese network structure.

12 Updated Zero Shot Regression

Let ztz_{t} denote the state of the object at time tt, and ata_{t} the action to be performed on the object at time tt. The input can be reduced to 3 dimensions by treating ztz_{t} as the origin and making use of the fact that the pusher moves at constant speed. Then zt=[0,0,0]z_{t}=[0,0,0], and at=[xt(p),yt(p),at(p)]a_{t}=[x^{(p)}_{t},y^{(p)}_{t},a^{(p)}_{t}] contains the pusher location and angle with respect to the object. The target to predict is zt+1z_{t+1} or equivalently zt\triangle z_{t}. Additionally let cc be a characteristic vector denoting characteristics of the object.

We propose learning the function

zt+1=g(zt,at,c;W),z_{t+1}=g(z_{t},a_{t},c;W),

where g(zt,at,c)=gL(gL1(m(c)g1(zt,at)))g(z_{t},a_{t},c)=g_{L}(g_{L-1}(\dots m(c)\cdot g_{1}(z_{t},a_{t})\dots)) is a deep neural network with LL layers with an additional linear transformation m(c)m(c) which depends on the characteristic cc.

To do this learn gg we optimize over pairs of inputs qi=(zi,ai,ci)q_{i}=(z_{i},a_{i},c_{i}) and qj=(zj,aj,cj)q_{j}=(z_{j},a_{j},c_{j}) dropping tt since we only perform one step prediction. Denoting the first layer g1()g_{1}(\cdot) as the embedding layer, we optimize the loss function

(qi,qj;W)=L(qi;W)+L(qj;W)+λ1((m(ci)m(cj)2λ2cicj2)\mathcal{L}(q_{i},q_{j};W)=L(q_{i};W)+L(q_{j};W)+\lambda_{1}\left(\|(m(c_{i})-m(c_{j})\|_{2}-\lambda_{2}\|c_{i}-c_{j}\|_{2}\right)

where LL is the negative log likelihood of with mean gg. The idea here is that by incorporating characteristic information, we will be able to learn to better perform pushes better on unseen objects based on objects with similar characteristics.