This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient learning of nonlinear prediction models with time-series privileged information

Bastian Jung
Chalmers University of Technology
[email protected]
&Fredrik D. Johansson
Chalmers University of Technology
[email protected]
Abstract

In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to auxiliary information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.

1 Introduction

In data-poor domains, making the best use of all available information is central to efficient and effective machine learning. Despite this, supervised learning is often applied in such a way that informative data is ignored. A good example of this is learning to predict the condition of a patient at a set follow-up time based on information of the first medical examination. Classical supervised learning makes use only of the initial data to predict the disease status at the follow-up, although in many cases data about medications, laboratory tests or vital signs are routinely collected about patients also at intermediate time points. This information is privileged (Vapnik and Vashist, 2009), as it is unavailable at the time of prediction, but can be used for training a model.

Learning using privileged information (LuPI) (Vapnik and Vashist, 2009), generalized distillation (Lopez-Paz et al., 2016) and multi-view learning (Rosenberg and Bartlett, 2007) have been proposed to increase the sample efficiency by leveraging privileged information in learning. Theoretical results guarantee improved learning rates (Pechyony and Vapnik, 2010; Vapnik et al., 2015) or tighter generalization bounds (Wang, 2019) for large sample sizes under appropriate assumptions. However, privileged information is not always beneficial; it must be related to the task at hand (Jonschkowski et al., 2015). Previous works do not identify such settings. Moreover, existing analyses do not state when learning with privileged information is preferable to classical learning for problems with small sample sizes—which is where efficiency is needed the most.

Karlsson et al. (2022) studied LuPI in the context of predicting an outcome observed at the end of a time series based on variables collected at the first time step. They showed that making use of data from intermediate time steps in particular settings always leads to lower or equal prediction risk—for any sample size—compared to the best unbiased model which does not make use of this privileged information. However, their method called learning using privileged time series (LuPTS) was limited to settings where the outcome function, and estimators of it, are linear functions of baseline features. Moreover, their analysis did not study how variance reduction behaves as a function of increased input dimension. Hayashi et al. (2019) also learned from privileged intermediate time points but their study was limited to empirical results for classification using generalized distillation.

There is an abundance of real-world prediction tasks with fixed follow-up times which can be framed as having access to privileged time-series information. Examples include predicting 90-day patient mortality (Karhade et al., 2019) or patient readmission in healthcare (Mortazavi et al., 2016), the churn of users of an online service over a fixed period (Huang et al., 2012) or yearly crop yields from satellite imagery of farms (You et al., 2017). In these cases, privileged time-series information comprises daily patient vitals, intermediate user interactions and daily satellite imagery respectively. We are motivated by finding sample-efficient learning algorithms that utilize privileged time-series information to improve upon classical learning alternatives in such settings.

Contributions.

We extend the LuPTS framework to nonlinear models and prediction tasks in latent dynamical systems (Section 3). In this setting, we prove that learning with privileged information leads to lower risk when the nonlinear map connecting latent variables and observations is known up to a linear transform (Section 3.1). In doing so, we also find that when the representation dimension grows larger than the number of samples, the benefit of privileged information vanishes. We show that a privileged learner using random feature maps can learn optimal models consistently, even when the relationship between latent and observed variables is unknown, and give a practical algorithm based on this idea (Section 3.2). However, random feature methods may suffer from bias in small samples. As a remedy, we propose several representation learning algorithms aimed at trading off bias and variance (Section 3.3). In experiments, we find that privileged time-series learners with either random features or representation learning reduce variance and improve latent state recovery in small-sample settings (Section 4) on both synthetic and real-world regression tasks.

2 Prediction and privileged information in nonlinear time series

Refer to caption
(a) Linear dynamical system, fully observed
Refer to caption
(b) Linear latent dynamics, nonlinear observations
Figure 1: Comparison between the linear data generating process and the latent nonlinear generalization in this work. Ψ\Psi indicates the observation function of the latent system, with Φ\Phi its left inverse.

We aim to predict outcomes YY in 𝒴q\mathcal{Y}\subseteq\mathbb{R}^{q} based on covariates X1X_{1} in 𝒳k\mathcal{X}\subseteq\mathbb{R}^{k}. X1X_{1} are observed at a baseline time point t=1t=1, starting a discrete time series on the form X1,X2,XT,YX_{1},X_{2},\dots X_{T},Y. Outcomes are assumed to be compositions of a representation Φ\Phi, a linear map θ\theta and Gaussian noise ϵ\epsilon,

Y=h(X1)+ϵ, where h(x1)θΦ(x1), and ϵ𝒩(0,σ~Y2).Y=h(X_{1})+\epsilon,\;\;\mbox{ where }\;\;h(x_{1})\coloneqq\theta^{\top}\Phi(x_{1}),\;\;\mbox{ and }\;\;\epsilon\sim\mathcal{N}(0,\tilde{\sigma}_{Y}^{2})~{}. (1)

In addition to observations of X1X_{1} and YY, privileged information (PI) is available in the form of samples of random variables, X2,,XTX_{2},...,X_{T}, from intermediate time points between X1X_{1} and YY, all taking values in 𝒳\mathcal{X}. Two data generating processes (DGPs) with this structure are illustrated in Figure 1. Unlike baseline variables X1X_{1}, privileged information is observed only at training time, not at test time. Therefore, it can only benefit learning, and not inference. Data sets of m+m\in\mathbb{N}_{+} training samples, D{(xi,1,xi,2,,xi,T,yi)}i=1mD\coloneqq\{(x_{i,1},x_{i,2},...,x_{i,T},y_{i})\}_{i=1}^{m}, are drawn independently and identically distributed from a fixed, unknown distribution pp over all random variables in our system. We let 𝐗t=[x1,t,,xm,t]\mathbf{X}_{t}=[x_{1,t},...,x_{m,t}]^{\top} denote the data matrix of features observed at time t=1,,Tt=1,...,T and 𝐘=[y1,,ym]\mathbf{Y}=[y_{1},...,y_{m}]^{\top} the vector of all outcomes observed in DD. A learning algorithm 𝒜:𝒟\mathscr{A}:\mathcal{D}\rightarrow\mathcal{H} maps data sets D𝒟D\in\mathcal{D} to hypotheses h^\hat{h}\in\mathcal{H}. An efficient algorithm 𝒜\mathscr{A} has small expected risk R¯p(𝒜)\overline{R}_{p}(\mathscr{A}) with respect to a loss L:𝒴×𝒴L:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R} over training sets of fixed size mm drawn from pp,

R¯p(𝒜)𝔼D[Rp(h^)] where h^=𝒜(D) and Rp(h^)𝔼p[L(h^(X1),Y)].\overline{R}_{p}(\mathscr{A})\coloneqq\mathbb{E}_{D}[R_{p}(\hat{h})]\;\;\mbox{ where }\;\;\hat{h}=\mathscr{A}(D)\;\;\mbox{ and }\;\;R_{p}(\hat{h})\coloneqq\mathbb{E}_{p}[L(\hat{h}(X_{1}),Y)]~{}. (2)

We study the regression setting with LL the squared error, L(y,y)=yy22L(y,y^{\prime})=\|y-y^{\prime}\|_{2}^{2}. In analysis, we focus on univariate outcomes, q=1q=1, but all results extend to multivariate outcomes, q>1q>1.

Our main goal is to compare privileged learners 𝒜p\mathscr{A}_{\textsc{p}}, which make use of privileged information, to classical learners 𝒜c\mathscr{A}_{\textsc{c}} which learn without PI. We seek to identify conditions and algorithms for which privileged learning is more efficient, i.e., it leads to lower risk in expectation, R¯(𝒜p)R¯(𝒜c)\overline{R}(\mathscr{A}_{\textsc{p}})\leq\overline{R}(\mathscr{A}_{\textsc{c}}) for the same number of training samples. We describe such a setting next.

2.1 Privileged information in latent dynamical systems

We study tasks where data are produced by a latent dynamical process. Observations XtX_{t} are generated from unobserved latent states ZtZ_{t} through a nonlinear transformation Ψ\Psi, see Figure 1(b). This means that the target of learning, h(X1)=𝔼[YX1]h(X_{1})=\mathbb{E}[Y\mid X_{1}], is nonlinear in X1X_{1}. Latent dynamical systems like these have proven successful at modelling a variety of phenomena as for example fluid dynamics in physics (Lee and Carlberg, 2019) and human brain activity in neuroscience (Kao et al., 2015).

Assumption 1 (Latent linear-Gaussian system).

Let Z1,,ZTZ_{1},...,Z_{T} be latent states related as a linear-Gaussian dynamical system in a space 𝒵d\mathcal{Z}\subseteq\mathbb{R}^{d} with Z1Z_{1} of arbitrary distribution. Further, let the observation function be an injective map Ψ:𝒵𝒳\Psi:\mathcal{Z}\rightarrow\mathcal{X} with left-inverse (representation) Φ:𝒳𝒵\Phi:\mathcal{X}\rightarrow\mathcal{Z}. With A1,,AT1d×dA_{1},...,A_{T-1}\in\mathbb{R}^{d\times d}, βd\beta\in\mathbb{R}^{d} assume that Z1,,ZT,X1,,XT,YZ_{1},...,Z_{T},X_{1},...,X_{T},Y are generated as

Zt\displaystyle Z_{t} =At1Zt1+ϵt for t2, and t:Xt=Ψ(Zt) and Y=βZT+ϵY.\displaystyle={A_{t-1}^{\top}}Z_{t-1}+\epsilon_{t}\;\;\mbox{ for }\;\;t\geq 2,\;\;\;\mbox{ and }\;\;\;\forall t:X_{t}=\Psi(Z_{t})\;\;\;\mbox{ and }\;\;\;Y=\beta^{\top}Z_{T}+\epsilon_{Y}~{}.

where ϵt𝒩(0,σt2I)\epsilon_{t}\sim\mathcal{N}(0,\sigma_{t}^{2}I) and ϵY𝒩(0,σY2)\epsilon_{Y}\sim\mathcal{N}(0,\sigma_{Y}^{2}).

It is easy to verify that Assumption 1 is consistent with (1), but stronger: there are more systems that map X1X_{1} to YY as θΦ(X1)\theta^{\top}\Phi(X_{1}) than systems where additionally Xt=Ψ(Zt)X_{t}=\Psi(Z_{t}). Nevertheless, it is much more general than the results of Karlsson et al. (2022), limited to the linear setting, i.e. Φ(x1)=x1\Phi(x_{1})=x_{1}.

Next, we present a generalized version of the LuPTS algorithm of Karlsson et al. (2022) that is provably preferable to classical learning when data is generated according to Assumption 1 and Φ\Phi is known up to a linear transform. Further, we discuss how a privileged learner can be made universally consistent for unknown representations Φ\Phi when combined with random features.

3 Efficient learning with time-series privileged information

We analyze and compare learners from the privileged and classical paradigms which produce estimates of the form h^(x1)=θ^Φ^(x1)\hat{h}(x_{1})=\hat{\theta}^{\top}\hat{\Phi}(x_{1}), as motivated by (1). We let each use representations Φ^:𝒳𝒵^d^\hat{\Phi}:\mathcal{X}\rightarrow\hat{\mathcal{Z}}\subseteq\mathbb{R}^{\hat{d}} from a family \mathcal{F} shared by both paradigms, so that the hypothesis class h^\mathcal{H}\ni\hat{h} is shared as well.

Input: Data D=({𝐗t},𝐘)D=(\{\mathbf{X}_{t}\},\mathbf{Y}); Repr. Φ^\hat{\Phi} or kernel κ\kappa
if using a fixed representation Φ^\hat{\Phi} then
       𝐙^t=[Φ^(x1,t),,Φ^(xm,t)]\hat{\bf Z}_{t}=[\hat{\Phi}(x_{1,t}),...,\hat{\Phi}(x_{m,t})]^{\top} for t=1,,Tt=1,...,T
      θ^p[t=1T1(𝐙^t𝐙^t)𝐙^t𝐙^t+1A^t](𝐙^T𝐙^T)𝐙^T𝐘β^\hat{\theta}_{\textsc{p}}\coloneqq\big{[}\prod\limits_{t=1}^{T-1}\underbrace{(\hat{\bf Z}_{t}^{\top}\hat{\bf Z}_{t})^{\dagger}\hat{\bf Z}_{t}^{\top}\hat{\bf Z}_{t+1}}_{\hat{A}_{t}}\big{]}\underbrace{(\hat{\bf Z}_{T}^{\top}\hat{\bf Z}_{T})^{\dagger}\hat{\bf Z}_{T}^{\top}\mathbf{Y}}_{\hat{\beta}}
      h^p()θ^pΦ^()\hat{h}_{\textsc{p}}(\cdot)\coloneqq\hat{\theta}_{\textsc{p}}^{\top}\hat{\Phi}(\cdot)
else if using kernel κ\kappa then
       𝐊^t=κ(𝐗t,𝐗t)\hat{\bf K}_{t}=\kappa(\mathbf{X}_{t},\mathbf{X}_{t}) for t=1,,Tt=1,...,T
      𝜶𝐊^1[t=2T𝐊^t𝐊^t]𝐘\boldsymbol{\alpha}\coloneqq\hat{\bf K}_{1}^{\dagger}\left[\prod_{t=2}^{T}\hat{\bf K}_{t}\hat{\bf K}_{t}^{\dagger}\right]\mathbf{Y}
      h^p()i=1mαiκ(xi,1,)\hat{h}_{\textsc{p}}(\cdot)\coloneqq\sum_{i=1}^{m}\alpha_{i}\kappa(x_{i,1},\cdot)
return h^p\hat{h}_{\textsc{p}}
Algorithm 1 Generalized LuPTS
Refer to caption
Figure 2: Two regimes of generalized LuPTS with varying feature dimension d^\hat{d}, and sample size m=100m=100. When m>d^m>\hat{d}, generalized LuPTS is provably never worse than OLS. When md^m\leq\hat{d}, they are equivalent. See Appendix F for details.

3.1 Learning with true representations known up to a linear transform

In this section, we assume that privileged and classical learners have access to a common, fixed representation function Φ^\hat{\Phi} which is related to the true representation function through a linear transform, meaning there exists a matrix BB such that Φ^()=BΦ()\hat{\Phi}(\cdot)=B\Phi(\cdot).

Classical learning.

As a classical learner to serve as a strong comparison point for our privileged learners, we use the ordinary least-squares (OLS) linear estimator applied to the latent variables at the first time point inferred by Φ^\hat{\Phi}. With 𝐙^1=[Φ^(x1,1),,Φ^(xm,1)]m×d^\hat{\bf Z}_{1}=[\hat{\Phi}(x_{1,1}),...,\hat{\Phi}(x_{m,1})]^{\top}\in\mathbb{R}^{m\times\hat{d}},

𝒜c(D)=h^c()θ^cΦ^() where θ^c(𝐙^1𝐙^1)𝐙^1𝐘.\mathscr{A}_{\textsc{c}}(D)=\hat{h}_{\textsc{c}}(\cdot)\coloneqq\hat{\theta}_{\textsc{c}}^{\top}\hat{\Phi}(\cdot)\;\;\mbox{ where }\;\;\hat{\theta}_{\textsc{c}}\coloneqq(\hat{\bf Z}_{1}^{\top}\hat{\bf Z}_{1})^{\dagger}\hat{\bf Z}_{1}^{\top}\mathbf{Y}~{}. (3)

When md^m\geq\hat{d} and Φ\Phi is known up to linear transform, h^c\hat{h}_{\textsc{c}} is the minimum-risk unbiased estimator of hh which does not use privileged information. To accommodate the underdetermined case where m<d^m<\hat{d}, the matrix inverse is replaced by the Moore-Penrose pseudo inverse ()(\cdot)^{\dagger} (Penrose, 1956).

Generalized LuPTS.

For privileged learning, we first compute 𝐙^t=[Φ^(x1,t),,Φ^(xm,t)]\hat{\bf Z}_{t}=[\hat{\Phi}(x_{1,t}),...,\hat{\Phi}(x_{m,t})]^{\top} for t=1,,Tt=1,...,T. Then, we independently fit parameter estimates A^1,,A^T,β^\hat{A}_{1},...,\hat{A}_{T},\hat{\beta} of the dynamical system shown in Figure 1(b) by minimizing the squared error in single-step predictions. This equates to a series of OLS estimates in 𝒵^\hat{\mathcal{Z}}. At test time, baseline variables x1x_{1} are embedded with Φ^\hat{\Phi} and the latent dynamical system is simulated for TT time steps to form a prediction h^p(x1)=(A^1A^Tβ)Φ^(x1)\hat{h}_{\textsc{p}}(x_{1})=(\hat{A}_{1}\cdots\hat{A}_{T}\beta)^{\top}\hat{\Phi}(x_{1}). Putting this together, we arrive at Algorithm 1 which we call generalized LuPTS. We may apply a simple matrix identity and replace terms 𝐙^t𝐙^t\hat{\bf Z}_{t}\hat{\bf Z}_{t}^{\top} in Algorithm 1 by the Gram matrices 𝐊^t=κ(𝐗t,𝐗t)\hat{\bf K}_{t}=\kappa(\mathbf{X}_{t},\mathbf{X}_{t}) of a reproducing kernel κ\kappa with corresponding (implicit) feature map Φ^\hat{\Phi}, κ(x,x)=Φ^(x),Φ^(x)\kappa(x,x^{\prime})=\langle\hat{\Phi}(x),\hat{\Phi}(x^{\prime})\rangle (Smola and Schölkopf, 1998). This variant allows for learning with unknown Φ\Phi.

We now state a result which says that under Assumption 1, for an appropriate fixed representation Φ^\hat{\Phi} or kernel κ\kappa, generalized LuPTS is never worse in expectation than the classical learner in (3).

Theorem 1.

Let DD be a data set of size mm drawn from pp, consistent with Assumption 1. Assume that the left inverse Φ:𝒳𝒵\Phi:\mathcal{X}\rightarrow\mathcal{Z} of the observation function Ψ\Psi is known up to a linear transform, explicitly or through a kernel κ(x,x)=Φ^(x),Φ^(x)\kappa(x,x^{\prime})=\langle\hat{\Phi}(x),\hat{\Phi}(x^{\prime})\rangle, i.e., there exists a matrix BB with linearly independent columns such that Φ^(x)=BΦ(x)\hat{\Phi}(x)=B\Phi(x) with Φ^:𝒳𝒵^\hat{\Phi}:\mathcal{X}\rightarrow\hat{\mathcal{Z}} for all x𝒳x\in\mathcal{X}. Then, it holds for the privileged learner 𝒜p(D)=h^p\mathscr{A}_{\textsc{p}}(D)=\hat{h}_{\textsc{p}} (Algorithm 1) and the classical learner 𝒜c(D)=h^c\mathscr{A}_{\textsc{c}}(D)=\hat{h}_{\textsc{c}} (3),

R¯(𝒜p)=R¯(𝒜c)𝔼h^p,X1[VarD(h^c(X1)h^p)].\overline{R}(\mathscr{A}_{\textsc{p}})=\overline{R}(\mathscr{A}_{\textsc{c}})-\mathbb{E}_{\hat{h}_{\textsc{p}},X_{1}}[\mathrm{Var}_{D}(\hat{h}_{\textsc{c}}(X_{1})\mid\hat{h}_{\textsc{p}})]~{}. (4)
Proof sketch.

First, we show that the predictions made by generalized LuPTS are invariant to a linear transform BB applied to Φ^()\hat{\Phi}(\cdot) during training, see Appendix A. Then we consider two cases: (i) When d^m\hat{d}\leq m we may re-purpose the proof of Theorem 1 in Karlsson et al. (2022), (ii) When m<d^m<\hat{d} Proposition 1 below directly implies 𝔼h^p,X1[VarD(h^c(X1)h^p)]=0\mathbb{E}_{\hat{h}_{\textsc{p}},X_{1}}[\mathrm{Var}_{D}(\hat{h}_{\textsc{c}}(X_{1})\mid\hat{h}_{\textsc{p}})]=0 and R¯(𝒜p)=R¯(𝒜c)\overline{R}(\mathscr{A}_{\textsc{p}})=\overline{R}(\mathscr{A}_{\textsc{c}}). ∎

Theorem 1 implies that the privileged learner is at least as sample efficient as the classical one since Var()0\mathrm{Var}(\cdot)\geq 0 and thus R¯(𝒜p)R¯(𝒜c)\overline{R}(\mathscr{A}_{\textsc{p}})\leq\overline{R}(\mathscr{A}_{\textsc{c}}) for the same number of training samples mm. The result is a direct generalization of the main result in Karlsson et al. (2022). We also observe that generalized LuPTS and the classical learner coincide under certain conditions when md^m\leq\hat{d}.

Proposition 1.

Let Φ^:𝒳𝒵^d^\hat{\Phi}:\mathcal{X}\rightarrow\hat{\mathcal{Z}}\subseteq\mathbb{R}^{\hat{d}} be any map with corresponding kernel κ\kappa. Let 𝐊^t\hat{\bf K}_{t} be the Gram matrix of κ\kappa applied to 𝐗t\mathbf{X}_{t} and let h^c,h^p\hat{h}_{\textsc{c}},\hat{h}_{\textsc{p}} be classical (3) and privileged (Algorithm 1) estimates. Then,

𝐊^tis invertible for allth^p=h^c.\hat{\bf K}_{t}\;\mbox{is invertible for all}\;t\implies\hat{h}_{\textsc{p}}=\hat{h}_{\textsc{c}}.

𝐊^t\hat{\bf K}_{t} is noninvertible whenever m>d^m>\hat{d}, assuming linearly independent features.

Proof sketch.

When the Gram matrices 𝐊t{\bf K}_{t} are invertible, the pseudo-inverse coincides with the inverse, and factors 𝐊t𝐊t{\bf K}_{t}{\bf K}_{t}^{\dagger} in the LuPTS estimators cancel, making LuPTS and CL equal. ∎

Remarks.

Theorem 1 extends the applicability of LuPTS to a) nonlinear prediction through a fixed feature map or b) kernel estimation and c) to the underdetermined case of m<d^m<\hat{d}. While previous work is restricted to observed linear systems our result considers the case of latent linear dynamics which only need to be identified up to a linear transform (see Figure 1). Proposition 1 does not claim that no preferable privileged learner exists; it is a statement only about generalized LuPTS. We may relate the result to the double descent characteristic previously observed for other linear estimators for fixed mm and varying d^\hat{d} (Loog et al., 2020). After a phase transition around d^=m\hat{d}=m LuPTS’s variance reduces for a second time when it becomes equivalent to the classical learner (see Figure 2).

3.2 Random feature maps for unknown representations

When the true Φ\Phi is entirely unknown, as is often the case in practice, using a poor representation Φ^\hat{\Phi} may yield biased results for both classical and privileged learners. A common solution in nonlinear prediction is to use a universal kernel, such as the Gaussian-RBF kernel. These have dense reproducing-kernel Hilbert spaces which allow approximation of any continuous function. However, universal kernels also have positive-definite and thus invertible Gram matrices (Hofmann et al., 2008), which according to Proposition 1 eliminates any gain in sample efficiency of generalized LuPTS.

Instead, we combine our algorithm with an approximation of universal kernels—random feature maps. These methods project inputs xx onto d^\hat{d} features by a random linear map Wk×d^W\in\mathbb{R}^{k\times\hat{d}}, and a nonlinear element-wise activation function. By choosing d^<m\hat{d}<m, we can benefit from the function approximation properties of universal kernels (see discussion below) and the variance reduction of generalized LuPTS. Popular random features include random Fourier features (RFF) (Rahimi and Recht, 2007) and random ReLU features (RRF) (Sun et al., 2018),111For RFF, [W𝒩]ij𝒩(0,1)[W_{\mathcal{N}}]_{ij}\sim\mathcal{N}(0,1), and for RRF, [W𝒰]ij𝒰(1,1)[W_{\mathcal{U}}]_{ij}\sim\mathcal{U}(-1,1) and bi𝒰(0,2π)b_{i}\sim\mathcal{U}(0,2\pi).

Φ^rffγ(x)=2/d^[cos(2γW𝒩x+b)] and Φ^rrfγ(x)=f+(γW𝒰[x;1]).\hat{\Phi}_{\mbox{\sc rff}}^{\gamma}(x)=\sqrt{2/\hat{d}}\left[\cos(\sqrt{2\gamma}W_{\mathcal{N}}^{\top}x+b)\right]\;\;\mbox{ and }\;\;\hat{\Phi}_{\mbox{\sc rrf}}^{\gamma}(x)=f_{+}(\gamma W_{\mathcal{U}}^{\top}[x;1])~{}.

where f+(z)=max(0,z)f_{+}(z)=\max(0,z) is the rectifier (ReLU) function and γ>0\gamma>0 is a bandwidth hyperparameter.

For large enough numbers of random features and training samples, any continuous function hh can be approximated up to arbitrary precision by a linear map ω\omega applied to the random features, e.g., h^(x)=ωΦ^RRFγ(x)\hat{h}(x)=\omega^{\top}\hat{\Phi}_{\mathrm{RRF}}^{\gamma}(x), see Sun et al. (2019); Rudi and Rosasco (2017). Applying the same argument to the step-wise estimators of LuPTS we can justify using random features in Algorithm 1 by the following observation: Under appropriate assumptions, we can construct a privileged learner using random feature maps which is a universally consistent estimator of h(x1)=𝔼[Y|x1]h(x_{1})=\mathbb{E}[Y|x_{1}]. The precise construction deviates somewhat from Algorithm 1, but follows the same structure. We refer to Appendix B for a precise statement. As universal consistency describes the asymptotic behaviour in the limit of infinite samples and random features, this offers only limited insight into the benefits of privileged information in small sample settings, where performance will be a bias-variance trade-off.

Variance reduction & bias amplification.

Generalized LuPTS is only guaranteed lower variance compared to the classical estimator under Assumption 1, although our empirical results (Section 4) suggest this applies more widely. When Φ^\hat{\Phi} is a bad approximation of Φ\Phi, generalized LuPTS may amplify bias, increasing with the number of privileged time points, compared to classical learning. We show this theoretically in Appendix C and also empirically in Appendix F. Whether generalized LuPTS is still preferable to classical learning in terms of prediction risk appears to depend on the amount of bias that gets amplified. Our experiments imply the variance reduction mostly dominates when using random features, whereas this is not always the case for linear LuPTS. The phenomenon of bias amplification is familiar from e.g., model-based and model-free reinforcement learning (Kober et al., 2013). As the bias with random features may still be high for small sample settings, we next present privileged representation learning algorithms to trade off bias and variance more efficiently.

3.3 Privileged time-series representation learning

Up until now the representation Φ^\hat{\Phi} was considered fixed, either because Φ\Phi was known up to a linear transform (explicitly or implicitly) or because of the use of random feature methods. Generalized LuPTS (Algorithm 1) produces minimizers {A^t},β^\{\hat{A}_{t}\},\hat{\beta} of the following objective for fixed Φ^\hat{\Phi},

srl(Φ^,{A^t},β^):=1NTi=1N[t=1T11d^A^tΦ^(xi,t)Φ^(xi,t+1)22+1qβ^Φ^(xi,T)yi22].\mathcal{L}_{\textsc{srl}}(\hat{\Phi},\{\hat{A}_{t}\},\hat{\beta}):=\frac{1}{NT}\sum\limits_{i=1}^{N}\bigg{[}\sum\limits_{t=1}^{T-1}\frac{1}{\hat{d}}\big{\|}\hat{A}_{t}^{\top}\hat{\Phi}(x_{i,t})-\hat{\Phi}(x_{i,t+1})\big{\|}_{2}^{2}+\frac{1}{q}\big{\|}\hat{\beta}^{\top}\hat{\Phi}(x_{i,T})-y_{i}\big{\|}_{2}^{2}\bigg{]}~{}. (5)

Objective (5) and the systems described by Assumption 1 lend themselves to methods which also learn the representation Φ\Phi in addition to the latent dynamics {At,β}\{A_{t},\beta\}. Next, we present three algorithms which combine the ideas of generalized LuPTS with the expressiveness of deep representation learning. All learners use equivalent encoders to represent Φ^()\hat{\Phi}(\cdot) and linear layers to model the relations between the latent variables {Zt}\{Z_{t}\} and the outcome YY. The classical learner predicts the outcome linearly from Z^1\hat{Z}_{1}. All architectures under consideration are visualized jointly in Figure 3(a).

Refer to caption
(a) Classical (left) and privileged (right) representation learners. Φ^\hat{\Phi} is shared across all time steps. GRL models the direct maps θ^t\hat{\theta}_{t} to the outcome; SRL models the single steps A^t\hat{A}_{t} and β^\hat{\beta}. CRL combines the two.
Refer to caption
(b) Five example sequences from Clocks-LGS image data.
Figure 3: Representation learning architectures (left) and samples from Clocks-LGS (right).

SRL.

The first privileged representation learner directly optimizes objective (5), just like generalized LuPTS, but now also fitting the representation Φ^\hat{\Phi}, parameterized by a neural network. We refer to this model as stepwise representation learner (SRL). As we will see in experiments, a drawback of this approach is that representations may favor predicting transitions z^i,tz^i,t+1\hat{z}_{i,t}\rightarrow\hat{z}_{i,t+1} with small error, while losing information relevant for the target outcome in the process. At test time, for a new input x1x_{1}, SRL composes the stepwise dynamics to output h^p(x1)=β^A^TA^1Φ^(x1)\hat{h}_{\textsc{p}}(x_{1})=\hat{\beta}^{\top}\hat{A}_{T}^{\top}\dots\hat{A}_{1}^{\top}\hat{\Phi}(x_{1}).

CRL and GRL.

To make sure that the learned representation Φ^\hat{\Phi} retains information about YY, we add linear outcome supervision to the representation Φ^(Xt)\hat{\Phi}(X_{t}) at each time step tt. Recall that, by Assumption 1, the expected outcome is linear in the latent state at any time step. We introduce a hyperparameter λ[0,1]\lambda\in[0,1] to trade off the two types of losses and arrive at the combined representation learner (CRL). With α=(Φ^,{A^t},{θ^t})\alpha=(\hat{\Phi},\{\hat{A}_{t}\},\{\hat{\theta}_{t}\}) the entire parameter vector, CRL minimizes the objective

crl(α):=λNTqi,tθ^tΦ^(xi,t)yi22+1λN(T1)d^i,tA^tΦ^(xi,t)Φ^(xi,t+1)22.\mathcal{L}_{\textsc{crl}}(\alpha):=\frac{\lambda}{NTq}\sum\limits_{i,t}\big{\|}\hat{\theta}_{t}^{\top}\hat{\Phi}(x_{i,t})-y_{i}\big{\|}_{2}^{2}+\frac{1-\lambda}{N(T-1)\hat{d}}\sum\limits_{i,t}\big{\|}\hat{A}_{t}^{\top}\hat{\Phi}(x_{i,t})-\hat{\Phi}(x_{i,t+1})\big{\|}_{2}^{2}. (6)

We make test-time predictions using h^p(x1)=θ^1Φ^(x1)\hat{h}_{\textsc{p}}(x_{1})=\hat{\theta}_{1}^{\top}\hat{\Phi}(x_{1}). In experiments, we highlight the case λ=1\lambda=1 where only outcome supervision is used as greedy representation learner (GRL). For a precise definition of the GRL objective, see Appendix D. GRL is related to multi-view learning, in which prediction of the same quantity is made from multiple “views” (cf. time points) (Zhao et al., 2017).

4 Experiments

We compare classical learning to variants of generalized LuPTS (Algorithm 1) and the privileged representation learners of Section 3.3 on two synthetic and two real-world time-series data sets. We (i) verify our theoretical findings by analyzing the sample efficiency and bias-variance characteristics of the given algorithms; (ii) demonstrate that generalized LuPTS with random features succeeds in settings where linear LuPTS suffers from large bias; (iii) point out that privileged representation learners offer even greater sample efficiency in practice and (iv) study how well these algorithms recover the true latent variables {Zt}\{Z_{t}\} and how this relates to predictive accuracy.

Experimental setup.

We report the mean coefficient of determination (R2R^{2}), proportional to the squared-error risk R¯\overline{R}, for varying sample sizes, sequence lengths and prediction horizons. Experiments are repeated and averaged over different random seeds. In each repetition, a given model performs hyperparameter tuning on the training data using random search and five-fold cross-validation before being re-trained on all training data. The test set size is 1000 samples for synthetic data and 20% of all data available for real-world data sets. We consider six privileged learners of two groups. The first group comprises generalized LuPTS with the linear kernel (LuPTS) and the two random feature maps shown in Section 3.2: Random Fourier features (Fourier RF) and random ReLU features (ReLU RF). The classical learners for this group are OLS estimators used with the same kernel or feature map. The second group consists of the representation learners SRL, CRL and GRL. For tabular data, their encoder is a multi-layer perceptron with three hidden layers of 25 neurons each. For the image data they use LeNet-5 (LeCun et al., 1989). The classical learner (Classic Rep.) uses the same encoder with a linear output layer. The results presented were found to be robust to small changes in training parameters such as learning rate. For details on the training process we refer to Appendix E. All experiments required less than 3000 GPU-h to complete using NVIDIA Tesla T4 GPUs.222Code to reproduce all results is available at https://github.com/Healthy-AI/glupts.

Data sets.

We briefly describe evaluation data sets and refer to Appendix E for further details. First, we create two synthetic data sets in which latent states and outcomes are generated from linear-Gaussian systems as in Assumption 1. To produce the observations {Xt}\{X_{t}\} we use a deterministic nonlinear function Ψ:𝒵𝒳\Psi:\mathcal{Z}\xrightarrow[]{}\mathcal{X}. In the first synthetic data set, which we call Square-Sign, the nonlinear transformation Ψ:d2d\Psi:\mathbb{R}^{d}\xrightarrow[]{}\mathbb{R}^{2d} maps each latent feature Z(t,k)Z_{(t,k)} to a two dimensional vector such that

XtΨ(Z(t))=[Z(t,1)2,sgn(Z(t,1)),,Z(t,d)2,sgn(Z(t,d))].X_{t}\coloneqq\Psi(Z_{(t)})=[Z_{(t,1)}^{2},\mathrm{sgn}(Z_{(t,1)}),\dots,Z_{(t,d)}^{2},\mathrm{sgn}(Z_{(t,d)})]^{\top}.

The second synthetic data set uses the same latent linear system with Zt2Z_{t}\in\mathbb{R}^{2} and produces square images (28×\times28 pixels) as observations. As the images are reminiscent of clocks we refer to this data set as Clocks-LGS. Example sequences of these observations are presented in Figure 3(b). The angle, size and fill of the two clock hands encode the value of the corresponding latent variable. The outcome YY is a linear function of ZTZ_{T} with q=1q=1. For Square-Sign, q=3q=3.

The Metro Interstate traffic volume data set (Traffic(Hogue, 2012) contains hourly records of the traffic volume on the interstate 94 between Minneapolis and St. Paul, MN. In addition, the data contains weather features and a holiday indication. We predict the traffic volume for a fixed time horizon given the present observations. Privileged information is observed every four hours.

We also predict the progression of Alzheimer’s disease (AD) as measured by the outcome of the Mini Mental State Examination (MMSE) (Galea and Woodward, 2005). The anonymized data were obtained through the Alzheimer’s Disease Neuroimaging Initiative (ADNI(ADNI, 2004). The outcome of interest is the MMSE score 48 months after the first examination. Privileged information are the measurements at 12, 24 and 36 months. In addition we tested our algorithms on the 𝐏𝐌2.5\mathbf{PM_{\text{2.5}}} data set (Liang et al., 2016), where we predict the air quality in five Chinese cities (see Appendix F).

4.1 Sample efficiency, bias and variance

Refer to caption
Refer to caption
(a) Traffic, T=3T=3.
Refer to caption
(b) Square-Sign, T=5T=5, d=10d=10.
Refer to caption
(c) ADNI, T=4T=4.
Figure 4: Predictive accuracy of generalized LuPTS on three data sets, over 60 repetitions. The shaded area represents one standard deviation above and below the mean over repetitions.
Refer to caption
Refer to caption
(a) Traffic, T=3T=3.
Refer to caption
(b) Square-Sign, T=5T=5, d=10d=10.
Refer to caption
(c) Clocks-LGS, T=6T=6, q=1q=1.
Figure 5: Predictive accuracy of the representation learners over 25 repetitions. Generalized distillation is used as proposed by Hayashi et al. (2019). For details, see Appendix E.

The main goal of our work is to improve learning efficiency by incorporating privileged time-series information. Across almost all prediction tasks and sample sizes, nonlinear variants of generalized LuPTS outperform their classical counterpart in terms of sample efficiency as can be seen in Figure 4. On the Traffic prediction task, LuPTS ReLU RF outperforms linear LuPTS as the former appears to exhibit less bias. On the synthetic data of Square-Sign, linear models reach their best accuracy quickly, while they are limited by their lack of expressiveness. Generalized LuPTS amplifies this bias, making linear LuPTS worse than OLS. Random feature methods attain higher accuracy here but are generally less sample efficient. Generalized LuPTS combined with random features manages to decrease this gap significantly. On ADNI, we don’t see a benefit of using nonlinear models in general. We make similar observations on the 𝐏𝐌2.5\mathbf{PM_{\text{2.5}}} air quality data set, see Appendix F.

The representation learners proposed in Section 3.3 are evaluated on the same data sets and on the image prediction task Clocks-LGS. Example results are displayed in Figure 5. These demonstrate that directly transferring the LuPTS objective to neural networks in the form of SRL results in subpar performance. SRL does not seem to have a strong enough incentive to learn representations which accurately predict the outcome. Generalized distillation for privileged time series as suggested by Hayashi et al. (2019), does not improve upon classical learning in our tasks. On all experiments displayed in Figure 5, CRL and GRL outperform the classical learner. The predictive accuracy of these models is similar on most tasks, as may be explained by the fact that CRL may reduce to GRL when choosing λ=1\lambda=1 in objective 6. Noticeably, the general observation of GRL and CRL being more sample efficient than the classic model, neither appears to depend on the neural architecture used for the encoder, nor does the modality of the data play an important role, as the image prediction task Clocks-LGS (Figure 5(c)) demonstrates. For additional empirical results, we refer to Appendix F.

Refer to caption
Figure 6: Estimated squared bias and variance of models trained on Square-Sign (T=5T=5, d=10d=10, q=3q=3) over random training sets (60 for left group, 25 for the right), evaluate on 1000 test points.

To analyze the bias and variance characteristics of our algorithms, we can estimate the expected squared prediction bias, 𝔼X1[(𝔼D[𝒜(D)](X1)𝔼[Y|X1])2]\mathbb{E}_{X_{1}}[(\mathbb{E}_{D}[\mathscr{A}(D)](X_{1})-\mathbb{E}[Y|X_{1}])^{2}], by computing 𝔼Y[Y|X1]\mathbb{E}_{Y}[Y|X_{1}] in synthetic DGPs in addition to the variance of the different estimators. Figure 6 depicts bias and variance for all models on the Square-Sign data. On the left panel, all variants of generalized LuPTS exhibit lower variance than classical learning, despite being biased. This holds generally: Across all experiments, we never encountered an example where the use of privileged information has not resulted in lower variance compared to classical learning. On the contrary, the privileged learners suffer higher bias than the comparable classical learners because of the bias compounding over the individual prediction steps as shown in Appendix C. For the random feature variants however, the bias decreases with the number of samples. The representation learners display similar characteristics on the right panel of Figure 6. Learning the transitions between latent variables ZtZ_{t} appears to be associated with low variance and high bias as demonstrated by the results of SRL. GRL however, which does not model these transitions, exhibits the lowest bias and the largest variance of all privileged learners. As CRL is able to trade off between these two objectives, the estimates for its variance and bias lie in between the corresponding values of the other two privileged learners.

4.2 Latent variable recovery

Refer to caption
(a) Mean R2R^{2} and SVVCA coefficient over 25 training runs. Markers represent average results over repeated experiments with size indicating the sample size, mm.
Refer to caption
(b) Visualizing the representations learned by Classic Rep. and CRL after applying SVCCA. The recovery target is shown in the top left corner. Both estimators are best in class and trained on 2500 samples.
Figure 7: Analyzing the representations learned by the models of Section 3.3 using SVVCA on the Clocks-LGS image regression data set (same setup as in Figure 5(c)).

Under Assumption 1, it is sufficient to identify the representation function Φ\Phi up to a linear transform BB to have provable gains from privileged information over a classical learner. In synthetic data, we can assess to what extent a representation Φ^\hat{\Phi} with this property has been found. As a proxy for the existence of such a transform, we can compute a measure of correlation between Φ^(X)\hat{\Phi}(X) and Φ(X)\Phi(X), such as the Canonical Correlation Analysis (CCA) (Hotelling, 1936). Raghu et al. (2017) introduced a modified version called Singular Vector Canonical Correlation Analysis (SVCCA) to compute correlations when dealing with noisy dimensions in neural network representations.

The mean SVCCA coefficients ρ¯\bar{\rho} and predictive accuracy of the representation learners are visualized in Figure 7(a). One notices that GRL and CRL produce higher correlation coefficients than the classical learner while also predicting the outcome on the Clocks-LGS task more accurately. Further comparing the representations learned by privileged and classical models, Figure 7(b) shows a visual example of the improved latent recovery of CRL on the same task. Concluding, the use of privileged time-series information does not only increase the sample efficiency of existing algorithms but can also aid the recovery of latent variables in latent dynamical systems.

5 Related work

Existing analyses for learning with privileged information guarantee improvements in asymptotic sample efficiency under strong assumptions (Vapnik and Vashist, 2009) but are insufficient to establish a clear preference for LuPI learners for a fixed sample size. For example, Pechyony and Vapnik (2010) showed that for a specialized problem construction, empirical risk minimization (ERM) using privileged information can achieve fast learning rates, 𝒪(1/n)\mathcal{O}(1/n), while classical (non-privileged) ERM can only achieve slow rates, 𝒪(1/n)\mathcal{O}(1/\sqrt{n}). We are not aware of any generalization theory tight enough to establish a lower bound on the risk of a classical learner larger than an upper bound for a PI learner. Our problem is also related to multi-task (representation) learning (Maurer et al., 2016), see especially (5)–(6). However, our goal is different in that only a single task is of interest after learning.

In estimation of causal effects, learning from surrogate outcomes (Prentice, 1989) has been proposed as a way to increase sample efficiency. Surrogates are variables related to the outcome which may be available even when the outcome is not (Athey et al., 2019). We can view these as privileged information. While the problem shares structure with ours, the goal is to compensate for missing outcomes and analytical results give no guarantees for improved efficiency when both surrogates outcomes are always observed (Chen et al., 2008; Kallus and Mao, 2020). Guo and Perković (2022) showed that, in the context of a linear-Gaussian system on a directed acyclic graph, a recursive least-squares estimator is the asymptotically most efficient estimator of causal effects using only the sample covariance. In the linear case on a path graph, their estimator coincides with ours. However, no analysis is provided for the nonlinear case or for fixed sample sizes.

6 Discussion

We have presented learning algorithms for predicting nonlinear outcomes by utilizing time-series privileged information during training. We prove that our estimator is preferable to classical learning when data is generated from a latent-variable dynamical system with partially known components. The proof holds for the case when the latent dynamical system is recovered by the representation function up to a linear transform, assuming that a left inverse of the true observation generating function exists. However, this assumption does not appear to be necessary, as our empirical results demonstrate that privileged learning is preferable to classical learning even when these conditions cannot be guaranteed. Consequently, a more general theoretical result where less is known about the latent system seems attainable. For example, one might consider a case in which only a few independent components of the latent variables are recovered by the representation used by the learning algorithm, while other components are treated as noise.

When the latent dynamical system is entirely unknown, we create practical estimators using random feature embeddings which outperform the corresponding classical learner across experiments. We show that a universally consistent learner can be constructed based on this idea, with slightly different form. As a further alternative, we propose representation learning methods of related form using neural networks and demonstrate the empirical benefits also of this estimator over classical learning. In experiments, we analyze how the gap in risk between privileged and classical learning changes for different prediction horizons, as displayed in Figure 16 in Appendix F. The results suggest that the risk advantage of privileged learners grows with the sequence length of the prediction task, despite the fact that the task becomes more difficult at the same time.

Our work focuses on the setting where data is observed as a time series. This setting is chosen for its simple causal structure given by (latent) linear-Gaussian systems and because it can be motivated from many different applications. However, the ideas presented are not specifically tied to time and also apply in the case when all variables are observed simultaneously as long as the causal structure remains sequential. Moreover, we believe that the theory presented here is not limited to sequential settings and generalizes to other causal structures, in particular directed acyclic graphs that connect the baseline covariates to the outcome. In either case, one might only have access to the baseline variable at test time. For example, the timely collection of data for all covariates at test time might be very expensive or even impossible.

As pointed out before, Theorem 1 requires the recovery of the latent variables up to a linear transformation. Nonlinear independent component analysis (ICA) (Hyvarinen and Morioka, 2016) aims to solve precisely this problem and has been applied to time series via time-contrastive learning. This makes for an interesting connection between learning using privileged information and nonlinear ICA, as the experiments of Section 4.2 suggest that privileged time-series information aids the recovery of latent variables. Other remaining challenges include providing risk guarantees for learning with biased representations (including deep neural networks), with regularized estimators, and for more general data generating processes with weaker structural assumptions. We are hopeful that the utility demonstrated in this work will inspire future research to overcome these limitations.

Acknowledgments and Disclosure of Funding

We would like to thank Anton Matsson and Rickard Karlsson for insightful feedback and the Alzheimer’s Neuroimaging Initiative (ADNI) for collecting and providing the data on Alzheimer’s disease used in this project. The present work was funded in part by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation.

References

  • ADNI (2004) ADNI. the Alzheimer’s Disease Neuroimaging Initiative (ADNI), 2004. URL http://adni.loni.usc.edu.
  • Athey et al. (2019) Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. Technical report, National Bureau of Economic Research, 2019.
  • Chen et al. (2008) Song Xi Chen, Denis HY Leung, and Jing Qin. Improving semiparametric estimation by using surrogate data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):803–823, 2008.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Galea and Woodward (2005) Mary Galea and Michael Woodward. Mini-mental state examination (mmse). Australian Journal of Physiotherapy, 51(3):198, 2005. ISSN 0004-9514. doi: https://doi.org/10.1016/S0004-9514(05)70034-9.
  • Guo and Perković (2022) F Richard Guo and Emilija Perković. Efficient least squares for estimating total effects under linearity and causal sufficiency. Journal of Machine Learning Research, 23(104):1–41, 2022.
  • Hayashi et al. (2019) Shogo Hayashi, Akira Tanimoto, and Hisashi Kashima. Long-term prediction of small time-series data using generalized distillation. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2019.
  • Hofmann et al. (2008) Thomas Hofmann, Bernhard Schölkopf, and Alexander J. Smola. Kernel methods in machine learning. The Annals of Statistics, 36(3), jun 2008.
  • Hogue (2012) John Hogue. Metro interstate traffic volume data set, 2012. URL https://archive.ics.uci.edu/ml/datasets/Metro+Interstate+Traffic+Volume.
  • Hotelling (1936) Harold Hotelling. Relations between two Sets of Variates*. Biometrika, 28(3-4):321–377, 12 1936. ISSN 0006-3444. doi: 10.1093/biomet/28.3-4.321.
  • Huang et al. (2012) Bingquan Huang, Mohand Tahar Kechadi, and Brian Buckley. Customer churn prediction in telecommunications. Expert Systems with Applications, 39(1):1414–1425, 2012.
  • Hyvarinen and Morioka (2016) Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in Neural Information Processing Systems, 29, 2016.
  • Jonschkowski et al. (2015) Rico Jonschkowski, Sebastian Höfer, and Oliver Brock. Patterns for learning with side information. arXiv preprint arXiv:1511.06429, 2015.
  • Kallus and Mao (2020) Nathan Kallus and Xiaojie Mao. On the role of surrogates in the efficient estimation of treatment effects with limited outcome data. arXiv preprint arXiv:2003.12408, 2020.
  • Kao et al. (2015) Jonathan C. Kao, Paul Nuyujukian, Stephen I. Ryu, Mark M. Churchland, John P. Cunningham, and Krishna V. Shenoy. Single-trial dynamics of motor cortex and their applications to brain-machine interfaces. Nature Communications, 6(1), July 2015. doi: 10.1038/ncomms8759.
  • Karhade et al. (2019) Aditya V Karhade, Quirina CBS Thio, Paul T Ogink, Christopher M Bono, Marco L Ferrone, Kevin S Oh, Philip J Saylor, Andrew J Schoenfeld, John H Shin, Mitchel B Harris, et al. Predicting 90-day and 1-year mortality in spinal metastatic disease: development and internal validation. Neurosurgery, 85(4):E671–E681, 2019.
  • Karlsson et al. (2022) Rickard Karlsson, Martin Willbo, Zeshan Hussain, Rahul G Krishnan, David Sontag, and Fredrik D Johansson. Using time-series privileged information for provably efficient learning of prediction models. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics., 2022.
  • Kingma and Ba (2017) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
  • Kober et al. (2013) Jens Kober, J. Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. Int. J. Rob. Res., 32(11):1238–1274, sep 2013. ISSN 0278-3649. doi: 10.1177/0278364913495721.
  • LeCun et al. (1989) Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4):541–551, 12 1989. ISSN 0899-7667. doi: 10.1162/neco.1989.1.4.541.
  • Lee and Carlberg (2019) Kookjin Lee and Kevin Carlberg. Deep conservation: A latent-dynamics model for exact satisfaction of physical conservation laws, 2019.
  • Liang et al. (2016) Xuan Liang, Shuo Li, Shuyi Zhang, Hui Huang, and Song Xi Chen. Pm2.5 data reliability, consistency, and air quality assessment in five chinese cities. Journal of Geophysical Research: Atmospheres, 121(17):10220–10236, September 2016. ISSN 2169-897X. doi: 10.1002/2016JD024877.
  • Loog et al. (2020) Marco Loog, Tom Viering, Alexander Mey, Jesse H. Krijthe, and David M. J. Tax. A brief prehistory of double descent. Proceedings of the National Academy of Sciences, 117(20):10625–10626, 2020. doi: 10.1073/pnas.2001875117.
  • Lopez-Paz et al. (2016) D. Lopez-Paz, B. Schölkopf, L. Bottou, and V. Vapnik. Unifying distillation and privileged information. In International Conference on Learning Representations (ICLR), November 2016.
  • Maurer et al. (2016) Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes. The benefit of multitask representation learning. Journal of Machine Learning Research, 17(81):1–32, 2016.
  • Mortazavi et al. (2016) Bobak J Mortazavi, Nicholas S Downing, Emily M Bucholz, Kumar Dharmarajan, Ajay Manhapra, Shu-Xia Li, Sahand N Negahban, and Harlan M Krumholz. Analysis of machine learning techniques for heart failure readmissions. Circulation: Cardiovascular Quality and Outcomes, 9(6):629–640, 2016.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Pechyony and Vapnik (2010) Dmitry Pechyony and Vladimir Vapnik. On the theory of learnining with privileged information. Advances in neural information processing systems, 23, 2010.
  • Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Penrose (1956) Roger Penrose. On best approximate solutions of linear matrix equations. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 52, pages 17–19. Cambridge University Press, 1956.
  • Prentice (1989) Ross L Prentice. Surrogate endpoints in clinical trials: definition and operational criteria. Statistics in medicine, 8(4):431–440, 1989.
  • Raghu et al. (2017) Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability, 2017.
  • Rahimi and Recht (2007) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20. Curran Associates, Inc., 2007.
  • Rosenberg and Bartlett (2007) David S Rosenberg and Peter L Bartlett. The rademacher complexity of co-regularized kernel classes. In Artificial Intelligence and Statistics, pages 396–403. PMLR, 2007.
  • Rudi and Rosasco (2017) Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. Advances in neural information processing systems, 30, 2017.
  • Smola and Schölkopf (1998) Alex J Smola and Bernhard Schölkopf. Learning with kernels, volume 4. Citeseer, 1998.
  • Sun et al. (2019) Y Sun, A Gilbert, and A Tewari. On the approximation capabilities of relu neural networks and random relu features. arXiv preprint arXiv:1810.04374, 2019.
  • Sun et al. (2018) Yitong Sun, Anna Gilbert, and Ambuj Tewari. On the approximation properties of random relu features, 2018.
  • Vapnik and Vashist (2009) Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural networks, 22(5-6):544–557, 2009.
  • Vapnik et al. (2015) Vladimir Vapnik, Rauf Izmailov, et al. Learning using privileged information: similarity control and knowledge transfer. J. Mach. Learn. Res., 16(1):2023–2049, 2015.
  • Wang (2019) Weiran Wang. Everything old is new again: A multi-view learning approach to learning using privileged information and distillation. arXiv preprint arXiv:1903.03694, 2019.
  • You et al. (2017) Jiaxuan You, Xiaocheng Li, Melvin Low, David Lobell, and Stefano Ermon. Deep gaussian process for crop yield prediction based on remote sensing data. In Thirty-First AAAI conference on artificial intelligence, 2017.
  • Zhao et al. (2017) Jing Zhao, Xijiong Xie, Xin Xu, and Shiliang Sun. Multi-view learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017. ISSN 1566-2535. doi: https://doi.org/10.1016/j.inffus.2017.02.007.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] The introduction refers to the section where each contribution is made.

    2. (b)

      Did you describe the limitations of your work? [Yes] See e.g., the discussion in Section 6 and on the applicability of Theorem 1 in Section 3.1 and 3.2.

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A] We have not identified any potential negative societal impact directly related to our work.

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [Yes] See Assumption 1 and Theorem 1.

    2. (b)

      Did you include complete proofs of all theoretical results? [Yes] Yes, see the Appendix.

  3. 3.

    If you ran experiments…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] A URL to a code repository is included in the paper.

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 4 and Appendix E

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] Each result figure includes error regions.

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] See Section 4

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes] See Appendix E.2

    2. (b)

      Did you mention the license of the assets? [Yes] See Appendix E.2

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes] Only original code.

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] ADNI data is anonymized.

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix

Appendix A Proof of Theorem 1

Our proof requires an additional technical assumption: that the matrix of true latent states 𝐙t=[Φ(x1,t),,Φ(xm,t)]{{\bf Z}_{t}=[\Phi(x_{1,t}),...,\Phi(x_{m,t})]^{\top}}, for all tt, for a random data set DD has independent columns with probability 1. This implies that rank(𝐙t)=d\mbox{rank}({\bf Z}_{t})=d and mdm\geq d. We consider this a minor restriction since it would only be violated if either a) two or more components of ZtZ_{t} were perfectly correlated—in this case, a smaller system with same distributions over observations could always be constructed—or b) if we observe fewer samples than necesary to determine the system (m<dm<d). Note that this does not require that the dimension d^\hat{d} of the estimated representation Φ^\hat{\Phi} is smaller than mm. We begin by proving that both classical and LuPTS estimators are invariant to a particular form of linear transformation of the representation Φ^\hat{\Phi}.

Lemma 1.

Assume we have a latent linear Gaussian system as defined in Assumption 1 such that for a data set DD of mm samples, the matrix of true latent states 𝐙t=[Φ(X1,t);;Φ(Xm,t)]{\bf Z}_{t}=[\Phi(X_{1,t});...;\Phi(X_{m,t})] has linearly independent columns with probability 1. Let 𝒜pΦ\mathscr{A}_{\textsc{p}}^{{\Phi}} be the LuPTS algorithm using the system’s true map Φ()\Phi(\cdot) with Φ(Ψ(z))=zz𝒵\Phi(\Psi(z))=z~{}\forall z\in\mathcal{Z}. Let 𝒜pΦ^\mathscr{A}_{\textsc{p}}^{\hat{\Phi}} be the same algorithm using a different map Φ^()\hat{\Phi}(\cdot). We assume that B:Φ^(x)=BΦ(x)x𝒳\exists B:\hat{\Phi}(x)=B\Phi(x)\forall x\in\mathcal{X}. Analogously, we denote the classical learners 𝒜cΦ^\mathscr{A}_{\textsc{c}}^{\hat{\Phi}} and 𝒜cΦ\mathscr{A}_{\textsc{c}}^{\Phi}. If Bd^×dB\in\mathbb{R}^{\hat{d}\times d} has linearly independent columns we have

h^pΦ^(x)\displaystyle\hat{h}_{\textsc{p}}^{\hat{\Phi}}(x) =h^pΦ(x)\displaystyle=\hat{h}_{\textsc{p}}^{\Phi}(x)
andh^cΦ^(x)\displaystyle\text{and}~{}~{}\hat{h}_{\textsc{c}}^{\hat{\Phi}}(x) =h^cΦ(x).\displaystyle=\hat{h}_{\textsc{c}}^{\Phi}(x).
Proof.

Let 𝐙tm×d{\bf Z}_{t}\in\mathbb{R}^{m\times d} be made up of the rows 𝐙t(i,:)=Φ(𝐗t(i,:)){\bf Z}_{t(i,:)}=\Phi(\mathbf{X}_{t(i,:)}) when 𝐗tm×k\mathbf{X}_{t}\in\mathbb{R}^{m\times k} is the design matrix belonging to data set DD. In the same fashion we define 𝐙^tm×d^\hat{{\bf Z}}_{t}\in\mathbb{R}^{m\times\hat{d}} using the map Φ^\hat{\Phi} instead. By assumption, 𝐙t{\bf Z}_{t} and Bd^×dB\in\mathbb{R}^{\hat{d}\times d} have independent columns such that BB=IB^{\dagger}B=I. These assumptions are used for matrix identities involving the Moore-Penrose inverse below. We compute the prediction on a new test point xx for the classical learner:

h^cΦ^(x)\displaystyle\hat{h}_{\textsc{c}}^{\hat{\Phi}}(x) =((𝐙^1𝐙^1)𝐙^1Y)Φ^(x)\displaystyle=\big{(}(\hat{{\bf Z}}_{1}^{\top}\hat{{\bf Z}}_{1})^{\dagger}\hat{{\bf Z}}_{1}^{\top}Y\big{)}^{\top}\hat{\Phi}(x)
=(((𝐙1B)(𝐙1B))(𝐙1B)𝐘)BΦ(x)\displaystyle=\bigg{(}\big{(}({\bf Z}_{1}B^{\top})^{\top}({\bf Z}_{1}B^{\top})\big{)}^{\dagger}({\bf Z}_{1}B^{\top})^{\top}\mathbf{Y}\bigg{)}^{\top}B\Phi(x)
=((𝐙1B)(B𝐙1)(𝐙1B)𝐘)BΦ(x)\displaystyle=\bigg{(}({\bf Z}_{1}B^{\top})^{\dagger}(B{\bf Z}_{1}^{\top})^{\dagger}({\bf Z}_{1}B^{\top})^{\top}\mathbf{Y}\bigg{)}^{\top}B\Phi(x)
=((B)𝐙1(𝐙1)B(𝐙1B)𝐘)BΦ(x)\displaystyle=\bigg{(}(B^{\top})^{\dagger}{\bf Z}_{1}^{\dagger}({\bf Z}_{1}^{\top})^{\dagger}B^{\dagger}({\bf Z}_{1}B^{\top})^{\top}\mathbf{Y}\bigg{)}^{\top}B\Phi(x)
=((B)(𝐙1𝐙1)BB=I𝐙1𝐘)BΦ(x)\displaystyle=\bigg{(}(B^{\top})^{\dagger}\big{(}{\bf Z}_{1}^{\top}{\bf Z}_{1}\big{)}^{\dagger}\underbrace{B^{\dagger}B}_{=I}{\bf Z}_{1}^{\top}\mathbf{Y}\bigg{)}^{\top}B\Phi(x)
=((𝐙1𝐙1)𝐙1𝐘)BB=IΦ(x)\displaystyle=\bigg{(}\big{(}{\bf Z}_{1}^{\top}{\bf Z}_{1}\big{)}^{\dagger}{\bf Z}_{1}^{\top}\mathbf{Y}\bigg{)}^{\top}\underbrace{B^{\dagger}B}_{=I}\Phi(x)
=h^cΦ(x)\displaystyle=\hat{h}_{\textsc{c}}^{{\Phi}}(x)

The arguments for the privileged learners are analogous:

h^pΦ^(x)\displaystyle\hat{h}_{\textsc{p}}^{\hat{\Phi}}(x) =[(𝐙^1𝐙^1)𝐙^1𝐙^2(𝐙^2𝐙^2)𝐙^2𝐙^3(𝐙^T𝐙^T)𝐙^T𝐘]Φ^(x)\displaystyle=\bigg{[}(\hat{{\bf Z}}_{1}^{\top}\hat{{\bf Z}}_{1})^{\dagger}\hat{{\bf Z}}_{1}^{\top}\hat{{\bf Z}}_{2}~{}(\hat{{\bf Z}}_{2}^{\top}\hat{{\bf Z}}_{2})^{\dagger}\hat{{\bf Z}}_{2}^{\top}\hat{{\bf Z}}_{3}~{}\dots~{}(\hat{{\bf Z}}_{T}^{\top}\hat{{\bf Z}}_{T})^{\dagger}\hat{{\bf Z}}_{T}^{\top}\mathbf{Y}\bigg{]}^{\top}\hat{\Phi}(x)
=[(B)(𝐙1𝐙1)BB𝐙1𝐙2B(B)(𝐙2𝐙2)BB𝐙2𝐙3B\displaystyle=\bigg{[}(B^{\top})^{\dagger}({\bf Z}_{1}^{\top}{\bf Z}_{1})^{\dagger}B^{\dagger}B{\bf Z}_{1}^{\top}{\bf Z}_{2}B^{\top}(B^{\top})^{\dagger}({\bf Z}_{2}^{\top}{\bf Z}_{2})^{\dagger}B^{\dagger}B{\bf Z}_{2}^{\top}{\bf Z}_{3}B^{\top}
(B)(𝐙T𝐙T)BB𝐙T𝐘]BΦ(x)\displaystyle~{}~{}~{}~{}\dots~{}(B^{\top})^{\dagger}({\bf Z}_{T}^{\top}{\bf Z}_{T})^{\dagger}B^{\dagger}B{\bf Z}_{T}^{\top}\mathbf{Y}\bigg{]}^{\top}B\Phi(x)
=[(𝐙1𝐙1)𝐙1𝐙2(𝐙2𝐙2)𝐙2𝐙3(𝐙T𝐙T)𝐙T𝐘]BBΦ(x)\displaystyle=\bigg{[}({\bf Z}_{1}^{\top}{\bf Z}_{1})^{\dagger}{\bf Z}_{1}^{\top}{\bf Z}_{2}({\bf Z}_{2}^{\top}{\bf Z}_{2})^{\dagger}{\bf Z}_{2}^{\top}{\bf Z}_{3}({\bf Z}_{T}^{\top}{\bf Z}_{T})^{\dagger}{\bf Z}_{T}^{\top}\mathbf{Y}\bigg{]}^{\top}B^{\dagger}B\Phi(x)
=h^pΦ(x)\displaystyle=\hat{h}_{\textsc{p}}^{\Phi}(x)

Proof of Theorem 1.

We consider the generalized LuPTS estimator h^pΦ^()\hat{h}_{\textsc{p}}^{\hat{\Phi}}(\cdot) treating different cases for the number of samples mm and latent state dimension dd in turn. By the added technical assumption, that the true latent state 𝐙t=[Φ(x1,t),,Φ(xm,t)]{\bf Z}_{t}=[\Phi(x_{1,t}),...,\Phi(x_{m,t})]^{\top} has rank dd for all tt with probability 1, and the assumption that Φ^\hat{\Phi} is linearly close to Φ\Phi, by a matrix Bd^×dB\in\mathbb{R}^{\hat{d}\times d} of rank dd such that 𝐙^t=𝐙tB\hat{\bf Z}_{t}={\bf Z}_{t}B^{\top}, we get that rank(𝐙^t)=d\mbox{rank}(\hat{\bf Z}_{t})=d for all tt. This also implies dd^d\leq\hat{d}.

(i) m=dm=d: In this case, the Gram matrix 𝐊t=𝐙^t𝐙^t{\bf K}_{t}=\hat{\bf Z}_{t}\hat{\bf Z}_{t}^{\top} has full rank and thus is invertible for all tt. By Proposition 1, LuPTS coincides with the classical learner. Hence, 𝔼h^p,X1[VarD(h^c(X1)h^p)]=0\mathbb{E}_{\hat{h}_{\textsc{p}},X_{1}}[\mathrm{Var}_{D}(\hat{h}_{\textsc{c}}(X_{1})\mid\hat{h}_{\textsc{p}})]=0 and R¯(𝒜p)=R¯(𝒜c)\overline{R}(\mathscr{A}_{\textsc{p}})=\overline{R}(\mathscr{A}_{\textsc{c}}).

(ii) m<dm<d: In this case, there does not exists a linearly close, as defined above, representation Φ^\hat{\Phi} to Φ\Phi since the rank of 𝐙^t\hat{\bf Z}_{t} must be smaller than dd. This contradicts that rank(𝐙^t)=d\mbox{rank}(\hat{\bf Z}_{t})=d. Independently, if the conditions of Proposition 1 hold, the same equivalence holds as in the case m=dm=d.

(iii) m>dm>d: In this case, the kernel Gram matrix 𝐊t=𝐙^t𝐙^t{\bf K}_{t}=\hat{\bf Z}_{t}\hat{\bf Z}_{t}^{\top} has rank d<md<m and is never invertible.

Three sub-cases remain: a) When d^=d\hat{d}=d, the matrix BB is invertible and square, the covariance matrix Σ^t=𝐙^t𝐙^t\hat{\Sigma}_{t}=\hat{\bf Z}_{t}^{\top}\hat{\bf Z}_{t} is invertible for all tt and our estimator coincides with linear LuPTS (Karlsson et al., 2022) in the space implied by Φ^\hat{\Phi}. To see this, note that Lemma 1 implies that h^pΦ^()\hat{h}_{\textsc{p}}^{\hat{\Phi}}(\cdot) makes the same predictions as a different generalized LuPTS estimator h^pΦ()\hat{h}_{\textsc{p}}^{\Phi}(\cdot) using the true map Φ\Phi when the two representation functions are related through BB, as defined in the Theorem statement. Consequently, we may analyze the latter estimator instead of the first. It uses the parameter

θ^p[t=1T1(𝐙t𝐙t)𝐙t𝐙t+1A^t](𝐙T𝐙T)𝐙T𝐘β^.\hat{\theta}_{\textsc{p}}\coloneqq\Bigg{[}\prod\limits_{t=1}^{T-1}\underbrace{({\bf Z}_{t}^{\top}{\bf Z}_{t})^{\dagger}{\bf Z}_{t}^{\top}{\bf Z}_{t+1}}_{\hat{A}_{t}}\Bigg{]}\underbrace{({\bf Z}_{T}^{\top}{\bf Z}_{T})^{\dagger}{\bf Z}_{T}^{\top}\mathbf{Y}}_{\hat{\beta}}~{}.

We know by assumption that the covariance matrices Σt=(𝐙t𝐙t)d×d\Sigma_{t}=({\bf Z}_{t}^{\top}{\bf Z}_{t})\in\mathbb{R}^{d\times d} have full rank for all tt. This implies that the Moore-Penrose pseudoinverse ()(\cdot)^{\dagger} may be replaced by the regular matrix inverse ()1(\cdot)^{-1} in the expression above, yielding

θ^p=[t=1T1(𝐙t𝐙t)1𝐙t𝐙t+1A^t](𝐙T𝐙T)1𝐙T𝐘β^\hat{\theta}_{\textsc{p}}=\Bigg{[}\prod\limits_{t=1}^{T-1}\underbrace{({\bf Z}_{t}^{\top}{\bf Z}_{t})^{-1}{\bf Z}_{t}^{\top}{\bf Z}_{t+1}}_{\hat{A}_{t}}\Bigg{]}\underbrace{({\bf Z}_{T}^{\top}{\bf Z}_{T})^{-1}{\bf Z}_{T}^{\top}\mathbf{Y}}_{\hat{\beta}}

which is equivalent to the LuPTS estimator of Karlsson et al. (2022) used on a linear-Gaussian system in the space 𝒵\mathcal{Z} rather than in 𝒳\mathcal{X}. In this case, Theorem 1 from Karlsson et al. (2022) yields the desired result. b) If d^>d\hat{d}>d, Σ^\hat{\Sigma} is not invertible but, due to Lemma 1, we can instead study a representation which is an appropriate linear transform Bd^×dB\in\mathbb{R}^{\hat{d}\times d} away from 𝐙^\hat{\bf Z}, and apply the Karlsson et al. (2022) result as described for the case d^=d\hat{d}=d. Note that in this case BB is non-square but has linearly independent columns as required. c) If d^<d\hat{d}<d, the assumed matrix BB cannot exist with the stated conditions (the assumptions of Theorem 1 are not satisfied).

Appendix B Universality of random features

A learning algorithm 𝒜\mathscr{A} is said to be universally consistent if, for any continuous function hh, the output of 𝒜\mathscr{A} converges in probability to hh. That is, for a random dataset DmD_{m} of mm i.i.d. samples drawn from a distribution pp, and any ϵ>0\epsilon>0,

limmPr[𝒜(Dm)hL2(p)>ϵ]=0.\lim_{m\rightarrow\infty}\mbox{Pr}[\|\mathscr{A}(D_{m})-h\|_{L^{2}(p)}>\epsilon]=0~{}.

Sun et al. (2019) prove that (norm-bounded) linear regression applied to random ReLU features (RRF) is universally consistent. Specifically, for any ϵ,δ>0\epsilon,\delta>0, there is a finite number of random features d^\hat{d} and samples mm, such that the estimator

h^RRF(x)=θ^Φ^RRFγ,d^(x) with θ^=argminθ:θ22<R21mi=1m(θΦ^RRFγ,d^(xi)yi)2\hat{h}_{\mathrm{RRF}}(x)=\hat{\theta}^{\top}\hat{\Phi}^{\gamma,\hat{d}}_{\mathrm{RRF}}(x)\;\;\mbox{ with }\;\;\hat{\theta}=\operatorname*{arg\,min}_{\theta:\|\theta\|^{2}_{2}<R^{2}}\frac{1}{m}\sum_{i=1}^{m}(\theta^{\top}\hat{\Phi}^{\gamma,\hat{d}}_{\mathrm{RRF}}(x_{i})-y_{i})^{2}

achieves an error of at most ϵ\epsilon with probability 1δ\geq 1-\delta for univariate continuous functions of xx. Universal consistency (as d^,m>d^)\hat{d}\rightarrow\infty,m>\hat{d})) follows as a result. We can apply the same idea to a version of generalized LuPTS by considering each parameter estimate of the latent dynamical system given Φ^\hat{\Phi}, with norms restricted by RR. We drop the subscript RRF\mathrm{RRF} from Φ^\hat{\Phi} moving forward, and continue to use random ReLU features. For the final prediction step of YY, we let

β^\displaystyle\hat{\beta} =argminb:b22<R21mi=1m(bΦ^γ,d^T(xi,T)yi)2\displaystyle=\operatorname*{arg\,min}_{b:\|b\|^{2}_{2}<R^{2}}\frac{1}{m}\sum_{i=1}^{m}(b^{\top}\hat{\Phi}^{\gamma,\hat{d}_{T}}(x_{i,T})-y_{i})^{2} (7)

Progressing recursively backward from t=Tt=T, we let

[A^t]j,:\displaystyle[\hat{A}_{t}]_{j,:} =argmina:a22<R21mi=1m(aΦ^γ,d^t(xi,t)Φ^γ,d^t+1(xi,t+1)j)2 for j[d^t+1],t[T1]\displaystyle=\operatorname*{arg\,min}_{a:\|a\|^{2}_{2}<R^{2}}\frac{1}{m}\sum_{i=1}^{m}(a^{\top}\hat{\Phi}^{\gamma,\hat{d}_{t}}(x_{i,t})-\hat{\Phi}^{\gamma,\hat{d}_{t+1}}(x_{i,t+1})_{j})^{2}\;\mbox{ for }\;j\in[\hat{d}_{t+1}],t\in[T-1] (8)

where Φ^γ,d^t\hat{\Phi}^{\gamma,\hat{d}_{t}} are random features specific to tt. Since the target of each prediction at t+1t+1 is fixed with respect to the features used as input at tt, the result from Sun et al. (2019) can be used to give a learning guarantee for each step. The construction differs from the standard generalized LuPTS formulation as Φ^\hat{\Phi} is not shared between time steps, and so A^t\hat{A}_{t} will be non-square in general. We believe that this is merely a limitation of the proof technique and that the results hold for shared random features and square transitions.

Let ReLU\mathcal{H}_{\mathrm{ReLU}} denote the set of linear functions applied to d^\hat{d} random ReLU features, with uniform random projection coefficients ωj𝒰(𝕊d)\omega_{j}\sim\mathcal{U}(\mathbb{S}^{d}), j=1,,d^j=1,...,\hat{d}.

ReLU,d^={h:𝒳,h(x)=j=1d^ajσ(ωj[x;1])}.\mathcal{H}_{\mathrm{ReLU},\hat{d}}=\left\{h:\mathcal{X}\xrightarrow[]{}\mathbb{R},h(x)=\sum_{j=1}^{\hat{d}}a_{j}\sigma(\omega_{j}^{\top}[x;1])\right\}~{}.
Corollary 1 (Follows from Proposition 5 in Sun et al. (2019)).

Let Assumption 1 hold with noiseless transitions and outcomes, ϵt=0\epsilon_{t}=0 for t=2,,Tt=2,...,T, and ϵY=0\epsilon_{Y}=0. Define gt(xt)Ψ(AtΦ(xt))=xt+1g^{*}_{t}(x_{t})\coloneqq\Psi(A_{t}^{\top}\Phi(x_{t}))=x_{t+1} and assume that for any fixed RRF representation Φ^γ,d^t+1\hat{\Phi}^{\gamma,\hat{d}_{t+1}}, each component j=1,,d^t+1j=1,...,\hat{d}_{t+1} of the transition target satisfies Φ^γ,d^t+1(gt())(j)ReLU,d^t\hat{\Phi}^{\gamma,\hat{d}_{t+1}}(g^{*}_{t}(\cdot))(j)\in\mathcal{H}_{\mathrm{ReLU},\hat{d}_{t}}. Let A^t\hat{A}_{t} be the minimizer of the single-step transitions as defined in (8). Then, for any δ>0,ϵ>0\delta>0,\epsilon>0 there is a number of random features d^=d^(ϵ,δ)\hat{d}=\hat{d}(\epsilon,\delta) and samples m=m(ϵ,δ)m=m(\epsilon,\delta), such that with probability 1δ\geq 1-\delta,

A^tΦ^γ,d^t(xt)Φ^γ,d^t+1(xt+1)ϵ.\|\hat{A}_{t}^{\top}\hat{\Phi}^{\gamma,\hat{d}_{t}}(x_{t})-\hat{\Phi}^{\gamma,\hat{d}_{t+1}}(x_{t+1})\|\leq\epsilon~{}.

The result follows from Proposition 5 in Sun et al. (2019) applied to the transition functions in our problem. Putting this together for all time-steps, we get the following result.

Proposition 2 (Universal consistency of RRF privileged learner).

Let Assumption 1 hold with ϵt=0,ϵY=0\epsilon_{t}=0,\epsilon_{Y}=0. By Corollary1 and Sun et al. (2019), we have for any δ>0\delta>0, ϵ>0,γ>0\epsilon>0,\gamma>0 and a sequence (d^1,,d^T)(\hat{d}_{1},...,\hat{d}_{T}) of sufficiently large numbers of random features and samples mm, that with probability at least 1δ/T1-\delta/T, for the privileged estimator defined in (8), (7),

β^Φ^γ,d^T(XT)YL2(p)ϵ\|\hat{\beta}^{\top}\hat{\Phi}^{\gamma,\hat{d}_{T}}(X_{T})-Y\|_{L^{2}(p)}\leq\epsilon

and

t=1,,T1:A^tΦ^γ,d^t(Xt)Φ^γ,d^t+1(Xt+1)L2(p)ϵ,\forall t=1,...,T-1:\|\hat{A}_{t}^{\top}\hat{\Phi}^{\gamma,\hat{d}_{t}}(X_{t})-\hat{\Phi}^{\gamma,\hat{d}_{t+1}}(X_{t+1})\|_{L^{2}(p)}\leq\epsilon,

Then, further assume that the largest eigenvalue λmax(A^t)C\lambda_{\mathrm{max}}(\hat{A}_{t})\leq C for any tt and β^C\|\hat{\beta}\|\leq C. Then, with probability at least 1δ1-\delta,

(A^1A^T1β^)Φ^γ,d^1(X1)YL2(p)TCTϵ\|(\hat{A}_{1}\cdots\hat{A}_{T-1}\hat{\beta})^{\top}\hat{\Phi}^{\gamma,\hat{d}_{1}}(X_{1})-Y\|_{L^{2}(p)}\leq TC^{T}\epsilon
Proof.

Let =L2(p)\|\cdot\|=\|\cdot\|_{L^{2}(p)}. Then, letting Φ^t=Φ^γ,d^t\hat{\Phi}^{t}=\hat{\Phi}^{\gamma,\hat{d}_{t}}, and applying a union bound to each of the TT (ϵ,δ)(\epsilon,\delta)-assumptions, and a series of Cauchy-Schwarz inequalities,

(A^1A^T1β^)Φ^1(X1)Y\displaystyle\|(\hat{A}_{1}\cdots\hat{A}_{T-1}\hat{\beta})^{\top}\hat{\Phi}^{1}(X_{1})-Y\|
=(A^1A^T1β^)Φ^1(X1)β^Φ^T(XT)+β^Φ^T(XT)Y\displaystyle=\|(\hat{A}_{1}\cdots\hat{A}_{T-1}\hat{\beta})^{\top}\hat{\Phi}^{1}(X_{1})-\hat{\beta}^{\top}\hat{\Phi}^{T}(X_{T})+\hat{\beta}^{\top}\hat{\Phi}^{T}(X_{T})-Y\|
(A^1A^T1β^)Φ^1(X1)β^Φ^T(XT)+β^Φ^T(XT)Yϵ\displaystyle\leq\|(\hat{A}_{1}\cdots\hat{A}_{T-1}\hat{\beta})^{\top}\hat{\Phi}^{1}(X_{1})-\hat{\beta}^{\top}\hat{\Phi}^{T}(X_{T})\|+\underbrace{\|\hat{\beta}^{\top}\hat{\Phi}^{T}(X_{T})-Y\|}_{\leq\epsilon}
(A^1A^T1β^)Φ^1(X1)β^A^T1Φ^T1(XT1)\displaystyle\leq\|(\hat{A}_{1}\cdots\hat{A}_{T-1}\hat{\beta})^{\top}\hat{\Phi}^{1}(X_{1})-\hat{\beta}^{\top}\hat{A}_{T-1}^{\top}\hat{\Phi}^{T-1}(X_{T-1})
+β^(A^T1Φ^T1(XT1)Φ^T(XT))+ϵ\displaystyle+\hat{\beta}^{\top}(\hat{A}_{T-1}^{\top}\hat{\Phi}^{T-1}(X_{T-1})-\hat{\Phi}^{T}(X_{T}))\|+\epsilon
(A^1A^T1β^)Φ^1(X1)(A^T1β^)Φ^T1(XT1)\displaystyle\leq\|(\hat{A}_{1}\cdots\hat{A}_{T-1}\hat{\beta})^{\top}\hat{\Phi}^{1}(X_{1})-(\hat{A}_{T-1}\hat{\beta})^{\top}\hat{\Phi}^{T-1}(X_{T-1})\|
+β^(A^T1Φ^T1(XT1)Φ^T(XT))Cϵ+ϵ\displaystyle+\underbrace{\|\hat{\beta}^{\top}(\hat{A}_{T-1}^{\top}\hat{\Phi}^{T-1}(X_{T-1})-\hat{\Phi}^{T}(X_{T}))\|}_{\leq C\epsilon}+\epsilon
\displaystyle...
TCTϵ.\displaystyle\leq TC^{T}\epsilon~{}.

Remark 1.

Proposition 2 shows that a privileged learner with random ReLU features can be turned into a universally consistent estimator of any (noiseless) continuous function of X1X_{1} by letting each time step have its own random feature representation of appropriate size and adding a norm constraint to each linear transformation. The construction in (8), (7) deviates from Algorithm 1 primarily in that the random feature representations used at each time step are different, but the overall structure is maintained: At training, predictions of YY are made from an embedding Z^T=Φ^T(XT)\hat{Z}_{T}=\hat{\Phi}^{T}(X_{T}) of XTX_{T}, predictions of Z^T\hat{Z}_{T} are made from an embedding Z^T1=Φ^T1(XT1)\hat{Z}_{T-1}=\hat{\Phi}^{T-1}(X_{T-1}), and so on. At test time, the transition matrices A^1,,A^T,β^\hat{A}_{1},...,\hat{A}_{T},\hat{\beta} are multiplied and applied to Z^1=Φ^1(X1)\hat{Z}_{1}=\hat{\Phi}^{1}(X_{1}). We conjecture that a similar argument can be applied to the construction in Algorithm 1 by letting the dimension of each time step approach \infty.

Appendix C Compounding bias

We can describe the compounding bias of the LuPTS estimator due to a biased representation Φ^\hat{\Phi}, in comparison with the standard OLS estimator, by propagating the error in Φ^\hat{\Phi} through the estimates. Assume that

Y=θΦ(X1)+ϵ=(A1AT1β)Φ(X1)+ϵY=\theta^{\top}\Phi(X_{1})+\epsilon=(A_{1}\cdots A_{T-1}\beta)^{\top}\Phi(X_{1})+\epsilon^{\prime}

Then, let Zt=Φ(Xt)Z_{t}=\Phi(X_{t}) and for an estimate Φ^\hat{\Phi}, assumed for simplicity to have the same dimension, d^=d\hat{d}=d,

Z^t=Φ^(Xt)=Φ(Xt)+Rt\hat{Z}_{t}=\hat{\Phi}(X_{t})=\Phi(X_{t})+R_{t}

where RtR_{t} is the residual w.r.t. Φ\Phi. Let bold-face variables indicate multi-sample equivalents of all variables. Further, define Σt=𝐙t𝐙t\Sigma_{t}={\bf Z}_{t}^{\top}{\bf Z}_{t} and Σ^t=𝐙^t𝐙^t\hat{\Sigma}_{t}=\hat{\bf Z}_{t}^{\top}\hat{\bf Z}_{t}.

Fitting θ^\hat{\theta} to Φ^\hat{\Phi} using the classical learner (OLS) yields an estimate

θ^c=Σ^11𝐙^1𝐘\hat{\theta}_{c}=\hat{\Sigma}_{1}^{-1}\hat{\bf Z}_{1}^{\top}\mathbf{Y}

Now, define Ωt=𝐑t𝐙^t+𝐙t𝐑t\Omega_{t}={\bf R}_{t}^{\top}\hat{\bf Z}_{t}+{\bf Z}_{t}^{\top}{\bf R}_{t} and we have

θ^c\displaystyle\hat{\theta}_{c} =(Σ1+Ω1)1(𝐙1+𝐑1)𝐘\displaystyle=(\Sigma_{1}+\Omega_{1})^{-1}({\bf Z}_{1}+{\bf R}_{1})^{\top}\mathbf{Y}
=(Σ11+Δ1)(𝐙1+𝐑1)𝐘\displaystyle=(\Sigma_{1}^{-1}+\Delta_{1})({\bf Z}_{1}+{\bf R}_{1})^{\top}\mathbf{Y}
=θ^c+(Δ1𝐙^1+Σ11𝐑1)𝐘\displaystyle=\hat{\theta}_{c}^{*}+(\Delta_{1}\hat{\bf Z}_{1}^{\top}+\Sigma_{1}^{-1}{\bf R}_{1}^{\top})\mathbf{Y}

where Δ1=Σ11Ω1(Σ1+Ω1)1\Delta_{1}=-\Sigma_{1}^{-1}\Omega_{1}(\Sigma_{1}+\Omega_{1})^{-1}, θ^c\hat{\theta}_{c}^{*} is the OLS estimate of θ\theta for the true Φ\Phi and the second line follows from the Woodbury matrix identity. The norm of Δ1\Delta_{1} is related to the condition number of Σ1\Sigma_{1}. The expectation of the first term is θ\theta, and the expectation of the remaining terms is the bias.

Now, we can do the same thing for the privileged estimator. Let’s start with T=2T=2.

θ^p\displaystyle\hat{\theta}_{p} =Σ^11𝐙^1𝐙^2Σ^21𝐙^2Y\displaystyle=\hat{\Sigma}_{1}^{-1}\hat{\bf Z}_{1}^{\top}\hat{\bf Z}_{2}\hat{\Sigma}_{2}^{-1}\hat{\bf Z}_{2}^{\top}Y
=(Σ11+Δ1)(𝐙1+𝐑1)(𝐙2+𝐑2)(Σ21+Δ2)(𝐙2+𝐑2)𝐘\displaystyle=(\Sigma_{1}^{-1}+\Delta_{1})({\bf Z}_{1}+{\bf R}_{1})^{\top}({\bf Z}_{2}+{\bf R}_{2})(\Sigma_{2}^{-1}+\Delta_{2})({\bf Z}_{2}+{\bf R}_{2})^{\top}\mathbf{Y}
=θ^p+A^1(Σ21𝐑2+Δ2𝐙^2)𝐘+(Σ11𝐙^1𝐑2+Σ11𝐑1𝐙^2+Δ1𝐙^1𝐙^2)β^.\displaystyle=\hat{\theta}_{p}^{*}+\hat{A}_{1}(\Sigma_{2}^{-1}{\bf R}_{2}^{\top}+\Delta_{2}\hat{\bf Z}_{2}^{\top})\mathbf{Y}+(\Sigma_{1}^{-1}\hat{\bf Z}_{1}^{\top}{\bf R}_{2}+\Sigma_{1}^{-1}{\bf R}_{1}^{\top}\hat{\bf Z}_{2}+\Delta_{1}\hat{\bf Z}_{1}^{\top}\hat{\bf Z}_{2})\hat{\beta}~{}.

Thus, the difference in bias between the two estimators is

𝔼[θ^cθ^p]=\displaystyle\mathbb{E}[\hat{\theta}_{c}-\hat{\theta}_{p}]= 𝔼[θ^cθ^p]=0\displaystyle\underbrace{\mathbb{E}[\hat{\theta}_{c}^{*}-\hat{\theta}_{p}^{*}]}_{=0}
+𝔼[(Δ1𝐙^1+Σ^11𝐑1)𝐘A^1(Δ2𝐙^2+Σ^21𝐑2)𝐘\displaystyle+\mathbb{E}[(\Delta_{1}\hat{\bf Z}_{1}^{\top}+\hat{\Sigma}_{1}^{-1}{\bf R}_{1}^{\top})\mathbf{Y}-\hat{A}_{1}(\Delta_{2}\hat{\bf Z}_{2}^{\top}+\hat{\Sigma}_{2}^{-1}{\bf R}_{2}^{\top})\mathbf{Y}
(Σ11𝐙^1𝐑2+Σ11𝐑1𝐙^2+Δ1𝐙^1𝐙^2)β^]\displaystyle-(\Sigma_{1}^{-1}\hat{\bf Z}_{1}^{\top}{\bf R}_{2}+\Sigma_{1}^{-1}{\bf R}_{1}^{\top}\hat{\bf Z}_{2}+\Delta_{1}\hat{\bf Z}_{1}^{\top}\hat{\bf Z}_{2})\hat{\beta}]

More generally, we can express this difference recursively as below.

Proposition 3.

Let θ^pA^1A^Tβ^\hat{\theta}_{p}\coloneqq\hat{A}_{1}\cdots\hat{A}_{T}\hat{\beta} be a privileged estimator using a linearly biased representation Φ^\hat{\Phi}, and let A^t\hat{A}_{t}^{*} be the same estimator using an unbiased representation Φ\Phi^{*}. Then, the bias of θ^p\hat{\theta}_{p} is

𝔼[θ^pθ]=𝔼[θ^pθ^p]=𝔼[ETβ^+(A^1A^T)(β^β^)],\mathbb{E}[\hat{\theta}_{p}-\theta]=\mathbb{E}[\hat{\theta}_{p}-\hat{\theta}_{p}^{*}]=\mathbb{E}[E_{T}\hat{\beta}^{*}+(\hat{A}_{1}\cdots\hat{A}_{T})(\hat{\beta}-\hat{\beta}^{*})]~{},

where EtE_{t} is the compounded error in transition dynamics, computed recursively as follows

Et(A^1A^t)(A^1A^t)=Et1A^t+(A^1A^t1)(A^tA^t),\displaystyle E_{t}\coloneqq(\hat{A}_{1}\cdots\hat{A}_{t})-(\hat{A}^{*}_{1}\cdots\hat{A}^{*}_{t})=E_{t-1}\hat{A}_{t}^{*}+(\hat{A}_{1}\cdots\hat{A}_{t-1})(\hat{A}_{t}-\hat{A}^{*}_{t})~{},

with E0=0E_{0}=0. In the worst case, the bias of θ^p\hat{\theta}_{p} grows exponentially with TT.

Appendix D Privileged time series representation learners

We expand on the description of the greedy representation learner (GRL) described as a special case of CRL in Section 3.3. To avoid the information loss of SRL, we consider its conceptual opposite, using privileged time series information only to predict the outcome. To do this, a linear output layer θ^tZ^t\hat{\theta}_{t}^{\top}\hat{Z}_{t} is used to predict YY at every time step tt. Recall that, by Assumption 1, the expected outcome is linear in the latent state at any time step. The method is related to multi-view learning, in which prediction of the same quantity is made from multiple “views” (Zhao et al., 2017). We dub the model greedy privileged representation learner (GRL), which minimizes the objective

grl(Φ^,{θ^t}):=1NTi=1Nt=1Twtθ^tΦ^(xi,t)yi22.\mathcal{L}_{\textsc{grl}}(\hat{\Phi},\{\hat{\theta}_{t}\}):=\frac{1}{NT}\sum\limits_{i=1}^{N}\sum\limits_{t=1}^{T}w_{t}\big{\|}\hat{\theta}_{t}^{\top}\hat{\Phi}(x_{i,t})-y_{i}\big{\|}_{2}^{2}. (9)

During inference this algorithm returns h^p()=θ^1Φ^()\hat{h}_{\textsc{p}}(\cdot)=\hat{\theta}_{1}^{\top}\hat{\Phi}(\cdot). Compared to the objective of CRL in 6, we introduce an additional hyperparameter λ(0,1)\lambda\in(0,1) to place more weight on the loss term that is relevant at inference time. As a consequence, we choose wt=λw_{t}=\lambda for t=1t=1 and wt=1λw_{t}=1-\lambda otherwise. We expect GRL to have less bias than SRL, but higher variance since less structure is imposed on the representation Φ^\hat{\Phi}.

Appendix E Experiment setup & data processing

E.1 Detailed experiment setup

In the following we give a detailed description of the experimental setup used to obtain the results presented in Section 4 as well as the additional results that are part of this section. For a given data set, we select a combination of training set sizes and sequence length. For each unique combination of these parameters the models of interest are trained repeatedly with different random sampling. For each repetition the data is split into a train and a test set randomly before hyperparameter tuning and model training are performed. At last each model’s predictions on the test set are scored by computing the coefficient of determination R2R^{2}. On synthetic data the test set contains 1000 samples. In the case of real-world data, where samples are limited, we test on 20% of all available data.

The preprocessing used for real-world data and the generation procedure of synthetic data is unique to each data set. We refer to the data set specific subsections for detailed descriptions of how each data set is processed. During the experiments each model uses standardized data for training and inference. To perform the data rescaling we use the StandardScaler implementation that is part of scikit-learn (Pedregosa et al., 2011).

Hyperparameter tuning.

The tuning of hyperparameters is carried out for each repetition and is implemented using random search and five-fold cross-validation. Each hyperparameter is sampled from a fixed interval of possible values. An overview of the ranges of different hyperparameters determined through random-search is provided in Table 1. For the experiments with variants of generalized LuPTS we sample ten sets of hyperparameters before retraining on all training data using the best set of parameters. For the representation learners we merely sample five values for λ\lambda.

Hyperparameter Description Used in Algorithm Value Range
nRFn_{RF} number of random features all random feature methods [0.05m,0.8m][0.05m,0.8m]
γRRF\gamma_{RRF} bandwith parameter Random ReLU methods [0.01,10][0.01,10]
γRFF\gamma_{RFF} bandwith parameter Random Fourier methods [0.001,0.1][0.001,0.1]
λ\lambda loss function parameter GRL & CRL [0,1][0,1]
Table 1: Overview of all hyperparameter determined by hyperparameter tuning. mm denotes the number of samples, meaning that nRFn_{RF} is chosen from different ranges depending on the size of the training set.

Neural network training.

The training of neural networks involves many choices and hyperparameters. We choose PyTorch’s implementation (Paszke et al., 2019) of the Adam optimizer (Kingma and Ba, 2017) to train the representation learning models. If not specified otherwise the results shown in this project are obtained using a learning rate of 0.00010.0001, a batch size of 3030, leaky ReLU activations and a maximum of 15001500 training epochs. In the case of neural network models, the sample sizes reported as part of the experiments denote the combined size of the training and validation set, where the validation set contains 20% of those samples. We use early stopping during the training process by keeping track of the validation loss. If a model does not improve the validation loss over a waiting period of 200200 epochs we stop training early and set the network parameters to the values that obtained the lowest validation loss up until that point. In order to make sure that results are not dependent on the specific choice of the parameters just described, we performed additional experiments with different parameter choices. The results were found to be robust to small changes in these parameters.

Generalized distillation.

In order to compare our algorithms to the alternative of using generalized distillation for privileged time series as presented by Hayashi et al. (2019), we implemented a model that (i) produces hypotheses of the same class as our other algorithms and (ii) that adopts the learning paradigm of a student model incorporating soft targets produced by a teacher model into its loss function. For tabular data our teacher model is a multi-layer-perceptron (MLP) with TkT*k input neurons such that all {Xt}\{X_{t}\} are concatenated and then used as input. The teacher MLP makes use of five hidden layers, each consisting of 100 neurons. In the case of the image data generated by Clocks-LGS, the teacher uses an implementation of LeNet-5 on all variables XtX_{t} (while sharing the encoder parameters) before concatenating the 25-dimensional output of this encoder from different time steps. This combined representation is then processed by an MLP with a single hidden layer with 25 neurons. The loss function used for the student model producing the estimate h^c\hat{h}_{\textsc{c}} is architecturally identical to the classic representation learner used in each of the experiments. When training the student model the mean squared error on the data and the error corresponding to the soft targets of the teacher model are combined via a hyperparameter λ\lambda:

GDλ1mi=1mh^c(xi)yi22+(1λ)1mi=1mh^c(xi)h^teacher(xi)22\mathcal{L}_{GD}\coloneqq\lambda\frac{1}{m}\sum\limits_{i=1}^{m}\|\hat{h}_{\textsc{c}}(x_{i})-y_{i}\|_{2}^{2}+(1-\lambda)\frac{1}{m}\sum\limits_{i=1}^{m}\|\hat{h}_{\textsc{c}}(x_{i})-\hat{h}_{\text{teacher}}(x_{i})\|_{2}^{2}

The hyperparameter is determined via hyperparameter tuning in all repetitions as described for other hyperparameters in this section. All other training procedures follow the same logic as described above.

Resources.

For the training of the representation learning algorithms we use a cluster of graphics processing units (GPUs) in order to reach the number of experiment repetitions required for our work. A single experiment like shown in Figure 5(c) takes several hours on 100 NVIDIA Tesla T4 GPUs. While the random feature methods do not require GPU training, they still require hyperparameter tuning which is why we compute results such as presented in Figure 4 on many CPU cores in parallel. While the experiments on neural networks cannot reasonably be reproduced on a single desktop machine, this is still possible within a few days for the random feature methods.

Latent variable recovery and SVCCA.

In order to assess to what extent a representation has been found that is linearly related to the true latent variables we use SVCCA as described by Raghu et al. (2017), meaning we first use PCA retaining at least 99% of variation before then applying CCA. For the visualization shown in Figure 7(b) we construct a grid of points (150 ×\times 150) around the origin and assign each point a unique color. Then we compute an image using the observation generating function Ψ:𝒵𝒳\Psi:\mathcal{Z}\xrightarrow[]{}\mathcal{X} of the Clocks-LGS data set for each point. The observations are passed to the encoder Φ^\hat{\Phi} of the two representation learners producing estimates of the latent variables. We then map the estimates to the ground truth linearly using SVCCA before plotting the result.

E.2 Alzheimer progression

To test our algorithms on the task of predicting the progression of Alzheimer’s disease (AD) we use an anonymized data set obtained through the Alzheimer’s Disease Neuroimaging Initiative (ADNI(ADNI, 2004) under the LONI Research License. The initiative is large multi-site research study on the brains of over 2000 AD patients which collects many features such as genetic, imaging and biospecimen biomarkers. The data consists of measurements taken every 3 months with some observations missing. The outcome of interest in our experiments is the Mini Mental State Experiment (MMSE) score 48 months after the first measurement(Galea and Woodward, 2005). Privileged information are the measurements taken between those time points, at 12, 24 and 36 months into the program.

Data processing.

The processing procedure used in this project is borrowed directly from the work of Karlsson et al. (2022). There is a large amount of missing information in the ADNI data set. The missingness varies with the time of when measurements were taken. Further some subjects were not present at some of the follow-up examinations. To deal with the missingness patients without an observation for the final follow up (the outcome YY) are excluded from our experiments. Further, we also require that patients are present at all intermediate time steps (12, 24 and 36 months after the first measurement) which we use as privileged information. We one-hot encode categorical features and exclude features for which more than 70% of the observations are missing. To deal with the remaining missing values, mean imputation is used. Due to the filtering that we apply as a result of the missing data we only obtain 502 suitable sequences that we can use for our experiments.

E.3 Traffic data

The Traffic data set (Hogue, 2012) obtained through the UCI machine learning repository (Dua and Graff, 2017) contains hourly measurements of the traffic volume as well as weather features and holiday information. The raw data contains 48.204 records. An overview of all available features is given in Table 2.

Feature Type Description
Date Time Timestamp date and time (CST)
Holiday String name of holiday if applicable
Weather Description String brief free text description of the weather
Weather Main Categorial contains categories like clear, clouds, or rain
Rain_1h Numerical rain in Lhm2\frac{L}{h~{}m^{2}}
Snow_1h Numerical snow in Lhm2\frac{L}{h~{}m^{2}}
Temp Numerical temperature in Kelvin
Traffic Volume Numerical hourly reported westbound traffic volume
Table 2: Features available in the Traffic data set.

Data processing.

We noticed extreme outliers in the data set as well as implausible numerical values for the temperature and rain features. Further, records for some of the hours of the timeframe (2012 - 2018) covered by this data set are missing. To deal with the extreme outliers we calculate the mean and standard deviation of each feature and remove records which contain values that are further than six standard deviations from the mean of a particular feature. We also remove a feature entirely if there is no variation left after this filtering. This is the case for the snowfall feature as snow is very rare in Minneapolis. From the date and time of each record we calculate the weekday which we add as a one-hot encoded feature and also represent the hour of the day h{0,1,23}h\in\{0,1,\dots 23\} as two separate periodic features given by

tperiodic=[sin(2πh24),cos(2πh24)].t_{\text{periodic}}=\bigg{[}\sin{(\frac{2\pi\cdot h}{24})},~{}\cos{(\frac{2\pi\cdot h}{24})}\bigg{]}~{}. (10)

This ensures that a timestamp just before midnight produces similar features compared to just after midnight. We one-hot encode the holiday information, making no difference between different types of holidays, and make this feature persist over a full calendar day. In the original data set the holiday information is only specified for the first hour of the day. The column Weather_Main contains some weather conditions that are very rare, such as smoke and squall. As a consequence we group the different conditions before encoding them as binary variables. In particular we make drizzle, rain and squall one single feature while also grouping together fog, haze, mist and smoke as they all affect visibility.

Time series selection.

After this preprocessing, that leaves only numerical values and one-hot encoded categorical values, we group the data together as time series used for the experiments. In order to do so we specify a desired sequence length T+1T+1 and a sequence step size in hours. With this information we iterate through the data set assembling time series with i) no values missing ii) the correct length and step size and iii) at least a seven hour gap between each pair of sequences. The third condition is introduced to make sure one does not end up with very similar cases (for short sequences in particular) in training and test set.

E.4 Square-Sign

The Square-Sign data set serves as a test environment for learning from privileged time series information where one can assure the conditions necessary for Assumption 1 to hold. In particular this means creating a linear-Gaussian system which remains unobserved and combining it with an observation generating function ψ:𝒵𝒳\psi:\mathcal{Z}\xrightarrow[]{}\mathcal{X}.

Latent linear-Gaussian system.

The first component that makes up the generation process for Square-Sign (and Clocks-LGS) is the linear-Gaussian system which is latent, just as depicted on the right side of Figure 1(b) with ZtdZ_{t}\in\mathbb{R}^{d}. The first step in the data creation process is sampling each of the dd components of Z0Z_{0} from 𝒩(0,5)\mathcal{N}(0,5). Then the subsequent latent variables Zt+1Z_{t+1} are computed as

Zt+1AtZt+ϵ,ϵd,ϵj𝒩(0,1).Z_{t+1}\coloneqq A_{t}^{\top}Z_{t}+\epsilon,~{}\epsilon\in\mathbb{R}^{d},~{}\epsilon_{j}~{}\mathcal{N}(0,1)~{}.

For the outcome we use the same form but with different dimensionality:

YβZT+ϵy,ϵyqY\coloneqq\beta^{\top}Z_{T}+\epsilon_{y},~{}\epsilon_{y}\in\mathbb{R}^{q}

Off-diagonal elements of the transition matrices Atd×dA_{t}\in\mathbb{R}^{d\times d} are sampled from a Normal distribution 𝒩(0,0.2)\mathcal{N}(0,0.2) while the diagonal elements are set to one. In a second step we compute the spectral radius of the randomly created matrices AtA_{t} via eigenvalue decomposition, obtaining the components UΛUU\Lambda U^{\top}. We then set the spectral radius to a predefined value λmax=1.3\lambda_{max}=1.3 and reassemble the matrix as

AtUλmaxλsΛU.A_{t}\xleftarrow[]{}U\frac{\lambda_{max}}{\lambda_{s}}\Lambda U^{\top}~{}.

The coefficients of β\beta are drawn from the same normal distribution as the ones of AtA_{t} but undergo no further changes.

Observation generating function.

As the dimensionality dd of the latent space 𝒵\mathcal{Z} is not fixed we use an observation generating function that is not restricted to a specific value of dd. For each element in Zt𝒵=dZ_{t}\in\mathcal{Z}=\mathbb{R}^{d} we create two elements in Xt𝒳=2dX_{t}\in\mathcal{X}=\mathbb{R}^{2d} by denoting its sign separately from its square. This gives the following nonlinear observation generating function:

Xtψ(Z(t))=[Z(t,1)2,sgn(Z(t,1)),,Z(t,d)2,sgn(Z(t,d))]X_{t}\coloneqq\psi(Z_{(t)})=[Z_{(t,1)}^{2},\mathrm{sgn}(Z_{(t,1)}),\dots,Z_{(t,d)}^{2},\mathrm{sgn}(Z_{(t,d)})]^{\top}

E.5 Clocks-LGS

This data set serves the purpose of testing our algorithms on a different modality with high dimensional data. In particular the idea was to use image data as this is a domain where neural networks have been very successful. For this reason we combine a latent dynamical system with an image generation process which we explain in detail in this section.

Latent linear-Gaussian system.

We use exactly the same setup as we do for the Square-Sign latent dynamical system as described in Section E.4. The only difference here is the dimensionality of the latent variables, transition matrices and the outcome. For Clocks-LGS we generally have d=2d=2 and q{1,2}q\in\{1,2\}.

Image generation.

The second part of Clocks-LGS is creating images from two dimensional latent vectors Zt=[Zt(1),Zt(2)]Z_{t}=[Z_{t}^{(1)},Z_{t}^{(2)}]^{\top}. The goal was to keep it the process simple while using small black and white images of 28×\times28 pixels. In addition we wanted each image to have no ambiguity with respect to the latent state it represents. We represent the first component by a clock hand mounted at the image center. One can think of Zt(1)Z_{t}^{(1)} as the angle in radian, meaning the hand points straight up for Zt(1)=0Z_{t}^{(1)}=0 or Zt(1)=2πZ_{t}^{(1)}=2\pi and straight down for Zt(1)=πZ_{t}^{(1)}=\pi. To visualize a full rotation we increase the size of the cirle around the image center in discrete steps for each mutliple of 2π2\pi. For negative values the circle is empty (black) while it is filled (white) for positive values. For the second component we make use of the same logic but instead of a clock hand, we only use a circle that orbits the image center. The two hands cannot obscure each other as the orbiting cirle uses a larger radius. Figure 8 shows three examples of pairs of corresponding latent vectors and generated images.

Refer to caption
Figure 8: Pairs of corresponding latent vectors ZtZ_{t} and images as generated in Clocks-LGS.

E.6 PM2.5\textbf{PM}_{2.5} air quality

Due to health concerns the air quality in Chinese cities has become an important topic. The 𝐏𝐌2.5\mathbf{PM_{\text{2.5}}} data set contains hourly meteorologic information and the concentration of small particles (PM2.5\text{PM}_{2.5}) for the cities Beijing, Shanghai, Guangzhou, Chengdu and Shenyang (Liang et al., 2016). The individual features available for all cities are listed in Table 3. In addition to the features listed, the data includes the date and time of each record. Just like in the preprocessing of Traffic we compute a periodic time feature using expression 10 to represent the time of day of each record. For each numerical feature we calculate the mean and standard deviation and remove rows with values that are more extreme than six standard deviations from the mean. We also remove rows with missing categorical features, which are then represented as one-hot vectors. Apart from differences in the preprocessing we consider the same prediction task as Karlsson et al. (2022) which is predicting the future particle concentration for a fixed time horizon given current observations.

Feature Type Description
season Numerical season (1 to 4) of the data in this row
PM Numerical PM2.5\text{PM}_{2.5} particle concentration in μg/m3\mu g/m^{3}
DEWP Numerical dew point in C
TEMP Numerical air temperature in C
HUMI Numerical humidity in %
PRES Numerical atmospheric pressure in hPa
cbwd Categorical combined wind direction in {N,W,S,E, NW, SW, NE, SE}
Iws Numerical cumulated wind speed in m/sm/s
precipitation Numerical hourly precipitation in mmmm
Iprec Numerical cumulated precipitation in mmmm
Table 3: Features available as part of the 𝐏𝐌2.5\mathbf{PM_{\text{2.5}}} data set on the air quality of five large Chinese cities.

Appendix F Additional experiment results

In the course of this section we present a larger scope of our experimental results. We demonstrate the predictive accuracy of the algorithms introduced in Sections 3.2 and 3.3 in terms of the mean coefficient of determination R2R^{2} over different settings on five data sets. The variation of the results over repetitions is represented by the shaded areas in the visualizations, which denotes one standard deviation above and below the mean value. We also show empirically that the bias of generalized LuPTS increases with the number of privileged time steps when a poor representation function is used. This can be seen in Figure 16(a) where we test privileged and classical learners over 500 different systems of the Square-Sign type over different sequence lengths. In addition we further illustrate how privileged information can improve the recovery of latent variables of the data generating process by providing more visualizations in the style of Figure 7(a) on the Clocks-LGS and Square-Sign data set. These can be found in Figures 13 and 16.

In the following we first provide the experiment details for visualizing the phase transition of generalized LuPTS as seen in Figure 2. In the subsections thereafter the material is organized by data set.

F.1 Two regimes of generalized LuPTS

As seen in Figure 2 and demonstrated by Proposition 1, generalized LuPTS becomes equivalent to the corresponding classical learner when the number of features d^\hat{d} is larger than the number of samples mm. In the following we provide the experiment details that led to Figure 2.

In order to evaluate the dependency on the number of features we used linear LuPTS and OLS on a synthetically generated linear-Gaussian system as displayed in Figure 1(a). We use the same setup as for the latent dynamics in Square-Sign but without a nonlinear observation generating function. For each number of features k=d^=dk=\hat{d}=d we sample 50 such systems with different dynamics, each producing a training set with m=100m=100 and test set of 1000 samples. The systems are all configured with T=3T=3 and q=10q=10. We train and score both estimators on all of the data generating systems before computing the mean coefficient of determination over all systems with the same number of features.

F.2 Alzheimer progression

Refer to caption
Refer to caption
Refer to caption
(a) Generalized LuPTS on the reduced setting with a single privileged time point, T=2T=2.
Refer to caption
(b) Representation learners and generalized distillation on the reduced setting with one privileged time step, T=2T=2.
Refer to caption
(c) Representation learners and generalized distillation on the full setting with three privileged time steps, T=4T=4.
Figure 9: Generalized LuPTS and the representation learning algorithms tested in terms of their predictive accuracy on different settings of the ADNI prediction task. Each experiments are based on 25 repetitions while the random feature experiment consists of 60 repetitions.

F.3 Traffic data

Refer to caption
Refer to caption
(a) T=2T=2.
Refer to caption
(b) T=3T=3.
Refer to caption
(c) T=4T=4.
Figure 10: Prediction accuracy of the different variants of generalized LuPTS on Traffic with varying sequence lengths. Based on 60 repetitions and four hour steps in each time series.
Refer to caption
Refer to caption
(a) T=2T=2.
Refer to caption
(b) T=3T=3.
Refer to caption
(c) T=4T=4.
Figure 11: Prediction accuracy of the different representation learning algorithms and generalized distillation on Traffic with varying sequence lengths. The results are based on 25 repetitions and four hour steps in each time series.

F.4 Clock_LGS

Refer to caption
Refer to caption
(a) T=2T=2.
Refer to caption
(b) T=3T=3.
Refer to caption
(c) T=5T=5.
Figure 12: Prediction accuracy of the representation learning algorithms and generalized distillation on Clocks-LGS with a two dimensional outcome q=2q=2 and varying sequence lengths, based on 25 repetitions.
Refer to caption
Refer to caption
Figure 13: Mean SVCCA coefficients and R2R^{2} for the representation learners on the Clocks-LGS prediction task. The experiment was set up for sequences of length six (T=5T=5), with outcomes of different dimensionality: q=1q=1 on the left and q=2q=2 on the right. Solid marks represent the mean over 25 repetitions, while the faded marks denote the individual training runs. The annotations refer to the size of the training sets.

F.5 Square-Sign

Refer to caption
Refer to caption
(a) Sequences of length six, four privileged time points, T=5T=5.
Refer to caption
(b) Sequences of length seven, five privileged time points, T=6T=6.
Refer to caption
(c) Sequences of length eight, six privileged time points, T=7T=7.
Figure 14: Predictive accuracy (R2R^{2}) of the different variants of generalized LuPTS applied to the prediction task offered by Square-Sign. The DGP was configured with different sequence lengths and the experiments represent 60 repetitions.
Refer to caption
Refer to caption
(a) Sequences of length six, four privileged time points, T=5T=5.
Refer to caption
(b) Sequences of length seven, five privileged time points, T=6T=6.
Refer to caption
(c) Sequences of length eight, six privileged time points, T=7T=7.
Figure 15: Predictive accuracy of the different representation learners introduced in Section 3.3 when applied to the Square-Sign data set for different sequence lengths. The experiments are based on 25 repetitions.
Refer to caption

(a) Mean R2R^{2} over 500 repetitions for different sequence lengths on Square-Sign (d=10d=10, q=3q=3) with a sample size of m=1000m=1000. Each run is performed on a different set of randomly sampled transition dynamics.

Refer to caption

(b) Mean SVCCA coefficients and R2R^{2} for the representation learners on the Square-Sign data set configured with (T=5,d=10,q=3T=5,d=10,q=3). Solid marks represent the mean over 25 repetitions for a fixed training set size while the faded marks denote individual runs. The annotations refer to the number of samples in the training data.

Figure 16: Generalized LuPTS applied on the Square-Sign task for different sequence lengths (left panel), reporting the mean coefficient of determination R2R^{2} over different systems. The right panel shows analysis of the predictive accuracy and latent recovery of the representation learners in the style of Figure 7(a).

F.6 PM2.5\textbf{PM}_{2.5} air quality

Refer to caption
Refer to caption
(a) City of Beijing
Refer to caption
(b) City of Shanghai
Refer to caption
(c) City of Shenyang
Refer to caption
(d) City of Guangzhou
Refer to caption
(e) City of Chengdu
Figure 17: Predictive accuracy of the different generalized LuPTS variants on the prediction task posed by the 𝐏𝐌2.5\mathbf{PM_{\text{2.5}}} data set. All experiments use time series of length five (T=4T=4), where each time step is two hours long. The results were computed based on 60 repetitions.
Refer to caption
Refer to caption
(a) City of Beijing
Refer to caption
(b) City of Shanghai
Refer to caption
(c) City of Shenyang
Refer to caption
(d) City of Guangzhou
Refer to caption
(e) City of Chengdu
Figure 18: Evaluation of the sample efficiency of the represenation learning algorithms of Section 3.3 on the 𝐏𝐌2.5\mathbf{PM_{\text{2.5}}} air quality prediction task. All experiments use time series of length five (T=4T=4), where each time step represents two hours. The results were computed based on 25 repetitions.