One-Pass Learning via Bridging Orthogonal Gradient Descent and Recursive Least-Squares

Youngjae Min, Kwangjun Ahn, and Navid Azizan The authors are with the Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. {yjm,kjahn,azizan}@mit.edu

Abstract

While deep neural networks are capable of achieving state-of-the-art performance in various domains, their training typically requires iterating for many passes over the dataset. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the increasing use of overparameterized models, we develop Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit every new datapoint while changing the parameters in a direction that causes the least change to the predictions on previous datapoints. By doing so, we bridge two seemingly distinct algorithms in adaptive filtering and machine learning, namely the recursive least-squares (RLS) algorithm and orthogonal gradient descent (OGD). Our algorithm uses the memory efficiently by exploiting the structure of the streaming data via an incremental principal component analysis (IPCA). Further, we show that, for overparameterized linear models, the parameter vector obtained by our algorithm is what stochastic gradient descent (SGD) would converge to in the standard multi-pass setting. Finally, we generalize the results to the nonlinear setting for highly overparameterized models, relevant for deep learning. Our experiments show the effectiveness of the proposed method compared to the baselines.

I Introduction

While deep neural networks have been successful in numerous domains, their training is computationally demanding and requires iterating over the entire dataset multiple times. This hinders their deployment in many real-world settings such as robot learning, autonomy, and online decision making, where new datapoints are collected over time or become available sequentially. In such settings, storing all the datapoints and retraining the model at every step on all the data is extremely costly and often not feasible. In addition, in certain applications, storing the data may be prohibited for privacy reasons.

Thus, it is very desirable to come up with algorithms that can learn incrementally or in an online fashion, rather than by iterating over the entire data many times. However, it is well-known that deep neural networks are prone to entirely forgetting past information while learning new data, which is an issue referred to as “catastrophic forgetting” [10]. This begs the question:

“Can we learn streaming data efficiently without forgetting or retraining on previous data?”

This is a setting often referred to as one-pass learning. More specifically, one-pass learning concerns the setting where the algorithm (i) makes an update for the current datapoint without direct access to previous data; (ii) the new updates do not significantly affect the predictions on the previous data; moreover, (iii) the computational and memory costs of each update must not grow with the iteration count.

There has been growing attention on one-pass learning and its variants, and several works have attempted to address them under various contexts. In particular, [8] studied learning the ImageNet dataset in a single pass by revisiting some “important” previous datapoints at each learning step, and [19] investigated learning incremental ‘batches’ on a large scale by correcting the classifier’s bias towards new data. However, both methods train on previous data and are not adequate for one-pass learning. On the other hand, [15] proposed an effective one-pass learning method for support vector machines. However, their method is tailored to the specific setting of support vector machines. Further, [16] proposed a one-pass deep learning algorithm, but it relies on a specific network architecture and is vulnerable to forgetting the previous data unless the data is consistently arriving from the same distribution.

One-pass learning is also closely related to a classical problem studied in the context of control/estimation theory. More specifically, a classical algorithm known as recursive least-squares (RLS) (see, e.g., [17]) tackles one-pass learning for linear models (as elaborated in Section II-B). However, there are two limitations of the standard RLS: (i) it suffers from high computational and memory costs, and (ii) it is not well-suited for the overparameterized setting where zero training loss is desired. The main focus of this work is to develop a method that overcomes these limitations. Related to the overparameterized setting, a few works have recently discussed utilizing RLS to train popular deep neural networks such as FNN, CNN, RNN, and LSTM [21, 20]. However, these works are empirical in nature and consider multi-pass learning with mini-batches. Moreover, the theoretical properties of RLS for one-pass learning are not studied in those works.

I-A Contributions

Our main contributions can be summarized as follows.

•

We develop Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning in the overparameterized setting which fits new data on the fly while updating the parameters in a direction that causes the least change to the predictions on previous data. Our algorithm uses memory efficiently by exploiting the structure of the streaming data via incremental principal component analysis (IPCA) to extract the essential information for the update. (§III)
•

Through the proposed method, we establish an interesting connection between two different algorithms from adaptive filtering and machine learning, namely, the recursive least-squares (RLS) algorithm and the orthogonal gradient descent (OGD). We further theoretically characterize the behavior of the proposed method in the linear overparameterized setting. (§IV)
•

We demonstrate the practicality of our approach and corroborate our theoretical findings through various experiments. (§V)
•

We discuss further extensions to overparameterized nonlinear models, relevant for deep learning. (§VI)

I-B Connections to Related Notions

One-pass learning shares some similarities with other settings that deal with streaming data such as online learning and incremental learning. While these terms are often inconsistently defined in the literature, one may distinguish them based on whether we learn a single datapoint or a batch of data at a time [14]. Although both of these approaches learn from streaming data, they typically assume that data is arriving from the same distribution. Compared to these settings, one-pass learning requires explicit efforts to not alter the predictions on the previous data while learning new data. Thus, it aims to preserve the predictions even when the new data comes from a different distribution. Another related setting is continual learning where batches of data from different tasks are sequentially learned [6].

II Preliminaries

II-A One-Pass Learning

Let $f(x;w)\in\mathbb{R}$ be a model that the agent is trying to fit, where $x\in\mathcal{X}\subset\mathbb{R}^{d}$ is the input and $w\in\mathbb{R}^{p}$ is the parameter (weight) vector. Consider a sequentially arriving stream of data $\{(x_{k},y_{k})\}_{k=1}^{K}$ where $x_{k}\in\mathcal{X}$ and $y_{k}\in\mathcal{Y}\subset\mathbb{R}$ . In overparameterized models, we have $p\geq K$ (and often $p\gg K$ ). For a loss function $\ell(\cdot,\cdot)$ , let $f_{k}(w)\vcentcolon=f(x_{k};w)$ and $\ell_{k}(w)\vcentcolon=\ell(y_{k},f_{k}(w))$ . Then, one-pass learning considers the setting where, given an initial parameter $w_{0}\in\mathbb{R}^{p}$ , it updates the parameter $w_{i}\in\mathbb{R}^{p}$ after the new data $(x_{i},y_{i})$ arrived without revisiting previous datapoints $\{(x_{k},y_{k})\}_{k=1}^{i-1}$ .

For concreteness, we first focus on a linear overparameterized model $f(x;w)=w^{\top}x$ with $p=d$ . We then discuss the extension of the results to nonlinear models (e.g. overparameterized architectures in deep learning) in Section VI.

Before introducing our algorithm and the results, we briefly review two related algorithms capable of one-pass learning, proposed in two different literatures.

II-B Recursive Least-Squares

First, we briefly review recursive least-squares (RLS) from the control/estimation theory literature (refer to [17] for details). At every step $i$ , RLS aims to find a parameter vector that solves the following regularized least-squares problem:

w^{(RLS)}_{i}=\arg\min_{w}\sum_{k=1}^{i}(y_{k}-w^{\top}x_{k})^{2}+\lVert w-w_{0}\rVert^{2}_{\Pi},

(1)

where $\lVert x\rVert_{\Pi}\vcentcolon=\sqrt{x^{\top}\Pi x}$ for a $p\times p$ positive-definite matrix $\Pi$ and $w_{0}$ is an initial parameter estimate. Note that the system is underdetermined/overparameterized, and the regularization term in (1) is necessary for the solution to be uniquely defined. While there is a closed-form solution for (1) given by $w^{(RLS)}_{i}=(\Pi+X_{i}^{\top}X_{i})^{-1}(X_{i}^{\top}Y_{i}+\Pi w_{0})$ with $X_{i}=[x_{1}\;x_{2}\dots\;x_{i}]^{\top}$ and $Y_{i}=[y_{1}\;y_{2}\dots\;y_{i}]^{\top}$ , computing the solution directly requires storing all the previous data as well as recomputing the inverse of the covariance matrix for every new datapoint. RLS bypasses this issue by computing the new solution $w^{(RLS)}_{i}$ of (1) recursively from $w^{(RLS)}_{i-1}$ and $(x_{i},y_{i})$ .

We elaborate the algorithm more formally for a general version of RLS called exponentially-weighted recursive least-squares (EW-RLS) [17]. Consider the following problem:

w^{(RLS)}_{i}=\arg\min_{w}\sum_{k=1}^{i}\lambda^{i-k}(y_{k}-w^{\top}x_{k})^{2}+\lambda^{i}\lVert w-w_{0}\rVert^{2}_{\Pi},

(2)

with a forgetting factor $0<\lambda\leq 1$ . Note that this reduces to the problem of the vanilla RLS (1) when $\lambda=1$ . The exact solution of it is recursively updated as follows:

\begin{cases}w^{(RLS)}_{i}=w^{(RLS)}_{i-1}+\dfrac{P_{i-1}x_{i}}{\lambda^{i}+x_{i}^{\top}P_{i-1}x_{i}}(y_{i}-x_{i}^{\top}w^{(RLS)}_{i-1})\,,\\ P_{i}=P_{i-1}-\dfrac{P_{i-1}x_{i}x_{i}^{\top}P_{i-1}}{\lambda^{i}+x_{i}^{\top}P_{i-1}x_{i}},\end{cases}

(3)

with $w^{(RLS)}_{0}=w_{0}$ and $P_{0}=\Pi^{-1}$ . Here, $P_{i}$ can be alternatively written as $P_{i}=[\Pi+X_{i}^{\top}\Lambda_{i}X_{i}]^{-1}$ where $\Lambda_{i}=\mathrm{diag}(\lambda^{-1},\lambda^{-2},\dots,\lambda^{-i})$ .

II-C Orthogonal Gradient Descent

An algorithm called orthogonal gradient descent (OGD) [7] has been recently proposed in the context of machine learning for a different but related problem. More specifically, [7] considers a continual learning setting in which tasks $\{T_{1},T_{2},\dots\}$ arrive sequentially, and each task consists of a set of datapoints. Continual learning can be understood as a “batch” version of one-pass learning. At a high level, when the $i$ -th task $T_{i}$ arrives, OGD updates the parameter to fit new samples from $T_{i}$ in a way that causes minimal changes to the predictions for previous tasks $\{T_{k}\}_{k=1}^{i-1}$ . The gradient of the model on a datapoint $x_{j}$ with respect to the parameter, $\nabla_{w}f(x_{j};w)$ , is the direction in the parameter space that causes the most change to the prediction on that datapoint. Thus, moving orthogonal to this direction, locally, keeps the prediction unchanged, which is the main idea behind OGD. More formally, the update direction is computed via projecting the current gradient $g$ of the loss onto the subspace orthogonal to $\mathcal{G}\vcentcolon=\text{span}\{\bigcup_{k=1}^{i-1}\{\nabla_{w}f(x;w_{k})\}_{(x,y)\in T_{k}}\}$ :

\tilde{g}=g-\sum_{v\in S}\textup{proj}_{v}(g),

(4)

where $S$ is an orthogonal basis for $\mathcal{G}$ and $\textup{proj}_{v}(u)\vcentcolon=(u^{\top}v/\lVert v\rVert^{2})v=(vv^{\top}/\lVert v\rVert^{2})u$ . The orthogonal basis $S$ is incrementally updated through the Gram-Schmidt procedure.

III Orthogonal Recursive Fitting

In this section, we propose a one-pass learning algorithm called orthogonal recursive fitting (ORFit). The algorithm consists of three main components: (i) orthogonal update of the parameter motivated by OGD; (ii) interpolation (perfect fitting) of new data in a single step; and (iii) efficient use of memory via incremental summary. We describe the details of each component below, starting from (i) and (ii). See Fig. 1 for an illustration.

III-A Orthogonal Recursive Update

We start by considering OGD directly applied to the one-pass learning setting. By treating each task $T_{k}$ in the continual learning setting as consisting of a single datapoint, OGD will run multiple gradient descent steps on the single datapoint. While perfect fitting/interpolation is desired in often highly overparameterized models [5, 4], it will take many iterations for OGD to perfectly fit the datapoint. Instead, inspired by the recent trend in meta-learning, we consider a one-step learning scheme that only runs a single gradient step to interpolate the new datapoint. We first begin with the following result that serves as a building block for our algorithm design; see Appendix -A for a proof.

Lemma 1.

Consider a linear model $f(x;w)=w^{\top}x$ , and let $\tilde{g}$ be the projection defined by (4) of any vector $g\in\mathbb{R}^{p}$ . Then, for any step size $\eta\in\mathbb{R}$ , the new parameter $w^{\prime}=w-\eta\tilde{g}$ preserves the predictions on the previous datapoints, i.e., $f(x;w^{\prime})=f(x;w)$ for all $(x,y)\in\bigcup_{k=1}^{i-1}T_{k}$ .

The main takeaway of Lemma 1 is that the predictions for the previous datapoints do not change when we update the model along the direction $\tilde{g}$ . Hence, we may choose $\eta$ so that the updated parameter $w^{\prime}$ can perfectly fit the new datapoint, say $(x^{\prime},y^{\prime})$ , as $y^{\prime}=w^{\prime\top}x^{\prime}=(w-\eta\tilde{g})^{\top}x^{\prime}$ . Following this principle, a straightforward calculation yields the following update rule for $i\geq 1$ :

\begin{cases}\tilde{g}_{i-1}=\nabla\ell_{i}(w_{i-1})-\sum_{v\in S_{i-1}}\textup{proj}_{v}(\nabla\ell_{i}(w_{i-1}))\,,\\ S_{i}=S_{i-1}\bigcup\{\nabla f_{i}(w_{i-1})-\sum_{v\in S_{i-1}}\textup{proj}_{v}(\nabla f_{i}(w_{i-1}))\}\,,\\ w_{i}=w_{i-1}-\eta_{i-1}\tilde{g}_{i-1},\end{cases}

(5)

where $S_{0}=\phi$ , $w_{0}$ is the initial weight vector, and the optimal step size is chosen as

\eta_{i-1}=\frac{1}{\nabla f_{i}(w_{i-1})^{\top}\tilde{g}_{i-1}}(f_{i}(w_{i-1})-y_{i}).

(6)

For intuition, we note that the optimal step size (6) is typically small for highly overparameterized models; there are many parameter vectors in the vicinity of the current solution that perfectly fit the new datapoint [2, 13, 1].

Refer to caption — Figure 1: An illustration of ORFit in the parameter space for a linear model. The parameter $w_{i-1}$ fits the previous datapoints $\{(x_{k},y_{k})\}_{k=1}^{i-1}$ . The set $S$ (which is updated incrementally) consists of the directions moving towards which causes the most change in the predictions on previous data, and thus, moving orthogonal to $S$ keeps the predictions intact. Given a new datapoint $(x_{i},y_{i})$ , projecting its corresponding gradient $g$ to the orthogonal complement of the subspace spanned by $S$ yields the new update direction $\tilde{g}$ . ORFit finds a new parameter $w_{i}$ along the direction of $-\tilde{g}$ which fits the new datapoint $(x_{i},y_{i})$ within a single step, while still fitting the previous data $\{(x_{k},y_{k})\}_{k=1}^{i-1}$ .

Remark 1 (Computational overhead).

It is important to note that all the quantities appearing in (6) are typically available in the gradient-based optimization setting, and hence, there is no computational overhead for computing the stepsize (6).

Remark 2 (Nonlinear models).

Although the update rule (5) is derived based on linear models, it can also be applied to highly overparameterized nonlinear models such as deep neural networks, as we will discuss in Section VI.

Another distinction between the update rule (5) and OGD lies in the update of the orthogonal basis $S_{i}$ : OGD utilizes “fresher” gradient at the updated parameter $w_{i}$ by $S_{i}=S_{i-1}\bigcup\{\nabla f_{i}(w_{i})-\sum_{v\in S_{i-1}}\textup{proj}_{v}(\nabla f_{i}(w_{i}))\}$ . Although the two bases actually span the same subspace for linear models, it turns out that ORFit in (5) leads to a natural generalization for nonlinear models; see Section VI for details.

Although the update rule (5) does not access previous datapoints, it still requires storing the orthogonal basis $S_{i}$ , whose size grows linearly in the number of visited datapoints. This is not desirable in practice when one needs to train the model on a large dataset. We address this issue next.

III-B Incremental Summary of Memory

In this section, we overcome the aforementioned memory issue by utilizing the structure of the streaming dataset. The main idea is to summarize the orthogonal basis $S$ using an incremental principal component analysis (IPCA) algorithm, known as the sequential Karhunen–Loeve (SKL) algorithm proposed in [12]. IPCA is a memory-efficient variant of PCA that enables sequential update for streaming/large datasets. Let us formally describe how IPCA incrementally summarizes the orthogonal basis.

Consider the orthogonal basis $S=\{v_{1},v_{2},\dots,v_{i}\}$ obtained by (5). Let the singular value decomposition (SVD) of $A=[v_{1}\;v_{2}\,\dots v_{i}]$ be $A=U\Sigma V^{\top}$ . Here, the crucial information of the SVD is the left-singular vectors $\mathrm{col}(U)$ that forms an orthonormal basis for $\mathrm{span}(S)$ . This information can be used to come up with a rank- $m$ approximation of $A$ by using the components corresponding to the top $m$ singular values.

Now suppose that the orthogonal basis $S$ is augmented with a newly projected gradient $v^{\prime}$ as in (5). We want to efficiently update $U$ for the new basis. Letting $u\vcentcolon=v^{\prime}/\lVert v^{\prime}\rVert$ , the new basis matrix can be represented as:

	$\displaystyle\begin{bmatrix}A&v^{\prime}\end{bmatrix}$	$\displaystyle=\begin{bmatrix}U&u\end{bmatrix}\begin{bmatrix}\Sigma&0\\ 0&u^{\top}v^{\prime}\end{bmatrix}\begin{bmatrix}V^{\top}&0\\ 0&1\end{bmatrix}$		(7)
		$\displaystyle=\left(\begin{bmatrix}U&u\end{bmatrix}\tilde{U}\right)\tilde{\Sigma}\left(\tilde{V}^{\top}\begin{bmatrix}V^{\top}&0\\ 0&1\end{bmatrix}\right),$		(8)

where $\tilde{U}\tilde{\Sigma}\tilde{V}^{\top}$ is the SVD of $\begin{bmatrix}\Sigma&0\\ 0&u^{\top}v^{\prime}\end{bmatrix}$ . Then, (8) is the SVD of the new basis matrix. Hence to update $U$ , one can directly use the information from the previous iteration, namely $U$ and $\Sigma$ . The important aspect here is that the update can be made without storing $V$ and without having to recompute the SVD of the new basis matrix. Finally, one can store only the top $m$ singular values in $\Sigma$ and their corresponding components in $U$ . By repeatedly applying this IPCA algorithm in addition to (5), we obtain orthogonal recursive fitting (ORFit). See Algorithm 1 for the detailed procedure.

Algorithm 1 Orthogonal Recursive Fitting (ORFit)

0: Data sequence

\left((x_{k},y_{k})\right)_{k=1}^{K}

, memory limit

m

0: The optimal parameter

w

1: Initialize

U\leftarrow[\;\;],\;\Sigma\leftarrow[\;\;],\;w\leftarrow w_{0}

2: for

i=1,2,3,\dots

g\leftarrow

Stochastic Gradient for

(x_{i},y_{i})

w

\tilde{g}\leftarrow g-\sum_{v\in\textup{col}(U)}\textup{proj}_{v}(g)

v^{\prime}\leftarrow\nabla f_{i}(w)-\sum_{v\in\textup{col}(U)}\textup{proj}_{v}(\nabla f_{i}(w))

6: if

i\leq m

then

U\leftarrow[U\;v^{\prime}]

8: if

i=m

then

U,\Sigma\leftarrow

Compute SVD of

U

10: end if

11: else

12:

u\leftarrow v^{\prime}/\lVert v^{\prime}\rVert

13:

\tilde{U},\Sigma\leftarrow

Compute SVD of

\begin{bmatrix}\Sigma&0\\ 0&u^{\top}v^{\prime}\end{bmatrix}

14:

U\leftarrow[U\;u]\tilde{U}

15:

U,\Sigma\leftarrow

top

m

singular vectors/values in

U,\Sigma

16: end if

17:

\eta\leftarrow(f_{i}(w)-y_{i})/(\nabla f_{i}(w)^{\top}\tilde{g})

18:

w\leftarrow w-\eta\tilde{g}

19: end for

The main advantage of ORFit is its computational/memory efficiency. By only storing the top $m$ components, we can reduce the memory size from $O(ip)$ to $O(mp)$ . Moverover, the additional computational overhead to perform IPCA as well as the total time complexity of ORFit at each step is $O(m^{2}(p+m))$ . Hence, ORFit can reduce both the computation and the memory complexity by appropriately choosing $m$ .

IV Theoretical Results

In this section, we provide the theoretical properties of the proposed method. We begin by discussing a formal connection between the proposed method and RLS.

IV-A Connection to RLS

It turns out ORFit corresponds to an extreme case of the well-known RLS method, as formally described in the following result; see Appendix -B for a proof.

Proposition 2.

Consider a linear overparameterized ( $p\geq K$ ) model. Let $w_{0}$ be the initialization and $m$ be the memory limit for ORFit. Then, at each iteration $i\leq m$ , the update rule of ORFit results in the same parameter vector as the EW-RLS update rule (3) does with $\lambda=0$ , $\Pi=I$ , and initialization $w_{0}$ . In this setting, $P_{i}$ in (3) is the projection matrix onto the subspace orthogonal to $\mathrm{span}\{\nabla f_{k}(w_{k-1})\}_{k=1}^{i}$ .

Remark 3.

The optimization problem of EW-RLS (2) is not well-defined for $\lambda=0$ . That said, the update rule (3) can be still computed for $\lambda=0$ , and ORFit finds the same solution as the limiting case of EW-RLS.

Remark 4.

ORFit in (5) has $O(ip)$ time and memory complexities, compared to those of $O(p^{2})$ for the EW-RLS update rule, where typically $i\ll p$ in the overparameterized setting.

One notable aspect of ORFit is that it bridges the two seemingly distinct algorithms OGD and RLS through Proposition 2. The connection provides us new insights into understanding the behavior of our proposed method, as we discuss next.

IV-B Characterizing the Solution of ORFit

Before presenting our main result, we first provide some intuitions. To understand the behavior of ORFit, let us first recall that the EW-RLS update rule (3) is the solution to the optimization problem (2). In light of Proposition 2, one might be tempted to claim that ORFit in (5) solves (2) with $\lambda=0$ and $\Pi=I$ . However, (2) is not well-defined for $\lambda=0$ .

Nevertheless, intuitively one can regard ORFit as solving (2) in the limit of $\lambda\to 0^{+}$ . Then, for sufficiently small $\lambda>0$ , the first term in the objective of (2) outweighs the second term, which suggests that, in the overparameterized case, the solution should enforce $y_{k}\approx w^{\top}x_{k}$ for all $k=1,2,\dots,i$ , while minimizing $\lVert w-w_{0}\rVert^{2}$ . In the following theorem, we formalize this intuition and characterize the solution of ORFit; see Appendix -C for a proof.

Theorem 3.

Consider a linear overparameterized ( $p\geq K$ ) model. Let $w_{0}$ be the initialization and $m$ be the memory limit. Then, at each iteration $i\leq m$ , the parameter vector obtained by ORFit is the solution of the following optimization problem:

	$\displaystyle w_{i}=\arg\min_{w}$	$\displaystyle\lVert w-w_{o}\rVert$		(9)
	s.t.	$\displaystyle y_{k}=w^{\top}x_{k}\;\;k=1,2,\dots,i.$		(9)

It is known that, for a linear overparameterized model, in the standard multi-pass learning setting over the dataset $\{(x_{k},y_{k})\}_{k=1}^{i}$ , as the number of iterations goes to infinity, the iterates of stochastic gradient descent (SGD) initialized at $w_{0}$ with a sufficiently small step size converge to the solution of problem (9) (see, e.g., Proposition 1 in [3]). Thus, ORFit with just an epoch of training finds the solution that SGD in the limit of infinite number of iterations converges to.

Corollary 4.

Consider a linear overparameterized ( $p\geq K$ ) model. Let $w_{0}$ be the initialization and $m$ be the memory limit. The parameter vector obtained by ORFit at each iteration $i\leq m$ is equal to what SGD would converge to by iterating over the dataset $\{(x_{k},y_{k})\}_{k=1}^{i}$ with a sufficiently small step size in the limit of infinite number of iterations.

V Experiments

In this section, we demonstrate the effectiveness of our proposed methods in the one-pass learning setting and corroborate the theoretical results presented in Section IV. We performed experiments for linear models in the Rotated MNIST setup described in [18]. In this setup, the inputs are rotated MNIST images for digit ‘ $2$ ’, whose size is $28\times 28$ . Our goal is to estimate the rotated angles in $[0,\pi]$ . For the training dataset, the angles are uniformly sampled from $[0,\pi]$ , and to introduce distribution shift, we order the dataset so that an image rotated with a smaller angle arrives earlier.

V-A Learning with Restrictions on Memory Size

In this experiment, we demonstrate the effectiveness of our proposed method in the memory-restricted setting, in which 100 datapoints are sequentially learned while we are only allowed to store up to $10$ basis vectors. For comparison, we consider the following baselines: (1) Greedy scheme: outputs only the label of the most recently learned datapoint, regardless of the input, as an extreme case of forgetting. (2) One-Step SGD: employs the one-step learning scheme with the step size in (6) but with empty orthogonal basis. (3) ORFit-random: ORFit that keeps $10$ randomly chosen basis vectors after each iteration instead of performing IPCA. (4) ORFit-latest: ORFit that keeps the latest $10$ basis vectors after each iteration instead of performing IPCA.

We first compared the test errors of the proposed method and the baselines. The test errors are measured with the test dataset that consists of $1032$ images of digit ‘ $2$ ’ in the MNIST test set of which each is rotated with a random angle in $[0,\pi]$ . As shown in Fig. 2(a), ORFit outperforms other baselines after a sufficient number of training steps. Notably, ORFit results in lower variances over the $10$ independent runs with different initialization.

Next, we compared the degrees of forgetting for the proposed method and the baselines by keeping track of the prediction errors of a training datapoint throughout the training. As shown in Fig. 2(b), ORFit successfully keeps the prediction error low throughout the training. This is in stark contrast with other methods for which the prediction error quickly increases as other datapoints are learned.

V-B Learning without Memory Restriction

In this experiment, we follow the setup in Section V-A except that this time, we do not impose any memory restrictions. For comparison, we run vanilla SGD with a fixed step size $10^{-5}$ . Vanilla SGD makes multiple passes over the entire dataset for $1000$ epochs.

Reported in Fig. 3(a) are the training and test errors of vanilla SGD, One-Step SGD, and ORFit. Note that both SGD and ORFit learn the training dataset perfectly with almost zero training error. Moreover, after the training is finished, SGD and ORFit achieve similar test errors, corroborating Theorem 3.

In addition, Fig. 3(b) shows the prediction error of the $11$ -th training datapoint throughout the training (analogous to Fig. 2(b)). Note that ORFit perfectly preserves the prediction error of the sample, while One-Step SGD quickly deteriorates it. This result demonstrates the effectiveness of the orthogonal update in ORFit for one-pass learning.

VI Extension to Deep Learning

In this section, we extend our discussion to nonlinear models under the neural tangent kernel (NTK) regime [9, 11]. The main idea behind the NTK regime is that when the width of the neural network is chosen large enough, the model is well-approximated by its first-order approximation around the initialization:

f_{k}(w)\approx f_{k}(w_{0})+\nabla f_{k}(w_{0})^{\top}(w-w_{0}).

(10)

In particular, Lee et al. [11] discussed sufficient conditions (in terms of the width of the network) for which this approximation is valid; see their Theorem 2.1 for details. Under the linearized regime, we consider expanding the model around the parameter learned at the previous step as

f_{k}(w)\approx f_{k}(w_{k-1})+\nabla f_{k}(w_{k-1})^{\top}(w-w_{k-1})=\vcentcolon f_{k|k-1}(w).

(11)

Throughout, we denote by $f_{k|k-1}(w)$ the RHS of (11).

A notable feature of ORFit is that it is applicable to any differentiable nonlinear model $f$ . This is in contrast to the EW-RLS algorithm (3), which is only applicable to linear models $f(x;w)=w^{\top}x$ . Based on this observation, we employ the step size calculated in (6) to fit the new data for a nonlinear model under the NTK regime (11). More specifically, we consider an analog of the EW-RLS (2) for nonlinear models:

w_{i}=\arg\min_{w}\sum_{k=1}^{i}\lambda^{i-k}(y_{k}-f_{k|k-1}(w))^{2}+\lambda^{i}\lVert w-w_{0}\rVert^{2}_{\Pi},

(12)

given a forgetting factor $0<\lambda\leq 1$ and $\Pi\succ 0$ . Then similarly as in (3), one can write out the solution of (12) in a recursive manner, while treating $(\nabla f_{k}(w_{k-1}),y_{k}-f_{k}(w_{k-1})+\nabla f_{k}(w_{k-1})^{\top}w_{k-1})$ as the streamed data at the $k$ -th step:

\begin{cases}w_{i}=w_{i-1}+\dfrac{P_{i-1}\nabla f_{i}(w_{i-1})(y_{i}-f_{i}(w_{i-1}))}{\lambda^{i}+\nabla f_{i}(w_{i-1})^{\top}P_{i-1}\nabla f_{i}(w_{i-1})}\,,\\ P_{i}=P_{i-1}-\dfrac{P_{i-1}\nabla f_{i}(w_{i-1})\nabla f_{i}(w_{i-1})^{\top}P_{i-1}}{\lambda^{i}+\nabla f_{i}(w_{i-1})^{\top}P_{i-1}\nabla f_{i}(w_{i-1})}\,,\end{cases}

(13)

with initialization $w_{0}$ and $P_{0}=\Pi^{-1}$ . We call this generalized update rule NTK-RLS. Following a similar argument as in Section IV, we obtain the following result; see Appendix -D for a proof.

Theorem 5.

Consider a (nonlinear) overparameterized model with $p\geq K$ . Let $w_{0}$ be the initialization and $m$ be the memory limit for ORFit. Then, at each iteration $i\leq m$ , the update rule of ORFit results in the same parameter vector as the NTK-RLS update rule (13) does with $\lambda=0$ , $\Pi=I$ , and initialization $w_{0}$ . Moreover, at each iteration $i\leq m$ , the parameter vector obtained by ORFit is the solution of the following optimization problem:

	$\displaystyle w_{i}=\arg\min_{w}$	$\displaystyle\lVert w-w_{o}\rVert$		(14)
	s.t.	$\displaystyle y_{k}=f_{k\|k-1}(w)\;\;k=1,2,\dots,i.$		(14)

Note that, under the NTK regime, for an overparameterized model, we have $f_{k|k-1}(w)\approx f(x_{k};w)$ , and the solution obtained by ORFit in one pass is the same as that of SGD in the standard multi-pass setting (as characterized in, e.g., [3]). Compared to NTK-RLS (13), ORFit in (5) greatly reduces both time and memory complexities from $O(p^{2})$ to $O(ip)$ . This is particularly important for $p\gg i$ , common to the overparameterized settings (for instance, $p\approx 11\textup{M}$ in ResNet-18, commonly used for the CIFAR-10 dataset, consisting of 50K samples).

VII Conclusion

In this paper, we proposed an algorithm called Orthognoal Recursive Fitting (ORFit) to tackle one-pass learning. We discussed the connection between the proposed method and orthogonal gradient descent (OGD), a practical algorithm in continual learning, as well as the recursive least-squares (RLS), a well-known method from adaptive filtering. Through this connection, we explained the advantages of the proposed method and theoretically characterized its behavior. Our theoretical findings reveal that ORFit attains the same solution as SGD, with much lower time and memory complexities. We validated our method and its theoretical properties through several experiments and discussed its extensions to nonlinear settings, relevant for deep learning.

We conclude with several interesting future directions. First, although ORFit exhibits outstanding performance in the memory-limited setting, some forgetting is still happening. It is caused by the information loss from summarizing the orthogonal basis via IPCA. Theoretically characterizing how much ORFit forgets would be of great importance to understand the practicality of the algorithm. Moreover, one can also come up with other methods to summarize the memory such as matrix sketching and compare them with ORFit. Next, we remark that RLS is a special case of the Kalman filter applied on a static system. Based on this connection, another interesting avenue is to build on ORFit and devise efficient learning/estimation methods for dynamic systems. Lastly, exploring the practicality of ORFit in deep learning based on a more comprehensive set of experiments would be of great interest.

References

[1] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
[2] N. Azizan and B. Hassibi. Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations, 2018.
[3] N. Azizan, S. Lale, and B. Hassibi. Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, 2021.
[4] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
[5] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
[6] M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis, G. Slabaugh, and T. Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
[7] M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773. PMLR, 2020.
[8] H. Hu, A. Li, D. Calandriello, and D. Gorur. One pass ImageNet. arXiv preprint arXiv:2111.01956, 2021.
[9] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
[10] R. Kemker, M. McClure, A. Abitino, T. Hayes, and C. Kanan. Measuring catastrophic forgetting in neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
[11] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
[12] A. Levey and M. Lindenbaum. Sequential Karhunen-Loeve basis extraction and its application to images. IEEE Transactions on Image Processing, 9(8):1371–1374, 2000.
[13] Y. Li and Y. Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in Neural Information Processing Systems, 31, 2018.
[14] D. Nallaperuma, R. Nawaratne, T. Bandaragoda, A. Adikari, S. Nguyen, T. Kempitiya, D. De Silva, D. Alahakoon, and D. Pothuhera. Online incremental machine learning platform for big data-driven smart traffic management. IEEE Transactions on Intelligent Transportation Systems, 20(12):4679–4690, 2019.
[15] P. Rai, H. Daumé, and S. Venkatasubramanian. Streamed learning: one-pass svms. In Proceedings of the 21st International Joint Conference on Artificial Intelligence, pages 1211–1216, 2009.
[16] D. Sahoo, Q. Pham, J. Lu, and S. C. Hoi. Online deep learning: learning deep neural networks on the fly. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2660–2666, 2018.
[17] A. H. Sayed. Fundamentals of adaptive filtering. John Wiley & Sons, 2003.
[18] A. Sharma, N. Azizan, and M. Pavone. Sketching curvature for efficient out-of-distribution detection for deep neural networks. In Uncertainty in Artificial Intelligence, pages 1958–1967. PMLR, 2021.
[19] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu. Large scale incremental learning. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 374–382. IEEE, 2019.
[20] T. Yu, C. Zhang, Y. Wang, M. Ma, and Q. Song. Recursive least squares for training and pruning convolutional neural networks. arXiv preprint arXiv:2201.04813, 2022.
[21] C. Zhang, Q. Song, H. Zhou, Y. Ou, H. Deng, and L. T. Yang. Revisiting recursive least squares for training deep neural networks. arXiv preprint arXiv:2109.03220, 2021.

-A Proof of Lemma 1

Consider $(x,y)\in T_{k}$ for any $1\leq k\leq i-1$ .

f(x;w^{\prime})=x^{\top}w^{\prime}=x^{\top}(w-\eta\tilde{g})=f(x;w)-\eta x^{\top}\tilde{g}.

(15)

Since the orthogonal basis $S$ spans $\nabla_{w}f(x;w_{k})=x$ , $x$ can be represented as $x=\sum_{u\in S}\textup{proj}_{u}(x)$ . Then,

$\displaystyle x^{\top}\tilde{g}$	$\displaystyle=(\sum_{u\in S}\textup{proj}_{u}(x))^{\top}(g-\sum_{v\in S}\textup{proj}_{v}(g))$	(16)
	$\displaystyle=(\sum_{u\in S}\dfrac{uu^{\top}}{\lVert u\rVert^{2}}x)^{\top}(I-\sum_{v\in S}\dfrac{vv^{\top}}{\lVert v\rVert^{2}})g$	(17)
	$\displaystyle=x^{\top}(\sum_{u\in S}\dfrac{uu^{\top}}{\lVert u\rVert^{2}}-\sum_{u\in S}\sum_{v\in S}\dfrac{uu^{\top}vv^{\top}}{\lVert u\rVert^{2}\lVert v\rVert^{2}})g$	(18)
	$\displaystyle=x^{\top}(\sum_{u\in S}\dfrac{uu^{\top}}{\lVert u\rVert^{2}}-\sum_{u\in S}\dfrac{uu^{\top}}{\lVert u\rVert^{2}})g=0$	(19)

as $u^{\top}v=0$ if $u\neq v$ . Thus, $f(x;w^{\prime})=f(x;w)$ . ∎

-B Proof of Proposition 2

In the update rule of ORFit (5), the new basis vector can be represented as

	$\displaystyle v^{\prime}$	$\displaystyle=\nabla f_{i}(w_{i-1})-\sum_{v\in S_{i-1}}\textup{proj}_{v}(\nabla f_{i}(w_{i-1}))$		(20)
		$\displaystyle=(I-\sum_{v\in S_{i-1}}\frac{vv^{\top}}{\lVert v\rVert^{2}})\nabla f_{i}(w_{i-1})=Q_{i-1}x_{i},$		(21)

where $Q_{i-1}\vcentcolon=I-\sum_{v\in S_{i-1}}vv^{\top}/\lVert v\rVert^{2}$ . Similarly,

\tilde{g}_{i-1}=Q_{i-1}\nabla\ell_{i}(w_{i-1})=Q_{i-1}\ell^{\prime}(y_{i},f_{i}(w_{i-1}))x_{i},

(22)

where $\ell^{\prime}(\cdot,\cdot)$ denotes the derivative of $\ell(\cdot,\cdot)$ w.r.t. its second argument.

Putting (22) into the parameter update in (5),

w_{i}=w_{i-1}+\dfrac{Q_{i-1}x_{i}}{x_{i}^{\top}Q_{i-1}x_{i}}(y_{i}-x_{i}^{\top}w_{i-1}).

(23)

Also, note that $Q_{i-1}$ is symmetric and satisfies $Q_{i-1}^{2}=Q_{i-1}$ , and it corresponds to the projection matrix to the subspace orthogonal to $\mathrm{span}\{S_{i-1}\}=\mathrm{span}\{\nabla f_{k}(w_{k-1})\}_{k=1}^{i-1}$ . With these properties, $Q_{i}$ can be expressed in a recursive form as

Q_{i}=Q_{i-1}-\frac{v^{\prime}v^{\prime\top}}{\lVert v^{\prime}\rVert^{2}}=Q_{i-1}-\frac{Q_{i-1}x_{i}x_{i}^{\top}Q_{i-1}^{\top}}{x_{i}^{\top}Q_{i-1}x_{i}}.

(24)

Then, (23) and (24) are equivalent to the EW-RLS update rule (3) with $\lambda=0$ , $\Pi=I$ , and initialization $w_{0}$ , while $Q_{0}=I=P_{0}$ . ∎

-C Proof of Theorem 3

We prove (9) using KKT conditions. First, we transform the RHS into an equivalent convex problem:

\arg\min_{w}\frac{1}{2}\lVert w-w_{o}\rVert^{2}\;\textup{s.t.}\;y_{k}=w^{\top}x_{k}\;\;k=1,\dots,i.

(25)

Let $y\vcentcolon=[y_{1}\;\dots\;y_{i}]^{\top}\in\mathbb{R}^{i}$ and $X\vcentcolon=[x_{1}\;\dots\;x_{i}]^{\top}\in\mathbb{R}^{i\times d}$ . Then, the Lagrangian of (25) is then represented as

\mathcal{L}(w,\lambda)=\frac{1}{2}\lVert w-w_{o}\rVert^{2}+\lambda^{\top}(y-Xw).

(26)

Since (25) is a convex problem, any primal and dual variables of (26) satisfying KKT conditions are primal and dual optimal. The KKT conditions for a pair of primal and dual variables $(w^{*},\lambda^{*})$ are

\begin{cases}w^{*}-w_{0}-X^{\top}\lambda^{*}=0\\ y-Xw^{*}=0.\end{cases}

(27)

Now, we show that $w_{i}$ from the ORFit update rule (5) satisfies (27) with some $\lambda_{i}$ so that $w_{i}$ is the primal optimal solution of (25) which shows (9). From (5), $w_{i}=w_{0}-\sum_{k=0}^{i-1}\eta_{k}\tilde{g}_{k}$ . Then for each $k$ ,

	$\displaystyle\tilde{g}_{k}$	$\displaystyle=\nabla\ell_{k+1}(w_{k})-\sum_{v\in S_{k}}\textup{proj}_{v}(\nabla\ell_{k+1}(w_{k}))$		(28)
		$\displaystyle=\ell^{\prime}(y_{k+1},f_{k+1}(w_{k}))x_{k+1}-\sum_{j=1}^{k}c_{k,j}x_{j}=X^{\top}d_{k},$		(29)

where $d_{k}\vcentcolon=[-c_{k,1},\dots,-c_{k,k},\ell^{\prime}(y_{k+1},f_{k+1}(w_{k})),0,\dots,0]^{\top}\in\mathbb{R}^{i}$ . Such $c_{k,j}\in\mathbb{R}$ exists since $\mathrm{span}\{S_{k}\}=\mathrm{span}\{\nabla f_{j}(w_{j-1})\}_{j=1}^{k}=\mathrm{span}\{x_{j}\}_{j=1}^{k}$ . Then,

w_{i}=w_{0}-\sum_{k=0}^{i-1}\eta_{k}X^{\top}d_{k}=w_{0}-X^{\top}\sum_{k=0}^{i-1}\eta_{k}d_{k}.

(30)

By letting $\lambda_{i}\vcentcolon=-\sum_{k=0}^{i-1}\eta_{k}d_{k}$ , $(w_{i},\lambda_{i})$ satisfies the first KKT condition. Also, with the choice of (6) with Lemma 1, $w_{i}$ fits all the data $\{(x_{k},y_{k})\}_{k=1}^{i}$ , which implies the second KKT condition. Thus, $w_{i}$ is the solution of (25) so that (9) holds. ∎

-D Proof of Theorem 5

We first prove the connection between ORFit and NTK-RLS. As in the proof of Prop. 2, the new basis vector and the projected gradient of the loss in the update rule of ORFit (5) can be represented as

	$\displaystyle v^{\prime}$	$\displaystyle=Q_{i-1}\nabla f_{i}(w_{i-1}),$		(31)
	$\displaystyle\tilde{g}_{i-1}$	$\displaystyle=Q_{i-1}\ell^{\prime}(y_{i},f_{i}(w_{i-1}))\nabla f_{i}(w_{i-1}),$		(32)

where $Q_{i-1}\vcentcolon=I-\sum_{v\in S_{i-1}}vv^{\top}/\lVert v\rVert^{2}$ . Putting (32) into the parameter update in (5),

w_{i}=w_{i-1}+\dfrac{Q_{i-1}\nabla f_{i}(w_{i-1})(y_{i}-f_{i}(w_{i-1}))}{\nabla f_{i}(w_{i-1})^{\top}Q_{i-1}\nabla f_{i}(w_{i-1})}.

(33)

Since $Q_{i-1}$ is symmetric and satisfies $Q_{i-1}^{2}=Q_{i-1}$ , with (31), $Q_{i}$ can be expressed in a recursive form as

Q_{i}=Q_{i-1}-\frac{Q_{i-1}\nabla f_{i}(w_{i-1})\nabla f_{i}(w_{i-1})^{\top}Q_{i-1}^{\top}}{\nabla f_{i}(w_{i-1})^{\top}Q_{i-1}\nabla f_{i}(w_{i-1})}.

(34)

Then, (33) and (34) are equivalent to the NTK-RLS update rule (13) with $\lambda=0$ , $\Pi=I$ , and initialization $w_{0}$ , while $Q_{0}=I=P_{0}$ .

We then prove (14) using KKT conditions as in Thm. 3. First, we transform the RHS into an equivalent convex problem:

\arg\min_{w}\frac{1}{2}\lVert w-w_{o}\rVert^{2}\;\textup{s.t.}\;y_{k}=f_{k|k-1}(w)\;\;k=1,\dots,i.

(35)

Let $\tilde{y}_{k}\vcentcolon=y_{k}-f_{k}(w_{k-1})+\nabla f_{k}(w_{k-1})^{\top}w_{k-1}$ , $\tilde{y}\vcentcolon=[\tilde{y}_{1}\;\dots\;\tilde{y}_{i}]^{\top}\in\mathbb{R}^{i}$ , and $\tilde{X}\vcentcolon=[\nabla f_{1}(w_{0})\;\dots\;\nabla f_{i}(w_{i-1})]^{\top}\in\mathbb{R}^{i\times d}$ . Then, the Lagrangian of (35) is then represented as

\mathcal{L}(w,\lambda)=\frac{1}{2}\lVert w-w_{o}\rVert^{2}+\lambda^{\top}(\tilde{y}-\tilde{X}w).

(36)

Since (35) is a convex problem, any primal and dual variables of (36) satisfying KKT conditions are primal and dual optimal. The KKT conditions for a pair of primal and dual variables $(w^{*},\lambda^{*})$ are

\begin{cases}w^{*}-w_{0}-\tilde{X}^{\top}\lambda^{*}=0\\ \tilde{y}-\tilde{X}w^{*}=0\,.\end{cases}

(37)

Now, we show that $w_{i}$ from the ORFit update rule (5) satisfies (37) with some $\lambda_{i}$ so that $w_{i}$ is the primal optimal solution of (35) which in turn shows (14). From (5), $w_{i}=w_{0}-\sum_{k=0}^{i-1}\eta_{k}\tilde{g}_{k}$ . Then for each $k$ ,

$\displaystyle\tilde{g}_{k}$	$\displaystyle=\nabla\ell_{k+1}(w_{k})-\sum_{v\in S_{k}}\textup{proj}_{v}(\nabla\ell_{k+1}(w_{k}))$	(38)
	$\displaystyle=\ell^{\prime}(y_{k+1},f_{k+1}(w_{k}))\nabla f_{k+1}(w_{k})-\sum_{j=1}^{k}c_{k,j}\nabla f_{j}(w_{j-1})$	(39)
	$\displaystyle=\tilde{X}^{\top}d_{k},$	(40)

w_{i}=w_{0}-\sum_{k=0}^{i-1}\eta_{k}\tilde{X}^{\top}d_{k}=w_{0}-\tilde{X}^{\top}\sum_{k=0}^{i-1}\eta_{k}d_{k}.

(41)

By letting $\lambda_{i}\vcentcolon=-\sum_{k=0}^{i-1}\eta_{k}d_{k}$ , $(w_{i},\lambda_{i})$ satisfies the first KKT condition.

For the second KKT condition, we observe that for $k\leq i$ ,

$\displaystyle f_{k\|k-1}(w_{i})$	$\displaystyle=f_{k}(w_{k-1})+\nabla f_{k}(w_{k-1})^{\top}(w_{i}-w_{k-1})$	(42)
	$\displaystyle=f_{k}(w_{k-1})-\nabla f_{k}(w_{k-1})^{\top}\sum_{j=k-1}^{i-1}\eta_{j}\tilde{g}_{j}$	(43)
	$\displaystyle=f_{k}(w_{k-1})-\eta_{k-1}\nabla f_{k}(w_{k-1})^{\top}\tilde{g}_{k-1}$	(44)
	$\displaystyle=f_{k}(w_{k-1})-(f_{k}(w_{k-1})-y_{k})=y_{k},$	(45)

where (44) is satisfied as $\nabla f_{k}(w_{k-1})\perp\tilde{g}_{j}$ for $j\geq k$ , and the step-size from (6) results in (45). Then, $w_{i}$ satisfies the second KKT condition and hence is the solution of (35). ∎