This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two-way Spectrum Pursuit for π‘ͺ​𝑼​𝑹\boldsymbol{CUR} Decomposition and Its Application in Joint Column/Row Subset Selection

Abstract

The problem of simultaneous column and row subset selection is addressed in this paper. The column space and row space of a matrix are spanned by its left and right singular vectors, respectively. However, the singular vectors are not within actual columns/rows of the matrix. In this paper, an iterative approach is proposed to capture the most structural information of columns/rows via selecting a subset of actual columns/rows. This algorithm is referred to as two-way spectrum pursuit (TWSP) which provides us with an accurate solution for the CUR matrix decomposition. TWSP is applicable in a wide range of applications since it enjoys a linear complexity w.r.t. number of original columns/rows. We demonstrated the application of TWSP for joint channel and sensor selection in cognitive radio networks, informative users and contents detection, and efficient supervised data reduction.

Index Terms:
Column and rows subset selection, CUR matrix decomposition, spectrum pursuit

1 Introduction

Matrix factorization provides a concise representation of data. Despite desirable uniqueness conditions and computational simplicity of the well-known singular value decomposition (SVD), it comes with some fundamental shortcomings. The intrinsic structure of data is not inherited to the singular components. Moreover, SVD implies orthogonality on the components which is irrelevant to the underlying structure of the original data. This enforced structure makes the bases, a.k.a. singular vectors, hard to interpret [9]. On the other hand, it is shown that borrowing bases from the actual samples of a dataset provides a robust representation, which can be employed in interesting applications where sampling is their key factor [20]. This problem is studied under the literature of column subset selection problem (CSSP) [5, 2] and CUR decomposition [16, 3]. A general problem for CSSP and CUR can be written in the following form:

(π•Šc,π•Šr)=argminπ•Šc,π•Šrβ€‹β€–π‘Ώβˆ’Ο€π•Šrr​(Ο€π•Šcc​(𝑿))β€–F2,\small(\mathbb{S}^{c},\mathbb{S}^{r})=\underset{\mathbb{S}^{c},\mathbb{S}^{r}}{\text{argmin}}\|{\boldsymbol{X}}-\pi_{\mathbb{S}^{r}}^{r}(\pi_{\mathbb{S}^{c}}^{c}({\boldsymbol{X}}))\|_{F}^{2}, (1)

where, π‘Ώβˆˆβ„NΓ—M\!\small{\boldsymbol{X}}\!\!\in\!\!\mathbb{R}^{N\times M}\! is the data matrix containing MM data points in an NN-dimensional space. Here, Ο€π•Šcc(.)\pi_{\mathbb{S}^{c}}^{c}(.) and Ο€π•Šrr(.)\pi_{\mathbb{S}^{r}}^{r}(.) indicate column space projection and row space projection, respectively. These operators project all columns (rows) to a low-dimensional subspace spanned by selected columns (rows) of matrix 𝑿\boldsymbol{X} indexed by the set π•Šc\mathbb{S}^{c} (π•Šr\mathbb{S}^{r}). The chronological order in applying Ο€π•Šcc(.)\pi_{\mathbb{S}^{c}}^{c}(.) and Ο€π•Šrr(.)\pi_{\mathbb{S}^{r}}^{r}(.) does not affect the problem since these operators are linear. Moreover, substituting Ο€π•Šrr\pi_{\mathbb{S}^{r}}^{r} with the identity projection simplifies the problem to CSSP. Fig. 1 illustrates the structure of CUR matrix decomposition as a self-representative approach.

A versatile metric for evaluating the performance of a data subset selection algorithm can be defined by the approximation error resulted from the projection of the entire data to the span of selected rows/columns. How close to the optimal selection an algorithm can reach, is determined by comparing its approximation error to the best low-rank approximation error specified by the spectral decomposition. Recently, we proposed a fast and accurate algorithm for solving CSSP which is called spectrum pursuit (SP) [12].

Refer to caption
Fig.Β 1: Two columns and three rows from matrix 𝑿\boldsymbol{X} are selected and organized in matrix π‘ͺ\boldsymbol{C} and 𝑹\boldsymbol{R}. The outer product of each pair of a selected column and a selected row constructs a rank-1 matrix, i.e., 𝒄i​𝒓jT\boldsymbol{c}_{i}\boldsymbol{r}_{j}^{T}. The contribution amount of each pair is reflected in variable ui​ju_{ij}. The core matrix 𝑼\boldsymbol{U} is the collection of all ui​ju_{ij}’s. The goal is to minimize β€–π‘Ώβˆ’π‘Ώ^β€–F\|\boldsymbol{X}-\hat{\boldsymbol{X}}\|_{F} where 𝑿^=π‘ͺ​𝑼​𝑹\hat{\boldsymbol{X}}=\boldsymbol{CUR}.

Inspired by SP, we propose a new algorithm to address the more general case of the CUR matrix decomposition. The main contributions of our paper are summarized as follows:

  • β€’

    A novel algorithm for CUR decomposition, referred to as two-way spectrum pursuit (TWSP), is proposed. TWSP provides an accurate solution for CUR decomposition.

  • β€’

    TWSP enjoys a linear complexity w.r.t. the number of columns and the number of rows of a matrix.

  • β€’

    TWSP is a parameter-free algorithm that only requires the number of desired columns and rows for selection. Thus, TWSP does not require any parameter fine-tuning.

  • β€’

    The TWSP algorithm is put to the test and investigated in a set of synthetic and real experiments.

  • β€’

    The role of the core matrix 𝑼\boldsymbol{U} in π‘ͺ​𝑼​𝑹\boldsymbol{CUR} decomposition is illustrated which shows the connection between selected columns and rows. Based on analysis of 𝑼\boldsymbol{U}, an interesting application for joint sensor/channel selection is presented.

Notations: Throughout this paper, vectors, and matrices are denoted by bold lowercase, and bold uppercase letters, respectively. Moreover sets are indicated by blackboard bold characters. The mthm^{\text{th}} column of matrix 𝑿\boldsymbol{X} is denoted by 𝑿​(:,m)\boldsymbol{X}(:,m) and the nthn^{\text{th}} row of matrix 𝑿\boldsymbol{X} is denoted by 𝑿​(n,:)\boldsymbol{X}(n,:).

The rest of paper is organized as follows. Sec. 2 studies the problem of column subset selection as the conventional one-way selection scheme. In Sec. 3, our main work is presented for joint column and row subset selection. Experimental results are exhibited in Sec. 4.

2 Column Subset Selection Problem

Selecting the most diverse subset of data in an optimal sense is studied vastly [5, 14, 13]. However, these methods do not guarantee that the unselected columns are well represented by the selected ones. Further, outliers are selected with a high probability using such algorithms due to their diversity [12]. A more effective approach is selecting some representatives which are able to approximate the rest of data accurately [7] as defined as a special case of (1). This is an NP-hard problem [18] and there are several efforts for solving this problem [1, 5, 17, 21]. There are computationally expensive approaches based on convex relaxation [7, 6] that are not computationally feasible for large datasets since their complexity is of order O​(M3)O(M^{3}), where MM is the number of original columns. Recently, we proposed a new algorithm for solving CSSP with a linear complexity which is called Spectrum pursuit (SP)Β [12]. The SP algorithm finds KK columns of 𝑿\boldsymbol{X} such that their span is close to that of the best rank-KK approximation of 𝑿\boldsymbol{X}. SP is an iterative approach where at each iteration one sample selection is optimized such that the ensemble of selected samples describes the whole dataset more accurately. SP finds representatives such that the column space is spanned accurately via consecutive rank-11 approximations as theoretically analyzed in [11]. In the present paper, we extend the SP algorithm for selecting columns and rows jointly such that their outer product can represent the whole matrix accurately. A naive approach is applying SP algorithm on the matrix of interest to select a subset of columns and applying SP on its transpose in order to find a subset of rows. However, this approach is not efficient and we will compare it with our proposed approach which is optimized through a joint representation of selected columns and rows.

3 Two-way Spectrum Pursuit

The introduced joint column/row subset selection in (1) can be written as a CUR decomposition in the following form in which factor matrices must be drawn from actual columns/rows of the original matrix as

(π‘ͺ,𝑼,𝑹)=argminπ‘ͺ,𝑼,π‘Ήβ€‹β€–π‘Ώβˆ’π‘ͺ​𝑼​𝑹‖F2,\displaystyle\small(\boldsymbol{C},\boldsymbol{U},\boldsymbol{R})=\underset{\boldsymbol{C},\boldsymbol{U},\boldsymbol{R}}{\text{argmin}}\|{\boldsymbol{X}}-\boldsymbol{C}\boldsymbol{U}\boldsymbol{R}\|_{F}^{2}, (2)
s.t.​𝒄kβˆˆπ•c​and​𝒓kβˆˆπ•r.\displaystyle\text{s.t.}\;\;\boldsymbol{c}_{k}\in\mathbb{X}_{c}\;\text{and}\;\boldsymbol{r}_{k}\in\mathbb{X}_{r}.

In this problem, 𝕏c\mathbb{X}_{c}, and 𝕏r\mathbb{X}_{r} indicate the set of normalized columns and rows of matrix 𝑿\boldsymbol{X}, respectively. Here, 𝒄k\boldsymbol{c}_{k} and 𝒓kT\boldsymbol{r}^{T}_{k} denote the kt​hk^{th} column and the kt​hk^{th} row of π‘ͺ\boldsymbol{C} and 𝑹\boldsymbol{R}, respectively. In other words, 𝕏c={𝑿​(:,m)/‖𝑿​(:,m)β€–}\mathbb{X}_{c}\!=\!\{\boldsymbol{X}(:,m)/\|\boldsymbol{X}(:,m)\|\} for all columns and 𝕏r={𝑿​(n,:)/‖𝑿​(n,:)β€–}\mathbb{X}_{r}\!=\!\{\boldsymbol{X}(n,:)/\|\boldsymbol{X}(n,:)\|\} for all rows. Please note that replacing constraints in (2) with orthogonality constraint on 𝒄k\boldsymbol{c}_{k}’s and on 𝒓k\boldsymbol{r}_{k}’s results in the truncated singular value decomposition (SVD) with KK most significant components. In this case, π‘ͺ\boldsymbol{C} and 𝑹\boldsymbol{R} contain the first KK left singular vectors and the first KK right singular vectors of 𝑿\boldsymbol{X}, respectively. Moreover, the core matrix will be diagonal and the entries will be singular values with diagonal entries as singular values. However, the underlying constraints in (2) turn the problem into a joint subset of row and column selection problem instead of matrix low-rank approximation problem.

To solve this complicated problem, we split it into two consecutive problems for optimization on the kthk^{\text{th}} selected column/row. Our optimization approach is alternative, i.e., a random subset of columns and rows are picked. Then, one column or row is considered to be replaced with a more efficient one at each iteration. Since scaling a vector does not change its span, without loss of generality we assume that the column or the row subject of the optimization lie on the unit sphere. At each iteration, a rank-11 component is optimized characterized by π’„β€‹π’ˆT\boldsymbol{c}\boldsymbol{g}^{T} or 𝒉​𝒓T\boldsymbol{h}\boldsymbol{r}^{T} given by

argmin𝒄,𝑾,π’ˆβ€‹β€–π‘Ώβˆ’π‘ͺkβ€‹π‘Ύβ€‹π‘ΉβŸπ‘¬cβˆ’π’„β€‹π’ˆTβ€–F2​s.t.​‖𝒄‖2=1\underset{\boldsymbol{c},\;\boldsymbol{W},\;\boldsymbol{g}}{\text{argmin}}\;\;\|\underbrace{\boldsymbol{X}-\boldsymbol{C}_{k}\boldsymbol{W}\boldsymbol{R}}_{\boldsymbol{E}^{c}}-\boldsymbol{c}\boldsymbol{g}^{T}\|_{F}^{2}\;\text{s.t.}\;\;\|\boldsymbol{c}\|_{2}=1 (3a)
argmin𝒉,𝒀,π’“β€‹β€–π‘Ώβˆ’π‘ͺ​𝒀​𝑹kβŸπ‘¬rβˆ’π’‰β€‹π’“Tβ€–F2​s.t.​‖𝒓‖2=1\underset{\boldsymbol{h},\;\boldsymbol{Y},\;\boldsymbol{r}}{\text{argmin}}\;\;\|\underbrace{\boldsymbol{X}-\boldsymbol{C}\boldsymbol{Y}\boldsymbol{R}_{k}}_{\boldsymbol{E}^{r}}-\boldsymbol{h}\boldsymbol{r}^{T}\|_{F}^{2}\;\text{s.t.}\;\;\|\boldsymbol{r}\|_{2}=1 (3b)

Matrix π‘ͺk\boldsymbol{C}_{k} is the set of selected columns except the kthk^{\text{th}} one and 𝑹k\boldsymbol{R}_{k} is the set of selected rows except the kthk^{\text{th}} row. The first subproblem can be solved easily w.r.t. 𝒄\boldsymbol{c} using singular value decomposition. In other words, π’„β€‹π’ˆT\boldsymbol{cg}^{T} and 𝒉​𝒓T\boldsymbol{hr}^{T} are the best rank-11 approximations of the residual 𝑬c\boldsymbol{E}^{c} and 𝑬r\boldsymbol{E}^{r}, respectively. The obtained 𝒄\boldsymbol{c}/𝒓\boldsymbol{r} is the best column/row that can be added to the pool of selected columns/rows. However, the obtained vector is not available in the given dataset as a column or row since it is a singular vector which is a function of all columns/rows. The following step re-imposes the underlying constraints at each iteration,

π•Škc=argmaxπ‘šβ€‹|𝒙mT​𝒄|,βˆ€π’™mβˆˆπ•c\mathbb{S}_{k}^{c}=\underset{m}{\text{argmax}}|\boldsymbol{x}_{m}^{T}\boldsymbol{c}|,\;\;\forall{\boldsymbol{x}_{m}}\in\mathbb{X}_{c} (4a)
π•Škr=argmax𝑛​|𝒙nT​𝒓|,βˆ€π’™nβˆˆπ•r\mathbb{S}_{k}^{r}=\underset{n}{\text{argmax}}|\boldsymbol{x}_{n}^{T}\boldsymbol{r}|,\;\;\forall{\boldsymbol{x}_{n}}\in\mathbb{X}_{r} (4b)

Here, π•Škc\mathbb{S}_{k}^{c} indicates a singleton that contains kthk^{\text{th}} selected column and π•Škr\mathbb{S}_{k}^{r} corresponds to the kthk^{\text{th}} selected row. In each iteration, one column or one row is the subject of optimization. The impact of the latest estimation for that column/row on the representation is neglected. Then, an optimized replacement is found. At each iteration of TWSP, sub-problems in (3) are solved and their solutions are matched to the accessible column samples and row samples through matching equations in (4). The new selected column or row is stored in π•Škc\mathbb{S}_{k}^{c} or π•Škr\mathbb{S}_{k}^{r}, respectively. At each iteration we need to compute only the first singular vector and there are fast methods to do so [4]. The pair of (3a) and (4a) optimizes and matches a column. Similarly, performing (3b) and (4b) provides us with an optimized row. However, we do not update both of them per each iteration. In fact, we need to perform a column update or a row update in each iteration. It should be determined that which update (column or row) is more efficient in each iteration of the algorithm. To this aim, first, we choose a random previously selected column and a random previously selected row. Then, we find the best possible replacement column and the best possible replacement row. Accordingly, we choose whichever that minimizes the cost function more. The best modified column-wise subset is denoted by π•Š~c\tilde{\mathbb{S}}^{c} and π•Š~r\tilde{\mathbb{S}}^{r} denotes the best row-wise modified subset. Alg. 1 indicates the steps of TWSP algorithm. Here, †\dagger refers to the Moore–Penrose pseudo-inverse operator. Iterations can be terminated either once CUR decomposition error is saturated or when a maximum number of iterations is reached.

Refer to caption
(a) Adaptive CUR [16]
Refer to caption
(b) Near-optimal [20]
Refer to caption
(c) SP [12]
Refer to caption
(d) TWSP
Fig.Β 2: (a)-(d) Performance comparison in terms of the normalized error of CUR decomposition.
Refer to caption
Fig.Β 3: The convergence behavior of the proposed algorithm w.r.t. the initial condition of selected subset. The initial cost function is corresponding to the initial set which is drawn randomly. The blue curves indicate the optimization path alongside iterations of the TWSP algorithm. Here, 100100 different realizations are studied.

The proposed TWSP provides the selected columns and selected rows in order to form matrix π‘ͺ\boldsymbol{C} and 𝑹\boldsymbol{R} in CUR decomposition. It is straightforward to estimate the core matrix 𝑼\boldsymbol{U}. Mathematically,

𝑼=π‘ͺ†​𝑿​𝑹†.\boldsymbol{U}=\boldsymbol{C}^{\dagger}\boldsymbol{X}\boldsymbol{R}^{\dagger.}\vspace{-1mm} (5)

This matrix is a two-way compressed replica of the whole dataset and it contains valuable information in practice as will be discussed in Sec. 4.2. In general, the number of selected columns may differ from the desired number of rows. Here, K1K_{1} refers to the number of columns and K2K_{2} points to the number of rows. It is worthwhile to mention that the complexity order of TWSP is bottle-necked by computational burden for two pseudo-inverses and two singular vectors computation. Thus, the complexity can be expressed as O​(N​K12+M​K22+M​N)\small{O(NK_{1}^{2}\!+\!MK_{2}^{2}\!+\!MN)} per iteration and the algorithm needs O​(max​(K1,K2))\small{O(\text{max}(K_{1},K_{2}))} iterations. Moreover, operator rnd(KK) refers to an integer random number less than or equal to KK. In the next section, we evaluate the performance of our proposed algorithm.

The CSSP and CUR decomposition problems are NP-hard, i.e., a combinatorial search is required to find the best columns and rows. The proposed TWSP algorithm minimizes the main cost function (1) in practice. However, there is no theoretical guarantee for convergence of the proposed TWSP algorithm Alg. 1. In order to improve the convergence behavior of TWSP, we evade updating both columns and rows in each iteration. Rather, we prioritize updating a row or a column depending on which exhibits a smaller projection error, i.e., which is a better minimizer for the cost function per each iteration. The implementation steps of TWSP are summarized in Alg. 1. As seen in Alg. 1, the method adds up a revision feature to remedy the greediness associated with consecutive selection by removing samples and optimizing to select a sample for replacement and reiterating this procedure.

Algorithm 1 Two way spectrum pursuit (TWSP)
0:Β Β π‘Ώβˆˆβ„NΓ—M{\boldsymbol{X}}\!\in\!\mathbb{R}^{N\times M}\!, K1K_{1} and K2K_{2}. Output: π•Šc\mathbb{S}^{c} and π•Šr\mathbb{S}^{r}.
Β Β Initialization: π•Šc←\mathbb{S}^{c}\leftarrowA random subset of {1,…,M}\{1,\ldots,M\} with |π•Šc|=K1|\mathbb{S}^{c}|=K_{1} π•Šr←\mathbb{S}^{r}\leftarrowA random subset of {1,…,N}\{1,\ldots,N\} with |π•Šr|=K2|\mathbb{S}^{r}|=K_{2} {π•Škc}k=1K←\{\mathbb{S}^{c}_{k}\}_{k=1}^{K}\!\!\leftarrow\! Partition π•Šc\mathbb{S}^{c} into K1K_{1} subsets.{π•Škr}k=1K←\{\mathbb{S}^{r}_{k}\}_{k=1}^{K}\!\!\leftarrow\! Partition π•Šr\mathbb{S}^{r} into K2K_{2} subsetsi=rnd​(K1)i=\text{rnd}(K_{1}) and j=rnd​(K2)j=\text{rnd}(K_{2}) and 𝑿=π‘ͺ​𝑼​𝑹\boldsymbol{X}=\boldsymbol{CUR} (CUR of 𝑿\boldsymbol{X})while a stopping criterion is not met π•ŠiΒ―c=π•Šc\π•Šic\qquad\mathbb{S}^{c}_{\overline{i}}=\mathbb{S}^{c}\backslash\mathbb{S}^{c}_{i} π‘ͺi←remove column​i​in matrix​π‘ͺ\qquad\boldsymbol{C}_{i}\leftarrow\;\;\text{remove column}\;i\;\text{in matrix}\;\boldsymbol{C} 𝑾=π‘ͺi†​𝑿​𝑹†\qquad\boldsymbol{W}=\boldsymbol{C}_{i}^{\dagger}\boldsymbol{X}\boldsymbol{R}^{\dagger}𝑬c=π‘Ώβˆ’π‘ͺi​𝑾​𝑹\qquad\boldsymbol{E}^{c}=\boldsymbol{X}-\boldsymbol{C}_{i}\boldsymbol{WR}\;\; (Null space projection)
Β Β Β Β Β Β Β Β  𝒄=\boldsymbol{c}=\;find the first left singular vector of 𝑬c\boldsymbol{E}^{c} (3a)
Β Β Β Β Β Β Β Β  π•Šic←\mathbb{S}_{i}^{c}\xleftarrow[]{} the most correlated column of 𝑬\boldsymbol{E} with 𝒄\boldsymbol{c} (4a)
Β Β Β Β Β Β Β Β  π•Š~c←⋃iβ€²=1K1π•Šiβ€²c\tilde{\mathbb{S}}^{c}\xleftarrow[]{}\bigcup_{i^{\prime}=1}^{K_{1}}\mathbb{S}^{c}_{i^{\prime}}
Β Β π‘ͺ=𝑿​(:,π•Š~c)\qquad\boldsymbol{C}=\boldsymbol{X}(:,\tilde{\mathbb{S}}^{c})
Β Β ec=m​i​nπ‘ˆβ€‹β€–π‘Ώβˆ’π‘ͺ​𝑼​𝑹‖F\qquad e^{c}=\underset{U}{min}\|\boldsymbol{X}-\boldsymbol{C}\boldsymbol{U}\boldsymbol{R}\|_{F} π•ŠjΒ―r=π•Šr\π•Šjr\qquad\mathbb{S}^{r}_{\overline{j}}=\mathbb{S}^{r}\backslash\mathbb{S}^{r}_{j} 𝑹j←remove row​j​in matrix​𝑹\qquad\boldsymbol{R}_{j}\leftarrow\;\;\text{remove row}\;j\;\text{in matrix}\;\boldsymbol{R}𝒀=π‘ͺ†​𝑿​𝑹j†\qquad\boldsymbol{Y}=\boldsymbol{C}^{\dagger}\boldsymbol{X}\boldsymbol{R}_{j}^{\dagger} 𝑬r=π‘Ώβˆ’π‘ͺ​𝒀​𝑹j\qquad\boldsymbol{E}^{r}=\boldsymbol{X}-\boldsymbol{CYR}_{j}\;\;(Null space projection)
Β Β Β Β Β Β Β Β  𝒓=\boldsymbol{r}=\;find the first right singular vector of 𝑬r\boldsymbol{E}^{r} (3b)
Β Β Β Β Β Β Β Β  π•Šjr←\mathbb{S}_{j}^{r}\xleftarrow[]{} the most correlated row of 𝑬\boldsymbol{E} with 𝒓\boldsymbol{r} (4b)
Β Β Β Β Β Β Β Β  π•Š~r←⋃jβ€²=1K2π•Šjβ€²r\tilde{\mathbb{S}}^{r}\xleftarrow[]{}\bigcup_{j^{\prime}=1}^{K_{2}}\mathbb{S}^{r}_{j^{\prime}}
  𝑹=𝑿​(π•Š~r,:)\qquad\boldsymbol{R}=\boldsymbol{X}(\tilde{\mathbb{S}}^{r},:)
Β Β er=m​i​nπ‘ˆβ€‹β€–π‘Ώβˆ’π‘ͺ​𝑼​𝑹‖F\qquad e^{r}=\underset{U}{min}\|\boldsymbol{X}-\boldsymbol{C}\boldsymbol{U}\boldsymbol{R}\|_{F}
Β Β IF ​ec<er\qquad\text{IF }e^{c}<e^{r}
Β Β π•Šcβ†π•Š~c\qquad\qquad\mathbb{S}^{c}\xleftarrow[]{}\tilde{\mathbb{S}}^{c}
Β Β i=rnd​(K1)\qquad\qquad i=\text{rnd}(K_{1})
Β Β Β Β Β Β Β Β else
Β Β π•Šrβ†π•Š~r\qquad\qquad\mathbb{S}^{r}\xleftarrow[]{}\tilde{\mathbb{S}}^{r}
Β Β j=rnd​(K2)\qquad\qquad j=\text{rnd}(K_{2})

4 Experimental Results

In order to evaluate TWSP on machine-learning tasks in terms of CUR decomposition accuracy, we apply the proposed TWSP on synthetic data as well as three real applications.

4.1 CUR Decomposition on Synthetic Data

In order to evaluate the general performance of TWSP, we compared it with the state-of-the-art methods for selecting columns and rows. In this regards, we created a 1000Γ—20001000\times 2000 synthetic dataset. The dataset is generated by a rank-3030 matrix contaminated with random noise. In Fig.Β 2, we have illustrated the CUR decomposition error for selecting a subset of rows and columns in the range of 22 to 2020. The reconstruction error of CUR is normalized by ‖𝑿‖F2\|\boldsymbol{X}\|_{F}^{2}. We employ SP as the state-of-the-art algorithm for column subset selection [12]. We perform SP on the data matrix 𝑿\boldsymbol{X} and its transpose in order to, respectively, select a subset of columns and rows independently. Then, employing (5) results in a CUR decomposition. We refer to the algorithm in [16] as adaptive CUR. A more accurate algorithm for solving CUR decomposition results in a bigger blue region in Fig. 2. TWSP exhibits the best performance in this experiment. The convergence behavior of TWSP for this experiment is shown in Fig. 3 for selecting 20 columns and 20 rows. The final solution of the TWSP algorithm depends on the initial selected columns and selected rows. However, regardless of the initial condition, the TWSP algorithm minimizes the cost function of CUR decomposition. TWSP prioritizes in selection of columns or rows such that the largest decrease in the projection error is obtained. It is the main feature to obtain convergence in practice.

4.2 Joint Sensor Selection and Channel Assignment

The output products of CUR decomposition are not limited to a subset of columns and rows. In some applications, interestingly, matrix 𝑼\boldsymbol{U} is the most important output of a CUR decomposition. Entry (i,ji,j) in 𝑼\boldsymbol{U} indicates how important the contribution of the ithi^{\text{th}} column and the jthj^{\text{th}} row is to reconstruct the whole matrix 𝑿\boldsymbol{X}. This interesting property is utilized for the problem of joint sensor selection and channel assignment in a cognitive radio network. To this aim the exact setup in [22] is considered with 900900 grid points and 3232 frequency channels. The received power magnitudes are organized in a 900Γ—32900\times 32 matrix.

The only difference here, is that the uniform sampling pattern of sensing is replaced by the selection based on the CUR decomposition. Our proposed TWSP algorithm provides a fast and accurate solution for CUR decomposition. We select between 2020 and 8080 locations for spectrum sensing and all 3232 channels. Each row of matrix 𝑼\boldsymbol{U} corresponds to a selected location and it has 3232 entries. We are to assign FF channels for each selected sensor. In other words, each location does not sense the whole spectrum and only FF frequency channels are assigned to each sensor. The top-FF entries in each row with the highest absolute value show the most important channels for the corresponding locations to be sensed.
Fig. 4 shows the cartography error of spectrum sensing for the conventional random selection as introduced in [22] and our proposed optimized joint sensors and channels. For each sampled locations F=8F=8 channels out of 3232 channels are sensed. The sampled spectrum map is interpolated using thin plane splines method [19] for both sampling methods. In addition to visual superiority in reconstruction of the spectrum map, channel assignment based on TWSP provides a better quantitative error as observed in Fig. 4

Refer to caption
Refer to caption
Fig.Β 4: The original spectrum map and its comparison with the interpolated map using random sampling and our proposed method. The interpolation error is depicted versus number of sensed locations.

4.3 Supervised Sampling

The proposed TWSP algorithm is an unsupervised data selection algorithm. In supervised settings, labels can infuse information to perform a more viable joint selection [8]. A naive approach is to select representatives from each class independently. However, considering classes jointly is more effective for data reduction. Assume we are given two data classes as 𝑿1βˆˆβ„NΓ—M1\boldsymbol{X}_{1}\in\mathbb{R}^{N\times M_{1}} and 𝑿2βˆˆβ„NΓ—M2\boldsymbol{X}_{2}\in\mathbb{R}^{N\times M_{2}}. The goal is to select K1K_{1} samples from class 1 and K2K_{2} samples from class 2. To this aim, we propose to construct the cross correlation of two classes as a kernel representation for both classes jointly. Matrix 𝑿=𝑿2T​𝑿1\boldsymbol{X}={\boldsymbol{X}_{2}^{T}\boldsymbol{X}_{1}} which has M1M_{1} columns and M2M_{2} rows is fed to TWSP algorithm in order to select K1K_{1} columns and K2K_{2} rows jointly.

As an initial experiment, supervised sampling is performed on Kaggle cats and dogs dataset. The features are obtained by a trained Resnet-18 deep learning model as explained in [10]. Three mutually exclusive data subsets for training, validation, and testing are partitioned randomly from 20002000 images of each class. The classification accuracy of 97.597.5% is achieved from a fine-tuned Resnet-18 using the whole training set containing 10001000 samples for each class. Afterwards, samples are selected by applying TWSP on the kernel feature matrix and the Resnet-18 is fine-tuned by using the sampled data. The model’s accuracy is compared on the testing set with other sampling methods. Fig. 5 shows the performance of selection algorithms for different numbers of representatives per class. Using only two samples from each class, a classification accuracy of 82.3%82.3\% can be achieved which is more than 15%15\% improvement compared to random selection and more than 5%5\% improvement compared to other competitors.

Refer to caption
Fig.Β 5: The classification accuracy of a fine-tuned Resnet-18 network using a few selected data per each class.

We have conducted further experiments to study the effectiveness of the proposed algorithm in the multi-class image classification problem. For this study, we use the Resnet-34 deep learning model pre-trained on CIFAR10 and trained on a subset of ImageNet Dataset comprising of 1010 classes. The classes used for this experiment are Tench, Goldfish, Great white shark, Tiger shark, Hammerhead, Electric ray, Stingray, Cock, Hen and Ostrich. The original training data consists of 13001300 images of each class, and the idea is to use TWSP to select the data samples such that KK samples of each class are used for training. After training Resnet-34 for 10 classes of CIFAR10, the feature vectors of all the training images are extracted such that a 1300X512 matrix is obtained for each class. Since, TWPS is applicable for the two-class problem, we employ the one-versus-all approach. In other words, a separate cross correlation matrix is generated for each class such that 𝑿1\boldsymbol{X}_{1} has the dimensions 512X1300 and 𝑿2\boldsymbol{X}_{2} has the dimensions 512X11700 where 𝑿1\boldsymbol{X}_{1} represents the feature vector of the class for which samples will be selected while 𝑿2\boldsymbol{X}_{2} represents the feature vectors of the remaining 99 classes. Hence, the number of cross correlation matrices is equal to the number of the classes. Then, K1K_{1} rows and K2K_{2} columns are selected separately by applying TWSP on each cross correlation matrix. This step of generating correlation matrices is different from binary classification where only one correlation matrix was formed. Pre-trained Resnet-34 on CIFAR10 has been refined again by using the sampled data from Imagenet dataset. The model’s accuracy is subsequently compared on the test set with other sampling methods in Fig. 6. As it can be seen, Our proposed TWSP algorithm shows a superior performance comparison to the state of the art methods in sampling.

Refer to caption
Fig.Β 6: Test accuracy in terms of improvement compared with the test accuracy obtained by random selection.

4.4 Informative Users/Contents Detection

Another problem gaining a lot of interest by streaming service providers is choosing a set of users to decisively reflect their feedbacks about different products. Therefore, it is crucial for such companies to find a subset of users and media products, reviews of which contains information about other users’ unknown behavior. Each user has a limited scope of interest. For example, a user who only loves romance and action genres corresponds to a specific personality that is able to represent a cluster of users accurately. Moreover, such reviews for their areas of interest are more valuable, not reviews for all genres. Moreover, there exist movies well representing their genres.

Refer to caption
Fig.Β 7: Comparison of the normalized prediction error with state-of-the-art algorithms obtained by CUR decomposition for simultaneous movies and users subset selection from Netflix dataset.

As a result, there exists a demand for a reliable algorithm to simultaneously choose the most informative subset of users and movies. Such a subset is desirable for streaming companies to the extent that they are willing to give the users incentives to leave comprehensive reviews for certain products. In this regards, we have evaluated our algorithm on Netflix Prize dataset containing 17,77017,770 movies and 480,189480,189 users. We have reduced the dataset to 990990 movies and 4,7274,727 users by considering only movies and users with most reviews. Then, we completed the dataset by Lin et al. method to have a ground truth [15]. We select a subset of rows and columns (users and media) from the completed dataset. Fig. 7 reveals that TWSP shows the best performance in terms of predicting scores for all users/movies based on a few selected users/movies.

5 Conclusion

Two-way spectrum pursuit is proposed as an accurate and efficient algorithm for solving CUR decomposition. TWSP can be employed for joint selection of columns and rows such that their outer products is able to reconstruct the whole matrix as accurate as possible. Some applications of the proposed algorithm are presented to show the efficacy of the proposed method. However, they are not limited to the mentioned applications. Moreover, the proposed algorithm can be extended to nn-way spectrum pursuit for efficient tensor subset selection.

References

  • [1] Christos Boutsidis, Petros Drineas, and Malik Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014.
  • [2] Christos Boutsidis, MichaelΒ W Mahoney, and Petros Drineas. An improved approximation algorithm for the column subset selection problem. In Proceedings of the twentieth annual ACM-SIAM symposium on Discrete algorithms, pages 968–977. SIAM, 2009.
  • [3] Christos Boutsidis and DavidΒ P Woodruff. Optimal cur matrix decompositions. SIAM Journal on Computing, 46(2):543–589, 2017.
  • [4] Pierre Comon and GeneΒ H Golub. Tracking a few extreme singular values and vectors in signal processing. Proceedings of the IEEE, 78(8):1327–1343, 1990.
  • [5] Amit Deshpande and Luis Rademacher. Efficient volume sampling for row/column subset selection. In 2010 ieee 51st annual symposium on foundations of computer science, pages 329–338. IEEE, 2010.
  • [6] Ehsan Elhamifar, Guillermo Sapiro, and SΒ Shankar Sastry. Dissimilarity-based sparse subset selection. IEEE transactions on pattern analysis and machine intelligence, 38(11):2182–2197, 2015.
  • [7] Ehsan Elhamifar, Guillermo Sapiro, and Rene Vidal. See all by looking at a few: Sparse modeling for finding representative objects. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012.
  • [8] Ashkan Esmaeili, Kayhan Behdin, MohammadΒ Amin Fakharian, and Farokh Marvasti. Transduction with matrix completion using smoothed rank function. arXiv preprint arXiv:1805.07561, 2018.
  • [9] Keaton Hamm and Longxiu Huang. Perspectives on cur decompositions. Applied and Computational Harmonic Analysis, 48(3):1088–1099, 2020.
  • [10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [11] Mohsen Joneidi, Saeed Vahidian, Ashkan Esmaeili, and Siavash Khodadadeh. Optimality of spectrum pursuit for column subset selection problem: Theoretical guarantees and applications in deep learning. 2020.
  • [12] Mohsen Joneidi, Saeed Vahidian, Ashkan Esmaeili, Weijia Wang, Nazanin Rahnavard, Bill Lin, and Mubarak Shah. Select to better learn: Fast and accurate deep learning using data selection from nonlinear manifolds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7819–7829, 2020.
  • [13] Mohsen Joneidi, Alireza Zaeemzadeh, Behzad Shahrasbi, Guojun Qi, and Nazanin Rahnavard. E-optimal sensor selection for compressive sensing-based purposes. IEEE Transactions on Big Data, 2018.
  • [14] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithms for dual volume sampling. In Advances in Neural Information Processing Systems, pages 5038–5047, 2017.
  • [15] Zhouchen Lin, Minming Chen, and YiΒ Ma. The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055, 2010.
  • [16] MichaelΒ W Mahoney and Petros Drineas. Cur matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
  • [17] Saurabh Paul, Malik Magdon-Ismail, and Petros Drineas. Column selection via adaptive sampling. In Advances in neural information processing systems, pages 406–414, 2015.
  • [18] Yaroslav Shitov. Column subset selection is np-complete. Linear Algebra and its Applications, 2020.
  • [19] Suzan Üreten, Abbas YongaΓ§oğlu, and Emil Petriu. A comparison of interference cartography generation techniques in cognitive radio networks. In 2012 IEEE International Conference on Communications (ICC), pages 1879–1883. IEEE, 2012.
  • [20] Shusen Wang and Zhihua Zhang. Improving cur matrix decomposition and the nystrΓΆm approximation via adaptive sampling. The Journal of Machine Learning Research, 14(1):2729–2769, 2013.
  • [21] Alireza Zaeemzadeh, Mohsen Joneidi, Nazanin Rahnavard, and Mubarak Shah. Iterative projection and matching: Finding structure-preserving representatives and its application to computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5414–5423, 2019.
  • [22] Guoyong Zhang, Xiao Fu, Jun Wang, Xi-Le Zhao, and Mingyi Hong. Spectrum cartography via coupled block-term tensor decomposition. IEEE Transactions on Signal Processing, 2020.