This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Complexity of Learning Sparse Superposed Features with Feedback

Akash Kumar
Department of Computer Science and Engineering
University of California-San Diego
[email protected]
Abstract

The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative triplet comparisons. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent’s feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.

1 Introduction

In recent years, neural network-based models have achieved state-of-the-art performance across a wide array of tasks. These models effectively capture relevant features or concepts from samples, tailored to the specific prediction tasks they address (Yang and Hu, 2021b; Bordelon and Pehlevan, 2022a; Ba et al., 2022b). A fundamental challenge lies in understanding how these models learn such features and determining whether these features can be interpreted or even retrieved directly (Radhakrishnan et al., 2024). Recent advancements in mechanistic interpretability have opened multiple avenues for elucidating how transformer-based models, including Large Language Models (LLMs), acquire and represent features (Bricken et al., 2023; Doshi-Velez and Kim, 2017). These advances include uncovering neural circuits that encode specific concepts (Marks et al., 2024b; Olah et al., 2020), understanding feature composition across attention layers (Yang and Hu, 2021b), and revealing how models develop structured representations (Elhage et al., 2022). One line of research posits that features are encoded linearly within the latent representation space through sparse activations, a concept known as the linear representation hypothesis (LRH) (Mikolov et al., 2013; Arora et al., 2016). However, this hypothesis faces challenges in explaining how neural networks function, as models often need to represent more distinct features than their layer dimensions would theoretically allow under purely linear encoding. This phenomenon has been studied extensively in the context of large language models through the lens of superposition (Elhage et al., 2022), where multiple features share the same dimensional space in structured ways.

Recent efforts have addressed this challenge through sparse coding or dictionary learning, proposing that any layer \ell of the model learns features linearly:

𝒙Dα(𝒙)+ϵ(𝒙),\displaystyle{\bm{x}}\approx\bm{\textsf{D}}_{\ell}\cdot\alpha_{\ell}({\bm{x}})+\epsilon_{\ell}({\bm{x}}),

where 𝒙d{\bm{x}}\in\mathbb{R}^{d}, Dd×p\mathbf{\bm{\textsf{D}}}_{\ell}\in\mathbb{R}^{d\times p} is a dictionary111could be both overcomplete and undercomplete. matrix, α(𝒙)p\alpha_{\ell}({\bm{x}})\in\mathbb{R}^{p} is a sparse representation vector, and ϵ(𝒙)p\epsilon_{\ell}({\bm{x}})\in\mathbb{R}^{p} represents error terms. This approach enables retrieval of interpretable features through sparse autoencoders (Bricken et al., 2023; Huben et al., 2024; Marks et al., 2024b), allowing for targeted monitoring and modification of network behavior. The linear feature decomposition not only advances model interpretation but also suggests the potential for developing compact, interpretable models that maintain performance by leveraging universal features from larger architectures.

In this work, we explore how complex features encoded as a dictionary can be distilled through feedback from either advanced language models (e.g., ChatGPT, Claude 3.0 Sonnet) or human agents. Let’s define a dictionary Dd×p\bm{\textsf{D}}\in\mathbb{R}^{d\times p} where each column represents an atomic feature vector. These atomic features, denoted as u1,u2,,upd{u_{1},u_{2},\ldots,u_{p}}\subset\mathbb{R}^{d}, could correspond to semantic concepts like "tree", "house", or "lawn" that are relevant to the task’s sample space. The core mechanism involves an agent (either AI or human) determining whether different sparse combinations of these atomic features are similar or dissimilar. Specifically, given sparse activation vectors α,αp\alpha,\alpha^{\prime}\in\mathbb{R}^{p}, the agent evaluates whether linear combinations such as α1v("tree")+α2v("car")++αdv("house")\alpha_{1}v(\text{"tree"})+\alpha_{2}v(\text{"car"})+...+\alpha_{d}v(\text{"house"}) are equivalent to other combinations using different activation vectors. Precisely, we formalize these feedback relationships using relative triplet comparisons (α,β,ζ)𝒱(\alpha,\beta,\zeta)\in{\mathcal{V}}, where 𝒱p{\mathcal{V}}\subseteq\mathbb{R}^{p} is the activation or representation space. These comparisons express that a linear combination of features using coefficients α\alpha is more similar to a combination using coefficients β\beta than to one using coefficients ζ\zeta.

The objective is to determine the extent to which an oblivious learner—one who learns solely by satisfying the constraints of the feedback and randomly selecting valid features—can identify the relevant features up to normal transformation. Formally, the fundamental protocol is as follows:

  • The agent selects (constructive or distributional) sparse triplets (α,β,ζ)3p(\alpha,\beta,\zeta)_{\ell}\in\mathbb{R}^{3p} with labels {+1,0,1}\ell\in\left\{+1,0,-1\right\} corresponding to a dictionary D, and provides them to the learner.

  • The learner solves for

    {sgn(D(αβ)D(αζ))=}\left\{\operatorname{sgn}\left({\|\bm{\textsf{D}}(\alpha-\beta)\|-\|\bm{\textsf{D}}(\alpha-\zeta)\|}\right)=\ell\right\}\vspace{-2mm}

    and outputs a solution D^D^\hat{\bm{\textsf{D}}}^{\top}\hat{\bm{\textsf{D}}}.

Semantically, these relative distances provide the relative information on how ground truth samples, e.g. images, text among others, relate to each other. We term the normal transformation DD\bm{\textsf{D}}^{\top}\bm{\textsf{D}} for a given dictionary D as feature matrices 𝚽p×p\bm{\Phi}\in\mathbb{R}^{p\times p}, which is exactly a covariance matrix. Alternatively, for the representation space 𝒱p{\mathcal{V}}\subseteq\mathbb{R}^{p}, this transformation defines a Mahalanobis distance function d:𝒱×𝒱d:{\mathcal{V}}\times{\mathcal{V}}\to\mathbb{R}, characterized by the square symmetric linear transformation 𝚽0\bm{\Phi}\succeq 0 such that for any pair of activations (x,y)𝒱2(x,y)\in{\mathcal{V}}^{2}, their distance is given by:

d(x,y):=(xy)𝚽(xy)\displaystyle d(x,y):=(x-y)^{\top}\bm{\Phi}(x-y)

When 𝚽\bm{\Phi} embeds samples into r\mathbb{R}^{r}, it admits a decomposition 𝚽=LL\bm{\Phi}=\textsf{L}^{\top}\textsf{L} for Lr×p\textsf{L}\in\mathbb{R}^{r\times p}, where L serves as a dictionary for this distance function—a formulation well-studied in metric learning literature (Kulis, 2013). In this work, we study the minimal number of interactions, termed as feedback complexity of learning feature matrices—normal transformations to a dictionary—of the form 𝚽Sym+(p×p)\bm{\Phi}^{*}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}). We consider two types of feedback: general activations and sparse activations, examining both constructive and distributional settings. Our primary contributions are:

  1. I.

    We investigate feedback complexity in the constructive setting, where agents select activations from p\mathbb{R}^{p}, establishing strong bounds for both general and sparse scenarios. (see Section 4)

  2. II.

    We analyze the distributional setting with sampled activations, developing results for both general and sparse representations. For sparse sampling, we extend the definition of a Lebesgue measure to accommodate sparsity constraints. (see Section 5)

  3. III.

    We validate our theoretical bounds through experiments with feature matrices from Recursive Feature Machine trained kernel classifiers and dictionaries trained for sparse autoencoders in Large Language Models, including Pythia-70M (Biderman et al., 2023) and Board Game models (Karvonen et al., 2024). (see Section 6)

Table 1 summarizes our feedback complexity bounds for the various feedback settings examined in this study.

Setting Feedback Complexity
Stardard Constructive Θ(r(r+1)2+pr+1)\Theta\left(\frac{r(r+1)}{2}+p-r+1\right)
Sparse Constructive O(p(p+1)21)O\left(\frac{p(p+1)}{2}-1\right)
Standard Sampling Θ(p(p+1)21)\Theta(\frac{p(p+1)}{2}-1) (a.s)
Sparse Sampling cp2(2ps2log2δ)1/p2cp^{2}\left({\frac{2}{p_{\textsf{s}}^{2}}\log\frac{2}{\delta}}\right)^{1/p^{2}} (w.h.p)
Table 1: Feedback complexity under constructive and sampling cases. rr is the rank of the feature matrix, c>0c>0 is a constant and psp_{\textsf{s}} is a constant that depends on the activation distribution and sparsity ss. We abbreviate: a.s for almost surely and w.h.p for with high probability.

2 Related Work

Dictionary learning

Recent work has explored dictionary learning to disentangle the semanticity (mono- or polysemy) of neural network activations (Faruqui et al., 2015; Arora et al., 2018; Subramanian et al., 2018; Zhang et al., 2019; Yun et al., 2021). Dictionary learning (Mallat and Zhang, 1993; Olshausen and Field, 1997) (aka sparse coding) provides a systematic approach to decompose task-specific samples into sparse signals. The sample complexity of dictionary learning (or sparse coding) has been extensively studied as an optimization problem, typically involving non-convex objectives such as 1\ell_{1} regularization (see Gribonval et al. (2015)). While traditional methods work directly with ground-truth samples, our approach differs fundamentally as the learner only receives feedback on sparse signals or activations. Prior work in noiseless settings has established probabilistic exact recovery up to linear transformations (permutations and sign changes) under mutual incoherence conditions (Gribonval and Schnass, 2010; Agarwal et al., 2014). Our work extends these results by proving exact recovery (both deterministic and probabilistic) up to normal transformation, which generalizes to rotational and sign changes under strong incoherence properties (see Lemma 1). In the sampling regime, we analyze kk-sparse signals, building upon the noisy setting framework developed in Arora et al. (2013); Gribonval et al. (2015).

Feature learning in neural networks and Linear representation hypothesis

Neural networks demonstrate a remarkable capability to discover and exploit task-specific features from data (Yang and Hu, 2021b; Bordelon and Pehlevan, 2022b; Shi et al., 2022). Recent theoretical advances have significantly enhanced our understanding of feature evolution and emergence during training (Abbe et al., 2022; Ba et al., 2022a; Damian et al., 2022; Yang and Hu, 2021a; Zhu et al., 2022). Particularly noteworthy is the finding that the outer product of model weights correlates with the gradient outer product of the classifier averaged over layer preactivations (Radhakrishnan et al., 2024), which directly relates to the covariance matrices central to our investigation. Building upon these insights, Elhage et al. (2022) proposed that features in large language models follow a linear encoding principle, suggesting that the complex feature representations learned during training can be decomposed into interpretable linear components. This interpretability, in turn, could facilitate the development of simplified algorithms for complex tasks (Fawzi et al., 2022; Romera-Paredes et al., 2024). Recent research has focused on extracting these interpretable features in the form of dictionary learning by training sparse autoencoder for various language models including Board Games Models (Marks et al., 2024b; Bricken et al., 2023; Karvonen et al., 2024). Our work extends this line of inquiry by investigating whether such interpretable dictionaries can be effectively transferred to a weak learner using minimal comparative feedback.

Triplet learning a covariance matrix

Learning a feature matrix (for a dictionary) up to normal transformation can be viewed through two established frameworks: covariance estimation (Chen et al., 2013; Li and Voroninski, 2013) and learning linear distances such as Mahalanobis matrices (Kulis, 2013). While these frameworks traditionally rely on exact or noisy measurements, our work introduces a distinct mechanism based solely on relative feedback, aligning more closely with the semantic structure of Mahalanobis distances. The study of Mahalanobis distance functions has been central to metric learning research (Bellet et al., 2015; Kulis, 2013), encompassing both supervised approaches using class labels (Weinberger and Saul, 2009; Xing et al., 2002) and unsupervised methods such as LDA (Fisher, 1936) and PCA (Jolliffe, 1986). Schultz and Joachims (2003) and Kleindessner and von Luxburg (2016) have extended this framework to incorporate relative comparisons on distances. Particularly relevant to our work are studies by Schultz and Joachims (2003) and Mason et al. (2017) that employ triplet comparisons, though these typically assume i.i.d. triplets with potentially noisy measurements. Our approach differs by incorporating an active learning element: while signals are drawn i.i.d, an agent selectively provides feedback on informative instances. This constructive triplet framework for covariance estimation represents a novel direction, drawing inspiration from machine teaching, where a teaching agent provides carefully chosen examples to facilitate learning (Zhu et al., 2018).

3 Problem Setup

Refer to caption
Figure 1: Features via Recursive Features Machines: We consider monomial regression on 10-dimensional variables z𝒩(0,0.5𝕀10)z\sim\mathcal{N}(0,0.5\,\mathbb{I}_{10}) with the target function f(z)=z0z1𝟏(z5>0)f^{*}(z)=z_{0}z_{1}\mathbf{1}(z_{5}>0). We train a kernel machine defined as f^𝚽(z)=yi𝒟trainaiK𝚽(yi,z)\hat{f}_{\bm{\Phi}}(z)=\sum_{y_{i}\in\mathcal{D}_{\text{train}}}a_{i}\cdot K_{\bm{\Phi}}(y_{i},z) using the RFM (Radhakrishnan et al., 2024), over 5 iterations with 4,000 training samples, resulting in the ground truth feature matrix 𝚽\bm{\Phi}^{*} of rank 4 (see Section 6 for discussion on training process). Subsequently, we apply various feedback methods from an agent, including eigendecomposition (Theorem 1) and sparse construction (Theorem 2) in Section 4 , random Gaussian sampling (Theorem 3), and sparse sampling (Theorem 4) with a sparsity probability of 0.9 as shown in Section 5. Notably, the central methods attain the same Mean Squared Error (MSE) as the ground truth 𝚽\bm{\Phi}^{*} with 55 feedbacks, whereas the highly sparse approach produces poor feature matrices and higher MSE.

We denote by 𝒱p{\mathcal{V}}\subseteq\mathbb{R}^{p} the space of activations or representations and by 𝒳d{\mathcal{X}}\subseteq\mathbb{R}^{d} the space of samples. For the space of feature matrices (for a dictionary or Mahalanobis distances), denoted as 𝖥{\mathcal{M}}_{\mathsf{F}}, we consider the family of symmetric positive semi-definite matrices in p×p\mathbb{R}^{p\times p}, i.e. 𝖥={𝚽Sym+(p×p)}{\mathcal{M}}_{\mathsf{F}}=\left\{\bm{\Phi}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p})\right\}. We denote a feedback set as {\mathcal{F}} which consists of triplets (x,y,z)𝒱3(x,y,z)\in{\mathcal{V}}^{3} with corresponding signs {+1,0,1}\ell\in\left\{+1,0,-1\right\}.

We use the standard notations in linear algebra over a space of matrices provided in Appendix B.

Triplet feedback

An agent provides feedback on activations in 𝒱{\mathcal{V}} through relative triplet comparisons (x,y,z)𝒱(x,y,z)\in{\mathcal{V}}. Each comparison evaluates linear combinations of feature vectors:

i=1pxiui("feature i") is more similar to \displaystyle\sum_{i=1}^{p}x_{i}u_{i}(\text{"feature $i$"})\ \text{ is more similar to }\vspace*{-6mm}
i=1pyiui("feature i") than to i=1pziui("feature i")\displaystyle\sum_{i=1}^{p}y_{i}u_{i}(\text{"feature $i$"})\ \text{ than to }\ \sum_{i=1}^{p}z_{i}u_{i}(\text{"feature $i$"})

We study both sparse and non-sparse activation feedbacks, where sparsity is defined as:

Definition 1 (ss-sparse activations).

An activation αp\alpha\in\mathbb{R}^{p} is ss-sparse if at most ss many indices of α\alpha are non-zero.

Since triplet comparisons are invariant to positive scaling of feature matrices, we define:

Definition 2 (Feature equivalence).

For a feature family F{\mathcal{M}}_{\textsf{F}}, feature matrices 𝚽\bm{\Phi}^{\prime} and 𝚽\bm{\Phi}^{*} are equivalent if there exists λ>0\lambda>0 such that 𝚽=λ𝚽\bm{\Phi}^{\prime}=\lambda\cdot\bm{\Phi}^{*}.

We study a simple learning framework where the learner merely satisfies the constraints provided by the agent’s feedback:

Definition 3 (Oblivious learner).

A learner is oblivious if it randomly selects a feature matrix from the set of valid solutions to a given feedback set {\mathcal{F}}, i.e., arbitrarily chooses 𝚽F()\bm{\Phi}\in{\mathcal{M}}_{\textsf{F}}({\mathcal{F}}), where F(){\mathcal{M}}_{\textsf{F}}({\mathcal{F}}) represents the set of feature matrices satisfying the constraints in {\mathcal{F}}.

This framework aligns with version space learning, where VS(,F)\textsf{VS}({\mathcal{F}},{\mathcal{M}}_{\textsf{F}}) denotes the set of feature matrices in F(){\mathcal{M}}_{\textsf{F}}({\mathcal{F}}) compatible with feedback set {\mathcal{F}}.

Prior work on dictionary learning has established recovery up to linear transformation under weak mutual incoherence (e.g., cf Gribonval and Schnass (2010)). In our setting, when the agent provides feature feedback corresponding to D (or L)d×p\bm{\textsf{D}}\text{ (or $\textsf{L}$)}\in\mathbb{R}^{d\times p}, the learner recovers L up to normal transformation. Moreover, when L has orthogonal rows (strong incoherence), we can recover L up to rotation and sign changes as stated below, with proof deferred to Appendix C.

Lemma 1 (Recovering orthogonal representations).

Assume 𝚽Sym+(p×p)\bm{\Phi}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}) . Define the set of orthogonal Cholesky decompositions of 𝚽\bm{\Phi} as

𝒲CD={Up×r|𝚽=UU&UU=diag(λ1,,λr)},{\mathcal{W}}_{\textsf{CD}}=\left\{\textbf{U}\in\mathbb{R}^{p\times r}\,\bigg{|}\,\bm{\Phi}=\textbf{U}\textbf{U}^{\top}\&\,\textbf{U}^{\top}\textbf{U}=\text{diag}(\lambda_{1},\ldots,\lambda_{r})\right\},

where r=rank(𝚽)r=\text{rank}(\bm{\Phi}) and λ1,λ2,,λr\lambda_{1},\lambda_{2},\ldots,\lambda_{r} are the eigenvalues of 𝚽\bm{\Phi} in descending order. Then, for any two matrices U,U𝒲CD\textbf{U},\textbf{U}^{\prime}\in{\mathcal{W}}_{\textsf{CD}}, there exists an orthogonal matrix Rr×r\textbf{R}\in\mathbb{R}^{r\times r} such that U=UR\textbf{U}^{\prime}=\textbf{U}\textbf{R}, where R is block diagonal with orthogonal blocks corresponding to any repeated diagonal entries λi\lambda_{i} in UU\textbf{U}^{\top}\textbf{U}. Additionally, each column of U\textbf{U}^{\prime} can differ from the corresponding column of U by a sign change.

We note that the recovery of L is pertaining to the assumption that all the rows are orthogonal, and thus rank of L is r=dr=d. In cases where r<dr<d, one needs additional information in the form of ground sample 𝒙=Lα{\bm{x}}=\textsf{L}\alpha for some activation α\alpha to recover L up to a linear transformation. Finally, the interaction protocol is shown in Algorithm 1.

Given: Representation space 𝒱p{\mathcal{V}}\subseteq\mathbb{R}^{p}, Feature family 𝖥{\mathcal{M}}_{\mathsf{F}}
In batch setting:
  1. 1.

    Agent picks triplets (𝒱,𝚽)={\mathcal{F}}({\mathcal{V}},\bm{\Phi}^{*})=

    {(x,y,z)𝒱3|(xy)𝚽(xy)(xz)𝚽(xz)}\left\{(x,y,z)\in{\mathcal{V}}^{3}\,|\,(x-y)^{\top}\bm{\Phi}^{*}(x-y)\geq(x-z)^{\top}\bm{\Phi}^{*}(x-z)\right\}\vspace{-2mm}
  2. 2.

    Learner receives {\mathcal{F}}, and obliviously picks a feature matrix 𝚽𝖥\bm{\Phi}\in{\mathcal{M}}_{\mathsf{F}} that satisfy the set of constraints in (𝒱,𝚽){\mathcal{F}}({\mathcal{V}},\bm{\Phi}^{*})

  3. 3.

    Learner outputs 𝚽\bm{\Phi}.

Algorithm 1 Model of Feature learning with feedback

4 Sparse Feature Learning with Constructive Feedback

Here, we study the feedback complexity in the setting where agent is allowed to pick/construct any activation from p\mathbb{R}^{p}.

Reduction to Pairwise Comparisons

The general triplet feedbacks with potentially inequality constraints in Algorithm 1 can be simplified to pairwise comparisons with equality constraints with a simple manipulation as follows.

Lemma 2.

Let 𝚽𝖥\bm{\Phi}^{*}\in\mathcal{M}_{\mathsf{F}} be a target feature matrix in representation space p\mathbb{R}^{p} used for oblivious learning. Given a feedback set

={(x,y,z)3p|(xy)𝚽(xy)(xz)𝚽(xz)},\mathcal{F}=\left\{(x,y,z)\in\mathbb{R}^{3p}\,\big{|}\,(x-y)^{\top}\bm{\Phi}^{*}(x-y)\geq(x-z)^{\top}\bm{\Phi}^{*}(x-z)\right\},

such that any 𝚽VS(,𝖥)\bm{\Phi}^{\prime}\in\textsf{VS}(\mathcal{F},\mathcal{M}_{\mathsf{F}}) is feature equivalent to 𝚽\bm{\Phi}^{*}, there exists a pairwise feedback set

={(y,z)2p|y𝚽y=z𝚽z}\mathcal{F}^{\prime}=\left\{(y^{\prime},z^{\prime})\in\mathbb{R}^{2p}\,\big{|}\,y^{\prime\top}\bm{\Phi}^{*}y^{\prime}=z^{\prime\top}\bm{\Phi}^{*}z^{\prime}\right\}

such that 𝚽VS(,𝖥)\bm{\Phi}^{\prime}\in\textsf{VS}(\mathcal{F}^{\prime},\mathcal{M}_{\mathsf{F}}).

Proof.

WLOG, assume xzx\neq z for all (x,y,z)(x,y,z)\in\mathcal{F}. For any triplet (x,y,z)(x,y,z)\in\mathcal{F}: Case (i): If (xy)𝚽(xy)=(xz)𝚽(xz)(x-y)^{\top}\bm{\Phi}^{*}(x-y)=(x-z)^{\top}\bm{\Phi}^{*}(x-z), then (xy,xz)(x-y,x-z) satisfies the equality. Case (ii): If (xy)𝚽(xy)>(xz)𝚽(xz)(x-y)^{\top}\bm{\Phi}^{*}(x-y)>(x-z)^{\top}\bm{\Phi}^{*}(x-z), then for some λ>0\lambda>0:

(xy)𝚽(xy)=(1+λ)(xz)𝚽(xz)(x-y)^{\top}\bm{\Phi}^{*}(x-y)=(1+\lambda)(x-z)^{\top}\bm{\Phi}^{*}(x-z)

implying (xy,1+λ(xz))(x-y,\sqrt{1+\lambda}(x-z)) satisfies the equality.

Thus, each triplet in \mathcal{F} maps to a pair in \mathcal{F}^{\prime}, preserving feature equivalence under positive scaling. ∎

This implies that if triplet comparisons are used in Algorithm 1, equivalent pairwise comparisons exist satisfying:

𝚽\displaystyle\bm{\Phi}^{\prime} =λ𝚽,λ>0,\displaystyle=\lambda\cdot{\bm{\Phi}^{*}},\quad\lambda>0, (1a)
𝚽\displaystyle\bm{\Phi}^{\prime} {𝚽𝖥|(y,z),y𝚽y=z𝚽z}.\displaystyle\in\left\{\bm{\Phi}\in\mathcal{M}_{\mathsf{F}}\,\big{|}\,\forall(y,z)\in\mathcal{F}^{\prime},\,y^{\top}\bm{\Phi}y=z^{\top}\bm{\Phi}z\right\}. (1b)

Now, we show a reformulation of the oblivious learning problem for a feature matrix using pairwise comparisons that provide a unique geometric interpretation. Consider a pair (y,z)(y,z) and a matrix 𝚽\bm{\Phi}. An equality constraint implies

y𝚽y=z𝚽z\displaystyle y^{\top}\bm{\Phi}y=z^{\top}\bm{\Phi}z 𝚽,yyzz=0\displaystyle\iff\langle\bm{\Phi},\,yy^{\top}-zz^{\top}\rangle=0

where ,\langle\cdot,\cdot\rangle denotes the Frobenius inner product. This equivalence indicates that the matrix (yyzz)(yy^{\top}-zz^{\top}) is orthogonal to 𝚽\bm{\Phi} in the Frobenius inner product space. Now, given a set of pairwise feedbacks

(p,𝖥,𝚽)={(yi,zi)}i=1k\mathcal{F}(\mathbb{R}^{p},\mathcal{M}_{\mathsf{F}},\bm{\Phi}^{*})=\{(y_{i},z_{i})\}_{i=1}^{k}

corresponding to the target feature matrix 𝚽\bm{\Phi}^{*}, the learning problem defined by Eq. (1b) can be formulated as:

(y,z)(p,𝖥,𝚽),𝚽,yyzz=0.\displaystyle\forall(y,z)\in\mathcal{F}(\mathbb{R}^{p},\mathcal{M}_{\mathsf{F}},\bm{\Phi}^{*}),\quad\langle\bm{\Phi},\,yy^{\top}-zz^{\top}\rangle=0. (2)

Essentially, the learner aims to select a matrix 𝚽\bm{\Phi} that satisfies the condition in Eq. (2). Geometrically, the condition in Eq. (2) implies that any solution 𝚽\bm{\Phi} should annihilate the subspace of the orthogonal complement that is spanned by the matrices {yyzz}(y,z)\{yy^{\top}-zz^{\top}\}_{(y,z)\in\mathcal{F}}. Formally, this complement is defined as:

𝒪𝚽:={SSym(p×p)|𝚽,S=0}.\mathcal{O}_{\bm{\Phi}^{*}}:=\left\{S\in\textsf{Sym}(\mathbb{R}^{p\times p})\,\bigg{|}\,\langle\bm{\Phi}^{*},S\rangle=0\right\}.\vspace{-3mm}

4.1 Constructive feedbacks: Worst-case lower bound

To learn a symmetric PSD matrix, learner needs at most p(p+1)/2p(p+1)/2 constraints for linear programming corresponding to the number of degrees of freedom. So, the first question is are there pathological cases of feature matrices in F{\mathcal{M}}_{\textsf{F}} which would require at least p(p+1)/2p(p+1)/2 many triplet feedbacks in Algorithm 1. This indeed is the case, if a target matrix 𝚽Sym+(p×p)\bm{\Phi}^{*}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}) is full rank.

In the following proposition proven in Appendix D, we show a strong lower bound on the worst-case 𝚽\bm{\Phi}^{*} that turns out to be of order Ω(p2)\Omega(p^{2}).

Proposition 1.

In the constructive setting, the worst-case feedback complexity of the class F{\mathcal{M}}_{\textsf{F}} with general activations is at the least (p(p+1)21)\left({\frac{p(p+1)}{2}-1}\right).

Proof Outline.

As discussed in Eq. (1) and Eq. (2), for a full-rank feature matrix 𝚽F\bm{\Phi}^{*}\in{\mathcal{M}}_{\textsf{F}}, the span of any feedback set {\mathcal{F}}, i.e., span{xxyy}(x,y)\text{span}\langle{\left\{xx^{\top}-yy^{\top}\right\}_{(x,y)\in{\mathcal{F}}}}\rangle, must lie within the orthogonal complement 𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}} of 𝚽\bm{\Phi}^{*} in the space of symmetric matrices Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}). Conversely, if 𝚽\bm{\Phi}^{*} has full rank, then 𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}} is contained within this span. This necessary condition requires the feedback set to have a size of at least p(p+1)21\frac{p(p+1)}{2}-1, given that dim(Sym(p×p))=p(p+1)2\dim(\textsf{Sym}(\mathbb{R}^{p\times p}))=\frac{p(p+1)}{2}. ∎

Since the worst-case bounds are pessimistic for oblivious learning of Eq. (1) a general question is how feedback complexity varies over the feature model F{\mathcal{M}}_{\textsf{F}}. In the following subsection, we study the feedback complexity for feature model based on the rank of the underlying matrix, showing that the bounds can be drastically reduced.

4.2 Feature learning of low-rank matrices

As stated in Proposition 1, the learner requires at least p(p+1)21\frac{p(p+1)}{2}-1 feedback pairs to annihilate the orthogonal complement 𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}}. However, this requirement decreases with a lower rank of 𝚽\bm{\Phi}^{*}. We illustrate this in Fig. 1 for a feature matrix 𝚽10×10\bm{\Phi}\in\mathbb{R}^{10\times 10} of rank 4 trained via Recursive Feature Machines (Radhakrishnan et al., 2024).

Consider an activation αp\alpha\in\mathbb{R}^{p} in the nullspace of 𝚽\bm{\Phi}^{*}. Since 𝚽α=0\bm{\Phi}^{*}\alpha=0, it follows that α𝚽α=0\alpha^{\top}\bm{\Phi}^{*}\alpha=0. Moreover, for another activation βspanα\beta\notin\text{span}\langle{\alpha}\rangle in the nullspace, any linear combination aα+bβa\alpha+b\beta satisfies

(aα+bβ)𝚽(aα+bβ)=0.(a\alpha+b\beta)^{\top}\bm{\Phi}^{*}(a\alpha+b\beta)=0.\vspace{-4mm}

This suggests a strategy for designing effective feedback based on the kernel Ker(𝚽)\mathrm{Ker}({\bm{\Phi}^{*}}) and the null space null(𝚽)\mathrm{null}({\bm{\Phi}^{*}}) of 𝚽\bm{\Phi}^{*} (see Appendix B for table of notations). This intuition is formalized by the eigendecomposition of the feature matrix:

𝚽=i=1rλiuiui,\displaystyle\bm{\Phi}^{*}=\sum_{i=1}^{r}\lambda_{i}u_{i}u_{i}^{\top},\vspace{-4mm} (3)

where {λi}\{\lambda_{i}\} are the eigenvalues and {ui}\{u_{i}\} are the orthonormal eigenvectors. Since 𝚽0\bm{\Phi}^{*}\succeq 0 this decomposition is unique with non-negative eigenvalues.

To teach 𝚽\bm{\Phi}^{*}, the agent can employ a dual approach: teaching the kernel associated with the eigenvectors in this decomposition and the null space separately. Specifically, the agent can provide feedbacks corresponding to the eigenvectors of 𝚽\bm{\Phi}^{*}’s kernel and extend the basis {ui}\{u_{i}\} for the null space. We first present the following useful result (see proof in Appendix E).

Lemma 3.

Let {vi}i=1rp\{v_{i}\}_{i=1}^{r}\subset\mathbb{R}^{p} be a set of orthogonal vectors. Then, the set of rank-1 matrices

:={vivi,(vi+vj)(vi+vj)| 1i<jr}\mathcal{B}:=\left\{v_{i}v_{i}^{\top},\ (v_{i}+v_{j})(v_{i}+v_{j})^{\top}\ \bigg{|}\ 1\leq i<j\leq r\right\}

is linearly independent in the space symmetric matrices Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}).

Using this construction, the agent can provide feedbacks of the form (ui,ciy)(u_{i},\sqrt{c_{i}}y) for some ypy\in\mathbb{R}^{p} with 𝚽y0\bm{\Phi}^{*}y\neq 0 and vi𝚽vi=ciy𝚽yv_{i}^{\top}\bm{\Phi}^{*}v_{i}=c_{i}y^{\top}\bm{\Phi}^{*}y to teach the kernel of 𝚽\bm{\Phi}^{*}. For an orthogonal extension {ui}i=r+1p\{u_{i}\}_{i=r+1}^{p} where 𝚽ui=0\bm{\Phi}^{*}u_{i}=0 for all i=r+1,,pi=r+1,\dots,p, feedbacks of the form (ui,0)(u_{i},0) suffice to teach the null space of 𝚽\bm{\Phi}^{*}.

This is the key idea underlying our study on feedback complexity in the general constructive setting that is stated below with the full proof deferred to Appendix E and F.

Theorem 1 (General Activations).

Let ΦF\Phi^{*}\in{\mathcal{M}}_{\textsf{F}} be a target feature matrix with rank(𝚽)=r\text{rank}({\bm{\Phi}^{*}})=r. Then, in the setting of constructive feedbacks with general activations, the feedback complexity has a tight bound of Θ(r(r+1)2+(pr)1)\Theta\left(\frac{r(r+1)}{2}+(p-r)-1\right) for Eq. (1).

Proof Outline.

As discussed above we decompose the feature matrix 𝚽\bm{\Phi}^{*} into its eigenspace and null space, leveraging the linear independence of the constructed feedbacks to ensure that the span covers the necessary orthogonal complements. The upper bound is established with a simple observation: r(r+1)/21r(r+1)/2-1 many pairs composed of {\mathcal{B}} are sufficient to teach 𝚽\bm{\Phi}^{*} if the null space of 𝚽\bm{\Phi}^{*} is known, whereas the agent only needs to provide (pr)(p-r) many feedbacks corresponding to a basis extension to cover the null space, and hence the stated upper bound is achieved.

The lower bound requires showing that a valid feedback set possesses two spanning properties of xxyy\langle{xx^{\top}-yy^{\top}\rangle} for all (x,y)(x,y)\in{\mathcal{F}}: (1) it must include any 𝚽𝒪𝚽\bm{\Phi}\in\mathcal{O}_{\bm{\Phi}^{*}} whose column vectors are within the span of eigenvectors of 𝚽\bm{\Phi}^{*}, and (2) it must include any vvvv^{\top} for some subset UU that spans the null space of 𝚽\bm{\Phi}^{*} and vUv\in U. ∎

Learning with sparse activations

In the discussion above, we demonstrated a strategy for reducing the feedback complexity when general activations are allowed. Now, we aim to understand how this complexity changes when activations are ss-sparse (see Definition 1) for some s<ps<p. Notably, there exists a straightforward construction of rank-1 matrices using a sparse set of activations.

Consider this sparse set of activations B{B} consisting of p(p+1)2\frac{p(p+1)}{2} items in p\mathbb{R}^{p}:

B={ei1ip}{ei+ej1i<jp},\displaystyle{B}=\{e_{i}\mid 1\leq i\leq p\}\cup\{e_{i}+e_{j}\mid 1\leq i<j\leq p\}, (4)

where eie_{i} is the iith standard basis vector. Using a similar argument to Lemma 3, it is straightforward to show that the set of rank-1 matrices

sparse:={uuuB}{\mathcal{B}}_{\textsf{sparse}}:=\left\{uu^{\top}\mid u\in{B}\right\}

is linearly independent in the space of symmetric matrices Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}) and forms a basis. Moreover, every activation in BextB_{\textsf{ext}} is at most 2-sparse (see Definition 1). With this, we state the main result on learning with sparse constructive feedback here.

Theorem 2 (Sparse Activations).

Let 𝚽F\bm{\Phi}^{*}\in{\mathcal{M}}_{\textsf{F}} be the target feature matrix. If an agent can construct pairs of activations from a representation space p\mathbb{R}^{p}, then the feedback complexity of the feature model F{\mathcal{M}}_{\textsf{F}} with 2-sparse activations is upper bounded by p(p+1)2\frac{p(p+1)}{2}.

Remark: While the lower bound from Theorem 1 applies here, sparse settings may require even more feedbacks. Consider a rank-1 matrix 𝚽=vv\bm{\Phi}^{*}=vv^{\top} with sparsity(v)=p\text{sparsity}(v)=p. By the Pigeonhole principle, representing this using ss-sparse activations requires at least (p/s)2(p/s)^{2} rank-1 matrices. Thus, for constant sparsity s=O(1)s=O(1), we need Ω(p2)\Omega(p^{2}) feedbacks—implying sparse representation of dense features might not exploit the low-rank structure to minimize feedbacks.

5 Sparse Feature Learning with Sampled Feedback

Given: Representation space 𝒱p{\mathcal{V}}\subset\mathbb{R}^{p}, Distribution over representations 𝒟𝒱{\mathcal{D}}_{\mathcal{V}}, Feature family 𝖥{\mathcal{M}}_{\mathsf{F}}.
In batch setting: 1. Agent receives sampled representations 𝒱n𝒟𝒱{\mathcal{V}}_{n}\sim{\mathcal{D}}_{{\mathcal{V}}}. 2. Agent picks pairs (𝒱n,𝚽)={\mathcal{F}}({\mathcal{V}}_{n},\bm{\Phi}^{*})= {(x,λxy)|(x,y)𝒱n2,x𝚽x=λxy𝚽y}\left\{(x,\sqrt{\lambda_{x}}y)\,|\,(x,y)\in{\mathcal{V}}_{n}^{2},\,x^{\top}\bm{\Phi}^{*}x=\lambda_{x}\cdot y^{\top}\bm{\Phi}^{*}y\right\}\vspace{-2mm} 3. Learner receives {\mathcal{F}}; and obliviously picks a feature matrix 𝚽𝖥\bm{\Phi}\in{\mathcal{M}}_{\mathsf{F}} that satisfy the set of constraints in (𝒱n,𝚽){\mathcal{F}}({\mathcal{V}}_{n},\bm{\Phi}^{*}) 4. Learner outputs 𝚽\bm{\Phi}.
Algorithm 2 Feature learning with sampled representations

In general, the assumption of constructive feedback may not hold in practice, as ground truth samples from nature or induced representations from a model are typically independently sampled from the representation space. The literature on Mahalanobis distance learning and dictionary learning has explored distributional assumptions on the sample/activation space (Mason et al., 2017; Gribonval et al., 2014).

In this section, we consider a more realistic scenario where the agent observes a set of representations/activations 𝒱n:={α1,α2,,αn}𝒟𝒱{\mathcal{V}}_{n}:=\{\alpha_{1},\alpha_{2},\ldots,\alpha_{n}\}\sim{\mathcal{D}}_{{\mathcal{V}}}, with 𝒟𝒱{\mathcal{D}}_{{\mathcal{V}}} being an unknown measure over the continuous space 𝒱p{\mathcal{V}}\subseteq\mathbb{R}^{p}. With these observations, the agent designs pairs of activations to teach a target feature matrix 𝚽Sym+(p×p)\bm{\Phi}^{*}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}).

As demonstrated in Lemma 2, we can reduce inequality constraints with triplet comparisons to equality constraints with pairs in the constructive setting. However, when the agent is restricted to selecting activations from the sampled set 𝒱n{\mathcal{V}}_{n} rather than arbitrarily from 𝒱{\mathcal{V}}, this reduction no longer holds. This is evident from the following observation: let α,βiid𝒟𝒱\alpha,\beta\sim\text{iid}\ {\mathcal{D}}_{{\mathcal{V}}} and let 𝚽0\bm{\Phi}^{*}\neq 0 be a non-degenerate feature matrix. Then,

α𝚽α=β𝚽βi,j(αiαjβiβj)𝚽ij=0.\alpha^{\top}\bm{\Phi}^{*}\alpha=\beta^{\top}\bm{\Phi}^{*}\beta\implies\sum_{i,j}(\alpha_{i}\alpha_{j}-\beta_{i}\beta_{j})\bm{\Phi}^{*}_{ij}=0.\vspace{-1.5mm}

This equation represents a non-zero polynomial. According to Sard’s Theorem, the zero set of a non-zero polynomial has Lebesgue measure zero. Therefore,

𝒫(α,β)({α𝚽α=β𝚽β})=0.{\mathcal{P}}_{(\alpha,\beta)}\left(\left\{\alpha^{\top}\bm{\Phi}^{*}\alpha=\beta^{\top}\bm{\Phi}^{*}\beta\right\}\right)=0.\vspace{-1mm}

Given this, the agent cannot reliably construct pairs that satisfy the required equality constraints from independently sampled activations. Since a general triplet feedback only provides 3 bits of information, exact recovery up to feature equivalence is impossible. To address these limitations, we consider rescaling the sampled activations to enable the agent to design effective pairs for the target feature matrix 𝚽F\bm{\Phi}^{*}\in{\mathcal{M}}_{\textsf{F}}.

Refer to caption
Figure 2: Sparse sampling: We consider the same setup as Fig. 1 for the target function f(z)=z0z1z3𝟏(z5>0)f^{*}(z)=z_{0}z_{1}z_{3}\mathbf{1}(z_{5}>0). In these plots, we employ sparse sampling feedback methods where an agent provides feedback based on 𝚽\bm{\Phi}^{*} with different sparsity probability (mu: probability of 0 being sampled). Thus, as mu decreases the theorized complexity of p(p+1)/2=55p(p+1)/2=55 obtains a close approximation of 𝚽\bm{\Phi}^{*}. But for mu=.97mu=.97, the agent needs to sample more number of activations to approximate properly, i.e. from 55, 110, \ldots, and 1100 approximation gradually improves as shown in Theorem 4.

Rescaled Pairs

For a given matrix 𝚽0\bm{\Phi}\neq 0, a sampled input x𝒟𝒱x\sim\mathcal{D}_{\mathcal{V}} is almost never orthogonal, i.e., almost surely 𝚽x0\bm{\Phi}x\neq 0. This property can be utilized to rescale an input and construct pairs that satisfy equality constraints. Specifically, there exist scalars γ,λ>0\gamma,\lambda>0 such that (assuming without loss of generality x𝚽x>y𝚽yx^{\top}\bm{\Phi}x>y^{\top}\bm{\Phi}y),

x𝚽x=λy𝚽y+y𝚽y=(1+λ)y𝚽(1+λ)y.\displaystyle x^{\top}\bm{\Phi}x=\lambda\cdot y^{\top}\bm{\Phi}y+y^{\top}\bm{\Phi}y=(\sqrt{1+\lambda})y^{\top}\bm{\Phi}(\sqrt{1+\lambda})y.

Thus, the pair (x,(1+λ)y)(x,(\sqrt{1+\lambda})y) satisfies the equality constraints. With this understanding, we reformulate Algorithm 1 into Algorithm 2.

In this section, we analyze the feedback complexity in terms of the minimum number of sampled activations required for the agent to construct an effective feedback set achieving feature equivalence which is illustrated in Fig. 2. Our first result establishes complexity bounds for general activations (without sparsity constraints) sampled from a Lebesgue distribution, with the complete proof provided in Appendix G.

Theorem 3 (General Sampled Activations).

Consider a representation space 𝒱p\mathcal{V}\subseteq\mathbb{R}^{p}. Assume that the agent receives activations sampled i.i.d from a Lebesgue distribution 𝒟𝒱\mathcal{D}_{\mathcal{V}}. Then, for any target feature matrix 𝚽F\bm{\Phi}^{*}\in\mathcal{M}_{\textsf{F}}, with a tight bound of n=Θ(p(p+1)2)n=\Theta\left(\frac{p(p+1)}{2}\right) on the feedback complexity, the oblivious learner (almost surely) learns 𝚽\bm{\Phi}^{*} up to feature equivalence using the feedback set (𝒱n,𝚽)\mathcal{F}(\mathcal{V}_{n},\bm{\Phi}^{*}), i.e.,

𝒫𝒱(𝚽(𝒱n,𝚽),λ>0,𝚽=λ𝚽)=1.\mathcal{P}_{\mathcal{V}}\left(\forall\,\,\bm{\Phi}^{\prime}\in\mathcal{F}(\mathcal{V}_{n},\bm{\Phi}^{*}),\,\exists\lambda>0,\bm{\Phi}^{\prime}=\lambda\cdot\bm{\Phi}^{*}\right)=1.
Proof Outline.

The key observation is that almost surely for any np(p+1)/2n\leq p(p+1)/2 randomly sampled activations on a unit sphere 𝕊p1\mathbb{S}^{p-1} under Lebesgue measure, the corresponding rank-1 matrices are linearly independent. This is a direct application of Sard’s theorem on the zero set of a non-zero polynomial equation, yielding the upper bound. For the lower bound, we use some key necessary properties of a feedback set as elucidated in the proof of Theorem 1. This result essentially fixes certain activations that need to be spanned by a feedback set, but under a Lebesgue measure on a continuous domain, we know that the probability of sampling a direction is zero. ∎

We consider a fairly general distribution over sparse activations similar to the signal model in Gribonval et al. (2015).

Assumption 1 (Sparse-Distribution).

Each index of a sparse activation vector αp\alpha\in\mathbb{R}^{p} is sampled i.i.d from a sparse distribution defined as: for all ii,

𝒫(αi=0)=pi,αi|αi0Lebesgue((0,1]).\mathcal{P}(\alpha_{i}=0)=p_{i},\quad\alpha_{i}\,|\,\alpha_{i}\neq 0\sim\text{Lebesgue}((0,1]).

With this we state the main theorem of the section with the proof deferred to Appendix H.

Theorem 4 (Sparse Sampled Activations).

Consider a representation space 𝒱p\mathcal{V}\subseteq\mathbb{R}^{p}. Assume that the agent receives representations sampled i.i.d from a sparse distribution 𝒟𝒱\mathcal{D}_{\mathcal{V}}. Fix a threshold δ>0\delta>0, and sparsity parameter s<ps<p. Then, for any target feature matrix 𝚽F\bm{\Phi}^{*}\in\mathcal{M}_{\textsf{F}}, with a bound of n=O(p2(2ps2log2δ)1/p2)n=O\left({p^{2}(\frac{2}{p_{\textsf{s}}^{2}}\log\frac{2}{\delta})^{1/p^{2}}}\right) on the feedback complexity using ss-sparse feedbacks,the oblivious learner learns 𝚽\bm{\Phi}^{*} up to feature equivalence with high probability using the feedback set (𝒱n,𝚽)\mathcal{F}(\mathcal{V}_{n},\bm{\Phi}^{*}), i.e.,

𝒫𝒱(𝚽(𝒱n,𝚽),λ>0,𝚽=λ𝚽)(1δ),\mathcal{P}_{\mathcal{V}}\left(\forall\,\,\bm{\Phi}^{\prime}\in\mathcal{F}(\mathcal{V}_{n},\bm{\Phi}^{*}),\,\exists\lambda>0,\bm{\Phi}^{\prime}=\lambda\cdot\bm{\Phi}^{*}\right)\geq(1-\delta),

where psp_{\textsf{s}} depends on 𝒟𝒱{\mathcal{D}}_{\mathcal{V}}, and sparasity parameter ss.

Proof Outline.

Using the formulation of Eq. (2), we need to estimate the number of activations the agent needs to receive/sample before an induced set of p(p+1)/2p(p+1)/2 many rank-1 linearly independent matrices are found. To estimate this, first we generalize the construction of the set {{\mathcal{B}}} from the proof of Theorem 2 to

U^g={λi2eieiT:1ip}{(λijiei+λijjej)(λijiei+λijjej)T:1i<jp}\displaystyle\widehat{U}_{g}=\left\{\lambda_{i}^{2}e_{i}e_{i}^{T}:1\leq i\leq p\right\}\cup\left\{(\lambda_{iji}e_{i}+\lambda_{ijj}e_{j})(\lambda_{iji}e_{i}+\lambda_{ijj}e_{j})^{T}:1\leq i<j\leq p\right\}

We then analyze a design matrix 𝕄\mathbb{M} of rank-1 matrices from sampled activations and compute the probability of finding columns with entries semantically similar to those in U^g\widehat{U}_{g} (as vectorizations), ensuring a non-trivial determinant. The final complexity bound is derived using Hoeffding’s inequality and Sterling’s approximation. ∎

6 Experimental Setup

We empirically validate our theoretical framework for learning feature matrices. Our experiments222https://anonymous.4open.science/r/learnsparsefeatureswithfeedback-688E examine different feedback mechanisms and teaching strategies across both synthetic tasks and large-scale neural networks.

Feedback Methods:

We evaluate four feedback mechanisms: (1) Eigendecomposition uses Lemma 3 to construct feedback based on 𝚽\bm{\Phi}’s low rank structure, (2) Sparse Constructive builds 2-sparse feedbacks using the basis in Eq. (4), (3) Random Sampling generates feedbacks spanning 𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}} from a Lebesgue distribution, and (4) Sparse Sampling creates feedbacks using ss-sparse samples drawn from a sparse distribution (see Definition 1).

Teaching Agent:

We implement a teaching agent with access to the target feature matrix to enable numerical analysis. The agent constructs either specific basis vectors or receives activations from distributions (Lebesgue or Sparse) based on the chosen feedback method. For problems with small dimensions, we utilize the cvxpy package to solve constraints of the form {ααyy}\left\{\alpha\alpha^{\top}-yy^{\top}\right\}. When handling larger dimensional features (5000×50005000\times 5000), where constraints scale to millions (p(p+1)/212.5Mp(p+1)/2\approx 12.5M), we employ batch-wise gradient descent for matrix regression.

Features via RFM:

RFM (Radhakrishnan et al., 2024) considers a trainable shift-invariant kernel K𝚽:𝒳×𝒳K_{\bm{\Phi}}:{\mathcal{X}}\times{\mathcal{X}}\to\mathbb{R} corresponding to a symmetric, PSD matrix 𝚽\bm{\Phi}. At each iteration of the update, the matrix 𝚽(t)\bm{\Phi}_{(t)} is updated for the classifier f𝚽(z)=yi𝒟trainaiK𝚽(yi,z){f}_{\bm{\Phi}}(z)=\sum_{y_{i}\in\mathcal{D}_{\text{train}}}a_{i}K_{\bm{\Phi}}(y_{i},z) as follows: 𝚽(t+1)=z𝒟train(f𝚽(t)z)(f𝚽(t)z)\bm{\Phi}_{(t+1)}=\sum_{z\in{\mathcal{D}}_{\textsf{train}}}\big{(}\frac{\partial{f}_{\bm{\Phi}_{(t)}}}{\partial z}\big{)}\big{(}\frac{\partial{f}_{\bm{\Phi}_{(t)}}}{\partial z}\big{)}^{\top}. We train target functions corresponding to monomials over samples in 10\mathbb{R}^{10} using 4000 training samples. The feature matrix 𝚽(t)\bm{\Phi}_{(t)} obtained after tt iterations is used as ground truth against learning with feedbacks. Plots are shown in Fig. 1 and Fig. 2.

SAE features of Large-Scale Models:

We analyze dictionaries from trained sparse autoencoders on Pythia-70M (Biderman et al., 2023) and Board Game Models (Karvonen et al., 2024), with dictionary dimensions of 32k×51232k\times 512 and 4096×5124096\times 512 respectively. These large-scale dimensions result in millions of feedback parameters, necessitating computationally efficient implementations of the teaching agent and constraint optimization. Plots for Pythia-70M and detailed implementation specifications are provided in Appendix I.

Since there could be numerical issues in computation for these large dictionaries, we compute the Pearson Correlation Coefficient (PCC) of the trained feature matrix 𝚽\bm{\Phi}^{\prime} with the target feature matrix 𝚽\bm{\Phi}^{*} to show their closenss.

ρ(𝚽,𝚽)=i,j(𝚽ij𝚽¯)(𝚽ij𝚽¯)i,j(𝚽ij𝚽¯)2i,j(𝚽ij𝚽¯)2\rho(\bm{\Phi}^{\prime},\bm{\Phi}^{*})=\frac{\sum_{i,j}(\bm{\Phi}^{\prime}_{ij}-\bar{\bm{\Phi}}^{\prime})(\bm{\Phi}^{*}_{ij}-\bar{\bm{\Phi}}^{*})}{\sqrt{\sum_{i,j}(\bm{\Phi}^{\prime}_{ij}-\bar{\bm{\Phi}}^{\prime})^{2}}\sqrt{\sum_{i,j}(\bm{\Phi}^{*}_{ij}-\bar{\bm{\Phi}}^{*})^{2}}}

where 𝚽¯\bar{\bm{\Phi}}^{\prime} and 𝚽¯\bar{\bm{\Phi}}^{*} represent the means of all elements in matrices 𝚽\bm{\Phi}^{\prime} and 𝚽\bm{\Phi}^{*} respectively. Note the highest value of ρ\rho is 1.

Dictionary features of MLP layers of ChessGPT and OthelloGPT:

We use the SAEs trained for various MLP layers of Board Games models: ChessGPT and OthelloGPT considered in Karvonen et al. (2024). We analyze the dictionaries from MLP layers with dimensions 4096×5124096\times 512. Note that p(p+1)/2,8.3Mp(p+1)/2\approx,8.3M. For the experiments, we use 33-sparsity on uniform sparse distributions. We present the plots for ChessGPT in Fig. 4 for different feedback methods. Additionally, we provide a table showing the Pearson Correlation Coefficient between the learned feature matrix and the target 𝚽\bm{\Phi}^{*} in Table 2.

Method Eigendecom. Sparse Cons. Sparse Samp. Sparse Samp. Sparse Samp. Sparse Samp.
PCC .9427 .9773 .9741 .9625 .8256 .7152
Feedbacks 134912 8390656 \approx (8 mil) 10 mil 4 mil 2 mil 1 mil
Table 2: Feedback sizes and Pearson Correlation Coefficients for the SAE dictionary for the ChessGPT under different feedback methods: (Dimension of the dictionary): 4096×5124096\times 512, rank of the feature matrix DD\textsf{D}\textsf{D}^{\top} is 512512, ss-sparsity: 3, sparse distribution: uniform.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Visualization of 100 dimensions: Feature learning on a dictionary retrieved for an MLP layer of ChessGPT of dimension 4096×5124096\times 512 (Karvonen et al., 2024). The plots show all the feedback methods studied in Section 4 and Section 5: (Top leftmost): Ground Truth, (Top rightmost): Eigendecomposition, (Botton leftmost): Sparse Constructive, and (Bottom rightmost): Sparse sampling. Note that p(p+1)/2p(p+1)/2 is roughly 8M. Eigendecomposition and Sparse constructive achieve very high PCC with the feedback complexity stated in Theorem 1 and Theorem 2. With 3-sparse sampled activations, the Sparse sampling method performs poorly with just 10000 feedbacks which gradually improves with the feedback size as shown in Fig. 4 (P.T.O).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Sparse sampling: Plots show the improvement in PCC with 3-sparse uniformly sampled activations with respect to the ground truth shown above.

Acknowledgments

Author thanks the National Science Foundation for support under grant IIS-2211386 in duration of this project. Author thanks Sanjoy Dasgupta (UCSD) for helping to develop the preliminary ideas of the work. Author thanks Geelon So (UCSD) for many extended discussions on the project. Author also thanks Mikhail (Misha) Belkin (UCSD) and Enric Boix-Adserà (MIT) for a helpful discussion during a visit to Simons Insitute. The general idea of writing was developed while the author was visiting Simons Insitute (UC Berkeley) for a workshop for which the travel was supported by the National Science Foundation (NSF) and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and #814639.

References

  • Abbe et al. (2022) Emmanuel Abbe, Emmanuel Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
  • Agarwal et al. (2014) Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish Tandon. Learning sparsely used overcomplete dictionaries. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 123–137, Barcelona, Spain, 13–15 Jun 2014. PMLR. URL https://proceedings.mlr.press/v35/agarwal14a.html.
  • Arora et al. (2013) Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In Annual Conference Computational Learning Theory, 2013. URL https://api.semanticscholar.org/CorpusID:6978132.
  • Arora et al. (2016) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent variable model approach to PMI-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016. doi: 10.1162/tacl_a_00106. URL https://aclanthology.org/Q16-1028/.
  • Arora et al. (2018) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018.
  • Ba et al. (2022a) Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. arXiv preprint arXiv:2205.01445, 2022a.
  • Ba et al. (2022b) Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=akddwRG6EGi.
  • Bellet et al. (2015) Aurélien Bellet, Amaury Habrard, and Marc Sebban. Metric learning, volume 30 of Synth. Lect. Artif. Intell. Mach. Learn. San Rafael, CA: Morgan & Claypool Publishers, 2015. ISBN 978-1-62705-365-5; 978-1-627-05366-2. doi: 10.2200/S00626ED1V01Y201501AIM030.
  • Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Bordelon and Pehlevan (2022a) Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2023, 2022a. URL https://api.semanticscholar.org/CorpusID:248887466.
  • Bordelon and Pehlevan (2022b) Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 32240–32256. Curran Associates, Inc., 2022b.
  • Bricken et al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  • Chen et al. (2013) Yuxin Chen, Yuejie Chi, and Andrea J. Goldsmith. Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory, 61:4034–4059, 2013. URL https://api.semanticscholar.org/CorpusID:210337.
  • Damian et al. (2022) Andrei Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pages 5413–5452. PMLR, 2022.
  • Doshi-Velez and Kim (2017) Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning, 2017. URL https://arxiv.org/abs/1702.08608.
  • Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
  • Faruqui et al. (2015) Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A Smith. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004, 2015.
  • Fawzi et al. (2022) Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, October 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-05172-4. URL https://doi.org/10.1038/s41586-022-05172-4.
  • Fisher (1936) R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936. doi: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
  • Gribonval et al. (2014) Rémi Gribonval, Rodolphe Jenatton, and Francis R. Bach. Sparse and spurious: Dictionary learning with noise and outliers. IEEE Transactions on Information Theory, 61:6298–6319, 2014. URL https://api.semanticscholar.org/CorpusID:217787222.
  • Gribonval and Schnass (2010) Rémi Gribonval and Karin Schnass. Dictionary identification—sparse matrix-factorization via 1\ell_{1} -minimization. IEEE Transactions on Information Theory, 56(7):3523–3539, 2010. doi: 10.1109/TIT.2010.2048466.
  • Gribonval et al. (2015) Rémi Gribonval, Rodolphe Jenatton, and Francis Bach. Sparse and spurious: Dictionary learning with noise and outliers. IEEE Transactions on Information Theory, 61(11):6298–6319, 2015. doi: 10.1109/TIT.2015.2472522.
  • Huben et al. (2024) Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=F76bwRSLeK.
  • Jolliffe (1986) Ian T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.
  • Karvonen et al. (2024) Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=SCEdoGghcw.
  • Kleindessner and von Luxburg (2016) Matthäus Kleindessner and Ulrike von Luxburg. Kernel functions based on triplet comparisons. In Neural Information Processing Systems, 2016.
  • Kulis (2013) Brian Kulis. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013. ISSN 1935-8237. doi: 10.1561/2200000019. URL http://dx.doi.org/10.1561/2200000019.
  • Li and Voroninski (2013) Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements via convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019–3033, 2013. doi: 10.1137/120893707. URL https://doi.org/10.1137/120893707.
  • Mallat and Zhang (1993) S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. doi: 10.1109/78.258082.
  • Marks et al. (2024a) Samuel Marks, Adam Karvonen, and Aaron Mueller. dictionary_learning. https://github.com/saprmarks/dictionary_learning, 2024a.
  • Marks et al. (2024b) Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024b. URL https://arxiv.org/abs/2403.19647.
  • Mason et al. (2017) Blake Mason, Lalit P. Jain, and Robert D. Nowak. Learning low-dimensional metrics. In Neural Information Processing Systems, 2017.
  • Mikolov et al. (2013) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090/.
  • Olah et al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
  • Olshausen and Field (1997) Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311–3325, 1997. ISSN 0042-6989. doi: https://doi.org/10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697.
  • Radhakrishnan et al. (2024) Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models. Science, 383(6690):1461–1467, 2024. doi: 10.1126/science.adi5639. URL https://www.science.org/doi/abs/10.1126/science.adi5639.
  • Romera-Paredes et al. (2024) Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, January 2024. ISSN 1476-4687. doi: 10.1038/s41586-023-06924-6. URL https://doi.org/10.1038/s41586-023-06924-6.
  • Schultz and Joachims (2003) Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003.
  • Shi et al. (2022) Zhengdao Shi, Jerry Wei, and Yucheng Lian. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2022.
  • Subramanian et al. (2018) Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Weinberger and Saul (2009) Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, June 2009. ISSN 1532-4435.
  • Xing et al. (2002) Eric P. Xing, A. Ng, Michael I. Jordan, and Stuart J. Russell. Distance metric learning with application to clustering with side-information. In Neural Information Processing Systems, 2002.
  • Yang and Hu (2021a) Greg Yang and Edward J. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, 2021a.
  • Yang and Hu (2021b) Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11727–11737. PMLR, 18–24 Jul 2021b. URL https://proceedings.mlr.press/v139/yang21c.html.
  • Yun et al. (2021) Zeyu Yun, Yiren Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021.
  • Zhang et al. (2019) Jeffrey Zhang, Yiren Chen, Brian Cheung, and Bruno A Olshausen. Word embedding visualization via dictionary learning. arXiv preprint arXiv:1910.03833, 2019.
  • Zhu et al. (2022) Lingxiao Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Quadratic models for understanding neural network dynamics. arXiv preprint arXiv:2205.11787, 2022.
  • Zhu et al. (2018) Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine teaching, 2018. URL https://arxiv.org/abs/1801.05927.

Appendix A Table of Contents

Here, we provide the table of contents for the appendix of the supplementary.

  • -

    Appendix B provides a comprehensive table of additional notations used throughout the paper and supplementary material.

  • -

    Appendix C contains the proof for Lemma 1, establishing conditions for recovering orthogonal representations.

  • -

    Appendix D completes the proof of Proposition 1, establishing a worst-case lower bound on feedback complexity in the constructive setting.

  • -

    Appendix E presents the proof for the upper bound in Theorem 1 for low-rank feature matrices.

  • -

    Appendix F establishes the proof for the lower bound in Theorem 1 for low-rank feature matrices.

  • -

    Appendix G details the proof of Theorem 3 which asserts tight bounds on feedback complexity for general sampled activations.

  • -

    Appendix H demonstrates the proof of Theorem 4 establishing an upper bound on the feedback complexity for sparse sampled activations.

  • -

    Appendix I provides supplementary experimental results validating our theoretical findings.

Appendix B Notations

Here we provide the glossary of notations followed in the supplementary material.

Symbol Description
α,β,x,y,z\alpha,\beta,x,y,z Activations
col(𝚽)\textsf{col}({\bm{\Phi}}) Set of columns of matrix 𝚽\bm{\Phi}
𝒟,𝒟sparse{\mathcal{D}},{\mathcal{D}}_{\textsf{sparse}} Distributions over activations
dd Dimension of ground-truth sample space
D Dictionary matrix
γ,λ,γi,λi\gamma,\lambda,\gamma_{i},\lambda_{i} Eigenvalues of a matrix
𝚽,𝚽\left<\bm{\Phi}^{\prime},\bm{\Phi}\right> Element-wise product (inner product) of matrices
Ker(𝚽)\mathrm{Ker}({\bm{\Phi}}) Kernel of matrix 𝚽\bm{\Phi}
μi,ui,vi\mu_{i},u_{i},v_{i} Eigenvectors (orthogonal vectors)
null(𝚽)\mathrm{null}({\bm{\Phi}}) Null set of matrix 𝚽\bm{\Phi}
𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}} Orthogonal complement of 𝚽\bm{\Phi}^{*} in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p})
pp Dimension of representation space
𝚽,Σ\bm{\Phi},\Sigma Feature matrix
𝚽ij\bm{\Phi}_{ij} Entry at iith row and jjth column of 𝚽\bm{\Phi}
𝚽\bm{\Phi}^{*} Target feature matrix
rr Rank of a feature matrix
Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}) Space of symmetric matrices
Sym+(p×p)\textsf{Sym}_{+}(\mathbb{R}^{p\times p}) Space of PSD, symmetric matrices
VS(,F)\textsf{VS}({\mathcal{F}},{\mathcal{M}}_{\textsf{F}}) Version space of F{\mathcal{M}}_{\textsf{F}} wrt feedback set {\mathcal{F}}
V[r]V_{\left[r\right]} The set {v1,v2,,vr}\left\{v_{1},v_{2},\ldots,v_{r}\right\}
V[pr]V_{\left[p-r\right]} The set {vr+1,,vp}\left\{v_{r+1},\ldots,v_{p}\right\}
V[p]V_{\left[p\right]} Complete orthonormal basis {v1,v2,,vp}\left\{v_{1},v_{2},\ldots,v_{p}\right\}
𝒱p{\mathcal{V}}\subset\mathbb{R}^{p} Activation/Representation space
𝒳d{\mathcal{X}}\subset\mathbb{R}^{d} Ground truth sample space

Appendix C Proof of Lemma 1

In this appendix we restate and provide the proof of Lemma 1.

Lemma 1 (Recovering orthogonal atoms).

Let 𝚽p×p\bm{\Phi}\in\mathbb{R}^{p\times p} be a symmetric positive semi-definite matrix. Define the set of orthogonal Cholesky decompositions of 𝚽\bm{\Phi} as

𝒲CD={Up×r|𝚽=UU and UU=diag(λ1,,λr)},{\mathcal{W}}_{\textsf{CD}}=\left\{\textbf{U}\in\mathbb{R}^{p\times r}\,\bigg{|}\,\bm{\Phi}=\textbf{U}\textbf{U}^{\top}\text{ and }\textbf{U}^{\top}\textbf{U}=\text{diag}(\lambda_{1},\ldots,\lambda_{r})\right\},

where r=rank(𝚽)r=\text{rank}(\bm{\Phi}) and λ1,λ2,,λr\lambda_{1},\lambda_{2},\ldots,\lambda_{r} are the eigenvalues of 𝚽\bm{\Phi} in descending order. Then, for any two matrices U,U𝒲CD\textbf{U},\textbf{U}^{\prime}\in{\mathcal{W}}_{\textsf{CD}}, there exists an orthogonal matrix Rr×rR\in\mathbb{R}^{r\times r} such that

U=UR,\textbf{U}^{\prime}=\textbf{U}\textbf{R},

where R is block diagonal with orthogonal blocks corresponding to any repeated diagonal entries did_{i} in UU\textbf{U}^{\top}\textbf{U}. Additionally, each column of U\textbf{U}^{\prime} can differ from the corresponding column of U by a sign change.

Proof.

Let U,U𝒲CD\textbf{U},\textbf{U}^{\prime}\in{\mathcal{W}}_{\textsf{CD}} be two orthogonal Cholesky decompositions of 𝚽\bm{\Phi}. Define R=Udiag(1/λ1,,1/λr)U\textbf{R}=\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}. We will show that this matrix satisfies our requirements through the following steps:

First, we show that R is orthogonal. Note,

RR\displaystyle\textbf{R}^{\top}\textbf{R} =(Udiag(1/λ1,,1/λr)U)(Udiag(1/λ1,,1/λr)U)\displaystyle=(\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime})^{\top}(\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime})
=Udiag(1/λ1,,1/λr)UUdiag(1/λ1,,1/λr)U\displaystyle=\textbf{U}^{\prime\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}
=Udiag(1/λ1,,1/λr)𝚽diag(1/λ1,,1/λr)U\displaystyle=\textbf{U}^{\prime\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\bm{\Phi}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}
=Udiag(1/λ1,,1/λr)UUdiag(1/λ1,,1/λr)U\displaystyle=\textbf{U}^{\prime\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}\textbf{U}^{\prime\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}
=Udiag(1/λ1,,1/λr)U\displaystyle=\textbf{U}^{\prime\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}
=Ir\displaystyle=\textbf{I}_{r}

Similarly,

RR\displaystyle\textbf{R}\textbf{R}^{\top} =Udiag(1/λ1,,1/λr)U(U)diag(1/λ1,,1/λr)U\displaystyle=\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}(\textbf{U}^{\prime})^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}
=Udiag(1/λ1,,1/λr)𝚽diag(1/λ1,,1/λr)U\displaystyle=\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\bm{\Phi}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}
=Udiag(1/λ1,,1/λr)UUU\displaystyle=\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}\textbf{U}^{\top}\textbf{U}
=Udiag(1/λ1,,1/λr)Udiag(λ1,,λr)\displaystyle=\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}\text{diag}(\lambda_{1},\ldots,\lambda_{r})
=Ir\displaystyle=\textbf{I}_{r}

Now we show that U=UR\textbf{U}^{\prime}=\textbf{U}\textbf{R}.

UR =UUdiag(1/λ1,,1/λr)U\displaystyle=\textbf{U}\textbf{U}^{\top}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}
=𝚽diag(1/λ1,,1/λr)U\displaystyle=\bm{\Phi}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})\textbf{U}^{\prime}
=UUUdiag(1/λ1,,1/λr)\displaystyle=\textbf{U}^{\prime}\textbf{U}^{\prime\top}\textbf{U}^{\prime}\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})
=Udiag(λ1,,λr)diag(1/λ1,,1/λr)\displaystyle=\textbf{U}^{\prime}\text{diag}(\lambda_{1},\ldots,\lambda_{r})\text{diag}(1/\lambda_{1},\ldots,1/\lambda_{r})
=U\displaystyle=\textbf{U}^{\prime}

To show that 𝐑\mathbf{R} is block diagonal with orthogonal blocks corresponding to repeated eigenvalues, consider the partitioning based on distinct eigenvalues. Let k={iλi=γk}\mathcal{I}_{k}=\{i\mid\lambda_{i}=\gamma_{k}\} be the set of indices corresponding to the kk-th distinct eigenvalue γk\gamma_{k} of 𝚽\bm{\Phi}, for k=1,,Kk=1,\ldots,K, where KK is the number of distinct eigenvalues. Let mk=|k|m_{k}=|\mathcal{I}_{k}| denote the multiplicity of γk\gamma_{k}.

Define 𝐔k\mathbf{U}_{k} and 𝐔k\mathbf{U}^{\prime}_{k} as the submatrices of 𝐔\mathbf{U} and 𝐔\mathbf{U}^{\prime} consisting of columns indexed by k\mathcal{I}_{k}, respectively.

Now, consider the block 𝐑k\mathbf{R}_{k\ell} of 𝐑\mathbf{R} corresponding to eigenvalues γk\gamma_{k} and γ\gamma_{\ell}. For kk\neq\ell, 𝐔k\mathbf{U}_{k} and 𝐔\mathbf{U}^{\prime}_{\ell} correspond to different eigenspaces (as γkγ\gamma_{k}\neq\gamma_{\ell}), and thus their inner product is zero. Hence,

𝐔kdiag(1λ1,,1λr)𝐔=𝟎mk×m.\mathbf{U}_{k}^{\top}\text{diag}\left(\frac{1}{\lambda_{1}},\ldots,\frac{1}{\lambda_{r}}\right)\mathbf{U}^{\prime}_{\ell}=\mathbf{0}_{m_{k}\times m_{\ell}}.

This implies 𝐑k=𝟎mk×mfork.\mathbf{R}_{k\ell}=\mathbf{0}_{m_{k}\times m_{\ell}}\quad\text{for}\quad k\neq\ell.

But then 𝐑\mathbf{R} must be block diagonal:

𝐑=[𝐑1𝟎𝟎𝟎𝐑2𝟎𝟎𝟎𝐑K],\mathbf{R}=\begin{bmatrix}\mathbf{R}_{1}&\mathbf{0}&\cdots&\mathbf{0}\\ \mathbf{0}&\mathbf{R}_{2}&\cdots&\mathbf{0}\\ \vdots&\vdots&\ddots&\vdots\\ \mathbf{0}&\mathbf{0}&\cdots&\mathbf{R}_{K}\\ \end{bmatrix},

where each 𝐑kmk×mk\mathbf{R}_{k}\in\mathbb{R}^{m_{k}\times m_{k}} is an orthogonal matrix. For eigenvalues with multiplicity one (mk=1m_{k}=1), the corresponding block 𝐑k\mathbf{R}_{k} is a 1×11\times 1 orthogonal matrix. The only possibilities are:

𝐑k=[1]or𝐑k=[1],\mathbf{R}_{k}=[1]\quad\text{or}\quad\mathbf{R}_{k}=[-1],

representing a sign change in the corresponding column of 𝐔\mathbf{U}. For eigenvalues with multiplicity greater than one (mk>1m_{k}>1), each block 𝐑k\mathbf{R}_{k} can be any mk×mkm_{k}\times m_{k} orthogonal matrix. This allows for rotations within the eigenspace corresponding to the repeated eigenvalue γk\gamma_{k}.

Combining all steps, we have shown that:

𝐔=𝐔𝐑,\mathbf{U}^{\prime}=\mathbf{U}\mathbf{R},

where 𝐑\mathbf{R} is an orthogonal, block-diagonal matrix. Each block 𝐑k\mathbf{R}_{k} corresponds to a distinct eigenvalue γk\gamma_{k} of 𝚽\bm{\Phi} and is either a 1×11\times 1 matrix with entry ±1\pm 1 (for unique eigenvalues) or an arbitrary orthogonal matrix of size equal to the multiplicity of γk\gamma_{k} (for repeated eigenvalues). This completes the proof of the lemma.

Appendix D Worst-case bounds: Constructive case

In this Appendix, we provide the proof of the lower bound as stated in Proposition 1. Before we prove this lower bound, we state a useful property of the sum of a symmetric, PSD matrix and a general symmetric matrix in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}).

Lemma 5.

Let 𝚽Sym+(p×p)\bm{\Phi}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}) be a symmetric matrix with full rank, i.e., rank(𝚽)=p\text{rank}({\bm{\Phi}})=p. For any arbitrary symmetric matrix 𝚽Sym(p×p)\bm{\Phi}^{\prime}\in\textsf{Sym}(\mathbb{R}^{p\times p}), there exists a positive scalar λ>0\lambda>0 such that the matrix (𝚽+λ𝚽)(\bm{\Phi}+\lambda\bm{\Phi}^{\prime}) is positive semidefinite.

Proof.

Since 𝚽\bm{\Phi} is symmetric and has full rank, it admits an eigendecomposition:

𝚽=i=1pλiuiui,\bm{\Phi}=\sum_{i=1}^{p}\lambda_{i}u_{i}u_{i}^{\top},

where {λi}i=1p\{\lambda_{i}\}_{i=1}^{p} are the positive eigenvalues and {ui}i=1p\{u_{i}\}_{i=1}^{p} are the corresponding orthonormal eigenvectors of 𝚽\bm{\Phi}.

Define the constant γ\gamma as the maximum absolute value of the quadratic forms of 𝚽\bm{\Phi}^{\prime} with respect to the eigenvectors of 𝚽\bm{\Phi}:

γ:=max1ip|ui𝚽ui|.\gamma:=\max_{1\leq i\leq p}\left|u_{i}^{\top}\bm{\Phi}^{\prime}u_{i}\right|.

Let λ\lambda be chosen as:

λ:=min1ipλiγ.\lambda:=\frac{\min_{1\leq i\leq p}\lambda_{i}}{\gamma}.

For each eigenvector uiu_{i}, consider the quadratic form of (𝚽+λ𝚽)(\bm{\Phi}+\lambda\bm{\Phi}^{\prime}):

ui(𝚽+λ𝚽)ui=λi+λui𝚽uiλiλγ=λiminλiγγ=λiminλi0.u_{i}^{\top}(\bm{\Phi}+\lambda\bm{\Phi}^{\prime})u_{i}=\lambda_{i}+\lambda u_{i}^{\top}\bm{\Phi}^{\prime}u_{i}\geq\lambda_{i}-\lambda\gamma=\lambda_{i}-\frac{\min\lambda_{i}}{\gamma}\gamma=\lambda_{i}-\min\lambda_{i}\geq 0.

This shows that each eigenvector uiu_{i} satisfies:

ui(𝚽+λ𝚽)ui0.u_{i}^{\top}(\bm{\Phi}+\lambda\bm{\Phi}^{\prime})u_{i}\geq 0.

Since {ui}i=1p\{u_{i}\}_{i=1}^{p} forms an orthonormal basis for p\mathbb{R}^{p}, for any vector xpx\in\mathbb{R}^{p}, we can express xx as x=i=1paiuix=\sum_{i=1}^{p}a_{i}u_{i}. Then:

x(𝚽+λ𝚽)x=i=1pai2ui(𝚽+λ𝚽)ui0,x^{\top}(\bm{\Phi}+\lambda\bm{\Phi}^{\prime})x=\sum_{i=1}^{p}a_{i}^{2}u_{i}^{\top}(\bm{\Phi}+\lambda\bm{\Phi}^{\prime})u_{i}\geq 0,

since each term in the sum is non-negative.

Therefore, (𝚽+λ𝚽)(\bm{\Phi}+\lambda\bm{\Phi}^{\prime}) is positive semidefinite. ∎

Now, we provide the proof of Proposition 1 in the following:

Proof of Proposition 1.

Assume, for contradiction, that there exists a feedback set (𝒱,F,𝚽){\mathcal{F}}({\mathcal{V}},{\mathcal{M}}_{\textsf{F}},\bm{\Phi}^{*}) for Eq. (1) with size ||<(p(p+1)21)|{\mathcal{F}}|<\left(\frac{p(p+1)}{2}-1\right).

For each pair (y,z)(y,z)\in{\mathcal{F}}, 𝚽\bm{\Phi}^{*} is orthogonal to (yyzz)(yy^{\top}-zz^{\top}), implying that (yyzz)𝒪𝚽(yy^{\top}-zz^{\top})\in\mathcal{O}_{\bm{\Phi}^{*}}, the orthogonal complement of 𝚽\bm{\Phi}^{*}. Therefore,

span{yyzz}(y,z)𝒪𝚽.\text{span}\left<\{yy^{\top}-zz^{\top}\}_{(y,z)\in{\mathcal{F}}}\right>\subset\mathcal{O}_{\bm{\Phi}^{*}}.

This leads to

𝚽span{yyzz}(y,z).\bm{\Phi}^{*}\perp\text{span}\left<\{yy^{\top}-zz^{\top}\}_{(y,z)\in{\mathcal{F}}}\right>.

Since ||<p(p+1)21|{\mathcal{F}}|<\frac{p(p+1)}{2}-1, we have

dim(span{yyzz})<p(p+1)21.\dim\left(\text{span}\left<\{yy^{\top}-zz^{\top}\}\right>\right)<\frac{p(p+1)}{2}-1.

Adding 𝚽\bm{\Phi}^{*} to this span increases the dimension by at most one:

dim(span𝚽{yyzz}(y,z))p(p+1)21.\dim\left(\text{span}\left<\bm{\Phi}^{*}\cup\{yy^{\top}-zz^{\top}\}_{(y,z)\in{\mathcal{F}}}\right>\right)\leq\frac{p(p+1)}{2}-1.

Since Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}) is a vector space with dim(Sym(p×p))=p(p+1)2\dim(\textsf{Sym}(\mathbb{R}^{p\times p}))=\frac{p(p+1)}{2}, there exists a symmetric matrix 𝚽𝒪𝚽\bm{\Phi}^{\prime}\in\mathcal{O}_{\bm{\Phi}^{*}} such that

𝚽(yyzz)(y,z).\bm{\Phi}^{\prime}\perp(yy^{\top}-zz^{\top})\quad\forall\,(y,z)\in{\mathcal{F}}.

By Lemma 5, there exists λ>0\lambda>0 such that 𝚽+λ𝚽\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime} is PSD and symmetric. Since 𝚽𝒪𝚽\bm{\Phi}^{\prime}\in\mathcal{O}_{\bm{\Phi}^{*}} and 𝚽\bm{\Phi}^{\prime} is not a scalar multiple of 𝚽\bm{\Phi}^{*}, the matrix 𝚽+λ𝚽\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime} is not related to 𝚽\bm{\Phi}^{*} via linear scaling. However, it still satisfies Eq. (1), contradicting the minimality of {\mathcal{F}}.

Thus, any feedback set must satisfy

||p(p+1)21.|{\mathcal{F}}|\geq\frac{p(p+1)}{2}-1.

This establishes the stated lower bound on the feedback complexity of the feedback set. ∎

Appendix E Proof of Theorem 1: Upper bound

Below we provide proof of the upper bound stated in Theorem 1.

Consider the eigendecomposition of the matrix 𝚽\bm{\Phi}^{*}. There exists a set of orthonormal vectors {v1,v2,,vr}\left\{v_{1},v_{2},\ldots,v_{r}\right\} with corresponding eigenvalues {γ1,γ2,,γr}\left\{\gamma_{1},\gamma_{2},\ldots,\gamma_{r}\right\} such that

𝚽=i=1rγivivi\displaystyle\bm{\Phi}^{*}=\sum_{i=1}^{r}\gamma_{i}v_{i}v_{i}^{\top} (5)

Denote the set of orthogonal vectors {v1,v2,,vr}\left\{v_{1},v_{2},\ldots,v_{r}\right\} as V[r]V_{\left[r\right]}.

Let {vr+1,,vp}\left\{v_{r+1},\dots,v_{p}\right\}, denoted as V[pr]V_{\left[p-r\right]}, be an orthogonal extension to the vectors in V[r]V_{\left[r\right]} such that

V[r]V[pr]={v1,v2,,vp}V_{\left[r\right]}\cup V_{\left[p-r\right]}=\left\{v_{1},v_{2},\ldots,v_{p}\right\}

forms an orthonormal basis for p\mathbb{R}^{p}. Denote the complete basis {v1,v2,,vp}\left\{v_{1},v_{2},\ldots,v_{p}\right\} as V[p]V_{\left[p\right]}.

Note that {vr+1,,vp}\left\{v_{r+1},\ldots,v_{p}\right\} precisely defines the null space of 𝚽\bm{\Phi}^{*}, i.e.,

null(𝚽)=span{vr+1,,vp}.\mathrm{null}({\bm{\Phi}^{*}})=\text{span}\left<\left\{v_{r+1},\ldots,v_{p}\right\}\right>.

The key idea of the proof is to manipulate this null space to satisfy the feedback set condition in Eq. (2) for the target matrix 𝚽\bm{\Phi}^{*}. Since 𝚽\bm{\Phi}^{*} has rank rpr\leq p, the number of degrees of freedom is exactly r(r+1)2\frac{r(r+1)}{2}. Alternatively, the span of the null space of 𝚽\bm{\Phi}^{*}, which has dimension exactly prp-r, fixes the remaining entries in 𝚽\bm{\Phi}^{*}.

Using this intuition, the teacher can provide pairs (y,z)𝒱2(y,z)\in{\mathcal{V}}^{2} to teach the null space and the eigenvectors {v1,v2,,vr}\left\{v_{1},v_{2},\ldots,v_{r}\right\} separately. However, it is necessary to ensure that this strategy is optimal in terms of sample efficiency. We confirm the optimality of this strategy in the next two lemmas.

E.1 Feedback set for the null space of 𝚽\bm{\Phi}^{*}

Our first result is on nullifying the null set of 𝚽\bm{\Phi}^{*} in the Eq. (2). Consider a partial feedback set

null={(0,vi)}i=r+1p\displaystyle{\mathcal{F}}_{\textsf{null}}=\left\{(0,v_{i})\right\}_{i=r+1}^{p}
Lemma 6.

If the teacher provides the set null{\mathcal{F}}_{\textsf{null}}, then the null space of any PSD symmetric matrix 𝚽\bm{\Phi}^{\prime} that satisfies Eq. (2) contains the span of {vr+1,,vp}\{v_{r+1},\ldots,v_{p}\}, i.e.,

{vr+1,,vp}null(𝚽).\{v_{r+1},\ldots,v_{p}\}\subseteq\mathrm{null}({\bm{\Phi}^{\prime}}).
Proof.

Let 𝚽Sym+(p×p)\bm{\Phi}^{\prime}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}) be a matrix that satisfies Eq. (2) (note that 𝚽\bm{\Phi}^{*} satisfies Eq. (2)). Thus, we have the following equality constraints:

(0,v)null,v𝚽v=0.\forall(0,v)\in{\mathcal{F}}_{\textsf{null}},\quad v^{\top}\bm{\Phi}^{\prime}v=0.

Since {vr+1,,vp}\left\{v_{r+1},\ldots,v_{p}\right\} is a set of linearly independent vectors, it suffices to show that

vV[dr],v𝚽v=0𝚽v=0.\displaystyle\forall v\in V_{\left[d-r\right]},\quad v^{\top}\bm{\Phi}^{\prime}v=0\implies\bm{\Phi}^{\prime}v=0. (6)

To prove Eq. (6), we utilize general properties of the eigendecomposition of a symmetric, positive semi-definite matrix. We express 𝚽\bm{\Phi}^{\prime} in its eigendecomposition as

𝚽=i=1sγiuiui,\bm{\Phi}^{\prime}=\sum_{i=1}^{s}\gamma_{i}^{\prime}u_{i}u_{i}^{\top},

where {ui}i=1s\left\{u_{i}\right\}_{i=1}^{s} are the eigenvectors and {γi}i=1s\left\{\gamma_{i}^{\prime}\right\}_{i=1}^{s} are the corresponding eigenvalues of 𝚽\bm{\Phi}^{\prime}. Assume that x0px\neq 0\in\mathbb{R}^{p} satisfies

x𝚽x=0.x^{\top}\bm{\Phi}^{\prime}x=0.

Consider the decomposition x=i=1saiui+vx=\sum_{i=1}^{s}a_{i}u_{i}+v^{\prime} for scalars aia_{i} and v{ui}i=1sv^{\prime}\bot\{u_{i}\}_{i=1}^{s} . Now, expanding the equation above, we get

x𝚽x\displaystyle x^{\top}\bm{\Phi}^{\prime}x =(i=1saiui+v)𝚽(i=1saiui+v)\displaystyle=\left({\sum_{i=1}^{s}a_{i}u_{i}+v^{\prime}}\right)^{\top}\bm{\Phi}^{\prime}\left({\sum_{i=1}^{s}a_{i}u_{i}+v^{\prime}}\right)
=(i=1saiui)𝚽(i=1saiui)+v𝚽(i=1saiui)+(i=1saiui)𝚽v+v𝚽v\displaystyle=\left({\sum_{i=1}^{s}a_{i}u_{i}}\right)^{\top}\bm{\Phi}^{\prime}\left({\sum_{i=1}^{s}a_{i}u_{i}}\right)+v^{\prime\top}\bm{\Phi}^{\prime}\left({\sum_{i=1}^{s}a_{i}u_{i}}\right)+\left({\sum_{i=1}^{s}a_{i}u_{i}}\right)\bm{\Phi}^{\prime}v^{\prime}+v^{\prime\top}\bm{\Phi}^{\prime}v^{\prime}
=(i=1saiui)(i=1sγiuiui)(i=1saiui)+2v(i=1sγiuiui)(i=1saiui)+v(i=1sγiuiui)v= 0 as v{ui}\displaystyle=\left({\sum_{i=1}^{s}a_{i}u_{i}}\right)^{\top}\left({\sum_{i=1}^{s}\gamma_{i}^{\prime}u_{i}u_{i}^{\top}}\right)\left({\sum_{i=1}^{s}a_{i}u_{i}}\right)+\underbrace{2v^{\prime\top}\left({\sum_{i=1}^{s}\gamma_{i}^{\prime}u_{i}u_{i}^{\top}}\right)\left({\sum_{i=1}^{s}a_{i}u_{i}}\right)+v^{\prime\top}\left({\sum_{i=1}^{s}\gamma_{i}^{\prime}u_{i}u_{i}^{\top}}\right)v^{\prime}}_{=\,0\textnormal{ as }v^{\prime}\bot\left\{u_{i}\right\}}
=i,j,kaiui(γjujuj)akuk\displaystyle=\sum_{i,j,k}a_{i}u_{i}^{\top}(\gamma_{j}^{\prime}u_{j}u_{j}^{\top})a_{k}u_{k}
=i=1sai2γi=0\displaystyle=\sum_{i=1}^{s}a_{i}^{2}\gamma_{i}^{\prime}=0

Since γi>0\gamma_{i}^{\prime}>0 for all i=1,,si=1,\ldots,s (because 𝚽\bm{\Phi}^{\prime} is PSD), it follows that each ai=0a_{i}=0. Therefore,

𝚽x=𝚽v=0.\bm{\Phi}^{\prime}x=\bm{\Phi}^{\prime}v^{\prime}=0.

This implies that xnull(𝚽)x\in\mathrm{null}({\bm{\Phi}^{\prime}}), thereby proving Eq. (6).

Hence, if the teacher provides null{\mathcal{F}}_{\textsf{null}}, any solution 𝚽\bm{\Phi}^{\prime} to Eq. (2) must satisfy

{vr+1,,vp}null(𝚽).\{v_{r+1},\ldots,v_{p}\}\subseteq\mathrm{null}({\bm{\Phi}^{\prime}}).

With this we will argue that the feedback setup in Eq. (2) can be decomposed in two parts: first is teaching the null set null(𝚽):=span{vi}i=r+1n\mathrm{null}({\bm{\Phi}^{*}}):=\text{span}\left<\{v_{i}\}_{i=r+1}^{n}\right>, and second is teaching 𝒮𝚽=span{vi}i=1r\mathcal{S}_{\bm{\Phi}^{*}}=\text{span}\left<\{v_{i}\}_{i=1}^{r}\right> in the form of 𝚽=i=1rγivivi\bm{\Phi}^{*}=\sum_{i=1}^{r}\gamma_{i}v_{i}v_{i}^{\top}.

Lemma 6 implies that using a feedback set of the form null{\mathcal{F}}_{\textsf{null}} any solution 𝚽Sym+(p×p)\bm{\Phi}^{\prime}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}) to Eq. (2) satisfies the property V[dr]null(𝚽)V_{\left[d-r\right]}\subset\mathrm{null}({\bm{\Phi}^{\prime}}). Furthermore, |null|=pr|{\mathcal{F}}_{\textsf{null}}|=p-r.

E.2 Feedback set for the kernel of 𝚽\bm{\Phi}^{*}

Next, we discuss how to teach V[r]V_{\left[r\right]}, i.e. V[r]V_{\left[r\right]} span the rows of any solution 𝚽Sym+(p×p)\bm{\Phi}^{\prime}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}) to Eq. (2) with the corresponding eigenvalues {γi}i=1r\left\{\gamma_{i}\right\}_{i=1}^{r}. We show that if the search space of metrics in Eq. (2) is the version space VS(F,null)\textsf{VS}({\mathcal{M}}_{\textsf{F}},{\mathcal{F}}_{\textsf{null}}) which is a restriction of the space F{\mathcal{M}}_{\textsf{F}} to feedback set null{\mathcal{F}}_{\textsf{null}}, then a feedback set of size at most r(r+1)21\frac{r(r+1)}{2}-1 is sufficient to teach 𝚽\bm{\Phi}^{*} up to feature equivalence. Thus, we consider the reformation of the problem in Eq. (2) as

(y,z)(𝒳,VS(F,null),𝚽),𝚽(yyzz)=0\displaystyle\forall(y,z)\in{\mathcal{F}}({\mathcal{X}},\textsf{VS}({\mathcal{M}}_{\textsf{F}},{\mathcal{F}}_{\textsf{null}}),\bm{\Phi}^{*}),\quad\bm{\Phi}\bm{\cdot}(yy^{\top}-zz^{\top})=0 (7)

where the feedback set (𝒳,VS(F,null),𝚽){\mathcal{F}}({\mathcal{X}},\textsf{VS}({\mathcal{M}}_{\textsf{F}},{\mathcal{F}}_{\textsf{null}}),\bm{\Phi}^{*}) is devised to solve a smaller space VS(F,null):={𝚽F|𝚽v=0,(0,v)null}\textsf{VS}({\mathcal{M}}_{\textsf{F}},{\mathcal{F}}_{\textsf{null}}):=\left\{\bm{\Phi}\in{\mathcal{M}}_{\textsf{F}}\,|\,\bm{\Phi}v=0,\forall(0,v)\in{\mathcal{F}}_{\textsf{null}}\right\}. With this state the following useful lemma on the size of the restricted feedback set (𝒳,VS(F,null),𝚽){\mathcal{F}}({\mathcal{X}},\textsf{VS}({\mathcal{M}}_{\textsf{F}},{\mathcal{F}}_{\textsf{null}}),\bm{\Phi}^{*}).

Lemma 7.

Consider the problem as formulated in Eq. (7) in which the null set null(𝚽)\mathrm{null}({\bm{\Phi}^{*}}) of the target matrix 𝚽\bm{\Phi}^{*} is known. Then, the teacher sufficiently and necessarily finds a set (𝒳,VS(null),𝚽){\mathcal{F}}({\mathcal{X}},\textsf{VS}({\mathcal{F}}_{\textsf{null}}),\bm{\Phi}^{*}) of size r(r+1)21\frac{r(r+1)}{2}-1 for oblivious learning up to feature equivalence.

Proof.

Note that any solution 𝚽\bm{\Phi}^{\prime} of Eq. (7) has its columns spanned exactly by V[r]V_{\left[r\right]}. Alternatively, if we consider the eigendecompostion of 𝚽\bm{\Phi}^{\prime} then the corresponding eigenvectors exists in spanV[r]span\left<V_{\left[r\right]}\right>. Furthermore, note that 𝚽\bm{\Phi}^{*} is of rank rr which implies there are only r(r+1)2\frac{r(r+1)}{2} degrees of freedom, i.e. entries in the matrix 𝚽\bm{\Phi}^{*}, that need to be fixed.

Thus, there are exactly rr linearly independent columns of 𝚽\bm{\Phi}^{*}, indexed as {j1,j2,,jr}\{j_{1},j_{2},\ldots,j_{r}\}. Now, consider the set of matrices

{𝚽(i,j)|i[d],j{j1,j2,,jr},𝚽ij(i,j)=𝟙[i{i,j},j{i,j}{i}]}\displaystyle\left\{\bm{\Phi}^{(i,j)}\,|\,i\in\left[d\right],j\in\{j_{1},j_{2},\ldots,j_{r}\},\bm{\Phi}^{(i,j)}_{i^{\prime}j^{\prime}}=\mathds{1}[i^{\prime}\in\{i,j\},j^{\prime}\in\{i,j\}\setminus\{i^{\prime}\}]\right\}

This forms a basis to generate any matrix with independent columns along the indexed set. Hence, the span of 𝒮𝚽\mathcal{S}_{\bm{\Phi}^{*}} induces a subspace of symmetric matrices of dimension r(r+1)2\frac{r(r+1)}{2} in the vector space symm(p)\textsf{symm}(\mathbb{R}^{p}), i.e. the column vectors along the indexed set is spanned by elements of 𝒮𝚽\mathcal{S}_{\bm{\Phi}^{*}}. Thus, it is clear that picking a feedback set of size r(r+1)21\frac{r(r+1)}{2}-1 in the orthogonal complement of 𝚽\bm{\Phi}^{*}, i.e. 𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}} restricted by this span sufficiently teaches 𝚽\bm{\Phi}^{*} if null(𝚽)\mathrm{null}({\bm{\Phi}^{*}}) is known. One exact form of this set is proven in Lemma 3. Since any solution 𝚽\bm{\Phi}^{\prime} is agnostic to the scaling of the target matrix 𝚽\bm{\Phi}^{\prime}, we have shown that the sufficiency on the feedback complexity for 𝚽\bm{\Phi}^{*} up to feature equivalence.

Now, we show that the stated feedback set size is necessary. The argument is similar to the proof of Lemma 5.

For the sake of contradiction assume that there is a smaller sized feedback set small{\mathcal{F}}_{\textsf{small}}. This implies that there is some matrix in VS(F,null)\textsf{VS}({\mathcal{M}}_{\textsf{F}},{\mathcal{F}}_{\textsf{null}}), a subspace induced by span 𝒮𝚽\mathcal{S}_{\bm{\Phi}^{*}}, orthogonal to (𝚽)(\bm{\Phi}^{*}) is not in the span of small{\mathcal{F}}_{\textsf{small}}, denoted as 𝚽\bm{\Phi}^{\prime}. If 𝚽\bm{\Phi}^{\prime} is PSD then it is a solution to Eq. (7) and 𝚽\bm{\Phi}^{\prime} is not a scalar multiple of 𝚽\bm{\Phi}^{*}. Now, if 𝚽\bm{\Phi}^{\prime} is not PSD we show that there exists scalar λ>0\lambda>0 such that

𝚽+λ𝚽Sym+(p×p),\displaystyle\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime}\in\textsf{Sym}_{+}(\mathbb{R}^{p\times p}),

i.e. the sum is PSD. Consider the eigendecompostion of 𝚽\bm{\Phi}^{\prime} (assume rank(𝚽)=r\text{rank}({\bm{\Phi}^{\prime}})=r^{\prime})

𝚽=i=1rδiμiμi\displaystyle\bm{\Phi}^{\prime}=\sum_{i=1}^{r^{\prime}}\delta_{i}\mu_{i}\mu_{i}^{\top}

for orthogonal eigenvectors {μi}i=1r\left\{\mu_{i}\right\}_{i=1}^{r^{\prime}} and the corresponding eigenvalues {δi}i=1r\left\{\delta_{i}\right\}_{i=1}^{r^{\prime}}. Since (assume) r0rr_{0}\leq r^{\prime} of the eigenvalues are negative we can rewrite 𝚽\bm{\Phi}^{\prime} as

𝚽=i=1r0δiμiμi+j=r0+1rδjμjμj\displaystyle\bm{\Phi}^{\prime}=\sum_{i=1}^{r_{0}}\delta_{i}\mu_{i}\mu_{i}^{\top}+\sum_{j=r_{0}+1}^{r^{\prime}}\delta_{j}\mu_{j}\mu_{j}^{\top}

Thus, if we can regulate the values of μi𝚽μi\mu^{\top}_{i}\bm{\Phi}^{*}\mu_{i}, for all i=1,2,,r0i=1,2,\ldots,r_{0}, noting they are positive, then we can find an appropriate scalar λ>0\lambda>0. Let m:=mini[r0]μi𝚽μim^{*}:=\min_{i\in[r_{0}]}\mu_{i}^{\top}\bm{\Phi}^{*}\mu_{i} and :=maxi[r0]|δi|\ell^{*}:=\max_{i\in[r_{0}]}|\delta_{i}|. Now, setting λm\lambda\leq\frac{m^{*}}{\ell^{*}} achieves the desired property of 𝚽+λ𝚽\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime} as shown in the proof of Lemma 5.

Consider that both 𝚽\bm{\Phi}^{\prime} and 𝚽\bm{\Phi}^{*} are orthogonal to every element in the feedback set small{\mathcal{F}}_{\textsf{small}}. This orthogonality implies that 𝚽\bm{\Phi}^{*} is not a unique solution to equation Eq. (7) up to a positive scaling factor.

Therefore, we have demonstrated that when the null set null(𝚽)\mathrm{null}({\bm{\Phi}^{*}}) of the target matrix 𝚽\bm{\Phi}^{*} is known, a feedback set of size exactly r(r+1)21\frac{r(r+1)}{2}-1 is both necessary and sufficient. ∎

E.3 Proof of Lemma 3 and construction of feedback set for Ker(𝚽)\mathrm{Ker}({\bm{\Phi}^{*}})

Up until this point we haven’s shown how to construct this r(r+1)21\frac{r(r+1)}{2}-1 sized feedback set. Consider the following union:

{v1v1}{v2v2,(v2+v1)(v2+v1)}{vrvr,(v1+vr)(v1+vr),,(vr1+vr)(vr1+vr)}\displaystyle\left\{v_{1}v_{1}^{\top}\right\}\cup\left\{v_{2}v_{2}^{\top},(v_{2}+v_{1})(v_{2}+v_{1})^{\top}\right\}\cup\ldots\cup\left\{v_{r}v_{r}^{\top},(v_{1}+v_{r})(v_{1}+v_{r})^{\top},\ldots,(v_{r-1}+v_{r})(v_{r-1}+v_{r})^{\top}\right\}

We can show that this union is a set of linearly independent matrices of rank 1 as stated in Lemma 3 below.

Lemma 3.

Let {vi}i=1rp\{v_{i}\}_{i=1}^{r}\subset\mathbb{R}^{p} be a set of orthogonal vectors. Then, the set of rank-1 matrices

:={vivi,(vi+vj)(vi+vj)| 1i<jr}\mathcal{B}:=\left\{v_{i}v_{i}^{\top},\ (v_{i}+v_{j})(v_{i}+v_{j})^{\top}\ \bigg{|}\ 1\leq i<j\leq r\right\}

is linearly independent in the space of symmetric matrices Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}).

Proof.

We prove the claim by considering two separate cases. For the sake of contradiction, suppose that the set {\mathcal{B}} is linearly dependent. This implies that there exists at least one matrix of the form viviv_{i}v_{i}^{\top} or (vi+vj)(vi+vj)(v_{i}+v_{j})(v_{i}+v_{j})^{\top} that can be expressed as a linear combination of the other matrices in {\mathcal{B}}. We now examine these two cases individually.

Case 1: First, we assume that for some i[r]i\in[r], viviv_{i}v_{i}^{\top} can be written as a linear combination. Thus, there exists scalars that satisfy the following property

vivi=j=1rαjvijvij+k=1r′′βk(vlk+vmk)(vlk+vmk)\displaystyle v_{i}v_{i}^{\top}=\sum_{j=1}^{r^{\prime}}\alpha_{j}v_{i_{j}}v_{i_{j}}^{\top}+\sum_{k=1}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})(v_{l_{k}}+v_{m_{k}})^{\top} (8)
j,k,αj,βk>0,iji,lk<mk\displaystyle\forall j,k,\quad\alpha_{j},\beta_{k}>0,i_{j}\neq i,l_{k}<m_{k} (9)

Now, note that we can write

k=1r′′βk(vlk+vmk)(vlk+vmk)=k=1,lk=ir′′βk(vlk+vmk)vlk+k=1,lkir′′βk(vlk+vmk)vlk+k=1r′′βk(vlk+vmk)vmk\displaystyle\sum_{k=1}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})(v_{l_{k}}+v_{m_{k}})^{\top}=\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})v_{l_{k}}^{\top}+\sum_{k=1,l_{k}\neq i}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})v_{l_{k}}^{\top}+\sum_{k=1}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})v_{m_{k}}^{\top}

But the following sum

j=1rαjvijvij+k=1,lkir′′βk(vlk+vmk)vlk+k=1r′′βk(vlk+vmk)vmk\displaystyle\sum_{j=1}^{r^{\prime}}\alpha_{j}v_{i_{j}}v_{i_{j}}^{\top}+\sum_{k=1,l_{k}\neq i}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})v_{l_{k}}^{\top}+\sum_{k=1}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})v_{m_{k}}^{\top}

doesn’t span (as column vectors) a subspace that contains the column vector viv_{i} because {vi}i=1r\left\{v_{i}\right\}_{i=1}^{r} is a set of orthogonal vectors. Thus, we can write

vivi=k=1,lk=ir′′βk(vlk+vmk)vlk=(k=1,lk=ir′′βkvlk+k=1,lk=ir′′βkvmk)vi\displaystyle v_{i}v_{i}^{\top}=\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})v_{l_{k}}^{\top}=\left({\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}v_{l_{k}}+\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}v_{m_{k}}}\right)v_{i}^{\top} (10)

This implies that

k=1,lk=ir′′βkvmk=0 if lk=i,βk=0\displaystyle\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}v_{m_{k}}=0\implies\textnormal{ if }l_{k}=i,\beta_{k}=0 (11)

Since not all βk=0\beta_{k}=0 corresponding to lk=il_{k}=i (otherwise k=1,lk=ir′′βkvlk=0\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}v_{l_{k}}=0 ) we have shown that viviv_{i}v_{i}^{\top} can not be written as a linear combination of elements in {vivi}{\mathcal{B}}\setminus\left\{v_{i}v_{i}^{\top}\right\}.

Case 2: Now, we consider the second case where there exists some indices i,ji,j such that (vi+vj)(vi+vj)(v_{i}+v_{j})(v_{i}+v_{j})^{\top} is a sum of linear combination of elements in {\mathcal{B}}. Note that this linear combination can’t have an element of type vkvkv_{k}v_{k}^{\top} as it contradicts the first case. So, there are scalars such that

(vi+vj)(vi+vj)=k=1r′′βk(vlk+vmk)(vlk+vmk)\displaystyle(v_{i}+v_{j})(v_{i}+v_{j})^{\top}=\sum_{k=1}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})(v_{l_{k}}+v_{m_{k}})^{\top} (12)
k,lk<mk\displaystyle\forall k,\quad l_{k}<m_{k} (13)

But we rewrite this as

(vi+vj)vi+(vi+vj)vj\displaystyle(v_{i}+v_{j})v_{i}^{\top}+(v_{i}+v_{j})v_{j}^{\top}
=\displaystyle= k=1,lk=ir′′βk(vi+vmk)vi+k=1,mk=jr′′βk(vlk+vj)vj+k=1,lki,mkjr′′βk(vlk+vmk)(vlk+vmk)\displaystyle\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}(v_{i}+v_{m_{k}})v_{i}^{\top}+\sum_{k=1,m_{k}=j}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{j})v_{j}^{\top}+\sum_{\begin{subarray}{c}k=1,l_{k}\neq i,\\ m_{k}\neq j\end{subarray}}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})(v_{l_{k}}+v_{m_{k}})^{\top}

Note that if lk=il_{k}=i then the corresponding mkjm_{k}\neq j and vice versa. Since {vi}i=1r\left\{v_{i}\right\}_{i=1}^{r} are orthogonal, the decomposition above implies

(vi+vj)vi=k=1,lk=ir′′βk(vi+vmk)vi\displaystyle(v_{i}+v_{j})v_{i}^{\top}=\sum_{k=1,l_{k}=i}^{r^{\prime\prime}}\beta_{k}(v_{i}+v_{m_{k}})v_{i}^{\top} (14)
(vi+vj)vj=k=1,mk=jr′′βk(vlk+vj)vj\displaystyle(v_{i}+v_{j})v_{j}^{\top}=\sum_{k=1,m_{k}=j}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{j})v_{j}^{\top} (15)
k=1,lki,mkjr′′βk(vlk+vmk)(vlk+vmk)=0\displaystyle\sum_{\begin{subarray}{c}k=1,l_{k}\neq i,\\ m_{k}\neq j\end{subarray}}^{r^{\prime\prime}}\beta_{k}(v_{l_{k}}+v_{m_{k}})(v_{l_{k}}+v_{m_{k}})^{\top}=0 (16)

But using the arguments in Eq. (10) and Eq. (11), we can achieve Eq. (14) or Eq. (15).

Thus, we have shown that the set of rank-1 matrices as described in {\mathcal{B}} are linearly independent. ∎

In Lemma 7, we discussed that in order to teach 𝚽\bm{\Phi}^{*} sufficiently agent needs a feedback set of size r(r+1)21\frac{r(r+1)}{2}-1 if the null set of 𝚽\bm{\Phi}^{*} is known. We can establish this feedback set using the basis shown in Lemma 3. We state this result in the following lemma.

Lemma 9.

For a given target matrix 𝚽=i=1rγivivi\bm{\Phi}^{*}=\sum_{i=1}^{r}\gamma_{i}v_{i}v_{i}^{\top} and basis set of matrices {\mathcal{B}} as shown in Lemma 3, the following set spans a subspace of dimension r(r+1)21\frac{r(r+1)}{2}-1 in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}).

𝒪:={v1v1λ11yy,v2v2λ22yy,(v1+v2)(v1+v2)λ12yy,,vrvrλrryy,(v1+vr)(v1+vr)λ1ryy,,(vr1+vr)(vr1+vr)λ(r1)ryy}\mathcal{O}_{{\mathcal{B}}}:=\left\{\begin{aligned} &v_{1}v_{1}^{\top}-\lambda_{11}yy^{\top},v_{2}v_{2}^{\top}-\lambda_{22}yy^{\top},(v_{1}+v_{2})(v_{1}+v_{2})^{\top}-\lambda_{12}yy^{\top},\ldots,\\ &v_{r}v_{r}^{\top}-\lambda_{rr}yy^{\top},(v_{1}+v_{r})(v_{1}+v_{r})^{\top}-\lambda_{1r}yy^{\top},\ldots,\\ &(v_{r-1}+v_{r})(v_{r-1}+v_{r})^{\top}-\lambda_{(r-1)r}yy^{\top}\end{aligned}\right\}
y𝚽y0y\bm{\Phi}^{*}y^{\top}\neq 0
i,j,λii=vi𝚽viy𝚽y,λij=(vi+vj)𝚽(vi+vj)y𝚽y(ij)\forall i,j,\quad\lambda_{ii}=\frac{v_{i}\bm{\Phi}^{*}v_{i}^{\top}}{y\bm{\Phi}^{*}y^{\top}},\quad\lambda_{ij}=\frac{(v_{i}+v_{j})\bm{\Phi}^{*}(v_{i}+v_{j})^{\top}}{y\bm{\Phi}^{*}y^{\top}}\quad(i\neq j)
Proof.

Since 𝚽\bm{\Phi}^{*} has at least rr positive eigenvalues there exists a vector ypy\in\mathbb{R}^{p} such that y𝚽y0y\bm{\Phi}^{*}y^{\top}\neq 0. It is straightforward to note that 𝒪\mathcal{O}_{{\mathcal{B}}} is orthogonal to 𝚽\bm{\Phi}^{*}. As 𝒪span\mathcal{O}_{{\mathcal{B}}}\subset\text{span}\langle{\mathcal{B}}\rangle and 𝚽𝒪\bm{\Phi}^{*}\bot\mathcal{O}_{{\mathcal{B}}}, dim(span𝒪)=r(r+1)21\dim(\text{span}\langle\mathcal{O}_{{\mathcal{B}}}\rangle)=\frac{r(r+1)}{2}-1. ∎

Now, we will complete the proof of the main result of the appendix here.

Proof of Theorem 1.

Combining the results from Lemma 6, Lemma 7, and Lemma 9, we conclude that the feedback setup in Eq. (2) can be effectively decomposed into teaching the null space and the span of the eigenvectors of 𝚽\bm{\Phi}^{*}. The constructed feedback sets ensure that 𝚽\bm{\Phi}^{*} is uniquely identified up to a linear scaling factor with optimal sample efficiency. ∎

Appendix F Proof of Theorem 1: Lower bound

In this appendix, we provide the proof of the lower bound as stated in Theorem 1. We proceed by first showing some useful properties on a valid feedback set (p,F,𝚽){\mathcal{F}}(\mathbb{R}^{p},{\mathcal{M}}_{\textsf{F}},\bm{\Phi}^{*}) for a target feature matrix 𝚽\bm{\Phi}^{*}. They are stated in Lemma 10 and Lemma 11.

First, we consider a basic spanning property of matrices (xxyy)(xx^{\top}-yy^{\top}) for any pair (x,y)(x,y)\in{\mathcal{F}} in the space of symmetric matrices Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}).

Lemma 10.

If 𝚽𝒪𝚽\bm{\Phi}\in\mathcal{O}_{\bm{\Phi}^{*}} such that spancol(𝚽)spanV[r]\text{span}\left<\textsf{col}({\bm{\Phi}})\right>\subset\text{span}\left<V_{\left[r\right]}\right> then 𝚽span\bm{\Phi}\in span\left<{\mathcal{F}}\right>.

Proof.

Consider an 𝚽𝒪𝚽\bm{\Phi}\in\mathcal{O}_{\bm{\Phi}^{*}} such that spancol(𝚽)spanV[r]\text{span}\left<\textsf{col}({\bm{\Phi}})\right>\subset\text{span}\left<V_{\left[r\right]}\right>. Note that the eigendecompostion of 𝚽\bm{\Phi} (assume rank(𝚽)=r<r\text{rank}({\bm{\Phi}})=r^{\prime}<r)

𝚽=i=1rδiμiμi\displaystyle\bm{\Phi}=\sum_{i=1}^{r^{\prime}}\delta_{i}\mu_{i}\mu_{i}^{\top}

for orthogonal eigenvectors {μi}i=1r\left\{\mu_{i}\right\}_{i=1}^{r^{\prime}} and the corresponding eigenvalues {δi}i=1r\left\{\delta_{i}\right\}_{i=1}^{r^{\prime}} has the property that span{μi}i=1rspanV[r]span\left<\left\{\mu_{i}\right\}_{i=1}^{r^{\prime}}\right>\subset\text{span}\left<V_{\left[r\right]}\right>. Using the arguments exactly as shown in the second half of the proof of Lemma 7 we can show there exists λ>0\lambda>0 such that 𝚽+λ𝚽VS(,F)\bm{\Phi}^{*}+\lambda\bm{\Phi}\in\textsf{VS}({\mathcal{F}},{\mathcal{M}}_{\textsf{F}}). But then 𝚽\bm{\Phi} is not feature equivalent to 𝚽\bm{\Phi}^{*}. But this contradicts the assumption of {\mathcal{F}} being a valid feedback set. ∎

Lemma 11.

There exists vectors U[pr]null(𝚽)U_{\left[p-r\right]}\subset\mathrm{null}({\bm{\Phi}^{*}}) (of size prp-r) such that spanU[pr]=null(𝚽)\text{span}\left<U_{\left[p-r\right]}\right>=\mathrm{null}({\bm{\Phi}^{*}}) and for any vector vU[pr]v\in U_{\left[p-r\right]}, vvspanvv^{\top}\in\text{span}\left<{\mathcal{F}}\right>.

Proof.

Assuming the contrary, there exists vspannull(𝚽)v\in\text{span}\left<\mathrm{null}({\bm{\Phi}^{*}})\right> such that vvspanvv^{\top}\notin\text{span}\left<{\mathcal{F}}\right>.

Now if vvvv^{\top}\,\bot\,{\mathcal{F}}, then for any scalar λ>0\lambda>0, 𝚽+λvv\bm{\Phi}^{*}+\lambda vv^{\top} is both symmetric and positive semi-definite and satisfies all the conditions in Eq. (1) wrt {\mathcal{F}} a contradiction as 𝚽+λvv\bm{\Phi}^{*}+\lambda vv^{\top} is not feature equivalent to 𝚽\bm{\Phi}^{*}.

So, consider the case when vv⟂̸vv^{\top}\,\not\perp\,{\mathcal{F}}. Let {vr+1,,vp1}\left\{v_{r+1},\ldots,v_{p-1}\right\} be an orthogonal extension333the set is not trivially empty in which case the proof follows easily of vv such that {vr+1,,vp1,v}\left\{v_{r+1},\ldots,v_{p-1},v\right\} forms a basis of null(𝚽)\mathrm{null}({\bm{\Phi}^{*}}), i.e., in other words

v{vr+1,,vp1}&span{vr+1,,vp1,v}=null(𝚽).\displaystyle v\bot\left\{v_{r+1},\ldots,v_{p-1}\right\}\quad\&\quad\text{span}\left<\left\{v_{r+1},\ldots,v_{p-1},v\right\}\right>=\mathrm{null}({\bm{\Phi}^{*}}).

We will first show that there exists some 𝚽\bm{\Phi}^{\prime} (λ𝚽,for some λ>0)(\not=\lambda\bm{\Phi}^{*},\text{for some }\lambda>0) Sym(p×p)\in\textsf{Sym}(\mathbb{R}^{p\times p}) orthogonal to {\mathcal{F}} and furthermore {vr+1,,vp1}null(𝚽)\left\{v_{r+1},\ldots,v_{p-1}\right\}\subset\mathrm{null}({\bm{\Phi}^{\prime}}) .

Consider the intersection (in the space Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p})) of the orthogonal complement of the matrices {vr+1vr+1,,vp1vp1}\left\{v_{r+1}v_{r+1}^{\top},\ldots,v_{p-1}v_{p-1}^{\top}\right\}, denote it as 𝒪rest\mathcal{O}_{\textsf{rest}}, i.e.,

𝒪rest:=i=r+1p1𝒪vivi\displaystyle\mathcal{O}_{\textsf{rest}}:=\bigcap_{i=r+1}^{p-1}\mathcal{O}_{v_{i}v_{i}^{\top}}

Note that

dim(𝒪rest)=p(p+1)/2p+r\displaystyle\dim(\mathcal{O}_{\textsf{rest}})=p(p+1)/2-p+r

Since vvvv^{\top} is in 𝒪rest\mathcal{O}_{\textsf{rest}} and dim(𝒪rest)>1\dim(\mathcal{O}_{\textsf{rest}})>1 there exists some 𝚽\bm{\Phi}^{\prime} such that 𝚽𝚽\bm{\Phi}^{\prime}\perp\bm{\Phi}^{*}, and also orthogonal to elements in the feedback set {\mathcal{F}}. Thus, 𝚽\bm{\Phi}^{\prime} has a null set which includes the subset {vr+1,,vp1}\left\{v_{r+1},\ldots,v_{p-1}\right\}.

Now, the rest of the proof involves showing existence of some scalar λ>0\lambda>0 such that 𝚽+λ𝚽\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime} satisfies the conditions of Eq. (1) for the feedback set {\mathcal{F}}. Note that if v𝚽v=0v\bm{\Phi}^{\prime}v^{\top}=0 then the proof is straightforward as span{vr+1,,vp1,v}null(𝚽)\text{span}\left<\left\{v_{r+1},\ldots,v_{p-1},v\right\}\right>\subset\mathrm{null}({\bm{\Phi}^{\prime}}), which implies spancol(𝚽)spanV[r]\text{span}\left<\textsf{col}({\bm{\Phi}^{\prime}})\right>\subset\text{span}\left<V_{[r]}\right>. But this is precisely the condition for Lemma 10 to hold.

Without loss of generality assume that v𝚽v>0v\bm{\Phi}^{\prime}v^{\top}>0. First note that the eigendecomposition of 𝚽\bm{\Phi}^{\prime} has eigenvectors that are contained in V[r]{v}V_{[r]}\cup\left\{v\right\}. Consider some arbitrary choice of λ>0\lambda>0, we will fix a value later. It is straightforward that 𝚽+λ𝚽\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime} is symmetric for 𝚽\bm{\Phi}^{*} and 𝚽\bm{\Phi}^{\prime} are symmetric. In order to show it is positive semi-definite, it suffices to show that

up,u(𝚽+λ𝚽)u0\displaystyle\forall u\in\mathbb{R}^{p},u^{\top}(\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime})u\geq 0 (17)

Since {vr+1,,vp1}(null(𝚽)null(𝚽))\left\{v_{r+1},\ldots,v_{p-1}\right\}\subset\left({\mathrm{null}({\bm{\Phi}^{*}})\cap\mathrm{null}({\bm{\Phi}^{\prime}})}\right) we can simplify Eq. (17) to

uspanV[r]{v},u(𝚽+λ𝚽)u0\displaystyle\forall u\in\text{span}\left<V_{[r]}\cup\left\{v\right\}\right>,u^{\top}(\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime})u\geq 0 (18)

Consider the decomposition of any arbitrary vector uspanV[r]{v}u\in\text{span}\left<V_{[r]}\cup\left\{v\right\}\right> as follows:

u=u[r]+v, such that u[r]spanV[r],vspan{v}\displaystyle u=u_{[r]}+v^{\prime},\textnormal{ such that }u_{[r]}\in\text{span}\left<V_{[r]}\right>,v^{\prime}\in\text{span}\left<\{v\}\right> (19)
u[r]:=i=1rαivi,iαi\displaystyle u_{[r]}:=\sum_{i=1}^{r}\alpha_{i}v_{i},\;\;\forall i\;\alpha_{i}\in\mathbb{R} (20)

From here on we assume that u[r]0u_{[r]}\neq 0. The alternate case is trivial as v𝚽v>0v^{\prime\top}\bm{\Phi}^{\prime}v^{\prime}>0.

Now, we write the vectors as scalar multiples of their corresponding unit vectors

u[r]=δru^r,u^r:=u[r]u[r]V[r]2,u[r]V[r]2:=i=1rαi2\displaystyle u_{[r]}=\delta_{r}\cdot\hat{u}_{r},\;\;\hat{u}_{r}:=\frac{u_{[r]}}{||u_{[r]}||^{2}_{V_{[r]}}},||u_{[r]}||^{2}_{V_{[r]}}:=\sum_{i=1}^{r}\alpha_{i}^{2} (21)
v=δvv^,v^:=vv22\displaystyle v^{\prime}=\delta_{v^{\prime}}\cdot\hat{v},\;\;\hat{v}:=\frac{v}{||v||_{2}^{2}} (22)

Remark: Although we have computed the norm of u[r]u_{[r]} as u[r]V[r]2||u_{[r]}||^{2}_{V_{[r]}} in the orthonormal basis V[r]V_{[r]}, note that the norm remains unchanged (same as the 2\ell_{2}). 2\ell_{2} is used for ease of analysis later on.

Using the decomposition in Eq. (19)-(20), we can write Eq. (18) as

u(𝚽+λ𝚽)u\displaystyle u^{\top}(\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime})u =(u[r]+v)(𝚽+λ𝚽)(u[r]+v)\displaystyle=(u_{[r]}+v^{\prime})^{\top}(\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime})(u_{[r]}+v^{\prime})
=u[r]𝚽u[r]+λ(u[r]+v)𝚽(u[r]+v)\displaystyle=u_{[r]}^{\top}\bm{\Phi}^{*}u_{[r]}+\lambda(u_{[r]}+v^{\prime})^{\top}\bm{\Phi}^{\prime}(u_{[r]}+v^{\prime})
=δr2u^r𝚽u^r+λ(δr2u^r𝚽u^r+2δrδvu^r𝚽v^+δv2v^𝚽v^)\displaystyle=\delta_{r}^{2}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{*}\hat{u}_{r}+\lambda\big{(}\delta_{r}^{2}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{u}_{r}+2\delta_{r}\delta_{v^{\prime}}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}+\delta^{2}_{v^{\prime}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}\big{)} (23)

Since we want u(𝚽+λ𝚽)u0u^{\top}(\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime})u\geq 0 we can further simplify Eq. (23) as

u^r𝚽u^r+λ(u^r𝚽u^r+2δrδvδr2u^r𝚽v^+δv2δr2v^𝚽v^)?0\displaystyle\hat{u}_{r}^{\top}\bm{\Phi}^{*}\hat{u}_{r}+\lambda\left({\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{u}_{r}+2{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\frac{\delta_{r}\delta_{v^{\prime}}}{\delta_{r}^{2}}}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}+{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\frac{\delta^{2}_{v^{\prime}}}{\delta^{2}_{r}}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}}\right)\underset{?}{\geq}0 (24)
u^r𝚽u^r(1)+λ(u^r𝚽u^r(3)+2ξu^r𝚽v^+ξ2v^𝚽v^(2))?0\displaystyle\Longleftrightarrow\underbrace{\hat{u}_{r}^{\top}\bm{\Phi}^{*}\hat{u}_{r}}_{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(1)}}+\lambda\left({\underbrace{\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{u}_{r}}_{{\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}(3)}}+\underbrace{2{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}+{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi^{2}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}}_{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(2)}}}\right)\underset{?}{\geq}0 (25)

where we have used ξ=δvδr\xi=\frac{\delta_{v^{\prime}}}{\delta_{r}}. The next part of the proof we show that (1){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(1)} is lower bounded by a positive constant whereas (2){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(2)} is upper bounded by a positive constant and there is a choice of λ\lambda so that (3){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(3)} is always smaller than (1){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(1)}.

Considering (1){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(1)} we note that u^r\hat{u}_{r} is a unit vector wrt the orthonormal set of basis V[r]V_{[r]}. Expanding using the eigendecomposition of Eq. (5)

u^r𝚽u^r=i=1rαi2i=1rαi2γiminiγi>0\displaystyle\hat{u}_{r}^{\top}\bm{\Phi}^{*}\hat{u}_{r}=\sum_{i=1}^{r}\frac{\alpha^{2}_{i}}{\sum_{i=1}^{r}\alpha_{i}^{2}}\cdot\gamma_{i}\geq\min_{i}\gamma_{i}>0

The last inequality follows as all the eigenvalues in the eigendecompostion are (strictly) positive. Denote this minimum eigenvalue as γmin:=miniγi\gamma_{\min}:=\min_{i}\gamma_{i}.

Considering (2){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(2)} note that only terms that are variable (i.e. could change value) is ξ\xi as u^r𝚽v^\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v} is

Note that v^\hat{v} is a fixed vector and u^r\hat{u}_{r} has a fixed norm (using Eq. (21)-(22)), so |u^r𝚽v^|C|\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}|\leq C for some bounded constant C>0C>0 whereas v^𝚽v^\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v} is already a constant. Now, |2ξu^r𝚽v^||2{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}| exceeds ξ2v^𝚽v^{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi^{2}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v} only if

|2ξu^r𝚽v^||ξ2v^𝚽v^||u^r𝚽v^|v^𝚽v^ξCv^𝚽v^ξ\displaystyle|2{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}|\geq|{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi^{2}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}|\Longleftrightarrow\frac{|\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}|}{\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}}\geq{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}\implies\frac{C}{\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}}\geq{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}

Rightmost inequality implies that 2ξu^r𝚽v^+ξ2v^𝚽v^2{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}+{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi^{2}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v} is negative only for an ξ{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi} bounded from above by a positive constant. But since ξ\xi is non-negative

|2ξu^r𝚽v^+ξ2v^𝚽v^|C(bounded constant)\displaystyle|2{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}+{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi^{2}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}|\leq C^{\prime}(\textnormal{bounded constant})

Now using an argument similar to the second half of the proof of Lemma 7, it is straight forward to show that there is a choice of λ>0\lambda^{\prime}>0 so that (3){\color[rgb]{.5,0,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,0,.5}(3)} is always smaller than (1){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}(1)}.

Now, for λ=λ2Cλ′′\lambda=\frac{\lambda^{\prime}}{2\lceil C^{\prime}\rceil\lambda^{\prime\prime}} where λ′′\lambda^{\prime\prime} is chosen so that λminλλ′′\lambda_{\min}\geq\frac{\lambda^{\prime}}{\lambda^{\prime\prime}}, we note that

u^r𝚽u^r+λ(u^r𝚽u^r+2ξu^r𝚽v^+ξ2v^𝚽v^)λmin+λ2Cλ′′u^r𝚽u^rλ2λ′′>0.\displaystyle\hat{u}_{r}^{\top}\bm{\Phi}^{*}\hat{u}_{r}+\lambda\left({\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{u}_{r}+2{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi}\cdot\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{v}+{\color[rgb]{.5,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{.5,.5,.5}\pgfsys@color@gray@stroke{.5}\pgfsys@color@gray@fill{.5}\xi^{2}}\cdot\hat{v}^{\top}\bm{\Phi}^{\prime}\hat{v}}\right)\geq\lambda_{\min}+\frac{\lambda^{\prime}}{2\lceil C^{\prime}\rceil\lambda^{\prime\prime}}\hat{u}_{r}^{\top}\bm{\Phi}^{\prime}\hat{u}_{r}-\frac{\lambda^{\prime}}{2\lambda^{\prime\prime}}>0.

Using the equivalence in Eq. (23), Eq. (24) and Eq. (25), we have a choice of λ>0\lambda>0 such that u(𝚽+λ𝚽)u0u^{\top}(\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime})u\geq 0 for any arbitrary vector uspanV[r]{v}u\in\text{span}\left<V_{[r]}\cup\left\{v\right\}\right>. Hence, we have achieved the conditions in Eq. (18), which is the simplification of Eq. (17). This implies that 𝚽+λ𝚽\bm{\Phi}^{*}+\lambda\bm{\Phi}^{\prime} is positive semi-definite.

This implies that there doesn’t exist a vspannull(𝚽)v\in\text{span}\left<\mathrm{null}({\bm{\Phi}^{*}})\right> such that vvspanvv^{\top}\notin\text{span}\left<{\mathcal{F}}\right> otherwise the assumption on {\mathcal{F}} to be an oblivious feedback set for 𝚽\bm{\Phi}^{*} is violated. Thus, the statement of Lemma 11 has to hold. ∎

F.1 Proof of lower bound in Theorem 1

In the following, we provide proof of the main statement on the lower bound of the size of a feedback set.

If any of the two lemmas (10-11) are violated, we can show there exists λ>0\lambda>0 and 𝚽\bm{\Phi} such that 𝚽+λ𝚽VS(,F)\bm{\Phi}^{*}+\lambda\bm{\Phi}\in\textsf{VS}({\mathcal{F}},{\mathcal{M}}_{\textsf{F}}). In order to ensure these statements, the feedback set should have (r(r+1)2+(dr)1)\left({\frac{r(r+1)}{2}+(d-r)-1}\right) many elements which proves the lower bound on {\mathcal{F}}.

But using Lemma 7 and Lemma 9 we know that the dimension of the span of matrices that satisfy the condition in Lemma 10 is at the least r(r+1)21\frac{r(r+1)}{2}-1. We can use Lemma 9 where y=i=1rvry=\sum_{i=1}^{r}v_{r} (note 𝚽v0\bm{\Phi}^{*}v\neq 0). Thus, any basis matrix in 𝒪\mathcal{O}_{{\mathcal{B}}} satisfy the conditions in Lemma 10.

Since the dimension of null(𝚽)\mathrm{null}({\bm{\Phi}^{*}}) is at least (dr)(d-r) thus there are at least (dr)(d-r) directions or linearly independent matrices (in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p})) that need to be spanned by {\mathcal{F}}.

Thus, Lemma 10 implies there are r(r+1)21\frac{r(r+1)}{2}-1 linearly independent matrices (in 𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}}) that need to be spanned by {\mathcal{F}}. Similarly, Lemma 11 implies there are prp-r linearly independent matrices (in 𝒪𝚽\mathcal{O}_{\bm{\Phi}^{*}}) that need to be spanned by {\mathcal{F}}. Note that the column vectors of these matrices from the two statements are spanned by orthogonal set of vectors, i.e. one by V[r]V_{[r]} and the other by null(𝚽)\mathrm{null}({\bm{\Phi}^{*}}) respectively. Thus, these r(r+1)21+(pr)\frac{r(r+1)}{2}-1+(p-r) are linearly independent in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}), but this forces a lower bound on the size of {\mathcal{F}} (a lower dimensional span can’t contain a set of vectors spanning higher dimensional space). This completes the proof of the lower bound in Theorem 1.

Appendix G Proof of Theorem 3: General Activations Sampling

We aim to establish both upper and lower bounds on the feedback complexity for oblivious learning in Algorithm 2. The proof revolves around the linear independence of certain symmetric matrices derived from random representations and the dimensionality required to span a target feature matrix.

Let us define a positive index P=p(p+1)2P=\frac{p(p+1)}{2}. The agent receives PP representations:

𝒱n:={v1,v2,,vP}𝒟𝒱.\mathcal{V}_{n}:=\{v_{1},v_{2},\ldots,v_{P}\}\sim\mathcal{D}_{\mathcal{V}}.

For each ii, we define the symmetric matrix Vi=viviV_{i}=v_{i}v_{i}^{\top}.

Consider the matrix 𝕄\mathbb{M} formed by concatenating the vectorized ViV_{i}:

𝕄=[vec(V1)vec(V2)vec(VP)],\mathbb{M}=\begin{bmatrix}\text{vec}(V_{1})&\text{vec}(V_{2})&\cdots&\text{vec}(V_{P})\end{bmatrix},

where each vec(Vi)\text{vec}(V_{i}) is treated as a column vector in P\mathbb{R}^{P}. The vectorization operation for a symmetric matrix ASym(p×p)A\in\textsf{Sym}(\mathbb{R}^{p\times p}) is defined as:

vec(A)k={Aiiif k corresponds to (i,i),Aij+Ajiif k corresponds to (i,j),i<j.\text{vec}(A)_{k}=\begin{cases}A_{ii}&\text{if }k\text{ corresponds to }(i,i),\\ A_{ij}+A_{ji}&\text{if }k\text{ corresponds to }(i,j),\ i<j.\end{cases}

The determinant det(𝕄)\det(\mathbb{M}) is a non-zero polynomial in the entries of v1,v2,,vPv_{1},v_{2},\ldots,v_{P}. Since the vectors viv_{i} are drawn from a continuous distribution 𝒟𝒱\mathcal{D}_{\mathcal{V}}, using Sard’s theorem the probability that det(𝕄)=0\det(\mathbb{M})=0 is zero, i.e.,

𝒫𝒱n(det(𝕄)=0)=0.\mathcal{P}_{\mathcal{V}_{n}}(\det(\mathbb{M})=0)=0.

This implies that, with probability 1, the set {V1,V2,,VP}\{V_{1},V_{2},\ldots,V_{P}\} is linearly independent in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}):

𝒫𝒱n({vivi} is linearly independent in Sym(p×p))=1.\displaystyle\mathcal{P}_{\mathcal{V}_{n}}\left(\{v_{i}v_{i}^{\top}\}\text{ is linearly independent in }\textsf{Sym}(\mathbb{R}^{p\times p})\right)=1. (26)

Next, let Σ0\Sigma^{*}\neq 0 be an arbitrary target feature matrix for learning with feedback in Algorithm 2. Without loss of generality, assume v:=v10v:=v_{1}\neq 0. Define the set \mathcal{F} of rescaled pairs as:

={(v,γivi)|Σ(vvγivivi)=0,γi>0},\mathcal{F}=\left\{\left(v,\sqrt{\gamma_{i}}v_{i}\right)\,\bigg{|}\,\Sigma^{*}\cdot\left(vv^{\top}-\gamma_{i}v_{i}v_{i}^{\top}\right)=0,\ \sqrt{\gamma_{i}}>0\right\},

noting that ||=P1|\mathcal{F}|=P-1.

Assume, for contradiction, that the elements of \mathcal{F} are linearly dependent in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}). Then, there exist scalars {ai}\{a_{i}\} (not all zero) such that:

i=2Pai(vvγivivi)=0(i=2Pai)vv=i=2Paiγivivi.\sum_{i=2}^{P}a_{i}\left(vv^{\top}-\gamma_{i}v_{i}v_{i}^{\top}\right)=0\quad\Rightarrow\quad\left(\sum_{i=2}^{P}a_{i}\right)vv^{\top}=\sum_{i=2}^{P}a_{i}\gamma_{i}v_{i}v_{i}^{\top}.

However, since {vivi}\{v_{i}v_{i}^{\top}\} are linearly independent with probability 1, it must be that:

i=2Pai=0andaiγi=0i.\sum_{i=2}^{P}a_{i}=0\quad\text{and}\quad a_{i}\gamma_{i}=0\quad\forall i.

Given that γi>0\gamma_{i}>0, this implies ai=0a_{i}=0 for all ii, contradicting the assumption of linear dependence. Therefore, matrices induced by \mathcal{F} are linearly independent.

This implies that {\mathcal{F}} induces a set of linearly independent matrices, i.e., {vvγivivi}\left\{vv^{\top}-\gamma_{i}v_{i}v_{i}^{\top}\right\} in the orthogonal complement 𝒪Σ\mathcal{O}_{\Sigma^{*}}, and since Σ\Sigma^{*} has at most PP degrees of freedom, any matrix ΣSym(p×p)\Sigma^{\prime}\in\textsf{Sym}(\mathbb{R}^{p\times p}) satisfying:

Σ(vvγivivi)=0i\Sigma^{\prime}\cdot\left(vv^{\top}-\gamma_{i}v_{i}v_{i}^{\top}\right)=0\quad\forall i

must be a positive scalar multiple of Σ\Sigma^{*}.

Thus, using Eq. (26), with probability 1, the feedback set \mathcal{F} is valid:

𝒫𝒱n( is a valid feedback set)=1.\mathcal{P}_{\mathcal{V}_{n}}\left(\mathcal{F}\text{ is a valid feedback set}\right)=1.

Since Σ\Sigma^{*} was arbitrary, the worst-case feedback complexity is almost surely upper bounded by P1P-1 for achieving feature equivalence.

For the lower bound, consider the proof of the lower bound in Theorem 1, specifically Lemma 10, which asserts that for any feedback set \mathcal{F} in Algorithm 1, given any target matrix ΣSym(p×p)\Sigma^{*}\in\textsf{Sym}(\mathbb{R}^{p\times p}), if Σ𝒪Σ\Sigma\in\mathcal{O}_{\Sigma^{*}} such that spancol(Σ)spanZ[r]\text{span}\left<\textsf{col}({\Sigma})\right>\subset\text{span}\left<Z_{\left[r\right]}\right> then Σspan\Sigma\in\text{span}\left<{\mathcal{F}}\right> where Z[r]Z_{[r]} (rdr\leq d) is defined as the set of eigenvectors in the eigendecompostion of Σ\Sigma^{*} (see Eq. (5)).

This implies that any feedback set (𝒱n,Σ)\mathcal{F}(\mathcal{V}_{n},\Sigma^{*}) must span certain matrices ΣSym(p×p)\Sigma^{\prime}\in\textsf{Sym}(\mathbb{R}^{p\times p}). Suppose the agent receives \ell representations v1,v2,,v𝒟𝒱v_{1},v_{2},\ldots,v_{\ell}\sim\mathcal{D}_{\mathcal{V}} and constructs:

𝕄=[vec(Σ)vec(V1)vec(V)].\mathbb{M}=\begin{bmatrix}\text{vec}(\Sigma^{\prime})&\text{vec}(V_{1})&\cdots&\text{vec}(V_{\ell})\end{bmatrix}.

Now, consider the polynomial equation det(𝕄)=0\det(\mathbb{M})=0. Since every entry of 𝕄\mathbb{M} is semantically different, the determinant det(𝕄)\det(\mathbb{M}) is a non-zero polynomial. Note that there are p(p+1)2\frac{p(p+1)}{2} many degrees of freedom for the rows. Thus, it is clear that the zero set {det(𝕄)=0}\left\{\det(\mathbb{M})=0\right\} has Lebesgue measure zero if <p(p+1)2\ell<\frac{p(p+1)}{2}, i.e. 𝕄\mathbb{M} requires at least p(p+1)2\frac{p(p+1)}{2} columns for det(𝕄)\det(\mathbb{M}) to be identically zero. But this implies that set {vivi}i=1\left\{v_{i}v_{i}^{\top}\right\}_{i=1}^{\ell} can’t span Σ\Sigma^{\prime} (almost surely) if p(p+1)21\ell\leq\frac{p(p+1)}{2}-1. Hence, (almost surely) the agent can’t devise a feedback set for oblivious learning in Algorithm 2. In other words, if p(p+1)21\ell\leq\frac{p(p+1)}{2}-1,

𝒫𝒱(agent devises a feedback set  up to feature equivalence)=0\displaystyle{\mathcal{P}}_{{\mathcal{V}}_{\ell}}\left({\textnormal{agent devises}\textnormal{ a feedback set }{\mathcal{F}}\textnormal{ up to feature equivalence}}\right)=0

Hence, to span Σ\Sigma^{\prime}, it almost surely requires at least p(p+1)2\frac{p(p+1)}{2} representations. Therefore, the feedback complexity cannot be lower than Ω(p(p+1)2)\Omega\left(\frac{p(p+1)}{2}\right).

Combining the upper and lower bounds, we conclude that the feedback complexity for oblivious learning in Algorithm 2 is tightly bounded by Θ(p(p+1)2)\Theta\left(\frac{p(p+1)}{2}\right).

Appendix H Proof of Theorem 4: Sparse Activations Sampling

Here we consider the analysis for the case when the activations 𝒱{\mathcal{V}} are sampled from the sparse distribution as stated in Definition 1.

In Theorem 4, we assume that the activations are sampled from a Lebesgue distribution. This, sufficiently, ensures that (almost surely) any random sampling of PP activations induces a set of linearly independent rank-1 matrices. Since the distribution in Assumption 1 is not a Lebesgue distribution over the entire support [0,1]\left[0,1\right], requiring an understanding of certain events of the sampling of activations which could lead to linearly independent rank-1 matrices.

In the proof of Theorem 2, we used a set of sparse activations using the standard basis of the vector space p\mathbb{R}^{p}. We note that the idea could be generalized to arbitrary choice of scalars as well, i.e.,

Ug={λiei:λi0,1ip}{(λijiei+λijjej):λiji,λijj0,1i<jp}.\displaystyle U_{g}=\{\lambda_{i}e_{i}:\lambda_{i}\neq 0,1\leq i\leq p\}\cup\{(\lambda_{iji}e_{i}+\lambda_{ijj}e_{j}):\lambda_{iji},\lambda_{ijj}\neq 0,1\leq i<j\leq p\}.

Here eie_{i} is the iith standard basis vector. Note that the corresponding set of rank-1 matrices, denoted as U^g\widehat{U}_{g}

U^g={λi2eieiT:1ip}{(λijiei+λijjej)(λijiei+λijjej)T:1i<jp}\displaystyle\widehat{U}_{g}=\left\{\lambda_{i}^{2}e_{i}e_{i}^{T}:1\leq i\leq p\right\}\cup\left\{(\lambda_{iji}e_{i}+\lambda_{ijj}e_{j})(\lambda_{iji}e_{i}+\lambda_{ijj}e_{j})^{T}:1\leq i<j\leq p\right\}

is linearly independent in the space of symmetric matrices on p\mathbb{R}^{p}, i.e., Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}).

Assume that activations are sampled PP times, denoted as 𝒱P{\mathcal{V}}_{P}. Now, consider the design matrix 𝕄=[V1V2VP]\mathbb{M}=\left[V_{1}\quad V_{2}\quad\ldots V_{P}\right] as shown in the proof of Theorem 3. We know that if det(𝕄)\det(\mathbb{M}) is non-zero then {Vi}s\left\{V_{i}\right\}^{\prime}s are linearly independent in Sym(p×p)\textsf{Sym}(\mathbb{R}^{p\times p}). To show if a sampled set 𝒱P{\mathcal{V}}_{P} exhibits this property we need to show that det(𝕄)\det(\mathbb{M}) is not identically zero, which could be possible for activations sampled from sparse distributions as stated in Assumption 1, i.e. 𝒫v𝒟sparse(vi0)>0{\mathcal{P}}_{v\sim{\mathcal{D}}_{\textsf{sparse}}}(v_{i}\neq 0)>0.

Note that det(𝕄)=σPPi𝕄iσ(i)\det(\mathbb{M})=\sum_{\sigma\in\textsf{P}_{P}}\prod_{i}\mathbb{M}_{i\sigma(i)}. Consider the diagonal of 𝕄\mathbb{M}. Consider the situation where all the entries are non-zero. This corresponds to sampling a set of activations of the form U^g\widehat{U}_{g}. Consider the following random design matrix 𝕄\mathbb{M}.

𝕄=[λ12λ1212λ22λ1222λ32λ(p1)p(p1)2λp2λ(p1)pp2λ121λ122λ(p1)p(p1)λ(p1)pp].\mathbb{M}=\begin{bmatrix}\lambda_{1}^{2}&\cdot&\cdot&\cdots&\cdot&\lambda_{121}^{2}&\cdots&\cdot\\ \cdot&\lambda_{2}^{2}&\cdot&\cdots&\cdot&\lambda_{122}^{2}&\cdots&\vdots\\ \cdot&\cdot&\lambda_{3}^{2}&\cdots&\cdot&\cdot&\cdots&\lambda_{(p-1)p(p-1)}^{2}\\ \vdots&\cdots&\cdots&\cdots&\lambda_{p}^{2}&\cdots&\cdots&\lambda_{(p-1)pp}^{2}\\ \vdots&\cdots&\cdots&\cdots&\cdot&\lambda_{121}\lambda_{122}&\cdots&\cdots\\ \vdots&\cdots&\cdots&\cdots&\cdot&\cdot&\cdots&\cdots\\ \cdot&\cdots&\cdots&\cdots&\cdot&\cdot&\cdot&\lambda_{(p-1)p(p-1)}\lambda_{(p-1)pp}\\ \end{bmatrix}.

Now, a random design matrix 𝕄\mathbb{M} is not identically zero for any set of PP randomly sampled activations that satisfy the following indexing property:

:={v:vi0,1ip}{v:vi,vj0,1i<jp}.\displaystyle{\mathcal{R}}:=\{v:v_{i}\neq 0,1\leq i\leq p\}\cup\{v:v_{i},v_{j}\neq 0,1\leq i<j\leq p\}. (27)

This is so because for identity permutation, we have i𝕄ii0\prod_{i}\mathbb{M}_{ii}\neq 0. Now, we will compute the probability that {\mathcal{R}} is sampled from 𝒟sparse{\mathcal{D}}_{\textsf{sparse}}. Using the independence of sampling of each index of an activation, the probabilities for the two subsets of {\mathcal{R}} can be computed as follows:

  • pp activations {α1,α2,,αp}𝒟sparsep\left\{\alpha_{1},\alpha_{2},\cdots,\alpha_{p}\right\}\sim{\mathcal{D}}_{\textsf{sparse}}^{p} such that αii0\alpha_{ii}\neq 0. Using independence, we have

    𝒫1=i=0s1(p1i)pnzi+1(1pnz)p1i,{\mathcal{P}}_{1}=\sum_{i=0}^{s-1}\binom{p-1}{i}p_{nz}^{i+1}(1-p_{nz})^{p-1-i},
  • Rest of p(p1)/2p(p-1)/2 activations of {\mathcal{R}} in Eq. (27) require at least two indices to be non-zero. This could be computed as

    𝒫2=i=0s2(p2i)pnzi+2(1pnz)p2i.{\mathcal{P}}_{2}=\sum_{i=0}^{s-2}\binom{p-2}{i}p_{nz}^{i+2}(1-p_{nz})^{p-2-i}.

Now, note that these PP activations can be permuted in P!P! ways and thus

𝒫𝒱p(𝒱P)P!𝒫1𝒫2=P!(i=0s1(p1i)pnzi+1(1pnz)p1i)(i=0s2(p2i)pnzi+2(1pnz)p2i)ps\displaystyle{\mathcal{P}}_{{\mathcal{V}}_{p}}({\mathcal{V}}_{P}\equiv{\mathcal{R}})\geq P!\cdot{\mathcal{P}}_{1}\cdot{\mathcal{P}}_{2}=\underbrace{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}P!\cdot\left({\sum_{i=0}^{s-1}\binom{p-1}{i}p_{nz}^{i+1}(1-p_{nz})^{p-1-i}}\right)\left({\sum_{i=0}^{s-2}\binom{p-2}{i}p_{nz}^{i+2}(1-p_{nz})^{p-2-i}}\right)}}_{p_{\textsf{s}}} (28)

Now, we will complete the proof of the theorem using Hoeffding’s inequality. Assume that the agent samples NN activations, we will compute the probability that 𝒱N{\mathcal{R}}\subset{\mathcal{V}}_{N}. Consider all possible PP-subsets of NN items, enumerated as {1,2,,(NP)}\left\{1,2,\ldots,\binom{N}{P}\right\}. Now, define random variables XiX_{i} as

Xi={1 if ith subset equals ,0 o.w.\displaystyle X_{i}=\begin{cases}1\textnormal{ if $i$th subset equals }{\mathcal{R}},\\ 0\textnormal{ o.w.}\end{cases}

Now, define sum random variable X=i(NP)XiX=\sum_{i}^{\binom{N}{P}}X_{i}. We want to understand the probability 𝒫𝒱N(X1){\mathcal{P}}_{{\mathcal{V}}_{N}}(X\geq 1). Now note that,

𝔼𝒱N𝒟sparse[X]=i𝔼[Xi]=(NP)𝒫𝒱P(𝒱P)\displaystyle\mathop{\mathbb{E}}_{{\mathcal{V}}_{N}\sim{\mathcal{D}}_{\textsf{sparse}}}\!\left[X\right]=\sum_{i}\mathbb{E}\left[X_{i}\right]=\binom{N}{P}\cdot{\mathcal{P}}_{{\mathcal{V}}_{P}}({\mathcal{V}}_{P}\equiv{\mathcal{R}})

Now, using Hoeffding’s inequality

𝒫𝒱N(X>0)12exp2𝔼[X]212exp2(NP)2ps2\displaystyle{\mathcal{P}}_{{\mathcal{V}}_{N}}(X>0)\geq 1-2\exp^{-2\mathbb{E}\left[X\right]^{2}}\geq 1-2\exp^{-2\binom{N}{P}^{2}p_{\textsf{s}}^{2}}

Now, for a given choice of of δ>0\delta>0, we want δ2exp2(NP)2ps2\delta\geq 2\exp^{-2\binom{N}{P}^{2}p_{\textsf{s}}^{2}}. Using Sterling’s approximation

(NP)1pslog4δ2(eNP)P1pslog4δ2NPe(1ps2log4δ2)1/2P\displaystyle\binom{N}{P}\geq\frac{1}{p_{\textsf{s}}}\sqrt{\log\frac{4}{\delta^{2}}}\implies\left({\frac{eN}{P}}\right)^{P}\geq\frac{1}{p_{\textsf{s}}}\sqrt{\log\frac{4}{\delta^{2}}}\implies N\geq\frac{P}{e}\left({\frac{1}{p_{\textsf{s}}^{2}}\log\frac{4}{\delta^{2}}}\right)^{1/2P}

Appendix I Additional Experiments

We evaluate our methods on large-scale models: Pythia-70M (Biderman et al., 2023) and Board Games Models (Karvonen et al., 2024). These models present significant computational challenges, as different feedback methods generate millions of constraints. This scale necessitates specialized approaches for both memory management and computational efficiency.

Memory-efficient constraint storage

The high dimensionality of model dictionaries makes storing complete activation indices for each feature prohibitively memory-intensive. We address this by enforcing constant sparsity constraints, limiting activations to a maximum sparsity of 3. This constraint enables efficient storage of large-dimensional arrays while preserving the essential characteristics of the features.

Computational optimization

To efficiently handle constraint satisfaction at scale, we reformulate the problem as a matrix regression task, as detailed in Fig. 5. The learner maintains a low-rank decomposition of the feature matrix 𝚽\bm{\Phi}, assuming 𝚽=UU\bm{\Phi}=UU^{\top}, where UU represents the learned dictionary. This formulation allows for efficient batch-wise optimization over the constraint set while maintaining feasible memory requirements.

Dictionary features of Pythia-70M

We use the publicly available repository for dictionary learning via sparse autoencoders on neural network activations (Marks et al., 2024a). We consider the dictionaries trained for Pythia-70M (Biderman et al., 2023) (a general-purpose LLM trained on publicly available datasets). We retrieve the corresponding autoencoders for attention output layers which have dimensions 32768×51232768\times 512. Note that p(p+1)/2,512Mp(p+1)/2\approx,512M. For the experiments, we use 33-sparsity on uniform sparse distributions. We present the plots for ChessGPT in two parts in Fig. 6 and Fig. 7 for different feedback methods.

1. Given a dictionary Up×rU\in\mathbb{R}^{p\times r}, minimize the loss (U)\mathcal{L}(U): (U)=MSE(U)+reg(U)\mathcal{L}(U)=\mathcal{L}_{\text{MSE}}(U)+\mathcal{L}_{\text{reg}}(U) (29) where MSE loss is: MSE(U)=1|B|iB(Uui2ciUy2)2\mathcal{L}_{\text{MSE}}(U)=\frac{1}{|B|}\sum_{i\in B}(\|U^{\top}u_{i}\|^{2}-c_{i}\|U^{\top}y\|^{2})^{2} (30) and regularization term is: reg(U)=λUF2\mathcal{L}_{\text{reg}}(U)=\lambda\|U\|_{F}^{2} (31) 2. For each batch containing indices ii, values vv, and targets cc: (a) Construct sparse vectors uiu_{i} using (i,v)(i,v) pairs (b) Compute projected values: UuiU^{\top}u_{i} and UyU^{\top}y where y=e1y=e_{1} (c) Calculate residuals: ri=Uui2ciUy2r_{i}=\|U^{\top}u_{i}\|^{2}-c_{i}\|U^{\top}y\|^{2} 3. Update UU using Adam optimizer with gradient clipping 4. Enforce fixed entries in UU after each update (U[0,0]=1U[0,0]=1 is enforced to be 1.) where BB represents the batch of samples, λ=104\lambda=10^{-4} is the regularization coefficient, and y=e1y=e_{1} is the fixed unit vector.
Algorithm 3 Optimization via Gradient Descent
Figure 5: Gradient-based optimization procedure for learning a dictionary decomposition UU with fixed entries.
Refer to caption
Refer to caption
Figure 6: Feature learning on a subsampled dictionary of dimension 4500×5124500\times 512 of SAE trained for Pythia-70M. Theorem 1 states that Eigendecompostion method requires 135316 constructive feedback. After few 100 iterations of gradient descent as shown in Fig. 5, a PCC of 93% is achieved on ground truth. For visualization, only the first 100 dimensions are used.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Sparse sampling for Pythia-70M: Dimension of feature matrix: 32768×51232768\times 512 and the rank is 215. Plots for varying feedback complexity sizes. Note that p(p+1)/2p(p+1)/2\approx 512M. We run experiments with 3-sparse activations for uniform sparse distributions. The Pearson Correlation Coefficient (PCC) to feedback size (PCC, Feedback size) improves as follows: (200k,.0242),(2M,.38),(5M,.54),(10M,.65)(200k,.0242),(2M,.38),(5M,.54),(10M,.65), and (20M,.77)(20M,.77).