The Complexity of Learning Sparse Superposed Features with Feedback
Abstract
The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative triplet comparisons. These features may represent various constructs, including dictionaries in LLMs or components of a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent’s feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machine-trained models and dictionary extraction from sparse autoencoders trained on Large Language Models.
1 Introduction
In recent years, neural network-based models have achieved state-of-the-art performance across a wide array of tasks. These models effectively capture relevant features or concepts from samples, tailored to the specific prediction tasks they address (Yang and Hu, 2021b; Bordelon and Pehlevan, 2022a; Ba et al., 2022b). A fundamental challenge lies in understanding how these models learn such features and determining whether these features can be interpreted or even retrieved directly (Radhakrishnan et al., 2024). Recent advancements in mechanistic interpretability have opened multiple avenues for elucidating how transformer-based models, including Large Language Models (LLMs), acquire and represent features (Bricken et al., 2023; Doshi-Velez and Kim, 2017). These advances include uncovering neural circuits that encode specific concepts (Marks et al., 2024b; Olah et al., 2020), understanding feature composition across attention layers (Yang and Hu, 2021b), and revealing how models develop structured representations (Elhage et al., 2022). One line of research posits that features are encoded linearly within the latent representation space through sparse activations, a concept known as the linear representation hypothesis (LRH) (Mikolov et al., 2013; Arora et al., 2016). However, this hypothesis faces challenges in explaining how neural networks function, as models often need to represent more distinct features than their layer dimensions would theoretically allow under purely linear encoding. This phenomenon has been studied extensively in the context of large language models through the lens of superposition (Elhage et al., 2022), where multiple features share the same dimensional space in structured ways.
Recent efforts have addressed this challenge through sparse coding or dictionary learning, proposing that any layer of the model learns features linearly:
where , is a dictionary111could be both overcomplete and undercomplete. matrix, is a sparse representation vector, and represents error terms. This approach enables retrieval of interpretable features through sparse autoencoders (Bricken et al., 2023; Huben et al., 2024; Marks et al., 2024b), allowing for targeted monitoring and modification of network behavior. The linear feature decomposition not only advances model interpretation but also suggests the potential for developing compact, interpretable models that maintain performance by leveraging universal features from larger architectures.
In this work, we explore how complex features encoded as a dictionary can be distilled through feedback from either advanced language models (e.g., ChatGPT, Claude 3.0 Sonnet) or human agents. Let’s define a dictionary where each column represents an atomic feature vector. These atomic features, denoted as , could correspond to semantic concepts like "tree", "house", or "lawn" that are relevant to the task’s sample space. The core mechanism involves an agent (either AI or human) determining whether different sparse combinations of these atomic features are similar or dissimilar. Specifically, given sparse activation vectors , the agent evaluates whether linear combinations such as are equivalent to other combinations using different activation vectors. Precisely, we formalize these feedback relationships using relative triplet comparisons , where is the activation or representation space. These comparisons express that a linear combination of features using coefficients is more similar to a combination using coefficients than to one using coefficients .
The objective is to determine the extent to which an oblivious learner—one who learns solely by satisfying the constraints of the feedback and randomly selecting valid features—can identify the relevant features up to normal transformation. Formally, the fundamental protocol is as follows:
-
•
The agent selects (constructive or distributional) sparse triplets with labels corresponding to a dictionary D, and provides them to the learner.
-
•
The learner solves for
and outputs a solution .
Semantically, these relative distances provide the relative information on how ground truth samples, e.g. images, text among others, relate to each other. We term the normal transformation for a given dictionary D as feature matrices , which is exactly a covariance matrix. Alternatively, for the representation space , this transformation defines a Mahalanobis distance function , characterized by the square symmetric linear transformation such that for any pair of activations , their distance is given by:
When embeds samples into , it admits a decomposition for , where L serves as a dictionary for this distance function—a formulation well-studied in metric learning literature (Kulis, 2013). In this work, we study the minimal number of interactions, termed as feedback complexity of learning feature matrices—normal transformations to a dictionary—of the form . We consider two types of feedback: general activations and sparse activations, examining both constructive and distributional settings. Our primary contributions are:
-
I.
We investigate feedback complexity in the constructive setting, where agents select activations from , establishing strong bounds for both general and sparse scenarios. (see Section 4)
-
II.
We analyze the distributional setting with sampled activations, developing results for both general and sparse representations. For sparse sampling, we extend the definition of a Lebesgue measure to accommodate sparsity constraints. (see Section 5)
-
III.
We validate our theoretical bounds through experiments with feature matrices from Recursive Feature Machine trained kernel classifiers and dictionaries trained for sparse autoencoders in Large Language Models, including Pythia-70M (Biderman et al., 2023) and Board Game models (Karvonen et al., 2024). (see Section 6)
Table 1 summarizes our feedback complexity bounds for the various feedback settings examined in this study.
Setting | Feedback Complexity |
---|---|
Stardard Constructive | |
Sparse Constructive | |
Standard Sampling | (a.s) |
Sparse Sampling | (w.h.p) |
2 Related Work
Dictionary learning
Recent work has explored dictionary learning to disentangle the semanticity (mono- or polysemy) of neural network activations (Faruqui et al., 2015; Arora et al., 2018; Subramanian et al., 2018; Zhang et al., 2019; Yun et al., 2021). Dictionary learning (Mallat and Zhang, 1993; Olshausen and Field, 1997) (aka sparse coding) provides a systematic approach to decompose task-specific samples into sparse signals. The sample complexity of dictionary learning (or sparse coding) has been extensively studied as an optimization problem, typically involving non-convex objectives such as regularization (see Gribonval et al. (2015)). While traditional methods work directly with ground-truth samples, our approach differs fundamentally as the learner only receives feedback on sparse signals or activations. Prior work in noiseless settings has established probabilistic exact recovery up to linear transformations (permutations and sign changes) under mutual incoherence conditions (Gribonval and Schnass, 2010; Agarwal et al., 2014). Our work extends these results by proving exact recovery (both deterministic and probabilistic) up to normal transformation, which generalizes to rotational and sign changes under strong incoherence properties (see Lemma 1). In the sampling regime, we analyze -sparse signals, building upon the noisy setting framework developed in Arora et al. (2013); Gribonval et al. (2015).
Feature learning in neural networks and Linear representation hypothesis
Neural networks demonstrate a remarkable capability to discover and exploit task-specific features from data (Yang and Hu, 2021b; Bordelon and Pehlevan, 2022b; Shi et al., 2022). Recent theoretical advances have significantly enhanced our understanding of feature evolution and emergence during training (Abbe et al., 2022; Ba et al., 2022a; Damian et al., 2022; Yang and Hu, 2021a; Zhu et al., 2022). Particularly noteworthy is the finding that the outer product of model weights correlates with the gradient outer product of the classifier averaged over layer preactivations (Radhakrishnan et al., 2024), which directly relates to the covariance matrices central to our investigation. Building upon these insights, Elhage et al. (2022) proposed that features in large language models follow a linear encoding principle, suggesting that the complex feature representations learned during training can be decomposed into interpretable linear components. This interpretability, in turn, could facilitate the development of simplified algorithms for complex tasks (Fawzi et al., 2022; Romera-Paredes et al., 2024). Recent research has focused on extracting these interpretable features in the form of dictionary learning by training sparse autoencoder for various language models including Board Games Models (Marks et al., 2024b; Bricken et al., 2023; Karvonen et al., 2024). Our work extends this line of inquiry by investigating whether such interpretable dictionaries can be effectively transferred to a weak learner using minimal comparative feedback.
Triplet learning a covariance matrix
Learning a feature matrix (for a dictionary) up to normal transformation can be viewed through two established frameworks: covariance estimation (Chen et al., 2013; Li and Voroninski, 2013) and learning linear distances such as Mahalanobis matrices (Kulis, 2013). While these frameworks traditionally rely on exact or noisy measurements, our work introduces a distinct mechanism based solely on relative feedback, aligning more closely with the semantic structure of Mahalanobis distances. The study of Mahalanobis distance functions has been central to metric learning research (Bellet et al., 2015; Kulis, 2013), encompassing both supervised approaches using class labels (Weinberger and Saul, 2009; Xing et al., 2002) and unsupervised methods such as LDA (Fisher, 1936) and PCA (Jolliffe, 1986). Schultz and Joachims (2003) and Kleindessner and von Luxburg (2016) have extended this framework to incorporate relative comparisons on distances. Particularly relevant to our work are studies by Schultz and Joachims (2003) and Mason et al. (2017) that employ triplet comparisons, though these typically assume i.i.d. triplets with potentially noisy measurements. Our approach differs by incorporating an active learning element: while signals are drawn i.i.d, an agent selectively provides feedback on informative instances. This constructive triplet framework for covariance estimation represents a novel direction, drawing inspiration from machine teaching, where a teaching agent provides carefully chosen examples to facilitate learning (Zhu et al., 2018).
3 Problem Setup

We denote by the space of activations or representations and by the space of samples. For the space of feature matrices (for a dictionary or Mahalanobis distances), denoted as , we consider the family of symmetric positive semi-definite matrices in , i.e. . We denote a feedback set as which consists of triplets with corresponding signs .
We use the standard notations in linear algebra over a space of matrices provided in Appendix B.
Triplet feedback
An agent provides feedback on activations in through relative triplet comparisons . Each comparison evaluates linear combinations of feature vectors:
We study both sparse and non-sparse activation feedbacks, where sparsity is defined as:
Definition 1 (-sparse activations).
An activation is -sparse if at most many indices of are non-zero.
Since triplet comparisons are invariant to positive scaling of feature matrices, we define:
Definition 2 (Feature equivalence).
For a feature family , feature matrices and are equivalent if there exists such that .
We study a simple learning framework where the learner merely satisfies the constraints provided by the agent’s feedback:
Definition 3 (Oblivious learner).
A learner is oblivious if it randomly selects a feature matrix from the set of valid solutions to a given feedback set , i.e., arbitrarily chooses , where represents the set of feature matrices satisfying the constraints in .
This framework aligns with version space learning, where denotes the set of feature matrices in compatible with feedback set .
Prior work on dictionary learning has established recovery up to linear transformation under weak mutual incoherence (e.g., cf Gribonval and Schnass (2010)). In our setting, when the agent provides feature feedback corresponding to , the learner recovers L up to normal transformation. Moreover, when L has orthogonal rows (strong incoherence), we can recover L up to rotation and sign changes as stated below, with proof deferred to Appendix C.
Lemma 1 (Recovering orthogonal representations).
Assume . Define the set of orthogonal Cholesky decompositions of as
where and are the eigenvalues of in descending order. Then, for any two matrices , there exists an orthogonal matrix such that , where R is block diagonal with orthogonal blocks corresponding to any repeated diagonal entries in . Additionally, each column of can differ from the corresponding column of U by a sign change.
We note that the recovery of L is pertaining to the assumption that all the rows are orthogonal, and thus rank of L is . In cases where , one needs additional information in the form of ground sample for some activation to recover L up to a linear transformation. Finally, the interaction protocol is shown in Algorithm 1.
-
1.
Agent picks triplets
-
2.
Learner receives , and obliviously picks a feature matrix that satisfy the set of constraints in
-
3.
Learner outputs .
4 Sparse Feature Learning with Constructive Feedback
Here, we study the feedback complexity in the setting where agent is allowed to pick/construct any activation from .
Reduction to Pairwise Comparisons
The general triplet feedbacks with potentially inequality constraints in Algorithm 1 can be simplified to pairwise comparisons with equality constraints with a simple manipulation as follows.
Lemma 2.
Let be a target feature matrix in representation space used for oblivious learning. Given a feedback set
such that any is feature equivalent to , there exists a pairwise feedback set
such that .
Proof.
WLOG, assume for all . For any triplet : Case (i): If , then satisfies the equality. Case (ii): If , then for some :
implying satisfies the equality.
Thus, each triplet in maps to a pair in , preserving feature equivalence under positive scaling. ∎
This implies that if triplet comparisons are used in Algorithm 1, equivalent pairwise comparisons exist satisfying:
(1a) | ||||
(1b) |
Now, we show a reformulation of the oblivious learning problem for a feature matrix using pairwise comparisons that provide a unique geometric interpretation. Consider a pair and a matrix . An equality constraint implies
where denotes the Frobenius inner product. This equivalence indicates that the matrix is orthogonal to in the Frobenius inner product space. Now, given a set of pairwise feedbacks
corresponding to the target feature matrix , the learning problem defined by Eq. (1b) can be formulated as:
(2) |
Essentially, the learner aims to select a matrix that satisfies the condition in Eq. (2). Geometrically, the condition in Eq. (2) implies that any solution should annihilate the subspace of the orthogonal complement that is spanned by the matrices . Formally, this complement is defined as:
4.1 Constructive feedbacks: Worst-case lower bound
To learn a symmetric PSD matrix, learner needs at most constraints for linear programming corresponding to the number of degrees of freedom. So, the first question is are there pathological cases of feature matrices in which would require at least many triplet feedbacks in Algorithm 1. This indeed is the case, if a target matrix is full rank.
In the following proposition proven in Appendix D, we show a strong lower bound on the worst-case that turns out to be of order .
Proposition 1.
In the constructive setting, the worst-case feedback complexity of the class with general activations is at the least .
Proof Outline.
As discussed in Eq. (1) and Eq. (2), for a full-rank feature matrix , the span of any feedback set , i.e., , must lie within the orthogonal complement of in the space of symmetric matrices . Conversely, if has full rank, then is contained within this span. This necessary condition requires the feedback set to have a size of at least , given that . ∎
Since the worst-case bounds are pessimistic for oblivious learning of Eq. (1) a general question is how feedback complexity varies over the feature model . In the following subsection, we study the feedback complexity for feature model based on the rank of the underlying matrix, showing that the bounds can be drastically reduced.
4.2 Feature learning of low-rank matrices
As stated in Proposition 1, the learner requires at least feedback pairs to annihilate the orthogonal complement . However, this requirement decreases with a lower rank of . We illustrate this in Fig. 1 for a feature matrix of rank 4 trained via Recursive Feature Machines (Radhakrishnan et al., 2024).
Consider an activation in the nullspace of . Since , it follows that . Moreover, for another activation in the nullspace, any linear combination satisfies
This suggests a strategy for designing effective feedback based on the kernel and the null space of (see Appendix B for table of notations). This intuition is formalized by the eigendecomposition of the feature matrix:
(3) |
where are the eigenvalues and are the orthonormal eigenvectors. Since this decomposition is unique with non-negative eigenvalues.
To teach , the agent can employ a dual approach: teaching the kernel associated with the eigenvectors in this decomposition and the null space separately. Specifically, the agent can provide feedbacks corresponding to the eigenvectors of ’s kernel and extend the basis for the null space. We first present the following useful result (see proof in Appendix E).
Lemma 3.
Let be a set of orthogonal vectors. Then, the set of rank-1 matrices
is linearly independent in the space symmetric matrices .
Using this construction, the agent can provide feedbacks of the form for some with and to teach the kernel of . For an orthogonal extension where for all , feedbacks of the form suffice to teach the null space of .
This is the key idea underlying our study on feedback complexity in the general constructive setting that is stated below with the full proof deferred to Appendix E and F.
Theorem 1 (General Activations).
Let be a target feature matrix with . Then, in the setting of constructive feedbacks with general activations, the feedback complexity has a tight bound of for Eq. (1).
Proof Outline.
As discussed above we decompose the feature matrix into its eigenspace and null space, leveraging the linear independence of the constructed feedbacks to ensure that the span covers the necessary orthogonal complements. The upper bound is established with a simple observation: many pairs composed of are sufficient to teach if the null space of is known, whereas the agent only needs to provide many feedbacks corresponding to a basis extension to cover the null space, and hence the stated upper bound is achieved.
The lower bound requires showing that a valid feedback set possesses two spanning properties of for all : (1) it must include any whose column vectors are within the span of eigenvectors of , and (2) it must include any for some subset that spans the null space of and . ∎
Learning with sparse activations
In the discussion above, we demonstrated a strategy for reducing the feedback complexity when general activations are allowed. Now, we aim to understand how this complexity changes when activations are -sparse (see Definition 1) for some . Notably, there exists a straightforward construction of rank-1 matrices using a sparse set of activations.
Consider this sparse set of activations consisting of items in :
(4) |
where is the th standard basis vector. Using a similar argument to Lemma 3, it is straightforward to show that the set of rank-1 matrices
is linearly independent in the space of symmetric matrices and forms a basis. Moreover, every activation in is at most 2-sparse (see Definition 1). With this, we state the main result on learning with sparse constructive feedback here.
Theorem 2 (Sparse Activations).
Let be the target feature matrix. If an agent can construct pairs of activations from a representation space , then the feedback complexity of the feature model with 2-sparse activations is upper bounded by .
Remark: While the lower bound from Theorem 1 applies here, sparse settings may require even more feedbacks. Consider a rank-1 matrix with . By the Pigeonhole principle, representing this using -sparse activations requires at least rank-1 matrices. Thus, for constant sparsity , we need feedbacks—implying sparse representation of dense features might not exploit the low-rank structure to minimize feedbacks.
5 Sparse Feature Learning with Sampled Feedback
In general, the assumption of constructive feedback may not hold in practice, as ground truth samples from nature or induced representations from a model are typically independently sampled from the representation space. The literature on Mahalanobis distance learning and dictionary learning has explored distributional assumptions on the sample/activation space (Mason et al., 2017; Gribonval et al., 2014).
In this section, we consider a more realistic scenario where the agent observes a set of representations/activations , with being an unknown measure over the continuous space . With these observations, the agent designs pairs of activations to teach a target feature matrix .
As demonstrated in Lemma 2, we can reduce inequality constraints with triplet comparisons to equality constraints with pairs in the constructive setting. However, when the agent is restricted to selecting activations from the sampled set rather than arbitrarily from , this reduction no longer holds. This is evident from the following observation: let and let be a non-degenerate feature matrix. Then,
This equation represents a non-zero polynomial. According to Sard’s Theorem, the zero set of a non-zero polynomial has Lebesgue measure zero. Therefore,
Given this, the agent cannot reliably construct pairs that satisfy the required equality constraints from independently sampled activations. Since a general triplet feedback only provides 3 bits of information, exact recovery up to feature equivalence is impossible. To address these limitations, we consider rescaling the sampled activations to enable the agent to design effective pairs for the target feature matrix .

Rescaled Pairs
For a given matrix , a sampled input is almost never orthogonal, i.e., almost surely . This property can be utilized to rescale an input and construct pairs that satisfy equality constraints. Specifically, there exist scalars such that (assuming without loss of generality ),
Thus, the pair satisfies the equality constraints. With this understanding, we reformulate Algorithm 1 into Algorithm 2.
In this section, we analyze the feedback complexity in terms of the minimum number of sampled activations required for the agent to construct an effective feedback set achieving feature equivalence which is illustrated in Fig. 2. Our first result establishes complexity bounds for general activations (without sparsity constraints) sampled from a Lebesgue distribution, with the complete proof provided in Appendix G.
Theorem 3 (General Sampled Activations).
Consider a representation space . Assume that the agent receives activations sampled i.i.d from a Lebesgue distribution . Then, for any target feature matrix , with a tight bound of on the feedback complexity, the oblivious learner (almost surely) learns up to feature equivalence using the feedback set , i.e.,
Proof Outline.
The key observation is that almost surely for any randomly sampled activations on a unit sphere under Lebesgue measure, the corresponding rank-1 matrices are linearly independent. This is a direct application of Sard’s theorem on the zero set of a non-zero polynomial equation, yielding the upper bound. For the lower bound, we use some key necessary properties of a feedback set as elucidated in the proof of Theorem 1. This result essentially fixes certain activations that need to be spanned by a feedback set, but under a Lebesgue measure on a continuous domain, we know that the probability of sampling a direction is zero. ∎
We consider a fairly general distribution over sparse activations similar to the signal model in Gribonval et al. (2015).
Assumption 1 (Sparse-Distribution).
Each index of a sparse activation vector is sampled i.i.d from a sparse distribution defined as: for all ,
With this we state the main theorem of the section with the proof deferred to Appendix H.
Theorem 4 (Sparse Sampled Activations).
Consider a representation space . Assume that the agent receives representations sampled i.i.d from a sparse distribution . Fix a threshold , and sparsity parameter . Then, for any target feature matrix , with a bound of on the feedback complexity using -sparse feedbacks,the oblivious learner learns up to feature equivalence with high probability using the feedback set , i.e.,
where depends on , and sparasity parameter .
Proof Outline.
Using the formulation of Eq. (2), we need to estimate the number of activations the agent needs to receive/sample before an induced set of many rank-1 linearly independent matrices are found. To estimate this, first we generalize the construction of the set from the proof of Theorem 2 to
We then analyze a design matrix of rank-1 matrices from sampled activations and compute the probability of finding columns with entries semantically similar to those in (as vectorizations), ensuring a non-trivial determinant. The final complexity bound is derived using Hoeffding’s inequality and Sterling’s approximation. ∎
6 Experimental Setup
We empirically validate our theoretical framework for learning feature matrices. Our experiments222https://anonymous.4open.science/r/learnsparsefeatureswithfeedback-688E examine different feedback mechanisms and teaching strategies across both synthetic tasks and large-scale neural networks.
Feedback Methods:
We evaluate four feedback mechanisms: (1) Eigendecomposition uses Lemma 3 to construct feedback based on ’s low rank structure, (2) Sparse Constructive builds 2-sparse feedbacks using the basis in Eq. (4), (3) Random Sampling generates feedbacks spanning from a Lebesgue distribution, and (4) Sparse Sampling creates feedbacks using -sparse samples drawn from a sparse distribution (see Definition 1).
Teaching Agent:
We implement a teaching agent with access to the target feature matrix to enable numerical analysis. The agent constructs either specific basis vectors or receives activations from distributions (Lebesgue or Sparse) based on the chosen feedback method. For problems with small dimensions, we utilize the cvxpy package to solve constraints of the form . When handling larger dimensional features (), where constraints scale to millions (), we employ batch-wise gradient descent for matrix regression.
Features via RFM:
RFM (Radhakrishnan et al., 2024) considers a trainable shift-invariant kernel corresponding to a symmetric, PSD matrix . At each iteration of the update, the matrix is updated for the classifier as follows: . We train target functions corresponding to monomials over samples in using 4000 training samples. The feature matrix obtained after iterations is used as ground truth against learning with feedbacks. Plots are shown in Fig. 1 and Fig. 2.
SAE features of Large-Scale Models:
We analyze dictionaries from trained sparse autoencoders on Pythia-70M (Biderman et al., 2023) and Board Game Models (Karvonen et al., 2024), with dictionary dimensions of and respectively. These large-scale dimensions result in millions of feedback parameters, necessitating computationally efficient implementations of the teaching agent and constraint optimization. Plots for Pythia-70M and detailed implementation specifications are provided in Appendix I.
Since there could be numerical issues in computation for these large dictionaries, we compute the Pearson Correlation Coefficient (PCC) of the trained feature matrix with the target feature matrix to show their closenss.
where and represent the means of all elements in matrices and respectively. Note the highest value of is 1.
Dictionary features of MLP layers of ChessGPT and OthelloGPT:
We use the SAEs trained for various MLP layers of Board Games models: ChessGPT and OthelloGPT considered in Karvonen et al. (2024). We analyze the dictionaries from MLP layers with dimensions . Note that . For the experiments, we use -sparsity on uniform sparse distributions. We present the plots for ChessGPT in Fig. 4 for different feedback methods. Additionally, we provide a table showing the Pearson Correlation Coefficient between the learned feature matrix and the target in Table 2.
Method | Eigendecom. | Sparse Cons. | Sparse Samp. | Sparse Samp. | Sparse Samp. | Sparse Samp. |
---|---|---|---|---|---|---|
PCC | .9427 | .9773 | .9741 | .9625 | .8256 | .7152 |
Feedbacks | 134912 | 8390656 (8 mil) | 10 mil | 4 mil | 2 mil | 1 mil |










Acknowledgments
Author thanks the National Science Foundation for support under grant IIS-2211386 in duration of this project. Author thanks Sanjoy Dasgupta (UCSD) for helping to develop the preliminary ideas of the work. Author thanks Geelon So (UCSD) for many extended discussions on the project. Author also thanks Mikhail (Misha) Belkin (UCSD) and Enric Boix-Adserà (MIT) for a helpful discussion during a visit to Simons Insitute. The general idea of writing was developed while the author was visiting Simons Insitute (UC Berkeley) for a workshop for which the travel was supported by the National Science Foundation (NSF) and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and #814639.
References
- Abbe et al. (2022) Emmanuel Abbe, Emmanuel Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pages 4782–4887. PMLR, 2022.
- Agarwal et al. (2014) Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish Tandon. Learning sparsely used overcomplete dictionaries. In Maria Florina Balcan, Vitaly Feldman, and Csaba Szepesvári, editors, Proceedings of The 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 123–137, Barcelona, Spain, 13–15 Jun 2014. PMLR. URL https://proceedings.mlr.press/v35/agarwal14a.html.
- Arora et al. (2013) Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In Annual Conference Computational Learning Theory, 2013. URL https://api.semanticscholar.org/CorpusID:6978132.
- Arora et al. (2016) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent variable model approach to PMI-based word embeddings. Transactions of the Association for Computational Linguistics, 4:385–399, 2016. doi: 10.1162/tacl_a_00106. URL https://aclanthology.org/Q16-1028/.
- Arora et al. (2018) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018.
- Ba et al. (2022a) Jimmy Ba, Murat A. Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. arXiv preprint arXiv:2205.01445, 2022a.
- Ba et al. (2022b) Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022b. URL https://openreview.net/forum?id=akddwRG6EGi.
- Bellet et al. (2015) Aurélien Bellet, Amaury Habrard, and Marc Sebban. Metric learning, volume 30 of Synth. Lect. Artif. Intell. Mach. Learn. San Rafael, CA: Morgan & Claypool Publishers, 2015. ISBN 978-1-62705-365-5; 978-1-627-05366-2. doi: 10.2200/S00626ED1V01Y201501AIM030.
- Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar Van Der Wal. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Bordelon and Pehlevan (2022a) Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Journal of Statistical Mechanics: Theory and Experiment, 2023, 2022a. URL https://api.semanticscholar.org/CorpusID:248887466.
- Bordelon and Pehlevan (2022b) Blake Bordelon and Cengiz Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 32240–32256. Curran Associates, Inc., 2022b.
- Bricken et al. (2023) Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Christopher Olah. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Chen et al. (2013) Yuxin Chen, Yuejie Chi, and Andrea J. Goldsmith. Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Transactions on Information Theory, 61:4034–4059, 2013. URL https://api.semanticscholar.org/CorpusID:210337.
- Damian et al. (2022) Andrei Damian, Jason Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, pages 5413–5452. PMLR, 2022.
- Doshi-Velez and Kim (2017) Finale Doshi-Velez and Been Kim. Towards a rigorous science of interpretable machine learning, 2017. URL https://arxiv.org/abs/1702.08608.
- Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/toy_model/index.html.
- Faruqui et al. (2015) Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah A Smith. Sparse overcomplete word vector representations. arXiv preprint arXiv:1506.02004, 2015.
- Fawzi et al. (2022) Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J. R. Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, David Silver, Demis Hassabis, and Pushmeet Kohli. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, October 2022. ISSN 1476-4687. doi: 10.1038/s41586-022-05172-4. URL https://doi.org/10.1038/s41586-022-05172-4.
- Fisher (1936) R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936. doi: https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.
- Gribonval et al. (2014) Rémi Gribonval, Rodolphe Jenatton, and Francis R. Bach. Sparse and spurious: Dictionary learning with noise and outliers. IEEE Transactions on Information Theory, 61:6298–6319, 2014. URL https://api.semanticscholar.org/CorpusID:217787222.
- Gribonval and Schnass (2010) Rémi Gribonval and Karin Schnass. Dictionary identification—sparse matrix-factorization via -minimization. IEEE Transactions on Information Theory, 56(7):3523–3539, 2010. doi: 10.1109/TIT.2010.2048466.
- Gribonval et al. (2015) Rémi Gribonval, Rodolphe Jenatton, and Francis Bach. Sparse and spurious: Dictionary learning with noise and outliers. IEEE Transactions on Information Theory, 61(11):6298–6319, 2015. doi: 10.1109/TIT.2015.2472522.
- Huben et al. (2024) Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=F76bwRSLeK.
- Jolliffe (1986) Ian T. Jolliffe. Principal Component Analysis. Springer-Verlag, New York, 1986.
- Karvonen et al. (2024) Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Riggs Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language model interpretability with board game models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=SCEdoGghcw.
- Kleindessner and von Luxburg (2016) Matthäus Kleindessner and Ulrike von Luxburg. Kernel functions based on triplet comparisons. In Neural Information Processing Systems, 2016.
- Kulis (2013) Brian Kulis. Metric learning: A survey. Foundations and Trends® in Machine Learning, 5(4):287–364, 2013. ISSN 1935-8237. doi: 10.1561/2200000019. URL http://dx.doi.org/10.1561/2200000019.
- Li and Voroninski (2013) Xiaodong Li and Vladislav Voroninski. Sparse signal recovery from quadratic measurements via convex programming. SIAM Journal on Mathematical Analysis, 45(5):3019–3033, 2013. doi: 10.1137/120893707. URL https://doi.org/10.1137/120893707.
- Mallat and Zhang (1993) S.G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries. IEEE Transactions on Signal Processing, 41(12):3397–3415, 1993. doi: 10.1109/78.258082.
- Marks et al. (2024a) Samuel Marks, Adam Karvonen, and Aaron Mueller. dictionary_learning. https://github.com/saprmarks/dictionary_learning, 2024a.
- Marks et al. (2024b) Samuel Marks, Can Rager, Eric J. Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models, 2024b. URL https://arxiv.org/abs/2403.19647.
- Mason et al. (2017) Blake Mason, Lalit P. Jain, and Robert D. Nowak. Learning low-dimensional metrics. In Neural Information Processing Systems, 2017.
- Mikolov et al. (2013) Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090/.
- Olah et al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
- Olshausen and Field (1997) Bruno A. Olshausen and David J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23):3311–3325, 1997. ISSN 0042-6989. doi: https://doi.org/10.1016/S0042-6989(97)00169-7. URL https://www.sciencedirect.com/science/article/pii/S0042698997001697.
- Radhakrishnan et al. (2024) Adityanarayanan Radhakrishnan, Daniel Beaglehole, Parthe Pandit, and Mikhail Belkin. Mechanism for feature learning in neural networks and backpropagation-free machine learning models. Science, 383(6690):1461–1467, 2024. doi: 10.1126/science.adi5639. URL https://www.science.org/doi/abs/10.1126/science.adi5639.
- Romera-Paredes et al. (2024) Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M. Pawan Kumar, Emilien Dupont, Francisco J. R. Ruiz, Jordan S. Ellenberg, Pengming Wang, Omar Fawzi, Pushmeet Kohli, and Alhussein Fawzi. Mathematical discoveries from program search with large language models. Nature, 625(7995):468–475, January 2024. ISSN 1476-4687. doi: 10.1038/s41586-023-06924-6. URL https://doi.org/10.1038/s41586-023-06924-6.
- Schultz and Joachims (2003) Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems, volume 16. MIT Press, 2003.
- Shi et al. (2022) Zhengdao Shi, Jerry Wei, and Yucheng Lian. A theoretical analysis on feature learning in neural networks: Emergence from inputs and advantage over fixed features. In International Conference on Learning Representations, 2022.
- Subramanian et al. (2018) Anant Subramanian, Danish Pruthi, Harsh Jhamtani, Taylor Berg-Kirkpatrick, and Eduard Hovy. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Weinberger and Saul (2009) Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res., 10:207–244, June 2009. ISSN 1532-4435.
- Xing et al. (2002) Eric P. Xing, A. Ng, Michael I. Jordan, and Stuart J. Russell. Distance metric learning with application to clustering with side-information. In Neural Information Processing Systems, 2002.
- Yang and Hu (2021a) Greg Yang and Edward J. Hu. Tensor Programs IV: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, 2021a.
- Yang and Hu (2021b) Greg Yang and Edward J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 11727–11737. PMLR, 18–24 Jul 2021b. URL https://proceedings.mlr.press/v139/yang21c.html.
- Yun et al. (2021) Zeyu Yun, Yiren Chen, Bruno A Olshausen, and Yann LeCun. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. arXiv preprint arXiv:2103.15949, 2021.
- Zhang et al. (2019) Jeffrey Zhang, Yiren Chen, Brian Cheung, and Bruno A Olshausen. Word embedding visualization via dictionary learning. arXiv preprint arXiv:1910.03833, 2019.
- Zhu et al. (2022) Lingxiao Zhu, Chaoyue Liu, Adityanarayanan Radhakrishnan, and Mikhail Belkin. Quadratic models for understanding neural network dynamics. arXiv preprint arXiv:2205.11787, 2022.
- Zhu et al. (2018) Xiaojin Zhu, Adish Singla, Sandra Zilles, and Anna N. Rafferty. An overview of machine teaching, 2018. URL https://arxiv.org/abs/1801.05927.
Appendix A Table of Contents
Here, we provide the table of contents for the appendix of the supplementary.
Appendix B Notations
Here we provide the glossary of notations followed in the supplementary material.
Symbol | Description |
---|---|
Activations | |
Set of columns of matrix | |
Distributions over activations | |
Dimension of ground-truth sample space | |
D | Dictionary matrix |
Eigenvalues of a matrix | |
Element-wise product (inner product) of matrices | |
Kernel of matrix | |
Eigenvectors (orthogonal vectors) | |
Null set of matrix | |
Orthogonal complement of in | |
Dimension of representation space | |
Feature matrix | |
Entry at th row and th column of | |
Target feature matrix | |
Rank of a feature matrix | |
Space of symmetric matrices | |
Space of PSD, symmetric matrices | |
Version space of wrt feedback set | |
The set | |
The set | |
Complete orthonormal basis | |
Activation/Representation space | |
Ground truth sample space |
Appendix C Proof of Lemma 1
In this appendix we restate and provide the proof of Lemma 1.
Lemma 1 (Recovering orthogonal atoms).
Let be a symmetric positive semi-definite matrix. Define the set of orthogonal Cholesky decompositions of as
where and are the eigenvalues of in descending order. Then, for any two matrices , there exists an orthogonal matrix such that
where R is block diagonal with orthogonal blocks corresponding to any repeated diagonal entries in . Additionally, each column of can differ from the corresponding column of U by a sign change.
Proof.
Let be two orthogonal Cholesky decompositions of . Define . We will show that this matrix satisfies our requirements through the following steps:
First, we show that R is orthogonal. Note,
Similarly,
Now we show that .
UR | |||
To show that is block diagonal with orthogonal blocks corresponding to repeated eigenvalues, consider the partitioning based on distinct eigenvalues. Let be the set of indices corresponding to the -th distinct eigenvalue of , for , where is the number of distinct eigenvalues. Let denote the multiplicity of .
Define and as the submatrices of and consisting of columns indexed by , respectively.
Now, consider the block of corresponding to eigenvalues and . For , and correspond to different eigenspaces (as ), and thus their inner product is zero. Hence,
This implies
But then must be block diagonal:
where each is an orthogonal matrix. For eigenvalues with multiplicity one (), the corresponding block is a orthogonal matrix. The only possibilities are:
representing a sign change in the corresponding column of . For eigenvalues with multiplicity greater than one (), each block can be any orthogonal matrix. This allows for rotations within the eigenspace corresponding to the repeated eigenvalue .
Combining all steps, we have shown that:
where is an orthogonal, block-diagonal matrix. Each block corresponds to a distinct eigenvalue of and is either a matrix with entry (for unique eigenvalues) or an arbitrary orthogonal matrix of size equal to the multiplicity of (for repeated eigenvalues). This completes the proof of the lemma.
∎
Appendix D Worst-case bounds: Constructive case
In this Appendix, we provide the proof of the lower bound as stated in Proposition 1. Before we prove this lower bound, we state a useful property of the sum of a symmetric, PSD matrix and a general symmetric matrix in .
Lemma 5.
Let be a symmetric matrix with full rank, i.e., . For any arbitrary symmetric matrix , there exists a positive scalar such that the matrix is positive semidefinite.
Proof.
Since is symmetric and has full rank, it admits an eigendecomposition:
where are the positive eigenvalues and are the corresponding orthonormal eigenvectors of .
Define the constant as the maximum absolute value of the quadratic forms of with respect to the eigenvectors of :
Let be chosen as:
For each eigenvector , consider the quadratic form of :
This shows that each eigenvector satisfies:
Since forms an orthonormal basis for , for any vector , we can express as . Then:
since each term in the sum is non-negative.
Therefore, is positive semidefinite. ∎
Now, we provide the proof of Proposition 1 in the following:
Proof of Proposition 1.
Assume, for contradiction, that there exists a feedback set for Eq. (1) with size .
For each pair , is orthogonal to , implying that , the orthogonal complement of . Therefore,
This leads to
Since , we have
Adding to this span increases the dimension by at most one:
Since is a vector space with , there exists a symmetric matrix such that
By Lemma 5, there exists such that is PSD and symmetric. Since and is not a scalar multiple of , the matrix is not related to via linear scaling. However, it still satisfies Eq. (1), contradicting the minimality of .
Thus, any feedback set must satisfy
This establishes the stated lower bound on the feedback complexity of the feedback set. ∎
Appendix E Proof of Theorem 1: Upper bound
Below we provide proof of the upper bound stated in Theorem 1.
Consider the eigendecomposition of the matrix . There exists a set of orthonormal vectors with corresponding eigenvalues such that
(5) |
Denote the set of orthogonal vectors as .
Let , denoted as , be an orthogonal extension to the vectors in such that
forms an orthonormal basis for . Denote the complete basis as .
Note that precisely defines the null space of , i.e.,
The key idea of the proof is to manipulate this null space to satisfy the feedback set condition in Eq. (2) for the target matrix . Since has rank , the number of degrees of freedom is exactly . Alternatively, the span of the null space of , which has dimension exactly , fixes the remaining entries in .
Using this intuition, the teacher can provide pairs to teach the null space and the eigenvectors separately. However, it is necessary to ensure that this strategy is optimal in terms of sample efficiency. We confirm the optimality of this strategy in the next two lemmas.
E.1 Feedback set for the null space of
Our first result is on nullifying the null set of in the Eq. (2). Consider a partial feedback set
Lemma 6.
If the teacher provides the set , then the null space of any PSD symmetric matrix that satisfies Eq. (2) contains the span of , i.e.,
Proof.
Let be a matrix that satisfies Eq. (2) (note that satisfies Eq. (2)). Thus, we have the following equality constraints:
Since is a set of linearly independent vectors, it suffices to show that
(6) |
To prove Eq. (6), we utilize general properties of the eigendecomposition of a symmetric, positive semi-definite matrix. We express in its eigendecomposition as
where are the eigenvectors and are the corresponding eigenvalues of . Assume that satisfies
Consider the decomposition for scalars and . Now, expanding the equation above, we get
Since for all (because is PSD), it follows that each . Therefore,
This implies that , thereby proving Eq. (6).
With this we will argue that the feedback setup in Eq. (2) can be decomposed in two parts: first is teaching the null set , and second is teaching in the form of .
E.2 Feedback set for the kernel of
Next, we discuss how to teach , i.e. span the rows of any solution to Eq. (2) with the corresponding eigenvalues . We show that if the search space of metrics in Eq. (2) is the version space which is a restriction of the space to feedback set , then a feedback set of size at most is sufficient to teach up to feature equivalence. Thus, we consider the reformation of the problem in Eq. (2) as
(7) |
where the feedback set is devised to solve a smaller space . With this state the following useful lemma on the size of the restricted feedback set .
Lemma 7.
Consider the problem as formulated in Eq. (7) in which the null set of the target matrix is known. Then, the teacher sufficiently and necessarily finds a set of size for oblivious learning up to feature equivalence.
Proof.
Note that any solution of Eq. (7) has its columns spanned exactly by . Alternatively, if we consider the eigendecompostion of then the corresponding eigenvectors exists in . Furthermore, note that is of rank which implies there are only degrees of freedom, i.e. entries in the matrix , that need to be fixed.
Thus, there are exactly linearly independent columns of , indexed as . Now, consider the set of matrices
This forms a basis to generate any matrix with independent columns along the indexed set. Hence, the span of induces a subspace of symmetric matrices of dimension in the vector space , i.e. the column vectors along the indexed set is spanned by elements of . Thus, it is clear that picking a feedback set of size in the orthogonal complement of , i.e. restricted by this span sufficiently teaches if is known. One exact form of this set is proven in Lemma 3. Since any solution is agnostic to the scaling of the target matrix , we have shown that the sufficiency on the feedback complexity for up to feature equivalence.
Now, we show that the stated feedback set size is necessary. The argument is similar to the proof of Lemma 5.
For the sake of contradiction assume that there is a smaller sized feedback set . This implies that there is some matrix in , a subspace induced by span , orthogonal to is not in the span of , denoted as . If is PSD then it is a solution to Eq. (7) and is not a scalar multiple of . Now, if is not PSD we show that there exists scalar such that
i.e. the sum is PSD. Consider the eigendecompostion of (assume )
for orthogonal eigenvectors and the corresponding eigenvalues . Since (assume) of the eigenvalues are negative we can rewrite as
Thus, if we can regulate the values of , for all , noting they are positive, then we can find an appropriate scalar . Let and . Now, setting achieves the desired property of as shown in the proof of Lemma 5.
Consider that both and are orthogonal to every element in the feedback set . This orthogonality implies that is not a unique solution to equation Eq. (7) up to a positive scaling factor.
Therefore, we have demonstrated that when the null set of the target matrix is known, a feedback set of size exactly is both necessary and sufficient. ∎
E.3 Proof of Lemma 3 and construction of feedback set for
Up until this point we haven’s shown how to construct this sized feedback set. Consider the following union:
We can show that this union is a set of linearly independent matrices of rank 1 as stated in Lemma 3 below.
Lemma 3.
Let be a set of orthogonal vectors. Then, the set of rank-1 matrices
is linearly independent in the space of symmetric matrices .
Proof.
We prove the claim by considering two separate cases. For the sake of contradiction, suppose that the set is linearly dependent. This implies that there exists at least one matrix of the form or that can be expressed as a linear combination of the other matrices in . We now examine these two cases individually.
Case 1: First, we assume that for some , can be written as a linear combination. Thus, there exists scalars that satisfy the following property
(8) | |||
(9) |
Now, note that we can write
But the following sum
doesn’t span (as column vectors) a subspace that contains the column vector because is a set of orthogonal vectors. Thus, we can write
(10) |
This implies that
(11) |
Since not all corresponding to (otherwise ) we have shown that can not be written as a linear combination of elements in .
Case 2: Now, we consider the second case where there exists some indices such that is a sum of linear combination of elements in . Note that this linear combination can’t have an element of type as it contradicts the first case. So, there are scalars such that
(12) | |||
(13) |
But we rewrite this as
Note that if then the corresponding and vice versa. Since are orthogonal, the decomposition above implies
(14) | |||
(15) | |||
(16) |
But using the arguments in Eq. (10) and Eq. (11), we can achieve Eq. (14) or Eq. (15).
Thus, we have shown that the set of rank-1 matrices as described in are linearly independent. ∎
In Lemma 7, we discussed that in order to teach sufficiently agent needs a feedback set of size if the null set of is known. We can establish this feedback set using the basis shown in Lemma 3. We state this result in the following lemma.
Lemma 9.
For a given target matrix and basis set of matrices as shown in Lemma 3, the following set spans a subspace of dimension in .
Proof.
Since has at least positive eigenvalues there exists a vector such that . It is straightforward to note that is orthogonal to . As and , . ∎
Now, we will complete the proof of the main result of the appendix here.
Proof of Theorem 1.
Combining the results from Lemma 6, Lemma 7, and Lemma 9, we conclude that the feedback setup in Eq. (2) can be effectively decomposed into teaching the null space and the span of the eigenvectors of . The constructed feedback sets ensure that is uniquely identified up to a linear scaling factor with optimal sample efficiency. ∎
Appendix F Proof of Theorem 1: Lower bound
In this appendix, we provide the proof of the lower bound as stated in Theorem 1. We proceed by first showing some useful properties on a valid feedback set for a target feature matrix . They are stated in Lemma 10 and Lemma 11.
First, we consider a basic spanning property of matrices for any pair in the space of symmetric matrices .
Lemma 10.
If such that then .
Proof.
Consider an such that . Note that the eigendecompostion of (assume )
for orthogonal eigenvectors and the corresponding eigenvalues has the property that . Using the arguments exactly as shown in the second half of the proof of Lemma 7 we can show there exists such that . But then is not feature equivalent to . But this contradicts the assumption of being a valid feedback set. ∎
Lemma 11.
There exists vectors (of size ) such that and for any vector , .
Proof.
Assuming the contrary, there exists such that .
Now if , then for any scalar , is both symmetric and positive semi-definite and satisfies all the conditions in Eq. (1) wrt a contradiction as is not feature equivalent to .
So, consider the case when . Let be an orthogonal extension333the set is not trivially empty in which case the proof follows easily of such that forms a basis of , i.e., in other words
We will first show that there exists some orthogonal to and furthermore .
Consider the intersection (in the space ) of the orthogonal complement of the matrices , denote it as , i.e.,
Note that
Since is in and there exists some such that , and also orthogonal to elements in the feedback set . Thus, has a null set which includes the subset .
Now, the rest of the proof involves showing existence of some scalar such that satisfies the conditions of Eq. (1) for the feedback set . Note that if then the proof is straightforward as , which implies . But this is precisely the condition for Lemma 10 to hold.
Without loss of generality assume that . First note that the eigendecomposition of has eigenvectors that are contained in . Consider some arbitrary choice of , we will fix a value later. It is straightforward that is symmetric for and are symmetric. In order to show it is positive semi-definite, it suffices to show that
(17) |
Since we can simplify Eq. (17) to
(18) |
Consider the decomposition of any arbitrary vector as follows:
(19) | |||
(20) |
From here on we assume that . The alternate case is trivial as .
Now, we write the vectors as scalar multiples of their corresponding unit vectors
(21) | |||
(22) |
Remark: Although we have computed the norm of as in the orthonormal basis , note that the norm remains unchanged (same as the ). is used for ease of analysis later on.
Using the decomposition in Eq. (19)-(20), we can write Eq. (18) as
(23) |
Since we want we can further simplify Eq. (23) as
(24) | |||
(25) |
where we have used . The next part of the proof we show that is lower bounded by a positive constant whereas is upper bounded by a positive constant and there is a choice of so that is always smaller than .
Considering we note that is a unit vector wrt the orthonormal set of basis . Expanding using the eigendecomposition of Eq. (5)
The last inequality follows as all the eigenvalues in the eigendecompostion are (strictly) positive. Denote this minimum eigenvalue as .
Considering note that only terms that are variable (i.e. could change value) is as is
Note that is a fixed vector and has a fixed norm (using Eq. (21)-(22)), so for some bounded constant whereas is already a constant. Now, exceeds only if
Rightmost inequality implies that is negative only for an bounded from above by a positive constant. But since is non-negative
Now using an argument similar to the second half of the proof of Lemma 7, it is straight forward to show that there is a choice of so that is always smaller than .
Now, for where is chosen so that , we note that
Using the equivalence in Eq. (23), Eq. (24) and Eq. (25), we have a choice of such that for any arbitrary vector . Hence, we have achieved the conditions in Eq. (18), which is the simplification of Eq. (17). This implies that is positive semi-definite.
This implies that there doesn’t exist a such that otherwise the assumption on to be an oblivious feedback set for is violated. Thus, the statement of Lemma 11 has to hold. ∎
F.1 Proof of lower bound in Theorem 1
In the following, we provide proof of the main statement on the lower bound of the size of a feedback set.
If any of the two lemmas (10-11) are violated, we can show there exists and such that . In order to ensure these statements, the feedback set should have many elements which proves the lower bound on .
But using Lemma 7 and Lemma 9 we know that the dimension of the span of matrices that satisfy the condition in Lemma 10 is at the least . We can use Lemma 9 where (note ). Thus, any basis matrix in satisfy the conditions in Lemma 10.
Since the dimension of is at least thus there are at least directions or linearly independent matrices (in ) that need to be spanned by .
Thus, Lemma 10 implies there are linearly independent matrices (in ) that need to be spanned by . Similarly, Lemma 11 implies there are linearly independent matrices (in ) that need to be spanned by . Note that the column vectors of these matrices from the two statements are spanned by orthogonal set of vectors, i.e. one by and the other by respectively. Thus, these are linearly independent in , but this forces a lower bound on the size of (a lower dimensional span can’t contain a set of vectors spanning higher dimensional space). This completes the proof of the lower bound in Theorem 1.
Appendix G Proof of Theorem 3: General Activations Sampling
We aim to establish both upper and lower bounds on the feedback complexity for oblivious learning in Algorithm 2. The proof revolves around the linear independence of certain symmetric matrices derived from random representations and the dimensionality required to span a target feature matrix.
Let us define a positive index . The agent receives representations:
For each , we define the symmetric matrix .
Consider the matrix formed by concatenating the vectorized :
where each is treated as a column vector in . The vectorization operation for a symmetric matrix is defined as:
The determinant is a non-zero polynomial in the entries of . Since the vectors are drawn from a continuous distribution , using Sard’s theorem the probability that is zero, i.e.,
This implies that, with probability 1, the set is linearly independent in :
(26) |
Next, let be an arbitrary target feature matrix for learning with feedback in Algorithm 2. Without loss of generality, assume . Define the set of rescaled pairs as:
noting that .
Assume, for contradiction, that the elements of are linearly dependent in . Then, there exist scalars (not all zero) such that:
However, since are linearly independent with probability 1, it must be that:
Given that , this implies for all , contradicting the assumption of linear dependence. Therefore, matrices induced by are linearly independent.
This implies that induces a set of linearly independent matrices, i.e., in the orthogonal complement , and since has at most degrees of freedom, any matrix satisfying:
must be a positive scalar multiple of .
Thus, using Eq. (26), with probability 1, the feedback set is valid:
Since was arbitrary, the worst-case feedback complexity is almost surely upper bounded by for achieving feature equivalence.
For the lower bound, consider the proof of the lower bound in Theorem 1, specifically Lemma 10, which asserts that for any feedback set in Algorithm 1, given any target matrix , if such that then where () is defined as the set of eigenvectors in the eigendecompostion of (see Eq. (5)).
This implies that any feedback set must span certain matrices . Suppose the agent receives representations and constructs:
Now, consider the polynomial equation . Since every entry of is semantically different, the determinant is a non-zero polynomial. Note that there are many degrees of freedom for the rows. Thus, it is clear that the zero set has Lebesgue measure zero if , i.e. requires at least columns for to be identically zero. But this implies that set can’t span (almost surely) if . Hence, (almost surely) the agent can’t devise a feedback set for oblivious learning in Algorithm 2. In other words, if ,
Hence, to span , it almost surely requires at least representations. Therefore, the feedback complexity cannot be lower than .
Combining the upper and lower bounds, we conclude that the feedback complexity for oblivious learning in Algorithm 2 is tightly bounded by .
Appendix H Proof of Theorem 4: Sparse Activations Sampling
Here we consider the analysis for the case when the activations are sampled from the sparse distribution as stated in Definition 1.
In Theorem 4, we assume that the activations are sampled from a Lebesgue distribution. This, sufficiently, ensures that (almost surely) any random sampling of activations induces a set of linearly independent rank-1 matrices. Since the distribution in Assumption 1 is not a Lebesgue distribution over the entire support , requiring an understanding of certain events of the sampling of activations which could lead to linearly independent rank-1 matrices.
In the proof of Theorem 2, we used a set of sparse activations using the standard basis of the vector space . We note that the idea could be generalized to arbitrary choice of scalars as well, i.e.,
Here is the th standard basis vector. Note that the corresponding set of rank-1 matrices, denoted as
is linearly independent in the space of symmetric matrices on , i.e., .
Assume that activations are sampled times, denoted as . Now, consider the design matrix as shown in the proof of Theorem 3. We know that if is non-zero then are linearly independent in . To show if a sampled set exhibits this property we need to show that is not identically zero, which could be possible for activations sampled from sparse distributions as stated in Assumption 1, i.e. .
Note that . Consider the diagonal of . Consider the situation where all the entries are non-zero. This corresponds to sampling a set of activations of the form . Consider the following random design matrix .
Now, a random design matrix is not identically zero for any set of randomly sampled activations that satisfy the following indexing property:
(27) |
This is so because for identity permutation, we have . Now, we will compute the probability that is sampled from . Using the independence of sampling of each index of an activation, the probabilities for the two subsets of can be computed as follows:
-
•
activations such that . Using independence, we have
-
•
Rest of activations of in Eq. (27) require at least two indices to be non-zero. This could be computed as
Now, note that these activations can be permuted in ways and thus
(28) |
Now, we will complete the proof of the theorem using Hoeffding’s inequality. Assume that the agent samples activations, we will compute the probability that . Consider all possible -subsets of items, enumerated as . Now, define random variables as
Now, define sum random variable . We want to understand the probability . Now note that,
Now, using Hoeffding’s inequality
Now, for a given choice of of , we want . Using Sterling’s approximation
Appendix I Additional Experiments
We evaluate our methods on large-scale models: Pythia-70M (Biderman et al., 2023) and Board Games Models (Karvonen et al., 2024). These models present significant computational challenges, as different feedback methods generate millions of constraints. This scale necessitates specialized approaches for both memory management and computational efficiency.
Memory-efficient constraint storage
The high dimensionality of model dictionaries makes storing complete activation indices for each feature prohibitively memory-intensive. We address this by enforcing constant sparsity constraints, limiting activations to a maximum sparsity of 3. This constraint enables efficient storage of large-dimensional arrays while preserving the essential characteristics of the features.
Computational optimization
To efficiently handle constraint satisfaction at scale, we reformulate the problem as a matrix regression task, as detailed in Fig. 5. The learner maintains a low-rank decomposition of the feature matrix , assuming , where represents the learned dictionary. This formulation allows for efficient batch-wise optimization over the constraint set while maintaining feasible memory requirements.
Dictionary features of Pythia-70M
We use the publicly available repository for dictionary learning via sparse autoencoders on neural network activations (Marks et al., 2024a). We consider the dictionaries trained for Pythia-70M (Biderman et al., 2023) (a general-purpose LLM trained on publicly available datasets). We retrieve the corresponding autoencoders for attention output layers which have dimensions . Note that . For the experiments, we use -sparsity on uniform sparse distributions. We present the plots for ChessGPT in two parts in Fig. 6 and Fig. 7 for different feedback methods.







