This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Rethinking the Construction of Effective Metrics for Understanding the Mechanisms of Pretrained Language Models

You Li11footnotemark: 1  Jinhui Yin11footnotemark: 1  Yuming Lin22footnotemark: 2
Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology
[email protected][email protected][email protected]
Abstract

Pretrained language models are expected to effectively map input text to a set of vectors while preserving the inherent relationships within the text. Consequently, designing a white-box model to compute metrics that reflect the presence of specific internal relations in these vectors has become a common approach for post-hoc interpretability analysis of pretrained language models. However, achieving interpretability in white-box models and ensuring the rigor of metric computation becomes challenging when the source model lacks inherent interpretability. Therefore, in this paper, we discuss striking a balance in this trade-off and propose a novel line to constructing metrics for understanding the mechanisms of pretrained language models. We have specifically designed a family of metrics along this line of investigation, and the model used to compute these metrics is referred to as the tree topological probe. We conducted measurements on BERT-large by using these metrics. Based on the experimental results, we propose a speculation regarding the working mechanism of BERT-like pretrained language models, as well as a strategy for enhancing fine-tuning performance by leveraging the topological probe to improve specific submodules.111Our code is available at https://github.com/cclx/Effective_Metrics

footnotetext: *Equal contribution.footnotetext: †Corresponding Author.

1 Introduction

Pretrained language models consisting of stacked transformer blocks (Vaswani et al., 2017) are commonly expected to map input text to a set of vectors, such that any relationship in the text corresponds to some algebraic operation on these vectors. However, it is generally unknown whether such operations exist. Therefore, designing a white-box model that computes a metric for a given set of vectors corresponding to a text, which reflects to some extent the existence of operations extracting specific information from the vectors, is a common approach for post-hoc interpretability analysis of such models (Maudslay et al., 2020; Limisiewicz and Marecek, 2021; Chen et al., 2021; White et al., 2021; Immer et al., 2022). However, even though we may desire strong interpretability from a white-box model and metrics computed by it that rigorously reflect the ability to extract specific information from a given set of vectors, it can be challenging to achieve both of these aspects simultaneously when the source model lacks inherent interpretability. Therefore, making implicit assumptions during metric computation is common (Kornblith et al., 2019; Wang et al., 2022). A simple example is the cosine similarity of contextual embeddings. This metric is straightforward and has an intuitive geometric interpretation, making it easy to explain, but it tends to underestimate the similarity of high-frequency words (Zhou et al., 2022).

On the other hand, due to the intuition that ’if a white-box model cannot distinguish embeddings that exhibit practical differences (such as context embeddings and static embeddings), it should be considered ineffective,’ experimental validation of a white-box model’s ability to effectively distinguish between embeddings with evident practical distinctions is a common practice in research. Furthermore, if the magnitude of metrics computed by a white-box model strongly correlates with the quality of different embeddings in practical settings, researchers usually trust its effectiveness. Therefore, in practice, traditional white-box models actually classify sets of vectors from different sources.

Taking the structural probe proposed by Hewitt and Manning as an example, they perform a linear transformation on the embedding of each complete word in the text and use the square of the L2 norm of the transformed vector as a prediction for the depth of the corresponding word in the dependency tree (Hewitt and Manning, 2019). In this way, the linear transformation matrix serves as a learning parameter, and the minimum risk loss between the predicted and true depths is used as a metric. Intuitively, the smaller the metric is, the more likely the embedding contains complete syntax relations. The experimental results indeed align with this intuition, showing that contextual embeddings (such as those generated by BERT (Devlin et al., 2019)) outperform static embeddings. However, due to the unknown nature of the true deep distribution, it is challenging to deduce which geometric features within the representations influence the magnitude of structural probe measurements from the setup of structural probe. Overall, while the results of the structural probe provide an intuition that contextual embeddings, such as those generated by BERT, capture richer syntactic relations than those of the traditional embeddings, it is currently impossible to know what the geometric structure of a "good" embedding is for the metric defined by the structural probe.

In addition, to enhance the interpretability and flexibility of white-box models, it is common to include assumptions that are challenging to empirically validate. For example, Ethayarajh proposed to use anisotropy-adjusted self-similarity to measure the context-specificity of embeddings (Ethayarajh, 2019). Since the computation of this metric doesn’t require the introduction of additional human labels, it is theoretically possible to conduct further analysis, such as examining how fundamental geometric features in the representation (e.g., rank) affect anisotropy-adjusted self-similarity, or simply consider this metric as defining a new geometric feature. Overall, this is a metric that can be discussed purely at the mathematical level. However, verifying whether the measured context-specificity in this metric aligns well with context-specificity in linguistics, without the use of, or with only limited additional human labels, may be challenging. Additionally, confirming whether the model leverages the properties of anisotropy-adjusted self-similarity during actual inference tasks might also be challenging.

There appears to be a trade-off here between two types of metrics:

1. Metrics that are constrained by supervised signals with ground truth labels, which provide more practical intuition.

2. Metrics that reflect the geometric properties of the vector set itself, which provide a more formal representation.

Therefore, we propose a new line that takes traditional supervised probes as the structure of the white-box model and then self-supervises it, trying to preserve both of the abovementioned properties as much as possible. The motivation behind this idea is that any feature that is beneficial for interpretability has internal constraints. If a certain feature has no internal constraints, it must be represented by a vector set without geometric constraints, which does not contain any interpretable factors. Therefore, what is important for interpretability is the correspondence between the internal constraints of the probed features and the vector set, which can describe the geometric structure of the vector set to some extent. In the case where the internal constraints of the probed features are well defined, a probe that detects these features can naturally induce a probe that detects the internal constraints, which is self-supervised.

In summary, the contributions of this work include:

  1. 1.

    We propose a novel self-supervised probe, referred to as the tree topological probe, to probe the hierarchical structure of sentence representations learned by pretrained language models like BERT.

  2. 2.

    We discuss the theoretical relationship between the tree topological probe and the structural probe, with the former bounding the latter.

  3. 3.

    We measure the metrics constructed based on the tree topological probe on BERT-large. According to the experimental results, we propose a speculation regarding the working mechanism of a BERT-like pretrained language model.

  4. 4.

    We utilize metrics constructed by the tree topological probe to enhance BERT’s submodules during fine-tuning and observe that enhancing certain modules can improve the fine-tuning performance. We also propose a strategy for selecting submodules.

2 Related Work

The probe is the most common approach for associating neural network representations with linguistic properties (Voita and Titov, 2020). This approach is widely used to explore part of speech knowledge (Belinkov and Glass, 2019; Voita and Titov, 2020; Pimentel et al., 2020b; Hewitt et al., 2021) and for sentence and dependency structures (Hewitt and Manning, 2019; Maudslay et al., 2020; White et al., 2021; Limisiewicz and Marecek, 2021; Chen et al., 2021). These studies demonstrate many important aspects of the linguistic information are encoded in pretrained representations. However, in some probe experiments, researchers have found that the probe precision obtained by both random representation and pretrained representation were quite close (Zhang and Bowman, 2018; Hewitt and Liang, 2019). This demonstrates that it is not sufficient to use the probe precision to measure whether the representations contain specific language information. To improve the reliability of probes, some researchers have proposed the use of control tasks in probe experiments (Hewitt and Liang, 2019). In recent research, Lovering et al. realized that inductive bias can be used to describe the ease of extracting relevant features from representations. Immer et al. further proposed a Bayesian framework for quantifying inductive bias with probes, and they used the Model Evidence Maximum instead of trivial precision.

3 Methodology

As the foundation of the white-box model proposed in this paper is built upon the traditional probe, we will begin by providing a general description of the probe based on the definition presented in Ivanova et al. (2021). Additionally, we will introduce some relevant notation for better understanding.

3.1 General Form of the Probe

Given a character set, in a formal language, the generation rules uniquely determine the properties of the language. We assume that there also exists a set of generation rules \mathcal{R} implicitly in natural language, and the language objects derived from these rules exhibit a series of features. Among these features, a subset YY is selected as the probed feature for which the properties represent the logical constraints of the generation rule set. Assuming there is another model \mathcal{M} that can assign a suitable representation vector to the generated language objects, the properties of YY are then represented by the intrinsic geometric constraints of the vector set. By studying the geometric constraints that are implicit in the vector set and that correspond to YY, especially when YY is expanded to all features of the language object, we can determine the correspondence between \mathcal{M} and \mathcal{R}. The probe is a model that investigates the relationship between the geometric constraints of the vector set and YY. It is composed of a function set FF and a metric EYE_{Y} defined on YY. The input of a function in FF is the representation vector of a language object, and the output is the predicted YY feature of the input language object. The distance between the predicted feature and the true feature is calculated by using the metric EYE_{Y}, and a function ff in FF that minimizes the distance is determined. Here, FF limits the range of geometric constraints, and EYE_{Y} limits the selection of a "good" geometry. Notably, this definition seems very similar to that of learning. Therefore, the larger the scope of FF is, the harder it is to discern the form of the geometric constraints, especially when FF is a neural network (Pimentel et al., 2020b; White et al., 2021). However, the purpose of the probe is different from that of learning. The goal of learning is to construct a model \mathcal{M} (usually a black box), which may have multiple construction methods, while the purpose of the probe is to analyze the relationship between \mathcal{M} and \mathcal{R}.

3.2 The Design Scheme for the Topological Probe

One of the goals of topology is to find homeomorphic or homotopic invariants (including invariant quantities, algebraic structures, functors, etc.) and then to characterize the intrinsic structure of a topological space with these invariants. Analogously, we can view RR as a geometric object and YY as its topology. Can we then define a concept similar to topological invariants with respect to YY?

We define a feature invariant for YY as a set of conditions CYC_{Y} such that any element in YY satisfies CYC_{Y}. CYC_{Y} reflects the internal constraints of the probed feature, as well as a part of the logical constraints of RR. Furthermore, if CYC_{Y} is well defined, it induces a set XCYX_{C_{Y}} consisting of all objects satisfying CYC_{Y}, which naturally extends the metric defined on YY to XCYX_{C_{Y}}.

Furthermore, just as the distance measure between two points can induce a distance measure between a point and a plane, the distance measure between the predicted feature pxpx and XCYX_{C_{Y}} can also be induced by EYE_{Y} (denoted as ECYE_{C_{Y}}):

ECY(px,XCY)=minxXCYEY(px,x)E_{C_{Y}}(px,X_{C_{Y}})=\mathop{\min}\limits_{x\in X_{C_{Y}}}E_{Y}(px,x) (1)

It can be easily verified that if EYE_{Y} is a well-defined distance metric on YY, then ECYE_{C_{Y}} should also be a well-defined distance metric on pxpx. Once we have ECYE_{C_{Y}}, the supervised probe (F,EY,Y)(F,E_{Y},Y) can naturally induce a self-supervised probe (F,ECY,CY)(F,E_{C_{Y}},C_{Y}). We refer to (F,ECY,CY)(F,E_{C_{Y}},C_{Y}) as the self-supervised version of (F,EY,Y)(F,E_{Y},Y), also known as the topological probe.

Notably, the prerequisite for obtaining (F,ECY,CY)(F,E_{C_{Y}},C_{Y}) is that CYC_{Y} must be well-defined, so CYC_{Y} should not be a black box. Figure 1 shows an intuitive illustration.

Refer to caption
Figure 1: The relationship between the distance from the predicted feature A to XCYX_{C_{Y}} and the distance from A to XYX_{Y}.

Next, we present a specific topological probe that is based on the previously outlined design scheme and serves as a self-supervised variant of the structural probe.

3.3 The Self-supervised Tree Topological Probe

Given a sentence WW, it is represented by a model MM as a set (or sequence) of vectors, denoted as H=M(W)H=M(W). The number of vectors in HH is denoted as LHL_{H}, and we assign an index (1,2,LH)(1,2,...L_{H}) to each vector in HH so that the order of the indices matches the order of the corresponding tokens in the sentence. Additionally, we denote the dimension of the vectors as nn. For each WW, there exists a syntax tree TWT_{W}, where each complete word in WW corresponds to a node in TWT_{W}.

The probed feature YY that the structural probe defines is the depth of the nodes corresponding to complete words. Following the work in (Hewitt and Manning, 2019), we set the parameter space of FF for the structural probe to be all real matrices of size mnm*n, where m<nm<n. The specific form for predicting the depth is as follows:      Given pR,1iLHp\in R,\,\forall 1\leq i\leq L_{H}

𝒑dep(hi)=fhip\bm{p}dep(h_{i})=\|f*h_{i}\|^{p} (2)

where 𝒑dep(hi)\bm{p}dep(h_{i}) is the prediction tree depth of wiw_{i} in TWT_{W} and ff is a real matrix of size mnm*n. Because p<2\forall p<2, there is a tree that cannot be embedded as above (Reif et al., 2019), so pp is usually taken as 22. 𝒑dep(h1)\bm{p}dep(h_{1}), 𝒑dep(h2)\bm{p}dep(h_{2}) \cdots, 𝒑dep(hLH)\bm{p}dep(h_{L_{H}}) form a sequence denoted as 𝒑depH\bm{p}dep_{H}.

Moreover, we denote the true depth of wiw_{i} as dep(wi)dep(w_{i}). Hence, dep(w1)dep(w_{1}), dep(w2)dep(w_{2}) \cdots, dep(wLH)dep(w_{L_{H}}) also form a sequence denoted as depWdep_{W}. The metric EE in the structural probe is defined as follows:

E(𝒑depH,depW)E(\bm{p}dep_{H},dep_{W})\qquad\qquad\qquad
=1LHi=1LH(𝒑dep(hi)dep(wi))2=\frac{1}{L_{H}}\sum_{i=1}^{L_{H}}(\bm{p}dep(h_{i})-dep(w_{i}))^{2} (3)

Therefore, the structural probe is defined as (|f|2(|f*|^{2}, EE , dep)dep).

Now we provide the constraints CdepC_{dep} for depdep. An important limitation of depWdep_{W} is that it is an integer sequence. Based on the characteristics of the tree structure, it is naturally determined that depWdep_{W} must satisfy the following two conditions:

(Boundary condition). If LH1L_{H}\geq 1, there is exactly one minimum element in depWdep_{W}, and it is equal to 11; if LH2L_{H}\geq 2, at least one element in depWdep_{W} is equal to 22.

(Recursion condition). If we sort depWdep_{W} in ascending order to obtain the sequence asdepWasdep_{W}, then

1iLH1\forall 1\leq i\leq L_{H}-1
asdep(wi+1)=asdep(wi)asdep(w_{i+1})=asdep(w_{i})

or

asdep(wi+1)=asdep(wi)+1asdep(w_{i+1})=asdep(w_{i})+1

We denote the set of all sequences that conform to CdepC_{dep} as XCdepX_{C_{dep}}. From equation 1, we can induce a metric ECdepE_{C_{dep}}:

ECdep(𝒑depH,XCdep)=minxXCdepE(𝒑depH,x)E_{C_{dep}}(\bm{p}dep_{H},X_{C_{dep}})=\mathop{\min}\limits_{x\in X_{C_{dep}}}E(\bm{p}dep_{H},x) (4)

Assuming we can construct an explicit sequence minsWmins_{W} such that:

minsW=argminxXCdepi=1LH(𝒑dep(hi)x(wi))2mins_{W}=\mathop{\arg\min}\limits_{x\in X_{C_{dep}}}\sum_{i=1}^{L_{H}}(\bm{p}dep(h_{i})-x(w_{i}))^{2} (5)

We can obtain an analytical expression for ECdepE_{C_{dep}} as follows:

ECdep(𝒑depH,XCdep)=E(𝒑depH,minsW)E_{C_{dep}}(\bm{p}dep_{H},X_{C_{dep}})=E(\bm{p}dep_{H},mins_{W}) (6)

Consider the following two examples:

  1. 1.

    When 𝒑depH=0.8,1.5,1.8,2.4,4.5\bm{p}dep_{H}=0.8,1.5,1.8,2.4,4.5, then minsW=1,2,2,3,4mins_{W}=1,2,2,3,4.

  2. 2.

    When 𝒑depH=0.8,1.5,1.8,2.4,7.5\bm{p}dep_{H}=0.8,1.5,1.8,2.4,7.5, then minsW=1,2,3,4,5mins_{W}=1,2,3,4,5.

It can be observed that the predicted depths for nodes further down the hierarchy can also influence the corresponding values of minsWmins_{W} for nodes higher up in the hierarchy. In the examples provided, due to the change from 4.5 to 7.5, 1.8 changes from 2 to 3 at the corresponding minsWmins_{W}. Therefore, using a straightforward local greedy approach may not yield an accurate calculation of minsWmins_{W}, and if a simple enumeration method is employed, the computational complexity will become exponential.

However, while a local greedy approach may not always provide an exact computation of minsWmins_{W}, it can still maintain a certain degree of accuracy for reasonable results of 𝒑depH\bm{p}dep_{H}. This is because cases like the jump from 2.4 to 7.5 should be infrequent in a well-trained probe’s computed sequence of predicted depths, unless the probed representation does not encode the tree structure well and exhibits a disruption in the middle.

Before delving into that, we first introduce some notations:

  • 𝒂𝒑depH\bm{ap}dep_{H} denote the sequence obtained by sorting 𝒑depH\bm{p}dep_{H} in ascending order.

  • 𝒂𝒑depi\bm{ap}dep_{i} represents the ii-th element of 𝒂𝒑depH\bm{ap}dep_{H}.

  • preWpre_{W} be a sequence in XCdepX_{C_{dep}}.

Here, we introduce a simple method for constructing minsWmins_{W} from a local greedy perspective.

(Initialization). If LH1L_{H}\geq 1, let pre(w1)=1pre(w_{1})=1; if LH2L_{H}\geq 2, let pre(w2)=2pre(w_{2})=2.

(Recurrence). If LH3L_{H}\geq 3 and 3iLH3\leq i\leq L_{H}, let

pre(wi)=pre(wi1)+biasi1pre(w_{i})=pre(w_{i-1})+bias_{i-1} (7)

where the values of biasi1bias_{i-1} and 𝒂𝒑depH\bm{ap}dep_{H} are related if

|pre(wi1)+1𝒂𝒑depi||pre(wi1)𝒂𝒑depi||pre(w_{i-1})+1-\bm{ap}dep_{i}|\leq|pre(w_{i-1})-\bm{ap}dep_{i}|

biasi1=1bias_{i-1}=1; otherwise, biasi1=0bias_{i-1}=0.

(Alignment). Let aia_{i}(1iLH1\leq i\leq L_{H}) denote the index of 𝒂𝒑depi\bm{ap}dep_{i} in 𝒑depH\bm{p}dep_{H}. Then, let

pesu(wai)=pre(wi)pesu(w_{a_{i}})=pre(w_{i}) (8)

It can be shown that pesuWpesu_{W} constructed in the above manner satisfies the following theorem:

Theorem 1.

If i=1\,\forall i=1, 22 \cdots, LH1L_{H}-1, 𝐚𝐩depi+1𝐚𝐩depi<=1\bm{ap}dep_{i+1}-\bm{ap}dep_{i}<=1, then

E(𝒑depH,pesuW)=E(𝒑depH,minsW)E(\bm{p}dep_{H},pesu_{W})=E(\bm{p}dep_{H},mins_{W})

Therefore, pesuWpesu_{W} can be considered an approximation to minsWmins_{W}. Appendix A contains the proof of this theorem. In the subsequent sections of this paper, we replace ECdep(𝒑depH,XCdep)E_{C_{dep}}(\bm{p}dep_{H},X_{C_{dep}}) with E(𝒑depH,pesuW)E(\bm{p}dep_{H},pesu_{W}).

Additionally, an important consideration is determining the appropriate value of the minimum element for depWdep_{W} in the boundary condition. In the preceding contents, we assumed a root depth of 1 for the syntactic tree. However, in traditional structural probe (Hewitt and Manning, 2019; Maudslay et al., 2020; Limisiewicz and Marecek, 2021; Chen et al., 2021; White et al., 2021), the root depth is typically assigned as 0 due to the annotation conventions of syntactic tree datasets. From a logical perspective, these two choices may appear indistinguishable.

However, in Appendix B, we demonstrate that the choice of whether the root depth is 0 has a significant impact on the geometry defined by the tree topological probe. Furthermore, we can prove that as long as the assigned root depth is greater than 0, the optimal geometry defined by the tree topological probe remains the same to a certain extent. Therefore, in the subsequent sections of this paper, we adopt the setting where the value of the minimum element of depWdep_{W} is 11.

3.4 Enhancements to the Tree Topological Probe

Let the set of all language objects generated by rule RR be denoted as 𝒳R\mathcal{X}_{R}, and the cardinality of 𝒳R\mathcal{X}_{R} be denoted as |𝒳R||\mathcal{X}_{R}|. The structural probe induces a metric that describes the relationship between model MM and depdep:

𝒳sp(M)=minfF1|𝒳R|W𝒳RE(𝒑depM(W),depW)\mathcal{X}_{sp}(M)=\mathop{\min}\limits_{f\in F}\frac{1}{|\mathcal{X}_{R}|}\sum_{W\in\mathcal{X}_{R}}E(\bm{p}dep_{M(W)},dep_{W}) (9)

The tree topological probe can also induce a similar metric:

𝒳ssp(M)=\mathcal{X}_{ssp}(M)=

minfF1|𝒳R|W𝒳RE(𝒑depM(W),minsW)\mathop{\min}\limits_{f\in F}\frac{1}{|\mathcal{X}_{R}|}\sum_{W\in\mathcal{X}_{R}}E(\bm{p}dep_{M(W)},mins_{W}) (10)

On the other hand, we let

maxsW=argmaxxXCdepi=1LH(𝒑dep(hi)x(wi))2maxs_{W}=\mathop{\arg\max}\limits_{x\in X_{C_{dep}}}\sum_{i=1}^{L_{H}}(\bm{p}dep(h_{i})-x(w_{i}))^{2} (11)

similar to minsWmins_{W}, and maxsWmaxs_{W}, inducing the following metrics:

𝒳essp(M)\mathcal{X}_{essp}(M)

=minfF1|𝒳R|W𝒳RE(𝒑depM(W),maxsW)=\mathop{\min}\limits_{f\in F}\frac{1}{|\mathcal{X}_{R}|}\sum_{W\in\mathcal{X}_{R}}E(\bm{p}dep_{M(W)},maxs_{W}) (12)

Since depWXCdepdep_{W}\in X_{C_{dep}}, when ff is given, we have:

E(𝒑depM(W),depW)maxxXCdepE(𝒑depM(W),x)E(\bm{p}dep_{M(W)},dep_{W})\leq\mathop{\max}\limits_{x\in X_{C_{dep}}}E(\bm{p}dep_{M(W)},x)

Furthermore, as 𝒳sp(M)\mathcal{X}{sp}(M) and 𝒳essp(M)\mathcal{X}{essp}(M) share the same set of probing functions FF, we have:

𝒳sp(M)𝒳essp(M)\mathcal{X}_{sp}(M)\leq\mathcal{X}_{essp}(M)

Therefore, 𝒳essp(M)\mathcal{X}_{essp}(M) provides us with an upper bound for the structural probe metric. Similarly, for 𝒳ssp(M)\mathcal{X}{ssp}(M), we also have:

𝒳ssp(M)𝒳sp(M)\mathcal{X}_{ssp}(M)\leq\mathcal{X}_{sp}(M)

Therefore, 𝒳ssp(M)\mathcal{X}_{ssp}(M) provides us with a lower bound for the structural probe metric. In summary, we have the following:

𝒳ssp(M)𝒳sp(M)𝒳essp(M)\mathcal{X}_{ssp}(M)\leq\mathcal{X}_{sp}(M)\leq\mathcal{X}_{essp}(M)

If 𝒳ssp(M)=𝒳essp(M)\mathcal{X}{ssp}(M)=\mathcal{X}{essp}(M), then there is no difference between the tree topological probe and the structural probe. On the other hand, if it is believed that a smaller 𝒳sp(M)\mathcal{X}{sp}(M) is desirable, then estimating 𝒳sp(M)\mathcal{X}{sp}(M) within the range [𝒳ssp(M),𝒳essp(M)][\mathcal{X}{ssp}(M),\mathcal{X}{essp}(M)] becomes an interesting problem. We consider the following:

θW=\theta_{W}=

E(𝒑depM(W),depW)E(𝒑depM(W),minsW)E(𝒑depM(W),maxsW)E(𝒑depM(W),minsW)\frac{E(\bm{p}dep_{M(W)},dep_{W})-E(\bm{p}dep_{M(W)},mins_{W})}{E(\bm{p}dep_{M(W)},maxs_{W})-E(\bm{p}dep_{M(W)},mins_{W})} (13)

This leads to an intriguing linguistic distribution, the distribution of θW[0,1]\theta_{W}\in[0,1] when uniformly sampling WW from 𝒳R\mathcal{X}_{R}. We suppose the density function of this distribution is denoted as PθP_{\theta}, and the expectation with respect to θ\theta is denoted as EPθE_{P_{\theta}}. Then we can approximate 𝒳sp(M)\mathcal{X}{sp}(M) as follows:

𝒳sp(M)=EPθ𝒳essp(M)+(1EPθ)𝒳ssp(M)\mathcal{X}{sp}(M)=E_{P_{\theta}}\mathcal{X}{essp}(M)+(1-E_{P_{\theta}})\mathcal{X}{ssp}(M) (14)

While the analysis of PθP_{\theta} is not the primary focus of this paper, in the absence of any other constraints or biases on model MM, we conjecture that the distribution curve of θ\theta may resemble a uniform bell curve. Hence, we consider the following distribution approximation:

Pθ(x)=6(xx2)x[0,1]P_{\theta}(x)=6(x-x^{2})\quad x\in[0,1]

At this point:

𝒳sp(M)=12(𝒳essp(M)+𝒳ssp(M))\mathcal{X}{sp}(M)=\frac{1}{2}(\mathcal{X}{essp}(M)+\mathcal{X}{ssp}(M)) (15)

Therefore, utilizing a self-supervised metric can approximate the unbiased optimal geometry defined by the structural probe:

MG=argminM12(𝒳essp(M)+𝒳ssp(M))M_{G}=\mathop{\arg\min}\limits_{M}\frac{1}{2}(\mathcal{X}{essp}(M)+\mathcal{X}{ssp}(M)) (16)

Moreover, MGM_{G} is an analytically tractable object, implying that the metrics induced by the tree topological probe preserve to a certain extent the two metric properties discussed in the introduction. However, there is a crucial issue that remains unresolved. Can we explicitly construct maxsWmaxs_{W}? Currently, we have not found a straightforward method similar to constructing pesuWpesu_{W} for approximating maxsWmaxs_{W}. However, based on the sorting inequality, we can construct a sequence that approximates maxsWmaxs_{W} based on preWpre_{W}. Let did_{i}(1iLH1\leq i\leq L_{H}) denote LHi+1L_{H}-i+1. Then, let

xpesu(wai)=pre(wdi)xpesu(w_{a_{i}})=pre(w_{d_{i}}) (17)

In our subsequent experiments, we approximate E(𝒑depH,maxsW)E(\bm{p}dep_{H},maxs_{W}) with E(𝒑depH,xpesuW)E(\bm{p}dep_{H},xpesu_{W}).

4 Experiments

In this section, delve into a range of experiments conducted on the tree topological probe, along with the underlying motivations behind them. To accommodate space limitations, we include many specific details of the experimental settings in Appendices C and D. Moreover, we focus our experiments on BERT-large and its submodules. Moreover, conducting similar experiments on other models is also straightforward (refer to Appendix F for supplementary results of experiments conducted using RoBERTa-large).

4.1 Measuring 𝒳ssp\mathcal{X}{ssp} and 𝒳essp\mathcal{X}{essp} on BERT

We denote the model consisting of the input layer and the first ii transformer blocks of BERT-large as Mi(0i24)M_{i}(0\leq i\leq 24). Since the input of MiM_{i} consists of tokenized units, including special tokens [CLS], [SEP], [PAD], and [MASK], we can conduct at least four types of measurement experiments:

  1. e1.

    Measurement of the vector set formed by token embedding and special token embedding.

  2. e2.

    Measurement of the vector set formed solely by token embedding.

  3. e3.

    Measurement of the vector set formed by estimated embedding of complete words using token embedding and special token embedding.

  4. e4.

    Measurement of the vector set formed solely by estimated embedding of complete words using token embedding.

Similarly, due to space constraints, we focus on discussing e1 in this paper. The measurement results are shown in Tables 1 and 2. The precise measurement values can be found in Appendix E. Furthermore, as shown in Figure 2, we present the negative logarithm curves of three measurement values as a function of MiM_{i} variation.

𝒳ssp\mathcal{X}{ssp} MM
0.01\sim0.05 M0M_{0}\simM11M_{11}
0.05\sim0.1 M12M_{12}\simM21M_{21}
0.1\sim0.15 M22M_{22}\simM23M_{23}
Table 1: Grouping MiM_{i} based on 𝒳ssp\mathcal{X}{ssp}. MlM_{l}\simMrM_{r} denotes Ml,Ml+1,Ml+2,,MrM_{l},M_{l+1},M_{l+2},\ldots,M_{r}. For example, the first row of the table indicates that the exact values of 𝒳ssp\mathcal{X}{ssp} for M0,M1,M2,,M11M_{0},M_{1},M_{2},\ldots,M_{11} fall within the range of 0.01 to 0.05.
𝒳essp\mathcal{X}{essp} MM
0.3\sim0.4 M3M_{3}\simM4M_{4} M7M_{7}\simM12M_{12}
0.4\sim0.5 M13M_{13}\simM14M_{14}
0.5\sim1.0 M1M_{1}\simM2M_{2} M5M_{5}\simM6M_{6} M15M_{15}\simM19M_{19}
1.0\sim2.0 M20M_{20}\simM24M_{24}
4.0\geq 4.0 M0M_{0}
Table 2: Grouping MiM_{i} based on 𝒳essp\mathcal{X}{essp}. Similar to the explanation in the caption of Table 1.
Refer to caption
Figure 2: Negative logarithm of 𝒳ssp\mathcal{X}_{ssp}, 𝒳essp\mathcal{X}_{essp}, unbiased 𝒳sp\mathcal{X}_{sp} and true 𝒳sp\mathcal{X}_{sp} across MiM_{i}.

By examining the experimental results presented above, we can ascertain the following findings:

  1. f1.

    𝒳ssp\mathcal{X}{ssp} and 𝒳essp\mathcal{X}{essp} indeed bound the actual 𝒳sp\mathcal{X}{sp}, and for M14M_{14} to M18M_{18}, their true 𝒳sp\mathcal{X}{sp} are very close to their 𝒳ssp\mathcal{X}{ssp}.

  2. f2.

    M0M_{0} serves as a good baseline model. Furthermore, using 𝒳essp\mathcal{X}{essp} and unbiased 𝒳sp\mathcal{X}{sp} allows for effective differentiation between embeddings generated by models consisting solely of the regular input layer and those generated by models incorporating transformer blocks.

  3. f3.

    For M1M_{1} to M6M_{6}, their true 𝒳sp\mathcal{X}{sp} are very close to their unbiased 𝒳sp\mathcal{X}{sp}.

  4. f4.

    Both the curve of log(𝒳essp)-log(\mathcal{X}{essp}) and the curve of the true log(𝒳sp)-log(\mathcal{X}{sp}) follow an ascending-then-descending pattern. However, the models corresponding to their highest points are different, namely, M8M_{8} and M16M_{16}, respectively.

  5. f5.

    For the curve of log(𝒳ssp)-log(\mathcal{X}{ssp}), its overall trend also shows an ascending-then-descending pattern but with some fluctuations in the range of M3M_{3} to M6M_{6}. However, the model corresponding to its highest point is consistent with log(𝒳essp)-log(\mathcal{X}{essp}), which is M8M_{8}.

  6. f6.

    The true 𝒳sp\mathcal{X}{sp} does not effectively distinguish between M0M_{0} and M1M_{1}.

Based on the above findings, we can confidently draw the following rigorous conclusions:

  1. c1.

    Based on f1, we can almost infer that depWargminxXCdepi=1LH(𝒑dep(hi)x(wi))2dep_{W}\in\mathop{\arg\min}\limits_{x\in X_{C_{dep}}}\sum_{i=1}^{L_{H}}(\bm{p}dep(h_{i})-x(w_{i}))^{2} for M14M_{14} to M18M_{18}. This implies that they memorize the preferences of the real data and minimize as much as possible to approach the theoretical boundary. Building upon f5, we can further conclude that the cost of memorizing depWdep_{W} is an increase in 𝒳ssp\mathcal{X}{ssp}, which leads to a decrease in the accuracy of the embedding’s linear encoding for tree structures.

  2. c2.

    Based on f1, we can conclude that there exists a model MM where the true 𝒳sp(M)\mathcal{X}{sp}(M) aligns with the 𝒳ssp(M)\mathcal{X}{ssp}(M) determined by CdepC_{dep}. This indicates that CdepC_{dep} serves as a sufficiently tight condition.

  3. c3.

    Based on f3, we can infer that M1M_{1} to M6M_{6} may not capture the distributional information of the actual syntactic trees, resulting in their generated embeddings considering only the most general case for linear encoding of tree structures. This implies that the distribution curve of their θW\theta_{W} parameters is uniformly bell-shaped.

  4. c4.

    Based on f2 and f6, we can conclude that the tree topological probe provides a more fine-grained evaluation of the ability to linearly encode tree structures in embedding vectors compared to the structural probe.

  5. c5.

    Based on f3, f4 and f5, we can conclude that in BERT-large, embedding generated by M8M_{8} and its neighboring models exhibit the strongest ability to linearly encode tree structures. Moreover, they gradually start to consider the distribution of real dependency trees, resulting in the true 𝒳sp(M)\mathcal{X}{sp}(M) approaching 𝒳ssp(M)\mathcal{X}{ssp}(M) until reaching M16M_{16}.

  6. c6.

    Based on f4 and f5, we can conclude that starting from M16M_{16}, the embeddings generated by MiM_{i} gradually lose their ability to linearly encode tree structures. The values of 𝒳ssp\mathcal{X}{ssp} and 𝒳essp\mathcal{X}{essp} for these models are generally larger compared to models before M16M_{16}. However, they still retain some distributional information about the depth of dependency trees. This means that despite having a higher unbiased 𝒳sp\mathcal{X}{sp}, their true 𝒳sp\mathcal{X}{sp} is still smaller than that of MiM_{i} before M8M_{8}.

From the above conclusions, we can further speculate about the workings of pretrained language models such as BERT, and we identify some related open problems.

Based on c5 and c6, we can speculate that the final layer of a pretrained language model needs to consider language information at various levels, but its memory capacity is limited. Therefore, it relies on preceding submodules to filter the information. The earlier submodules in the model encode the most generic (unbiased) structures present in the language features. As the model advances, the intermediate submodules start incorporating preferences for general structures based on actual data. Once a certain stage is reached, the later submodules in the model start to loosen their encoding of generic structures. However, due to the preference information passed from the intermediate submodules, the later submodules can still outperform the earlier submodules in encoding real structures, rather than generic ones.

Based on c3 and c6, it appears that true 𝒳sp\mathcal{X}{sp}\leq unbiased 𝒳sp<\mathcal{X}{sp}< 𝒳essp\mathcal{X}{essp}. This suggests that for BERT, unbiased 𝒳sp\mathcal{X}{sp} serves as a tighter upper bound for 𝒳sp\mathcal{X}{sp}, and there exists a submodule that achieves this upper bound. Now, the question arises: Is this also the case for general pretrained models? If so, what are the underlying reasons?

4.2 Using 𝒳ssp\mathcal{X}{ssp} and 𝒳essp\mathcal{X}{essp} as Regularization Loss in Fine-tuning BERT

Let us denote the downstream task loss as T(M24)T(M_{24}). Taking 𝒳ssp\mathcal{X}{ssp} as an example, using 𝒳ssp\mathcal{X}{ssp} as a regularizing loss during fine-tuning refers to replacing the task loss with:

T(M24)+λ𝒳ssp(Mi)(1i24)T(M_{24})+\lambda*\mathcal{X}{ssp}(M_{i})\quad(1\leq i\leq 24)

where λ\lambda is a regularization parameter. The purpose of this approach is to explore the potential for enhancing the fine-tuning performance by improving the submodules of BERT in their ability to linearly encode tree structures. If there exists a submodule that achieves both enhancement in linear encoding capabilities and improved fine-tuning performance, it implies that the parameter space of this submodule, which has better linear encoding abilities, overlaps with the optimization space of fine-tuning. This intersection is smaller than the optimization space of direct fine-tuning, reducing susceptibility to local optima and leading to improved fine-tuning results.

Conversely, if enhancing certain submodules hinders fine-tuning or even leads to its failure, it suggests that the submodule’s parameter space, which has better linear encoding abilities, does not overlap with the optimization space of fine-tuning. This indicates that the submodule has already attained the smallest 𝒳ssp\mathcal{X}{ssp} value that greatly benefits the BERT’s performance.

Based on f1, we can infer that M14M_{14} to M18M_{18} are not suitable as enhanced submodules. According to c5, the submodules most likely to improve fine-tuning performance after enhancement should be near M8M_{8}. We conducted experiments on a single-sentence task called the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), which is part of The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019).

The test results are shown in Table 3. As predicted earlier, enhancing the submodules around M14M_{14} to M18M_{18} (now expanded to M12M_{12} to M19M_{19}) proves to be detrimental to fine-tuning, resulting in failed performance. However, we did observe an improvement in fine-tuning performance for the sub-module M10M_{10} near M8M_{8} after enhancement. This gives us an intuition that if we have additional topological probes and similar metrics to 𝒳ssp\mathcal{X}{ssp} and 𝒳sp\mathcal{X}{sp}, we can explore enhancing submodules that are in the rising phase of true 𝒳sp\mathcal{X}{sp}, away from the boundary of unbiased 𝒳sp\mathcal{X}{sp} and 𝒳ssp\mathcal{X}{ssp}, in an attempt to improve fine-tuning outcomes.

Method mean std max
DF 63.34 1.71 66.54
EH M3M_{3} 63.90 2.66 68.73
EH M5M_{5} 63.90 1.36 66.04
EH M10M_{10} 64.87 2.07 68.47
EH M12M_{12}\simM19M_{19} 0.00 0.00 0.00
EH M20M_{20} 5.48 16.46 54.87
EH M24M_{24} 40.43 26.60 62.52
Table 3: Direct fine-tuning and sub-module enhancement test scores. Here, "DF" denotes direct fine-tuning, while "EH MiM_{i}" represents the fine-tuning with the enhancement of MiM_{i} based on 𝒳ssp\mathcal{X}{ssp}. The evaluation metric used in CoLA is the Matthew coefficient, where a higher value indicates better performance.

5 Conclusion

Consider a thought experiment where there is a planet in a parallel universe called "Vzjgs" with a language called "Vmtprhs". Like "English", "Vmtprhs" comprises 26 letters as basic units, and there is a one-to-one correspondence between the letters of "Vmtprhs" and "English". Moreover, these two languages are isomorphic under letter permutation operations. In other words, sentences in "English" can be rearranged so that they are equivalent to sentences in "Vmtprhs", while preserving the same meaning. If there were models like BERT or GPT in the "Vzjgs" planet, perhaps called "YVJIG" and "TLG," would the pretraining process of "YVJIG" on "Vmtprhs" be the same as BERT’s pretraining on "English"?

In theory, there should be no means to differentiate between these two pretraining processes. For a blank model (without any training), extracting useful information from "Vmtprhs" and "English" would pose the same level of difficulty. However, it is true that "Vmtprhs" and "English" are distinct, with the letters of "Vmtprhs" possibly having different shapes or being the reverse order of the "English" alphabet. Therefore, we can say that they have different letter features, although this feature seems to be a mere coincidence. In natural language, there are many such features created by historical contingencies, such as slang or grammatical exceptions. Hence, when we aim to interpret the mechanisms of these black-box models by studying how language models represent language-specific features, we must consider which features are advantageous for interpretation and what we ultimately hope to gain from this research.

This paper presents a thorough exploration of a key issue, specifically examining the articulation of internal feature constraints. By enveloping the original feature within a feature space that adheres to such constraints, it is possible to effectively eliminate any unintended or accidental components. Within this explicitly defined feature space, metrics such as 𝒳ssp\mathcal{X}{ssp} and 𝒳essp\mathcal{X}{essp} can be defined. We can subsequently examine the evolution of these metrics within the model to gain a deeper understanding of the encoding strategies employed by the model for the original feature, as described in the experimental section of this paper. Once we understand the encoding strategies employed by the model, we can investigate the reasons behind their formation and the benefits they bring to the model. By conducting studies on multiple similar features, we can gain a comprehensive understanding of the inner workings of the black box.

Limitations

The main limitation of this research lies in the approximate construction of minsWmins_{W} and maxsWmaxs_{W}, which leads to true log(𝒳sp)-log(\mathcal{X}{sp}) surpassing log(𝒳ssp)-log(\mathcal{X}{ssp}) near M16M_{16} to some extent. However, this may also be due to their proximity, resulting in fluctuations within the training error. On the other hand, the proposed construction scheme for the topological probe discussed in this paper lacks sufficient mathematical formalization. One possible approach is to restate it using the language of category theory.

Acknowledgements

We thank the anonymous reviewers for their helpful comments and suggestions. This work was supported by National Natural Science Foundation of China (Nos. 62362015, 62062027 and U22A2099) and the project of Guangxi Key Laboratory of Trusted Software.

References

Appendix A Proof of Theorem 1

Proof.

For any sequence xXCdepx\in X_{C_{dep}} that is in the same order as 𝒑depH\bm{p}dep_{H}, according to the inequality of rankings, for any permutation πx\pi_{x} of xx, we have:

i=1LHπx(wi)𝒑dep(hi)i=1LHx(wi)𝒑dep(hi)\sum_{i=1}^{L_{H}}\pi_{x}(w_{i})*\bm{p}dep(h_{i})\leq\sum_{i=1}^{L_{H}}x(w_{i})*\bm{p}dep(h_{i})

Therefore,

i=1LH(πx(wi)𝒑dep(hi))2\sum_{i=1}^{L_{H}}(\pi_{x}(w_{i})-\bm{p}dep(h_{i}))^{2}
i=1LH(x(wi)𝒑dep(hi))2\geq\sum_{i=1}^{L_{H}}(x(w_{i})-\bm{p}dep(h_{i}))^{2}

Since pesuWpesu_{W} and 𝒑depH\bm{p}dep_{H} are in the same order, we just need to prove that any sequence xXCdepx\in X_{C_{dep}} and in the same order as 𝒑depH\bm{p}dep_{H} satisfies

i=1LH(x(wi)𝒑dep(hi))2\sum_{i=1}^{L_{H}}(x(w_{i})-\bm{p}dep(h_{i}))^{2}
i=1LH(pesu(wi)𝒑dep(hi))2\geq\sum_{i=1}^{L_{H}}(pesu(w_{i})-\bm{p}dep(h_{i}))^{2}

The theorem is automatically established. Because

i=1LH(pesu(wi)𝒑dep(hi))2\sum_{i=1}^{L_{H}}(pesu(w_{i})-\bm{p}dep(h_{i}))^{2}
=i=1LH(pre(wi)𝒂𝒑depi)2=\sum_{i=1}^{L_{H}}(pre(w_{i})-\bm{ap}dep_{i})^{2} (18)

, without loss of generality, we can assume pesuWpesu_{W} and xx to be ascending sequences and not equal and exist a kk such that when 1ik11\leq i\leq k-1

pesu(wi)=x(wi)pesu(w_{i})=x(w_{i}) (19)

and

pesu(wk)x(wk)pesu(w_{k})\neq x(w_{k}) (20)

Based on the recursive condition, we can infer that

|pesu(wk)x(wk)|=1|pesu(w_{k})-x(w_{k})|=1 (21)

Combined with the value condition of biask1bias_{k-1}, we further find that

|pesu(wk)𝒂𝒑depk||x(wk)𝒂𝒑depk||pesu(w_{k})-\bm{ap}dep_{k}|\leq|x(w_{k})-\bm{ap}dep_{k}|

The inductive hypothesis when i=mi=m is

|pesu(wm)𝒂𝒑depm||x(wm)𝒂𝒑depm||pesu(w_{m})-\bm{ap}dep_{m}|\leq|x(w_{m})-\bm{ap}dep_{m}|

Due to the condition 𝒂𝒑depm+1𝒂𝒑depm1\bm{ap}dep_{m+1}-\bm{ap}dep_{m}\leq 1 and the value condition of biasmbias_{m}, it still holds when i=m+1i=m+1 that

|pesu(wm+1)𝒂𝒑depm+1||pesu(w_{m+1})-\bm{ap}dep_{m+1}|
|x(wm+1)𝒂𝒑depm+1|\leq|x(w_{m+1})-\bm{ap}dep_{m+1}|

Thus, when iki\geq k

(x(wi)𝒑dep(hi))2(pesu(wi)𝒑dep(hi))2.(x(w_{i})-\bm{p}dep(h_{i}))^{2}\geq(pesu(w_{i})-\bm{p}dep(h_{i}))^{2}.

Appendix B Analysis of Tree Depth Minimum

The minimum of pesuWpesu_{W} is denoted as depmindep_{min}. Fixing pesuWpesu_{W}, we let all sets (or sequences) of vectors satisfying the following conditions compose a set denoted by ΩpesuW\Omega_{pesu_{W}}.

PRmn,i(i=1,2,LH)\exists\,P\in R^{m*n},\,\forall\,i(i=1,2,\cdots L_{H})
pesu(wi)ϵi<hiTPTPhi<pesu(wi)+ϵipesu(w_{i})-\sqrt{\epsilon_{i}}<h_{i}^{T}P^{T}Ph_{i}<pesu(w_{i})+\sqrt{\epsilon_{i}}

Here, ϵipesu(wi)2\epsilon_{i}\ll pesu(w_{i})^{2}. Let pesu(w1)pesu(w_{1}) be depmindep_{min} and pesu(wi)pesu(wi+1)(i=1,2,LH1)pesu(w_{i})\leq pesu(w_{i+1})(i=1,2,\cdots L_{H}-1) without loss of generality, and the following theorem can be obtained.

Theorem 2.

For any two different sequences pesuWpesu_{W} and pesuWpesu_{W}^{{}^{\prime}}, if pesu(w1)>0pesu(w_{1})>0 and pesu(w1)>0pesu^{{}^{\prime}}(w_{1})>0. there is a one-to-one mapping ϕ\phi between ΩpesuW\Omega_{pesu_{W}} and ΩpesuW\Omega_{pesu_{W}^{{}^{\prime}}}.

Proof.

We construct ϕ\phi such that

HΩpesuW\forall H\in\Omega_{pesu_{W}}
ϕ(H)=(h1,h2,hLH)=HΩpesuW\phi(H)=(h_{1}^{{}^{\prime}},h_{2}^{{}^{\prime}},\cdots h_{L_{H}}^{{}^{\prime}})=H^{{}^{\prime}}\in\Omega_{pesu_{W}^{{}^{\prime}}}

Here, h1=h1h_{1}^{{}^{\prime}}=h_{1}, and when i=2,3,LHi=2,3\cdots,L_{H}

hi=pesu(wi)pesu(w1)pesu(wi)pesu(w1)hih_{i}^{{}^{\prime}}=\frac{\sqrt{pesu^{{}^{\prime}}(w_{i})*pesu(w_{1})}}{\sqrt{pesu(w_{i})*pesu^{{}^{\prime}}(w_{1})}}h_{i}

Since

PRmn,i(i=1,2,LH)\exists P\in R^{m*n},\forall i(i=1,2,\cdots L_{H})
pesu(wi)ϵi<hiTPTPhi<pesu(wi)+ϵipesu(w_{i})-\sqrt{\epsilon_{i}}<h_{i}^{T}P^{T}Ph_{i}<pesu(w_{i})+\sqrt{\epsilon_{i}}
ϵipesu(wi)2\epsilon_{i}\ll pesu(w_{i})^{2}

Let P=pesu(w1)pesu(w1)PP^{{}^{\prime}}=\frac{\sqrt{pesu^{{}^{\prime}}(w_{1})}}{\sqrt{pesu(w_{1})}}P and when i=1,2,LHi=1,2,\cdots L_{H}

ϵi=(pesu(wi)pesu(wi))2ϵi,\epsilon_{i}^{{}^{\prime}}=(\frac{pesu^{{}^{\prime}}(w_{i})}{pesu(w_{i})})^{2}\epsilon_{i},

then

ϵi(pesu(wi)pesu(wi))2pesu(wi)2=pesu(wi)2\epsilon_{i}^{{}^{\prime}}\ll(\frac{pesu^{{}^{\prime}}(w_{i})}{pesu(w_{i})})^{2}pesu(w_{i})^{2}=pesu^{{}^{\prime}}(w_{i})^{2}

After calculation,

i(i=1,2,LH)\forall i(i=1,2,\cdots L_{H})
pesu(wi)ϵipesu^{{}^{\prime}}(w_{i})-\sqrt{\epsilon_{i}^{{}^{\prime}}}
<(hi)T(P)TPhi<(h_{i}^{{}^{\prime}})^{T}(P^{{}^{\prime}})^{T}P^{{}^{\prime}}h_{i}^{{}^{\prime}}
<pesu(wi)+ϵi<pesu^{{}^{\prime}}(w_{i})+\sqrt{\epsilon_{i}^{{}^{\prime}}}

Therefore, ϕ\phi is well defined, and Hi,HjΩpesuW\forall H_{i},H_{j}\in\Omega_{pesu_{W}} when HiHjH_{i}\neq H_{j}

ϕ(Hi)ϕ(Hj)\phi(H_{i})\neq\phi(H_{j})

Therefore, ϕ\phi is also an injective function. It is easy to prove that the inverse map ϕ1\phi^{-1} of ϕ\phi is also an injective function and satisfies the above conditions. ∎

The proof of the theorem above does not apply to the cases where pesu(w1)=0pesu(w_{1})=0 or pesu(w1)=0pesu^{{}^{\prime}}(w_{1})=0. If depmindep_{min} is greater than 0, then the results of the tree topological probe do not necessarily depend on the selection of depmindep_{min}, and we may set it as 11. However, we have not further explored whether Theorem 2 is necessarily invalid. Nevertheless, we can examine the drawbacks that arise from setting depmindep_{min} to 0 from another perspective.

When i2i\geq 2, hih_{i} is projected by PP near the (mm)-dimensional sphere with a radius of pesu(wi)\sqrt{pesu(w_{i})},

i=1,2,LH\forall i=1,2\cdots,L_{H}
|hiTPTPhipesu(wi)|<ϵi|h_{i}^{T}P^{T}Ph_{i}-pesu(w_{i})|<\epsilon_{i}

If depmin=0dep_{min}=0, then the topology of the geometric space composed of all vectors Ph1Ph_{1} satisfying |h1TPTPh1depmin|<ϵ1|h_{1}^{T}P^{T}Ph_{1}-dep_{min}|<\epsilon_{1} is homeomorphic to an mm-dimensional open ball. This may result in probes exhibiting different preferences for the root and other nodes. However, if depmin>0dep_{min}>0, the topology of the geometric space is an mm-dimensional annulus, which is the same for all nodes, thus avoiding the issue of preference.

Appendix C Data for Training and Evaluating Probes

To ensure the reliability and diversity of data (appropriate sentences) sources, we separated the sentences participating in the probe experiment from the training, verification and test data sets of some tasks of The General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019).

We selected four small sample text classification tasks in GLUE with reference to (Hua et al., 2021), namely, the Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019), Microsoft Research Paraphrase Corpus (MRPC) (Dolan and Brockett, 2005), Recognizing Textual Entailment (RTE) (Wang et al., 2019) and Semantic Textual Similarity Benchmark (STS-B) (Cer et al., 2017), which cover the three major task types of SINGLE-SENTENCE, SIMILARITY AND PARAPHRASE and INFERENCE in GLUE. MRPC, RTE and STS-B are all double sentence tasks, and the experiment needs only BERT to represent a single sentence; thus, we consider two sentences that belong to the same group of data independently, not spliced.

After the data sets of the four tasks are processed as above, the remaining statements are merged into a raw text data set rtdmixrtd_{mix}, which contains 47136 sentences. This is close to the size of the Pennsylvania tree database (Marcus et al., 1993) used by the structural probe (Hewitt and Manning, 2019); short and long sentences are evenly distributed.

Appendix D Experimental Setup for Training Probes and Fine-tuning

We use the BERT implementation of Wolf et al. and set the rank of the probe matrix to be half the embedding dimension. The probe matrix is randomly initialized following a uniform distribution U(0.05,0.05)U(-0.05,0.05).

We employ the AdamW optimizer with the warmup technique, where the initial learning rate is set to 2e-5 and the epsilon value is set to 1e-8. The training stops after 10 epochs. The training setup for fine-tuning experiments is similar to that of training probes. One notable difference is the regularization coefficient λ\lambda, which is dynamically determined after one epoch of training, ensuring that λ𝒳ssp(Mi)T(M24)0.1\frac{\lambda*\mathcal{X}_{ssp}(M_{i})}{T(M_{24})}\approx 0.1, without any manual tuning.

We conduct experiments on each fine-tuning method by using 10 different random seeds, and we compute the mean, the standard deviation (std), and the maximum values.

Appendix E Supplementary Chart Materials

Table 4 lists the exact measurements of 𝒳ssp\mathcal{X}{ssp}, 𝒳essp\mathcal{X}{essp}, and true 𝒳sp\mathcal{X}_{sp} for BERT-Large.

MM 𝒳ssp\mathcal{X}_{ssp} 𝒳essp\mathcal{X}_{essp} 𝒳tsp\mathcal{X}_{tsp}
M0M_{0} 0.039 5.382 0.3084
M1M_{1} 0.017 0.536 0.2644
M2M_{2} 0.017 0.526 0.244
M3M_{3} 0.018 0.348 0.2016
M4M_{4} 0.033 0.351 0.1701
M5M_{5} 0.025 0.52 0.1622
M6M_{6} 0.023 0.52 0.1559
M7M_{7} 0.013 0.345 0.14
M8M_{8} 0.01 0.347 0.1424
M9M_{9} 0.011 0.352 0.1577
M10M_{10} 0.013 0.359 0.1415
M11M_{11} 0.021 0.375 0.1128
M12M_{12} 0.054 0.391 0.0975
M13M_{13} 0.076 0.42 0.0764
M14M_{14} 0.084 0.467 0.0651
M15M_{15} 0.088 0.525 0.0616
M16M_{16} 0.09 0.663 0.0656
M17M_{17} 0.086 0.785 0.0808
M18M_{18} 0.09 0.883 0.1155
M19M_{19} 0.09 0.999 0.1416
M20M_{20} 0.092 1.045 0.1615
M21M_{21} 0.094 1.447 0.2468
M22M_{22} 0.102 1.715 0.28634
M23M_{23} 0.107 1.709 0.3171
M24M_{24} 0.113 1.837 0.328
Table 4: Exact values of 𝒳ssp\mathcal{X}_{ssp}, 𝒳essp\mathcal{X}_{essp}, and true 𝒳sp\mathcal{X}_{sp} for MiM_{i}

Appendix F Experimental data for RoBERTa-large

Figure 3 shows the negative logarithm curves of three measurement values as a function of variation in Mi for RoBERTa-Large. Table 5 lists the exact measurements of 𝒳ssp\mathcal{X}{ssp}, 𝒳essp\mathcal{X}{essp}, and true 𝒳sp\mathcal{X}_{sp} for RoBERTa-Large.

Refer to caption
Figure 3: Negative logarithm of 𝒳ssp\mathcal{X}_{ssp}, 𝒳essp\mathcal{X}_{essp}, unbiased 𝒳sp\mathcal{X}_{sp} and true 𝒳sp\mathcal{X}_{sp} across MiM_{i}.

From the experimental data, it is evident that the overall pattern of evolution in the graphs for RoBERTa-Large and BERT-Large is consistent. There’s a slight initial increase followed by a decline, but the boundaries for 𝒳sp\mathcal{X}{sp} in the case of RoBERTa-Large are much tighter, especially in the earlier modules.

MM 𝒳ssp\mathcal{X}_{ssp} 𝒳essp\mathcal{X}_{essp} 𝒳tsp\mathcal{X}_{tsp}
M0M_{0} 0.008 3.532 0.991
M1M_{1} 0.145 0.515 0.243
M2M_{2} 0.137 0.469 0.446
M3M_{3} 0.139 0.470 0.331
M4M_{4} 0.131 0.493 0.257
M5M_{5} 0.132 0.500 0.199
M6M_{6} 0.123 0.494 0.153
M7M_{7} 0.117 0.491 0.110
M8M_{8} 0.117 0.542 0.109
M9M_{9} 0.114 0.491 0.087
M10M_{10} 0.113 0.555 0.091
M11M_{11} 0.113 0.567 0.091
M12M_{12} 0.110 0.598 0.094
M13M_{13} 0.114 0.667 0.101
M14M_{14} 0.113 0.675 0.102
M15M_{15} 0.124 0.738 0.120
M16M_{16} 0.133 0.797 0.136
M17M_{17} 0.136 0.789 0.131
M18M_{18} 0.137 0.807 0.136
M19M_{19} 0.136 0.831 0.135
M20M_{20} 0.136 0.880 0.145
M21M_{21} 0.150 1.086 0.169
M22M_{22} 0.153 1.277 0.176
M23M_{23} 0.190 2.006 0.197
M24M_{24} 0.117 0.711 0.201
Table 5: Exact values of 𝒳ssp\mathcal{X}_{ssp}, 𝒳essp\mathcal{X}_{essp}, and true 𝒳sp\mathcal{X}_{sp} for MiM_{i}