A Holistic Approach to Predicting Top Quark Kinematic
Properties with the Covariant Particle Transformer

Shikai Qiu [email protected] Department of Physics, University of California, Berkeley, Berkeley, CA 94720, USA Shuo Han [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Xiangyang Ju [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Benjamin Nachman [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA Haichen Wang [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Department of Physics, University of California, Berkeley, Berkeley, CA 94720, USA

Abstract

Precise reconstruction of top quark properties is a challenging task at the Large Hadron Collider due to combinatorial backgrounds and missing information. We introduce a physics-informed neural network architecture called the Covariant Particle Transformer (CPT) for directly predicting the top quark kinematic properties from reconstructed final state objects. This approach is permutation invariant and partially Lorentz covariant and can account for a variable number of input objects. In contrast to previous machine learning-based reconstruction methods, CPT is able to predict top quark four-momenta regardless of the jet multiplicity in the event. Using simulations, we show that the CPT performs favorably compared with other machine learning top quark reconstruction approaches. We make our code available at https://github.com/shikaiqiu/Covariant-Particle-Transformer.

I Introduction

For the Large Hadron Collider (LHC) experiments, the kinematic reconstruction of top quarks is critical to many precision tests of the Standard Model (SM) as well as direct searches for physics beyond the SM. Once produced, the top quark decays to a bottom quark ( $b$ -quark) and a W boson, with a branching ratio close to 100% [1]. Subsequently, the W boson decays into a lepton or quark pair. In the final state, quarks originating from top quark decays and other colored partons hadronize, resulting in collimated sprays of hadrons, known as jets. Conventional top quark methods assume that a hadronically decaying top quark produces three jets in the final state. Therefore, these methods are tuned to identify triplets of jets, which are considered as proxies for the three quarks originating directly from the top quark and W boson decays. The estimated top quark four-momentum is computed from the sum of measured four-momenta over the triplet of jets. Essentially, top quark reconstruction is treated as a combinatorial problem of sorting jets, and most methods use jet kinematic and flavor tagging information to construct likelihood-based [2] or machine learning-based [3, 4, 5, 6, 7, 8, 9, 10] metrics to identify triplets of jets as proxies to top quarks and similar particles.

While the conventional top quark reconstruction approaches have been implemented in a variety of forms and extensively used at hadron collider experiments, they have fundamental flaws and shortcomings. The one-to-one correspondence between a parton (quark or gluon) and a jet, assumed by the conventional approaches, is only an approximation. Partons carry color charges but jets only consist of colorless hadrons. The formation of a jet, by construction, has to be contributed to by multiple partons. On the other hand, a single parton may contribute to the formation of multiple jets, particularly when the parton is highly energetic. In addition, triplet-based top quark reconstruction requires the presence of a certain number of jets in the final state. This jet multiplicity requirement can be inefficient because of kinematic thresholds, limited detector coverage, or the merging of highly collimated parton showers.

In this paper, we propose a new machine learning-enabled approach to determine the top quark properties through a holistic processing of the event final state. Our goal is to predict top quark four-momenta in a collision event with a given number of top quarks. The number of top quarks can itself be learned from the final state or it can be posited for a given hypothesis. As discussed earlier, the kinematic information of a top quark is not localized in a triplet of jets, rather, it is possessed by all particles in the event collectively. This motivates the use of particle identification (ID) and kinematic information from all detectable particles in the final state as input to the determination of the top quark four-momenta. Specifically, the four-momenta and ID of all detectable final state particles are input to a deep neural networks regression model, which is constructed and trained to predict the four-momenta of a given number of top quarks. This approach offers three major advantages compared to conventional approaches. First, we no longer deal with the conceptually ill-defined jet-triplet identification process. Second, we can account for noisy or missing observations due to limited acceptance, detector inefficiency and resolution, as the regression model can learn such effects from Monte Carlo (MC) simulations. Third, the holistic processing of the event final state offers a unified approach to determining the top quark properties for both the hadronic and semi-leptonic top quark decays, which may simplify analysis workflows. Finally, our approach has a runtime polynomial in the number of final state objects as opposed to super-exponential for standard reconstruction-based approaches which need to consider all possible permutations, making ours the first tractable method for processes with high multiplicity final state such as $t\bar{t}t\bar{t}$ .

To realize the holistic approach of top quark property determination, we propose a physics-informed transformer [11] architecture termed Covariant Particle Transformer (CPT). CPT takes as input properties of the final state objects in a collision event and outputs predictions for the top quark kinematic properties. Like other recent top reconstruction proposals [7, 9, 8], CPT is permutation invariant under exchange of the inputs. A novel attention mechanism [11, 12], referred to as covariant attention, is designed to learn the predicted kinematic properties as a function of the set of final state objects as a whole, and guarantees that the predictions transform covariantly under rotation and/or boosts of the event along the beamline. While not fully Lorentz-covariant like Ref. [13], our approach captures the most important covariances relevant to hadron collider physics with minimal computational overhead and enjoys a much simpler implementation, which allows it to be easily adopted for a broad range of tasks in collider physics.

This paper is organized as follows. Section II introduces the construction and properties of CPT. Synthetic datasets used for demonstrating the performance of CPT are introduced in Sec. III. Numerical results illustrating the performance of CPT are presented in Sec. IV. In Sec. V, we explore what aspects of CPT give raise to the excellent performance. The paper ends with conclusions and outlook in Sec. VI.

II Covariant Particle Transformer

II.1 Symmetries and covariance

At the LHC, the beamline determines a special direction and reduces the relevant symmetry group of collision events from the proper orthochronous Lorentz group $\mathrm{SO}^{+}(1,3)$ to $\mathrm{SO}(2)\times\mathrm{SO}^{+}(1,1),$ which contains products of azimuthal rotations and longitudinal boosts along the beamline. The Covariant Particle Transformer extends the original transformer architecture to properly account for these symmetry transformations, by ensuring that if the four-momenta of all final state objects undergo such a transformation, the resulting prediction of the top quark four-momenta will undergo the same transformation. At its core, this is achieved through the novel covariant attention mechanism, which modifies the standard attention mechanism to ensure that all intermediate learned features have well-defined transformation properties.

Covariance¹¹1Called equivariance in machine learning. under rotations and boosts [13, 14] and input permutations [15] have been studied in a variety of recent High Energy Physics (HEP) papers. A number of additional studies have explored permutation invariant architectures [16, 17, 18, 19, 20] (see also other graph network approaches [21, 22]). Compared to prior works in this direction, we make the following important contributions:

•

We develop the first transformer architecture that enforces Lorentz covariance. Transformers are a powerful class of neural networks that have revolutionized many areas of machine learning applications, such as natural language processing [11, 23], computer vision [24], and recently protein folding [25]. By integrating the transformer architecture with Lorentz covariance, CPT combines the current state-of-the-art of machine learning with physics-specific knowledge to become a powerful tool for applications in collider physics, as we will illustrate in this work.
•

We develop a simple, efficient, and effective way of achieving partial Lorentz covariance. While previous works have developed Lorentz covariant neural networks using customized architectures, they incur significant computational overhead compared to a standard neural network due to computations of continuous group convolutions [14] or irreducible representations of the Lorentz group [13]. By contrast, CPT only requires a simple modification to the standard attention mechanism with minimal computational overhead.
•

We are the first to demonstrate the benefit of using a Lorentz covariant architecture for regression problems where the targets are four momenta of the particles. Previous works on Lorentz covariant neural networks only evaluate on classification problems such as jet-tagging where the Lorentz group acts trivially (i.e. as an identity) on the targets. There Lorentz symmetry plays a less significant role since the neural network only needs to be Lorentz invariant but not covariant.

II.2 Architecture

The Covariant Particle Transformer consists of an encoder and a decoder. To ensure permutation invariance, we remove the positional encoding [11] in the original transformer encoder. The encoder produces learned features of the final state objects, which include jets, photons, electrons, muons, and missing transverse energy ( $E_{\mathrm{T}}^{\text{miss}}$ )²²2 $E_{\mathrm{T}}^{\text{miss}}$ is implemented as a massless particle with zero longitudinal momentum component..

Each object is represented by its transverse momentum $p_{\mathrm{T}}$ , rapidity $y$ , azimuthal angle $\phi$ expressed as a unit vector $(\cos(\phi),\sin(\phi))$ to avoid $\text{mod }\pi$ calculations, mass $m$ , and particle identification ID. The encoder uses six covariant self-attention layers to update the feature vectors of the final state objects. The decoder uses 12 covariant attention layers to produce learned features of the top quarks. Six of these layers use self-attention, which updates the feature vector of each top quark as a function of itself and the feature vectors of other top quarks, and the other six layers use cross-attention, which updates the feature vector of each top quark as a function of itself and the feature vectors of the final state objects. Finally, the feature vectors of top quarks are converted to predicted physics variables, which are the top quark four-momenta expressed in transverse momentum $p_{\mathrm{T}}$ , rapidity $y$ , azimuthal angle unit vector, and mass $m$ . Figure 1 illustrates the architecture of the Covariant Particle Transformer. Detailed descriptions of input featurization, CPT architecture, and the covariant attention mechanism are provided in Appendix A.

II.3 Loss function

The model is trained to minimize a supervised learning objective that measures the distance between the true and predicted values of the target variables³³3Note that learning the true value from reconstructed quantities introduces a prior dependence [26]. This is true for nearly all regression approaches in HEP.. Auxiliary losses are included to stabilize training the model. We provide a detailed description of the loss function in Appendix A.6.

III Datasets

We use Madgraph@NLO (v2.3.7) [27] to generate $pp$ collision events at next-to-leading order (NLO) in QCD. The decays of top quarks and W bosons are performed by MadSpin [28]. We generate 9.2 million $t\bar{t}H$ events, 5.4 million $t\bar{t}t\bar{t}$ events, 1.3 million $t\bar{t}$ events, 1.3 million $t\bar{t}W$ events, and 1 million $t\bar{t}H$ events with a CP-odd top-Yukawa coupling ( $t\bar{t}H_{\text{CP-odd}}$ ). In our generation, Higgs bosons decay through the diphoton channel for simplicity and all other objects such as top quarks and $W$ bosons decay inclusively.The Higgs Characterization model [29] is used to generate the $t\bar{t}H_{\text{CP-odd}}$ events. The generated events are interfaced with the Pythia 8.235 [30] for parton shower. We do not emulate detector effects as the salient features of the problem already present from the parton shower and hadronization. The generated hadrons are used to construct anti- $k_{t}$ [31] $R=0.4$ jets using FastJet 3.3.2 [32, 33].

Jets are required to have $|y|\leq 2.5$ and $p_{T}\geq 25$ GeV, while leptons are required to have $|y|\leq 2.5$ and $p_{T}\geq 10$ GeV. A jet is removed if its distance⁴⁴4 $\Delta R$ is defined as $\sqrt{\Delta y^{2}+\Delta\phi^{2}}$ , where $\Delta y$ is the difference of two particles in pseudorapidity and $\Delta\phi$ is the difference in azimuthal angle. in $\Delta R$ with a photon or a lepton is less than 0.4. Jets that are $\Delta R$ matched to $b$ -quarks at the parton level are labeled as $b$ -jets; this label is removed randomly for 30% of the $b$ -jets, to mimic the inefficiency of a realistic $b$ -tagging [34, 35]. We further apply a preselection on the testing set of $N_{\mathrm{bjet}}>0$ , and $(N_{\mathrm{jet}}\geq 3$ and $N_{\mathrm{lepton}}=0)$ or $N_{\mathrm{lepton}}>0.$ , to mimic realistic data analysis requirements. The $t\bar{t}H$ and $t\bar{t}t\bar{t}$ samples are each divided to training, validation, and testing sets, corresponding to a split of 75%:12.5%:12.5%. The other samples ( $t\bar{t}$ , $t\bar{t}W$ , and $t\bar{t}H_{\text{CP-odd}}$ ) are used only for testing. While a single model can be trained to learn from a mixture of processes such as $t\bar{t}H$ and $t\bar{t}t\bar{t}$ for greater generality, we leave this exciting direction to future work.

As we compare the performance of CPT to that of a conventional approach, we refer to top quarks that can be matched to a triplet of jets as “truth-matched” and those that cannot as “unmatched”. Specifically, a top quark is considered as “truth-matched” if decays hadronically and each of the three quarks originating from its decay is matched ( $\Delta R<0.4$ ) to exactly one jet. According to this definition, semi-leptonically decaying tops are always unmatched, which is motivated by the fact that we can’t physically detect its neutrino (at best we can estimate its kinematics such as $p_{\mathrm{T}}$ ). The vast majority (e.g., 76% for $t\bar{t}H$ ) of tops are unmatched, and therefore can’t be fully reconstructed due to incomplete information about their decay products. For events passing the preselection, the fraction of hadronically decaying top quarks that can be truth-matched is 36% for $t\bar{t}H$ , 37% for $t\bar{t}$ , 38% for $t\bar{t}W$ , and 38% for $t\bar{t}t\bar{t}$ .

IV Performance

We study three different performance aspects of CPT. First, we evaluate the resolution of the predictions of individual top quark kinematic variables. Second, we compare the correlation between the predicted variables to the correlations between the true top quark properties. Finally, we assess the model dependence of CPT by applying the model trained on $t\bar{t}H$ events to alternative processes. We study these metrics inclusively for events passing the preselection, and we also break down the performance for top quarks where a matching triplet of jets can be identified using truth information and for top quarks where no matching triplet of jets can be identified. For the former case, we also compare CPT prediction with the calculation from a triplet-based reconstruction method. The latter scenario corresponds to the case where the conventional triplet-based reconstruction method does not apply.

Resolution: Figure 2 shows the predicted and truth variable distributions for $p_{\mathrm{T}}$ , $y$ , $\phi$ of the top quarks in the $t\bar{t}H$ sample. To quantify the resolution, we calculate the width of $\Delta p_{\mathrm{T}}/p_{\mathrm{T,truth}}$ , $\Delta y$ and $\Delta\phi,$ the model’s prediction error for the three variables (relative error for $p_{\mathrm{T}}$ ). The width is quantified using half of the 68% inter-quantile range, which corresponds to one standard deviation in the Gaussian case. The top quark mass is part of the four-momentum prediction, but we do not show it here as it is nearly a delta function. Since the model predicts the four-momenta of two top quarks, the predicted top quarks are matched to truth top quarks during the resolution calculation to minimize the sum of $\Delta R$ between all matched pairs. Table 1 summarizes the prediction resolutions for all top quarks in the predicted $t\bar{t}H$ events, separated into “truth-matched” top quarks and “unmatched” top quarks. As expected, CPT’s performance is worse for unmatched tops due to incomplete information. Over all tops (truth-matched and unmatched) in the test set, the median values of $\Delta p_{\mathrm{T}}/p_{\mathrm{T}},\Delta y$ , and $\Delta\phi,$ are $-0.02,0.002$ and $-0.002,$ showing that there is no significant statistical bias in CPT’s prediction.

Relative performance: The model prediction resolutions are compared to the intrinsic resolutions of reconstructing top quarks using jet-triplets. The intrinsic resolutions are calculated from truth-matched triplets of jets, where the four-momenta of the truth-matched jet-triplet are considered as the predictions. In this case, the resolution arises from the effects of quark hadronization and jet reconstruction. For truth-matched top quarks, the ratio of the prediction resolution from CPT to the intrinsic resolution is 1.5 for $p_{\mathrm{T}}$ , 2.3 for the rapidity $y$ , and 2.0 for the azimuthal angle $\phi$ .

To compare CPT with a strong baseline, we also evaluate a triplet-based reconstruction method, where a neural network is trained to identify the triplet associated with each top quark. The baseline resolutions have prediction-to-intrinsic ratios of 2.2 for $p_{\mathrm{T}}$ , 2.8 for $y$ , and 3.1 for $\phi$ . Therefore, even when evaluated on truth-matched top quarks, CPT achieves significantly better resolution than the triplet-based method. The comparison is visualized in Figure 3. Details on the baseline implementation is available in Appendix B.

In the preselected $t\bar{t}H$ events, 76% of the top quarks are unmatched. Specifically, 43% out of the total 67% of tops that decay hadronically don’t have a matching triplet and 33% of all tops decay semi-leptonically. For these unmatched top quarks, CPT achieves a prediction-to-intrinsic resolution ratio of 2.5 for $p_{\mathrm{T}}$ , 6.5 for $y$ , and 3.6 for $\phi$ . Due to incomplete information about the tops’ decay products, CPT’s performance degrades as expected for unmatched top quarks, though the absolute resolutions remain below 30%. Note these top quarks cannot otherwise be fully reconstructed using reconstruction-based alternatives due to incomplete information about their decay products. While there exist procedures to approximately recover some of the missing information, such as the neutrino kinematics, combining these additional estimators with a reconstruction-based method to handle unmatched tops introduces additional complexity and sources of error and it’s highly unlikely that the resulting approach will outperform a regression model.

Table 1: Summary of resolutions of top quark four-momentum components in various scenarios for

t\bar{t}H

t\bar{t}

and

t\bar{t}W

processes.

		${\sigma_{p_{\mathrm{T}}}}$	$\sigma_{y}$	$\sigma_{\phi}$
$t\bar{t}H$	Intrinsic	0.10	0.04	0.07
	Truth-matched	0.15	0.09	0.14
	Unmatched	0.27	0.25	0.26
$t\bar{t}$	Intrinsic	0.11	0.04	0.09
	Truth-matched	0.19	0.11	0.20
	Unmatched	0.31	0.32	0.37
$t\bar{t}W$	Intrinsic	0.12	0.04	0.08
	Truth-matched	0.27	0.15	0.28
	Unmatched	0.45	0.36	0.50

Correlation: Between the six variables of interest, only three pairs of variables have a linear correlation beyond 5% in the truth sample. These correlations are $74\%$ for ( $p_{\mathrm{T,1}}$ , $p_{\mathrm{T,2}}$ ), $50\%$ for ( $y_{1}$ , $y_{2}$ ), and $-31\%$ for ( $\phi_{1}$ , $\phi_{2}$ ). The corresponding correlations observed in the Covariant Particle Transformer prediction are $75\%$ for ( $p_{\mathrm{T,1}}$ , $p_{\mathrm{T,2}}$ ), $43\%$ ( $y_{1}$ , $y_{2}$ ), and $-34\%$ for ( $\phi_{1}$ , $\phi_{2}$ ). The correlation between top quarks is well-reproduced in CPT’s predictions.

Process dependence: We assess the process dependence of CPT by applying the model trained with $t\bar{t}H$ to $t\bar{t}W$ , $t\bar{t}$ and $t\bar{t}H_{\text{CP-odd}}$ events, respectively. Table 1 compares the intrinsic and prediction resolutions between $t\bar{t}H$ , $t\bar{t}W$ , and $t\bar{t}$ processes. CPT trained exclusively on the $t\bar{t}H$ sample can be applied without any retraining to yield a similar level of performance for $t\bar{t}$ events. This level of generalization is not trivial since these two processes induce different statistics in the final state objects and top quarks. The $t\bar{t}W$ events constitute a much more challenging test set since additional jets, leptons, and neutrinos are produced from the $W$ decay which introduces more complex correlations among the objects that are not present in CPT’s training set. Consequently, CPT yields a larger resolution on the $t\bar{t}W$ test set. The process-dependence can be mitigated by a number of strategies, such as training CPT with a more representative sample or possibly active decorrelation strategies [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], which we defer to future studies. Figure 4 shows distributions of the system-level observables constructed from individual top quark four-momenta for $t\bar{t}H$ and $t\bar{t}H_{\text{CP-odd}}$ samples. A reasonable agreement between the predictions and ground truth properties is observed for these observables, indicating CPT captures the subtle difference in the kinematics between the two processes and reproduces correlation in the four-momentum between the two top quarks. The agreement can be improved by applying preselection such as the requirement of at least one truth-matched top. Importantly, although the model prediction is not perfect, the separation between $t\bar{t}H$ and $t\bar{t}H_{\text{CP-odd}}$ events is preserved by CPT predictions, showing the promise of applying CPT to produce discriminating kinematic variables.

High multiplicity final state: CPT can predict the four-momenta of an arbitrary (fixed) number of top quarks in a collision event. We test the prediction ability of CPT in the extreme case at the LHC where four top quarks are produced in the same event. We configure CPT to predict the four-momenta of four top quarks and train it with the $t\bar{t}t\bar{t}$ sample described in Section III. Table 2 shows the intrinsic and prediction resolutions from this test. Compared to the prediction for the $t\bar{t}H$ sample, the prediction for $t\bar{t}t\bar{t}$ is worse. However, the intrinsic resolution in the $t\bar{t}t\bar{t}$ sample is also worse than that in the $t\bar{t}H$ sample, suggesting that the top quarks in $t\bar{t}t\bar{t}$ events are inherently more complex and challenging to reconstruct. We expect the gap between the intrinsic and CPT’s resolution can be reduced by further architectural improvements and more training data. We stress that the exploding combinatorics in $t\bar{t}t\bar{t}$ events render reconstruction-based methods prohibitively expensive to be successfully applied in this setting, whereas we can easily apply CPT without any modification. To predict top quarks’ kinematics from $N$ jets, a standard reconstruction-based method has a super-exponential computational complexity of $O(N!),$ the number of all possible permutations within $N$ objects, while CPT only has a polynomial complexity of $O(N^{2})$ since the attention mechanism only involves pairwise interactions among the objects.

Table 2: Summary of resolutions of top quark four-momentum components in various scenarios in the

t\bar{t}t\bar{t}

sample.

	${\sigma_{p_{\mathrm{T}}}}$	$\sigma_{y}$	$\sigma_{\phi}$
Intrinsic	0.19	0.05	0.09
Truth-matched	0.29	0.16	0.24
Unmatched	0.42	0.32	0.36

V Ablation studies

We demonstrate the effects of removing important components of CPT to show how they contribute to the final performance. All comparisons are done on the $t\bar{t}H$ dataset. Resolutions are reported on all top quarks passing the preselection, regardless of truth-matching status.

Attention mechanism: The attention mechanism is an important part of the model as it allows the model to selectively focus on a subset of the final state objects in determining the four-momentum of each top quark. We demonstrate its benefit by training an otherwise identical model except with all attention weights set to a constant $\frac{1}{N_{\mathrm{in}}},$ where $N_{\mathrm{in}}$ is the number of final state objects in the event. Comparisons between the resolution achieved by this model and the nominal model is shown in Table 3. We observe the model with uniform attention achieves worse resolutions, which demonstrates the benefit of the attention mechanism.

Table 3: Comparison of resolutions of top quark four-momentum components in the

t\bar{t}H

sample achieved by CPT and its variant applying uniform-attention for each final state object.

	${\sigma_{p_{\mathrm{T}}}}$	$\sigma_{y}$	$\sigma_{\phi}$
CPT	0.24	0.21	0.23
CPT (uniform attention)	0.27	0.23	0.28

Covariant attention: CPT employes a covariant attention mechanism to exploit the symmetries in collision data. When the covariant attention is replaced by a regular attention mechanism which does not guarantee covariance, we observe increasing degradation in performance as the size of the training sample becomes smaller. Figure 5 compares the resolutions achieved by CPT and its variant using a regular attention mechanism, as a function of the number of training events. For example, the increase in $p_{\mathrm{T}}$ resolution can be as large as 16% when only 0.1% of the events in the nominal training sample is used. This shows that the covariant attention enables CPT to be more data-efficient and provide more accurate predictions in the low-data regime compared to non-covariant models.

Alternative architectures: Finally, we compare with two alternative permutation-invariant architectures, Graph Convolutional Networks [50] and DeepSets [51]. Applied to this task, Graph Convolutional Networks (GCNs) use graph convolutions to process information in the final state objects represented as a complete graph, while DeepSets uses a fully connected neural network encoder to learn the feature vector of each final state object individually. In both cases, the feature vectors of all final state objects are then summed and fed into a fully connected neural network to predict the top quark four-momenta. The Covariant Particle Transformer mainly differs from these two architectures by utilizing an attention mechanism, implementing partial Lorentz covariance, and using a decoder module. We use six graph convolutional layers and six encoder layers for the GCN and the DeepSet models, and a feature dimension of 128 for both. A comparison of resolutions between the models is shown in Table 4. CPT significantly outperforms the other two methods, showing its outstanding effectiveness on this task. We did not perform extensive hyperparameter optimizations for any of the three architectures. However, we hypothesize that the performance ordering would persist after such an optimization given the magnitude of the observed differences. We defer this study to future work.

Table 4: Comparison of resolutions of top quark four-momentum components in the

t\bar{t}H

sample achieved by CPT, GCN, and DeepSets.

	${\sigma_{p_{\mathrm{T}}}}$	$\sigma_{y}$	$\sigma_{\phi}$
CPT	0.24	0.21	0.23
GCN	0.38	0.35	0.42
DeepSets	0.36	0.32	0.36

VI Conclusion

In this paper, we propose a new machine learning-enabled approach to determining top quark kinematic properties by processing the full event information holistically. Our approach offers three major advantages compared to conventional approaches. First, we no longer deal with the conceptually ill-defined jet-triplet identification process. Second, we can account for noisy or missing observations due to limited detector acceptance, inefficiency, and resolution, as the regression model can learn such effects from simulations. Third, the holistic processing of the event final state offers a unified approach to determine the top quark properties for both the hadronic and semi-leptonic top quark decays, which simplifies the analysis workflow. Finally, our approach has a runtime polynomial in the number of final state objects as opposed to super-exponential for reconstruction-based approaches which need to consider all possible permutations, making ours the first tractable method for processes with high multiplicity final state such as $t\bar{t}t\bar{t}$ .

To realize this holistic approach to predicting top quark kinematic properties, we propose the Covariant Particle Transformer (CPT). CPT takes as input properties of the final state objects in a collision event and outputs predictions for the top quark kinematic properties. Using a novel covariant attention mechanism, CPT prediction is invariant under permutation of the inputs and covariant under rotation and/or boosts of the event along the beamline. CPT can recover 76% (75%) of the top quarks produced in the $t\bar{t}H$ ( $t\bar{t}t\bar{t}$ ) events that cannot be truth matched to a jet-triplet and thus not fully reconstructable by conventional methods. For $t\bar{t}H$ events, CPT achieves a resolution close to the intrinsic resolution of jet-triplet and outperforms a carefully tuned triplet-based top reconstruction method on top quarks that can be matched to a jet-triplet. In addition, we demonstrate that CPT can generalize to top production processes not seen during training, though its performance degrades as the test process becomes more complex and distinct from the training process. Finally, we demonstrate that by building Lorentz covariance into CPT, it achieves higher data efficiency and outperforms the non-covariant alternative when the training set is small.

In the future, it may be possible to improve and extend CPT. CPT training uses simulation to learn to invert parton shower and hadronization (and in the future, detector effects). Training strategies that rely less on parton shower and hadronization simulations like those in Ref. [52] may be able to improve the robustness of CPT. Furthermore, as a direct regression approach, CPT is prior dependent. A variety of domain adaptation and other strategies may be able to further improve the resilience of CPT. It may also be possible to include lower-level, higher-dimensional inputs directly into CPT instead of first clustering jets.

As it uses a generic representation for collision events as sets of particles, CPT can be directly applied to predict kinematic properties of other heavy decaying particles, such as the $W$ , $Z$ , and Higgs boson, and potential heavy particles beyond the SM. The predicted kinematics of these heavy decaying particles can be used to construct discriminating variables for searches or observables for differential cross-section measurements. The ability to predict properties of heavy decaying particles through a holistic analysis of the collision event can enable measurements that otherwise suffer extreme inefficiencies using conventional reconstruction methods.

Note added: While this paper was being finalized, we became aware of Ref. [53], which proposes another Lorentz equivariant architecture. In contrast to that paper, we integrate Lorentz covariance with the Transformer, a state-of-the-art neural network architecture that revolutionized many areas of machine learning applications such as natural language processing, computer vision, and protein folding. We have also considered a completely different application: namely regression instead of classification, where the Lorentz group acts non-trivially (not an identity) on the target variables.

Acknowledgements

This work is supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. H.W.’s work is partly supported by the U.S. National Science Foundation under the Award No. 2046280.

Appendix A CPT Implementation

A.1 Attention mechanism

Attention mechanisms are a way to update a vector of $n$ features $\{x_{i}\}_{i=1}^{n}$ , given a context $\{c_{j}\}_{j=1}^{m}$ . Learnable query, key, and value matrices $\{W_{Q},W_{K},W_{V}\}$ are used to generate $d$ -dimensional query, key, and value vectors $\{q_{i}\}_{i=1}^{n},$ $\{k_{j}\}_{j=1}^{m},$ and $\{v_{j}\}_{j=1}^{m},$ via

$\displaystyle q_{i}$	$\displaystyle=W_{Q}x_{i}$	(1)
$\displaystyle k_{j}$	$\displaystyle=W_{K}c_{j},$	(2)
$\displaystyle v_{j}$	$\displaystyle=W_{V}c_{j}.$	(3)

The inner product between $q_{i}^{\top}$ and $k_{j}$ is used to compute the attention weights $\alpha_{ij}$ through

\displaystyle\alpha_{ij}=\frac{\exp(q_{i}^{\top}k_{j}/\sqrt{d})}{\sum_{j}\exp(q_{i}^{\top}k_{j}/\sqrt{d})},

(4)

where the $\sqrt{d}$ is a normalization factor. A weighted sum of the value vectors are then used to compute update vectors $\{m_{i}\}_{i=1}^{n},$

\displaystyle m_{i}=\sum_{j}\alpha_{ij}v_{j},

(5)

which is then used to update $x_{i}$ by, for example, addition $x^{\prime}_{i}=x_{i}+m_{i}$ . Intuitively, the attention weights $\alpha_{ij}$ represent how important the information contained in $c_{j}$ is to $x_{i}.$ When the context $\{c_{j}\}$ is simply $\{x_{i}\},$ this is termed as self-attention, otherwise cross-attention. It is common to use a slight extension of the method above, called Multi-headed attention, where $H$ different query, key, and value matrices $\{(W^{h}_{Q},W^{h}_{K},W^{h}_{V})\}_{h=1}^{H}$ are learnt. Each head follow the above procedure to independently produce attention weights $\{a_{ij}^{h}\}_{ijh}$ and then update vectors $\{m^{h}_{i}\}_{i=1,h=1}^{n,H}.$ The $H$ update vectors $\{m^{h}_{i}\}_{h=1}^{H}$ received by each $x_{i}$ are concatenated to produce a final update vector

\displaystyle m_{i}=\bigoplus_{h=1}^{H}m^{h}_{i},

(6)

which is then used to update $x_{i}$ as before.

A.2 Particle representation

We represent each particle with a feature vector $h_{i},$ and $h_{i}=(x_{i},\omega_{i})$ consists of an invariant feature vector $x_{i},$ and a covariant featrue vector $\omega_{i}.$ $x_{i}$ is an invariant quantity under a rotation and boost along the beamline, while $\omega_{i}=(y_{i},\cos(\phi_{i}),\sin(\phi_{i}))$ represnets the flight direction of the object and is a covariant quantity. As input to the Covariant Particle Transformer, $x_{i}=(p_{\mathrm{T},i},m_{i},\mathrm{id})$ where $\mathrm{id}$ is a one-hot vector indicating particle identity. The model learns to update these feature vectors while maintaining their invariance/covariance property through the covariant attention.

A.3 Covariant attention

To update the learned feature vectors of each object in the event, we use covariant attention, an extenstion of the regular attention mechanism to process kinematics information and gaurantee covariance properties of the predictions. In general, covariant attention updates feature vectors $\{h_{i}\}$ of a subset of the objects in the event using feature vectors $\{h_{j}\}$ of a (potentially different) subset as context. First, it computes the flight direction of each context object as viewed in $i$ ’s frame: $\omega_{ij}=(y_{j}-y_{i},\cos(\phi_{j}-\phi_{i}),\sin(\phi_{j}-\phi_{i}),$ which is invariant under longitudinal boosts and azimuthal rotations. Then it computes the $d$ -dimensional query, keys, and value vectors as follows

$\displaystyle\hat{x}_{i}$	$\displaystyle=\mathrm{LayerNorm}(x_{i}),$	(7)
$\displaystyle v_{ij}$	$\displaystyle=W_{V}(\hat{x}_{j}+\mathrm{MLP}(\omega_{ij})),$	(8)
$\displaystyle k_{ij}$	$\displaystyle=W_{K}(\hat{x}_{j}+\mathrm{MLP}(\omega_{ij})),$	(9)
$\displaystyle q_{i}$	$\displaystyle=W_{Q}\hat{x}_{i}$	(10)

where $W_{V},W_{K},W_{Q}$ are learned matrices and $\mathrm{MLP}$ is a Multilayer Perceptron. The inner products between $q_{i}$ and $k_{ij}$ are then went through softmax operator so as to weight the value vectors. The weighted sum produces an aggregated message vetor $m^{x}_{i}$ which is added to $x_{i}$ :

$\displaystyle\alpha_{ij}$	$\displaystyle=\frac{\exp(q_{i}^{\top}k_{ij}/\sqrt{d})}{\sum_{j}\exp(q_{i}^{\top}k_{ij}/\sqrt{d})},$	(11)
$\displaystyle\tilde{m}^{x}_{i}$	$\displaystyle=\sum_{j}\alpha_{ij}v_{ij},$	(12)
$\displaystyle m^{x}_{i}$	$\displaystyle=\sigma(\mathrm{Linear}(x_{i},\tilde{m}^{x}_{i}))\odot\tilde{m}^{x}_{i},$	(13)
$\displaystyle x^{\prime}_{i}$	$\displaystyle=x_{i}+m^{x}_{i},$	(14)

where $\sigma$ is the sigmoid function and $\odot$ denotes elementwise (Hadamard) product. Gating is applied to the attention weights following the Gated Attention Network [54]. A multi-headed version of covariant attention can be constructed in the same way as in regular attention, and is omitted here. $x^{\prime}_{i}$ is then passed through a feed-forward network as done in the original transformer. When it is desirable to also update the covariant feature $\omega_{i},$ we produce another update vector $m^{\omega}_{i}$ from $m^{x}_{i}$ via

	$\displaystyle\tilde{m}^{\omega}_{i}$	$\displaystyle=\mathrm{MLP}(m^{x}_{i})$		(15)
	$\displaystyle m^{\omega}_{i}$	$\displaystyle=\sigma(\mathrm{Linear}(x_{i},\tilde{m}^{\omega}_{i}))\odot\tilde{m}^{\omega}_{i},$		(16)

where $m^{\omega}_{i}$ is a three dimensional vector. Its first component is used as a boost with rapidity $\delta y_{i},$ while its last two components $v_{i}$ converted to a rotation matrix $R\quantity(v_{i}),$ which is used to rotate the azimuthal angle $\phi_{i}:$

$\displaystyle y^{\prime}_{i}$	$\displaystyle=y_{i}+\delta y_{i}$	(18)
$\displaystyle\begin{pmatrix}\cos(\phi^{\prime}_{i})\\ \sin(\phi^{\prime}_{i})\end{pmatrix}$	$\displaystyle=R\quantity(v_{i})\begin{pmatrix}\cos(\phi_{i})\\ \sin(\phi_{i})\end{pmatrix}$	(19)
$\displaystyle\omega^{\prime}_{i}$	$\displaystyle=(y^{\prime}_{i},\cos(\phi^{\prime}_{i}),\sin(\phi^{\prime}_{i})).$	(20)

where $R(v_{i})$ is obtained as follows

$\displaystyle u_{i}$	$\displaystyle=v_{i}+(1,0),$	(21)
$\displaystyle w_{i}$	$\displaystyle=\frac{u_{i}}{\norm{u_{i}}}=(\cos(\theta_{i}),\sin(\theta_{i})),$	(22)
$\displaystyle R\quantity(v_{i})=$	$\displaystyle\begin{pmatrix}\cos(\theta_{i})&-\sin(\theta_{i})\\ \sin(\theta_{i})&\cos(\theta_{i})\end{pmatrix},$	(23)

where we added $(1,0)$ to $v_{i}$ to bias the rotation matrix to an identity for stability. The covariance of $\{\omega^{\prime}_{i}\}$ follows from the fact that only invariant information is used to construct its update, and prior to the update, $\{\omega_{i}\}$ are themselves covariant. An inductive argument establishes the end-to-end covariance of compositions of covariant attention updates. We denote the above covariant attention update as $h_{i}\leftarrow\mathcal{A}^{x\omega}_{x\omega}(h_{i},\{h_{j}\})$ where the subscript indicates that it makes use of both the invariant and covariant feature vector, and the superscript indicates that it updates both the invariant and covariant feature vector. The following variants are used to build the full model:

•

$x_{i}\leftarrow\mathcal{A}^{x}_{x\omega}(h_{i},\{h_{j}\}):$ the covariant feature vector is not updated
•

$x_{i}\leftarrow\mathcal{A}^{x}_{x}(x_{i},\{x_{j}\}):$ the covariant feature vector is not updated nor used to construct the key and value vectors. This reduces to the regular attention mechanism.

A.4 Encoder

The encoder uses six layers of covariant attention to update the input invariant features $x^{\mathrm{in}}_{i}\leftarrow\mathcal{A}^{x}_{x\omega}(h^{\mathrm{in}}_{i},\{h^{\mathrm{in}}_{j}\})$ . The covariant features associated with the input objects $\{\omega^{\mathrm{in}}_{i}\}$ are not updated.

A.5 Decoder

A.5.1 Initialization

The decoder first initializes the invariant feature vectors associated with the top quarks using the Set2Set module [55], which takes in the set $\{x^{\mathrm{in}}_{i}\}$ and outputs $\{x^{\mathrm{out}}_{i}\}$ , the initial invariant feature vectors of the output objects. The decoder then updates $\{x^{\mathrm{out}}_{i}\}$ by having each output attends to the input objects, using invariant features only, $x^{\mathrm{out}}_{i}\leftarrow\mathcal{A}^{x}_{x}(x^{\mathrm{out}}_{i},\{x^{\mathrm{in}}_{j}\}).$ The attention weights $\alpha_{ij}$ computed in the previous attention update is used to intialize the output covariant feature vectors:

	$\displaystyle y^{\mathrm{out}}_{i}$	$\displaystyle=\sum_{j}\alpha_{ij}y^{\mathrm{in}}_{j},$		(24)
	$\displaystyle\begin{pmatrix}\cos(\phi^{\mathrm{out}}_{i})\\ \sin(\phi^{\mathrm{out}}_{i})\end{pmatrix}$	$\displaystyle=\frac{\sum_{j}\alpha_{ij}\begin{pmatrix}\cos(\phi^{\mathrm{in}}_{j})\\ \sin(\phi^{\mathrm{in}}_{j})\end{pmatrix}}{\norm{\sum_{j}\alpha_{ij}\begin{pmatrix}\cos(\phi^{\mathrm{in}}_{j})\\ \sin(\phi^{\mathrm{in}}_{j})\end{pmatrix}}}$		(25)

The covariance of $y^{\mathrm{out}}_{i}$ follows from the fact that $\sum_{j}\alpha_{ij}=1,$ and $\{y^{\mathrm{in}}_{j}\}$ transforms by an overall additive constant under a boost. The covariance of $\phi^{\mathrm{out}}_{i}$ follows from the fact that its unit vector representation is a linear combination of $\quantity{\begin{pmatrix}\cos(\phi^{\mathrm{in}}_{j})\\ \sin(\phi^{\mathrm{in}}_{j})\end{pmatrix}}_{j},$ each of which transform linearly by a rotation.

A.5.2 Interleaved covariant cross- and self-attention

After initialization, the decoder consists of $L_{\mathrm{out}}=6$ decoder blocks. In each block, the output invariant and covariant feature vectors are updated using two covariant attention layers:

	$\displaystyle h^{\mathrm{out}}_{i}$	$\displaystyle\leftarrow\mathcal{A}^{x\omega}_{x\omega}(h^{\mathrm{out}}_{i}\{h^{\mathrm{out}}_{j}\})\quad\forall i,$		(26)
	$\displaystyle h^{\mathrm{out}}_{i}$	$\displaystyle\leftarrow\mathcal{A}^{x\omega}_{x\omega}(h^{\mathrm{out}}_{i},\{h^{\mathrm{in}}_{j}\})\quad\forall i.$		(27)

After each decoder block, indexed by $\ell\in\{1,...,L_{\mathrm{out}}\},$ an intermediate set of predictions $\{p^{\ell}_{i}\}_{i}$ for the top quark four momenta is constructed as follows

	$\displaystyle(p^{\ell}_{T_{i}}$	$\displaystyle/\mathrm{GeV},y^{\ell}_{i},\phi^{\ell}_{i},m^{\ell}_{i}/\mathrm{GeV})$
		$\displaystyle=(100(x^{\ell}_{i})_{0},y_{i},\phi_{i},5(x^{\ell}_{i})_{1}+173),$		(28)

where $(x^{\ell}_{i})_{0},(x^{\ell}_{i})_{1}$ denotes the first and second entry of the invariant feature vector associated with each top at the $\ell$ -th block. The shift and scaling is to keep the feature vectors small and centered to facilitate training.

A.6 Loss function and optimization details

For each event, the main component of loss function is the $L_{2}$ norm of the difference between the model prediction and ground truth for the top quark four-momenta in $(p_{x}/100~{}\mathrm{GeV},p_{y}/100~{}\mathrm{GeV},y,m/5~{}\mathrm{GeV})$ coordinates, averaged over the $N$ top quarks present in the event:

\mathcal{L}_{\mathrm{final}}=\frac{1}{N}\sum_{i=1}^{N}\norm{p_{i}-p^{*}_{i}},

(29)

where $\{p_{i}\}$ are the model predictions at the final decoding block and $\{p^{*}_{i}\}$ are the ground truths. We chose this set of coordinates so that each component of the four-momenta has standard deviation of $O(1),$ encouraging the model to pay equal attention to each of them. The $N$ predictions from the model are matched to the $N$ ground truths through a permutation $\pi^{*}$ that minimizes the average $\Delta R$ between each matched pair:

\pi^{*}=\operatorname*{arg\,min}_{\pi:\mathrm{permutations}}\frac{1}{N}\sum_{i=1}^{N}\sqrt{(y_{i}-y^{*}_{i})^{2}+(\phi_{i}-\phi^{*}_{i})^{2}}.

(30)

We add two auxiliary losses $\mathcal{L}_{\mathrm{intermediate}}$ and $\mathcal{L}_{\mathrm{unit}-\mathrm{norm}}$ to stabilize training models with many layers. The intermediate loss $\mathcal{L}_{\mathrm{intermediate}}$ measures the intermediate prediction errors at earlier decoder blocks,

\mathcal{L}_{\mathrm{intermediate}}=\frac{1}{L_{\mathrm{out}}-1}\sum_{\ell=1}^{L_{\mathrm{out}}-1}\quantity(\frac{1}{N}\sum_{i=1}^{N}\norm{p^{\ell}_{i}-p^{*}_{i}}),

(31)

where $\{p^{\ell}_{i}\}_{i=1}^{N}$ are intermediate predictions at the $\ell^{\text{th}}$ decoder. The unit-norm loss $\mathcal{L}_{\mathrm{unit}-\mathrm{norm}}$ encourages the vectors $u_{i}$ to have unit-norm before being normalized and converted to rotation matrices in Equation 21 in each decoder block:

\mathcal{L}_{\mathrm{unit}-\mathrm{norm}}=\frac{1}{L_{\mathrm{out}}}\sum_{\ell=1}^{L_{\mathrm{out}}}\quantity(\frac{1}{N}\sum_{i=1}^{N}\absolutevalue{\norm{u^{\ell}_{i}}-1}).

(32)

The two auxiliary losses are inspired by similar auxiliary losses in AlphaFold2 [25]. The total loss is a weighted combination of the above three terms,

\mathcal{L}_{\mathrm{total}}=\lambda_{1}\mathcal{L}_{\mathrm{final}}+\lambda_{2}\mathcal{L}_{\mathrm{intermediate}}+\lambda_{3}\mathcal{L}_{\mathrm{unit}-\mathrm{norm}}.

(33)

We use $\lambda_{1}=\lambda_{2}=1,$ and $\lambda_{3}=0.02$ . All models used to report our results are trained using the Lamb optimizer [56] with a batch size of 256 and a learning rate of $10^{-4}$ for 30 epochs and $10^{-5}$ for another 10 epochs. A weight decay [57] of 0.01 is applied. Model from the epoch achieving minimum validation loss is used for final evaluation. This training protocol is sufficient to saturate validation performance for all variants of the model and datasets of various processes and sizes used to present our results.

Appendix B Baseline

We train a neural network to identify triplets of jets that originate from top decays. This task can be formulated as a link prediction problem on a graph, where the nodes are detected jets in an event and any two jets that belong to a triplet are connected by a link. Specifically, every event is represented by a fully-connected graph using the four-momenta and particle types as node features and a graph neural network (GNN) predicts a probability $p_{ij}\in(0,1)$ that a link exists between jet $i$ and jet $j$ for every pair of jets in the event. The particular architecture we use is the Interaction Network [58], followed by an MLP applied per edge to output the per-edge probabilities. The GNN is trained to minimize the cross-entropy loss so that $p_{ij}$ is encouraged to be 1 if the jets belong to the same triplet and 0 otherwise. It uses the same training, validation, and test set as used by CPT. We tune the hyperparameters to maximize validation accuracy and settled on 4 Interaction Network blocks, 2 layers and 128 hidden units for all MLPs, Adam optimizer, and a learning rate of 0.001. At test time, we sort all possible links $(i,j)$ by decreasing order in $p_{ij}$ and sequentially form one or two predicted triplets depending on the number of available jets in the event. Each predicted top four-vector is the system four-vector of the predicted triplet. The predicted tops are $\Delta R$ matched to the true tops following the same procedure in CPT defined in Equation 30. We note that this method provides a strong baseline as it uses a neural network architecture that has demonstrated state-of-the-art performance on reasoning about object and relations in a wide range of complex problems such as $N$ -body dynamics and estimating physical quantities [58].

References

Particle Data Group [2020] Particle Data Group, Review of Particle Physics, Progress of Theoretical and Experimental Physics 2020, 083C01 (2020).
Erdmann et al. [2014] J. Erdmann, S. Guindon, K. Kroeninger, B. Lemmer, O. Nackenhorst, A. Quadt, and P. Stolte, A likelihood-based reconstruction algorithm for top-quark pairs and the KLFitter framework, Nucl. Instrum. Meth. A 748, 18 (2014), arXiv:1312.5595 [hep-ex] .
Aaboud et al. [2018] M. Aaboud et al. (ATLAS), Search for the standard model Higgs boson produced in association with top quarks and decaying into a $b\bar{b}$ pair in $pp$ collisions at $\sqrt{s}$ = 13 TeV with the ATLAS detector, Phys. Rev. D 97, 072016 (2018), arXiv:1712.08895 [hep-ex] .
Sirunyan et al. [2020] A. M. Sirunyan et al. (CMS), Measurement of the $\mathrm{t\bar{t}}\mathrm{b\bar{b}}$ production cross section in the all-jet final state in pp collisions at $\sqrt{s}=$ 13 TeV, Phys. Lett. B 803, 135285 (2020), arXiv:1909.05306 [hep-ex] .
Erdmann et al. [2019] J. Erdmann, T. Kallage, K. Kröninger, and O. Nackenhorst, From the bottom to the top—reconstruction of $t\bar{t}$ events with deep learning, JINST 14 (11), P11015, arXiv:1907.11181 [hep-ex] .
Aad et al. [2020] G. Aad et al. (ATLAS), $CP$ Properties of Higgs Boson Interactions with Top Quarks in the $t\bar{t}H$ and $tH$ Processes Using $H\rightarrow\gamma\gamma$ with the ATLAS Detector, Phys. Rev. Lett. 125, 061802 (2020), arXiv:2004.04545 [hep-ex] .
Fenton et al. [2020] M. J. Fenton, A. Shmakov, T.-W. Ho, S.-C. Hsu, D. Whiteson, and P. Baldi, Permutationless Many-Jet Event Reconstruction with Symmetry Preserving Attention Networks, (2020), arXiv:2010.09206 [hep-ex] .
Lee et al. [2020] J. S. H. Lee, I. Park, I. J. Watson, and S. Yang, Zero-Permutation Jet-Parton Assignment using a Self-Attention Network, (2020), arXiv:2012.03542 [hep-ex] .
Shmakov et al. [2021] A. Shmakov, M. J. Fenton, T.-W. Ho, S.-C. Hsu, D. Whiteson, and P. Baldi, SPANet: Generalized Permutationless Set Assignment for Particle Physics using Symmetry Preserving Attention, (2021), arXiv:2106.03898 [hep-ex] .
Badea et al. [2022] A. Badea, W. J. Fawcett, J. Huth, T. J. Khoo, R. Poggi, and L. Lee, Solving Combinatorial Problems at Particle Colliders Using Machine Learning, (2022), arXiv:2201.02205 [hep-ph] .
Vaswani et al. [2017] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Advances in Neural Information Processing Systems (2017) p. 5998, 1706.03762 .
Bahdanau et al. [2014] D. Bahdanau, K. Cho, and Y. Bengio, Neural machine translation by jointly learning to align and translate (2014), 1409.0473 .
Bogatskiy et al. [2020] A. Bogatskiy, B. Anderson, J. T. Offermann, M. Roussi, D. W. Miller, and R. Kondor, Lorentz Group Equivariant Neural Network for Particle Physics, (2020), arXiv:2006.04780 [hep-ph] .
Shimmin [2021] C. Shimmin, Particle Convolution for High Energy Physics (2021) arXiv:2107.02908 [hep-ph] .
Dolan and Ore [2021a] M. J. Dolan and A. Ore, Equivariant Energy Flow Networks for Jet Tagging, Phys. Rev. D 103, 074022 (2021a), arXiv:2012.00964 [hep-ph] .
Komiske et al. [2019] P. T. Komiske, E. M. Metodiev, and J. Thaler, Energy Flow Networks: Deep Sets for Particle Jets, JHEP 01, 121, arXiv:1810.05165 [hep-ph] .
Qu and Gouskos [2020] H. Qu and L. Gouskos, ParticleNet: Jet Tagging via Particle Clouds, Phys. Rev. D 101, 056019 (2020), arXiv:1902.08570 [hep-ph] .
Moreno et al. [2020] E. A. Moreno, O. Cerri, J. M. Duarte, H. B. Newman, T. Q. Nguyen, A. Periwal, M. Pierini, A. Serikova, M. Spiropulu, and J.-R. Vlimant, JEDI-net: a jet identification algorithm based on interaction networks, Eur. Phys. J. C 80, 58 (2020), arXiv:1908.05318 [hep-ex] .
Mikuni and Canelli [2020] V. Mikuni and F. Canelli, ABCNet: An attention-based method for particle tagging, Eur. Phys. J. Plus 135, 463 (2020), arXiv:2001.05311 [physics.data-an] .
Mikuni and Canelli [2021] V. Mikuni and F. Canelli, Point cloud transformers applied to collider physics, Mach. Learn. Sci. Tech. 2, 035027 (2021), arXiv:2102.05073 [physics.data-an] .
Shlomi et al. [2020] J. Shlomi, P. Battaglia, and J.-R. Vlimant, Graph Neural Networks in Particle Physics, Mach. Learn.: Sci. Technol. 2, 021001 (2020), arXiv:2007.13681 [hep-ex] .
M. Feickert and B. Nachman [2021] M. Feickert and B. Nachman, A Living Review of Machine Learning for Particle Physics, (2021), arXiv:2102.02770 [hep-ph] .
Brown et al. [2020] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information processing systems 33, 1877 (2020).
Dosovitskiy et al. [2020] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, arXiv preprint arXiv:2010.11929 (2020).
Jumper et al. [2021] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, et al., Highly accurate protein structure prediction with alphafold, Nature 596, 583 (2021).
ATLAS Collaboration [2018] ATLAS Collaboration, Generalized Numerical Inversion: A Neural Network Approach to Jet Calibration, Tech. Rep. (2018).
Alwall et al. [2014] J. Alwall, R. Frederix, S. Frixione, V. Hirschi, F. Maltoni, O. Mattelaer, H. S. Shao, T. Stelzer, P. Torrielli, and M. Zaro, The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations, JHEP 07, 079, arXiv:1405.0301 [hep-ph] .
Artoisenet et al. [2013a] P. Artoisenet, R. Frederix, O. Mattelaer, and R. Rietkerk, Automatic spin-entangled decays of heavy resonances in Monte Carlo simulations, JHEP 03, 015, arXiv:1212.3460 [hep-ph] .
Artoisenet et al. [2013b] P. Artoisenet et al., A framework for Higgs characterisation, JHEP 11, 043, arXiv:1306.6464 [hep-ph] .
Sjöstrand et al. [2015] T. Sjöstrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Prestel, C. O. Rasmussen, and P. Z. Skands, An introduction to PYTHIA 8.2, Comput. Phys. Commun. 191, 159 (2015), arXiv:1410.3012 [hep-ph] .
Cacciari et al. [2008] M. Cacciari, G. P. Salam, and G. Soyez, The anti- $k_{t}$ jet clustering algorithm, JHEP 04, 063, arXiv:0802.1189 [hep-ph] .
Cacciari et al. [2012] M. Cacciari, G. P. Salam, and G. Soyez, FastJet User Manual, Eur. Phys. J. C 72, 1896 (2012), arXiv:1111.6097 [hep-ph] .
Cacciari and Salam [2006] M. Cacciari and G. P. Salam, Dispelling the $N^{3}$ myth for the $k_{t}$ jet-finder, Phys. Lett. B 641, 57 (2006), arXiv:hep-ph/0512210 .
Aad et al. [2019] G. Aad et al. (ATLAS), ATLAS b-jet identification performance and efficiency measurement with $t{\bar{t}}$ events in pp collisions at $\sqrt{s}=13$ TeV, Eur. Phys. J. C 79, 970 (2019), arXiv:1907.05120 [hep-ex] .
Sirunyan et al. [2018] A. M. Sirunyan et al. (CMS), Identification of heavy-flavour jets with the CMS detector in pp collisions at 13 TeV, JINST 13 (05), P05011, arXiv:1712.07158 [physics.ins-det] .
Louppe et al. [2017] G. Louppe, M. Kagan, and K. Cranmer, Learning to Pivot with Adversarial Networks, Advances in Neural Information Processing Systems 30 (2017), arXiv:1611.01046 [stat.ME] .
Stevens and Williams [2013] J. Stevens and M. Williams, uBoost: A boosting method for producing uniform selection efficiencies from multivariate classifiers, JINST 8, P12013, arXiv:1305.7248 [nucl-ex] .
Shimmin et al. [2017] C. Shimmin, P. Sadowski, P. Baldi, E. Weik, D. Whiteson, E. Goul, and A. Søgaard, Decorrelated Jet Substructure Tagging using Adversarial Neural Networks, Phys. Rev. D 96, 074034 (2017), arXiv:1703.03507 [hep-ex] .
Bradshaw et al. [2020] L. Bradshaw, R. K. Mishra, A. Mitridate, and B. Ostdiek, Mass Agnostic Jet Taggers, SciPost Phys. 8, 011 (2020), arXiv:1908.08959 [hep-ph] .
ATL [2018] Performance of mass-decorrelated jet substructure observables for hadronic two-body decay tagging in ATLAS, ATL-PHYS-PUB-2018-014 (2018).
Kasieczka and Shih [2020] G. Kasieczka and D. Shih, DisCo Fever: Robust Networks Through Distance Correlation, Phys. Rev. Lett. 125, 122001 (2020), arXiv:2001.05310 [hep-ph] .
Xia [2019] L.-G. Xia, QBDT, a new boosting decision tree method with systematical uncertainties into training for High Energy Physics, Nucl. Instrum. Meth. A930, 15 (2019), arXiv:1810.08387 [physics.data-an] .
Englert et al. [2019] C. Englert, P. Galler, P. Harris, and M. Spannowsky, Machine Learning Uncertainties with Adversarial Neural Networks, Eur. Phys. J. C79, 4 (2019), arXiv:1807.08763 [hep-ph] .
Wunsch et al. [2019] S. Wunsch, S. Jórger, R. Wolf, and G. Quast, Reducing the dependence of the neural network function to systematic uncertainties in the input space, Computing and Software for Big Science 4, 5 (2019), arXiv:1907.11674 [physics.data-an] .
Rogozhnikov et al. [2015] A. Rogozhnikov, A. Bukva, V. V. Gligorov, A. Ustyuzhanin, and M. Williams, New approaches for boosting to uniformity, JINST 10 (03), T03002, arXiv:1410.4140 [hep-ex] .
CMS Collaboration [2020] CMS Collaboration, A deep neural network to search for new long-lived particles decaying to jets, Machine Learning: Science and Technology 1, 035012 (2020), 1912.12238 .
Clavijo et al. [2020] J. M. Clavijo, P. Glaysher, and J. M. Katzy, Adversarial domain adaptation to reduce sample bias of a high energy physics classifier, (2020), arXiv:2005.00568 [stat.ML] .
Kitouni et al. [2020] O. Kitouni, B. Nachman, C. Weisser, and M. Williams, Enhancing searches for resonances with machine learning and moment decomposition, JHEP 21, 070, arXiv:2010.09745 [hep-ph] .
Dolan and Ore [2021b] M. J. Dolan and A. Ore, Meta-learning and data augmentation for mass-generalised jet taggers, (2021b), arXiv:2111.06047 [hep-ph] .
Kipf and Welling [2016] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, 5th International Conference on Learning Representations (2016), arXiv:1609.02907 .
Zaheer et al. [2017] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. Salakhutdinov, and A. Smola, Deep sets, Proceedings of the 31st International Conference on Neural Information Processing Systems , 3394–3404 (2017), 1703.06114 .
Howard et al. [2021] J. N. Howard, S. Mandt, D. Whiteson, and Y. Yang, Foundations of a Fast, Data-Driven, Machine-Learned Simulator, (2021), arXiv:2101.08944 [hep-ph] .
Gong et al. [2022] S. Gong, Q. Meng, J. Zhang, H. Qu, C. Li, S. Qian, W. Du, Z.-M. Ma, and T.-Y. Liu, An Efficient Lorentz Equivariant Graph Neural Network for Jet Tagging, (2022), arXiv:2201.08187 [hep-ph] .
Zhang et al. [2018] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, Gaan: Gated attention networks for learning on large and spatiotemporal graphs, Conference on Uncertainty in Artificial Intelligence (2018), 1803.07294 .
Vinyals et al. [2016] O. Vinyals, S. Bengio, and M. Kudlur, Order matters: Sequence to sequence for sets (2016), arXiv:1511.06391 [stat.ML] .
You et al. [2020] Y. You, J. Li, S. J. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, Large batch optimization for deep learning: Training bert in 76 minutes., Proceedings of ICLR (2020), 1904.00962 .
Loshchilov and Hutter [2019] I. Loshchilov and F. Hutter, Decoupled weight decay regularization, Proceedings of ICLR (2019), 1711.05101 .
Battaglia et al. [2016] P. Battaglia, R. Pascanu, M. Lai, D. Jimenez Rezende, et al., Interaction networks for learning about objects, relations and physics, Advances in neural information processing systems 29 (2016).

A Holistic Approach to Predicting Top Quark Kinematic Properties with the Covariant Particle Transformer