This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Holistic Approach to Predicting Top Quark Kinematic
Properties with the Covariant Particle Transformer

Shikai Qiu [email protected] Department of Physics, University of California, Berkeley, Berkeley, CA 94720, USA    Shuo Han [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA    Xiangyang Ju [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA    Benjamin Nachman [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Berkeley Institute for Data Science, University of California, Berkeley, CA 94720, USA    Haichen Wang [email protected] Physics Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA Department of Physics, University of California, Berkeley, Berkeley, CA 94720, USA
Abstract

Precise reconstruction of top quark properties is a challenging task at the Large Hadron Collider due to combinatorial backgrounds and missing information. We introduce a physics-informed neural network architecture called the Covariant Particle Transformer (CPT) for directly predicting the top quark kinematic properties from reconstructed final state objects. This approach is permutation invariant and partially Lorentz covariant and can account for a variable number of input objects. In contrast to previous machine learning-based reconstruction methods, CPT is able to predict top quark four-momenta regardless of the jet multiplicity in the event. Using simulations, we show that the CPT performs favorably compared with other machine learning top quark reconstruction approaches. We make our code available at https://github.com/shikaiqiu/Covariant-Particle-Transformer.

I Introduction

For the Large Hadron Collider (LHC) experiments, the kinematic reconstruction of top quarks is critical to many precision tests of the Standard Model (SM) as well as direct searches for physics beyond the SM. Once produced, the top quark decays to a bottom quark (bb-quark) and a W boson, with a branching ratio close to 100% [1]. Subsequently, the W boson decays into a lepton or quark pair. In the final state, quarks originating from top quark decays and other colored partons hadronize, resulting in collimated sprays of hadrons, known as jets. Conventional top quark methods assume that a hadronically decaying top quark produces three jets in the final state. Therefore, these methods are tuned to identify triplets of jets, which are considered as proxies for the three quarks originating directly from the top quark and W boson decays. The estimated top quark four-momentum is computed from the sum of measured four-momenta over the triplet of jets. Essentially, top quark reconstruction is treated as a combinatorial problem of sorting jets, and most methods use jet kinematic and flavor tagging information to construct likelihood-based [2] or machine learning-based [3, 4, 5, 6, 7, 8, 9, 10] metrics to identify triplets of jets as proxies to top quarks and similar particles.

While the conventional top quark reconstruction approaches have been implemented in a variety of forms and extensively used at hadron collider experiments, they have fundamental flaws and shortcomings. The one-to-one correspondence between a parton (quark or gluon) and a jet, assumed by the conventional approaches, is only an approximation. Partons carry color charges but jets only consist of colorless hadrons. The formation of a jet, by construction, has to be contributed to by multiple partons. On the other hand, a single parton may contribute to the formation of multiple jets, particularly when the parton is highly energetic. In addition, triplet-based top quark reconstruction requires the presence of a certain number of jets in the final state. This jet multiplicity requirement can be inefficient because of kinematic thresholds, limited detector coverage, or the merging of highly collimated parton showers.

In this paper, we propose a new machine learning-enabled approach to determine the top quark properties through a holistic processing of the event final state. Our goal is to predict top quark four-momenta in a collision event with a given number of top quarks. The number of top quarks can itself be learned from the final state or it can be posited for a given hypothesis. As discussed earlier, the kinematic information of a top quark is not localized in a triplet of jets, rather, it is possessed by all particles in the event collectively. This motivates the use of particle identification (ID) and kinematic information from all detectable particles in the final state as input to the determination of the top quark four-momenta. Specifically, the four-momenta and ID of all detectable final state particles are input to a deep neural networks regression model, which is constructed and trained to predict the four-momenta of a given number of top quarks. This approach offers three major advantages compared to conventional approaches. First, we no longer deal with the conceptually ill-defined jet-triplet identification process. Second, we can account for noisy or missing observations due to limited acceptance, detector inefficiency and resolution, as the regression model can learn such effects from Monte Carlo (MC) simulations. Third, the holistic processing of the event final state offers a unified approach to determining the top quark properties for both the hadronic and semi-leptonic top quark decays, which may simplify analysis workflows. Finally, our approach has a runtime polynomial in the number of final state objects as opposed to super-exponential for standard reconstruction-based approaches which need to consider all possible permutations, making ours the first tractable method for processes with high multiplicity final state such as tt¯tt¯t\bar{t}t\bar{t}.

To realize the holistic approach of top quark property determination, we propose a physics-informed transformer [11] architecture termed Covariant Particle Transformer (CPT). CPT takes as input properties of the final state objects in a collision event and outputs predictions for the top quark kinematic properties. Like other recent top reconstruction proposals [7, 9, 8], CPT is permutation invariant under exchange of the inputs. A novel attention mechanism [11, 12], referred to as covariant attention, is designed to learn the predicted kinematic properties as a function of the set of final state objects as a whole, and guarantees that the predictions transform covariantly under rotation and/or boosts of the event along the beamline. While not fully Lorentz-covariant like Ref. [13], our approach captures the most important covariances relevant to hadron collider physics with minimal computational overhead and enjoys a much simpler implementation, which allows it to be easily adopted for a broad range of tasks in collider physics.

This paper is organized as follows. Section II introduces the construction and properties of CPT. Synthetic datasets used for demonstrating the performance of CPT are introduced in Sec. III. Numerical results illustrating the performance of CPT are presented in Sec. IV. In Sec. V, we explore what aspects of CPT give raise to the excellent performance. The paper ends with conclusions and outlook in Sec. VI.

II Covariant Particle Transformer

II.1 Symmetries and covariance

At the LHC, the beamline determines a special direction and reduces the relevant symmetry group of collision events from the proper orthochronous Lorentz group SO+(1,3)\mathrm{SO}^{+}(1,3) to SO(2)×SO+(1,1),\mathrm{SO}(2)\times\mathrm{SO}^{+}(1,1), which contains products of azimuthal rotations and longitudinal boosts along the beamline. The Covariant Particle Transformer extends the original transformer architecture to properly account for these symmetry transformations, by ensuring that if the four-momenta of all final state objects undergo such a transformation, the resulting prediction of the top quark four-momenta will undergo the same transformation. At its core, this is achieved through the novel covariant attention mechanism, which modifies the standard attention mechanism to ensure that all intermediate learned features have well-defined transformation properties.

Covariance111Called equivariance in machine learning. under rotations and boosts [13, 14] and input permutations [15] have been studied in a variety of recent High Energy Physics (HEP) papers. A number of additional studies have explored permutation invariant architectures [16, 17, 18, 19, 20] (see also other graph network approaches [21, 22]). Compared to prior works in this direction, we make the following important contributions:

  • We develop the first transformer architecture that enforces Lorentz covariance. Transformers are a powerful class of neural networks that have revolutionized many areas of machine learning applications, such as natural language processing [11, 23], computer vision [24], and recently protein folding [25]. By integrating the transformer architecture with Lorentz covariance, CPT combines the current state-of-the-art of machine learning with physics-specific knowledge to become a powerful tool for applications in collider physics, as we will illustrate in this work.

  • We develop a simple, efficient, and effective way of achieving partial Lorentz covariance. While previous works have developed Lorentz covariant neural networks using customized architectures, they incur significant computational overhead compared to a standard neural network due to computations of continuous group convolutions [14] or irreducible representations of the Lorentz group [13]. By contrast, CPT only requires a simple modification to the standard attention mechanism with minimal computational overhead.

  • We are the first to demonstrate the benefit of using a Lorentz covariant architecture for regression problems where the targets are four momenta of the particles. Previous works on Lorentz covariant neural networks only evaluate on classification problems such as jet-tagging where the Lorentz group acts trivially (i.e. as an identity) on the targets. There Lorentz symmetry plays a less significant role since the neural network only needs to be Lorentz invariant but not covariant.

II.2 Architecture

The Covariant Particle Transformer consists of an encoder and a decoder. To ensure permutation invariance, we remove the positional encoding [11] in the original transformer encoder. The encoder produces learned features of the final state objects, which include jets, photons, electrons, muons, and missing transverse energy (ETmissE_{\mathrm{T}}^{\text{miss}})222ETmissE_{\mathrm{T}}^{\text{miss}} is implemented as a massless particle with zero longitudinal momentum component..

Each object is represented by its transverse momentum pTp_{\mathrm{T}}, rapidity yy, azimuthal angle ϕ\phi expressed as a unit vector (cos(ϕ),sin(ϕ))(\cos(\phi),\sin(\phi)) to avoid mod π\text{mod }\pi calculations, mass mm, and particle identification ID. The encoder uses six covariant self-attention layers to update the feature vectors of the final state objects. The decoder uses 12 covariant attention layers to produce learned features of the top quarks. Six of these layers use self-attention, which updates the feature vector of each top quark as a function of itself and the feature vectors of other top quarks, and the other six layers use cross-attention, which updates the feature vector of each top quark as a function of itself and the feature vectors of the final state objects. Finally, the feature vectors of top quarks are converted to predicted physics variables, which are the top quark four-momenta expressed in transverse momentum pTp_{\mathrm{T}}, rapidity yy, azimuthal angle unit vector, and mass mm. Figure 1 illustrates the architecture of the Covariant Particle Transformer. Detailed descriptions of input featurization, CPT architecture, and the covariant attention mechanism are provided in Appendix A.

Refer to caption
(a) CPT
Refer to caption
(b) Encoder
Refer to caption
(c) Decoder
Figure 1: An illustration of the Covariant Particle Transformer (CPT) architecture. The encoder consists of six covariant self-attention layers, while the decoder consists of six covariant cross-attention layers and six covariant self-attention layers interleaved.

II.3 Loss function

The model is trained to minimize a supervised learning objective that measures the distance between the true and predicted values of the target variables333Note that learning the true value from reconstructed quantities introduces a prior dependence [26]. This is true for nearly all regression approaches in HEP.. Auxiliary losses are included to stabilize training the model. We provide a detailed description of the loss function in Appendix A.6.

III Datasets

We use Madgraph@NLO (v2.3.7) [27] to generate pppp collision events at next-to-leading order (NLO) in QCD. The decays of top quarks and W bosons are performed by MadSpin [28]. We generate 9.2 million tt¯Ht\bar{t}H events, 5.4 million tt¯tt¯t\bar{t}t\bar{t} events, 1.3 million tt¯t\bar{t} events, 1.3 million tt¯Wt\bar{t}W events, and 1 million tt¯Ht\bar{t}H events with a CP-odd top-Yukawa coupling (tt¯HCP-oddt\bar{t}H_{\text{CP-odd}}). In our generation, Higgs bosons decay through the diphoton channel for simplicity and all other objects such as top quarks and WW bosons decay inclusively.The Higgs Characterization model [29] is used to generate the tt¯HCP-oddt\bar{t}H_{\text{CP-odd}} events. The generated events are interfaced with the Pythia 8.235 [30] for parton shower. We do not emulate detector effects as the salient features of the problem already present from the parton shower and hadronization. The generated hadrons are used to construct anti-ktk_{t} [31] R=0.4R=0.4 jets using FastJet 3.3.2 [32, 33].

Jets are required to have |y|2.5|y|\leq 2.5 and pT25p_{T}\geq 25 GeV, while leptons are required to have |y|2.5|y|\leq 2.5 and pT10p_{T}\geq 10 GeV. A jet is removed if its distance444ΔR\Delta R is defined as Δy2+Δϕ2\sqrt{\Delta y^{2}+\Delta\phi^{2}}, where Δy\Delta y is the difference of two particles in pseudorapidity and Δϕ\Delta\phi is the difference in azimuthal angle. in ΔR\Delta R with a photon or a lepton is less than 0.4. Jets that are ΔR\Delta R matched to bb-quarks at the parton level are labeled as bb-jets; this label is removed randomly for 30% of the bb-jets, to mimic the inefficiency of a realistic bb-tagging [34, 35]. We further apply a preselection on the testing set of Nbjet>0N_{\mathrm{bjet}}>0, and (Njet3(N_{\mathrm{jet}}\geq 3 and Nlepton=0)N_{\mathrm{lepton}}=0) or Nlepton>0.N_{\mathrm{lepton}}>0., to mimic realistic data analysis requirements. The tt¯Ht\bar{t}H and tt¯tt¯t\bar{t}t\bar{t} samples are each divided to training, validation, and testing sets, corresponding to a split of 75%:12.5%:12.5%. The other samples (tt¯t\bar{t}, tt¯Wt\bar{t}W, and tt¯HCP-oddt\bar{t}H_{\text{CP-odd}}) are used only for testing. While a single model can be trained to learn from a mixture of processes such as tt¯Ht\bar{t}H and tt¯tt¯t\bar{t}t\bar{t} for greater generality, we leave this exciting direction to future work.

As we compare the performance of CPT to that of a conventional approach, we refer to top quarks that can be matched to a triplet of jets as “truth-matched” and those that cannot as “unmatched”. Specifically, a top quark is considered as “truth-matched” if decays hadronically and each of the three quarks originating from its decay is matched (ΔR<0.4\Delta R<0.4) to exactly one jet. According to this definition, semi-leptonically decaying tops are always unmatched, which is motivated by the fact that we can’t physically detect its neutrino (at best we can estimate its kinematics such as pTp_{\mathrm{T}}). The vast majority (e.g., 76% for tt¯Ht\bar{t}H) of tops are unmatched, and therefore can’t be fully reconstructed due to incomplete information about their decay products. For events passing the preselection, the fraction of hadronically decaying top quarks that can be truth-matched is 36% for tt¯Ht\bar{t}H, 37% for tt¯t\bar{t}, 38% for tt¯Wt\bar{t}W, and 38% for tt¯tt¯t\bar{t}t\bar{t}.

IV Performance

We study three different performance aspects of CPT. First, we evaluate the resolution of the predictions of individual top quark kinematic variables. Second, we compare the correlation between the predicted variables to the correlations between the true top quark properties. Finally, we assess the model dependence of CPT by applying the model trained on tt¯Ht\bar{t}H events to alternative processes. We study these metrics inclusively for events passing the preselection, and we also break down the performance for top quarks where a matching triplet of jets can be identified using truth information and for top quarks where no matching triplet of jets can be identified. For the former case, we also compare CPT prediction with the calculation from a triplet-based reconstruction method. The latter scenario corresponds to the case where the conventional triplet-based reconstruction method does not apply.

Refer to caption
Figure 2: Top row: Distributions of truth and predicted top quark four-momentum components, pTp_{\mathrm{T}}, yy, and ϕ\phi from the tt¯Ht\bar{t}H sample. Bottom row: The distributions of dimensionless errors ΔpT/pT,Δy\Delta p_{\mathrm{T}}/p_{\mathrm{T}},\Delta y, and Δϕ,\Delta\phi, where Δ\Delta means prediction minus truth. The area under each histogram is normalized to unity. As expected, CPT’s performance is worse for unmatched tops due to incomplete information. Over all tops (truth-matched and unmatched) in the test set, the median values of ΔpT/pT,Δy\Delta p_{\mathrm{T}}/p_{\mathrm{T}},\Delta y, and Δϕ,\Delta\phi, are 0.02,0.002-0.02,0.002 and 0.002,-0.002, showing that there is no significant bias in CPT’s prediction.

Resolution: Figure 2 shows the predicted and truth variable distributions for pTp_{\mathrm{T}}, yy, ϕ\phi of the top quarks in the tt¯Ht\bar{t}H sample. To quantify the resolution, we calculate the width of ΔpT/pT,truth\Delta p_{\mathrm{T}}/p_{\mathrm{T,truth}}, Δy\Delta y and Δϕ,\Delta\phi, the model’s prediction error for the three variables (relative error for pTp_{\mathrm{T}}). The width is quantified using half of the 68% inter-quantile range, which corresponds to one standard deviation in the Gaussian case. The top quark mass is part of the four-momentum prediction, but we do not show it here as it is nearly a delta function. Since the model predicts the four-momenta of two top quarks, the predicted top quarks are matched to truth top quarks during the resolution calculation to minimize the sum of ΔR\Delta R between all matched pairs. Table 1 summarizes the prediction resolutions for all top quarks in the predicted tt¯Ht\bar{t}H events, separated into “truth-matched” top quarks and “unmatched” top quarks. As expected, CPT’s performance is worse for unmatched tops due to incomplete information. Over all tops (truth-matched and unmatched) in the test set, the median values of ΔpT/pT,Δy\Delta p_{\mathrm{T}}/p_{\mathrm{T}},\Delta y, and Δϕ,\Delta\phi, are 0.02,0.002-0.02,0.002 and 0.002,-0.002, showing that there is no significant statistical bias in CPT’s prediction.

Relative performance: The model prediction resolutions are compared to the intrinsic resolutions of reconstructing top quarks using jet-triplets. The intrinsic resolutions are calculated from truth-matched triplets of jets, where the four-momenta of the truth-matched jet-triplet are considered as the predictions. In this case, the resolution arises from the effects of quark hadronization and jet reconstruction. For truth-matched top quarks, the ratio of the prediction resolution from CPT to the intrinsic resolution is 1.5 for pTp_{\mathrm{T}}, 2.3 for the rapidity yy, and 2.0 for the azimuthal angle ϕ\phi.

To compare CPT with a strong baseline, we also evaluate a triplet-based reconstruction method, where a neural network is trained to identify the triplet associated with each top quark. The baseline resolutions have prediction-to-intrinsic ratios of 2.2 for pTp_{\mathrm{T}}, 2.8 for yy, and 3.1 for ϕ\phi. Therefore, even when evaluated on truth-matched top quarks, CPT achieves significantly better resolution than the triplet-based method. The comparison is visualized in Figure 3. Details on the baseline implementation is available in Appendix B.

Refer to caption
Figure 3: Resolution (smaller means better) achieved by CPT and the triplet-based reconstruction (baseline) normalized by the intrinsic resolution arising from effects of quark hadronization and jet reconstruction, evaluated on truth-matched tops in tt¯Ht\bar{t}H events. CPT achieves significantly better resolution than the reconstruction-based approach.

In the preselected tt¯Ht\bar{t}H events, 76% of the top quarks are unmatched. Specifically, 43% out of the total 67% of tops that decay hadronically don’t have a matching triplet and 33% of all tops decay semi-leptonically. For these unmatched top quarks, CPT achieves a prediction-to-intrinsic resolution ratio of 2.5 for pTp_{\mathrm{T}}, 6.5 for yy, and 3.6 for ϕ\phi. Due to incomplete information about the tops’ decay products, CPT’s performance degrades as expected for unmatched top quarks, though the absolute resolutions remain below 30%. Note these top quarks cannot otherwise be fully reconstructed using reconstruction-based alternatives due to incomplete information about their decay products. While there exist procedures to approximately recover some of the missing information, such as the neutrino kinematics, combining these additional estimators with a reconstruction-based method to handle unmatched tops introduces additional complexity and sources of error and it’s highly unlikely that the resulting approach will outperform a regression model.

Table 1: Summary of resolutions of top quark four-momentum components in various scenarios for tt¯Ht\bar{t}H, tt¯t\bar{t}  and tt¯Wt\bar{t}Wprocesses.
σpT{\sigma_{p_{\mathrm{T}}}} σy\sigma_{y} σϕ\sigma_{\phi}
tt¯Ht\bar{t}H Intrinsic 0.10 0.04 0.07
Truth-matched 0.15 0.09 0.14
Unmatched 0.27 0.25 0.26
tt¯t\bar{t} Intrinsic 0.11 0.04 0.09
Truth-matched 0.19 0.11 0.20
Unmatched 0.31 0.32 0.37
tt¯Wt\bar{t}W Intrinsic 0.12 0.04 0.08
Truth-matched 0.27 0.15 0.28
Unmatched 0.45 0.36 0.50

Correlation: Between the six variables of interest, only three pairs of variables have a linear correlation beyond 5% in the truth sample. These correlations are 74%74\% for (pT,1p_{\mathrm{T,1}},pT,2p_{\mathrm{T,2}}), 50%50\% for (y1y_{1},y2y_{2}), and 31%-31\% for (ϕ1\phi_{1},ϕ2\phi_{2}). The corresponding correlations observed in the Covariant Particle Transformer prediction are 75%75\% for (pT,1p_{\mathrm{T,1}},pT,2p_{\mathrm{T,2}}), 43%43\% (y1y_{1},y2y_{2}), and 34%-34\% for (ϕ1\phi_{1},ϕ2\phi_{2}). The correlation between top quarks is well-reproduced in CPT’s predictions.

Process dependence: We assess the process dependence of CPT by applying the model trained with tt¯Ht\bar{t}H to tt¯Wt\bar{t}W, tt¯t\bar{t}  and tt¯HCP-oddt\bar{t}H_{\text{CP-odd}} events, respectively. Table 1 compares the intrinsic and prediction resolutions between tt¯Ht\bar{t}H, tt¯Wt\bar{t}W, and tt¯t\bar{t} processes. CPT trained exclusively on the tt¯Ht\bar{t}H sample can be applied without any retraining to yield a similar level of performance for tt¯t\bar{t} events. This level of generalization is not trivial since these two processes induce different statistics in the final state objects and top quarks. The tt¯Wt\bar{t}W events constitute a much more challenging test set since additional jets, leptons, and neutrinos are produced from the WW decay which introduces more complex correlations among the objects that are not present in CPT’s training set. Consequently, CPT yields a larger resolution on the tt¯Wt\bar{t}W test set. The process-dependence can be mitigated by a number of strategies, such as training CPT with a more representative sample or possibly active decorrelation strategies [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49], which we defer to future studies. Figure 4 shows distributions of the system-level observables constructed from individual top quark four-momenta for tt¯Ht\bar{t}H and tt¯HCP-oddt\bar{t}H_{\text{CP-odd}} samples. A reasonable agreement between the predictions and ground truth properties is observed for these observables, indicating CPT captures the subtle difference in the kinematics between the two processes and reproduces correlation in the four-momentum between the two top quarks. The agreement can be improved by applying preselection such as the requirement of at least one truth-matched top. Importantly, although the model prediction is not perfect, the separation between tt¯Ht\bar{t}H and tt¯HCP-oddt\bar{t}H_{\text{CP-odd}} events is preserved by CPT predictions, showing the promise of applying CPT to produce discriminating kinematic variables.

Refer to caption
Refer to caption
Figure 4: Predicted and truth distributions for system-level observables |Δy||\Delta y| (top) and mtt¯Hm_{t\bar{t}H} (bottom) in the tt¯Ht\bar{t}H sample (orange) and tt¯HCP-oddt\bar{t}H_{\text{CP-odd}} sample (blue). |Δy||\Delta y| is the absolute difference between the rapidities of two tops, and mtt¯Hm_{t\bar{t}H} is the invariant mass of the tt¯Ht\bar{t}H system, where the Higgs 4-momentum is taken to be its ground-truth value. The area under each histogram is normalized to unity. As CPT is not trained on the tt¯HCP-oddt\bar{t}H_{\text{CP-odd}} sample, its prediction for tt¯HCP-oddt\bar{t}H_{\text{CP-odd}} events are worse as expected.

High multiplicity final state: CPT can predict the four-momenta of an arbitrary (fixed) number of top quarks in a collision event. We test the prediction ability of CPT in the extreme case at the LHC where four top quarks are produced in the same event. We configure CPT to predict the four-momenta of four top quarks and train it with the tt¯tt¯t\bar{t}t\bar{t} sample described in Section III. Table 2 shows the intrinsic and prediction resolutions from this test. Compared to the prediction for the tt¯Ht\bar{t}H sample, the prediction for tt¯tt¯t\bar{t}t\bar{t} is worse. However, the intrinsic resolution in the tt¯tt¯t\bar{t}t\bar{t} sample is also worse than that in the tt¯Ht\bar{t}H sample, suggesting that the top quarks in tt¯tt¯t\bar{t}t\bar{t} events are inherently more complex and challenging to reconstruct. We expect the gap between the intrinsic and CPT’s resolution can be reduced by further architectural improvements and more training data. We stress that the exploding combinatorics in tt¯tt¯t\bar{t}t\bar{t} events render reconstruction-based methods prohibitively expensive to be successfully applied in this setting, whereas we can easily apply CPT without any modification. To predict top quarks’ kinematics from NN jets, a standard reconstruction-based method has a super-exponential computational complexity of O(N!),O(N!), the number of all possible permutations within NN objects, while CPT only has a polynomial complexity of O(N2)O(N^{2}) since the attention mechanism only involves pairwise interactions among the objects.

Table 2: Summary of resolutions of top quark four-momentum components in various scenarios in the tt¯tt¯t\bar{t}t\bar{t} sample.
σpT{\sigma_{p_{\mathrm{T}}}} σy\sigma_{y} σϕ\sigma_{\phi}
Intrinsic 0.19 0.05 0.09
Truth-matched 0.29 0.16 0.24
Unmatched 0.42 0.32 0.36

V Ablation studies

We demonstrate the effects of removing important components of CPT to show how they contribute to the final performance. All comparisons are done on the tt¯Ht\bar{t}H dataset. Resolutions are reported on all top quarks passing the preselection, regardless of truth-matching status.

Attention mechanism: The attention mechanism is an important part of the model as it allows the model to selectively focus on a subset of the final state objects in determining the four-momentum of each top quark. We demonstrate its benefit by training an otherwise identical model except with all attention weights set to a constant 1Nin,\frac{1}{N_{\mathrm{in}}}, where NinN_{\mathrm{in}} is the number of final state objects in the event. Comparisons between the resolution achieved by this model and the nominal model is shown in Table 3. We observe the model with uniform attention achieves worse resolutions, which demonstrates the benefit of the attention mechanism.

Table 3: Comparison of resolutions of top quark four-momentum components in the tt¯Ht\bar{t}H sample achieved by CPT and its variant applying uniform-attention for each final state object.
σpT{\sigma_{p_{\mathrm{T}}}} σy\sigma_{y} σϕ\sigma_{\phi}
CPT 0.24 0.21 0.23
CPT (uniform attention) 0.27 0.23 0.28

Covariant attention: CPT employes a covariant attention mechanism to exploit the symmetries in collision data. When the covariant attention is replaced by a regular attention mechanism which does not guarantee covariance, we observe increasing degradation in performance as the size of the training sample becomes smaller. Figure 5 compares the resolutions achieved by CPT and its variant using a regular attention mechanism, as a function of the number of training events. For example, the increase in pTp_{\mathrm{T}} resolution can be as large as 16% when only 0.1% of the events in the nominal training sample is used. This shows that the covariant attention enables CPT to be more data-efficient and provide more accurate predictions in the low-data regime compared to non-covariant models.

Refer to caption
Figure 5: Resolution on in tt¯Ht\bar{t}H sample achieved by using the covariant attention and non-covariant attention. The covariant attention offers clear benefit particularly in the low-data regime.

Alternative architectures: Finally, we compare with two alternative permutation-invariant architectures, Graph Convolutional Networks [50] and DeepSets [51]. Applied to this task, Graph Convolutional Networks (GCNs) use graph convolutions to process information in the final state objects represented as a complete graph, while DeepSets uses a fully connected neural network encoder to learn the feature vector of each final state object individually. In both cases, the feature vectors of all final state objects are then summed and fed into a fully connected neural network to predict the top quark four-momenta. The Covariant Particle Transformer mainly differs from these two architectures by utilizing an attention mechanism, implementing partial Lorentz covariance, and using a decoder module. We use six graph convolutional layers and six encoder layers for the GCN and the DeepSet models, and a feature dimension of 128 for both. A comparison of resolutions between the models is shown in Table 4. CPT significantly outperforms the other two methods, showing its outstanding effectiveness on this task. We did not perform extensive hyperparameter optimizations for any of the three architectures. However, we hypothesize that the performance ordering would persist after such an optimization given the magnitude of the observed differences. We defer this study to future work.

Table 4: Comparison of resolutions of top quark four-momentum components in the tt¯Ht\bar{t}H sample achieved by CPT, GCN, and DeepSets.
σpT{\sigma_{p_{\mathrm{T}}}} σy\sigma_{y} σϕ\sigma_{\phi}
CPT 0.24 0.21 0.23
GCN 0.38 0.35 0.42
DeepSets 0.36 0.32 0.36

VI Conclusion

In this paper, we propose a new machine learning-enabled approach to determining top quark kinematic properties by processing the full event information holistically. Our approach offers three major advantages compared to conventional approaches. First, we no longer deal with the conceptually ill-defined jet-triplet identification process. Second, we can account for noisy or missing observations due to limited detector acceptance, inefficiency, and resolution, as the regression model can learn such effects from simulations. Third, the holistic processing of the event final state offers a unified approach to determine the top quark properties for both the hadronic and semi-leptonic top quark decays, which simplifies the analysis workflow. Finally, our approach has a runtime polynomial in the number of final state objects as opposed to super-exponential for reconstruction-based approaches which need to consider all possible permutations, making ours the first tractable method for processes with high multiplicity final state such as tt¯tt¯t\bar{t}t\bar{t}.

To realize this holistic approach to predicting top quark kinematic properties, we propose the Covariant Particle Transformer (CPT). CPT takes as input properties of the final state objects in a collision event and outputs predictions for the top quark kinematic properties. Using a novel covariant attention mechanism, CPT prediction is invariant under permutation of the inputs and covariant under rotation and/or boosts of the event along the beamline. CPT can recover 76% (75%) of the top quarks produced in the tt¯Ht\bar{t}H (tt¯tt¯t\bar{t}t\bar{t}) events that cannot be truth matched to a jet-triplet and thus not fully reconstructable by conventional methods. For tt¯Ht\bar{t}H events, CPT achieves a resolution close to the intrinsic resolution of jet-triplet and outperforms a carefully tuned triplet-based top reconstruction method on top quarks that can be matched to a jet-triplet. In addition, we demonstrate that CPT can generalize to top production processes not seen during training, though its performance degrades as the test process becomes more complex and distinct from the training process. Finally, we demonstrate that by building Lorentz covariance into CPT, it achieves higher data efficiency and outperforms the non-covariant alternative when the training set is small.

In the future, it may be possible to improve and extend CPT. CPT training uses simulation to learn to invert parton shower and hadronization (and in the future, detector effects). Training strategies that rely less on parton shower and hadronization simulations like those in Ref. [52] may be able to improve the robustness of CPT. Furthermore, as a direct regression approach, CPT is prior dependent. A variety of domain adaptation and other strategies may be able to further improve the resilience of CPT. It may also be possible to include lower-level, higher-dimensional inputs directly into CPT instead of first clustering jets.

As it uses a generic representation for collision events as sets of particles, CPT can be directly applied to predict kinematic properties of other heavy decaying particles, such as the WW, ZZ, and Higgs boson, and potential heavy particles beyond the SM. The predicted kinematics of these heavy decaying particles can be used to construct discriminating variables for searches or observables for differential cross-section measurements. The ability to predict properties of heavy decaying particles through a holistic analysis of the collision event can enable measurements that otherwise suffer extreme inefficiencies using conventional reconstruction methods.

Note added: While this paper was being finalized, we became aware of Ref. [53], which proposes another Lorentz equivariant architecture. In contrast to that paper, we integrate Lorentz covariance with the Transformer, a state-of-the-art neural network architecture that revolutionized many areas of machine learning applications such as natural language processing, computer vision, and protein folding. We have also considered a completely different application: namely regression instead of classification, where the Lorentz group acts non-trivially (not an identity) on the target variables.

Acknowledgements

This work is supported by the U.S. Department of Energy, Office of Science under contract DE-AC02-05CH11231. H.W.’s work is partly supported by the U.S. National Science Foundation under the Award No. 2046280.

Appendix A CPT Implementation

A.1 Attention mechanism

Attention mechanisms are a way to update a vector of nn features {xi}i=1n\{x_{i}\}_{i=1}^{n}, given a context {cj}j=1m\{c_{j}\}_{j=1}^{m}. Learnable query, key, and value matrices {WQ,WK,WV}\{W_{Q},W_{K},W_{V}\} are used to generate dd-dimensional query, key, and value vectors {qi}i=1n,\{q_{i}\}_{i=1}^{n}, {kj}j=1m,\{k_{j}\}_{j=1}^{m}, and {vj}j=1m,\{v_{j}\}_{j=1}^{m}, via

qi\displaystyle q_{i} =WQxi\displaystyle=W_{Q}x_{i} (1)
kj\displaystyle k_{j} =WKcj,\displaystyle=W_{K}c_{j}, (2)
vj\displaystyle v_{j} =WVcj.\displaystyle=W_{V}c_{j}. (3)

The inner product between qiq_{i}^{\top} and kjk_{j} is used to compute the attention weights αij\alpha_{ij} through

αij=exp(qikj/d)jexp(qikj/d),\displaystyle\alpha_{ij}=\frac{\exp(q_{i}^{\top}k_{j}/\sqrt{d})}{\sum_{j}\exp(q_{i}^{\top}k_{j}/\sqrt{d})}, (4)

where the d\sqrt{d} is a normalization factor. A weighted sum of the value vectors are then used to compute update vectors {mi}i=1n,\{m_{i}\}_{i=1}^{n},

mi=jαijvj,\displaystyle m_{i}=\sum_{j}\alpha_{ij}v_{j}, (5)

which is then used to update xix_{i} by, for example, addition xi=xi+mix^{\prime}_{i}=x_{i}+m_{i}. Intuitively, the attention weights αij\alpha_{ij} represent how important the information contained in cjc_{j} is to xi.x_{i}. When the context {cj}\{c_{j}\} is simply {xi},\{x_{i}\}, this is termed as self-attention, otherwise cross-attention. It is common to use a slight extension of the method above, called Multi-headed attention, where HH different query, key, and value matrices {(WQh,WKh,WVh)}h=1H\{(W^{h}_{Q},W^{h}_{K},W^{h}_{V})\}_{h=1}^{H} are learnt. Each head follow the above procedure to independently produce attention weights {aijh}ijh\{a_{ij}^{h}\}_{ijh} and then update vectors {mih}i=1,h=1n,H.\{m^{h}_{i}\}_{i=1,h=1}^{n,H}. The HH update vectors {mih}h=1H\{m^{h}_{i}\}_{h=1}^{H} received by each xix_{i} are concatenated to produce a final update vector

mi=h=1Hmih,\displaystyle m_{i}=\bigoplus_{h=1}^{H}m^{h}_{i}, (6)

which is then used to update xix_{i} as before.

A.2 Particle representation

We represent each particle with a feature vector hi,h_{i}, and hi=(xi,ωi)h_{i}=(x_{i},\omega_{i}) consists of an invariant feature vector xi,x_{i}, and a covariant featrue vector ωi.\omega_{i}. xix_{i} is an invariant quantity under a rotation and boost along the beamline, while ωi=(yi,cos(ϕi),sin(ϕi))\omega_{i}=(y_{i},\cos(\phi_{i}),\sin(\phi_{i})) represnets the flight direction of the object and is a covariant quantity. As input to the Covariant Particle Transformer, xi=(pT,i,mi,id)x_{i}=(p_{\mathrm{T},i},m_{i},\mathrm{id}) where id\mathrm{id} is a one-hot vector indicating particle identity. The model learns to update these feature vectors while maintaining their invariance/covariance property through the covariant attention.

A.3 Covariant attention

To update the learned feature vectors of each object in the event, we use covariant attention, an extenstion of the regular attention mechanism to process kinematics information and gaurantee covariance properties of the predictions. In general, covariant attention updates feature vectors {hi}\{h_{i}\} of a subset of the objects in the event using feature vectors {hj}\{h_{j}\} of a (potentially different) subset as context. First, it computes the flight direction of each context object as viewed in ii’s frame: ωij=(yjyi,cos(ϕjϕi),sin(ϕjϕi),\omega_{ij}=(y_{j}-y_{i},\cos(\phi_{j}-\phi_{i}),\sin(\phi_{j}-\phi_{i}), which is invariant under longitudinal boosts and azimuthal rotations. Then it computes the dd-dimensional query, keys, and value vectors as follows

x^i\displaystyle\hat{x}_{i} =LayerNorm(xi),\displaystyle=\mathrm{LayerNorm}(x_{i}), (7)
vij\displaystyle v_{ij} =WV(x^j+MLP(ωij)),\displaystyle=W_{V}(\hat{x}_{j}+\mathrm{MLP}(\omega_{ij})), (8)
kij\displaystyle k_{ij} =WK(x^j+MLP(ωij)),\displaystyle=W_{K}(\hat{x}_{j}+\mathrm{MLP}(\omega_{ij})), (9)
qi\displaystyle q_{i} =WQx^i\displaystyle=W_{Q}\hat{x}_{i} (10)

where WV,WK,WQW_{V},W_{K},W_{Q} are learned matrices and MLP\mathrm{MLP} is a Multilayer Perceptron. The inner products between qiq_{i} and kijk_{ij} are then went through softmax operator so as to weight the value vectors. The weighted sum produces an aggregated message vetor mixm^{x}_{i} which is added to xix_{i}:

αij\displaystyle\alpha_{ij} =exp(qikij/d)jexp(qikij/d),\displaystyle=\frac{\exp(q_{i}^{\top}k_{ij}/\sqrt{d})}{\sum_{j}\exp(q_{i}^{\top}k_{ij}/\sqrt{d})}, (11)
m~ix\displaystyle\tilde{m}^{x}_{i} =jαijvij,\displaystyle=\sum_{j}\alpha_{ij}v_{ij}, (12)
mix\displaystyle m^{x}_{i} =σ(Linear(xi,m~ix))m~ix,\displaystyle=\sigma(\mathrm{Linear}(x_{i},\tilde{m}^{x}_{i}))\odot\tilde{m}^{x}_{i}, (13)
xi\displaystyle x^{\prime}_{i} =xi+mix,\displaystyle=x_{i}+m^{x}_{i}, (14)

where σ\sigma is the sigmoid function and \odot denotes elementwise (Hadamard) product. Gating is applied to the attention weights following the Gated Attention Network [54]. A multi-headed version of covariant attention can be constructed in the same way as in regular attention, and is omitted here. xix^{\prime}_{i} is then passed through a feed-forward network as done in the original transformer. When it is desirable to also update the covariant feature ωi,\omega_{i}, we produce another update vector miωm^{\omega}_{i} from mixm^{x}_{i} via

m~iω\displaystyle\tilde{m}^{\omega}_{i} =MLP(mix)\displaystyle=\mathrm{MLP}(m^{x}_{i}) (15)
miω\displaystyle m^{\omega}_{i} =σ(Linear(xi,m~iω))m~iω,\displaystyle=\sigma(\mathrm{Linear}(x_{i},\tilde{m}^{\omega}_{i}))\odot\tilde{m}^{\omega}_{i}, (16)

where miωm^{\omega}_{i} is a three dimensional vector. Its first component is used as a boost with rapidity δyi,\delta y_{i}, while its last two components viv_{i} converted to a rotation matrix R(vi),R\quantity(v_{i}), which is used to rotate the azimuthal angle ϕi:\phi_{i}:

yi\displaystyle y^{\prime}_{i} =yi+δyi\displaystyle=y_{i}+\delta y_{i} (18)
(cos(ϕi)sin(ϕi))\displaystyle\begin{pmatrix}\cos(\phi^{\prime}_{i})\\ \sin(\phi^{\prime}_{i})\end{pmatrix} =R(vi)(cos(ϕi)sin(ϕi))\displaystyle=R\quantity(v_{i})\begin{pmatrix}\cos(\phi_{i})\\ \sin(\phi_{i})\end{pmatrix} (19)
ωi\displaystyle\omega^{\prime}_{i} =(yi,cos(ϕi),sin(ϕi)).\displaystyle=(y^{\prime}_{i},\cos(\phi^{\prime}_{i}),\sin(\phi^{\prime}_{i})). (20)

where R(vi)R(v_{i}) is obtained as follows

ui\displaystyle u_{i} =vi+(1,0),\displaystyle=v_{i}+(1,0), (21)
wi\displaystyle w_{i} =uiui=(cos(θi),sin(θi)),\displaystyle=\frac{u_{i}}{\norm{u_{i}}}=(\cos(\theta_{i}),\sin(\theta_{i})), (22)
R(vi)=\displaystyle R\quantity(v_{i})= (cos(θi)sin(θi)sin(θi)cos(θi)),\displaystyle\begin{pmatrix}\cos(\theta_{i})&-\sin(\theta_{i})\\ \sin(\theta_{i})&\cos(\theta_{i})\end{pmatrix}, (23)

where we added (1,0)(1,0) to viv_{i} to bias the rotation matrix to an identity for stability. The covariance of {ωi}\{\omega^{\prime}_{i}\} follows from the fact that only invariant information is used to construct its update, and prior to the update, {ωi}\{\omega_{i}\} are themselves covariant. An inductive argument establishes the end-to-end covariance of compositions of covariant attention updates. We denote the above covariant attention update as hi𝒜xωxω(hi,{hj})h_{i}\leftarrow\mathcal{A}^{x\omega}_{x\omega}(h_{i},\{h_{j}\}) where the subscript indicates that it makes use of both the invariant and covariant feature vector, and the superscript indicates that it updates both the invariant and covariant feature vector. The following variants are used to build the full model:

  • xi𝒜xωx(hi,{hj}):x_{i}\leftarrow\mathcal{A}^{x}_{x\omega}(h_{i},\{h_{j}\}): the covariant feature vector is not updated

  • xi𝒜xx(xi,{xj}):x_{i}\leftarrow\mathcal{A}^{x}_{x}(x_{i},\{x_{j}\}): the covariant feature vector is not updated nor used to construct the key and value vectors. This reduces to the regular attention mechanism.

A.4 Encoder

The encoder uses six layers of covariant attention to update the input invariant features xiin𝒜xωx(hiin,{hjin})x^{\mathrm{in}}_{i}\leftarrow\mathcal{A}^{x}_{x\omega}(h^{\mathrm{in}}_{i},\{h^{\mathrm{in}}_{j}\}). The covariant features associated with the input objects {ωiin}\{\omega^{\mathrm{in}}_{i}\} are not updated.

A.5 Decoder

A.5.1 Initialization

The decoder first initializes the invariant feature vectors associated with the top quarks using the Set2Set module [55], which takes in the set {xiin}\{x^{\mathrm{in}}_{i}\} and outputs {xiout}\{x^{\mathrm{out}}_{i}\}, the initial invariant feature vectors of the output objects. The decoder then updates {xiout}\{x^{\mathrm{out}}_{i}\} by having each output attends to the input objects, using invariant features only, xiout𝒜xx(xiout,{xjin}).x^{\mathrm{out}}_{i}\leftarrow\mathcal{A}^{x}_{x}(x^{\mathrm{out}}_{i},\{x^{\mathrm{in}}_{j}\}). The attention weights αij\alpha_{ij} computed in the previous attention update is used to intialize the output covariant feature vectors:

yiout\displaystyle y^{\mathrm{out}}_{i} =jαijyjin,\displaystyle=\sum_{j}\alpha_{ij}y^{\mathrm{in}}_{j}, (24)
(cos(ϕiout)sin(ϕiout))\displaystyle\begin{pmatrix}\cos(\phi^{\mathrm{out}}_{i})\\ \sin(\phi^{\mathrm{out}}_{i})\end{pmatrix} =jαij(cos(ϕjin)sin(ϕjin))jαij(cos(ϕjin)sin(ϕjin))\displaystyle=\frac{\sum_{j}\alpha_{ij}\begin{pmatrix}\cos(\phi^{\mathrm{in}}_{j})\\ \sin(\phi^{\mathrm{in}}_{j})\end{pmatrix}}{\norm{\sum_{j}\alpha_{ij}\begin{pmatrix}\cos(\phi^{\mathrm{in}}_{j})\\ \sin(\phi^{\mathrm{in}}_{j})\end{pmatrix}}} (25)

The covariance of yiouty^{\mathrm{out}}_{i} follows from the fact that jαij=1,\sum_{j}\alpha_{ij}=1, and {yjin}\{y^{\mathrm{in}}_{j}\} transforms by an overall additive constant under a boost. The covariance of ϕiout\phi^{\mathrm{out}}_{i} follows from the fact that its unit vector representation is a linear combination of {(cos(ϕjin)sin(ϕjin))}j,\quantity{\begin{pmatrix}\cos(\phi^{\mathrm{in}}_{j})\\ \sin(\phi^{\mathrm{in}}_{j})\end{pmatrix}}_{j}, each of which transform linearly by a rotation.

A.5.2 Interleaved covariant cross- and self-attention

After initialization, the decoder consists of Lout=6L_{\mathrm{out}}=6 decoder blocks. In each block, the output invariant and covariant feature vectors are updated using two covariant attention layers:

hiout\displaystyle h^{\mathrm{out}}_{i} 𝒜xωxω(hiout{hjout})i,\displaystyle\leftarrow\mathcal{A}^{x\omega}_{x\omega}(h^{\mathrm{out}}_{i}\{h^{\mathrm{out}}_{j}\})\quad\forall i, (26)
hiout\displaystyle h^{\mathrm{out}}_{i} 𝒜xωxω(hiout,{hjin})i.\displaystyle\leftarrow\mathcal{A}^{x\omega}_{x\omega}(h^{\mathrm{out}}_{i},\{h^{\mathrm{in}}_{j}\})\quad\forall i. (27)

After each decoder block, indexed by {1,,Lout},\ell\in\{1,...,L_{\mathrm{out}}\}, an intermediate set of predictions {pi}i\{p^{\ell}_{i}\}_{i} for the top quark four momenta is constructed as follows

(pTi\displaystyle(p^{\ell}_{T_{i}} /GeV,yi,ϕi,mi/GeV)\displaystyle/\mathrm{GeV},y^{\ell}_{i},\phi^{\ell}_{i},m^{\ell}_{i}/\mathrm{GeV})
=(100(xi)0,yi,ϕi,5(xi)1+173),\displaystyle=(100(x^{\ell}_{i})_{0},y_{i},\phi_{i},5(x^{\ell}_{i})_{1}+173), (28)

where (xi)0,(xi)1(x^{\ell}_{i})_{0},(x^{\ell}_{i})_{1} denotes the first and second entry of the invariant feature vector associated with each top at the \ell-th block. The shift and scaling is to keep the feature vectors small and centered to facilitate training.

A.6 Loss function and optimization details

For each event, the main component of loss function is the L2L_{2} norm of the difference between the model prediction and ground truth for the top quark four-momenta in (px/100GeV,py/100GeV,y,m/5GeV)(p_{x}/100~{}\mathrm{GeV},p_{y}/100~{}\mathrm{GeV},y,m/5~{}\mathrm{GeV}) coordinates, averaged over the NN top quarks present in the event:

final=1Ni=1Npipi,\mathcal{L}_{\mathrm{final}}=\frac{1}{N}\sum_{i=1}^{N}\norm{p_{i}-p^{*}_{i}}, (29)

where {pi}\{p_{i}\} are the model predictions at the final decoding block and {pi}\{p^{*}_{i}\} are the ground truths. We chose this set of coordinates so that each component of the four-momenta has standard deviation of O(1),O(1), encouraging the model to pay equal attention to each of them. The NN predictions from the model are matched to the NN ground truths through a permutation π\pi^{*} that minimizes the average ΔR\Delta R between each matched pair:

π=argminπ:permutations1Ni=1N(yiyi)2+(ϕiϕi)2.\pi^{*}=\operatorname*{arg\,min}_{\pi:\mathrm{permutations}}\frac{1}{N}\sum_{i=1}^{N}\sqrt{(y_{i}-y^{*}_{i})^{2}+(\phi_{i}-\phi^{*}_{i})^{2}}. (30)

We add two auxiliary losses intermediate\mathcal{L}_{\mathrm{intermediate}} and unitnorm\mathcal{L}_{\mathrm{unit}-\mathrm{norm}} to stabilize training models with many layers. The intermediate loss intermediate\mathcal{L}_{\mathrm{intermediate}} measures the intermediate prediction errors at earlier decoder blocks,

intermediate=1Lout1=1Lout1(1Ni=1Npipi),\mathcal{L}_{\mathrm{intermediate}}=\frac{1}{L_{\mathrm{out}}-1}\sum_{\ell=1}^{L_{\mathrm{out}}-1}\quantity(\frac{1}{N}\sum_{i=1}^{N}\norm{p^{\ell}_{i}-p^{*}_{i}}), (31)

where {pi}i=1N\{p^{\ell}_{i}\}_{i=1}^{N} are intermediate predictions at the th\ell^{\text{th}} decoder. The unit-norm loss unitnorm\mathcal{L}_{\mathrm{unit}-\mathrm{norm}} encourages the vectors uiu_{i} to have unit-norm before being normalized and converted to rotation matrices in Equation 21 in each decoder block:

unitnorm=1Lout=1Lout(1Ni=1N|ui1|).\mathcal{L}_{\mathrm{unit}-\mathrm{norm}}=\frac{1}{L_{\mathrm{out}}}\sum_{\ell=1}^{L_{\mathrm{out}}}\quantity(\frac{1}{N}\sum_{i=1}^{N}\absolutevalue{\norm{u^{\ell}_{i}}-1}). (32)

The two auxiliary losses are inspired by similar auxiliary losses in AlphaFold2 [25]. The total loss is a weighted combination of the above three terms,

total=λ1final+λ2intermediate+λ3unitnorm.\mathcal{L}_{\mathrm{total}}=\lambda_{1}\mathcal{L}_{\mathrm{final}}+\lambda_{2}\mathcal{L}_{\mathrm{intermediate}}+\lambda_{3}\mathcal{L}_{\mathrm{unit}-\mathrm{norm}}. (33)

We use λ1=λ2=1,\lambda_{1}=\lambda_{2}=1, and λ3=0.02\lambda_{3}=0.02. All models used to report our results are trained using the Lamb optimizer [56] with a batch size of 256 and a learning rate of 10410^{-4} for 30 epochs and 10510^{-5} for another 10 epochs. A weight decay [57] of 0.01 is applied. Model from the epoch achieving minimum validation loss is used for final evaluation. This training protocol is sufficient to saturate validation performance for all variants of the model and datasets of various processes and sizes used to present our results.

Appendix B Baseline

We train a neural network to identify triplets of jets that originate from top decays. This task can be formulated as a link prediction problem on a graph, where the nodes are detected jets in an event and any two jets that belong to a triplet are connected by a link. Specifically, every event is represented by a fully-connected graph using the four-momenta and particle types as node features and a graph neural network (GNN) predicts a probability pij(0,1)p_{ij}\in(0,1) that a link exists between jet ii and jet jj for every pair of jets in the event. The particular architecture we use is the Interaction Network [58], followed by an MLP applied per edge to output the per-edge probabilities. The GNN is trained to minimize the cross-entropy loss so that pijp_{ij} is encouraged to be 1 if the jets belong to the same triplet and 0 otherwise. It uses the same training, validation, and test set as used by CPT. We tune the hyperparameters to maximize validation accuracy and settled on 4 Interaction Network blocks, 2 layers and 128 hidden units for all MLPs, Adam optimizer, and a learning rate of 0.001. At test time, we sort all possible links (i,j)(i,j) by decreasing order in pijp_{ij} and sequentially form one or two predicted triplets depending on the number of available jets in the event. Each predicted top four-vector is the system four-vector of the predicted triplet. The predicted tops are ΔR\Delta R matched to the true tops following the same procedure in CPT defined in Equation 30. We note that this method provides a strong baseline as it uses a neural network architecture that has demonstrated state-of-the-art performance on reasoning about object and relations in a wide range of complex problems such as NN-body dynamics and estimating physical quantities [58].

References