This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimizing measurement-based cooling by reinforcement learning

Jia-shun Yan School of Physics, Zhejiang University, Hangzhou 310027, Zhejiang, China    Jun Jing Email address: [email protected] School of Physics, Zhejiang University, Hangzhou 310027, Zhejiang, China
Abstract

Conditional cooling-by-measurement holds a significant advantage over its unconditional (nonselective) counterpart in the average-population-reduction rate. However, it has a clear weakness with respect to the limited success probability of finding the detector in the measured state. In this work, we propose an optimized architecture to cool down a target resonator, which is initialized as a thermal state, using an interpolation of conditional and unconditional measurement strategies. An optimal measurement-interval τoptu\tau_{\rm opt}^{u} for unconditional measurement is analytically derived for the first time, which is inversely proportional to the collective dominant Rabi frequency Ωd\Omega_{d} as a function of the resonator’s population in the end of the last round. A cooling algorithm under global optimization by the reinforcement learning results in the maximum value for the cooperative cooling performance, an indicator to measure the comprehensive cooling efficiency for arbitrary cooling-by-measurement architecture. In particular, the average population of the target resonator under only 1616 rounds of measurements can be reduced by four orders in magnitude with a success probability about 30%30\%.

I Introduction

Cooling mesoscopic and microscopic resonators down to their minimum-energy state is fundamental to observe classical-quantum transition and to exploit quantum advantage in nanoscience Milburn and Woolley (2008); Aspelmeyer et al. (2014). The ground-state preparation is a crucial and implicit step in quantum information processes, including but not limited to the continuous-variable quantum computations Lloyd and Braunstein (1999); You and Nori (2011); Toyoda et al. (2015); Um et al. (2016), the ultrahigh precision measurements Bocko and Onofrio (1996); Caves et al. (1980), and the quantum interface constructions Sharma et al. (2018). Various strategies are designed to reach an effective temperature as low as possible in the trapped atom and ion systems Wilson-Rae et al. (2007); Gigan et al. (2006); Wang et al. (2011). In atomic laser cooling, popular strategies are consisted of the laser Doppler cooling Sharma et al. (2018); Zhang et al. (2013); Epstein et al. (1995), the resolved-sideband cooling, and the electromagnetically induced transparency (EIT) cooling Morigi et al. (2000); Roos et al. (2000).

Beyond the paradigms extracting the system energy through dissipative channels based on the blue-shifted (anti-Stokes) sidebands, a versatile approach to cooling the mechanical states of motion is provided by the interaction with electromagnetic radiation or quantum measurement. Back-action-evading measurement techniques that can surpass the standard quantum limit have attracted enormous interests. Through pulsed measurement process in optomechanics Vanner et al. (2011, 2013); Bennett et al. (2016); Rossi et al. (2018); Brunelli et al. (2020), they can dramatically change the mechanical thermal occupation with no initial cooling. A genuine quantum mechanical cooling engine is proposed Buffoni et al. (2019), whereby the fuel is the energy exchanged with an apparatus performing invasive quantum measurements.

Among these measurement-based techniques, quantum state engineering based on measurements on ancillary systems have been proposed recently in theory Nakazato et al. (2003); Li et al. (2011) and demonstrated in experiment Xu et al. (2014). Rather than directly detecting the target system, a net nonunitary propagator is realized by inserting projective measurements on the ground state of the detector system in between the joint unitary-evolution segments of target and detector. The induced postselection of the ground state of the target system (typically modelled as a resonator) reduces its high-energy distribution in the ensemble. In another word, the resonator is gradually steered by the outcomes of the conditional measurement (CM) to its ground state via dynamically filtering out its vibrational modes. Ranging from cooling the nonlinear mechanical resonators Puebla et al. (2020), cooling by one shot measurement Pyshkin et al. (2016), expanding cooling range by an external driving Yan and Jing (2021), to accelerating cooling rate by optimized measurement intervals Yan and Jing (2022), an unexplored weakness of the CM strategies is their limited success probability inherited from the projective operation. An amount of experimental overhead rises unavoidably with more samples in ensemble. In sharp contrast to CM, the unconditional measurement (UM) strategy performs a nonselective and impulsive measurement in all the bases of the bare Hamiltonian of detector in the end of each round of the joint evolution Zhang et al. (2019a); Harel and Kurizki (1996). It is more likely to realize a unit-success-probability cooling but suffers from a much slower cooling rate than CM, indicating much more number of measurements toward the ground-state cooling. To compromise the cooling rate and the success probability, the interpolating-configuration of conditional and unconditional measurements constitutes an optimization problem.

The integration of a small-scale quantum circuit with a classical optimizer, e.g., the neural network, provides a paradigm by designing a sequence of parametrized quantum operations that are well suited to implement robust and high-fidelity algorithms. Many reinforcement learning (RL) algorithms constructed by the neural network, that demonstrated remarkable capabilities in the board and video games Silver et al. (2016, 2017, 2018); Mnih et al. (2015), have substantiated a widely and timely interest in studying quantum physics Carleo et al. (2019), such as quantum error correction Convy et al. (2022); Fösel et al. (2018), quantum simulation Bolens and Heyl (2021); Yuan et al. (2021), and quantum state preparation Guo et al. (2021); Bukov et al. (2018); Zhang et al. (2019b), to name a few. The proximal policy optimization (PPO) algorithm, as a typical RL algorithm with a significant sample complexity, scalability, and robustness for hyperparameters, has proven to be a fruitful tool in quantum optimization control Sivak et al. (2022); Kim and Jeong (2021); Yao et al. (2021).

In this work, we propose a measurement-based cooling architecture as a hybrid sequence of UM and CM strategies. It involves a double optimization: for each step along the sequence, either UM or CM can be considerably improved by using a local optimized measurement interval; and for the global efficiency of the sequence, its arrangement can be separably optimized through reinforcement learning. Particularly, in a typical measurement-based cooling model, i.e., the Jaynes-Cummings (JC) model, where a mechanical resonator (the target system) is coupled to a qubit (the detector system), conditional and unconditional measurements are alternatively performed to cool down the resonator to its ground state. A feedback scheme is triggered upon calling a CM to determine whether or not to launch the next round of evolution-and-measurement according to the measurement outcome. Analogous to the optimized measurement-interval obtained for CM Yan and Jing (2022), we analytically derive an optimized interval for UM. Then the free-evolution intervals between any neighboring measurements, either UM or CM, can be optimized for cooling. The global sequence of measurements or the implementing order of UM and CM can be further optimized with reinforcement learning. The optimizer is fed with the cooperative cooling performance, a function of the average population of the resonator, the success probability of the detector in the measured subspace, and the fidelity of the resonator in the ground state. Eventually we find an optimal sequence holding an overwhelming advantage over all the others.

The rest of this work is structured as follows. We briefly revisit the general framework for the cooling protocols based on conditional and unconditional measurements in Secs. II.1 and II.2, respectively. In Sec. II.2, an analytical expression of the optimized measurement-interval is obtained for UM. In Sec. III, we introduce the interpolation diagram for the cooling architecture based on these two measurements, define the cooperative cooling performance to comprehensively quantify various strategies, and present the optimized result through reinforcement learning. The PPO algorithm and the optimal-control procedure are provided in Appendixes A and B, respectively. The whole work is discussed and summarized in Sec. IV.

II Conditional and unconditional measurements

II.1 Conditional Measurement

Consider a JC model used for cooling-by-measurement protocols, whose Hamiltonian in the rotating frame with respect to H0=ωa(|ee|+aa)H_{0}=\omega_{a}(|e\rangle\langle e|+a^{\dagger}a) reads

H=Δ|ee|+g(aσ+aσ+).H=\Delta|e\rangle\langle e|+g(a^{\dagger}\sigma_{-}+a\sigma_{+}). (1)

Here Δωeωa\Delta\equiv\omega_{e}-\omega_{a} is the detuning between the level-spacing of the atomic detector ωe\omega_{e} and the frequency of the target resonator ωa\omega_{a} and |Δ|ωe,ωa|\Delta|\ll\omega_{e},\omega_{a}. gg is the coupling strength between the detector (qubit) and the target resonator. Pauli matrices σ\sigma_{-} and σ+\sigma_{+} denote the transition operators of the qubit; and aa (aa^{\dagger}) represents annihilation (creation) operator of the resonator.

The conditional measurement-based cooling is described by a sequence of piecewise joint evolutions of the resonator and the detector, that are interrupted by instantaneous projective measurements on a particular subspace of the detector. Initially, the resonator is in a thermal-equilibrium state ρath\rho_{a}^{\rm th} with a finite temperature TT, and the detector qubit starts from the ground state. Then the overall initial state has the form of ρtot(0)=|gg|ρath\rho_{\rm tot}(0)=|g\rangle\langle g|\otimes\rho_{a}^{\rm th}. To cool down the resonator, a conditional or selective measurement Mg=|gg|M_{g}=|g\rangle\langle g| is implemented on the detector after the free-evolution with an interval τ\tau, when the overall state becomes ρtot(τ)=exp(iHτ)ρtot(0)exp(iHτ)\rho_{\rm tot}(\tau)=\exp(-iH\tau)\rho_{\rm tot}(0)\exp(iH\tau). And then conditional measurement yields a probabilistic result:

ρa(τ)=g|ρtot(τ)|gTr[g|ρtot(τ)|g].\rho_{a}(\tau)=\frac{\langle g|\rho_{\rm tot}(\tau)|g\rangle}{{\rm Tr}\left[\langle g|\rho_{\rm tot}(\tau)|g\rangle\right]}. (2)

Based on the time-dependence of the interval τ\tau, conditional cooling protocols can be categorized into the equal-time-spacing and unequal-time-spacing strategies Li et al. (2011); Yan and Jing (2022). The unequal-time-spacing strategy has demonstrated a dramatic cooling efficiency by setting the measurement interval as the inverse of the time-evolved thermal Rabi frequency τoptc(t)=1/Ωth(t)\tau_{\rm opt}^{c}(t)=1/\Omega_{\rm th}(t), where Ωth(t)gn¯(t)=gnnpn(t)\Omega_{\rm th}(t)\equiv g\sqrt{\bar{n}(t)}=g\sqrt{\sum_{n}np_{n}(t)} with pn(t)p_{n}(t) denoting the current population of the resonator on the Fock state |n|n\rangle. To optimize the cooling performance, our cooling architecture in this work employs the unequal-time-spacing strategy. After NN rounds of free-evolution and instantaneous-measurement described by an ordered time sequence {τ1(t1),τ2(t2),,τN(tN)}\{\tau_{1}(t_{1}),\tau_{2}(t_{2}),\cdots,\tau_{N}(t_{N})\} with ti>1=j=1j=i1τjt_{i>1}=\sum_{j=1}^{j=i-1}\tau_{j} and τ11/[gTr(n^ρath)]\tau_{1}\equiv 1/[g\sqrt{{\rm Tr}(\hat{n}\rho_{a}^{\rm th})}], the resonator state becomes

ρa(t=i=1Nτi)=ni=1N|αn(τi)|2pn|nn|Pg(N),\rho_{a}\left(t=\sum_{i=1}^{N}\tau_{i}\right)=\frac{\sum_{n}\prod_{i=1}^{N}|\alpha_{n}(\tau_{i})|^{2}p_{n}|n\rangle\langle n|}{P_{g}(N)}, (3)

where

pn=enωa/kBTZ,Z11eωa/kBTp_{n}=\frac{e^{-n\hbar\omega_{a}/k_{B}T}}{Z},\quad Z\equiv\frac{1}{1-e^{-\hbar\omega_{a}/k_{B}T}} (4)

is the initial population,

Pg(N)=ni=1N|αn(τi)|2pnP_{g}(N)=\sum_{n}\prod_{i=1}^{N}|\alpha_{n}(\tau_{i})|^{2}p_{n} (5)

is the survival or success probability of CM, and

|αn(τi)|2=Ωn2g2nsin2(Ωnτi)Ωn2\left|\alpha_{n}(\tau_{i})\right|^{2}=\frac{\Omega_{n}^{2}-g^{2}n\sin^{2}(\Omega_{n}\tau_{i})}{\Omega_{n}^{2}} (6)

is the cooling coefficient with Ωn=g2n+Δ2/4\Omega_{n}=\sqrt{g^{2}n+\Delta^{2}/4} denoting the nn-photon Rabi frequency. The cooling coefficient in Eq. (3) determines the average population

n¯(t)=Tr[n^ρa(t)],n^aa,\bar{n}(t)={\rm Tr}\left[\hat{n}\rho_{a}(t)\right],\quad\hat{n}\equiv a^{\dagger}a, (7)

by reshaping the population distributions over all the Fock states. Note in Eq. (6), the cooling coefficient for |0|0\rangle is unit, |α0(τi)|2=1|\alpha_{0}(\tau_{i})|^{2}=1, meaning that the ground-state population is always under protection during the cooling process. The populations on high-occupied Fock states are gradually reduced by |αn(τi)|N<1|\alpha_{n}(\tau_{i})|^{N}<1 with increasing NN unless sin(Ωnτi)=0\sin(\Omega_{n}\tau_{i})=0 or Ωnτi=jπ\Omega_{n}\tau_{i}=j\pi with integer jj.

II.2 Unconditional Measurement

Unconditional-measurement cooling is a statistical mixture of the conditional-measurement counterpart, by expanding the measurement subspace to the whole space of the detector system. After a period of joint unitary evolution under the Hamiltonian (1), the overall state can be written as

ρtot(τ)=npn(|αn(τ)|2χn(τ)χn(τ)|βn(τ)|2),\rho_{\rm tot}(\tau)=\bigoplus_{n}p_{n}\begin{pmatrix}|\alpha_{n}(\tau)|^{2}&\chi_{n}(\tau)\\ \chi^{*}_{n}(\tau)&|\beta_{n}(\tau)|^{2}\end{pmatrix}, (8)

where

χn(τ)gn[Δsin2(Ωnτ)iΩnsin(2Ωnτ)]2Ωn2,\displaystyle\chi_{n}(\tau)\equiv\frac{-g\sqrt{n}\left[\Delta\sin^{2}(\Omega_{n}\tau)-i\Omega_{n}\sin(2\Omega_{n}\tau)\right]}{2\Omega_{n}^{2}},
|βn(τ)|2g2nsin2(Ωnτ)Ωn2.\displaystyle|\beta_{n}(\tau)|^{2}\equiv\frac{g^{2}n\sin^{2}(\Omega_{n}\tau)}{\Omega_{n}^{2}}.

UM can be implemented by tracing out the degrees of freedom of the detector Trd[ρtot(τ)]{\rm Tr}_{d}[\rho_{\rm tot}(\tau)]. Then the resonator state reads

ρa(τ)=n0[|αn(τ)|2pn+|βn+1(τ)|2pn+1]|nn|.\rho_{a}(\tau)=\sum_{n\geq 0}\left[|\alpha_{n}(\tau)|^{2}p_{n}+|\beta_{n+1}(\tau)|^{2}p_{n+1}\right]|n\rangle\langle n|. (9)

So that after a nonselective measurement, i.e., a measurement without recording the result, a population transfer in the target resonator occurs as

pn|αn(τ)|2pn+|βn+1(τ)|2pn+1.p_{n}\rightarrow|\alpha_{n}(\tau)|^{2}p_{n}+|\beta_{n+1}(\tau)|^{2}p_{n+1}. (10)

In contrast to CM strategy that is characterized by a single cooling coefficient |αn|2|\alpha_{n}|^{2} in Eq. (6), UM strategy depends subtly on an extra cooling coefficient |βn|2|\beta_{n}|^{2}. According to Eq. (10), the initial population on the ground state p0p_{0} becomes |α0(τ)|2p0+|β1(τ)|2p1=p0+|β1(τ)|2p1|\alpha_{0}(\tau)|^{2}p_{0}+|\beta_{1}(\tau)|^{2}p_{1}=p_{0}+|\beta_{1}(\tau)|^{2}p_{1}, indicating that a part of population on the first excited state is transferred to the ground state. Under rounds of nonselective measurements, it is intuitive to expect that the populations on the higher states of the resonator keep moving to the lower states and eventually to the ground state. In practice, the cooling is however constrained and even invertible since the populations on certain excited states can be fixed or enhanced when |αn(τ)|2=1|\alpha_{n}(\tau)|^{2}=1 and |βn+1(τ)|20|\beta_{n+1}(\tau)|^{2}\geq 0, i.e., Ωnτ=1\Omega_{n}\tau=1 and Ωn+1τ0\Omega_{n+1}\tau\geq 0. This problem can be addressed by employing the unequal-time-spacing strategy. A time-varying τ\tau could ensure that populations on all excited states are gradually reduced.

Refer to caption
Figure 1: Average population of the resonator after a single unconditional measurement as a function of the measurement-interval τ\tau under various initial temperatures. (a) T=0.01T=0.01 K, (b) T=0.1T=0.1 K, (c) T=1.0T=1.0 K and (d) T=10T=10 K. The vertical black-dashed lines indicate the analytical results for the optimized intervals given by Eq. (14). The parameters for the blue-solid curves are set as g=0.04ωag=0.04\omega_{a} and Δ=0.01ωa\Delta=0.01\omega_{a}.

Cooling efficiency of UM strategy depends severely on the choice of τ\tau spacing neighboring measurements, analogous to that of CM Yan and Jing (2022). That could be observed in Fig. 1 by the average population of the resonator n¯\bar{n} under one measurement on the detector. The τ\tau-dependence of n¯\bar{n} demonstrates similar patterns across four orders in scale of initial temperature. It is found that the average population declines gradually to a minimal point (the relative reduction becomes smaller as increasing temperature) at an optimized measurement-interval τoptu\tau_{\rm opt}^{u}, then rebounds quickly and ends up with a random fluctuation around a value slightly lower than its initial thermal occupation n¯thTr(n^ρath)\bar{n}_{\rm th}\equiv{\rm Tr}(\hat{n}\rho_{a}^{\rm th}).

To make full use of the cooling strategy, it is desired to analytically find the optimized interval τoptu\tau_{\rm opt}^{u} as a functional of the current state and the model parameters. By virtue of Eq. (9) and under the resonant condition, the average population after a single unconditional measurement reads

n¯\displaystyle\bar{n} =n0n(pncos2Ωnτ+pn+1sin2Ωn+1τ)\displaystyle=\sum_{n\geq 0}n\left(p_{n}\cos^{2}\Omega_{n}\tau+p_{n+1}\sin^{2}\Omega_{n+1}\tau\right) (11)
=η+12Zn0nenx(cos2Ωnτexcos2Ωn+1τ),\displaystyle=\eta+\frac{1}{2Z}\sum_{n\geq 0}ne^{-nx}(\cos 2\Omega_{n}\tau-e^{-x}\cos 2\Omega_{n+1}\tau),

where η(n¯th+2n¯th2)/(2+2n¯th)\eta\equiv(\bar{n}_{\rm th}+2\bar{n}^{2}_{\rm th})/(2+2\bar{n}_{\rm th}) and xωa/kBTx\equiv\hbar\omega_{a}/k_{B}T. Since the weight function nenxne^{-nx} in Eq. (11) is dominant around ndkBT/ωa=1/xn_{d}\equiv k_{B}T/\hbar\omega_{a}=1/x, the variables Ωn\Omega_{n} and Ωn+1\Omega_{n+1} could thus be expanded around n=ndn=n_{d}. To the first order of nndn-n_{d}, we have

cos2Ωnτexcos2Ωn+1τ\displaystyle\cos 2\Omega_{n}\tau-e^{-x}\cos 2\Omega_{n+1}\tau
cos2Ωdτexcos2Ωd+1τ+(nnd)\displaystyle\approx\cos 2\Omega_{d}\tau-e^{-x}\cos 2\Omega_{d+1}\tau+(n-n_{d})
×(Ωdτsin2Ωdτnd+exΩd+1τsin2Ωd+1τnd+1),\displaystyle\times\left(-\frac{\Omega_{d}\tau\sin 2\Omega_{d}\tau}{n_{d}}+e^{-x}\frac{\Omega_{d+1}\tau\sin 2\Omega_{d+1}\tau}{n_{d}+1}\right),

where

Ωdgnd,Ωd+1gnd+1\Omega_{d}\equiv g\sqrt{n_{d}},\quad\Omega_{d+1}\equiv g\sqrt{n_{d}+1} (12)

define the dominant Rabi frequencies. Under the approximations that appropriate for a moderate temperature ex=n¯th/(n¯th+1)1e^{-x}=\bar{n}_{\rm th}/(\bar{n}_{\rm th}+1)\approx 1 and Ωd+1/(nd+1)Ωd/nd\Omega_{d+1}/(n_{d}+1)\approx\Omega_{d}/n_{d}, the average population in Eq. (11) can be expressed by

n¯η+sinΩτ(n¯thsinΩ+τ+ηΩdτcosΩ+τ),\bar{n}\approx\eta+\sin\Omega_{-}\tau\left(\bar{n}_{\rm th}\sin\Omega_{+}\tau+\eta^{\prime}\Omega_{d}\tau\cos\Omega_{+}\tau\right), (13)

where Ω±Ωd+1±Ωd\Omega_{\pm}\equiv\Omega_{d+1}\pm\Omega_{d} and ηn¯th(1+2n¯thnd)/nd\eta^{\prime}\equiv\bar{n}_{\rm th}(1+2\bar{n}_{\rm th}-n_{d})/n_{d}. Note we have used the formulas about the geometric series n=0nenx=ex/(ex1)2\sum_{n=0}^{\infty}ne^{-nx}=e^{x}/(e^{x}-1)^{2} and n=0n2enx=ex(1+ex)/(ex1)3\sum_{n=0}^{\infty}n^{2}e^{-nx}=e^{x}(1+e^{x})/(e^{x}-1)^{3}. Within a moderate time step τ\tau, Eq. (13) depends predominantly on the high-frequency terms characterized by Ω+\Omega_{+}. In the regime of T0.110T\sim 0.1-10 K, the term weighted by ηΩdτ\eta^{\prime}\Omega_{d}\tau overwhelms that weighted by n¯th\bar{n}_{\rm th}. And as evidenced by Fig. 1, this advantage expands with a larger τoptu\tau_{\rm opt}^{u} given the initial or effective temperature of the resonator becomes lower. We can therefore focus on the last term in Eq. (13) to minimize n¯\bar{n}. Subsequently, cosΩ+τ=1\cos\Omega_{+}\tau=-1 yields

τoptu=πΩd+Ωd+1.\tau_{\rm opt}^{u}=\frac{\pi}{\Omega_{d}+\Omega_{d+1}}. (14)

This result can be extended to the near-resonant situation by modifying the definition of Ωd\Omega_{d} in Eq. (12) to g2nd+Δ2/4\sqrt{g^{2}n_{d}+\Delta^{2}/4}. The vertical black-dashed lines in Fig. 1 denote the measurement-intervals optimized by Eq. (14). It is found that the analytical expression is well suited to estimate the minimum values of average population in a wide range of temperature. As demonstrated by both analytical and numerical results, a shorter measurement-interval is demanded to cool down a higher-temperature resonator. In the JC-like models, coupling a qubit to a high-temperature resonator induces a faster transition between the ground state and the excited state of the qubit. Although a quick measurement would interrupt this process, an unappropriate time-interval would have a negative effect on cooling Zhang et al. (2019a).

Similar to the optimized interval τoptc(t)\tau_{\rm opt}^{c}(t) for the conditional-measurement strategy Yan and Jing (2022), here τoptu\tau_{\rm opt}^{u} is also updatable by substituting time-varied Ωd\Omega_{d} and Ωd+1\Omega_{d+1} to Eq. (14). The dominant Fock-state-number ndn_{d} determining Ωd\Omega_{d} in Eq. (12) could be understood as a function of the effective temperature during the cooling procedure, which relies uniquely on n¯(t)\bar{n}(t) or pn(t)p_{n}(t).

III Measurement optimization

Thermal resonator could be steadily yet slowly cooled down by unconditional measurement strategy equipped with an optimized measurement-interval in Eq. (14). And this strategy is performed with a unit probability in the absence of postselection over the measurement outcome. In sharp contrast, conditional measurement strategy is a more efficient cooling protocol but with a poor success probability. It is therefore desired to find an optimized sequence of measurements as a hybrid of UM and CM to hold a great performance taking both cooling efficiency and experimental overhead into account. In this section, we present an algorithm that employs the reinforcement learning to generate the optimized control sequence indicating when and which measurement is performed.

The performance of any cooling-by-measurement strategy can be characterized or evaluated by the cooling ratio n¯(t)/n¯th\bar{n}(t)/\bar{n}_{\rm th}, the success probability PgP_{g} of the detector in the measured subspace, and the fidelity of the resonator in its ground state F=n=0|ρa(t)|n=0F=\langle n=0|\rho_{a}(t)|n=0\rangle Li et al. (2011). To compare various interpolation sequences of UM and CM in cooling performance and to evaluate the figure of merit for the reinforcement learning, we can define a cooperative cooling quantifier as

𝒞=FPglog10n¯thn¯(t).\mathcal{C}=FP_{g}\log_{10}{\frac{\bar{n}_{\rm th}}{\bar{n}(t)}}. (15)

Notably, the logarithm function is used to obtain a positive value with almost the same order as FF and PgP_{g} in magnitude. Then n¯(t)\bar{n}(t), PgP_{g}, and FF could be considered in a balanced manner. In fact, the average population could be reduced by several (normally less than 1010) orders in magnitude under an efficient cooling protocol. In the EIT cooling Feng et al. (2020), log10[n¯th/n¯(t)](2,3)\log_{10}[\bar{n}_{\rm th}/\bar{n}(t)]\sim(2,3); and in the resolved sideband cooling  Triana et al. (2016), log10[n¯th/n¯(t)](4,5)\log_{10}[\bar{n}_{\rm th}/\bar{n}(t)]\sim(4,5). Although Eq. (15) is not a unique choice, it is instructive to find that a lower average population, a larger success probability, and a higher ground-state fidelity to yield a better cooling performance.

Refer to caption
Figure 2: (a) RL-optimization diagram on cooling by measurement. An agent constructed by the neural network interacts with an environment. The agent chooses an action (CM or UM strategy) according to the current state of the resonator. Then the environment would take this action and return both the state under the measurement and the reward RR based on the cooperative cooling performance 𝒞\mathcal{C} in Eq. (15). (b) Circuit model for our cooling algorithm based on the local-optimized UM and CM strategies. Starting from a thermal state, the resonator (the upper line) would be gradually cooled down to its ground state with implementation of measurement on detector (the lower line), which starts from the ground state. The measurement sequence can be obtained by the reinforcement learning.

The RL-optimization is shown in Fig. 2(a). It is constituted by the “agent” part based on a series of neural network and the “environment” part performing the cooling-by-measurement actions on quantum system. In the reinforcement learning, the agent has a cluster of parameters, which would be learned and trained using the data collected through its interaction with the environment. In our architecture, the agent would choose an action, i.e., conditional or unconditional measurement, on the resonator, given its current state. Then the environment takes this action and returns the updated resonator-state ρa\rho_{a} and a “reward” RR after the measurement. The reward is generated by the indicator in Eq. (15) to estimate whether the action is good or bad, that would be used to update the agent’s parameters. During one “episode”, the agent would interact with the environment for NN times, i.e., the number of measurements during the whole sequence, which has been fixed from the beginning. A total reward is eventually counted. And the agent is trained to maximize the total reward through artificial episodes until it converges. Then the agent could provide a realistic control sequence of the measurement strategies with their own (optimized) measurement intervals. The cooling-by-measurement sequence can be realized in a circuit model in Fig. 2(b). Rounds of free-evolutions and measurements are successively arranged. The evolution time between two neighboring measurements depends on the measurement strategy and the resonator state at the end of the last round. We follow the PPO algorithm in the agent structure, the data-collecting methods, and the updating parameters, whose details can be found in Appendix A. The interpolation algorithm of UM and CM and the implementation of the measurement sequence are illustrated by a pseudocode in Appendix B.

Refer to caption
Figure 3: (a) Average population, (b) Fidelity of the resonator in its ground state, (c) Success probability, and (d) Cooperative cooling performance under various sequences of cooling-by-measurement. The blue-solid lines with circle markers labeled by SuS_{u} and the orange-dotted lines labeled by ScS_{c} indicate the sequences entirely consisting of UM and CM strategies, respectively. The green-solid lines, the red-dashed lines, and the purple-dot-dashed lines describe the hybrid sequences shown in (e), (f), and (g), and labeled by S1S_{1}, S2S_{2}, and S4S_{4}, respectively. The brown-solid lines with triangle markers labeled by SoptS_{\rm opt} is the RL-optimized sequence presented in (h). For all the sequences in (e), (f), (g), and (h), 11 and 0 indicate CM and UM strategies, respectively. The parameters are set as ωa=1.4\omega_{a}=1.4 GHz, T=0.1T=0.1 K, g=0.04ωag=0.04\omega_{a}, and Δ=0.01ωa\Delta=0.01\omega_{a}.

We consider to cool down a mechanical microresonator in gigahertz Ding et al. (2011); Chan et al. (2011) with various interpolation sequences of UM and CM. Using the resonator-frequency ωa=1.4\omega_{a}=1.4 GHz, the coupling strength between resonator and detector g=0.04ωag=0.04\omega_{a} and the initial temperature of resonator T=0.1T=0.1 K, it is found that the average population starts from n¯th=8.85\bar{n}_{\rm th}=8.85. The cooling performances under the sequences entirely consisting of UM and CM are shown by the blue-solid lines with circle markers and the orange-dotted lines in Figs. 3(a)-(d), labeled by SuS_{u} and ScS_{c}, respectively. It is found that under the conditional measurement strategy with N=16N=16, the average population n¯\bar{n} is reduced by five orders in magnitude [see Fig. 3(a)] and the ground-state fidelity is over F>0.9999F>0.9999 [see Fig. 3(b)] with less than 10%10\% of the success probability [see Fig. 3(c)]. In sharp contrast, under the same number of unconditional measurements, n¯\bar{n} is merely reduced to n¯3.36\bar{n}\approx 3.36 and the ground-state fidelity F0.78F\approx 0.78, despite with a unit success probability. In terms of all the individual quantifiers, i.e., n¯\bar{n}, FF, and PgP_{g}, the results under the hybrid sequences of UM and CM labelled by SkS_{k}, k=1,2,4k=1,2,4, are among the former two limits SuS_{u} and ScS_{c}. As illustrated by Figs. 3(e), (f), and (g), the three sequences start from a CM (indicated by 11), switch to the UM (indicated by 0) after kk rounds of free-evolution and measurement, switch back to CM after a single round, and then repeat the preceding arrangement. In comparison to the entire UM sequence, the interpolation with CM promotes the cooling efficiency in n¯\bar{n}. A larger kk gives rise to a smaller proportion of the unconditional measurements and a less probability PgP_{g} that the detector remains in its measured subspace.

With respect to the cooperative cooling performance given by Eq. (15), it is found [see Fig. 3(d)] that 𝒞(S1)>𝒞(S2)>𝒞(S4)>𝒞(Su)\mathcal{C}(S_{1})>\mathcal{C}(S_{2})>\mathcal{C}(S_{4})>\mathcal{C}(S_{u}) and yet 𝒞(S2)𝒞(Sc)\mathcal{C}(S_{2})\approx\mathcal{C}(S_{c}). Such that a regular interpolation sequence could therefore have a better cooperative cooling performance than the entire CM sequence. While the dependence of 𝒞\mathcal{C} for arbitrary hybrid sequence on its proportion of CM strategies might not be monotonic. We are then motivated to find an optimized sequence by virtue of the PPO algorithm. A typical RL-optimized sequence of cooling strategies labeled by SoptS_{\rm opt} is described in Fig. 3(h). With four orders reduction in the average population (close to the cooling efficiency provided by ScS_{c}), an almost unit ground-state fidelity F>0.9999F>0.9999, and a moderate success probability Pg30%P_{g}\approx 30\% (much larger than that by ScS_{c}), the optimized sequence achieves an overwhelming cooperative cooling performance 𝒞(Sopt)=2.73\mathcal{C}(S_{\rm opt})=2.73 according to Eq. (15) over all the other measurement sequences. Therefore, we have achieved a compromise of cooling rate and success probability through the reinforcement leaning method with a much less overhead than the brute-force searching. The RL-optimized sequence is not unique, yet the current results of n¯\bar{n}, FF, PgP_{g}, and 𝒞\mathcal{C} in Fig. 3 are almost invariant as long as there is one CM in the first several rounds.

Refer to caption
Figure 4: (a) Average populations and (b) Cooperative cooling performance under the RL-optimized cooling algorithm with various initial temperatures. (c), (d), (e), and (f) describe the optimized sequences of UM and CM with T=0.05T=0.05 K, T=0.1T=0.1 K, T=0.2T=0.2 K, and T=0.3T=0.3 K, respectively. The other parameters are the same as those in Fig. 3.

The RL-optimized algorithm applies to a wide range of initial temperature for the resonator. Starting from various n¯th\bar{n}_{\rm th} determined by the temperature, the average populations could be reduced by three to five orders in magnitude under the optimized measurement sequences, as demonstrated in Fig. 4(a). It is found that under a higher temperature, it is harder to suppress the transitions between the ground state and the excited states of the detector. Then both the relative magnitude in the population reduction [see Fig. 4(a)] and the cooperative cooling performance [see Fig. 4(b)] manifest a monotonically decreasing behavior as temperature increases.

Similar to Fig. 3(h), here we present in Figs. 4(c), (d), (e), and (f) the optimized sequences fully determined by the PPO algorithm, which still outperform any regular interpolated sequence in the cooling quantifier 𝒞\mathcal{C}. Comparing these four sub-figures corresponding to various temperatures, it is interesting to find that a larger portion of the unconditional measurements is required along the optimized sequence for a higher temperature. It is consistent with the fact that under CM the success probability PgP_{g} to find a detector in its ground state decreases exponentially with increasing temperature of the target resonator. Then more UMs are used to save a rapidly declining PgP_{g} for obtaining a larger 𝒞\mathcal{C}. In addition, for T>0.05T>0.05 K, RL-optimized sequence always starts from a conditional measurement, which is important to have a significant cooling rate for n¯\bar{n} during the first several rounds of the whole sequence.

The profiles shown in Fig. 3(h) and Figs. 4(d), (e), and (f) manifest a common pattern for all the RL-optimized sequences. It is found in the previous several rounds that a conditional or projective measurement should be performed on the detector, when the resonator is normally in a comparatively high-temperature state, and several unconditional measurements ensued before further cooling. This pattern is consistent with the variations of both energy and entropy in nonunitary controls Gherardini et al. (2020). The energy variation induced by a projective measurement is kBTH(ρ)k_{B}TH(\rho) on average, where H(ρ)H(\rho) is the Shannon entropy of the whole system after a free evolution. Then in the end of the first round, a projective measurement is desired to cut down as much energy as it could, which is followed by several rounds of unconditional measurements to save the success probability. Thus in general we anticipate to see more UMs than CMs in the first several rounds and more CMs than UMs in the remaining rounds.

IV Discussion and conclusion

Refer to caption
Figure 5: (a) Average population and (b) Cooperative cooling performance of the resonator coupled to a thermal environment under the optimized cooling strategy with various dissipative rates. The dissipation-free results are those labeled by SoptS_{\rm opt} in Figs. 3(a) and (d).

Preceding analysis over the cooling performance neglects the environment-induced dissipation. We now consider the cooling process in an open-quantum-system scenario, in which the free evolution between neighboring measurements is influenced by a finite-temperature environment. The dynamics is then described by the master equation

ρ˙(t)=\displaystyle\dot{\rho}(t)= i[H,ρ(t)]\displaystyle-i[H,\rho(t)] (16)
+γ(n¯th+1)𝒟[a]ρ(t)+γn¯th𝒟[a]ρ(t),\displaystyle+\gamma(\bar{n}_{\rm th}+1)\mathcal{D}[a]\rho(t)+\gamma\bar{n}_{\rm th}\mathcal{D}[a^{\dagger}]\rho(t),

where 𝒟[A]\mathcal{D}[A] represents the Lindblad superoperator

𝒟[A]ρ(t)Aρ(t)A12{AA,ρ(t)}.\mathcal{D}[A]\rho(t)\equiv A\rho(t)A^{\dagger}-\frac{1}{2}\left\{A^{\dagger}A,\rho(t)\right\}. (17)

In Fig. 5(a) and (b), we present the average population n¯\bar{n} and the cooperative cooling performance 𝒞\mathcal{C} respectively with various dissipation rates. To compare the cooling performances in the presence of thermal decoherence to the dissipation-free situation, we apply the RL-optimized sequence provided in Fig. 3(h). It is found that a larger dissipation rate gives rise to a weaker cooling performance in terms of both n¯\bar{n} and 𝒞\mathcal{C}, exhibiting the struggle between cooling effects by measurement and the accumulated heating effects by environment. Nevertheless, for typical mechanical resonators in gigahertz with γ/ωa105\gamma/\omega_{a}\sim 10^{-5} Ding et al. (2011); Chan et al. (2011), our optimized cooling protocol is still capable to reduce n¯\bar{n} by three orders in magnitude with about N=10N=10 measurements [see the green dashed line in Fig. 5(a)]. In the mean time, the asymptotic value of 𝒞\mathcal{C} still overwhelms the CM strategy labeled by ScS_{c} in Fig. 3(d).

Even in the absence of thermal decoherence, n¯\bar{n} does not keep decreasing. Fundamentally, it is under the constraint of the third law of thermodynamics, that the absolute zero cannot be attained within a finite number of operations. Actually, either τoptc\tau_{\rm opt}^{c} or τoptu\tau_{\rm opt}^{u} approaches infinity as n¯0\bar{n}\rightarrow 0, which indicates that the whole cooling process has to be truncated by a maximum timescale.

We emphasize again that the preceding hybrid cooling sequences based on the conditional and unconditional measurements are optimized in both global and local perspectives. Globally, we use the reinforcement learning to find the optimized order for UM and CM. The local optimization depends on the selected measurement interval to obtain a minimum average-population n¯\bar{n} under one measurement. For UM in Eq. (14), τoptu(t)\tau_{\rm opt}^{u}(t) is not necessarily obtained by an instant feedback mechanism during a realistic practice. The measurement sequence {τ1(t1),τ2(t2),,τN(tN)}\{\tau_{1}(t_{1}),\tau_{2}(t_{2}),\cdots,\tau_{N}(t_{N})\} can be actually obtained prior to the cooling measurements. τ1(t1)\tau_{1}(t_{1}) depends on the initial population-distribution pnp_{n}, and τk(tk)\tau_{k}(t_{k}), k2k\geq 2, can be calculated on the effective temperature that is uniquely determined by the dynamics of pn(t)p_{n}(t) through Eq. (12). In other words, we can avoid the feedback error and imprecision induced by detecting the resonator states during the experiment.

In summary, we present an optimized cooling architecture on a sequential arrangement of both conditional and unconditional measurements. We analyse and compare the advantages and disadvantages of both CM and UM on cooling rate and success probability. We obtain analytically for the first time an analytical expression for the optimized unconditional measurement-interval τoptu=π/(Ωd+Ωd+1)\tau_{\rm opt}^{u}=\pi/(\Omega_{d}+\Omega_{d+1}) in parallel to that for conditional measurement Yan and Jing (2022). Here the dominant Rabi frequency Ωd\Omega_{d} depends on the dominant distribution of resonator in its Fock state with nd=kBT/(ωa)n_{d}=k_{B}T/(\hbar\omega_{a}) and the coupling strength between target and detector. The combination of the advantages of both measurement strategies gives rise to an optimized hybrid cooling algorithm assisted by the reinforcement learning. It is justified by the cooperative cooling performance as we defined to quantify the comprehensive cooling efficiency for arbitrary cooling-by-measurement strategy. Our work therefore pushes the cooling-by-measurement to an unattained degree in regard of efficiency and feasibility. It offers an appealing interdisciplinary application of quantum control and artificial intelligence.

Acknowledgments

We acknowledge financial support from the National Science Foundation of China (Grants No. 11974311 and No. U1801661).

Appendix A Proximal Policy Optimization

This appendix provides more details in proximal policy optimization, a typical reinforcement learning algorithm that we use to optimize the measurement sequence for cooling. PPO algorithm follows an “actor-critic” frame, in which actors receive the current state as an input and then outputs an action according to an updatable policy, and a critic evaluates this action to determine whether the action should be encouraged or not. In the following, we do not discriminate “actor” and “policy” for simplicity.

Refer to caption
Figure 6: Diagram of proximal policy optimization algorithm.

As shown in Fig. 6, PPO algorithm has two actors (policies) πold({θ})\pi_{\rm old}(\{\theta\}) and πnew({θ})\pi_{\rm new}(\{\theta^{\prime}\}) and one critic. Any of them is of an agent constructed by the neural networks (see Fig. 2) feathered with a set of parameter {θ}\{\theta\}. The two policies have the same structures in PPO. The old policy collects the sampling data through interaction with the environment; and the new one would use these data stored in a buffer to update {θ}\{\theta\} to be {θ}\{\theta^{\prime}\}. At first, the environment would initialize and deliver the state s1s_{1} of the target system to the old policy πold({θ})\pi_{\rm old}(\{\theta\}); then the old policy generates an action a1a_{1} according to s1s_{1} and {θ}\{\theta\}. In environment, the action a1a_{1} is taken and the system state becomes s2s_{2}. The environment also provides a reward R1R_{1} indicating how good the action is. The reward is generated by a task-specified reward function. At this stage, an interaction between the policy and the environment is completed and one set of “trajectory” or return {s1,a1,R1}\{s_{1},a_{1},R_{1}\} is collected. NN trajectories are collected in one episode, where NN amounts to the number of actions required to complete the task. The critic takes both actions and states as input and outputs an advantage AiA_{i} representing the contribution of the current action aia_{i} on the current state sis_{i}. After collecting a sufficient amount of data, the critic would estimate the actions’ contribution as precise as possible. In the mean time, according to the advantages to maximize a clipped surrogate objective function LCLIP({θ})L^{\rm CLIP}(\{\theta\}) Schulman et al. (2017), the new policy would transfer its parameters {θ}\{\theta^{\prime}\} to the old one.

In our application for optimizing the cooling sequence, the allowed inputs of the system states are defined as the populations in the Fock states, i.e., the diagonal elements of the target resonator ρa\rho_{a}

si={p0(t),p1(t),p2(t),,pnc(t)},s_{i}=\{p_{0}(t),p_{1}(t),p_{2}(t),\cdots,p_{n_{c}}(t)\}, (18)

where ncn_{c} indicates the cutoff Fock-state for the resonator. The actions taken by the environment are selected from the set

ai{0,1},a_{i}\in\{0,1\}, (19)

where 0 and 11 represent unconditional and conditional measurements, respectively. Two policies are used to decide which type of measurement to be performed due to the current state of the resonator. Environment represents the quantum devices performing measurements, obtaining the updated states, and returning the rewards. When an action is selected and sent to the environment, the optimized measurement interval is calculated according to the measurement type. After unitary evolution lasting τopt{τoptc,τoptu}\tau_{\rm opt}\in\{\tau_{\rm opt}^{c},\tau_{\rm opt}^{u}\}, measurement is performed on the detector. Then the average population n¯\bar{n}, the ground-state fidelity FF, and the success probability PgP_{g} are obtained to calculate the cooperative cooling performance 𝒞\mathcal{C} given by Eq. (15). The reward function is set as a certain multiple of 𝒞\mathcal{C}, Ri(si,ai)=100×𝒞(si,ai)R_{i}(s_{i},a_{i})=100\times\mathcal{C}(s_{i},a_{i}). After measurement, the environment then returns the resonator state and the reward to the policies. When the training is completed, a policy π({θopt})\pi(\{\theta_{\rm opt}\}) with a set of optimized parameters is achieved. The neutral network equipped with {θopt}\{\theta_{\rm opt}\} could then be used to generate the optimized actions to cool down the current state.

Appendix B Generation of optimized sequence

Output: Sopt={1,2,,N}S_{\rm opt}=\{\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{N}\} and 𝒯={τopt(t1),τopt(t2),,τopt(tN)}\mathcal{T}=\{\tau_{\rm opt}(t_{1}),\tau_{\rm opt}(t_{2}),\cdots,\tau_{\rm opt}(t_{N})\}
Input: Temperature TT
Initialize the thermal state ρa=npn|nn|\rho_{a}=\sum_{n}p_{n}|n\rangle\langle n| with TT Use PPO to train an optimized policy π({θopt})\pi(\{\theta_{\rm opt}\})
for i=1,2,,Ni=1,2,\cdots,N do
       Run the policy π({θopt})\pi(\{\theta_{\rm opt}\}) on ρa\rho_{a} to generate i\mathcal{M}_{i}
       Attain Teff=ωa/[kBln(1+1/n¯)]T_{\rm eff}=\hbar\omega_{a}/[k_{B}\ln{(1+1/\bar{n}})] on n¯(ρa)\bar{n}(\rho_{a})
       if i=0\mathcal{M}_{i}=0 then
             Calculate τopt(ti)=π/(Ωd+Ωd+1)\tau_{\rm opt}(t_{i})=\pi/(\Omega_{d}+\Omega_{d+1}) on TeffT_{\rm eff}
             Get the cooling coefficients |αn|2|\alpha_{n}|^{2} and |βn|2|\beta_{n}|^{2}
             UM: ρan(|αn|2pn+|βn+1|2pn+1)|nn|\rho_{a}\leftarrow\sum_{n}(|\alpha_{n}|^{2}p_{n}+|\beta_{n+1}|^{2}p_{n+1})|n\rangle\langle n|
          
            end
           else if i=1\mathcal{M}_{i}=1 then
                  Calculate τopt(ti)=1/Ωth\tau_{\rm opt}(t_{i})=1/\Omega_{\rm th} on TeffT_{\rm eff}
                  Get the cooling coefficients |αn|2|\alpha_{n}|^{2}
                  CM: ρan|αn|2pn|nn|/(n|αn|2pn)\rho_{a}\leftarrow\sum_{n}|\alpha_{n}|^{2}p_{n}|n\rangle\langle n|/(\sum_{n}|\alpha_{n}|^{2}p_{n})
                 end
               
                end
Algorithm 1 RL-optimized cooling procedure

Both the order of measurements and the sequence of measurement-intervals could be regarded as output of our RL-optimized cooling algorithm as shown in Algorithm 1. The input information is the initial temperature TT, fully determining the thermal state of the resonator. When the reinforcement learning process was completed by PPO algorithm (see Appendix A), the parameters {θ}\{\theta\} of the neural network (policy π\pi) have been trained to be capable to select one of the two measurement strategies for the current state, which maximizes the cooperative cooling performance. And then the cooling procedure is formally launched. We run the policy π({θopt})\pi(\{\theta_{\rm opt}\}) on ρa(0)=ρath\rho_{a}(0)=\rho_{a}^{\rm th}, which generates the first measurement strategy 1\mathcal{M}_{1}, 1{0,1}\mathcal{M}_{1}\in\{0,1\}. Here 0 and 11 indicate UM and CM, respectively. If 1=0\mathcal{M}_{1}=0, then τopt(t1)=τoptu\tau_{\rm opt}(t_{1})=\tau_{\rm opt}^{u} in Eq. (14) that could be obtained by the effective temperature TeffT_{\rm eff} of the resonator (initially Teff=TT_{\rm eff}=T, and it is updated by the current state of the last round). Subsequently, the cooling coefficients |αn|2|\alpha_{n}|^{2} and |βn|2|\beta_{n}|^{2} are calculated and the resonator state is modified according to Eq. (9). Otherwise if 1=1\mathcal{M}_{1}=1, a conditional measurement will be implemented after an interval τopt(t1)=τoptc=1/Ωth(t)\tau_{\rm opt}(t_{1})=\tau_{\rm opt}^{c}=1/\Omega_{\rm th}(t) and the resonator state is modified according to Eq. (3). In the end of this round, one can calculate TeffT_{\rm eff} by the current pn(t)p_{n}(t) and then go to the next round. After NN iterations, the optimized measurement sequence characterized by Sopt={1,2,,N}S_{\rm opt}=\{\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{N}\} and 𝒯={τopt(t1),τopt(t2),,τopt(tN)}\mathcal{T}=\{\tau_{\rm opt}(t_{1}),\tau_{\rm opt}(t_{2}),\cdots,\tau_{\rm opt}(t_{N})\} appear as respectively described in Fig. 3(h) and Figs. 4(c), (d), (e), (f). In practical implementations, the measurements by SoptS_{\rm opt} and 𝒯\mathcal{T} can be acted on the detector without knowledge of the target-resonator state.

References

  • Milburn and Woolley (2008) G. J. Milburn and M. J. Woolley, Quantum nanoscience, Contemp. Phys. 49, 413 (2008).
  • Aspelmeyer et al. (2014) M. Aspelmeyer, T. J. Kippenberg,  and F. Marquardt, Cavity optomechanics, Rev. Mod. Phys. 86, 1391 (2014).
  • Lloyd and Braunstein (1999) S. Lloyd and S. L. Braunstein, Quantum computation over continuous variables, Phys. Rev. Lett. 82, 1784 (1999).
  • You and Nori (2011) J. Q. You and F. Nori, Atomic physics and quantum optics using superconducting circuits, Nature 474, 589 (2011).
  • Toyoda et al. (2015) K. Toyoda, R. Hiji, A. Noguchi,  and S. Urabe, Hong–ou–mandel interference of two phonons in trapped ions, Nature 527, 74 (2015).
  • Um et al. (2016) M. Um, J. Zhang, D. Lv, Y. Lu, S. An, J.-N. Zhang, H. Nha, M. S. Kim,  and K. Kim, Phonon arithmetic in a trapped ion system, Nat. Commun. 7, 11410 (2016).
  • Bocko and Onofrio (1996) M. F. Bocko and R. Onofrio, On the measurement of a weak classical force coupled to a harmonic oscillator: experimental progress, Rev. Mod. Phys. 68, 755 (1996).
  • Caves et al. (1980) C. M. Caves, K. S. Thorne, R. W. P. Drever, V. D. Sandberg,  and M. Zimmermann, On the measurement of a weak classical force coupled to a quantum-mechanical oscillator. i. issues of principle, Rev. Mod. Phys. 52, 341 (1980).
  • Sharma et al. (2018) S. Sharma, Y. M. Blanter,  and G. E. W. Bauer, Optical cooling of magnons, Phys. Rev. Lett. 121, 087205 (2018).
  • Wilson-Rae et al. (2007) I. Wilson-Rae, N. Nooshi, W. Zwerger,  and T. J. Kippenberg, Theory of ground state cooling of a mechanical oscillator using dynamical backaction, Phys. Rev. Lett. 99, 093901 (2007).
  • Gigan et al. (2006) S. Gigan, H. R. Böhm, M. Paternostro, F. Blaser, G. Langer, J. B. Hertzberg, K. C. Schwab, D. Bäuerle, M. Aspelmeyer,  and A. Zeilinger, Self-cooling of a micromirror by radiation pressure, Nature 444, 67 (2006).
  • Wang et al. (2011) X. Wang, S. Vinjanampathy, F. W. Strauch,  and K. Jacobs, Ultraefficient cooling of resonators: Beating sideband cooling with quantum control, Phys. Rev. Lett. 107, 177204 (2011).
  • Zhang et al. (2013) J. Zhang, D. Li, R. Chen,  and Q. Xiong, Laser cooling of a semiconductor by 40 kelvin, Nature 493, 504 (2013).
  • Epstein et al. (1995) R. I. Epstein, M. I. Buchwald, B. C. Edwards, T. R. Gosnell,  and C. E. Mungan, Observation of laser-induced fluorescent cooling of a solid, Nature 377, 500 (1995).
  • Morigi et al. (2000) G. Morigi, J. Eschner,  and C. H. Keitel, Ground state laser cooling using electromagnetically induced transparency, Phys. Rev. Lett. 85, 4458 (2000).
  • Roos et al. (2000) C. F. Roos, D. Leibfried, A. Mundt, F. Schmidt-Kaler, J. Eschner,  and R. Blatt, Experimental demonstration of ground state laser cooling with electromagnetically induced transparency, Phys. Rev. Lett. 85, 5547 (2000).
  • Vanner et al. (2011) M. R. Vanner, I. Pikovski, G. D. Cole, M. S. Kim, C. Brukner, K. Hammerer, G. J. Milburn,  and M. Aspelmeyer, Pulsed quantum optomechanics, Proc. Natl. Acad. Sci. 108, 16182 (2011).
  • Vanner et al. (2013) M. R. Vanner, J. Hofer, G. D. Cole,  and M. Aspelmeyer, Cooling-by-measurement and mechanical state tomography via pulsed optomechanics, Nat. Commun. 4, 2295 (2013).
  • Bennett et al. (2016) J. S. Bennett, K. Khosla, L. S. Madsen, M. R. Vanner, H. Rubinsztein-Dunlop,  and W. P. Bowen, A quantum optomechanical interface beyond the resolved sideband limit, New J. Phys. 18, 053030 (2016).
  • Rossi et al. (2018) M. Rossi, D. Mason, J. Chen, Y. Tsaturyan,  and A. Schliesser, Measurement-based quantum control of mechanical motion, Nature 563, 53 (2018).
  • Brunelli et al. (2020) M. Brunelli, D. Malz, A. Schliesser,  and A. Nunnenkamp, Stroboscopic quantum optomechanics, Phys. Rev. Research 2, 023241 (2020).
  • Buffoni et al. (2019) L. Buffoni, A. Solfanelli, P. Verrucchi, A. Cuccoli,  and M. Campisi, Quantum measurement cooling, Phys. Rev. Lett. 122, 070603 (2019).
  • Nakazato et al. (2003) H. Nakazato, T. Takazawa,  and K. Yuasa, Purification through zeno-like measurements, Phys. Rev. Lett. 90, 060401 (2003).
  • Li et al. (2011) Y. Li, L.-A. Wu, Y.-D. Wang,  and L.-P. Yang, Nondeterministic ultrafast ground-state cooling of a mechanical resonator, Phys. Rev. B 84, 094502 (2011).
  • Xu et al. (2014) J.-S. Xu, M.-H. Yung, X.-Y. Xu, S. Boixo, Z.-W. Zhou, C.-F. Li, A. Aspuru-Guzik,  and G.-C. Guo, Demon-like algorithmic quantum cooling and its realization with quantum optics, Nat. Photonics 8, 113 (2014).
  • Puebla et al. (2020) R. Puebla, O. Abah,  and M. Paternostro, Measurement-based cooling of a nonlinear mechanical resonator, Phys. Rev. B 101, 245410 (2020).
  • Pyshkin et al. (2016) P. V. Pyshkin, D.-W. Luo, J. Q. You,  and L.-A. Wu, Ground-state cooling of quantum systems via a one-shot measurement, Phys. Rev. A 93, 032120 (2016).
  • Yan and Jing (2021) J.-s. Yan and J. Jing, External-level assisted cooling by measurement, Phys. Rev. A 104, 063105 (2021).
  • Yan and Jing (2022) J.-s. Yan and J. Jing, Simultaneous cooling by measuring one ancillary system, Phys. Rev. A 105, 052607 (2022).
  • Zhang et al. (2019a) J.-M. Zhang, J. Jing, L.-A. Wu, L.-G. Wang,  and S.-Y. Zhu, Measurement-induced cooling of a qubit in structured environments, Phys. Rev. A 100, 022107 (2019a).
  • Harel and Kurizki (1996) G. Harel and G. Kurizki, Fock-state preparation from thermal cavity fields by measurements on resonant atoms, Phys. Rev. A 54, 5410 (1996).
  • Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel,  and D. Hassabis, Mastering the game of go with deep neural networks and tree search, Nature 529, 484 (2016).
  • Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel,  and D. Hassabis, Mastering the game of go without human knowledge, Nature 550, 354 (2017).
  • Silver et al. (2018) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan,  and D. Hassabis, A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, Science 362, 1140 (2018).
  • Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg,  and D. Hassabis, Human-level control through deep reinforcement learning, Nature 518, 529 (2015).
  • Carleo et al. (2019) G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto,  and L. Zdeborová, Machine learning and the physical sciences, Rev. Mod. Phys. 91, 045002 (2019).
  • Convy et al. (2022) I. Convy, H. Liao, S. Zhang, S. Patel, W. P. Livingston, H. N. Nguyen, I. Siddiqi,  and K. B. Whaley, Machine learning for continuous quantum error correction on superconducting qubits, New J. Phys. 24, 063019 (2022).
  • Fösel et al. (2018) T. Fösel, P. Tighineanu, T. Weiss,  and F. Marquardt, Reinforcement learning with neural networks for quantum feedback, Phys. Rev. X 8, 031084 (2018).
  • Bolens and Heyl (2021) A. Bolens and M. Heyl, Reinforcement learning for digital quantum simulation, Phys. Rev. Lett. 127, 110502 (2021).
  • Yuan et al. (2021) X. Yuan, J. Sun, J. Liu, Q. Zhao,  and Y. Zhou, Quantum simulation with hybrid tensor networks, Phys. Rev. Lett. 127, 040501 (2021).
  • Guo et al. (2021) S.-F. Guo, F. Chen, Q. Liu, M. Xue, J.-J. Chen, J.-H. Cao, T.-W. Mao, M. K. Tey,  and L. You, Faster state preparation across quantum phase transition assisted by reinforcement learning, Phys. Rev. Lett. 126, 060401 (2021).
  • Bukov et al. (2018) M. Bukov, A. G. R. Day, D. Sels, P. Weinberg, A. Polkovnikov,  and P. Mehta, Reinforcement learning in different phases of quantum control, Phys. Rev. X 8, 031086 (2018).
  • Zhang et al. (2019b) X.-M. Zhang, Z. Wei, R. Asad, X.-C. Yang,  and X. Wang, When does reinforcement learning stand out in quantum control? a comparative study on state preparation, npj Quantum Inf. 5, 85 (2019b).
  • Sivak et al. (2022) V. V. Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsioutsios,  and M. H. Devoret, Model-free quantum control with reinforcement learning, Phys. Rev. X 12, 011059 (2022).
  • Kim and Jeong (2021) D.-K. Kim and H. Jeong, Deep reinforcement learning for feedback control in a collective flashing ratchet, Phys. Rev. Research 3, L022002 (2021).
  • Yao et al. (2021) J. Yao, L. Lin,  and M. Bukov, Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving, Phys. Rev. X 11, 031070 (2021).
  • Feng et al. (2020) L. Feng, W. L. Tan, A. De, A. Menon, A. Chu, G. Pagano,  and C. Monroe, Efficient ground-state cooling of large trapped-ion chains with an electromagnetically-induced-transparency tripod scheme, Phys. Rev. Lett. 125, 053001 (2020).
  • Triana et al. (2016) J. F. Triana, A. F. Estrada,  and L. A. Pachón, Ultrafast optimal sideband cooling under non-markovian evolution, Phys. Rev. Lett. 116, 183602 (2016).
  • Ding et al. (2011) L. Ding, C. Baker, P. Senellart, A. Lemaitre, S. Ducci, G. Leo,  and I. Favero, Wavelength-sized gaas optomechanical resonators with gigahertz frequency, Appl. Phys. Lett 98, 113108 (2011).
  • Chan et al. (2011) J. Chan, T. P. M. Alegre, A. H. Safavi-Naeini, J. T. Hill, A. Krause, S. Gröblacher, M. Aspelmeyer,  and O. Painter, Laser cooling of a nanomechanical oscillator into its quantum ground state, Nature 478, 89 (2011).
  • Gherardini et al. (2020) S. Gherardini, F. Campaioli, F. Caruso,  and F. C. Binder, Stabilizing open quantum batteries by sequential measurements, Phys. Rev. Research 2, 013095 (2020).
  • Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford,  and O. Klimov, Proximal policy optimization algorithms, arXiv , 1707.06347 (2017).