Optimizing measurement-based cooling by reinforcement learning

Jia-shun Yan School of Physics, Zhejiang University, Hangzhou 310027, Zhejiang, China Jun Jing Email address: [email protected] School of Physics, Zhejiang University, Hangzhou 310027, Zhejiang, China

Abstract

Conditional cooling-by-measurement holds a significant advantage over its unconditional (nonselective) counterpart in the average-population-reduction rate. However, it has a clear weakness with respect to the limited success probability of finding the detector in the measured state. In this work, we propose an optimized architecture to cool down a target resonator, which is initialized as a thermal state, using an interpolation of conditional and unconditional measurement strategies. An optimal measurement-interval $\tau_{\rm opt}^{u}$ for unconditional measurement is analytically derived for the first time, which is inversely proportional to the collective dominant Rabi frequency $\Omega_{d}$ as a function of the resonator’s population in the end of the last round. A cooling algorithm under global optimization by the reinforcement learning results in the maximum value for the cooperative cooling performance, an indicator to measure the comprehensive cooling efficiency for arbitrary cooling-by-measurement architecture. In particular, the average population of the target resonator under only $16$ rounds of measurements can be reduced by four orders in magnitude with a success probability about $30\%$ .

I Introduction

Cooling mesoscopic and microscopic resonators down to their minimum-energy state is fundamental to observe classical-quantum transition and to exploit quantum advantage in nanoscience Milburn and Woolley (2008); Aspelmeyer et al. (2014). The ground-state preparation is a crucial and implicit step in quantum information processes, including but not limited to the continuous-variable quantum computations Lloyd and Braunstein (1999); You and Nori (2011); Toyoda et al. (2015); Um et al. (2016), the ultrahigh precision measurements Bocko and Onofrio (1996); Caves et al. (1980), and the quantum interface constructions Sharma et al. (2018). Various strategies are designed to reach an effective temperature as low as possible in the trapped atom and ion systems Wilson-Rae et al. (2007); Gigan et al. (2006); Wang et al. (2011). In atomic laser cooling, popular strategies are consisted of the laser Doppler cooling Sharma et al. (2018); Zhang et al. (2013); Epstein et al. (1995), the resolved-sideband cooling, and the electromagnetically induced transparency (EIT) cooling Morigi et al. (2000); Roos et al. (2000).

Beyond the paradigms extracting the system energy through dissipative channels based on the blue-shifted (anti-Stokes) sidebands, a versatile approach to cooling the mechanical states of motion is provided by the interaction with electromagnetic radiation or quantum measurement. Back-action-evading measurement techniques that can surpass the standard quantum limit have attracted enormous interests. Through pulsed measurement process in optomechanics Vanner et al. (2011, 2013); Bennett et al. (2016); Rossi et al. (2018); Brunelli et al. (2020), they can dramatically change the mechanical thermal occupation with no initial cooling. A genuine quantum mechanical cooling engine is proposed Buffoni et al. (2019), whereby the fuel is the energy exchanged with an apparatus performing invasive quantum measurements.

Among these measurement-based techniques, quantum state engineering based on measurements on ancillary systems have been proposed recently in theory Nakazato et al. (2003); Li et al. (2011) and demonstrated in experiment Xu et al. (2014). Rather than directly detecting the target system, a net nonunitary propagator is realized by inserting projective measurements on the ground state of the detector system in between the joint unitary-evolution segments of target and detector. The induced postselection of the ground state of the target system (typically modelled as a resonator) reduces its high-energy distribution in the ensemble. In another word, the resonator is gradually steered by the outcomes of the conditional measurement (CM) to its ground state via dynamically filtering out its vibrational modes. Ranging from cooling the nonlinear mechanical resonators Puebla et al. (2020), cooling by one shot measurement Pyshkin et al. (2016), expanding cooling range by an external driving Yan and Jing (2021), to accelerating cooling rate by optimized measurement intervals Yan and Jing (2022), an unexplored weakness of the CM strategies is their limited success probability inherited from the projective operation. An amount of experimental overhead rises unavoidably with more samples in ensemble. In sharp contrast to CM, the unconditional measurement (UM) strategy performs a nonselective and impulsive measurement in all the bases of the bare Hamiltonian of detector in the end of each round of the joint evolution Zhang et al. (2019a); Harel and Kurizki (1996). It is more likely to realize a unit-success-probability cooling but suffers from a much slower cooling rate than CM, indicating much more number of measurements toward the ground-state cooling. To compromise the cooling rate and the success probability, the interpolating-configuration of conditional and unconditional measurements constitutes an optimization problem.

The integration of a small-scale quantum circuit with a classical optimizer, e.g., the neural network, provides a paradigm by designing a sequence of parametrized quantum operations that are well suited to implement robust and high-fidelity algorithms. Many reinforcement learning (RL) algorithms constructed by the neural network, that demonstrated remarkable capabilities in the board and video games Silver et al. (2016, 2017, 2018); Mnih et al. (2015), have substantiated a widely and timely interest in studying quantum physics Carleo et al. (2019), such as quantum error correction Convy et al. (2022); Fösel et al. (2018), quantum simulation Bolens and Heyl (2021); Yuan et al. (2021), and quantum state preparation Guo et al. (2021); Bukov et al. (2018); Zhang et al. (2019b), to name a few. The proximal policy optimization (PPO) algorithm, as a typical RL algorithm with a significant sample complexity, scalability, and robustness for hyperparameters, has proven to be a fruitful tool in quantum optimization control Sivak et al. (2022); Kim and Jeong (2021); Yao et al. (2021).

In this work, we propose a measurement-based cooling architecture as a hybrid sequence of UM and CM strategies. It involves a double optimization: for each step along the sequence, either UM or CM can be considerably improved by using a local optimized measurement interval; and for the global efficiency of the sequence, its arrangement can be separably optimized through reinforcement learning. Particularly, in a typical measurement-based cooling model, i.e., the Jaynes-Cummings (JC) model, where a mechanical resonator (the target system) is coupled to a qubit (the detector system), conditional and unconditional measurements are alternatively performed to cool down the resonator to its ground state. A feedback scheme is triggered upon calling a CM to determine whether or not to launch the next round of evolution-and-measurement according to the measurement outcome. Analogous to the optimized measurement-interval obtained for CM Yan and Jing (2022), we analytically derive an optimized interval for UM. Then the free-evolution intervals between any neighboring measurements, either UM or CM, can be optimized for cooling. The global sequence of measurements or the implementing order of UM and CM can be further optimized with reinforcement learning. The optimizer is fed with the cooperative cooling performance, a function of the average population of the resonator, the success probability of the detector in the measured subspace, and the fidelity of the resonator in the ground state. Eventually we find an optimal sequence holding an overwhelming advantage over all the others.

The rest of this work is structured as follows. We briefly revisit the general framework for the cooling protocols based on conditional and unconditional measurements in Secs. II.1 and II.2, respectively. In Sec. II.2, an analytical expression of the optimized measurement-interval is obtained for UM. In Sec. III, we introduce the interpolation diagram for the cooling architecture based on these two measurements, define the cooperative cooling performance to comprehensively quantify various strategies, and present the optimized result through reinforcement learning. The PPO algorithm and the optimal-control procedure are provided in Appendixes A and B, respectively. The whole work is discussed and summarized in Sec. IV.

II Conditional and unconditional measurements

II.1 Conditional Measurement

Consider a JC model used for cooling-by-measurement protocols, whose Hamiltonian in the rotating frame with respect to $H_{0}=\omega_{a}(|e\rangle\langle e|+a^{\dagger}a)$ reads

H=\Delta|e\rangle\langle e|+g(a^{\dagger}\sigma_{-}+a\sigma_{+}).

(1)

Here $\Delta\equiv\omega_{e}-\omega_{a}$ is the detuning between the level-spacing of the atomic detector $\omega_{e}$ and the frequency of the target resonator $\omega_{a}$ and $|\Delta|\ll\omega_{e},\omega_{a}$ . $g$ is the coupling strength between the detector (qubit) and the target resonator. Pauli matrices $\sigma_{-}$ and $\sigma_{+}$ denote the transition operators of the qubit; and $a$ ( $a^{\dagger}$ ) represents annihilation (creation) operator of the resonator.

The conditional measurement-based cooling is described by a sequence of piecewise joint evolutions of the resonator and the detector, that are interrupted by instantaneous projective measurements on a particular subspace of the detector. Initially, the resonator is in a thermal-equilibrium state $\rho_{a}^{\rm th}$ with a finite temperature $T$ , and the detector qubit starts from the ground state. Then the overall initial state has the form of $\rho_{\rm tot}(0)=|g\rangle\langle g|\otimes\rho_{a}^{\rm th}$ . To cool down the resonator, a conditional or selective measurement $M_{g}=|g\rangle\langle g|$ is implemented on the detector after the free-evolution with an interval $\tau$ , when the overall state becomes $\rho_{\rm tot}(\tau)=\exp(-iH\tau)\rho_{\rm tot}(0)\exp(iH\tau)$ . And then conditional measurement yields a probabilistic result:

\rho_{a}(\tau)=\frac{\langle g|\rho_{\rm tot}(\tau)|g\rangle}{{\rm Tr}\left[\langle g|\rho_{\rm tot}(\tau)|g\rangle\right]}.

(2)

Based on the time-dependence of the interval $\tau$ , conditional cooling protocols can be categorized into the equal-time-spacing and unequal-time-spacing strategies Li et al. (2011); Yan and Jing (2022). The unequal-time-spacing strategy has demonstrated a dramatic cooling efficiency by setting the measurement interval as the inverse of the time-evolved thermal Rabi frequency $\tau_{\rm opt}^{c}(t)=1/\Omega_{\rm th}(t)$ , where $\Omega_{\rm th}(t)\equiv g\sqrt{\bar{n}(t)}=g\sqrt{\sum_{n}np_{n}(t)}$ with $p_{n}(t)$ denoting the current population of the resonator on the Fock state $|n\rangle$ . To optimize the cooling performance, our cooling architecture in this work employs the unequal-time-spacing strategy. After $N$ rounds of free-evolution and instantaneous-measurement described by an ordered time sequence $\{\tau_{1}(t_{1}),\tau_{2}(t_{2}),\cdots,\tau_{N}(t_{N})\}$ with $t_{i>1}=\sum_{j=1}^{j=i-1}\tau_{j}$ and $\tau_{1}\equiv 1/[g\sqrt{{\rm Tr}(\hat{n}\rho_{a}^{\rm th})}]$ , the resonator state becomes

\rho_{a}\left(t=\sum_{i=1}^{N}\tau_{i}\right)=\frac{\sum_{n}\prod_{i=1}^{N}|\alpha_{n}(\tau_{i})|^{2}p_{n}|n\rangle\langle n|}{P_{g}(N)},

(3)

where

p_{n}=\frac{e^{-n\hbar\omega_{a}/k_{B}T}}{Z},\quad Z\equiv\frac{1}{1-e^{-\hbar\omega_{a}/k_{B}T}}

(4)

is the initial population,

P_{g}(N)=\sum_{n}\prod_{i=1}^{N}|\alpha_{n}(\tau_{i})|^{2}p_{n}

(5)

is the survival or success probability of CM, and

\left|\alpha_{n}(\tau_{i})\right|^{2}=\frac{\Omega_{n}^{2}-g^{2}n\sin^{2}(\Omega_{n}\tau_{i})}{\Omega_{n}^{2}}

(6)

is the cooling coefficient with $\Omega_{n}=\sqrt{g^{2}n+\Delta^{2}/4}$ denoting the $n$ -photon Rabi frequency. The cooling coefficient in Eq. (3) determines the average population

\bar{n}(t)={\rm Tr}\left[\hat{n}\rho_{a}(t)\right],\quad\hat{n}\equiv a^{\dagger}a,

(7)

by reshaping the population distributions over all the Fock states. Note in Eq. (6), the cooling coefficient for $|0\rangle$ is unit, $|\alpha_{0}(\tau_{i})|^{2}=1$ , meaning that the ground-state population is always under protection during the cooling process. The populations on high-occupied Fock states are gradually reduced by $|\alpha_{n}(\tau_{i})|^{N}<1$ with increasing $N$ unless $\sin(\Omega_{n}\tau_{i})=0$ or $\Omega_{n}\tau_{i}=j\pi$ with integer $j$ .

II.2 Unconditional Measurement

Unconditional-measurement cooling is a statistical mixture of the conditional-measurement counterpart, by expanding the measurement subspace to the whole space of the detector system. After a period of joint unitary evolution under the Hamiltonian (1), the overall state can be written as

\rho_{\rm tot}(\tau)=\bigoplus_{n}p_{n}\begin{pmatrix}|\alpha_{n}(\tau)|^{2}&\chi_{n}(\tau)\\ \chi^{*}_{n}(\tau)&|\beta_{n}(\tau)|^{2}\end{pmatrix},

(8)

where

		$\displaystyle\chi_{n}(\tau)\equiv\frac{-g\sqrt{n}\left[\Delta\sin^{2}(\Omega_{n}\tau)-i\Omega_{n}\sin(2\Omega_{n}\tau)\right]}{2\Omega_{n}^{2}},$
		$\displaystyle\|\beta_{n}(\tau)\|^{2}\equiv\frac{g^{2}n\sin^{2}(\Omega_{n}\tau)}{\Omega_{n}^{2}}.$

UM can be implemented by tracing out the degrees of freedom of the detector ${\rm Tr}_{d}[\rho_{\rm tot}(\tau)]$ . Then the resonator state reads

\rho_{a}(\tau)=\sum_{n\geq 0}\left[|\alpha_{n}(\tau)|^{2}p_{n}+|\beta_{n+1}(\tau)|^{2}p_{n+1}\right]|n\rangle\langle n|.

(9)

So that after a nonselective measurement, i.e., a measurement without recording the result, a population transfer in the target resonator occurs as

p_{n}\rightarrow|\alpha_{n}(\tau)|^{2}p_{n}+|\beta_{n+1}(\tau)|^{2}p_{n+1}.

(10)

In contrast to CM strategy that is characterized by a single cooling coefficient $|\alpha_{n}|^{2}$ in Eq. (6), UM strategy depends subtly on an extra cooling coefficient $|\beta_{n}|^{2}$ . According to Eq. (10), the initial population on the ground state $p_{0}$ becomes $|\alpha_{0}(\tau)|^{2}p_{0}+|\beta_{1}(\tau)|^{2}p_{1}=p_{0}+|\beta_{1}(\tau)|^{2}p_{1}$ , indicating that a part of population on the first excited state is transferred to the ground state. Under rounds of nonselective measurements, it is intuitive to expect that the populations on the higher states of the resonator keep moving to the lower states and eventually to the ground state. In practice, the cooling is however constrained and even invertible since the populations on certain excited states can be fixed or enhanced when $|\alpha_{n}(\tau)|^{2}=1$ and $|\beta_{n+1}(\tau)|^{2}\geq 0$ , i.e., $\Omega_{n}\tau=1$ and $\Omega_{n+1}\tau\geq 0$ . This problem can be addressed by employing the unequal-time-spacing strategy. A time-varying $\tau$ could ensure that populations on all excited states are gradually reduced.

Refer to caption — Figure 1: Average population of the resonator after a single unconditional measurement as a function of the measurement-interval $\tau$ under various initial temperatures. (a) $T=0.01$ K, (b) $T=0.1$ K, (c) $T=1.0$ K and (d) $T=10$ K. The vertical black-dashed lines indicate the analytical results for the optimized intervals given by Eq. (14). The parameters for the blue-solid curves are set as $g=0.04\omega_{a}$ and $\Delta=0.01\omega_{a}$ .

Cooling efficiency of UM strategy depends severely on the choice of $\tau$ spacing neighboring measurements, analogous to that of CM Yan and Jing (2022). That could be observed in Fig. 1 by the average population of the resonator $\bar{n}$ under one measurement on the detector. The $\tau$ -dependence of $\bar{n}$ demonstrates similar patterns across four orders in scale of initial temperature. It is found that the average population declines gradually to a minimal point (the relative reduction becomes smaller as increasing temperature) at an optimized measurement-interval $\tau_{\rm opt}^{u}$ , then rebounds quickly and ends up with a random fluctuation around a value slightly lower than its initial thermal occupation $\bar{n}_{\rm th}\equiv{\rm Tr}(\hat{n}\rho_{a}^{\rm th})$ .

To make full use of the cooling strategy, it is desired to analytically find the optimized interval $\tau_{\rm opt}^{u}$ as a functional of the current state and the model parameters. By virtue of Eq. (9) and under the resonant condition, the average population after a single unconditional measurement reads

	$\displaystyle\bar{n}$	$\displaystyle=\sum_{n\geq 0}n\left(p_{n}\cos^{2}\Omega_{n}\tau+p_{n+1}\sin^{2}\Omega_{n+1}\tau\right)$		(11)
		$\displaystyle=\eta+\frac{1}{2Z}\sum_{n\geq 0}ne^{-nx}(\cos 2\Omega_{n}\tau-e^{-x}\cos 2\Omega_{n+1}\tau),$		(11)

where $\eta\equiv(\bar{n}_{\rm th}+2\bar{n}^{2}_{\rm th})/(2+2\bar{n}_{\rm th})$ and $x\equiv\hbar\omega_{a}/k_{B}T$ . Since the weight function $ne^{-nx}$ in Eq. (11) is dominant around $n_{d}\equiv k_{B}T/\hbar\omega_{a}=1/x$ , the variables $\Omega_{n}$ and $\Omega_{n+1}$ could thus be expanded around $n=n_{d}$ . To the first order of $n-n_{d}$ , we have

		$\displaystyle\cos 2\Omega_{n}\tau-e^{-x}\cos 2\Omega_{n+1}\tau$
		$\displaystyle\approx\cos 2\Omega_{d}\tau-e^{-x}\cos 2\Omega_{d+1}\tau+(n-n_{d})$
		$\displaystyle\times\left(-\frac{\Omega_{d}\tau\sin 2\Omega_{d}\tau}{n_{d}}+e^{-x}\frac{\Omega_{d+1}\tau\sin 2\Omega_{d+1}\tau}{n_{d}+1}\right),$

where

\Omega_{d}\equiv g\sqrt{n_{d}},\quad\Omega_{d+1}\equiv g\sqrt{n_{d}+1}

(12)

define the dominant Rabi frequencies. Under the approximations that appropriate for a moderate temperature $e^{-x}=\bar{n}_{\rm th}/(\bar{n}_{\rm th}+1)\approx 1$ and $\Omega_{d+1}/(n_{d}+1)\approx\Omega_{d}/n_{d}$ , the average population in Eq. (11) can be expressed by

\bar{n}\approx\eta+\sin\Omega_{-}\tau\left(\bar{n}_{\rm th}\sin\Omega_{+}\tau+\eta^{\prime}\Omega_{d}\tau\cos\Omega_{+}\tau\right),

(13)

where $\Omega_{\pm}\equiv\Omega_{d+1}\pm\Omega_{d}$ and $\eta^{\prime}\equiv\bar{n}_{\rm th}(1+2\bar{n}_{\rm th}-n_{d})/n_{d}$ . Note we have used the formulas about the geometric series $\sum_{n=0}^{\infty}ne^{-nx}=e^{x}/(e^{x}-1)^{2}$ and $\sum_{n=0}^{\infty}n^{2}e^{-nx}=e^{x}(1+e^{x})/(e^{x}-1)^{3}$ . Within a moderate time step $\tau$ , Eq. (13) depends predominantly on the high-frequency terms characterized by $\Omega_{+}$ . In the regime of $T\sim 0.1-10$ K, the term weighted by $\eta^{\prime}\Omega_{d}\tau$ overwhelms that weighted by $\bar{n}_{\rm th}$ . And as evidenced by Fig. 1, this advantage expands with a larger $\tau_{\rm opt}^{u}$ given the initial or effective temperature of the resonator becomes lower. We can therefore focus on the last term in Eq. (13) to minimize $\bar{n}$ . Subsequently, $\cos\Omega_{+}\tau=-1$ yields

\tau_{\rm opt}^{u}=\frac{\pi}{\Omega_{d}+\Omega_{d+1}}.

(14)

This result can be extended to the near-resonant situation by modifying the definition of $\Omega_{d}$ in Eq. (12) to $\sqrt{g^{2}n_{d}+\Delta^{2}/4}$ . The vertical black-dashed lines in Fig. 1 denote the measurement-intervals optimized by Eq. (14). It is found that the analytical expression is well suited to estimate the minimum values of average population in a wide range of temperature. As demonstrated by both analytical and numerical results, a shorter measurement-interval is demanded to cool down a higher-temperature resonator. In the JC-like models, coupling a qubit to a high-temperature resonator induces a faster transition between the ground state and the excited state of the qubit. Although a quick measurement would interrupt this process, an unappropriate time-interval would have a negative effect on cooling Zhang et al. (2019a).

Similar to the optimized interval $\tau_{\rm opt}^{c}(t)$ for the conditional-measurement strategy Yan and Jing (2022), here $\tau_{\rm opt}^{u}$ is also updatable by substituting time-varied $\Omega_{d}$ and $\Omega_{d+1}$ to Eq. (14). The dominant Fock-state-number $n_{d}$ determining $\Omega_{d}$ in Eq. (12) could be understood as a function of the effective temperature during the cooling procedure, which relies uniquely on $\bar{n}(t)$ or $p_{n}(t)$ .

III Measurement optimization

Thermal resonator could be steadily yet slowly cooled down by unconditional measurement strategy equipped with an optimized measurement-interval in Eq. (14). And this strategy is performed with a unit probability in the absence of postselection over the measurement outcome. In sharp contrast, conditional measurement strategy is a more efficient cooling protocol but with a poor success probability. It is therefore desired to find an optimized sequence of measurements as a hybrid of UM and CM to hold a great performance taking both cooling efficiency and experimental overhead into account. In this section, we present an algorithm that employs the reinforcement learning to generate the optimized control sequence indicating when and which measurement is performed.

The performance of any cooling-by-measurement strategy can be characterized or evaluated by the cooling ratio $\bar{n}(t)/\bar{n}_{\rm th}$ , the success probability $P_{g}$ of the detector in the measured subspace, and the fidelity of the resonator in its ground state $F=\langle n=0|\rho_{a}(t)|n=0\rangle$ Li et al. (2011). To compare various interpolation sequences of UM and CM in cooling performance and to evaluate the figure of merit for the reinforcement learning, we can define a cooperative cooling quantifier as

\mathcal{C}=FP_{g}\log_{10}{\frac{\bar{n}_{\rm th}}{\bar{n}(t)}}.

(15)

Notably, the logarithm function is used to obtain a positive value with almost the same order as $F$ and $P_{g}$ in magnitude. Then $\bar{n}(t)$ , $P_{g}$ , and $F$ could be considered in a balanced manner. In fact, the average population could be reduced by several (normally less than $10$ ) orders in magnitude under an efficient cooling protocol. In the EIT cooling Feng et al. (2020), $\log_{10}[\bar{n}_{\rm th}/\bar{n}(t)]\sim(2,3)$ ; and in the resolved sideband cooling Triana et al. (2016), $\log_{10}[\bar{n}_{\rm th}/\bar{n}(t)]\sim(4,5)$ . Although Eq. (15) is not a unique choice, it is instructive to find that a lower average population, a larger success probability, and a higher ground-state fidelity to yield a better cooling performance.

The RL-optimization is shown in Fig. 2(a). It is constituted by the “agent” part based on a series of neural network and the “environment” part performing the cooling-by-measurement actions on quantum system. In the reinforcement learning, the agent has a cluster of parameters, which would be learned and trained using the data collected through its interaction with the environment. In our architecture, the agent would choose an action, i.e., conditional or unconditional measurement, on the resonator, given its current state. Then the environment takes this action and returns the updated resonator-state $\rho_{a}$ and a “reward” $R$ after the measurement. The reward is generated by the indicator in Eq. (15) to estimate whether the action is good or bad, that would be used to update the agent’s parameters. During one “episode”, the agent would interact with the environment for $N$ times, i.e., the number of measurements during the whole sequence, which has been fixed from the beginning. A total reward is eventually counted. And the agent is trained to maximize the total reward through artificial episodes until it converges. Then the agent could provide a realistic control sequence of the measurement strategies with their own (optimized) measurement intervals. The cooling-by-measurement sequence can be realized in a circuit model in Fig. 2(b). Rounds of free-evolutions and measurements are successively arranged. The evolution time between two neighboring measurements depends on the measurement strategy and the resonator state at the end of the last round. We follow the PPO algorithm in the agent structure, the data-collecting methods, and the updating parameters, whose details can be found in Appendix A. The interpolation algorithm of UM and CM and the implementation of the measurement sequence are illustrated by a pseudocode in Appendix B.

We consider to cool down a mechanical microresonator in gigahertz Ding et al. (2011); Chan et al. (2011) with various interpolation sequences of UM and CM. Using the resonator-frequency $\omega_{a}=1.4$ GHz, the coupling strength between resonator and detector $g=0.04\omega_{a}$ and the initial temperature of resonator $T=0.1$ K, it is found that the average population starts from $\bar{n}_{\rm th}=8.85$ . The cooling performances under the sequences entirely consisting of UM and CM are shown by the blue-solid lines with circle markers and the orange-dotted lines in Figs. 3(a)-(d), labeled by $S_{u}$ and $S_{c}$ , respectively. It is found that under the conditional measurement strategy with $N=16$ , the average population $\bar{n}$ is reduced by five orders in magnitude [see Fig. 3(a)] and the ground-state fidelity is over $F>0.9999$ [see Fig. 3(b)] with less than $10\%$ of the success probability [see Fig. 3(c)]. In sharp contrast, under the same number of unconditional measurements, $\bar{n}$ is merely reduced to $\bar{n}\approx 3.36$ and the ground-state fidelity $F\approx 0.78$ , despite with a unit success probability. In terms of all the individual quantifiers, i.e., $\bar{n}$ , $F$ , and $P_{g}$ , the results under the hybrid sequences of UM and CM labelled by $S_{k}$ , $k=1,2,4$ , are among the former two limits $S_{u}$ and $S_{c}$ . As illustrated by Figs. 3(e), (f), and (g), the three sequences start from a CM (indicated by $1$ ), switch to the UM (indicated by $0$ ) after $k$ rounds of free-evolution and measurement, switch back to CM after a single round, and then repeat the preceding arrangement. In comparison to the entire UM sequence, the interpolation with CM promotes the cooling efficiency in $\bar{n}$ . A larger $k$ gives rise to a smaller proportion of the unconditional measurements and a less probability $P_{g}$ that the detector remains in its measured subspace.

With respect to the cooperative cooling performance given by Eq. (15), it is found [see Fig. 3(d)] that $\mathcal{C}(S_{1})>\mathcal{C}(S_{2})>\mathcal{C}(S_{4})>\mathcal{C}(S_{u})$ and yet $\mathcal{C}(S_{2})\approx\mathcal{C}(S_{c})$ . Such that a regular interpolation sequence could therefore have a better cooperative cooling performance than the entire CM sequence. While the dependence of $\mathcal{C}$ for arbitrary hybrid sequence on its proportion of CM strategies might not be monotonic. We are then motivated to find an optimized sequence by virtue of the PPO algorithm. A typical RL-optimized sequence of cooling strategies labeled by $S_{\rm opt}$ is described in Fig. 3(h). With four orders reduction in the average population (close to the cooling efficiency provided by $S_{c}$ ), an almost unit ground-state fidelity $F>0.9999$ , and a moderate success probability $P_{g}\approx 30\%$ (much larger than that by $S_{c}$ ), the optimized sequence achieves an overwhelming cooperative cooling performance $\mathcal{C}(S_{\rm opt})=2.73$ according to Eq. (15) over all the other measurement sequences. Therefore, we have achieved a compromise of cooling rate and success probability through the reinforcement leaning method with a much less overhead than the brute-force searching. The RL-optimized sequence is not unique, yet the current results of $\bar{n}$ , $F$ , $P_{g}$ , and $\mathcal{C}$ in Fig. 3 are almost invariant as long as there is one CM in the first several rounds.

The RL-optimized algorithm applies to a wide range of initial temperature for the resonator. Starting from various $\bar{n}_{\rm th}$ determined by the temperature, the average populations could be reduced by three to five orders in magnitude under the optimized measurement sequences, as demonstrated in Fig. 4(a). It is found that under a higher temperature, it is harder to suppress the transitions between the ground state and the excited states of the detector. Then both the relative magnitude in the population reduction [see Fig. 4(a)] and the cooperative cooling performance [see Fig. 4(b)] manifest a monotonically decreasing behavior as temperature increases.

Similar to Fig. 3(h), here we present in Figs. 4(c), (d), (e), and (f) the optimized sequences fully determined by the PPO algorithm, which still outperform any regular interpolated sequence in the cooling quantifier $\mathcal{C}$ . Comparing these four sub-figures corresponding to various temperatures, it is interesting to find that a larger portion of the unconditional measurements is required along the optimized sequence for a higher temperature. It is consistent with the fact that under CM the success probability $P_{g}$ to find a detector in its ground state decreases exponentially with increasing temperature of the target resonator. Then more UMs are used to save a rapidly declining $P_{g}$ for obtaining a larger $\mathcal{C}$ . In addition, for $T>0.05$ K, RL-optimized sequence always starts from a conditional measurement, which is important to have a significant cooling rate for $\bar{n}$ during the first several rounds of the whole sequence.

The profiles shown in Fig. 3(h) and Figs. 4(d), (e), and (f) manifest a common pattern for all the RL-optimized sequences. It is found in the previous several rounds that a conditional or projective measurement should be performed on the detector, when the resonator is normally in a comparatively high-temperature state, and several unconditional measurements ensued before further cooling. This pattern is consistent with the variations of both energy and entropy in nonunitary controls Gherardini et al. (2020). The energy variation induced by a projective measurement is $k_{B}TH(\rho)$ on average, where $H(\rho)$ is the Shannon entropy of the whole system after a free evolution. Then in the end of the first round, a projective measurement is desired to cut down as much energy as it could, which is followed by several rounds of unconditional measurements to save the success probability. Thus in general we anticipate to see more UMs than CMs in the first several rounds and more CMs than UMs in the remaining rounds.

IV Discussion and conclusion

Preceding analysis over the cooling performance neglects the environment-induced dissipation. We now consider the cooling process in an open-quantum-system scenario, in which the free evolution between neighboring measurements is influenced by a finite-temperature environment. The dynamics is then described by the master equation

	$\displaystyle\dot{\rho}(t)=$	$\displaystyle-i[H,\rho(t)]$		(16)
		$\displaystyle+\gamma(\bar{n}_{\rm th}+1)\mathcal{D}[a]\rho(t)+\gamma\bar{n}_{\rm th}\mathcal{D}[a^{\dagger}]\rho(t),$		(16)

where $\mathcal{D}[A]$ represents the Lindblad superoperator

\mathcal{D}[A]\rho(t)\equiv A\rho(t)A^{\dagger}-\frac{1}{2}\left\{A^{\dagger}A,\rho(t)\right\}.

(17)

In Fig. 5(a) and (b), we present the average population $\bar{n}$ and the cooperative cooling performance $\mathcal{C}$ respectively with various dissipation rates. To compare the cooling performances in the presence of thermal decoherence to the dissipation-free situation, we apply the RL-optimized sequence provided in Fig. 3(h). It is found that a larger dissipation rate gives rise to a weaker cooling performance in terms of both $\bar{n}$ and $\mathcal{C}$ , exhibiting the struggle between cooling effects by measurement and the accumulated heating effects by environment. Nevertheless, for typical mechanical resonators in gigahertz with $\gamma/\omega_{a}\sim 10^{-5}$ Ding et al. (2011); Chan et al. (2011), our optimized cooling protocol is still capable to reduce $\bar{n}$ by three orders in magnitude with about $N=10$ measurements [see the green dashed line in Fig. 5(a)]. In the mean time, the asymptotic value of $\mathcal{C}$ still overwhelms the CM strategy labeled by $S_{c}$ in Fig. 3(d).

Even in the absence of thermal decoherence, $\bar{n}$ does not keep decreasing. Fundamentally, it is under the constraint of the third law of thermodynamics, that the absolute zero cannot be attained within a finite number of operations. Actually, either $\tau_{\rm opt}^{c}$ or $\tau_{\rm opt}^{u}$ approaches infinity as $\bar{n}\rightarrow 0$ , which indicates that the whole cooling process has to be truncated by a maximum timescale.

We emphasize again that the preceding hybrid cooling sequences based on the conditional and unconditional measurements are optimized in both global and local perspectives. Globally, we use the reinforcement learning to find the optimized order for UM and CM. The local optimization depends on the selected measurement interval to obtain a minimum average-population $\bar{n}$ under one measurement. For UM in Eq. (14), $\tau_{\rm opt}^{u}(t)$ is not necessarily obtained by an instant feedback mechanism during a realistic practice. The measurement sequence $\{\tau_{1}(t_{1}),\tau_{2}(t_{2}),\cdots,\tau_{N}(t_{N})\}$ can be actually obtained prior to the cooling measurements. $\tau_{1}(t_{1})$ depends on the initial population-distribution $p_{n}$ , and $\tau_{k}(t_{k})$ , $k\geq 2$ , can be calculated on the effective temperature that is uniquely determined by the dynamics of $p_{n}(t)$ through Eq. (12). In other words, we can avoid the feedback error and imprecision induced by detecting the resonator states during the experiment.

In summary, we present an optimized cooling architecture on a sequential arrangement of both conditional and unconditional measurements. We analyse and compare the advantages and disadvantages of both CM and UM on cooling rate and success probability. We obtain analytically for the first time an analytical expression for the optimized unconditional measurement-interval $\tau_{\rm opt}^{u}=\pi/(\Omega_{d}+\Omega_{d+1})$ in parallel to that for conditional measurement Yan and Jing (2022). Here the dominant Rabi frequency $\Omega_{d}$ depends on the dominant distribution of resonator in its Fock state with $n_{d}=k_{B}T/(\hbar\omega_{a})$ and the coupling strength between target and detector. The combination of the advantages of both measurement strategies gives rise to an optimized hybrid cooling algorithm assisted by the reinforcement learning. It is justified by the cooperative cooling performance as we defined to quantify the comprehensive cooling efficiency for arbitrary cooling-by-measurement strategy. Our work therefore pushes the cooling-by-measurement to an unattained degree in regard of efficiency and feasibility. It offers an appealing interdisciplinary application of quantum control and artificial intelligence.

Acknowledgments

We acknowledge financial support from the National Science Foundation of China (Grants No. 11974311 and No. U1801661).

Appendix A Proximal Policy Optimization

This appendix provides more details in proximal policy optimization, a typical reinforcement learning algorithm that we use to optimize the measurement sequence for cooling. PPO algorithm follows an “actor-critic” frame, in which actors receive the current state as an input and then outputs an action according to an updatable policy, and a critic evaluates this action to determine whether the action should be encouraged or not. In the following, we do not discriminate “actor” and “policy” for simplicity.

As shown in Fig. 6, PPO algorithm has two actors (policies) $\pi_{\rm old}(\{\theta\})$ and $\pi_{\rm new}(\{\theta^{\prime}\})$ and one critic. Any of them is of an agent constructed by the neural networks (see Fig. 2) feathered with a set of parameter $\{\theta\}$ . The two policies have the same structures in PPO. The old policy collects the sampling data through interaction with the environment; and the new one would use these data stored in a buffer to update $\{\theta\}$ to be $\{\theta^{\prime}\}$ . At first, the environment would initialize and deliver the state $s_{1}$ of the target system to the old policy $\pi_{\rm old}(\{\theta\})$ ; then the old policy generates an action $a_{1}$ according to $s_{1}$ and $\{\theta\}$ . In environment, the action $a_{1}$ is taken and the system state becomes $s_{2}$ . The environment also provides a reward $R_{1}$ indicating how good the action is. The reward is generated by a task-specified reward function. At this stage, an interaction between the policy and the environment is completed and one set of “trajectory” or return $\{s_{1},a_{1},R_{1}\}$ is collected. $N$ trajectories are collected in one episode, where $N$ amounts to the number of actions required to complete the task. The critic takes both actions and states as input and outputs an advantage $A_{i}$ representing the contribution of the current action $a_{i}$ on the current state $s_{i}$ . After collecting a sufficient amount of data, the critic would estimate the actions’ contribution as precise as possible. In the mean time, according to the advantages to maximize a clipped surrogate objective function $L^{\rm CLIP}(\{\theta\})$ Schulman et al. (2017), the new policy would transfer its parameters $\{\theta^{\prime}\}$ to the old one.

In our application for optimizing the cooling sequence, the allowed inputs of the system states are defined as the populations in the Fock states, i.e., the diagonal elements of the target resonator $\rho_{a}$

s_{i}=\{p_{0}(t),p_{1}(t),p_{2}(t),\cdots,p_{n_{c}}(t)\},

(18)

where $n_{c}$ indicates the cutoff Fock-state for the resonator. The actions taken by the environment are selected from the set

a_{i}\in\{0,1\},

(19)

where $0$ and $1$ represent unconditional and conditional measurements, respectively. Two policies are used to decide which type of measurement to be performed due to the current state of the resonator. Environment represents the quantum devices performing measurements, obtaining the updated states, and returning the rewards. When an action is selected and sent to the environment, the optimized measurement interval is calculated according to the measurement type. After unitary evolution lasting $\tau_{\rm opt}\in\{\tau_{\rm opt}^{c},\tau_{\rm opt}^{u}\}$ , measurement is performed on the detector. Then the average population $\bar{n}$ , the ground-state fidelity $F$ , and the success probability $P_{g}$ are obtained to calculate the cooperative cooling performance $\mathcal{C}$ given by Eq. (15). The reward function is set as a certain multiple of $\mathcal{C}$ , $R_{i}(s_{i},a_{i})=100\times\mathcal{C}(s_{i},a_{i})$ . After measurement, the environment then returns the resonator state and the reward to the policies. When the training is completed, a policy $\pi(\{\theta_{\rm opt}\})$ with a set of optimized parameters is achieved. The neutral network equipped with $\{\theta_{\rm opt}\}$ could then be used to generate the optimized actions to cool down the current state.

Appendix B Generation of optimized sequence

Output:

S_{\rm opt}=\{\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{N}\}

and

\mathcal{T}=\{\tau_{\rm opt}(t_{1}),\tau_{\rm opt}(t_{2}),\cdots,\tau_{\rm opt}(t_{N})\}

Input: Temperature

T

Initialize the thermal state

\rho_{a}=\sum_{n}p_{n}|n\rangle\langle n|

with

T

Use PPO to train an optimized policy

\pi(\{\theta_{\rm opt}\})

for $i=1,2,\cdots,N$ do

Run the policy

\pi(\{\theta_{\rm opt}\})

\rho_{a}

to generate

\mathcal{M}_{i}

Attain

T_{\rm eff}=\hbar\omega_{a}/[k_{B}\ln{(1+1/\bar{n}})]

\bar{n}(\rho_{a})

if $\mathcal{M}_{i}=0$ then

Calculate

\tau_{\rm opt}(t_{i})=\pi/(\Omega_{d}+\Omega_{d+1})

T_{\rm eff}

Get the cooling coefficients

|\alpha_{n}|^{2}

and

|\beta_{n}|^{2}

UM:

\rho_{a}\leftarrow\sum_{n}(|\alpha_{n}|^{2}p_{n}+|\beta_{n+1}|^{2}p_{n+1})|n\rangle\langle n|

end

else if $\mathcal{M}_{i}=1$ then

Calculate

\tau_{\rm opt}(t_{i})=1/\Omega_{\rm th}

T_{\rm eff}

Get the cooling coefficients

|\alpha_{n}|^{2}

CM:

\rho_{a}\leftarrow\sum_{n}|\alpha_{n}|^{2}p_{n}|n\rangle\langle n|/(\sum_{n}|\alpha_{n}|^{2}p_{n})

end

Algorithm 1 RL-optimized cooling procedure

Both the order of measurements and the sequence of measurement-intervals could be regarded as output of our RL-optimized cooling algorithm as shown in Algorithm 1. The input information is the initial temperature $T$ , fully determining the thermal state of the resonator. When the reinforcement learning process was completed by PPO algorithm (see Appendix A), the parameters $\{\theta\}$ of the neural network (policy $\pi$ ) have been trained to be capable to select one of the two measurement strategies for the current state, which maximizes the cooperative cooling performance. And then the cooling procedure is formally launched. We run the policy $\pi(\{\theta_{\rm opt}\})$ on $\rho_{a}(0)=\rho_{a}^{\rm th}$ , which generates the first measurement strategy $\mathcal{M}_{1}$ , $\mathcal{M}_{1}\in\{0,1\}$ . Here $0$ and $1$ indicate UM and CM, respectively. If $\mathcal{M}_{1}=0$ , then $\tau_{\rm opt}(t_{1})=\tau_{\rm opt}^{u}$ in Eq. (14) that could be obtained by the effective temperature $T_{\rm eff}$ of the resonator (initially $T_{\rm eff}=T$ , and it is updated by the current state of the last round). Subsequently, the cooling coefficients $|\alpha_{n}|^{2}$ and $|\beta_{n}|^{2}$ are calculated and the resonator state is modified according to Eq. (9). Otherwise if $\mathcal{M}_{1}=1$ , a conditional measurement will be implemented after an interval $\tau_{\rm opt}(t_{1})=\tau_{\rm opt}^{c}=1/\Omega_{\rm th}(t)$ and the resonator state is modified according to Eq. (3). In the end of this round, one can calculate $T_{\rm eff}$ by the current $p_{n}(t)$ and then go to the next round. After $N$ iterations, the optimized measurement sequence characterized by $S_{\rm opt}=\{\mathcal{M}_{1},\mathcal{M}_{2},\cdots,\mathcal{M}_{N}\}$ and $\mathcal{T}=\{\tau_{\rm opt}(t_{1}),\tau_{\rm opt}(t_{2}),\cdots,\tau_{\rm opt}(t_{N})\}$ appear as respectively described in Fig. 3(h) and Figs. 4(c), (d), (e), (f). In practical implementations, the measurements by $S_{\rm opt}$ and $\mathcal{T}$ can be acted on the detector without knowledge of the target-resonator state.

References

Milburn and Woolley (2008) G. J. Milburn and M. J. Woolley, Quantum nanoscience, Contemp. Phys. 49, 413 (2008).
Aspelmeyer et al. (2014) M. Aspelmeyer, T. J. Kippenberg, and F. Marquardt, Cavity optomechanics, Rev. Mod. Phys. 86, 1391 (2014).
Lloyd and Braunstein (1999) S. Lloyd and S. L. Braunstein, Quantum computation over continuous variables, Phys. Rev. Lett. 82, 1784 (1999).
You and Nori (2011) J. Q. You and F. Nori, Atomic physics and quantum optics using superconducting circuits, Nature 474, 589 (2011).
Toyoda et al. (2015) K. Toyoda, R. Hiji, A. Noguchi, and S. Urabe, Hong–ou–mandel interference of two phonons in trapped ions, Nature 527, 74 (2015).
Um et al. (2016) M. Um, J. Zhang, D. Lv, Y. Lu, S. An, J.-N. Zhang, H. Nha, M. S. Kim, and K. Kim, Phonon arithmetic in a trapped ion system, Nat. Commun. 7, 11410 (2016).
Bocko and Onofrio (1996) M. F. Bocko and R. Onofrio, On the measurement of a weak classical force coupled to a harmonic oscillator: experimental progress, Rev. Mod. Phys. 68, 755 (1996).
Caves et al. (1980) C. M. Caves, K. S. Thorne, R. W. P. Drever, V. D. Sandberg, and M. Zimmermann, On the measurement of a weak classical force coupled to a quantum-mechanical oscillator. i. issues of principle, Rev. Mod. Phys. 52, 341 (1980).
Sharma et al. (2018) S. Sharma, Y. M. Blanter, and G. E. W. Bauer, Optical cooling of magnons, Phys. Rev. Lett. 121, 087205 (2018).
Wilson-Rae et al. (2007) I. Wilson-Rae, N. Nooshi, W. Zwerger, and T. J. Kippenberg, Theory of ground state cooling of a mechanical oscillator using dynamical backaction, Phys. Rev. Lett. 99, 093901 (2007).
Gigan et al. (2006) S. Gigan, H. R. Böhm, M. Paternostro, F. Blaser, G. Langer, J. B. Hertzberg, K. C. Schwab, D. Bäuerle, M. Aspelmeyer, and A. Zeilinger, Self-cooling of a micromirror by radiation pressure, Nature 444, 67 (2006).
Wang et al. (2011) X. Wang, S. Vinjanampathy, F. W. Strauch, and K. Jacobs, Ultraefficient cooling of resonators: Beating sideband cooling with quantum control, Phys. Rev. Lett. 107, 177204 (2011).
Zhang et al. (2013) J. Zhang, D. Li, R. Chen, and Q. Xiong, Laser cooling of a semiconductor by 40 kelvin, Nature 493, 504 (2013).
Epstein et al. (1995) R. I. Epstein, M. I. Buchwald, B. C. Edwards, T. R. Gosnell, and C. E. Mungan, Observation of laser-induced fluorescent cooling of a solid, Nature 377, 500 (1995).
Morigi et al. (2000) G. Morigi, J. Eschner, and C. H. Keitel, Ground state laser cooling using electromagnetically induced transparency, Phys. Rev. Lett. 85, 4458 (2000).
Roos et al. (2000) C. F. Roos, D. Leibfried, A. Mundt, F. Schmidt-Kaler, J. Eschner, and R. Blatt, Experimental demonstration of ground state laser cooling with electromagnetically induced transparency, Phys. Rev. Lett. 85, 5547 (2000).
Vanner et al. (2011) M. R. Vanner, I. Pikovski, G. D. Cole, M. S. Kim, C. Brukner, K. Hammerer, G. J. Milburn, and M. Aspelmeyer, Pulsed quantum optomechanics, Proc. Natl. Acad. Sci. 108, 16182 (2011).
Vanner et al. (2013) M. R. Vanner, J. Hofer, G. D. Cole, and M. Aspelmeyer, Cooling-by-measurement and mechanical state tomography via pulsed optomechanics, Nat. Commun. 4, 2295 (2013).
Bennett et al. (2016) J. S. Bennett, K. Khosla, L. S. Madsen, M. R. Vanner, H. Rubinsztein-Dunlop, and W. P. Bowen, A quantum optomechanical interface beyond the resolved sideband limit, New J. Phys. 18, 053030 (2016).
Rossi et al. (2018) M. Rossi, D. Mason, J. Chen, Y. Tsaturyan, and A. Schliesser, Measurement-based quantum control of mechanical motion, Nature 563, 53 (2018).
Brunelli et al. (2020) M. Brunelli, D. Malz, A. Schliesser, and A. Nunnenkamp, Stroboscopic quantum optomechanics, Phys. Rev. Research 2, 023241 (2020).
Buffoni et al. (2019) L. Buffoni, A. Solfanelli, P. Verrucchi, A. Cuccoli, and M. Campisi, Quantum measurement cooling, Phys. Rev. Lett. 122, 070603 (2019).
Nakazato et al. (2003) H. Nakazato, T. Takazawa, and K. Yuasa, Purification through zeno-like measurements, Phys. Rev. Lett. 90, 060401 (2003).
Li et al. (2011) Y. Li, L.-A. Wu, Y.-D. Wang, and L.-P. Yang, Nondeterministic ultrafast ground-state cooling of a mechanical resonator, Phys. Rev. B 84, 094502 (2011).
Xu et al. (2014) J.-S. Xu, M.-H. Yung, X.-Y. Xu, S. Boixo, Z.-W. Zhou, C.-F. Li, A. Aspuru-Guzik, and G.-C. Guo, Demon-like algorithmic quantum cooling and its realization with quantum optics, Nat. Photonics 8, 113 (2014).
Puebla et al. (2020) R. Puebla, O. Abah, and M. Paternostro, Measurement-based cooling of a nonlinear mechanical resonator, Phys. Rev. B 101, 245410 (2020).
Pyshkin et al. (2016) P. V. Pyshkin, D.-W. Luo, J. Q. You, and L.-A. Wu, Ground-state cooling of quantum systems via a one-shot measurement, Phys. Rev. A 93, 032120 (2016).
Yan and Jing (2021) J.-s. Yan and J. Jing, External-level assisted cooling by measurement, Phys. Rev. A 104, 063105 (2021).
Yan and Jing (2022) J.-s. Yan and J. Jing, Simultaneous cooling by measuring one ancillary system, Phys. Rev. A 105, 052607 (2022).
Zhang et al. (2019a) J.-M. Zhang, J. Jing, L.-A. Wu, L.-G. Wang, and S.-Y. Zhu, Measurement-induced cooling of a qubit in structured environments, Phys. Rev. A 100, 022107 (2019a).
Harel and Kurizki (1996) G. Harel and G. Kurizki, Fock-state preparation from thermal cavity fields by measurements on resonant atoms, Phys. Rev. A 54, 5410 (1996).
Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of go with deep neural networks and tree search, Nature 529, 484 (2016).
Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mastering the game of go without human knowledge, Nature 550, 354 (2017).
Silver et al. (2018) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, Science 362, 1140 (2018).
Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, Human-level control through deep reinforcement learning, Nature 518, 529 (2015).
Carleo et al. (2019) G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, Machine learning and the physical sciences, Rev. Mod. Phys. 91, 045002 (2019).
Convy et al. (2022) I. Convy, H. Liao, S. Zhang, S. Patel, W. P. Livingston, H. N. Nguyen, I. Siddiqi, and K. B. Whaley, Machine learning for continuous quantum error correction on superconducting qubits, New J. Phys. 24, 063019 (2022).
Fösel et al. (2018) T. Fösel, P. Tighineanu, T. Weiss, and F. Marquardt, Reinforcement learning with neural networks for quantum feedback, Phys. Rev. X 8, 031084 (2018).
Bolens and Heyl (2021) A. Bolens and M. Heyl, Reinforcement learning for digital quantum simulation, Phys. Rev. Lett. 127, 110502 (2021).
Yuan et al. (2021) X. Yuan, J. Sun, J. Liu, Q. Zhao, and Y. Zhou, Quantum simulation with hybrid tensor networks, Phys. Rev. Lett. 127, 040501 (2021).
Guo et al. (2021) S.-F. Guo, F. Chen, Q. Liu, M. Xue, J.-J. Chen, J.-H. Cao, T.-W. Mao, M. K. Tey, and L. You, Faster state preparation across quantum phase transition assisted by reinforcement learning, Phys. Rev. Lett. 126, 060401 (2021).
Bukov et al. (2018) M. Bukov, A. G. R. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, Reinforcement learning in different phases of quantum control, Phys. Rev. X 8, 031086 (2018).
Zhang et al. (2019b) X.-M. Zhang, Z. Wei, R. Asad, X.-C. Yang, and X. Wang, When does reinforcement learning stand out in quantum control? a comparative study on state preparation, npj Quantum Inf. 5, 85 (2019b).
Sivak et al. (2022) V. V. Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsioutsios, and M. H. Devoret, Model-free quantum control with reinforcement learning, Phys. Rev. X 12, 011059 (2022).
Kim and Jeong (2021) D.-K. Kim and H. Jeong, Deep reinforcement learning for feedback control in a collective flashing ratchet, Phys. Rev. Research 3, L022002 (2021).
Yao et al. (2021) J. Yao, L. Lin, and M. Bukov, Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving, Phys. Rev. X 11, 031070 (2021).
Feng et al. (2020) L. Feng, W. L. Tan, A. De, A. Menon, A. Chu, G. Pagano, and C. Monroe, Efficient ground-state cooling of large trapped-ion chains with an electromagnetically-induced-transparency tripod scheme, Phys. Rev. Lett. 125, 053001 (2020).
Triana et al. (2016) J. F. Triana, A. F. Estrada, and L. A. Pachón, Ultrafast optimal sideband cooling under non-markovian evolution, Phys. Rev. Lett. 116, 183602 (2016).
Ding et al. (2011) L. Ding, C. Baker, P. Senellart, A. Lemaitre, S. Ducci, G. Leo, and I. Favero, Wavelength-sized gaas optomechanical resonators with gigahertz frequency, Appl. Phys. Lett 98, 113108 (2011).
Chan et al. (2011) J. Chan, T. P. M. Alegre, A. H. Safavi-Naeini, J. T. Hill, A. Krause, S. Gröblacher, M. Aspelmeyer, and O. Painter, Laser cooling of a nanomechanical oscillator into its quantum ground state, Nature 478, 89 (2011).
Gherardini et al. (2020) S. Gherardini, F. Campaioli, F. Caruso, and F. C. Binder, Stabilizing open quantum batteries by sequential measurements, Phys. Rev. Research 2, 013095 (2020).
Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv , 1707.06347 (2017).