Optimizing measurement-based cooling by reinforcement learning
Abstract
Conditional cooling-by-measurement holds a significant advantage over its unconditional (nonselective) counterpart in the average-population-reduction rate. However, it has a clear weakness with respect to the limited success probability of finding the detector in the measured state. In this work, we propose an optimized architecture to cool down a target resonator, which is initialized as a thermal state, using an interpolation of conditional and unconditional measurement strategies. An optimal measurement-interval for unconditional measurement is analytically derived for the first time, which is inversely proportional to the collective dominant Rabi frequency as a function of the resonator’s population in the end of the last round. A cooling algorithm under global optimization by the reinforcement learning results in the maximum value for the cooperative cooling performance, an indicator to measure the comprehensive cooling efficiency for arbitrary cooling-by-measurement architecture. In particular, the average population of the target resonator under only rounds of measurements can be reduced by four orders in magnitude with a success probability about .
I Introduction
Cooling mesoscopic and microscopic resonators down to their minimum-energy state is fundamental to observe classical-quantum transition and to exploit quantum advantage in nanoscience Milburn and Woolley (2008); Aspelmeyer et al. (2014). The ground-state preparation is a crucial and implicit step in quantum information processes, including but not limited to the continuous-variable quantum computations Lloyd and Braunstein (1999); You and Nori (2011); Toyoda et al. (2015); Um et al. (2016), the ultrahigh precision measurements Bocko and Onofrio (1996); Caves et al. (1980), and the quantum interface constructions Sharma et al. (2018). Various strategies are designed to reach an effective temperature as low as possible in the trapped atom and ion systems Wilson-Rae et al. (2007); Gigan et al. (2006); Wang et al. (2011). In atomic laser cooling, popular strategies are consisted of the laser Doppler cooling Sharma et al. (2018); Zhang et al. (2013); Epstein et al. (1995), the resolved-sideband cooling, and the electromagnetically induced transparency (EIT) cooling Morigi et al. (2000); Roos et al. (2000).
Beyond the paradigms extracting the system energy through dissipative channels based on the blue-shifted (anti-Stokes) sidebands, a versatile approach to cooling the mechanical states of motion is provided by the interaction with electromagnetic radiation or quantum measurement. Back-action-evading measurement techniques that can surpass the standard quantum limit have attracted enormous interests. Through pulsed measurement process in optomechanics Vanner et al. (2011, 2013); Bennett et al. (2016); Rossi et al. (2018); Brunelli et al. (2020), they can dramatically change the mechanical thermal occupation with no initial cooling. A genuine quantum mechanical cooling engine is proposed Buffoni et al. (2019), whereby the fuel is the energy exchanged with an apparatus performing invasive quantum measurements.
Among these measurement-based techniques, quantum state engineering based on measurements on ancillary systems have been proposed recently in theory Nakazato et al. (2003); Li et al. (2011) and demonstrated in experiment Xu et al. (2014). Rather than directly detecting the target system, a net nonunitary propagator is realized by inserting projective measurements on the ground state of the detector system in between the joint unitary-evolution segments of target and detector. The induced postselection of the ground state of the target system (typically modelled as a resonator) reduces its high-energy distribution in the ensemble. In another word, the resonator is gradually steered by the outcomes of the conditional measurement (CM) to its ground state via dynamically filtering out its vibrational modes. Ranging from cooling the nonlinear mechanical resonators Puebla et al. (2020), cooling by one shot measurement Pyshkin et al. (2016), expanding cooling range by an external driving Yan and Jing (2021), to accelerating cooling rate by optimized measurement intervals Yan and Jing (2022), an unexplored weakness of the CM strategies is their limited success probability inherited from the projective operation. An amount of experimental overhead rises unavoidably with more samples in ensemble. In sharp contrast to CM, the unconditional measurement (UM) strategy performs a nonselective and impulsive measurement in all the bases of the bare Hamiltonian of detector in the end of each round of the joint evolution Zhang et al. (2019a); Harel and Kurizki (1996). It is more likely to realize a unit-success-probability cooling but suffers from a much slower cooling rate than CM, indicating much more number of measurements toward the ground-state cooling. To compromise the cooling rate and the success probability, the interpolating-configuration of conditional and unconditional measurements constitutes an optimization problem.
The integration of a small-scale quantum circuit with a classical optimizer, e.g., the neural network, provides a paradigm by designing a sequence of parametrized quantum operations that are well suited to implement robust and high-fidelity algorithms. Many reinforcement learning (RL) algorithms constructed by the neural network, that demonstrated remarkable capabilities in the board and video games Silver et al. (2016, 2017, 2018); Mnih et al. (2015), have substantiated a widely and timely interest in studying quantum physics Carleo et al. (2019), such as quantum error correction Convy et al. (2022); Fösel et al. (2018), quantum simulation Bolens and Heyl (2021); Yuan et al. (2021), and quantum state preparation Guo et al. (2021); Bukov et al. (2018); Zhang et al. (2019b), to name a few. The proximal policy optimization (PPO) algorithm, as a typical RL algorithm with a significant sample complexity, scalability, and robustness for hyperparameters, has proven to be a fruitful tool in quantum optimization control Sivak et al. (2022); Kim and Jeong (2021); Yao et al. (2021).
In this work, we propose a measurement-based cooling architecture as a hybrid sequence of UM and CM strategies. It involves a double optimization: for each step along the sequence, either UM or CM can be considerably improved by using a local optimized measurement interval; and for the global efficiency of the sequence, its arrangement can be separably optimized through reinforcement learning. Particularly, in a typical measurement-based cooling model, i.e., the Jaynes-Cummings (JC) model, where a mechanical resonator (the target system) is coupled to a qubit (the detector system), conditional and unconditional measurements are alternatively performed to cool down the resonator to its ground state. A feedback scheme is triggered upon calling a CM to determine whether or not to launch the next round of evolution-and-measurement according to the measurement outcome. Analogous to the optimized measurement-interval obtained for CM Yan and Jing (2022), we analytically derive an optimized interval for UM. Then the free-evolution intervals between any neighboring measurements, either UM or CM, can be optimized for cooling. The global sequence of measurements or the implementing order of UM and CM can be further optimized with reinforcement learning. The optimizer is fed with the cooperative cooling performance, a function of the average population of the resonator, the success probability of the detector in the measured subspace, and the fidelity of the resonator in the ground state. Eventually we find an optimal sequence holding an overwhelming advantage over all the others.
The rest of this work is structured as follows. We briefly revisit the general framework for the cooling protocols based on conditional and unconditional measurements in Secs. II.1 and II.2, respectively. In Sec. II.2, an analytical expression of the optimized measurement-interval is obtained for UM. In Sec. III, we introduce the interpolation diagram for the cooling architecture based on these two measurements, define the cooperative cooling performance to comprehensively quantify various strategies, and present the optimized result through reinforcement learning. The PPO algorithm and the optimal-control procedure are provided in Appendixes A and B, respectively. The whole work is discussed and summarized in Sec. IV.
II Conditional and unconditional measurements
II.1 Conditional Measurement
Consider a JC model used for cooling-by-measurement protocols, whose Hamiltonian in the rotating frame with respect to reads
(1) |
Here is the detuning between the level-spacing of the atomic detector and the frequency of the target resonator and . is the coupling strength between the detector (qubit) and the target resonator. Pauli matrices and denote the transition operators of the qubit; and () represents annihilation (creation) operator of the resonator.
The conditional measurement-based cooling is described by a sequence of piecewise joint evolutions of the resonator and the detector, that are interrupted by instantaneous projective measurements on a particular subspace of the detector. Initially, the resonator is in a thermal-equilibrium state with a finite temperature , and the detector qubit starts from the ground state. Then the overall initial state has the form of . To cool down the resonator, a conditional or selective measurement is implemented on the detector after the free-evolution with an interval , when the overall state becomes . And then conditional measurement yields a probabilistic result:
(2) |
Based on the time-dependence of the interval , conditional cooling protocols can be categorized into the equal-time-spacing and unequal-time-spacing strategies Li et al. (2011); Yan and Jing (2022). The unequal-time-spacing strategy has demonstrated a dramatic cooling efficiency by setting the measurement interval as the inverse of the time-evolved thermal Rabi frequency , where with denoting the current population of the resonator on the Fock state . To optimize the cooling performance, our cooling architecture in this work employs the unequal-time-spacing strategy. After rounds of free-evolution and instantaneous-measurement described by an ordered time sequence with and , the resonator state becomes
(3) |
where
(4) |
is the initial population,
(5) |
is the survival or success probability of CM, and
(6) |
is the cooling coefficient with denoting the -photon Rabi frequency. The cooling coefficient in Eq. (3) determines the average population
(7) |
by reshaping the population distributions over all the Fock states. Note in Eq. (6), the cooling coefficient for is unit, , meaning that the ground-state population is always under protection during the cooling process. The populations on high-occupied Fock states are gradually reduced by with increasing unless or with integer .
II.2 Unconditional Measurement
Unconditional-measurement cooling is a statistical mixture of the conditional-measurement counterpart, by expanding the measurement subspace to the whole space of the detector system. After a period of joint unitary evolution under the Hamiltonian (1), the overall state can be written as
(8) |
where
UM can be implemented by tracing out the degrees of freedom of the detector . Then the resonator state reads
(9) |
So that after a nonselective measurement, i.e., a measurement without recording the result, a population transfer in the target resonator occurs as
(10) |
In contrast to CM strategy that is characterized by a single cooling coefficient in Eq. (6), UM strategy depends subtly on an extra cooling coefficient . According to Eq. (10), the initial population on the ground state becomes , indicating that a part of population on the first excited state is transferred to the ground state. Under rounds of nonselective measurements, it is intuitive to expect that the populations on the higher states of the resonator keep moving to the lower states and eventually to the ground state. In practice, the cooling is however constrained and even invertible since the populations on certain excited states can be fixed or enhanced when and , i.e., and . This problem can be addressed by employing the unequal-time-spacing strategy. A time-varying could ensure that populations on all excited states are gradually reduced.

Cooling efficiency of UM strategy depends severely on the choice of spacing neighboring measurements, analogous to that of CM Yan and Jing (2022). That could be observed in Fig. 1 by the average population of the resonator under one measurement on the detector. The -dependence of demonstrates similar patterns across four orders in scale of initial temperature. It is found that the average population declines gradually to a minimal point (the relative reduction becomes smaller as increasing temperature) at an optimized measurement-interval , then rebounds quickly and ends up with a random fluctuation around a value slightly lower than its initial thermal occupation .
To make full use of the cooling strategy, it is desired to analytically find the optimized interval as a functional of the current state and the model parameters. By virtue of Eq. (9) and under the resonant condition, the average population after a single unconditional measurement reads
(11) | ||||
where and . Since the weight function in Eq. (11) is dominant around , the variables and could thus be expanded around . To the first order of , we have
where
(12) |
define the dominant Rabi frequencies. Under the approximations that appropriate for a moderate temperature and , the average population in Eq. (11) can be expressed by
(13) |
where and . Note we have used the formulas about the geometric series and . Within a moderate time step , Eq. (13) depends predominantly on the high-frequency terms characterized by . In the regime of K, the term weighted by overwhelms that weighted by . And as evidenced by Fig. 1, this advantage expands with a larger given the initial or effective temperature of the resonator becomes lower. We can therefore focus on the last term in Eq. (13) to minimize . Subsequently, yields
(14) |
This result can be extended to the near-resonant situation by modifying the definition of in Eq. (12) to . The vertical black-dashed lines in Fig. 1 denote the measurement-intervals optimized by Eq. (14). It is found that the analytical expression is well suited to estimate the minimum values of average population in a wide range of temperature. As demonstrated by both analytical and numerical results, a shorter measurement-interval is demanded to cool down a higher-temperature resonator. In the JC-like models, coupling a qubit to a high-temperature resonator induces a faster transition between the ground state and the excited state of the qubit. Although a quick measurement would interrupt this process, an unappropriate time-interval would have a negative effect on cooling Zhang et al. (2019a).
Similar to the optimized interval for the conditional-measurement strategy Yan and Jing (2022), here is also updatable by substituting time-varied and to Eq. (14). The dominant Fock-state-number determining in Eq. (12) could be understood as a function of the effective temperature during the cooling procedure, which relies uniquely on or .
III Measurement optimization
Thermal resonator could be steadily yet slowly cooled down by unconditional measurement strategy equipped with an optimized measurement-interval in Eq. (14). And this strategy is performed with a unit probability in the absence of postselection over the measurement outcome. In sharp contrast, conditional measurement strategy is a more efficient cooling protocol but with a poor success probability. It is therefore desired to find an optimized sequence of measurements as a hybrid of UM and CM to hold a great performance taking both cooling efficiency and experimental overhead into account. In this section, we present an algorithm that employs the reinforcement learning to generate the optimized control sequence indicating when and which measurement is performed.
The performance of any cooling-by-measurement strategy can be characterized or evaluated by the cooling ratio , the success probability of the detector in the measured subspace, and the fidelity of the resonator in its ground state Li et al. (2011). To compare various interpolation sequences of UM and CM in cooling performance and to evaluate the figure of merit for the reinforcement learning, we can define a cooperative cooling quantifier as
(15) |
Notably, the logarithm function is used to obtain a positive value with almost the same order as and in magnitude. Then , , and could be considered in a balanced manner. In fact, the average population could be reduced by several (normally less than ) orders in magnitude under an efficient cooling protocol. In the EIT cooling Feng et al. (2020), ; and in the resolved sideband cooling Triana et al. (2016), . Although Eq. (15) is not a unique choice, it is instructive to find that a lower average population, a larger success probability, and a higher ground-state fidelity to yield a better cooling performance.

The RL-optimization is shown in Fig. 2(a). It is constituted by the “agent” part based on a series of neural network and the “environment” part performing the cooling-by-measurement actions on quantum system. In the reinforcement learning, the agent has a cluster of parameters, which would be learned and trained using the data collected through its interaction with the environment. In our architecture, the agent would choose an action, i.e., conditional or unconditional measurement, on the resonator, given its current state. Then the environment takes this action and returns the updated resonator-state and a “reward” after the measurement. The reward is generated by the indicator in Eq. (15) to estimate whether the action is good or bad, that would be used to update the agent’s parameters. During one “episode”, the agent would interact with the environment for times, i.e., the number of measurements during the whole sequence, which has been fixed from the beginning. A total reward is eventually counted. And the agent is trained to maximize the total reward through artificial episodes until it converges. Then the agent could provide a realistic control sequence of the measurement strategies with their own (optimized) measurement intervals. The cooling-by-measurement sequence can be realized in a circuit model in Fig. 2(b). Rounds of free-evolutions and measurements are successively arranged. The evolution time between two neighboring measurements depends on the measurement strategy and the resonator state at the end of the last round. We follow the PPO algorithm in the agent structure, the data-collecting methods, and the updating parameters, whose details can be found in Appendix A. The interpolation algorithm of UM and CM and the implementation of the measurement sequence are illustrated by a pseudocode in Appendix B.

We consider to cool down a mechanical microresonator in gigahertz Ding et al. (2011); Chan et al. (2011) with various interpolation sequences of UM and CM. Using the resonator-frequency GHz, the coupling strength between resonator and detector and the initial temperature of resonator K, it is found that the average population starts from . The cooling performances under the sequences entirely consisting of UM and CM are shown by the blue-solid lines with circle markers and the orange-dotted lines in Figs. 3(a)-(d), labeled by and , respectively. It is found that under the conditional measurement strategy with , the average population is reduced by five orders in magnitude [see Fig. 3(a)] and the ground-state fidelity is over [see Fig. 3(b)] with less than of the success probability [see Fig. 3(c)]. In sharp contrast, under the same number of unconditional measurements, is merely reduced to and the ground-state fidelity , despite with a unit success probability. In terms of all the individual quantifiers, i.e., , , and , the results under the hybrid sequences of UM and CM labelled by , , are among the former two limits and . As illustrated by Figs. 3(e), (f), and (g), the three sequences start from a CM (indicated by ), switch to the UM (indicated by ) after rounds of free-evolution and measurement, switch back to CM after a single round, and then repeat the preceding arrangement. In comparison to the entire UM sequence, the interpolation with CM promotes the cooling efficiency in . A larger gives rise to a smaller proportion of the unconditional measurements and a less probability that the detector remains in its measured subspace.
With respect to the cooperative cooling performance given by Eq. (15), it is found [see Fig. 3(d)] that and yet . Such that a regular interpolation sequence could therefore have a better cooperative cooling performance than the entire CM sequence. While the dependence of for arbitrary hybrid sequence on its proportion of CM strategies might not be monotonic. We are then motivated to find an optimized sequence by virtue of the PPO algorithm. A typical RL-optimized sequence of cooling strategies labeled by is described in Fig. 3(h). With four orders reduction in the average population (close to the cooling efficiency provided by ), an almost unit ground-state fidelity , and a moderate success probability (much larger than that by ), the optimized sequence achieves an overwhelming cooperative cooling performance according to Eq. (15) over all the other measurement sequences. Therefore, we have achieved a compromise of cooling rate and success probability through the reinforcement leaning method with a much less overhead than the brute-force searching. The RL-optimized sequence is not unique, yet the current results of , , , and in Fig. 3 are almost invariant as long as there is one CM in the first several rounds.

The RL-optimized algorithm applies to a wide range of initial temperature for the resonator. Starting from various determined by the temperature, the average populations could be reduced by three to five orders in magnitude under the optimized measurement sequences, as demonstrated in Fig. 4(a). It is found that under a higher temperature, it is harder to suppress the transitions between the ground state and the excited states of the detector. Then both the relative magnitude in the population reduction [see Fig. 4(a)] and the cooperative cooling performance [see Fig. 4(b)] manifest a monotonically decreasing behavior as temperature increases.
Similar to Fig. 3(h), here we present in Figs. 4(c), (d), (e), and (f) the optimized sequences fully determined by the PPO algorithm, which still outperform any regular interpolated sequence in the cooling quantifier . Comparing these four sub-figures corresponding to various temperatures, it is interesting to find that a larger portion of the unconditional measurements is required along the optimized sequence for a higher temperature. It is consistent with the fact that under CM the success probability to find a detector in its ground state decreases exponentially with increasing temperature of the target resonator. Then more UMs are used to save a rapidly declining for obtaining a larger . In addition, for K, RL-optimized sequence always starts from a conditional measurement, which is important to have a significant cooling rate for during the first several rounds of the whole sequence.
The profiles shown in Fig. 3(h) and Figs. 4(d), (e), and (f) manifest a common pattern for all the RL-optimized sequences. It is found in the previous several rounds that a conditional or projective measurement should be performed on the detector, when the resonator is normally in a comparatively high-temperature state, and several unconditional measurements ensued before further cooling. This pattern is consistent with the variations of both energy and entropy in nonunitary controls Gherardini et al. (2020). The energy variation induced by a projective measurement is on average, where is the Shannon entropy of the whole system after a free evolution. Then in the end of the first round, a projective measurement is desired to cut down as much energy as it could, which is followed by several rounds of unconditional measurements to save the success probability. Thus in general we anticipate to see more UMs than CMs in the first several rounds and more CMs than UMs in the remaining rounds.
IV Discussion and conclusion

Preceding analysis over the cooling performance neglects the environment-induced dissipation. We now consider the cooling process in an open-quantum-system scenario, in which the free evolution between neighboring measurements is influenced by a finite-temperature environment. The dynamics is then described by the master equation
(16) | ||||
where represents the Lindblad superoperator
(17) |
In Fig. 5(a) and (b), we present the average population and the cooperative cooling performance respectively with various dissipation rates. To compare the cooling performances in the presence of thermal decoherence to the dissipation-free situation, we apply the RL-optimized sequence provided in Fig. 3(h). It is found that a larger dissipation rate gives rise to a weaker cooling performance in terms of both and , exhibiting the struggle between cooling effects by measurement and the accumulated heating effects by environment. Nevertheless, for typical mechanical resonators in gigahertz with Ding et al. (2011); Chan et al. (2011), our optimized cooling protocol is still capable to reduce by three orders in magnitude with about measurements [see the green dashed line in Fig. 5(a)]. In the mean time, the asymptotic value of still overwhelms the CM strategy labeled by in Fig. 3(d).
Even in the absence of thermal decoherence, does not keep decreasing. Fundamentally, it is under the constraint of the third law of thermodynamics, that the absolute zero cannot be attained within a finite number of operations. Actually, either or approaches infinity as , which indicates that the whole cooling process has to be truncated by a maximum timescale.
We emphasize again that the preceding hybrid cooling sequences based on the conditional and unconditional measurements are optimized in both global and local perspectives. Globally, we use the reinforcement learning to find the optimized order for UM and CM. The local optimization depends on the selected measurement interval to obtain a minimum average-population under one measurement. For UM in Eq. (14), is not necessarily obtained by an instant feedback mechanism during a realistic practice. The measurement sequence can be actually obtained prior to the cooling measurements. depends on the initial population-distribution , and , , can be calculated on the effective temperature that is uniquely determined by the dynamics of through Eq. (12). In other words, we can avoid the feedback error and imprecision induced by detecting the resonator states during the experiment.
In summary, we present an optimized cooling architecture on a sequential arrangement of both conditional and unconditional measurements. We analyse and compare the advantages and disadvantages of both CM and UM on cooling rate and success probability. We obtain analytically for the first time an analytical expression for the optimized unconditional measurement-interval in parallel to that for conditional measurement Yan and Jing (2022). Here the dominant Rabi frequency depends on the dominant distribution of resonator in its Fock state with and the coupling strength between target and detector. The combination of the advantages of both measurement strategies gives rise to an optimized hybrid cooling algorithm assisted by the reinforcement learning. It is justified by the cooperative cooling performance as we defined to quantify the comprehensive cooling efficiency for arbitrary cooling-by-measurement strategy. Our work therefore pushes the cooling-by-measurement to an unattained degree in regard of efficiency and feasibility. It offers an appealing interdisciplinary application of quantum control and artificial intelligence.
Acknowledgments
We acknowledge financial support from the National Science Foundation of China (Grants No. 11974311 and No. U1801661).
Appendix A Proximal Policy Optimization
This appendix provides more details in proximal policy optimization, a typical reinforcement learning algorithm that we use to optimize the measurement sequence for cooling. PPO algorithm follows an “actor-critic” frame, in which actors receive the current state as an input and then outputs an action according to an updatable policy, and a critic evaluates this action to determine whether the action should be encouraged or not. In the following, we do not discriminate “actor” and “policy” for simplicity.

As shown in Fig. 6, PPO algorithm has two actors (policies) and and one critic. Any of them is of an agent constructed by the neural networks (see Fig. 2) feathered with a set of parameter . The two policies have the same structures in PPO. The old policy collects the sampling data through interaction with the environment; and the new one would use these data stored in a buffer to update to be . At first, the environment would initialize and deliver the state of the target system to the old policy ; then the old policy generates an action according to and . In environment, the action is taken and the system state becomes . The environment also provides a reward indicating how good the action is. The reward is generated by a task-specified reward function. At this stage, an interaction between the policy and the environment is completed and one set of “trajectory” or return is collected. trajectories are collected in one episode, where amounts to the number of actions required to complete the task. The critic takes both actions and states as input and outputs an advantage representing the contribution of the current action on the current state . After collecting a sufficient amount of data, the critic would estimate the actions’ contribution as precise as possible. In the mean time, according to the advantages to maximize a clipped surrogate objective function Schulman et al. (2017), the new policy would transfer its parameters to the old one.
In our application for optimizing the cooling sequence, the allowed inputs of the system states are defined as the populations in the Fock states, i.e., the diagonal elements of the target resonator
(18) |
where indicates the cutoff Fock-state for the resonator. The actions taken by the environment are selected from the set
(19) |
where and represent unconditional and conditional measurements, respectively. Two policies are used to decide which type of measurement to be performed due to the current state of the resonator. Environment represents the quantum devices performing measurements, obtaining the updated states, and returning the rewards. When an action is selected and sent to the environment, the optimized measurement interval is calculated according to the measurement type. After unitary evolution lasting , measurement is performed on the detector. Then the average population , the ground-state fidelity , and the success probability are obtained to calculate the cooperative cooling performance given by Eq. (15). The reward function is set as a certain multiple of , . After measurement, the environment then returns the resonator state and the reward to the policies. When the training is completed, a policy with a set of optimized parameters is achieved. The neutral network equipped with could then be used to generate the optimized actions to cool down the current state.
Appendix B Generation of optimized sequence
Both the order of measurements and the sequence of measurement-intervals could be regarded as output of our RL-optimized cooling algorithm as shown in Algorithm 1. The input information is the initial temperature , fully determining the thermal state of the resonator. When the reinforcement learning process was completed by PPO algorithm (see Appendix A), the parameters of the neural network (policy ) have been trained to be capable to select one of the two measurement strategies for the current state, which maximizes the cooperative cooling performance. And then the cooling procedure is formally launched. We run the policy on , which generates the first measurement strategy , . Here and indicate UM and CM, respectively. If , then in Eq. (14) that could be obtained by the effective temperature of the resonator (initially , and it is updated by the current state of the last round). Subsequently, the cooling coefficients and are calculated and the resonator state is modified according to Eq. (9). Otherwise if , a conditional measurement will be implemented after an interval and the resonator state is modified according to Eq. (3). In the end of this round, one can calculate by the current and then go to the next round. After iterations, the optimized measurement sequence characterized by and appear as respectively described in Fig. 3(h) and Figs. 4(c), (d), (e), (f). In practical implementations, the measurements by and can be acted on the detector without knowledge of the target-resonator state.
References
- Milburn and Woolley (2008) G. J. Milburn and M. J. Woolley, Quantum nanoscience, Contemp. Phys. 49, 413 (2008).
- Aspelmeyer et al. (2014) M. Aspelmeyer, T. J. Kippenberg, and F. Marquardt, Cavity optomechanics, Rev. Mod. Phys. 86, 1391 (2014).
- Lloyd and Braunstein (1999) S. Lloyd and S. L. Braunstein, Quantum computation over continuous variables, Phys. Rev. Lett. 82, 1784 (1999).
- You and Nori (2011) J. Q. You and F. Nori, Atomic physics and quantum optics using superconducting circuits, Nature 474, 589 (2011).
- Toyoda et al. (2015) K. Toyoda, R. Hiji, A. Noguchi, and S. Urabe, Hong–ou–mandel interference of two phonons in trapped ions, Nature 527, 74 (2015).
- Um et al. (2016) M. Um, J. Zhang, D. Lv, Y. Lu, S. An, J.-N. Zhang, H. Nha, M. S. Kim, and K. Kim, Phonon arithmetic in a trapped ion system, Nat. Commun. 7, 11410 (2016).
- Bocko and Onofrio (1996) M. F. Bocko and R. Onofrio, On the measurement of a weak classical force coupled to a harmonic oscillator: experimental progress, Rev. Mod. Phys. 68, 755 (1996).
- Caves et al. (1980) C. M. Caves, K. S. Thorne, R. W. P. Drever, V. D. Sandberg, and M. Zimmermann, On the measurement of a weak classical force coupled to a quantum-mechanical oscillator. i. issues of principle, Rev. Mod. Phys. 52, 341 (1980).
- Sharma et al. (2018) S. Sharma, Y. M. Blanter, and G. E. W. Bauer, Optical cooling of magnons, Phys. Rev. Lett. 121, 087205 (2018).
- Wilson-Rae et al. (2007) I. Wilson-Rae, N. Nooshi, W. Zwerger, and T. J. Kippenberg, Theory of ground state cooling of a mechanical oscillator using dynamical backaction, Phys. Rev. Lett. 99, 093901 (2007).
- Gigan et al. (2006) S. Gigan, H. R. Böhm, M. Paternostro, F. Blaser, G. Langer, J. B. Hertzberg, K. C. Schwab, D. Bäuerle, M. Aspelmeyer, and A. Zeilinger, Self-cooling of a micromirror by radiation pressure, Nature 444, 67 (2006).
- Wang et al. (2011) X. Wang, S. Vinjanampathy, F. W. Strauch, and K. Jacobs, Ultraefficient cooling of resonators: Beating sideband cooling with quantum control, Phys. Rev. Lett. 107, 177204 (2011).
- Zhang et al. (2013) J. Zhang, D. Li, R. Chen, and Q. Xiong, Laser cooling of a semiconductor by 40 kelvin, Nature 493, 504 (2013).
- Epstein et al. (1995) R. I. Epstein, M. I. Buchwald, B. C. Edwards, T. R. Gosnell, and C. E. Mungan, Observation of laser-induced fluorescent cooling of a solid, Nature 377, 500 (1995).
- Morigi et al. (2000) G. Morigi, J. Eschner, and C. H. Keitel, Ground state laser cooling using electromagnetically induced transparency, Phys. Rev. Lett. 85, 4458 (2000).
- Roos et al. (2000) C. F. Roos, D. Leibfried, A. Mundt, F. Schmidt-Kaler, J. Eschner, and R. Blatt, Experimental demonstration of ground state laser cooling with electromagnetically induced transparency, Phys. Rev. Lett. 85, 5547 (2000).
- Vanner et al. (2011) M. R. Vanner, I. Pikovski, G. D. Cole, M. S. Kim, C. Brukner, K. Hammerer, G. J. Milburn, and M. Aspelmeyer, Pulsed quantum optomechanics, Proc. Natl. Acad. Sci. 108, 16182 (2011).
- Vanner et al. (2013) M. R. Vanner, J. Hofer, G. D. Cole, and M. Aspelmeyer, Cooling-by-measurement and mechanical state tomography via pulsed optomechanics, Nat. Commun. 4, 2295 (2013).
- Bennett et al. (2016) J. S. Bennett, K. Khosla, L. S. Madsen, M. R. Vanner, H. Rubinsztein-Dunlop, and W. P. Bowen, A quantum optomechanical interface beyond the resolved sideband limit, New J. Phys. 18, 053030 (2016).
- Rossi et al. (2018) M. Rossi, D. Mason, J. Chen, Y. Tsaturyan, and A. Schliesser, Measurement-based quantum control of mechanical motion, Nature 563, 53 (2018).
- Brunelli et al. (2020) M. Brunelli, D. Malz, A. Schliesser, and A. Nunnenkamp, Stroboscopic quantum optomechanics, Phys. Rev. Research 2, 023241 (2020).
- Buffoni et al. (2019) L. Buffoni, A. Solfanelli, P. Verrucchi, A. Cuccoli, and M. Campisi, Quantum measurement cooling, Phys. Rev. Lett. 122, 070603 (2019).
- Nakazato et al. (2003) H. Nakazato, T. Takazawa, and K. Yuasa, Purification through zeno-like measurements, Phys. Rev. Lett. 90, 060401 (2003).
- Li et al. (2011) Y. Li, L.-A. Wu, Y.-D. Wang, and L.-P. Yang, Nondeterministic ultrafast ground-state cooling of a mechanical resonator, Phys. Rev. B 84, 094502 (2011).
- Xu et al. (2014) J.-S. Xu, M.-H. Yung, X.-Y. Xu, S. Boixo, Z.-W. Zhou, C.-F. Li, A. Aspuru-Guzik, and G.-C. Guo, Demon-like algorithmic quantum cooling and its realization with quantum optics, Nat. Photonics 8, 113 (2014).
- Puebla et al. (2020) R. Puebla, O. Abah, and M. Paternostro, Measurement-based cooling of a nonlinear mechanical resonator, Phys. Rev. B 101, 245410 (2020).
- Pyshkin et al. (2016) P. V. Pyshkin, D.-W. Luo, J. Q. You, and L.-A. Wu, Ground-state cooling of quantum systems via a one-shot measurement, Phys. Rev. A 93, 032120 (2016).
- Yan and Jing (2021) J.-s. Yan and J. Jing, External-level assisted cooling by measurement, Phys. Rev. A 104, 063105 (2021).
- Yan and Jing (2022) J.-s. Yan and J. Jing, Simultaneous cooling by measuring one ancillary system, Phys. Rev. A 105, 052607 (2022).
- Zhang et al. (2019a) J.-M. Zhang, J. Jing, L.-A. Wu, L.-G. Wang, and S.-Y. Zhu, Measurement-induced cooling of a qubit in structured environments, Phys. Rev. A 100, 022107 (2019a).
- Harel and Kurizki (1996) G. Harel and G. Kurizki, Fock-state preparation from thermal cavity fields by measurements on resonant atoms, Phys. Rev. A 54, 5410 (1996).
- Silver et al. (2016) D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of go with deep neural networks and tree search, Nature 529, 484 (2016).
- Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mastering the game of go without human knowledge, Nature 550, 354 (2017).
- Silver et al. (2018) D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, T. Lillicrap, K. Simonyan, and D. Hassabis, A general reinforcement learning algorithm that masters chess, shogi, and go through self-play, Science 362, 1140 (2018).
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, Human-level control through deep reinforcement learning, Nature 518, 529 (2015).
- Carleo et al. (2019) G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld, N. Tishby, L. Vogt-Maranto, and L. Zdeborová, Machine learning and the physical sciences, Rev. Mod. Phys. 91, 045002 (2019).
- Convy et al. (2022) I. Convy, H. Liao, S. Zhang, S. Patel, W. P. Livingston, H. N. Nguyen, I. Siddiqi, and K. B. Whaley, Machine learning for continuous quantum error correction on superconducting qubits, New J. Phys. 24, 063019 (2022).
- Fösel et al. (2018) T. Fösel, P. Tighineanu, T. Weiss, and F. Marquardt, Reinforcement learning with neural networks for quantum feedback, Phys. Rev. X 8, 031084 (2018).
- Bolens and Heyl (2021) A. Bolens and M. Heyl, Reinforcement learning for digital quantum simulation, Phys. Rev. Lett. 127, 110502 (2021).
- Yuan et al. (2021) X. Yuan, J. Sun, J. Liu, Q. Zhao, and Y. Zhou, Quantum simulation with hybrid tensor networks, Phys. Rev. Lett. 127, 040501 (2021).
- Guo et al. (2021) S.-F. Guo, F. Chen, Q. Liu, M. Xue, J.-J. Chen, J.-H. Cao, T.-W. Mao, M. K. Tey, and L. You, Faster state preparation across quantum phase transition assisted by reinforcement learning, Phys. Rev. Lett. 126, 060401 (2021).
- Bukov et al. (2018) M. Bukov, A. G. R. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, Reinforcement learning in different phases of quantum control, Phys. Rev. X 8, 031086 (2018).
- Zhang et al. (2019b) X.-M. Zhang, Z. Wei, R. Asad, X.-C. Yang, and X. Wang, When does reinforcement learning stand out in quantum control? a comparative study on state preparation, npj Quantum Inf. 5, 85 (2019b).
- Sivak et al. (2022) V. V. Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsioutsios, and M. H. Devoret, Model-free quantum control with reinforcement learning, Phys. Rev. X 12, 011059 (2022).
- Kim and Jeong (2021) D.-K. Kim and H. Jeong, Deep reinforcement learning for feedback control in a collective flashing ratchet, Phys. Rev. Research 3, L022002 (2021).
- Yao et al. (2021) J. Yao, L. Lin, and M. Bukov, Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving, Phys. Rev. X 11, 031070 (2021).
- Feng et al. (2020) L. Feng, W. L. Tan, A. De, A. Menon, A. Chu, G. Pagano, and C. Monroe, Efficient ground-state cooling of large trapped-ion chains with an electromagnetically-induced-transparency tripod scheme, Phys. Rev. Lett. 125, 053001 (2020).
- Triana et al. (2016) J. F. Triana, A. F. Estrada, and L. A. Pachón, Ultrafast optimal sideband cooling under non-markovian evolution, Phys. Rev. Lett. 116, 183602 (2016).
- Ding et al. (2011) L. Ding, C. Baker, P. Senellart, A. Lemaitre, S. Ducci, G. Leo, and I. Favero, Wavelength-sized gaas optomechanical resonators with gigahertz frequency, Appl. Phys. Lett 98, 113108 (2011).
- Chan et al. (2011) J. Chan, T. P. M. Alegre, A. H. Safavi-Naeini, J. T. Hill, A. Krause, S. Gröblacher, M. Aspelmeyer, and O. Painter, Laser cooling of a nanomechanical oscillator into its quantum ground state, Nature 478, 89 (2011).
- Gherardini et al. (2020) S. Gherardini, F. Campaioli, F. Caruso, and F. C. Binder, Stabilizing open quantum batteries by sequential measurements, Phys. Rev. Research 2, 013095 (2020).
- Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv , 1707.06347 (2017).