A Dual-Critic Reinforcement Learning Framework for Frame-level Bit Allocation in HEVC/H.265
Abstract
This paper introduces a dual-critic reinforcement learning (RL) framework to address the problem of frame-level bit allocation in HEVC/H.265. The objective is to minimize the distortion of a group of pictures (GOP) under a rate constraint. Previous RL-based methods tackle such a constrained optimization problem by maximizing a single reward function that often combines a distortion and a rate reward. However, the way how these rewards are combined is usually ad hoc and may not generalize well to various coding conditions and video sequences. To overcome this issue, we adapt the deep deterministic policy gradient (DDPG) reinforcement learning algorithm for use with two critics, with one learning to predict the distortion reward and the other the rate reward. In particular, the distortion critic works to update the agent when the rate constraint is satisfied. By contrast, the rate critic makes the rate constraint a priority when the agent goes over the bit budget. Experimental results on commonly used datasets show that our method outperforms the bit allocation scheme in x265 and the single-critic baseline by a significant margin in terms of rate-distortion performance while offering fairly precise rate control.
1 Introduction
Frame-level bit allocation is one key issue in video rate control [1, 2, 3]. The task is to minimize the distortion of a group of pictures (GOP) under a rate constraint. In allocating a proper number of bits to every video frame in a GOP, it is crucial to consider the inter-frame dependencies because the quality of a coding frame depends highly on that of its reference frame. Thus, frame-level bit allocation within a GOP can be viewed as a dependent decision-making process with a long-term goal to achieve.
Recently, deep reinforcement learning (RL) emerged as a promising technique for addressing dependent decision-making. Some early attempts apply RL to tackle constrained optimization problems in the video coding area [1, 4, 5, 2, 3]. Chung et al. [4] utilized RL to determine the partition of coding tree units in HEVC/H.265. Hu et al. [2] cast the choice of quantization parameters for intra-frame coding as an RL problem. The idea was extended by Chen et al. [1] to frame-level bit allocation for inter-frame coding with hierarchical bi-prediction. Lately, Zhou et al. [3] took this line of research one step further for combined intra- and inter-frame bit allocation, focusing particularly on low-delay coding scenarios. Instead of optimizing video quality for human perception, Shi et al. [5] proposed RL-based intra-frame bit allocation for image detection, classification, and segmentation.
In the context of frame-level bit allocation in a GOP, the reward is often formulated as a single function blending the resulting distortion of the GOP and the bias between the actual and target bit rates. Meeting the distortion and rate requirements simultaneously calls for a hyper-parameter, with the aim of seeking a balance between them.
To address this problem, we propose a dual-critic RL learning framework. We adapt the deep deterministic policy gradient (DDPG) [6] algorithm for use with two critics: the rate and distortion critics. The rate critic learns to predict the rate bias upon the completion of encoding a GOP, while the distortion critic gives an estimate of the distortion-to-go along the coding process. Instead of using one signal reward function to guide the learning of an agent, which performs bit allocation by choosing a quantization parameter (QP) for every video frame, we apply both critics in an alternate way to train the agent. Extensive experimental results with x265 show that our method outperforms the bit allocation scheme in x265 and the single-critic method [1] by a significant margin in terms of rate-distortion (R-D) performance, when evaluated on Class B and Class C test sequences in JCT-VC dataset, which are not seen during training. It also offers fairly precise rate control accuracy.
Our contributions include: (1) to the best of our knowledge, this is the first work that introduces a dual-critic-based RL framework to address constrained optimization problems; (2) we apply it to frame-level bit allocation for GOP coding with hierarchical bi-prediction; (3) our scheme outperforms the bit allocation scheme in x265 and the single-critic method [1] in terms of R-D performance and rate control accuracy.
2 Proposed Method

2.1 System Overview
The frame-level bit allocation is to allocate a proper number of bits to every video frame by choosing a QP for its encoding, in order to minimize the distortion of a GOP subject to a rate constraint. In symbols, we have
(1) |
where denotes the QP chosen for the -th frame in a GOP, is the distortion incurred by encoding frame with , is the actual number of bits produced by the encoder, and is the GOP-level bit budget.
In this paper, we tackle the learning of an agent that can perform bit allocation among video frames in a GOP from an RL perspective. The frame-level bit allocation for a GOP is regarded as an episodic task, with the learning process given as follows and illustrated in Fig. 1. (1) First, we evaluate a state signal (Section 2.2) for an input frame. (2) Second, the state signal is fed to the agent implemented by a neural network (Section 2.5) to determine a QP for encoding the current frame. (3) Third, based on the chosen QP, the current frame is encoded and decoded by the specified codec (e.g. x265) to update the state signal for the agent to decide the QP for the next video frame. Moreover, two different types of rewards (i.e. the distortion and the rate rewards) are computed to indicate how good the choice of QP is (Section 2.3). These steps are repeated iteratively until a terminal state (i.e. the end of an episode) is reached. The training has the agent interact with the codec many times so that it can learn to maximize the rewards (Section 2.4).
2.2 State Signals
The state signal serves as the input to the agent for its decision making. We follow mainly the state design in [1], which is summarized in Table 1 for easy reference. It involves several hand-crafted intra- and inter-frame features that characterize the statistics of all the video frames in a GOP and their residual frames produced with zero motion vectors for simplicity. The fact that the state depends on the statistics of the remaining video frames in a GOP suggests a look-ahead strategy. As such, when a reference frame is not yet coded, we turn to its original and uncompressed version to compute the residual frame.
state components | |
---|---|
1 | Intra-frame feature (the mean and variance of pixel values in a video frame) |
2 | Inter-frame feature (the mean and variance of pixel values in a residual frame) |
3 | Average of intra-frame features over the remaining frames |
4 | Average of inter-frame features over the remaining frames |
5 | Percentage of the remaining bits |
6 | Number of remaining frames in the GOP |
7 | Temporal identification of the current frame |
8 | Target bit of the GOP |
2.3 Distortion and Rate Rewards
The reward signal plays a central role in shaping the agent’s behavior. The agent is usually trained to maximize a single reward. Because our task is to minimize the distortion of a GOP subject to a rate constraint, we develop two rewards: the distortion and the rate rewards. The former is evaluated upon the completion of encoding a video frame as an immediate reward, reflecting how good the current choice of QP is. It is defined as the normalized negative distortion of a compressed video frame in mean square error (MSE):
(2) |
where is the MSE of the current frame , and the and are the MSE’s of the entire GOP by encoding every video frame with the two extreme QP values, i.e. QP0 and QP51. The two normalization factors cause the sum of over the video frames in a GOP to fall approximately in the interval from 0 to 1. Unlike the distortion reward, which is evaluated for every video frame, the rate reward is given at the end of an episode, capturing the absolute bias between the actual bit rate and the target bit rate. It is specified by
(3) |
2.4 Dual Critics
To train the proposed agent following the aforementioned state and reward design, we depart from the single critic approach to learn two critics, with one used for predicting the distortion reward (more precisely, the distortion-to-go ) and the other for predicting the rate reward . The agent is updated adaptively with one of the critics. Specifically, we constantly apply the agent under training to encode a GOP. If it produces a bit rate exceeding the target , we use the rate critic to update its network parameters, with the aim of optimizing the agent towards precise rate control. In the other case, where the actual bit rate is lower than the target, it is updated with the distortion critic to minimize the distortion of the reconstructed GOP.
Algorithm 1 details the training process. It can be divided into two major parts: Part I (lines 5 to 18) and Part II (lines 19 to 30). Part I corresponds to a rollout of the policy together with a noise process to collect transitions of states, actions, and rewards in a replay buffer , the data of which are utilized to update the two critics and based on the ordinary DDPG algorithm. Note that a zero immediate rate reward is recorded at every step until the very last step (the terminal step/state), at which the bit rate bias of the entire GOP is given as the immediate rate reward. Part II is another rollout of the policy but without the noise process. That is, the agent is put to use as for inference. The state transitions experienced is stored in a separate replay buffer . At the end of the rollout, the actual bit rate used is compared against the target bit rate to decide which of the two critics should be used to update the agent.
2.5 Network Architectures
As Fig. 1 illustrates, our agent consists of two modules: the base and the actor modules. The base module is pre-trained to mimic the QP control of the selected codec (e.g. x265), while the actor module is learned with our dual-critic RL framework to update the QP of the base module by a delta QP in the interval . In training the actor module, the base module is freezed. The introduction of the base module restricts the exploration space for learning the actor module. This is found beneficial particularly because the convergence can be an issue for dual-critic training with a vast exploration space of QP options. The architectures of our actor and critic networks are detailed in Table 2. Note that the base module shares the same architecture as the actor module, except for the exclusion of the base QP as input (Fig. 1).
Layer | Actor | Critics ( and ) | |
---|---|---|---|
Input | State | State | Action |
1 | fc, 800, elu | fc, 500, leaky-relu | fc, 500, leaky-relu |
2 | fc, 500, elu | fc, 300, leaky-relu | fc, 300, none |
3 | fc, 1, sigmoid | fc, 300, none | - |
4 | - | add, leaky-relu | |
5 | - | fc, 100, leaky-relu | |
6 | - | fc, 1, none |
2.6 Comparison with Single-Critic Design
Our scheme differs from the single-critic design [1] in two aspects. First, it learns two separate critics, whereas Chen et al. [1] use one critic to learn a single reward function for a GOP as the weighted combination of the distortion and rate rewards, namely , where is a hyper-parameter chosen empirically to trade the distortion of a GOP against the rate deviation from the target . We argue that the use of a fixed , as adopted in [1], could hardly learn an agent that works well on different types of videos. Second, we require our agent to determine QP’s for the remaining frames when the target is exceeded, whereas Chen et al. [1] propose encoding the remaining frames in terminal mode with , with denoting the I-frame’s QP.
3 Experimental Results
3.1 Settings
We assess the objective and subjective compression performance of the proposed method, with the results compared against those produced by x265 [7], the base module alone (referred hereafter to as base), and the single-critic method [1]. The baseline methods do not include [3] because it is optimized specifically for low-delay p-frame coding, as compared to our GOP coding with hierarchical bi-prediction.
All the competing methods are evaluated on x265, with a hierarchical GOP coding structure as depicted in Fig. 2. Because our focus is on the bit allocation inside a GOP, all the tested methods follow the same GOP-level bit allocation as x265. This is achieved by first encoding every test sequence with fixed QP 22, 27, 32, and 37 to establish four sequence-level target bit rates . They are then utilized as the sequence-level rate constraints to encode every test sequence again by turning on the ABR rate control of x265 (–bitrate –vbv-bufsize 2* –vbv-maxrate ), the results of which serve as our x265 baseline. For a fair comparison, the GOP-level bit rates produced by x265 in this ABR mode are taken as the GOP rate constraint for the other methods and the encoding is restricted to one pass only.
We train separate models for the four target bit rates. The base module clones approximately the rate control behavior of x265 using its QP choices as the supervision ground truths. For a fair comparison, both our scheme and the single-critic method [1] operate with the same base module. Our training datasets include UVG [8], MCL-JCV [9], and Class A sequences in JCT-VC dataset. At test time, we use Class B and Class C sequences in JCT-VC dataset, which are not seen at training time. To speed up the RL training processes, the training and test videos are downscaled to . The run time of our feature extraction and actor network forwarding only occupies 4% run time of ABR rate control of x265.
The objective compression performance is reported in terms of BD-PSNR gains and BD-rate savings, with x265 in ABR mode serving as anchor. The quality metrics are Y-PSNR and YUV-PSNR evaluated (and averaged) over individual frames, where YUV-PSNR is given by . The accuracy in rate control is quantified by averaging the absolute rate deviations from in percentage terms over all the GOPs in a test sequence.




Sequences | BD-rate (%) | BD-PSNR (dB) | Rate deviation (%) | ||||||||||||
Y | YUV | Y | YUV | - | |||||||||||
Base | [1] | Ours | Base | [1] | Ours | Base | [1] | Ours | Base | [1] | Ours | Base | [1] | Ours | |
BasketballDrill | 3.2 | 6.6 | -9.9 | 3.4 | 6.4 | -11.1 | -0.14 | -0.39 | 0.53 | -0.14 | -0.37 | 0.57 | 7.7 | 5.6 | 3.6 |
BasketballDrive | -0.3 | 3.1 | -7.6 | -0.1 | 3.1 | -10.7 | 0.02 | -0.15 | 0.40 | 0.02 | -0.15 | 0.53 | 11.3 | 4.4 | 5.0 |
BQMall | -7.9 | -4.7 | -17.0 | -8.6 | -5.5 | -22.7 | 0.40 | 0.22 | 0.92 | 0.41 | 0.25 | 1.18 | 25.3 | 9.6 | 11.9 |
BQTerrace | -12.5 | -19.7 | -17.6 | -13.3 | -23.1 | -24.5 | 0.59 | 0.97 | 0.88 | 0.57 | 1.07 | 1.20 | 16.8 | 6.0 | 7.7 |
Cactus | -4.7 | -13.4 | -16.1 | -4.9 | -15.5 | -19.8 | 0.24 | 0.71 | 0.85 | 0.23 | 0.77 | 0.99 | 5.6 | 11.4 | 4.4 |
Kimono | -7.6 | -16.3 | -17.4 | -7.2 | -18.0 | -22.5 | 0.35 | 0.78 | 0.84 | 0.30 | 0.76 | 0.98 | 11.5 | 6.4 | 4.9 |
ParkScene | -2.1 | -13.9 | -16.3 | -1.8 | -15.8 | -20.9 | 0.09 | 0.62 | 0.73 | 0.06 | 0.64 | 0.87 | 7.7 | 6.0 | 4.9 |
PartyScene | -10.1 | -15.5 | -21.8 | -11.0 | -17.0 | -25.9 | 0.51 | 0.81 | 1.19 | 0.52 | 0.83 | 1.33 | 24.9 | 5.8 | 8.5 |
RaceHorses | -1.8 | 5.0 | -8.6 | -1.0 | 6.2 | -11.1 | 0.08 | -0.22 | 0.41 | 0.04 | -0.25 | 0.49 | 18.6 | 6.1 | 5.1 |
Average | -4.9 | -7.6 | -14.7 | -5.0 | -8.8 | -18.8 | 0.24 | 0.37 | 0.75 | 0.22 | 0.39 | 0.91 | 14.4 | 6.8 | 6.2 |








3.2 Compression Performance and Rate Control Accuracy
Fig. 3 compares the rate-distortion (R-D) curves of different methods for three selected sequences, with the complete BD-rate and BD-PSNR results summarized in Table 3. Also compared in the same table is the rate control accuracy averaged over all the rate points. From these results, the following observations are immediate:
(1) Our method outperforms the competing methods in terms of R-D performance. Fig. 3 shows that the proposed method achieves superior R-D performance to x265, as is confirmed by its BD-PSNR and BD-rate numbers in Table 3. It improves Y-PSNR by 0.75dB and YUV-PSNR by 0.91dB, as compared to 0.37dB and 0.39dB improvements by Chen et al. [1], respectively. The corresponding BD-rate savings range from 14.7% to 18.8%, which nearly double the rate savings by Chen et al. [1] (7.6% to 8.8%).
(2) The single-critic method has inconsistent R-D performance. From Fig. 3, it performs better than x265 in slow-motion sequences (e.g. BQTerrace), but worse in the other fast-motion sequences (e.g. BasketballDrill, RaceHorses). The performance degradation, if any, is more obvious at higher bit rates. Its inconsistent behavior in different types of videos and at different bit rates may be attributed to the use of a fixed hyper-parameter (Section 2.6) for trading off the video quality against the rate penalty, which is unable to generalize well to all the scenarios.
(3) The base module is unable to meet the GOP-level rate constraints although showing comparable or better R-D performance than x265. Table 3 reveals that its average deviation from the GOP-level bit rate constraint is around 14% and can be as high as nearly 25% in some sequences. However, with the RL training, both the single-critic method [1] and ours can match the GOP-level bit rates of x265 more closely (the left plot of Fig. 4), showing much reduced rate deviations of 6.8% and 6.2% (Table 3), respectively. From Table 3, the base module achieves 4.9-5.0% BD-rate savings and 0.22-0.24dB BD-PSNR gains over x265. This is because the ABR rate control may suffer from an unstable transient behavior in encoding the first few GOP’s, as indicated by the developers of x265 [7]. By excluding these GOP’s from behavior cloning, the base module learns a somewhat fixed and regular QP pattern over different GOP’s, as shown in the right plot of Fig. 4. Nevertheless, the higher rate control accuracy and much improved R-D performance of the proposed method than the base module stress the contributions of our RL framework (Table 3).
3.3 Frame-level QP Assignment
Fig. 5 visualizes the frame-level QP assignment within GOP’s along with the corresponding coded frame size and YUV-PSNR. We make the following observations:
(4) Our RL agent tends to choose smaller QP values for I- and B-frames, allocating more bits to their coding. In slow-motion sequences (e.g. BQTerrace), our approach encodes I-frames at an even higher quality using an extremely small QP. Such a policy is intuitively agreeable since I- and B-frames serve as reference frames for the remaining non-reference b-frames. In slow-motion sequences, the quality of I-frames is even more critical to the reconstruction quality of a GOP.
(5) x265 and its clone version, i.e. the base module, focus more on I-frames without showing a drastic difference in QP selection for B- and b-frames. A similar policy is observed in both fast- and slow-motion sequences. This causes B- and b-frames to have similar YUV-PSNR, and B-frames to use slightly more bits in fast-motion sequences (e.g. RaceHorses) due to a longer-distance prediction.
(6) The single-critic method [1] learns a subtle policy on QP selection. In slow-motion sequences (e.g. BQTerrace), it behaves similarly to our scheme except that it does not particularly favor B-frames and enters terminal state early by encoding the remaining frames with (Section 2.6). Interestingly, such a policy shows comparable R-D performance to ours (BQTerrace in Table 3). However, in fast-motion sequences, I-frames are coded with relatively larger QP’s than B- and b-frames, which explains its poorer R-D performance in these sequences.
(7) Our RL agent treats b-frames differently in fast-motion sequences. It shows uneven QP assignment among b-frames, with their QP’s increasing with the coding order (Fig. 2). Recall that our objective is to minimize the GOP-level distortion; there is no constraint on the frame-level quality distribution in a GOP. From Fig. 2, b-frames are coded independently of each other as non-reference frames. The QP assignment in Fig. 4 turns out to be a reasonably good solution since the GOP-level distortion is found to be smaller than those of the other competing methods.
3.4 Subjective Quality Comparison
Fig. 6 presents randomly chosen sample images produced by the competing methods. We observe that the proposed method retains more texture details, e.g. the wooden stripe in BasketballDrill, the crease of the shirt in BQMall, and the grass region in RaceHorses. Videos are provided at http://mapl.nctu.edu.tw/RL_Rate_Control/.

4 Conclusion
This paper introduces a dual-critic-based RL framework for frame-level bit allocation in HEVC/H.265. It overcomes the need of combining the rate and distortion rewards in a heuristic manner with the single-critic design. The proposed method achieves promising R-D performance and fairly precise rate control accuracy. Although we are not currently aware of convergence guarantees for dual-critic training, it is able to arrive at a reasonably good solution in reality. This remains an open issue to be addressed in our future work.
5 Acknowledgements
We are grateful to Yen-Kuang Chen and Minghai Qin for discussions and helpful feedback on the manuscript.
6 References
References
- [1] L. Chen et al., “Reinforcement learning for HEVC/H.265 frame-level bit allocation,” in 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP). IEEE, 2018, pp. 1–5.
- [2] J. Hu at al., “Reinforcement learning for HEVC/H.265 intra-frame rate control,” in 2018 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2018, pp. 1–5.
- [3] M. Zhou et al., “Rate control method based on deep reinforcement learning for dynamic video sequences in HEVC,” IEEE Transactions on Multimedia, 2020.
- [4] C. Chung et al., “HEVC/H.265 coding unit split decision using deep reinforcement learning,” in 2017 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS). IEEE, 2017, pp. 570–575.
- [5] Jun Shi and Zhi-Bo Chen, “Reinforced bit allocation under task-driven semantic distortion metrics,” in 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2020.
- [6] Lillicrap, Timothy P et al., “Continuous control with deep reinforcement learning,” in Proc. International Conference on Learning Representations (ICLR), 2015.
- [7] “x265 Version 2.7,” http://x265.org, February, 2018.
- [8] Mercat, Alexandre et al., “UVG dataset: 50/120fps 4k sequences for video codec analysis and development,” in Proceedings of the 11th ACM Multimedia Systems Conference, 2020.
- [9] H. Wang et al., “MCL-JCV: A jnd-based H.264/AVC video quality assessment dataset,” in 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 1509–1513.