Rate-Distortion Optimized Skip Coding of Region Adaptive Hierarchical Transform Coefficients for MPEG G-PCC

Zehan Wang, Yuxuan Wei, Hui Yuan Wei Zhang, and Peng Li This work was supported in part by the National Natural Science Foundation of China under Grants 62222110, 62172259, and 62311530104, the Taishan Scholar Project of Shandong Province (tsqn202103001), the Natural Science Foundation of Shandong Province under Grant ZR2022ZD38, the State Grid Corporation Headquarters Technology Project-5500-202424166A-1-1-ZN and the OPPO Research Fund. (Corresponding author: Hui Yuan.)Zehan Wang, Yuxuan Wei and Hui Yuan are with the School of Control Science and Engineering, Shandong University, Jinan 250061, China (E-mail: [email protected]; [email protected]; [email protected]).Wei Zhang and Peng Li are with the School of Telecommunications Engineering, Xidian University, Xi’an, 710071, China (E-mail: [email protected]; [email protected]).

Abstract

Three-dimensional (3D) point clouds are becoming more and more popular for representing 3D objects and scenes. Due to limited network bandwidth, efficient compression of 3D point clouds is crucial. To tackle this challenge, the Moving Picture Experts Group (MPEG) is actively developing the Geometry-based Point Cloud Compression (G-PCC) standard, incorporating innovative methods to optimize compression, such as the Region-Adaptive Hierarchical Transform (RAHT) nestled within a layer-by-layer octree-tree structure. Nevertheless, a notable problem still exists in RAHT, i.e., the proportion of zero residuals in the last few RAHT layers leads to unnecessary bitrate consumption. To address this problem, we propose an adaptive skip coding method for RAHT, which adaptively determines whether to encode the residuals of the last several layers or not, thereby improving the coding efficiency. In addition, we propose a rate-distortion cost calculation method associated with an adaptive Lagrange multiplier. Experimental results demonstrate that the proposed method achieves average Bjøntegaard rate improvements of -3.50%, -5.56%, and -4.18% for the Luma, Cb, and Cr components, respectively, on dynamic point clouds, when compared with the state-of-the-art G-PCC reference software under the common test conditions recommended by MPEG.

Index Terms:

point cloud compression, dynamic point cloud, region-adaptive hierarchical transform, rate distortion optimization, skip coding.

I Introduction

Athree-dimensional (3D) point cloud is composed of a large number of unordered points with 3D coordinates and their associated attributes (color, reflectance, etc.). It can be used to represent 3D shape and structure of objects and scenes, and can be extensively used in immersive communication, cultural heritage preservation, and autonomous driving [1], etc. However, the data volume of 3D point cloud is extremely large, leading to a great challenge of efficient storage and transmission. Therefore, 3D point cloud compression was put on the agenda.

Point cloud coding methods developed significantly in recent years, including transformation-based methods [2], 3D-to-2D projection methods [3], and deep learning-based approaches [4, 5]. To promote the application of 3D point clouds, the Moving Picture Experts Group (MPEG) started to establish coding standards for 3D point cloud in 2017 [6], and proposed three test models, i.e., test model category 1 (TMC1) [7] for static point clouds, test model category 2 (TMC2) [8] for dynamic point clouds, and test model category 3 (TMC3) [9] for radar point clouds. Later, as both TMC1 and TMC3 directly encode point clouds in 3D space, they are merged together and renamed as geometry-based point cloud compression (G-PCC) (the corresponding test model is named as TMC13) [10], while TMC2 compresses 3D point clouds by converting them into 2D geometry and attribute videos and is named as video-based point cloud compression (V-PCC) [11]. Based on the requirement of immersive communication, to further improve the compression efficiency of solid (denser) and dynamic point clouds [12], MPEG proposed a new branch of G-PCC, namely, geometry-based solid content test model (GeS-TM), in 2023. This paper focuses on this new branch of G-PCC.

Refer to caption — Figure 1: G-PCC encoding and decoding frameworks. The yellow, green, and blue colors represent data processing, geometry processing, and attribute processing, respectively.

The encoding and decoding frameworks of G-PCC are illustrated in Fig. 1. In the encoding stage, the encoder first converts the coordinates of the input 3D point cloud to normalized coordinates that can be uniformly processed by the encoder. Subsequently, after geometry quantization and removal of redundant points, the point cloud is voxelized, and then encoded by using methods such as trisoup [13], octree [14], or predictive tree [15], to generate the corresponding geometry bitstream. Next, the reconstructed geometry information is used for attribute encoding in which three methods, i.e., region-adaptive hierarchical transform (RAHT) [16], levels of detail (LoD)-based predictive transform (PT) [17], and lifting transform (LT) [18], can be selected. Arithmetic entropy coding is then applied to the transformed coefficients to generate the corresponding attribute bitstream. In the decoding stage, the geometry bitstream is first subjected to arithmetic decoding to obtain the reconstructed geometry. The reconstructed geometry is then transformed back through inverse coordinate conversion. Simultaneously, the reconstructed geometry and attribute bitstreams are jointly fed into the decoder for arithmetic decoding and inverse quantization. Furthermore, based on the attribute encoding methods (i.e, RAHT, PT, and LT), the decoder performs corresponding inverse transformations to reconstruct the attribute information [19, 20]. In this paper, we focus on RAHT which is the unique transform method in GeS-TM.

RAHT is based on 3D Haar wavelet transform with an octree structure and proceeds from the root node (the upmost layer) to the leaf nodes (the bottommost layer) of the octree. The transform generates alternating-current (AC) coefficients and direct-current (DC) coefficients. The encoder encodes AC coefficients layer by layer. To reduce temporal and spatial redundancy of AC coefficients, G-PCC introduces intra-frame and inter-frame prediction for the transformed coefficients, therefore the encoder only needs to encode the residuals between the coefficients and its predicted counterpart [21, 22]. However, there are a large number of near-zero AC residuals in the lower layers of the octree, leading to significant redundancy in the bitstream. To further improve encoding efficiency, we propose a skip coding method for the residuals of RAHT by using Rate-Distortion (RD) Optimization (RDO) to adaptively determine whether to encode the residuals of each layer or not. Additionally, we also introduce a new rate-distortion cost calculation method associated with an adaptive Lagrange multiplier.

The rest of the paper is organized as follows. In Section II, we briefly review related work. In Section III, we describe the technical details of the proposed method. Experimental results and conclusions are given in Sections IV and V, respectively.

II Related Work

Research on point cloud compression has made notable progress in recent years, accompanied by ongoing enhancements and optimizations in G-PCC standard. Queiroz et al proposed region-adaptive hierarchical transform in [16] which was adopted by G-PCC as a transform coding method for dense point clouds. RAHT relies on a pre-partitioned octree structure, where each non-empty voxel block (namely a transform block hereafter) contains 2×2×2 sub-blocks (namely sub-blocks hereafter). Within each transform block, RAHT uses a Haar wavelet transform in the X, Y, and Z directions, respectively. The specific transformation formulas for the Haar transform of 2 sub-blocks are as follows:

\begin{bmatrix}DC\\ AC\end{bmatrix}=\begin{bmatrix}a&-b\\ a&b\end{bmatrix}\begin{bmatrix}A_{1}\\ A_{2}\end{bmatrix},

(1)

where $a=\sqrt{w_{1}/(w_{1}+w_{2})}$ , $b=\sqrt{w_{2}/(w_{1}+w_{2})}$ , $A_{1},A_{2}$ represent the sum of attributes within the two transform sub-blocks, $w_{1},w_{2}$ represent the number of points within the two transform sub-blocks. For each transform block with 8 sub-blocks, we can obtain one DC coefficient and seven AC coefficients. The process of RAHT is illustrated in Fig. 2.

In the current G-PCC, the transform starts from the root node and proceeds from top to bottom. Except for the bottommost leaf nodes, the nodes in a layer can be regarded as a transform block, which contains 2×2×2 sub-blocks. Each sub-block in the current layer corresponds to the transform block of the next layer. Blocks within the same layer are transformed according to the ascending order of their block coordinates in Morton code until all 8 sub-blocks are traversed, as shown in Fig. 3 (a) and (b). The overall transform procedure within a transform block contains 2×2×2 sub-blocks can be simplified to a matrix operation

\begin{bmatrix}DC\\ AC_{1}\\ \vdots\\ AC_{N-1}\end{bmatrix}=T(w_{1},w_{1},\dots w_{N})\begin{bmatrix}A_{1}/\sqrt{w_{1}}\\ A_{2}/\sqrt{w_{2}}\\ \vdots\\ A_{N}/\sqrt{w_{N}}\end{bmatrix}.

(2)

where $A_{1},A_{1},\dots A_{N}$ represent the sum of attributes within each sub-block, $w_{1},w_{1},\dots w_{N}$ represent the number of points within each sub-block, and $T(w_{1},w_{1},\dots w_{N})$ represents the simplified transform matrix of the transform block, $DC,AC_{1},AC_{2},\dots A_{N-1}$ represent the transformed coefficients, as shown in Fig. 3(c). During the encoding procedure, only the DC coefficient of the root node is encoded, while the DC coefficients of the other transform blocks are ignored as they can be derived from the upper layer. Therefore, only N-1 transformed AC coefficients are encoded for the transform block.

Beyond that, recent research on RAHT has achieved significant progress. By exploring the spatial correlation of RAHT coefficients, Lasserre et al. proposed an up-sampling-based RAHT predictive encoding method, i.e., using the attributes of the parent node and the parent neighboring node to predict the attribute of the current node [23]. However, when there is a significant attribute variation between neighbor nodes and the current node, up-sampling-based prediction may result in significant prediction residuals. To address this problem, Zhang et al. proposed a threshold-based up-sampling prediction improvement in [24]. Wang et al. proposed to use co-plane or co-line peer neighbor attributes from sub-blocks at the same layer during the up-sampling prediction process instead of directly using the attributes of the parent neighboring nodes, to further improve prediction accuracy [25]. To further reduce temporal redundancy of dynamic multi-frame point clouds, Xu et al. proposed an inter-frame predictive method for RAHT transform coefficients. In this method, the corresponding node of the current node is found in the reference frame, and the attribute of this corresponding node are used to predict the attribute of the current node [26]. When intra-frame prediction and inter-frame prediction are introduced in the RAHT algorithm, the predicted attributes are first subjected to RAHT transform to obtain the corresponding $AC_{pre}$ , the original transform coefficients $AC_{org}$ are subtracted from the predicted transform coefficients $AC_{pre}$ , resulting in coefficient residuals $AC_{res}$ . These residuals are then quantized and subjected to RD optimized quantization (RDOQ) [27] followed by entropy coding.

In lossy compression, RDO plays a pivotal role to improve the compression efficiency. It is also commonly used as an indispensable technique in V-PCC [28, 29, 30] and G-PCC [31, 32, 33] reference softwares. In RAHT, to better compare the performance of intra-frame prediction and inter-frame prediction, Zhang et al. proposed an RDO-based prediction mode selection method to encode the transformed coefficients of a RAHT layer [34], which further improves the accuracy of RAHT coefficient prediction. To enhance the accuracy of inter-frame prediction, Ma et al. proposed a multi-reference frame prediction scheme based on RDO selection [35]. They adaptively select K reference frames for the current frame. These reference frames are then used to perform weighted inter-frame prediction of the RAHT coefficients for the current frame. Xu et al. introduced a weighted average prediction between intra-frame and inter-frame predictions, and employed RDO to adaptively select the optimal prediction mode from inter-frame prediction, intra-frame prediction, and average prediction [36].

III Proposed Method

III-A Analysis of RAHT coefficients

When conducting RAHT from top to bottom, the number of transform blocks gradually increases with the depth of layers, leading to an increasing number of AC coefficients that need to be encoded. Taking the point cloud “dancer_vox11_00000001” as an example, Table I summarizes the total number of AC coefficients that need to be encoded for each layer. Herein, layer 0 represents the root node. From the table, it is evident that, for all bitrate configurations (r01, r02, r03, r04, r05, r06, from low bitrate to high bitrate), the number of coefficients in deep layers occupy the majority of the total coefficients. Additionally, as the layers deepen, a large transform block is progressively divided into several small blocks by the octree structure, and the attributes in a large block are allocated to the small blocks, resulting in a gradual decrease of attributes within the small blocks. This, in turn, leads to a decrease of AC coefficients after RAHT of the small blocks. After the prediction and quantization, in the last few layers near the leaf nodes, the proportion of zero residuals exceeds 95%. Fig. 4 illustrates the variations of zero values in each layer of the dense point cloud “dancer_vox11_00000001” and the sparse point cloud “Arco_Valentino_Dense_vox12”. We can see that both dense and sparse point clouds exhibit the aforementioned trend. Particularly, for bitrates, i.e., r01 and r02, the proportion of zero values in the last two layers of Luma components and the last four layers of Chroma components exceeds 99%. This suggests that the proportion of zero values within a layer at low bitrates is larger than that at high bitrates, and the proportion of zero values in the last few layers of Chroma components is also higher than that in Luma component. Therefore, encoding a significant amount of residuals of so many zero values in the last few layers leads to unnecessary waste of bitrate.

TABLE I: The total number of AC coefficients to be encoded in each layer at different bitrates

dancer	r01	r02	r03	r04	r05	r06
layer 0	2	2	2	2	2	2
layer 1	14	14	14	14	4	6
layer 2	29	29	29	29	12	10
layer 3	123	122	123	122	50	25
layer 4	511	547	549	549	204	135
layer 5	2099	2177	2171	2168	843	493
layer 6	8783	8681	8649	8610	3470	1811
layer 7	-	34575	34633	34657	13813	7538
layer 8	-	-	135811	135844	54325	28866
layer 9	-	-	-	520522	211127	111548
layer 10	-	-	-	-	788745	410150

III-B Rate and distortion estimation

To reduce the bit rate waste of the last few layers in RAHT, we propose an adaptive skip coding method for the transformed coefficients by using RDO. As the proportion of zero residuals in the last several layers is large, we first estimate the attribute bitrates of skipping all AC residuals of the last one, the last two, the last three, and the last four layers, and calculate the corresponding distortion, thus calculate the RD cost of the four cases, respectively. Then, we estimate the attribute bitrate for encoding all the AC residuals and also calculate the corresponding distortion to obtain the corresponding RD cost. Subsequently, the RD costs of the five cases are compared, and the case which gives the minimum RD cost is selected and indicated in the bit stream for the decoder. Besides, the skip coding method is independently applied to Luma, Cb, and Cr color components.

From Parseval’s theorem [37], the distortion in the transform domain is equal to that in the attribute domain [38]. Therefore, to save computational complexity, distortion is calculated directly in the transform domain to avoid the additional computational complexity induced by inverse RAHT transform. For the reconstructed point cloud obtained by encoding all residuals, the distortion can be calculated as

D_{org}=\sum_{i=1}^{M_{t}}(AC_{org}^{i}-AC_{recon}^{i})^{2},

(3)

AC_{recon}^{i}=AC_{res_{q}}^{i}\times Q+AC_{pre}^{i}.

(4)

where $AC_{org}^{i}$ represents the i-th AC coefficient, $AC_{recon}^{i}$ represents the reconstructed i-th AC coefficient, $AC_{pre}^{i}$ is the i-th predicted AC coefficient, $AC_{res_{q}}^{i}$ is the i-th quantized AC residual, $Q$ denotes the quantization step size, and $M_{t}$ is the total number of AC coefficients. When all residuals of the last $k$ , $k\in\left\{1,2,3,4\right\}$ , layers are skipped, all residuals of the AC coefficients in the last k layers are ignored during encoding, they are reconstructed as zero in the decoder. Therefore, these reconstructed AC coefficients are replaced by their predicted values, and their distortion can be calculated by ${\textstyle\sum_{i=1}^{M_{k}}}(AC_{org}^{i}-AC_{pre}^{i})^{2}$ . Therefore, the distortion in this case, denoted as $D_{k}$ can be represented as

D_{k}=\sum_{i=1}^{M_{k}}(AC_{org}^{i}-AC_{pre}^{i})^{2}+\sum_{i=M_{k}+1}^{M_{t}}(AC_{org}^{i}-AC_{recon}^{i})^{2}.

(5)

where $M_{k}$ represents the number of AC coefficients in the last k layers of RAHT.

Due to the fact that most of the residuals after RAHT are zero, zero-run length coding [39] is employed in G-PCC to encode non-zero residuals and run lengths of consecutive zeros. In practical implementation, three flags, i.e., utilizing_sign, iszero, and isone are used for non-zero residuals to distinguish negative values, 1, or 2. The remaining values are encoded using an exponential golomb encoder [40]. For run length of consecutive zeros, truncated unary coding combined with exponential golomb encoder is used. The bitrate can be estimated by the cumulative number of bits required for encoding. For truncated unary coding, the bitrate can be estimated by the length of the codeword. For exponential golomb coding, the number of bits can be estimated by the probability of the remaining values, which is updated in real time. The procedure of encoding and bitrate estimation of quantized coefficient’ residuals are illustrated in Fig. 5. In the proposed method, we use $R_{org}$ to denote the attribute bits when all residuals are encoded and $R_{k}$ to represent the attribute bits (including the flags) when all AC residuals of the last $k$ , $k\in\left\{1,2,3,4\right\}$ layers are skipped. Therefore, the RD cost can be calculated by

RDcost=D+\lambda\cdot R,

(6)

where $D$ represents the distortion ( $D_{k}$ resp. $D_{org}$ ), $R$ represents the bits ( $R_{k}$ resp. $R_{org}$ ), and $\lambda$ is the Lagrange multiplier which follows the basic style of H.266/VVC, i.e.,

\lambda=c\times 2^{\frac{QP-12}{3}},

(7)

where $c$ is a content dependent constant and $QP$ is the quantization parameter.

TABLE II: The

BDBR_{Total}

with respective to

c

for different test point clouds

constant c	queen	redandblack	longdress	basketball	average
0.05	-20.69%	-0.80%	-5.87%	-3.92%	-7.82%
0.1	-138.95%	0.53%	-8.76%	-53.53%	-50.18%
0.15	-139.17%	1.31%	-8.37%	-54.61%	-50.21%
0.2	-137.20%	0.12%	-9.42%	-59.13%	-51.41%
0.25	-130.40%	-11.14%	-9.47%	-59.58%	-52.65%
0.26	-130.44%	-11.26%	-9.48%	-59.83%	-52.75%
0.27	-127.86%	-1.60%	-9.41%	-59.56%	-49.61%
0.28	-126.50%	-1.69%	-7.00%	-59.23%	-48.61%
0.29	-126.38%	-0.37%	1.57%	-59.33%	-46.13%
0.3	-126.38%	-0.19%	10.31%	-59.62%	-43.97%
0.35	-125.62%	-0.18%	13.29%	-59.10%	-42.90%
0.4	-125.64%	-1.36%	13.10%	-63.73%	-44.41%
0.45	-123.05%	-1.20%	12.50%	-65.32%	-44.27%
0.5	-120.78%	-0.90%	12.50%	-65.32%	-43.62%

In our study, we did extensive statistical experiments to obtain $c$ . First, we implemented the proposed method onto the latest software platform GeS-TM v4.0 [41] and encoded some typical test point clouds “queen”, “8ivfbv2_redandblack_vox10” in the dynamic point cloud category Cat2-A, “8ivfbv2_longdress_vox10” in the dynamic point cloud category Cat2-B, and “basketball_player_vox11” in the dynamic point cloud category Cat2-C provided by MPEG to evaluate the coding performance with respect to $c$ . During the experiment, $c$ ranges from 0.05 to 0.5 with an incremental step size of 0.01. To find the optimal $c$ , we use the End-to-End BD‑Attribute Rate (BDBR) [42] to quantitatively represents the attribute bitrate savings at the same distortion when comparing the proposed method with GeS-TM v4.0. A negative BDBR indicates a positive gain compared to G-PCC, whereas a positive BDBR indicates a negative gain. We tested all the bitrates (from the lowest bitrate to the highest bitrate) by following the common test condition (CTC) [43] of Ges-TM, and presented the average performance. By taking all the color components into account, we also define $BDBR_{Total}$ to represent the sum of BDBRs for Luma and Chroma components,

BDBR_{Total}=a\times BDBR_{Luma}+BDBR_{Cb}+BDBR_{Cr}.

(8)

where $BDBR_{Luma}$ , $BDBR_{Cb}$ and $BDBR_{Cr}$ represent the BDBR of Luma, Cb, and Cr components, respectively, compared to GeS-TM v4.0. The parameter $a$ equals to 7 because the importance of Luma is roughly 7 times greater than that of Chroma [44]. The statistical results are given in Table II, and the variation of average $BDBR_{Total}$ with respect to $c$ is shown in Fig. 6. Finally, we can conclude from Table II and Fig. 6 that the optimal $c$ can be set as 0.26.

III-C Implementation details

By using the optimal $c$ , we can compare the $RDcost_{k}$ of skipping the residuals of the last $k$ , $k\in\left\{1,2,3,4\right\}$ , layers with the $RDcost_{org}$ of normally encoding all the coefficients by using (6) and (7), and pre-set a flag to indicate the best case for the decoder. If $RDcost_{org}$ is smaller than all $RDcost_{k}$ , we set the flag to be $k=0$ . In this case, the encoder will encode all RAHT residuals as their original values. If $RDcost_{k}$ is smaller than $RDcost_{org}$ , we find the smallest $RDcost_{k}$ , $k\in\left\{1,2,3,4\right\}$ , and set the flag as $k$ to skip the residuals of the last $k$ layers of RAHT.

The decoder first extracts the flag from the attribute bitstream. If the flag is $k=0$ , the residuals are decoded normally. If the flag is not zero, the decoder will deem all the residuals of the last $k$ layers of RAHT as zero. Subsequently, following the original G-PCC decoding process, the decoded residuals are inversely quantized to obtain $AC_{res}$ . If intra-frame or inter-frame prediction was performed at the encoder side, the obtained $AC_{res}$ is added to the predicted coefficients $AC_{pre}$ to obtain the reconstructed coefficients $AC_{recon}$ . Finally, inverse RAHT is applied to obtain the reconstructed attributes. The encoding and decoding framework for the proposed method can be illustrated in Fig. 7. Moreover, as there are three color components, we perform the proposed method for the three color components independently and use three flags, namely flag_luma, flag_cb, and flag_cr, to indicate how to decode the residuals, respectively.

IV Experimental Results and Analyses

To verify the effectiveness of the proposed method, we implemented the proposed method in the latest test model of G-PCC reference software GeS-TMv4.0 [41]. This test model is especially for dynamic object point clouds (categorized as Category 2, i.e., Cat2 in MPEG) which are further divided into Cat2-A, Cat2-B, and Cat2-C based on their content complexity.

The experiments were conducted according to the CTC [43] of the testing model. Octree encoding is used for geometry, while RAHT is used for attribute encoding. We tested two conditions: lossless geometry lossy attributes (defined as C1 condition) and lossy geometry lossy attributes (defined as C2 condition), by following the CTC. When using GeS-TM for dynamic point cloud compression, inter-frame prediction was also enabled, namely “Octree-RAHT-inter”. We also conducted the experiment when inter-frame prediction was disabled under the C1 and C2 condition, namely “Octree-RAHT-intra” for extensive comparison and analyses. The hardware platform was an Intel I7-8700K CPU with 16GB of memory, and the software platform was the Windows 10 operating system.

TABLE III: BD-rates when comparing with GeS-TMv4.0-Octree-RAHT-inter under C1 and C2 condition

Octree-RAHT-inter	End-to-End BD‑AttrRate (%)
	C1 condition			C2 condition
Class	Luma	Cb	Cr	Luma	Cb	Cr
loot	-0.23%	0.29%	-0.18%	-0.33%	-0.57%	-0.83%
redandblack	0.16%	0.09%	0.13%	-0.97%	-2.49%	-0.59%
soldier	-0.03%	-0.10%	-0.05%	-1.23%	-3.71%	-1.97%
queen	-3.04%	-10.30%	-4.01%	-14.35%	-20.68%	-18.15%
longdress	0.21%	-0.17%	-0.39%	-1.04%	-1.71%	-1.89%
basketball_player	-0.35%	-0.47%	-1.58%	-1.44%	1.54%	1.04%
dancer_player	0.08%	0.21%	1.29%	-5.14%	-11.27%	-6.87%
Cat2-A average	-0.78%	-2.50%	-1.03%	-4.22%	-6.86%	-5.39%
Cat2-B average	0.21%	-0.17%	-0.39%	-1.04%	-1.71%	-1.89%
Cat2-C average	-0.13%	-0.13%	-0.14%	-3.29%	-4.87%	-2.91%
Overall average	-0.46%	-1.49%	-0.68%	-3.50%	-5.56%	-4.18%

IV-A BD-rate Comparison under CTC

In the field of 3D point cloud compression standardization, Bits Per Output Point (BPOP) is usually adopted to measure the average bitrate required for each point. For the same reconstruction quality, a lower BPOP indicates higher encoding efficiency. The reconstruction quality of a point cloud is usually measured by Peak Signal-to-Noise Ratio (PSNR), where a higher PSNR indicates higher reconstruction quality. In the CTC, each point cloud is first encoded at 6 different quantization parameters, resulting in 6 bitrates (i.e., r01, r02, r03, r04, r05, and r06) and PSNRs. Then, Bjøntegaard rate (BD-rate, in percentage) can be calculated to quantitatively evaluate the RD performance [42]. A negative BD-rate denotes a decrease in BPOP at the same reconstruction quality, indicating better encoding efficiency.

TABLE IV: BD-rates when comparing with GeS-TMv4.0-Octree-RAHT-intra under C1 and C2 condition

Octree-RAHT-intra	End-to-End BD‑AttrRate (%)
	C1 condition			C2 condition
Class	Luma	Cb	Cr	Luma	Cb	Cr
loot	0.04%	-0.02%	-0.02%	-0.10%	-0.15%	-0.14%
redandblack	0.16%	-0.04%	-0.04%	0.12%	-0.42%	-0.27%
soldier	-0.01%	-0.04%	-0.04%	-0.15%	-0.22%	-0.24%
queen	0.81%	-3.34%	0.61%	1.91%	-6.62%	-3.74%
longdress	0.46%	-0.64%	-0.20%	-0.04%	-1.23%	-0.04%
basketball_player	-0.05%	-0.04%	-0.07%	0.08%	-3.23%	-2.92%
dancer_player	-0.11%	-0.21%	-0.25%	-0.76%	-4.97%	-4.11%
Cat2-A average	0.25%	-0.86%	0.13%	0.45%	-1.85%	-1.10%
Cat2-B average	0.46%	-0.64%	-0.20%	-0.04%	-1.23%	-0.04%
Cat2-C average	-0.08%	-0.12%	-0.16%	-0.34%	-4.10%	-3.52%
Overall average	0.18%	-0.62%	0.00%	0.15%	-2.41%	-1.64%

Table III shows the performance of dynamic point clouds of the proposed method compared to GeS-TMv4.0 under the C1 and C2 conditions, where End-to-End BD-AttrRate (%) denotes the change in attribute BPOP at the same reconstruction quality. Table IV gives the results when attribute inter-frame prediction is disabled and only intra-frame prediction is used under the C1 and C2 conditions.

From Table III, we can see that the RD performance gain of C2 condition is larger than that of C1 condition. This is because the number of points under the lossless geometry compression of C1 condition leads to larger total coding bits, and therefore, the ratio of saved bits by the proposed method is not so significant. Besides, in C1 condition, the values of AC residuals in the last few RAHT layers are also large, leading to more reconstructed distortion and PSNR loss if they are skipped. Specifically, at C1 condition, the average BD-rates for Luma, Cb, and Cr components achieves -0.46%, -1.49%, and -0.68%, respectively, while at C2 condition, the corresponding average BD-rates are -3.50%, -5.56%, and -4.18%, respectively. It is noteworthy that the average BD-rates of the three color components achieve -14.35%, -20.68%, and -18.15%, respectively, for “queen”. This is because “queen” is a computer-generated imagery (CGI) sequence, which has stronger inter-frame correlation compared to regular sequences, making the color attributes much easier to be predicted within the RAHT. Therefore, when inter-frame prediction is enabled, a larger proportion of the quantized AC coefficient’ residuals are zero values. Accordingly, using the proposed method achieves for significant bitrate savings. From Table IV, we can also get the conclusion that the RD performance gain of C2 condition is larger than that of C1 condition.

By comparing Table III with Table IV, we can observe that the proposed method achieves better performance when inter-frame prediction is enabled. This is because, when we skipped the residuals in the last few layers, $AC_{pre}$ is directly used to represent $AC_{recon}$ on the decoder side. The accuracy of inter-prediction is typically better than intra-prediction, resulting in higher PSNR and RD performance.

IV-B RD Comparison

To analyze the reconstruction quality of point clouds at different bitrates, we conducted an analysis on the dynamic point cloud “queen”, “8ivfbv2_soldier_vox10”, “8ivfbv2_redandblack_vox10” in Cat2-A category, “8ivfbv2_longdress_vox10” in Cat2-B category, and “dancer_player_vox11” in Cat2-C category under the “Octree-RAHT-inter-C2” condition. Fig. 8 depicts the RD curves of the proposed method and G-PCC, in which the weighted average (7:1:1) of PSNR for Luma, Cb, and Cr components is used. We can see that the proposed method exhibits better performance at lower bitrates. This is because lower bitrates entail a larger quantization step, leading to a substantial number of zero values in the last few layers of AC residuals in RAHT. Therefore, significant bit rate saving and minimal PSNR loss can be achieved by the proposed method. At high bit rates, as the significant distortion leads to a high RD cost, the proposed method is hard to be selected by RDO.

IV-C Snapshots of Dynamic Sequences

Due to the difficulty of visualizing dynamic point cloud sequences, we randomly selected certain snapshots from the dynamic point cloud sequences. By comparing the subjective quality of these snapshots, we aim to approximately evaluate the performance of our method. Fig. 9 shows some snapshots of the point clouds reconstructed by the proposed method and GesTMv4.0. Specifically, we selected the point cloud “queen_frame_1.ply”, “redandblack_frame_8.ply”, “longdress_vox10_frame_6.ply” and “dancer_vox11_frame_2.ply” for illustration, as shown in Fig. 9 (a), (b), (c), and (d). We can see that the proposed method can achieve similar subject quality by consuming much smaller bit rates. For instance, under the C2, r02 condition, the point cloud “queen_frame_1.ply” was reconstructed by GPCC (GeS-TMv4.0) with a color BPOP of 27.6 $\times 10^{3}$ , whereas our method only required 4.7 $\times 10^{3}$ color BPOP. We saved 82% of the bitrate while achieving almost the same subjective visual quality. Besides, the proposed method can also eliminates block artifacts effectively and thus producing smoother visual appearance compared with GeS-TMv4.0. This is because the residuals in the last few layers are skipped and the AC prediction is directly used as the reconstruction, resulting in more smooth reconstruction quality.

TABLE V: Complexity Ratio comparison with GeS-TMv4.0

Complexity Ratio (%)		Octree-RAHT-inter-C1		Octree-RAHT-intra-C1		Octree-RAHT-inter-C2		Octree-RAHT-intra-C2
Class		C_enc	C_dec	C_enc	C_dec	C_enc	C_dec	C_enc	C_dec
cat2-A	8ivfbv2_loot_vox10	136.10%	101.38%	165.43%	100.43%	112.57%	100.33%	119.25%	100.73%
	8ivfbv2_redandblack_vox10	134.04%	100.02%	165.05%	100.92%	112.53%	100.81%	119.08%	99.66%
	8ivfbv2_soldier_vox10	140.66%	99.73%	165.55%	100.57%	114.62%	100.37%	119.36%	100.26%
	queen	142.51%	100.10%	165.68%	101.19%	113.86%	100.78%	118.50%	100.94%
cat2-B	8ivfbv2_longdress_vox10	133.13%	99.56%	165.76%	100.47%	112.20%	100.42%	119.60%	100.84%
cat2-C	basketball_player_vox11	127.36%	99.93%	165.51%	100.25%	108.39%	100.28%	115.02%	101.52%
	dancer_player_vox11	129.62%	99.57%	166.33%	100.32%	109.68%	101.56%	115.31%	100.43%
Overall average		134.87%	100.04%	165.62%	100.59%	112.34%	100.59%	113.17%	100.10%

IV-D Complexity Comparison

For multi-frame dynamic point clouds, the average encoding and decoding complexity $C$ is formulated according to:

C_{enc/dec}=\frac{T_{end/dec}^{pro}}{T_{end/dec}^{anc}}\times 100\%,

(9)

where $T_{end/dec}^{pro}$ and $T_{end/dec}^{anc}$ represent the average encoding/decoding time of the proposed method and anchor (i.e., G-PCC), respectively. Table V shows the compares time complexity of the proposed method with GeS-TMv4.0 under the C1 and C2 conditions, respectively. Compared to the GeS-TMv4.0, there is a certain increase in encoding time for the proposed method, however, the decoding time remains more or less the same. We can also see that the encoding time increase of the proposed method under C1 condition is greater than that under C2 condition. This is because, during RDO, distortion and bitrate are calculated block by block. As the number of blocks is quite large in lossless geometry conditions (C1 condition), leading to higher time complexity. Additionally, the increase of encoding time under inter-frame coding configuration is less than that under intra-frame coding configuration, as inter-frame prediction results in smaller AC coefficients’ residuals (i.e., more zero residuals), making faster distortion and bitrate estimation. Furthermore, when inter-frame prediction is enabled, the proposed method is used more frequently, reducing more time complexity of attribute reconstruction at the encoder side. We also observe a slight decrease in decoding complexity for certain point clouds. This is because the proposed method does not need to decode the residuals in the last $k$ layers.

V Conclusion

We proposed a skip coding method for RAHT coefficients by adaptively determining whether to encode the residuals in the last $k$ ( $k\in\left\{1,2,3,4\right\}$ ) layers. Specifically, we proposed a comprehensive RD framework including bitrate calculation, distortion estimation, and skip coding condition for the coefficients’ residuals to guarantee the RD performance. Additionally, we present an experimental method for determining the Lagrange multiplier. Experimental results demonstrate that the proposed method achieves significant improvements in encoding efficiency compared with the CTC (recommended by MPEG) of state-of-the-art G-PCC reference software (GeS-TMv4.0), especially at low bitrates. Specifically, the average BD-rates for the Luma, Cb, and Cr channels achieve -3.50%, -5.56%, and -4.18%, respectively, under the configuration of lossy geometry and lossy attribute compression for inter-frame compression of dynamic point clouds. Moreover, the proposed method does not induce additional decoding complexity, which is friendly for practical applications. In the future, we will keep developing optimization technologies for MPEG G-PCC to further improve its coding efficiency at affordable complexity consumption.

References

[1] S. Chen, S. Niu, T. Lan and B. Liu, “PCT: Large-Scale 3d Point Cloud Representations Via Graph Inception Networks with Applications to Autonomous Driving,” in 2019 IEEE International Conference on Image Processing , Taipei, Taiwan, 2019, pp. 4395-4399, DOI: 10.1109/ICIP.2019.8803525.
[2] B. Zhao, W. Lin and C. Lv, “Fine-Grained Patch Segmentation and Rasterization for 3-D Point Cloud Attribute Compression,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4590-4602, Dec. 2021.
[3] H. Liu, H. Yuan, Q. Liu, J. Hou, H. Zeng and S. Kwong, “A Hybrid Compression Framework for Color Attributes of Static 3D Point Clouds,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 3, pp. 1564-1577, Mar 2022.
[4] D. T. Nguyen and A. Kaup, “Lossless Point Cloud Geometry and Attribute Compression Using a Learned Conditional Probability Model,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 8, pp. 4337-4348, Aug. 2023.
[5] H. Liu, H. Yuan, R. Hamzaoui, Q. Liu and S. Li, “PU-Mask: 3D Point Cloud Upsampling via an Implicit Virtual Mask,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 7, pp. 6489-6502, July 2024.
[6] S. Schwarz et al., “Emerging MPEG Standards for Point Cloud Compression,” in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no. 1, pp. 133-148, Mar. 2019.
[7] MPEG 3D Graphics Coding and Haptics Coding, “Point cloud compressiontest model for category 1 V0,” in ISO/IEC JTC1/SC29/WG11 MPEG output document w17223, Macau, Oct. 2017.
[8] MPEG 3D Graphics Coding and Haptics Coding, “PCC test model category 2 V0,” in ISO/IEC JTC1/SC29/WG11 MPEG output document w17248, Macau, Oct. 2017.
[9] MPEG 3D Graphics Coding and Haptics Coding, “PCC test model category 3 V0,” in ISO/IEC JTC1/SC29/WG11 MPEG output document w17249, Macau, Oct. 2017.
[10] MPEG 3D Graphics Coding and Haptics Coding, “G-PCC Codec Description,” in ISO/IEC JTC1/SC29/WG11 MPEG output document w19331, Alpbach, Apr. 2020.
[11] MPEG 3D Graphics Coding and Haptics Coding, “V-PCC Codec Description,” in ISO/IEC JTC1/SC29/WG07 MPEG output document N00100, Online, Apr. 2021.
[12] MPEG 3D Graphics Coding and Haptics Coding, “EE 13.60 on dynamic solid coding with G-PCC,” in ISO/IEC JTC1/SC29/WG07 MPEG output document N00528, Online, Jan. 2023.
[13] S. Lasserre, “Improving TriSoup summary, results and perspective,” in ISO/IEC JTC1/SC29/WG07 MPEG input document m59288, Online, Apr. 2022.
[14] X. Zhang and W. Gao, “Adaptive Geometry Partition for Point Cloud Compression,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4561-4574, Dec. 2021
[15] Zhenzhen Gao, David Flynn, Alexis Tourapis, and Khaled Mammou, “Predictive Geometry Coding,” in ISO/IEC JTC1/SC29/WG11 MPEG input document m51012, Geneva, Oct. 2019.
[16] R. L. de Queiroz and P. A. Chou, “Compression of 3D Point Clouds Using a Region-Adaptive Hierarchical Transform,” in IEEE Transactions on Image Processing, vol. 25, no. 8, pp. 3947-3956, Aug. 2016.
[17] Toshiyasu Sugio, “Reference structure modification on attribute predicting transform in TMC13,” in ISO/IEC JTC1/SC29/WG11 MPEG input document m46107, Marrakech, Jan. 2019.
[18] Khaled Mammou, Alexis Tourapis, Jungsun Kim, Fabrice Robinet, Valery Valentin and Yeping Su, “Lifting Scheme for Lossy Attribute Encoding in TMC1,” in ISO/IEC JTC1/SC29/WG11 MPEG input document m42640, San Diego, Apr. 2018.
[19] D. Graziosi, O. Nakagami, S. Kuma, A. Zaghetto, T. Suzuki, and A. Tabatabai, “An overview of ongoing point cloud compression standardization activities: Video-based (V-PCC) and geometry-based (G-PCC),” in APSIPA Transactions on Signal and Information Processing, vol. 9, p. E13, 2020.
[20] H. Liu, H. Yuan, Q. Liu, J. Hou and J. Liu, “A Comprehensive Study and Comparison of Core Technologies for MPEG 3-D Point Cloud Compression,” in IEEE Transactions on Broadcasting, vol. 66, no. 3, pp. 701-717, Sept. 2020.
[21] MPEG 3D Graphics Coding and Haptics Coding, “CE 13.18 on RAHT AC prediction,” in ISO/IEC JTC1/SC29/WG11 MPEG output document w18503, Geneva, Mar. 2019.
[22] MPEG 3D Graphics Coding and Haptics Coding, “EE4FE 13.2 on inter prediction,” in ISO/IEC JTC1/SC29/WG07 MPEG output document N00020, Online, Oct. 2020.
[23] S. Lasserre, D. Flynn, “On an improvement of RAHT to exploit attribute correlation,” in ISO/IEC JTC1/SC29/WG11 MPEG input document m47378, Geneva, Mar. 2019.
[24] Wei Zhang, Na Dai, Mary-Luc Champel, “RAHT upsampled prediction improvement,” in ISO/IEC JTC1/SC29/WG11 MPEG input document m54607, Online, Jun. 2020.
[25] W. Wang, Y. Xu, K. Zhang and L. Zhang, “Peer Upsampled Transform Domain Prediction for G-PCC,” in 2023 IEEE International Conference on Multimedia and Expo, Brisbane, Australia, 2023, pp. 708-713, DOI: 10.1109/ICME55011.2023.00127.
[26] Yingzhan Xu, Wenyi Wang, Kai Zhang and Li Zhang, “Inter-Prediction for RAHT Attribute Coding,” in ISO/IEC JTC1/SC29/WG07 MPEG input document m61083, Online, Oct. 2020.
[27] Guo T, Yuan H, Wang L, et al., “Rate-distortion optimized quantization for geometry-based point cloud compression,” in Journal of Electronic Imaging, vol. 32, no. 1, article no. 013047, Feb. 2023.
[28] Q. Liu, H. Yuan, J. Hou, R. Hamzaoui and H. Su, “Model-Based Joint Bit Allocation Between Geometry and Color for Video-Based 3D Point Cloud Compression,” in IEEE Transactions on Multimedia, vol. 23, pp. 3278-3291, 2021.
[29] Q. Liu, H. Yuan, R. Hamzaoui, H. Su, J. Hou and H. Yang, “Reduced Reference Perceptual Quality Model With Application to Rate Control for Video-Based Point Cloud Compression,” in IEEE Transactions on Image Processing, vol. 30, pp. 6623-6636, Jul. 2021.
[30] C. Herglotz, N. Genser and A. Kaup, “Rate-Distortion Optimal Transform Coefficient Selection for Unoccupied Regions in Video-Based Point Cloud Compression,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7996-8009, Nov. 2022.
[31] P. Gao, S. Luo and M. Paul, “Rate-Distortion Modeling for Bit Rate Constrained Point Cloud Compression,” in IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 5, pp. 2424-2438, May. 2023.
[32] Tian Guo, Hui Yuan, Raouf Hamzaoui, Xiaohui Wang, and Lu Wang, “Dependency-based coarse-to-fine approach for reducing distortion accumulation in G-PCC attribute compression,” in IEEE Transactions on Industrial Informatics, vol. 20, no. 9, pp. 11393-11403, Sept. 2024.
[33] L. Wei, S. Wan, Z. Wang and F. Yang, “Near-Lossless Compression of Point Cloud Attribute Using Quantization Parameter Cascading and Rate-Distortion Optimization,” in IEEE Transactions on Multimedia, vol. 26, pp. 3317-3330, 2024.
[34] W. Zhang, J. Wang, X. Liu, F. Yang and N. Wang, “On RAHT prediction mode selection,” in ISO/IEC JTC1/SC29/WG07 MPEG input document m64112, Geneva, Jul. 2023.
[35] Chuang Ma, Yue Yu, Haoping Yu, Dong Wang, “Adaptive selection of Multi-reference frames for RAHT attribute inter coding,” in ISO/IEC JTC1/SC29/WG07 MPEG input document m66318, Online, Jan. 2024.
[36] Yingzhan Xu, Bharath Vishwanath, Kai Zhang, Li Zhang, “Report on Average Prediction for RAHT Attribute Coding,” in ISO/IEC JTC1/SC29/WG07 MPEG input document m66274, Online, Jan. 2024.
[37] A. Iwasaki, “Deriving the Variance of the Discrete Fourier Transform Test Using Parseval’s Theorem,” in IEEE Transactions on Information Theory, vol. 66, no. 2, pp. 1164-1170, Feb. 2020.
[38] Bharath Vishwanath, Yingzhan Xu, Kai Zhang, Li Zhang, “Transform Domain Distortion Estimation for RAHT,” in ISO/IEC JTC1/SC29/WG07 MPEG input document m66308, Online, Jan. 2024.
[39] K. J. Hole and O. Ytrehus, “Cosets of convolutional codes with least possible maximum zero- and one-run lengths,” in IEEE Transactions on Information Theory, vol. 44, no. 1, pp. 423-431, Jan. 1998.
[40] M. F. Brejza et al., “Exponential Golomb and Rice Error Correction Codes for Generalized Near-Capacity Joint Source and Channel Coding,” in IEEE Access, vol. 4, pp. 7154-7175, 2016.
[41] MPEG 3D Graphics Coding and Haptics Coding, “Test model for geometry-based solid point cloud - GeS TM 3.0,” in ISO/IEC JTC1/SC29/WG07 MPEG output document N00750, Hannover, Oct. 2023.
[42] Bjontegaard G, “Calculation of average PSNR differences between RD-curves,” in ITU SG16 Doc. VCEG-M33, 2001.
[43] MPEG 3D Graphics Coding and Haptics Coding, “Common Test Conditions for G-PCC,” in ISO/IEC JTC1/SC29/WG07 MPEG output document N00722, Hannover, Oct. 2023.
[44] S. Lasserre, J. Taquet, “On balancing attribute QPs for GeS-TM,” in ISO/IEC JTC1/SC29/WG07 MPEG input document m65830, Online, Jan. 2024.