This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Multi-Scale Spatial-Temporal Network for Wireless Video Transmission

Xinyi Zhou Beijing University of Posts and Telecommunications, Beijing, China Danlan Huang *This is corresponding author. [email protected]. Zhixin Qi Beijing University of Posts and Telecommunications, Beijing, China Liang Zhang Beijing University of Posts and Telecommunications, Beijing, China Ting Jiang Beijing University of Posts and Telecommunications, Beijing, China
Abstract

Deep joint source-channel coding (DeepJSCC) has shown promise in wireless transmission of text, speech, and images within the realm of semantic communication. However, wireless video transmission presents greater challenges due to the difficulty of extracting and compactly representing both spatial and temporal features, as well as its significant bandwidth and computational resource requirements. In response, we propose a novel video DeepJSCC (VDJSCC) approach to enable end-to-end video transmission over a wireless channel. Our approach involves the design of a multi-scale vision Transformer encoder and decoder to effectively capture spatial-temporal representations over long-term frames. Additionally, we propose a dynamic token selection module to mask less semantically important tokens from spatial or temporal dimensions, allowing for content-adaptive variable-length video coding by adjusting the token keep ratio. Experimental results demonstrate the effectiveness of our VDJSCC approach compared to digital schemes that use separate source and channel codes, as well as other DeepJSCC schemes, in terms of reconstruction quality and bandwidth reduction.

Index Terms:
Video deep joint source-channel coding, multi-scale Transformer, spatial-temporal network, dynamic token selection.

I Introduction

With the rapid emergence of video traffic such as video conference and virtual reality, wireless video transmission technology has gained widespread attention to tackle large volume data. Digital wireless video transmission scheme is based on separate source and channel coding. The primary source codes such as industry standard video codecs H.264/H.265 attempt to eliminate redundant information, while the channel codes such as low density parity check (LDPC) adopts redundant bits to overcome distortions in imperfect wireless channel. However, the separate coding scheme is sub-optimal for video transmission since it suffers from cliff-effect and is hard to meet the low latency requirement of video applications.

To address the aforementioned issues, joint source-channel coding (JSCC) has been proposed to achieve system-level optimality in the coding process [1]. Owing to the rapid advancement of deep learning (DL) in recent years, an increasing number of DL models are being used in JSCC systems to enhance compact feature representation and noise resilient capabilities. Deep joint source-channel coding (DeepJSCC) scheme has demonstrated significant success in various wireless data transmission tasks, including image [2], text [3], and speech [4]. Nevertheless, there is still limited research on video wireless transmission task. The core challenge is how to capture the dynamic temporal features across frames, instead of merely considering the static spatial feature within a frame.

In order to facilitate the wireless transmission of video, the incorporation of the image DeepJSCC method [2, 5] can be used to individually process each frame. However, this approach overlooks the temporal correlations among sequential frames and applies an identical compression ratio to each frame, resulting in unnecessary redundancy. Alternatively, previous research [6, 7] has focused on separately encoding key frames and residual motion information in order to account for temporal correlations. For instance, MGCNet [7] utilizes the extraction and aggregation of context features from per-frame, short-term, and long-term granularities. Nevertheless, these approaches often rely on the calculation of motion information through the use of optical flow algorithms, which are burdened by heavy computational complexity and limit the methods’ practical generalization.

Recently, the Vision Transformer (ViT) [8] has demonstrated promise for image analysis tasks. Additionally, the Video Swin Transformer [9] has expanded the applicability of local attention computation from spatial to spatial-temporal domains by incorporating 3D patch and shifted window techniques. The robust feature representation abilities of the Video Swin Transformer [9] enable effective management of correlations between temporal and spatial dimensions. Nevertheless, the transmission of 3D tokens presents a considerable bandwidth challenge for wireless communication.

Refer to caption
Figure 1: The architecture of the proposed VDJSCC scheme, where ST/TT refer to spatial/temporal Transformer, respectively.

To address the aforementioned challenges, this paper presents a novel video DeepJSCC scheme (VDJSCC) that incorporate spatial-temporal attention to encode and transmit video frames. We innovatively adopt Vision Transformer to capture semantic information in video transmission. The contributions can be summarized as follows:

  • We introduce a novel video DeepJSCC scheme, referred to as VDJSCC, which utilizes the multi-scale vision Transformer for encoding spatial-temporal features. Within the VDJSCC model, we incorporate a spatial-temporal separation attention mechanism to capture comprehensive semantic information.

  • In order to optimize computational efficiency and reduce bandwidth burden, we have developed a dynamic token selection module, which identifies and masks tokens with lower semantic importance. This module also adjusts the encoding length based on the token keep ratio, enabling content-adaptive variable-length coding.

  • Our experimental findings demonstrate that the proposed VDJSCC scheme outperforms traditional methods, such as H.264 combined with LDPC and digital modulation schemes, across established performance metrics including peak signal-to-noise ratio (PSNR) and multiscale structural similarity index (MS-SSIM). Furthermore, VDJSCC offers significant advantages in terms of transmission resource savings.

II The proposed VDJSCC scheme

In this section, we will first present the overall structure of the proposed VDJSCC scheme. After that, the detailed content of each module will be introduced separately.

II-A The Overall Architecture of VDJSCC

The architecture of the proposed video wireless transmission scheme, noted as VDJSCC, is illustrated in Fig. 1. We consider the wireless transmission of videos over the additive white Gaussian noise (AWGN) channel. Denoting TT as the number of frames, CC as the number of color channels, HH and WW as the height and the width of the frame. The video sequences can be represented as 𝐗i={𝐱1,𝐱2,,𝐱T}\mathbf{X}_{i}=\left\{\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{T}\right\}, where 𝐗iT×C×H×W\mathbf{X}_{i}\in\mathbb{R}^{T\times C\times H\times W}. The proposed VDJSCC model includes a pair of trainable encoder 𝐄ϕ\mathbf{E_{\phi}} and decoder 𝐃θ\mathbf{D_{\theta}} and a non-trainable physical channel, where ϕ\phi and θ\theta are the parameters for the encoder and decoder, respectively.

Inspired by the patch embedding in ViT [8], video vision Transformer (ViViT) [10] first used the tubelet embedding to extract non-overlapping, spatial-temporal tubes from the input video sequence. The video sequence 𝐗i\mathbf{X}_{i} is initially split into video tubes. Subsequently, these tubes are then flattened and converted to tokens 𝐳nt×nh×nw×K\mathbf{z}\in\mathbb{R}^{n_{t}\times{n_{h}\times n_{w}\times{K}}} through a trainable linear projection, where KK is the hidden dimension. For a video tube of dimension t×h×wt\times h\times w, nt=Ttn_{t}=\left\lfloor\frac{T}{t}\right\rfloor, nh=Hhn_{h}=\left\lfloor\frac{H}{h}\right\rfloor, nw=Wwn_{w}=\left\lfloor\frac{W}{w}\right\rfloor, tokens are extracted from the temporal, height, and width dimensions respectively. The tokens 𝐳\mathbf{z} is processed by both spatial Transformer (ST) and temporal Transformer (TT) in sequence, resulting in the generation of feature 𝐟nt×(nh×nw)×K\mathbf{f}\in\mathbb{R}^{n_{t}\times(n_{h}\times n_{w})\times{K}}. This approach allows for the effective capture of multi-scale features aimed at enhancing the representation of details. The down-scaled feature 𝐟snt×(nh2×nw2)×2K\mathbf{f}_{s}\in\mathbb{R}^{n_{t}\times(\frac{n_{h}}{2}\times\frac{n_{w}}{2})\times{2K}} is obtained through a patch merging process, followed by further processing with ST and TT. Ultimately, the outputs from these separate branches are aggregated through averaging, resulting in the generation of feature 𝐬nt×(nh×nw)×K\mathbf{s}\in\mathbb{R}^{n_{t}\times(n_{h}\times n_{w})\times{K}}.

In order to reduce the amount of tokens to be transmitted, we design a dynamic token selection module, which mask tokens with less semantic importance at a certain token keep ratio γ\gamma. We denote the pruned tokens that required to be transmitted as 𝐬~\mathbf{\tilde{s}}, and the mask matrix is represented as Ω{0,1}M\Omega\in\{0,1\}^{M}, where MM is the amount of video tube. Additionally, the power normalization operation enables 𝐬~\mathbf{\tilde{s}} to satisfy the average power constraint before transmitting into the channel. The Encoder can be formulated as:

(𝐬~,Ω)=𝐄ϕ(𝐗i,γ).(\mathbf{\tilde{s}},\Omega)=\mathbf{E_{\phi}}(\mathbf{X}_{i},\gamma). (1)

The wireless channel can be formulated as 𝐬^=h𝐬~+𝐧\mathbf{\hat{s}}=h\mathbf{\tilde{s}}+\mathbf{n}, where hh denotes the channel gain coefficient. In this formula, 𝐧𝒞𝒩(0,σ2I)\mathbf{n}\in\mathcal{CN}\sim(0,\sigma^{2}\mathrm{I}) denotes independent identically distributed (i.i.d) AWGN samples with power σ2\sigma^{2}. Assuming that the dimension of the original data is N=T×C×H×WN=T\times C\times H\times W, and R=(γ×K)/NR=(\gamma\times K)/N is channel bandwidth ratio (CBR).

At the receiver, the VDJSCC Decoder 𝐃θ\mathbf{D_{\theta}} expands the received tokens 𝐬^\mathbf{\hat{s}} with zero elements to maintain dimensional consistency. Finally, the TT and ST are utilized to reconstruct the input video sequence 𝐗^iT×C×H×W\hat{\mathbf{X}}_{i}\in\mathbb{R}^{T\times C\times H\times W}, which can be formulated as:

𝐗^i=𝐃θ(𝐬^,Ω,γ).\hat{\mathbf{X}}_{i}=\mathbf{D_{\theta}}(\mathbf{\hat{s}},\Omega,\gamma). (2)

The VDJSCC model is trained in an end-to-end manner. We optimize the model by the mean square error (MSE) between 𝐗i\mathbf{X}_{i} and 𝐗^i\hat{\mathbf{X}}_{i}. The training loss function is formulated as:

ϕ,θ,γ=𝐗i𝐗^i22.\mathcal{L}_{\phi,\theta,\gamma}=\left\|\mathbf{X}_{i}-\hat{\mathbf{X}}_{i}\right\|_{2}^{2}. (3)

The optimal model parameters can be obtained by:

argminϕ,θ,γ\displaystyle\underset{\phi,\theta,\gamma}{\mathrm{argmin}} ϕ,θ,γ\displaystyle\mathcal{L}_{\phi,\theta,\gamma} (4)
s.t. r(𝐬~)+r(Ω)Γ,\displaystyle r(\mathbf{\tilde{s}})+r(\Omega)\leq\Gamma,

where r(𝐬~)r(\mathbf{\tilde{s}}) and r(Ω)r(\Omega) denote the coding rate of the token and the mask matrix. Γ\Gamma means a certain transmission coding rate.

In the video reconstruction task, accounting for temporal correlations is essential to minimize redundant information. The proposed VDJSCC scheme effectively captures multi-scale spatial-temporal representations, encompassing temporal information over long-term frames and spatial information within individual frames. While multi-scale feature extraction demands more computational resources, this paper focuses on two specific feature scales. Additionally, the dynamic token selection block generates an adaptive mask matrix Ω\Omega based on the video content to discard less significant tokens, thereby conserving transmission resources. By employing these approaches, VDJSCC can effectively reduce the information redundancy and save the transmission resources.

II-B Spatial-Temporal Transformer Module

Refer to caption
Figure 2: The structure of spatial-temporal Transformer encoder.

1) Spatial-Temporal Transformer Encoder and Decoder

As is shown in Fig. 2, the spatial-temporal Transformer encoder consists of two separate Transformer encoders, the ST encoder and the TT encoder. Firstly, spatial position embedding is added to retain spatial positional information. After LL-layer ST, we rearrange the tokens from nt×(nh×nw)×Kn_{t}\times({n_{h}\times n_{w})\times{K}} to (nh×nw)×nt×K({n_{h}\times n_{w})\times n_{t}\times{K}} to make the model pay more attention to temporal connection. For the TT, the temporal position embedding is also added to obtain temporal positional information. Finally, the tokens are processed through LL-layer TT to obtain the final output.

In the spatial-temporal Transformer encoder, each layer ll contains the multi-head self-attention (MSA), layer normalization (LN), and the multilayer perceptron (MLP) blocks. Taking feature 𝐟\mathbf{f} as an example, this process can be written as follows:

𝐲l=MSA(LN(𝐟l))+𝐟l,\mathbf{y}^{l}=\mathrm{MSA}\left(\mathrm{LN}\left(\mathbf{f}^{l}\right)\right)+\mathbf{f}^{l}, (5)
𝐟l+1=MLP(LN(𝐲l))+𝐲l.\mathbf{f}^{l+1}=\mathrm{MLP}\left(\mathrm{LN}\left(\mathbf{y}^{l}\right)\right)+\mathbf{y}^{l}. (6)

In the MSA, we perform attention operation for each head as follows:

Attention(Q,K,V)=Softmax(QKdk)V.\mathrm{Attention}(\mathrm{Q},\mathrm{K},\mathrm{V})=\mathrm{Softmax}\left(\frac{\mathrm{Q}\mathrm{K}^{\top}}{\sqrt{d_{k}}}\right)\mathrm{V}. (7)

In the spatial-temporal Transformer decoder, we incorporate both spatial and temporal position embedding to capture positional information. Initially, the tokens 𝐬^\mathbf{\hat{s}} and the corresponding downscaled features are processed by TT, after which they are rearranged and fed into ST. Subsequently, the two multi-scale branches are merged and processed by another TT and ST, resulting in the generation of reconstructed tokens 𝐳^\mathbf{\hat{z}}.

2) Multi-Scale Transformer Encoder and Decoder

In order to enhance the feature representation capability, we down sample the video frames to achieve multi-scale features. Inspired by the downsampling method in Swin Transformer [11], we use patch merging to downsample the video frames. The patch merging layer concatenates the features of 2×22\times 2 neighboring patches, and applies a linear layer on the 4K4K-dimensional concatenated features. This patch merging process reduces the number of tokens by a multiple of 2×2=42\times 2=4 (2×2\times downsampling of resolution), and the output dimension is set to 2K2K. Additionally, the patch reverse merging process reverts the downsampled tokens to their original dimensions, thereby preserving the dimensional consistency between the input and output tokens. At the receiver, we continue to use patch merging to downsample the tokens 𝐬^\mathbf{\hat{s}} and employ patch reverse merging to restore the original dimensions.

Refer to caption
Figure 3: The structure of dynamic token selection module.

II-C Dynamic Token Selection Module

In long-term video sequences, the static background in each frame remains largely consistent, leading to substantial redundancy during transmission. Consequently, this background can be regarded as content with lower semantic significance. In contrast, it is essential to prioritize the capture of motion information within the video, as this represents content with higher semantic importance. A token selection network can be developed to dynamically identify and prioritize the transmission of semantically significant content, optimizing transmission resources.

In VDJSCC, we design a dynamic token selection module to reduce the redundancy, as illustrated in Fig. 3. This module dynamically generates a mask matrix Ω\Omega based on the input video content and adjusts the encoding length according to the token keep ratio γ\gamma, enabling content-adaptive variable-length coding. To enhance the capability of the mask decisions, we take into account the influence of both local features and global features. The local features contains the information of a certain token while the global feature contains the context of the whole video sequence. We first utilize an MLP to divide the features 𝐬\mathbf{s} into the local features 𝐬localnt×(nh×nw)×K\mathbf{s}^{local}\in\mathbb{R}^{n_{t}\times(n_{h}\times n_{w})\times{K^{\prime}}} and global features 𝐬globalnt×1×K\mathbf{s}^{global}\in\mathbb{R}^{n_{t}\times 1\times{K^{\prime}}}. The process can be formulated as:

𝐬local=MLP(𝐬),\mathbf{s}^{local}=\mathrm{MLP}(\mathbf{s}), (8)
𝐬global=Agg(MLP(𝐬)),\mathbf{s}^{global}=Agg(\mathrm{MLP}(\mathbf{s})), (9)

where AggAgg is the function which aggregates all tokens and can be simply implemented as the average pooling. KK^{\prime} is the dimension associated with token splitting. In this paper, we set KK^{\prime} as K/2K/2.

Subsequently, we combine local features 𝐬local\mathbf{s}^{local} and global features 𝐬global\mathbf{s}^{global} to obtain the concatenated features 𝐒nt×(nh×nw)×K\mathbf{S}\in\mathbb{R}^{n_{t}\times(n_{h}\times n_{w})\times{K}},

𝐒=[𝐬local,𝐬global].\mathbf{S}=\left[\mathbf{s}^{local},\mathbf{s}^{global}\right]. (10)

Then we feed 𝐒\mathbf{S} to another MLP to predict the probability to keep the tokens Pkeepnt×(nh×nw)×1P_{keep}\in\mathbb{R}^{n_{t}\times(n_{h}\times n_{w})\times{1}}:

Pkeep=SoftMax(MLP(𝐒)).P_{keep}=\mathrm{SoftMax}(\mathrm{MLP}(\mathbf{S})). (11)

Ultimately, we calculate the mask matrix Ω\Omega using Pkeep{P_{keep}} and γ\gamma to select more semantically significant tokens from 𝐒\mathbf{S} for transmission. The process is written as:

Ω=𝐒𝕀(Pkeep>1γ),\Omega=\mathbf{S}\odot\mathbb{I}(P_{keep}>1-\gamma), (12)

where \odot is elementwise production, and 𝕀\mathbb{I} is indicator function. Specifically, Ω\Omega is clearly determined based on the input video content, with “0” representing the discarded features of lower semantic importance, and “1” indicating the retained features of higher semantic importance.

Moreover, the token keep ratio γ\gamma can be adjusted to control the number of retained features, thereby enabling dynamic control over the encoding rate. Lower γ\gamma results in lower coding rate since more tokens are discarded. At the receiver, zero padding is employed to restore the tokens to their original dimensions, resulting in the generation of feature 𝐬^\mathbf{\hat{s}}.

III Experiments

In this section, we first introduce the experiments settings, and then present the performance of the proposed VDJSCC scheme.

III-A Experimental Setup

1) Datasets

The VDJSCC model is trained and evaluated on the UCF101 dataset [12], which consists of 13320 video clips across 101 action classes. All clips have fixed frame rate and resolution of 25fps and 320×240320\times 240 respectively. In this paper, we use the first train/test list to split the dataset into training dataset and test dataset. For each video frame, we first resize the frame to 300×225300\times 225, and then randomly crop into 224×224224\times 224. Additionally, we randomly select 1616 consecutive frames in a video to serve as the input for VDJSCC model.

2) Training Details

In all experiments, the image patch size (hh and ww) is set to 16, and the frame patch size (tt) is set to 2, the channel dimension KK is set to 768. The spatial-temporal Transformer encoder, with a depth of L=5L=5 and 12 heads, is used to extract high-dimensional features via multi-head self-attention. In the dynamic token selection module, the token keep ratio γ\gamma is set to 0.8 during the training process. Furthermore, we train the model at different channel SNRs and evaluate each model at the same SNR, with SNR is sampled uniformly from [1,4,7,10,13][1,4,7,10,13] dB. For each model, we use the Adam optimizer [13] with a learning rate of 10410^{-4}. The batch size is set to 4, and it takes about 1.5 week to train the model on the single RTX 3090 GPU.

3) Evaluation Metrics

In this paper, we qualify the end-to-end video transmission performance of the proposed VDJSCC model using pixel-wise metric PSNR and the perceptual metric MS-SSIM [14]. MS-SSIM evaluates the model at multiple scales, providing more comprehensive similarity information.

Refer to caption
(a) PSNR
Refer to caption
(b) MS-SSIM
Figure 4: Performance comparison of VDJSCC to other schemes over AWGN channel cases (CBR=0.031).
Refer to caption
Figure 5: The examples of visual comparison under AWGN channel at SNR=13dB.
Refer to caption
Figure 6: Visualization of token selection under AWGN channel at SNR=13dB.
Refer to caption
Figure 7: The performance of different channel bandwidth ratio (CBR) over AWGN channel at SNR=13dB, SNR=7dB, and SNR=1dB.

4) Comparison Schemes

We compare the performance of proposed VDJSCC model with the classic video coded transmission schems and a deep-learning based JSCC model. Specifically, we utilize the standard video codecs H.264 [15] for source coding and practial LDPC [16] for channel coding. FFmpeg [17] is used to achieve H.264 encoding.

In addition, we also compare the model with a state-of-the-art DeepJSCC model named WITT [5]. WITT uses Swin Transformers as the backbone to extract the long-range information, which shows the great performance in the image reconstruction tasks. In our comparison with WITT, we first downsample the video at intervals of 10 frames and save as images, and then input the all images into the WITT model.

III-B Experimental Results

1) Analysis of Video Reconstruction Performance Under Different SNR.

Fig. 4 shows the PSNR and MS-SSIM performance versus SNR over the AWGN channel. For the H.264+LDPC, we compare the impact of different modulation methods and code rates. The proposed VDJSCC model significantly outperforms classical video transmission schemes, avoiding the cliff-effect. When the SNR is 1dB, VDJSCC achieves a PSNR of 31.17dB, compared to WITT’s 29.07dB, indicating a notable improvement in reconstruction quality. However, WITT slightly surpasses VDJSCC in PSNR when the SNR exceeds 10dB due to its focus on individual frame processing, which neglects temporal correlations and increases computational resources. In terms of MS-SSIM, VDJSCC consistently outperforms other schemes, particularly at low SNR levels, demonstrating superior perceptual quality with higher visual similarity.

Fig. 5 vividly demonstrates the visual comparison of VDJSCC and the classical video transmission scheme under AWGN channel at SNR=13dB. The proposed VDJSCC preserves clearer details and achieves higher reconstruction quality.

2) Analysis of Video Compression Performance.

Next, we explore the impact of CBR on the performance of the proposed VDJSCC scheme. In this part, we set different token keep ratio to alter the CBR, adjusting the CBR to 0.031 when the token keep ratio γ\gamma=1.0. Fig. 6 provides the visual results of token selection of the same input video under AWGN channel at SNR=13dB. The model dynamically generates the mask matrix based on the input videos and token keep ratio, effectively masking less semantically important background information while retaining key portrait details. These results demonstrate the model’s ability to achieve high-quality video reconstruction.

Furthermore, in Fig. 7, we present the performance of PSNR and MS-SSIM of different CBR. The models are trained under the AWGN channel at SNR=13dB, SNR=7dB, and SNR=1dB. It is shown that, the PSNR and MS-SSIM increase along with the increase of CBR. Specifically, with a CBR of 0.019, the PSNR exceeds 30dB, indicating good video reconstruction quality even with nearly half of the tokens masked. Notably, token selection allows for dynamic adjustment of coding length based on the token keep ratio, enabling variable-length encoding and efficient bandwidth utilization.

3) Ablation Study and Computational Cost Comparison.

Last but not least, we present the results of ablation study to evaluate the influence of different modules. Table \@slowromancapi@ illustrates the performance of base models, all trained under an AWGN channel at SNR=13dB, with CBR=0.031. For models with the token selection module, the token keep ratio γ\gamma is set to 0.8. ”VDJSCC w/o multi-scale” refers to the model without patch merging, where video frames are transmitted directly into the 10-layer spatial-temporal Transformer encoder. ”VDJSCC w/o token selection” means the model without dynamic token selection, where all tokens are transmitted. Comparative results clearly show that the proposed multi-scale method can improve the reconstruction quality by 3.52dB. However, the token selection module slightly reduces PSNR by 0.62dB, an acceptable trade-off given its bandwidth-saving benefits.

With respect to complexity, the average inference time of the proposed VDJSCC is approximately 77.42ms using a RTX 3090 GPU. By comparison, the classical H.264 scheme runs at speeds ranging from 1fps to over 100fps, depending on coding settings, with typical encoding rates between 10-20fps. Although the multi-scale method requires more computational resources, it results in an approximately 10%10\% increase in PSNR. The ablation study indicates that the proposed VDJSCC method achieves high-quality video reconstruction while optimizing bandwidth and computational resources.

TABLE I: Results on ablation study. The base models are all trained under AWGN channel at SNR=13dB, CBR=0.031, and γ\gamma =0.8.
Base Model PSNR MS-SSIM Inference time (ms) FLOPS (G)
VDJSCC 34.54 0.9906 77.42 321.8
w/o multi-scale 31.02 (-3.52) 0.9796 (-0.0110) 60.05 252.7
w/o token selection 35.16 (+0.62) 0.9923 (+0.0017) 78.09 320.2
w/o multi-scale&token selection 32.42 (-2.12) 0.9843 (-0.0063) 61.89 251.2

IV Conclusion

This paper has proposed a novel DeepJSCC method to achieve end-to-end wireless videos transmission. The proposed VDJSCC scheme utilizes the multi-scale spatial-temporal Transformer encoder to explore multi-scale semantic information and obtain a new spatial-temporal representation. Moreover, a dynamic token selection algorithm has been employed to mask tokens with less semantic importance, facilitating content-adaptive variable-length encoding. Extensive experimental results have shown that the proposed VDJSCC scheme outperforms the classical separation-based video transmission scheme H.264+LDPC by a large margin and overcomes the cliff-effect. We have also proved that the multi-scale method greatly enhances video reconstruction quality. Furthermore, the token selection module has saved certain bandwidth resources and has accelerated the model’s inference time while ensuring the reconstruction quality.

Acknowledgment

This work is supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62101307, U23B2052, 62071061 and in part by the Fundamental Research Funds for the Central Universities 2023RC78.

References

  • [1] V. Kostina and S. Verdú, “Lossy joint source-channel coding in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 59, no. 5, pp. 2545–2575, 2013.
  • [2] J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, “Nonlinear transform source-channel coding for semantic communications,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2300–2316, 2022.
  • [3] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source-channel coding of text,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2326–2330.
  • [4] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2434–2444, 2021.
  • [5] K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “Witt: A wireless image transmission transformer for semantic communications,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
  • [6] T.-Y. Tung and D. Gündüz, “Deepwive: Deep-learning-aided wireless video transmission,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 9, pp. 2570–2583, 2022.
  • [7] Z. Liang, X. Dai, Y. Wu, X. Jin, and J. Shen, “Multi-granularity context network for efficient video semantic segmentation,” IEEE Transactions on Image Processing, vol. 32, pp. 3163–3175, 2023.
  • [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in 2021 International Conference on Learning Representations (ICLR), 2021.
  • [9] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3192–3201.
  • [10] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6816–6826.
  • [11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002.
  • [12] K. Soomro, A. R. Zamir, and M. Shah, “A dataset of 101 human action classes from videos in the wild,” Center for Research in Computer Vision, vol. 2, no. 11, pp. 1–7, 2012.
  • [13] D. Kinga, J. B. Adam et al., “A method for stochastic optimization,” in International conference on learning representations (ICLR), vol. 5.   San Diego, California;, 2015, p. 6.
  • [14] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2.   Ieee, 2003, pp. 1398–1402.
  • [15] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h.264/avc video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003.
  • [16] T. Richardson and S. Kudekar, “Design of low-density parity check codes for 5g new radio,” IEEE Communications Magazine, vol. 56, no. 3, pp. 28–34, 2018.
  • [17] S. Tomar, “Converting video formats with ffmpeg,” Linux journal, vol. 2006, no. 146, p. 10, 2006.