A Multi-Scale Spatial-Temporal Network for Wireless Video Transmission
Abstract
Deep joint source-channel coding (DeepJSCC) has shown promise in wireless transmission of text, speech, and images within the realm of semantic communication. However, wireless video transmission presents greater challenges due to the difficulty of extracting and compactly representing both spatial and temporal features, as well as its significant bandwidth and computational resource requirements. In response, we propose a novel video DeepJSCC (VDJSCC) approach to enable end-to-end video transmission over a wireless channel. Our approach involves the design of a multi-scale vision Transformer encoder and decoder to effectively capture spatial-temporal representations over long-term frames. Additionally, we propose a dynamic token selection module to mask less semantically important tokens from spatial or temporal dimensions, allowing for content-adaptive variable-length video coding by adjusting the token keep ratio. Experimental results demonstrate the effectiveness of our VDJSCC approach compared to digital schemes that use separate source and channel codes, as well as other DeepJSCC schemes, in terms of reconstruction quality and bandwidth reduction.
Index Terms:
Video deep joint source-channel coding, multi-scale Transformer, spatial-temporal network, dynamic token selection.I Introduction
With the rapid emergence of video traffic such as video conference and virtual reality, wireless video transmission technology has gained widespread attention to tackle large volume data. Digital wireless video transmission scheme is based on separate source and channel coding. The primary source codes such as industry standard video codecs H.264/H.265 attempt to eliminate redundant information, while the channel codes such as low density parity check (LDPC) adopts redundant bits to overcome distortions in imperfect wireless channel. However, the separate coding scheme is sub-optimal for video transmission since it suffers from cliff-effect and is hard to meet the low latency requirement of video applications.
To address the aforementioned issues, joint source-channel coding (JSCC) has been proposed to achieve system-level optimality in the coding process [1]. Owing to the rapid advancement of deep learning (DL) in recent years, an increasing number of DL models are being used in JSCC systems to enhance compact feature representation and noise resilient capabilities. Deep joint source-channel coding (DeepJSCC) scheme has demonstrated significant success in various wireless data transmission tasks, including image [2], text [3], and speech [4]. Nevertheless, there is still limited research on video wireless transmission task. The core challenge is how to capture the dynamic temporal features across frames, instead of merely considering the static spatial feature within a frame.
In order to facilitate the wireless transmission of video, the incorporation of the image DeepJSCC method [2, 5] can be used to individually process each frame. However, this approach overlooks the temporal correlations among sequential frames and applies an identical compression ratio to each frame, resulting in unnecessary redundancy. Alternatively, previous research [6, 7] has focused on separately encoding key frames and residual motion information in order to account for temporal correlations. For instance, MGCNet [7] utilizes the extraction and aggregation of context features from per-frame, short-term, and long-term granularities. Nevertheless, these approaches often rely on the calculation of motion information through the use of optical flow algorithms, which are burdened by heavy computational complexity and limit the methods’ practical generalization.
Recently, the Vision Transformer (ViT) [8] has demonstrated promise for image analysis tasks. Additionally, the Video Swin Transformer [9] has expanded the applicability of local attention computation from spatial to spatial-temporal domains by incorporating 3D patch and shifted window techniques. The robust feature representation abilities of the Video Swin Transformer [9] enable effective management of correlations between temporal and spatial dimensions. Nevertheless, the transmission of 3D tokens presents a considerable bandwidth challenge for wireless communication.

To address the aforementioned challenges, this paper presents a novel video DeepJSCC scheme (VDJSCC) that incorporate spatial-temporal attention to encode and transmit video frames. We innovatively adopt Vision Transformer to capture semantic information in video transmission. The contributions can be summarized as follows:
-
•
We introduce a novel video DeepJSCC scheme, referred to as VDJSCC, which utilizes the multi-scale vision Transformer for encoding spatial-temporal features. Within the VDJSCC model, we incorporate a spatial-temporal separation attention mechanism to capture comprehensive semantic information.
-
•
In order to optimize computational efficiency and reduce bandwidth burden, we have developed a dynamic token selection module, which identifies and masks tokens with lower semantic importance. This module also adjusts the encoding length based on the token keep ratio, enabling content-adaptive variable-length coding.
-
•
Our experimental findings demonstrate that the proposed VDJSCC scheme outperforms traditional methods, such as H.264 combined with LDPC and digital modulation schemes, across established performance metrics including peak signal-to-noise ratio (PSNR) and multiscale structural similarity index (MS-SSIM). Furthermore, VDJSCC offers significant advantages in terms of transmission resource savings.
II The proposed VDJSCC scheme
In this section, we will first present the overall structure of the proposed VDJSCC scheme. After that, the detailed content of each module will be introduced separately.
II-A The Overall Architecture of VDJSCC
The architecture of the proposed video wireless transmission scheme, noted as VDJSCC, is illustrated in Fig. 1. We consider the wireless transmission of videos over the additive white Gaussian noise (AWGN) channel. Denoting as the number of frames, as the number of color channels, and as the height and the width of the frame. The video sequences can be represented as , where . The proposed VDJSCC model includes a pair of trainable encoder and decoder and a non-trainable physical channel, where and are the parameters for the encoder and decoder, respectively.
Inspired by the patch embedding in ViT [8], video vision Transformer (ViViT) [10] first used the tubelet embedding to extract non-overlapping, spatial-temporal tubes from the input video sequence. The video sequence is initially split into video tubes. Subsequently, these tubes are then flattened and converted to tokens through a trainable linear projection, where is the hidden dimension. For a video tube of dimension , , , , tokens are extracted from the temporal, height, and width dimensions respectively. The tokens is processed by both spatial Transformer (ST) and temporal Transformer (TT) in sequence, resulting in the generation of feature . This approach allows for the effective capture of multi-scale features aimed at enhancing the representation of details. The down-scaled feature is obtained through a patch merging process, followed by further processing with ST and TT. Ultimately, the outputs from these separate branches are aggregated through averaging, resulting in the generation of feature .
In order to reduce the amount of tokens to be transmitted, we design a dynamic token selection module, which mask tokens with less semantic importance at a certain token keep ratio . We denote the pruned tokens that required to be transmitted as , and the mask matrix is represented as , where is the amount of video tube. Additionally, the power normalization operation enables to satisfy the average power constraint before transmitting into the channel. The Encoder can be formulated as:
(1) |
The wireless channel can be formulated as , where denotes the channel gain coefficient. In this formula, denotes independent identically distributed (i.i.d) AWGN samples with power . Assuming that the dimension of the original data is , and is channel bandwidth ratio (CBR).
At the receiver, the VDJSCC Decoder expands the received tokens with zero elements to maintain dimensional consistency. Finally, the TT and ST are utilized to reconstruct the input video sequence , which can be formulated as:
(2) |
The VDJSCC model is trained in an end-to-end manner. We optimize the model by the mean square error (MSE) between and . The training loss function is formulated as:
(3) |
The optimal model parameters can be obtained by:
(4) | ||||
s.t. |
where and denote the coding rate of the token and the mask matrix. means a certain transmission coding rate.
In the video reconstruction task, accounting for temporal correlations is essential to minimize redundant information. The proposed VDJSCC scheme effectively captures multi-scale spatial-temporal representations, encompassing temporal information over long-term frames and spatial information within individual frames. While multi-scale feature extraction demands more computational resources, this paper focuses on two specific feature scales. Additionally, the dynamic token selection block generates an adaptive mask matrix based on the video content to discard less significant tokens, thereby conserving transmission resources. By employing these approaches, VDJSCC can effectively reduce the information redundancy and save the transmission resources.
II-B Spatial-Temporal Transformer Module

1) Spatial-Temporal Transformer Encoder and Decoder
As is shown in Fig. 2, the spatial-temporal Transformer encoder consists of two separate Transformer encoders, the ST encoder and the TT encoder. Firstly, spatial position embedding is added to retain spatial positional information. After -layer ST, we rearrange the tokens from to to make the model pay more attention to temporal connection. For the TT, the temporal position embedding is also added to obtain temporal positional information. Finally, the tokens are processed through -layer TT to obtain the final output.
In the spatial-temporal Transformer encoder, each layer contains the multi-head self-attention (MSA), layer normalization (LN), and the multilayer perceptron (MLP) blocks. Taking feature as an example, this process can be written as follows:
(5) |
(6) |
In the MSA, we perform attention operation for each head as follows:
(7) |
In the spatial-temporal Transformer decoder, we incorporate both spatial and temporal position embedding to capture positional information. Initially, the tokens and the corresponding downscaled features are processed by TT, after which they are rearranged and fed into ST. Subsequently, the two multi-scale branches are merged and processed by another TT and ST, resulting in the generation of reconstructed tokens .
2) Multi-Scale Transformer Encoder and Decoder
In order to enhance the feature representation capability, we down sample the video frames to achieve multi-scale features. Inspired by the downsampling method in Swin Transformer [11], we use patch merging to downsample the video frames. The patch merging layer concatenates the features of neighboring patches, and applies a linear layer on the -dimensional concatenated features. This patch merging process reduces the number of tokens by a multiple of ( downsampling of resolution), and the output dimension is set to . Additionally, the patch reverse merging process reverts the downsampled tokens to their original dimensions, thereby preserving the dimensional consistency between the input and output tokens. At the receiver, we continue to use patch merging to downsample the tokens and employ patch reverse merging to restore the original dimensions.

II-C Dynamic Token Selection Module
In long-term video sequences, the static background in each frame remains largely consistent, leading to substantial redundancy during transmission. Consequently, this background can be regarded as content with lower semantic significance. In contrast, it is essential to prioritize the capture of motion information within the video, as this represents content with higher semantic importance. A token selection network can be developed to dynamically identify and prioritize the transmission of semantically significant content, optimizing transmission resources.
In VDJSCC, we design a dynamic token selection module to reduce the redundancy, as illustrated in Fig. 3. This module dynamically generates a mask matrix based on the input video content and adjusts the encoding length according to the token keep ratio , enabling content-adaptive variable-length coding. To enhance the capability of the mask decisions, we take into account the influence of both local features and global features. The local features contains the information of a certain token while the global feature contains the context of the whole video sequence. We first utilize an MLP to divide the features into the local features and global features . The process can be formulated as:
(8) |
(9) |
where is the function which aggregates all tokens and can be simply implemented as the average pooling. is the dimension associated with token splitting. In this paper, we set as .
Subsequently, we combine local features and global features to obtain the concatenated features ,
(10) |
Then we feed to another MLP to predict the probability to keep the tokens :
(11) |
Ultimately, we calculate the mask matrix using and to select more semantically significant tokens from for transmission. The process is written as:
(12) |
where is elementwise production, and is indicator function. Specifically, is clearly determined based on the input video content, with “0” representing the discarded features of lower semantic importance, and “1” indicating the retained features of higher semantic importance.
Moreover, the token keep ratio can be adjusted to control the number of retained features, thereby enabling dynamic control over the encoding rate. Lower results in lower coding rate since more tokens are discarded. At the receiver, zero padding is employed to restore the tokens to their original dimensions, resulting in the generation of feature .
III Experiments
In this section, we first introduce the experiments settings, and then present the performance of the proposed VDJSCC scheme.
III-A Experimental Setup
1) Datasets
The VDJSCC model is trained and evaluated on the UCF101 dataset [12], which consists of 13320 video clips across 101 action classes. All clips have fixed frame rate and resolution of 25fps and respectively. In this paper, we use the first train/test list to split the dataset into training dataset and test dataset. For each video frame, we first resize the frame to , and then randomly crop into . Additionally, we randomly select consecutive frames in a video to serve as the input for VDJSCC model.
2) Training Details
In all experiments, the image patch size ( and ) is set to 16, and the frame patch size () is set to 2, the channel dimension is set to 768. The spatial-temporal Transformer encoder, with a depth of and 12 heads, is used to extract high-dimensional features via multi-head self-attention. In the dynamic token selection module, the token keep ratio is set to 0.8 during the training process. Furthermore, we train the model at different channel SNRs and evaluate each model at the same SNR, with SNR is sampled uniformly from dB. For each model, we use the Adam optimizer [13] with a learning rate of . The batch size is set to 4, and it takes about 1.5 week to train the model on the single RTX 3090 GPU.
3) Evaluation Metrics
In this paper, we qualify the end-to-end video transmission performance of the proposed VDJSCC model using pixel-wise metric PSNR and the perceptual metric MS-SSIM [14]. MS-SSIM evaluates the model at multiple scales, providing more comprehensive similarity information.





4) Comparison Schemes
We compare the performance of proposed VDJSCC model with the classic video coded transmission schems and a deep-learning based JSCC model. Specifically, we utilize the standard video codecs H.264 [15] for source coding and practial LDPC [16] for channel coding. FFmpeg [17] is used to achieve H.264 encoding.
In addition, we also compare the model with a state-of-the-art DeepJSCC model named WITT [5]. WITT uses Swin Transformers as the backbone to extract the long-range information, which shows the great performance in the image reconstruction tasks. In our comparison with WITT, we first downsample the video at intervals of 10 frames and save as images, and then input the all images into the WITT model.
III-B Experimental Results
1) Analysis of Video Reconstruction Performance Under Different SNR.
Fig. 4 shows the PSNR and MS-SSIM performance versus SNR over the AWGN channel. For the H.264+LDPC, we compare the impact of different modulation methods and code rates. The proposed VDJSCC model significantly outperforms classical video transmission schemes, avoiding the cliff-effect. When the SNR is 1dB, VDJSCC achieves a PSNR of 31.17dB, compared to WITT’s 29.07dB, indicating a notable improvement in reconstruction quality. However, WITT slightly surpasses VDJSCC in PSNR when the SNR exceeds 10dB due to its focus on individual frame processing, which neglects temporal correlations and increases computational resources. In terms of MS-SSIM, VDJSCC consistently outperforms other schemes, particularly at low SNR levels, demonstrating superior perceptual quality with higher visual similarity.
Fig. 5 vividly demonstrates the visual comparison of VDJSCC and the classical video transmission scheme under AWGN channel at SNR=13dB. The proposed VDJSCC preserves clearer details and achieves higher reconstruction quality.
2) Analysis of Video Compression Performance.
Next, we explore the impact of CBR on the performance of the proposed VDJSCC scheme. In this part, we set different token keep ratio to alter the CBR, adjusting the CBR to 0.031 when the token keep ratio =1.0. Fig. 6 provides the visual results of token selection of the same input video under AWGN channel at SNR=13dB. The model dynamically generates the mask matrix based on the input videos and token keep ratio, effectively masking less semantically important background information while retaining key portrait details. These results demonstrate the model’s ability to achieve high-quality video reconstruction.
Furthermore, in Fig. 7, we present the performance of PSNR and MS-SSIM of different CBR. The models are trained under the AWGN channel at SNR=13dB, SNR=7dB, and SNR=1dB. It is shown that, the PSNR and MS-SSIM increase along with the increase of CBR. Specifically, with a CBR of 0.019, the PSNR exceeds 30dB, indicating good video reconstruction quality even with nearly half of the tokens masked. Notably, token selection allows for dynamic adjustment of coding length based on the token keep ratio, enabling variable-length encoding and efficient bandwidth utilization.
3) Ablation Study and Computational Cost Comparison.
Last but not least, we present the results of ablation study to evaluate the influence of different modules. Table \@slowromancapi@ illustrates the performance of base models, all trained under an AWGN channel at SNR=13dB, with CBR=0.031. For models with the token selection module, the token keep ratio is set to 0.8. ”VDJSCC w/o multi-scale” refers to the model without patch merging, where video frames are transmitted directly into the 10-layer spatial-temporal Transformer encoder. ”VDJSCC w/o token selection” means the model without dynamic token selection, where all tokens are transmitted. Comparative results clearly show that the proposed multi-scale method can improve the reconstruction quality by 3.52dB. However, the token selection module slightly reduces PSNR by 0.62dB, an acceptable trade-off given its bandwidth-saving benefits.
With respect to complexity, the average inference time of the proposed VDJSCC is approximately 77.42ms using a RTX 3090 GPU. By comparison, the classical H.264 scheme runs at speeds ranging from 1fps to over 100fps, depending on coding settings, with typical encoding rates between 10-20fps. Although the multi-scale method requires more computational resources, it results in an approximately increase in PSNR. The ablation study indicates that the proposed VDJSCC method achieves high-quality video reconstruction while optimizing bandwidth and computational resources.
Base Model | PSNR | MS-SSIM | Inference time (ms) | FLOPS (G) |
---|---|---|---|---|
VDJSCC | 34.54 | 0.9906 | 77.42 | 321.8 |
w/o multi-scale | 31.02 (-3.52) | 0.9796 (-0.0110) | 60.05 | 252.7 |
w/o token selection | 35.16 (+0.62) | 0.9923 (+0.0017) | 78.09 | 320.2 |
w/o multi-scale&token selection | 32.42 (-2.12) | 0.9843 (-0.0063) | 61.89 | 251.2 |
IV Conclusion
This paper has proposed a novel DeepJSCC method to achieve end-to-end wireless videos transmission. The proposed VDJSCC scheme utilizes the multi-scale spatial-temporal Transformer encoder to explore multi-scale semantic information and obtain a new spatial-temporal representation. Moreover, a dynamic token selection algorithm has been employed to mask tokens with less semantic importance, facilitating content-adaptive variable-length encoding. Extensive experimental results have shown that the proposed VDJSCC scheme outperforms the classical separation-based video transmission scheme H.264+LDPC by a large margin and overcomes the cliff-effect. We have also proved that the multi-scale method greatly enhances video reconstruction quality. Furthermore, the token selection module has saved certain bandwidth resources and has accelerated the model’s inference time while ensuring the reconstruction quality.
Acknowledgment
This work is supported in part by the National Natural Science Foundation of China (NSFC) under Grant 62101307, U23B2052, 62071061 and in part by the Fundamental Research Funds for the Central Universities 2023RC78.
References
- [1] V. Kostina and S. Verdú, “Lossy joint source-channel coding in the finite blocklength regime,” IEEE Transactions on Information Theory, vol. 59, no. 5, pp. 2545–2575, 2013.
- [2] J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, “Nonlinear transform source-channel coding for semantic communications,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2300–2316, 2022.
- [3] N. Farsad, M. Rao, and A. Goldsmith, “Deep learning for joint source-channel coding of text,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2326–2330.
- [4] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2434–2444, 2021.
- [5] K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “Witt: A wireless image transmission transformer for semantic communications,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- [6] T.-Y. Tung and D. Gündüz, “Deepwive: Deep-learning-aided wireless video transmission,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 9, pp. 2570–2583, 2022.
- [7] Z. Liang, X. Dai, Y. Wu, X. Jin, and J. Shen, “Multi-granularity context network for efficient video semantic segmentation,” IEEE Transactions on Image Processing, vol. 32, pp. 3163–3175, 2023.
- [8] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in 2021 International Conference on Learning Representations (ICLR), 2021.
- [9] Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3192–3201.
- [10] A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, and C. Schmid, “Vivit: A video vision transformer,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 6816–6826.
- [11] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9992–10 002.
- [12] K. Soomro, A. R. Zamir, and M. Shah, “A dataset of 101 human action classes from videos in the wild,” Center for Research in Computer Vision, vol. 2, no. 11, pp. 1–7, 2012.
- [13] D. Kinga, J. B. Adam et al., “A method for stochastic optimization,” in International conference on learning representations (ICLR), vol. 5. San Diego, California;, 2015, p. 6.
- [14] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398–1402.
- [15] T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h.264/avc video coding standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 560–576, 2003.
- [16] T. Richardson and S. Kudekar, “Design of low-density parity check codes for 5g new radio,” IEEE Communications Magazine, vol. 56, no. 3, pp. 28–34, 2018.
- [17] S. Tomar, “Converting video formats with ffmpeg,” Linux journal, vol. 2006, no. 146, p. 10, 2006.