This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Latent Semantic Diffusion-based Channel Adaptive De-Noising SemCom for Future 6G Systems

Bingxuan Xu1, Rui Meng1, Yue Chen1, Xiaodong Xu12, Chen Dong12, and Hao Sun3 1 State Key Laboratory of Networking and Switching Technology, BUPT, Beijing, China 2 Department of Broadband Communication, Peng Cheng Laboratory, Shenzhen, China 3 Department of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge, U. K. {xubingxuan, buptmengrui, mars_ch, xuxiaodong, dongchen}@bupt.edu.cn, [email protected]
Abstract

Compared with the current Shannon’s Classical Information Theory (CIT) paradigm, semantic communication (SemCom) has recently attracted more attention, since it aims to transmit the meaning of information rather than bit-by-bit transmission, thus enhancing data transmission efficiency and supporting future human-centric, data-, and resource-intensive intelligent services in 6G systems. Nevertheless, channel noises are common and even serious in 6G-empowered scenarios, limiting the communication performance of SemCom, especially when Signal-to-Noise (SNR) levels during training and deployment stages are different, but training multi-networks to cover the scenario with a broad range of SNRs is computationally inefficient. Hence, we develop a novel De-Noising SemCom (DNSC) framework, where the designed de-noiser module can eliminate noise interference from semantic vectors. Upon the designed DNSC architecture, we further combine adversarial learning, variational autoencoder, and diffusion model to propose the Latent Diffusion DNSC (Latent-Diff DNSC) scheme to realize intelligent online de-noising. During the offline training phase, noises are added to latent semantic vectors in a forward Markov diffusion manner and then are eliminated in a reverse diffusion manner through the posterior distribution approximated by the U-shaped Network (U-Net), where the semantic de-noiser is optimized by maximizing evidence lower bound (ELBO). Such design can model real noisy channel environments with various SNRs and enable to adaptively remove noises from noisy semantic vectors during the online transmission phase. The simulations on open-source image datasets demonstrate the superiority of the proposed Latent-Diff DNSC scheme in PSNR and SSIM over different SNRs than the state-of-the-art schemes, including JPEG, Deep JSCC, and ADJSCC.

Index Terms:
Semantic communication, sixth-generation (6G), diffusion model, image transmission.

I Introduction

To support the transition from the Internet of Things (IoT) to the Internet of Everything (IoE), the future sixth-generation (6G) communication systems are expected to enable a wide range of services, including multisensory Extended Reality (XR), connected robotics and autonomous systems, wireless brain-computer interactions, and more. This will be achieved through emerging techniques such as higher millimeter wave (mmWave) frequencies, large intelligent surfaces, edge Artificial Intelligent (AI), and integrated terrestrial, airborne, and satellite networks [1]. However, the current Shannon’s Classical Information Theory (CIT) paradigm faces several challenges in supporting human-centric, data-, and resource-intensive intelligent services. These challenges include wireless transmission of a large amount of data, rapid system response as well as reliable and efficient information interaction, and consumption of more network resources for real-time updating of information and analysis of user data [2].

In response, semantic communication (SemCom) is a promising revolutionary paradigm in 6G to breakout the “Shannon’s trap” through filtering redundant information and extracting the meaning of effective information. Owing to the development of distributed computation and wide connectivity of ubiquitous intelligent devices, SemCom is expected to be deployed on a large scale in 6G. Particularly, SemCom has the following superiorities, including relieving the pressure of data transmission, improving network management efficiency, and enhancing resource allocation effectiveness [3][4].

Thanks to the growing up of Deep Learning (DL) techniques in comprehending language and audio as well as pictures, DL-based communication architectures have been recently developed. O’Shea et al. [5] design an Auto-Encoders (AEs)-based end-to-end wireless communication system, where the channel AEs include the encoder, channel regularizer, and decoder, and convolutional layers and domain specific regularizing effects are introduced to reduce the number of parameters and avoid over-fitting issues, respectively. O’Shea et al. [5] further verify that the proposed system not only has comparable performance potential with modern systems, but also almost reaches Shannon capacity, while maintaining universality and low complexity.

More recently, the joint source-channel coding (JSCC) schemes for structured sources in SemCom have attracted more attention, since such design is more feasible for actual communication systems with constrained block lengths. Xie et al. [6] present a SemCom framework based on Transformer and transfer learning to extract the semantic information of texts during noisy environments. Xie et al. [6] also consider cross-entropy and mutual information in the loss function to maximize the system capacity and propose a novel metric to measure the semantic error of sentences. Weng et al. [7] exploit squeeze-and-excitation (SE) networks to learn essential speech semantic information and employ the attention mechanism to enhance the accuracy of signal recovery. Bourtsoulatze et al. [8] first develop a SemCom architecture based on Convolution Neural Networks (CNNs) for radio transmission of high-resolution pictures over additive white Gaussian noise (AWGN) and fading channels. Dong et al. first design semantic slice-models to adaptively realize semantic transmission tasks in different circumstances [4]. Du et al. [9] propose an AI-generated incentive mechanism based on contract theory to facilitate semantic information sharing among users. The diffusion model is utilized to generate the optimal contract design, which effectively promote the sharing of semantic information between users.

Although many researchers pay attention to DL-enabled SemCom and utilize its potential in enhancing communication efficiency, SemCom is still facing some challenges. For example, channel noises severely affect the recovery and reception of semantic information. Hence, suppressing channel noises in the restoration of semantic information should be appropriately addressed to enhance transmission performance.

To resist channel noises, the schemes proposed in [6] and [8] introduce channels when training models to recover signals at the specific SNR. However, the above-mentioned schemes do not consider that there may exist differences between the SNR levels during the training and deployment stages. On the other hand, it is computationally inefficient for multi-networks to cover the scene with a wide range of SNRs. To realize highly robust SemCom during different SNRs, Hu et al. [10] develop a vector quantization-based semantic communication system, which utilizes adversarial networks to perform semantic de-noising for image classification tasks. For image generation, Xu et al. [11] propose an Attention DL-based JSCC (ADJSCC) scheme by the attention mechanism to dynamically adjust SNRs during the training phase. Such design can capture inherent channel characteristics at different SNRs and remove channel noises during actual deployment. Nevertheless, adaptive de-noising in different SNR environments remains a major challenge.

To support highly reliable image transmission for future 6G systems, we combine Variational Autoencoder (VAE), adversarial learning, and diffusion model to realize de-noising semantic communication under noisy channel environments. The main contributions are described as follows:

  • We develop the De-Noising SemCom (DNSC) framework consisting of the encoder and decoder to achieve highly robust transmission for future 6G systems, where the proposed semantic de-noiser module in the decoder can relieve the effects of channel noises on semantic transmission.

  • To realize adaptive online semantic de-noising under various SNR environments, we further propose the Latent Diffusion DNSC (Latent-Diff DNSC) scheme upon the designed DNSC system. The constructed objective function of the encoder and decoder joint training considers the reconstruction loss and adversarial loss as well as regularization loss for optimizing VAEs, optimizing the discriminator that differentiate original and generated images, and avoiding arbitrarily scaled latent semantic spaces, respectively.

  • The semantic de-noiser in the proposed Latent-Diff DNSC scheme is obtained through forward and reverse diffusion processes. The gradually added noises in the forward phase to model real channel noises are iteratively eliminated in the reverse phase by maximizing the logarithmic likelihood of the distribution predicted by the DL model. Under such design, the channel noises under different SNR conditions can be adaptively removed in the online inference stage without knowing channel information in the offline training stage.

  • The simulation results on open-source LAION2B-EN dataset [12] demonstrate that, under the same compression ratio, the proposed Latent-Diff DNSC model achieves higher Peak Signal-to-Noise Ratio (PSNR) by 20%67%20\%\sim 67\% and higher Structural SIMilarit (SSIM) by 4%68%4\%\sim 68\% compared to the ADJSCC and JSCC model.

II System Model

As shown in Fig. 1, a typical SemCom architecture is considered, including the transmitter, physical channels, and the receiver, which are described as follows in detail.

Refer to caption
Figure 1: Illustration of the proposed DNSC architecture.

II-A Transmitter

The input data 𝐱\mathbf{x} undergoes joint semantic-channel encoder, leading to the formation of a semantic latent vector 𝐳\mathbf{z}.

𝐳=δ(𝐱)\mathbf{z}=\mathcal{E}_{\delta}(\mathbf{x}) (1)

As shown in (2), the transmitted signal is subject to a power constraint of magnitude PP before being transmitted through the physical channel.

𝐳nor=kP𝐳𝐳𝐳\mathbf{z}_{nor}=\sqrt{kP}\frac{\mathbf{z}}{\sqrt{\mathbf{z}\mathbf{z}^{*}}} (2)

II-B Physical Channel

The physical channel is usually modeled as the additive white Gaussian noise (AWGN), denoted by 𝒏𝒞𝒩(0,σw2)\boldsymbol{n}\sim\mathcal{CN}(0,\sigma_{w}^{2}), where σw2\sigma_{w}^{2} represents the power of the noise. Hence, the received signal 𝐳~\mathbf{\tilde{z}} at the receiver is denoted as

𝐳~=𝒉𝐳nor+𝒏\mathbf{\tilde{z}}=\boldsymbol{h}*\mathbf{z}_{nor}+\boldsymbol{n} (3)

where 𝒉\boldsymbol{h} represents the coefficients of the physical channel between the transmitter and receiver.

II-C Receiver

Definition 1 (Semantic De-Noiser): Unlike the existing SemCom frameworks, in the proposed DNSC architecture, the semantic de-noiser module at the receiver is defined to remove noise components from the noisy semantic vector. Such design can be achieved by neural networks parameterized by θ\theta.

By the prediction of the trained semantic de-noiser, latent semantic variables 𝐳~\mathbf{\tilde{z}} are progressively denoised and eventually transformed into 𝐳~0\mathbf{\tilde{z}}_{0}. Then, the de-noised semantic vector 𝐳~0\mathbf{\tilde{z}}_{0} is sent to the joint semantic-channel decoder to generate 𝐱~\mathbf{\tilde{x}} as

𝐱~=𝒟φ(𝐳~0)\mathbf{\tilde{x}}=\mathcal{D}_{\varphi}(\mathbf{\tilde{z}}_{0}) (4)

Overall, the proposed semantic de-noiser allows it to accurately model noise characteristics of physical channels, leading to effective de-noising of the semantic information. In this way, the proposed DNSC architecture can intelligently remove noises from the latent semantic vectors to improve the robustness and generalization ability of SemCom models. The specific implementation of the semantic de-noiser will be provided in Section III.

III The Proposed Latent-Diff DNSC system

As illustrated in Fig. 2, upon the designed DNSC system, we present the Latent-Diff DNSC scheme, including the offline training and online inference phases.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Illustration of the proposed Latent-Diff DNSC model. (a): offline training; (b): online inference.

III-A Offline Training phase

As illustrated in Algorithm 1, the offline training phase includes two training stages.

Algorithm 1 The Offline Training Algorithm for Latent-Diff DNSC

The first training stage

1:  Input: Training dataset 𝒳\mathcal{X}, batch size NN, learning rate γ\gamma;
2:  repeat
3:     Sample a data batch 𝐱=[𝒙1,,𝒙N]\mathbf{x}=[\boldsymbol{x}_{1},...,\boldsymbol{x}_{N}] from 𝒳\mathcal{X};
4:     for each data sample 𝒙iH×W×3\boldsymbol{x}_{i}\in\mathbb{R}^{H\times W\times 3} in 𝑿\boldsymbol{X} do
5:        The encoder δ\mathcal{E}_{\delta} encodes 𝒙i\boldsymbol{x}_{i} into a latent representation 𝒛ih×w×c\boldsymbol{z}_{i}\in\mathbb{R}^{h\times w\times c},𝒛𝒊=δ(𝒙i)\boldsymbol{z_{i}}=\mathcal{E}_{\delta}(\boldsymbol{x}_{i});
6:        The decoder 𝒟φ\mathcal{D}_{\varphi} reconstructs image 𝒙~i\boldsymbol{\tilde{x}}_{i} from latent 𝒛i\boldsymbol{z}_{i}, 𝒙~i=𝒟φ(𝒛i)=𝒟φ(δ(𝒙i))\boldsymbol{\tilde{x}}_{i}=\mathcal{D}_{\varphi}(\boldsymbol{z}_{i})=\mathcal{D}_{\varphi}(\mathcal{E}_{\delta}(\boldsymbol{x}_{i}));
7:     end for
8:     Compute the mean μz\mu_{z} and variance σz\sigma_{z} of the latent 𝐳\mathbf{z};
9:     Compute reconstruction loss via (6);
10:     Compute adversarial loss via (7);
11:     Compute regularization loss via (8);
12:     Compute average loss via (5);
13:     Update the parameter δ\delta of the encoder \mathcal{E}, φ\varphi of the decoder 𝒟\mathcal{D} and the parameter ω\omega of the discriminator DD by gradient descent;
14:  until Converged

The second training stage

1:  Input: hyperparameter of Gaussian distribution variance {β1,β2,βT}\{\beta_{1},\beta_{2},...\beta_{T}\};
2:  repeat
3:     Sample a batch of latent representation 𝐳\mathbf{z} from the first stage, 𝐳0q(𝐳0)\mathbf{z}_{0}\sim q(\mathbf{z}_{0});
4:     tt\sim Uniform (1,,T)({1,...,T});
5:     ϵ𝒩(0,𝐈)\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I});
6:     Compute αt=1βt\alpha_{t}=1-\beta_{t} and α¯t=α1×α2×αt\bar{\alpha}_{t}=\alpha_{1}\times\alpha_{2}\times...\alpha_{t};
7:     Take gradient descent step on (15);
8:  until Converged

III-A1 The First Training Stage

During the initial phase, the encoder δ\mathcal{E}_{\delta} and decoder 𝒟φ\mathcal{D}_{\varphi} are jointly trained to obtain the distribution of the semantic latent space. As illustrated in Table I, the encoder-decoder structure in [13] is adopted.

Proposition 1: After the joint training, to learn the probability distribution of the latent semantic space, the model is trained by the constructed loss function involved the reconstruction loss recon\mathcal{L}_{recon}, self-similarity loss adv\mathcal{L}_{adv}, and regularized Kullback-Leibler (KL) divergence loss reg\mathcal{L}_{reg} as

1=recon+λadvadv+λreg reg\mathcal{L}_{1}=\mathcal{L}_{\text{recon}}+\lambda_{\text{adv}}\mathcal{L}_{\text{adv}}+\lambda_{\text{reg }}\mathcal{L}_{\text{reg}} (5)

where λadv\lambda_{adv} and λreg\lambda_{reg} denote the weights of the loss function, recon\mathcal{L}_{recon} quantifies the disparity between the original and reconstructed images, adv\mathcal{L}_{adv} obtained through adversarial training preserves local consistency, and reg\mathcal{L}_{reg} computed in the latent space encourages the retention of maximal information while encoding images, which can be denoted as

recon=|𝐱~𝐱22\mathcal{L}_{\text{recon}}=|\mathbf{\tilde{x}}-\mathbf{x}\|_{2}^{2} (6)
adv=[logDω(𝐱)]+[log(1Dω(𝐱~))])\mathcal{L}_{adv}=[\log{D_{\omega}}(\mathbf{x})]+[\log(1-{D}_{\omega}(\mathbf{\tilde{x}}))]) (7)

where DωD_{\omega} is the discriminator of the parameter ω\omega.

reg =KL(𝒩(μz,σz)𝒩(0,1))\mathcal{L}_{\text{reg }}=KL(\mathcal{N}(\mu_{z},\sigma_{z})\|\mathcal{N}(0,1)) (8)
TABLE I: The structure of Encoder-Decoder
Encoder Decoder
𝐱H×W×C\mathbf{x}\in\mathbb{R}^{H\times W\times C} 𝐳h×w×c\mathbf{z}\in\mathbb{R}^{h\times w\times c}
Conv2D:H×W×C:\mathbb{R}^{H\times W\times C^{\prime}} Conv2D:H×W×C′′:\mathbb{R}^{H\times W\times C^{\prime\prime}}
m×ResBlok,Downsample:h×w×C′′m\times{\text{ResBlok,Downsample}}:\mathbb{R}^{h\times w\times C^{\prime\prime}} ResBlok:h×w×C′′:\mathbb{R}^{h\times w\times C^{\prime\prime}}
ResBlok :h×w×C′′:\mathbb{R}^{h\times w\times C^{\prime\prime}} Non-Local:h×w×C′′:\mathbb{R}^{h\times w\times C^{\prime\prime}}
Non-Local:h×w×C′′:\mathbb{R}^{h\times w\times C^{\prime\prime}} ResBlok:h×w×C′′:\mathbb{R}^{h\times w\times C^{\prime\prime}}
ResBlok:h×w×C′′:\mathbb{R}^{h\times w\times C^{\prime\prime}} m×ResBlok,Upsample:H×W×Cm\times{\text{ResBlok,Upsample}}:\mathbb{R}^{H\times W\times C^{\prime}}
GroupNorm,Swish,Conv2D:h×w×c:\mathbb{R}^{h\times w\times c} GroupNorm,Swish,Conv2D:H×W×C:\mathbb{R}^{H\times W\times C}

III-A2 The Second Training Stage

A forward diffusion chain progressively adds noise to the semantic latent vectors 𝐳0=α(𝐱)\mathbf{z}_{0}=\mathcal{E}_{\alpha}(\mathbf{x}) via a Markov chain, using an ascending variance schedule β1,β2,,βT{\beta_{1},\beta_{2},...,\beta_{T}} as

q(𝐳1:T𝐳0)=t=1Tq(𝐳t𝐳t1)q\left(\mathbf{z}_{1:T}\mid\mathbf{z}_{0}\right)=\prod_{t=1}^{T}q\left(\mathbf{z}_{t}\mid\mathbf{z}_{t-1}\right) (9)

where

q(𝐳t𝐳t1)=𝒩(𝐳t;1βt𝐳t1,βt𝐈)q\left(\mathbf{z}_{t}\mid\mathbf{z}_{t-1}\right)=\mathcal{N}\left(\mathbf{z}_{t};\sqrt{1-\beta_{t}}\mathbf{z}_{t-1},\beta_{t}\mathbf{I}\right) (10)

After reparameterization, 𝐳t\mathbf{z}_{t} can be get by

𝐳t=α¯t𝐳0+1α¯tϵ\mathbf{z}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{z}_{0}+\sqrt{1-\bar{\alpha}_{t}}\mathbf{\epsilon} (11)

where α¯t=α1×α2×αt\bar{\alpha}_{t}=\alpha_{1}\times\alpha_{2}\times...\alpha_{t}, αt=1βt\alpha_{t}=1-\beta_{t}.

The training data is transformed into semantic vectors 𝐳0\mathbf{z}_{0} with a distribution of q(𝐳0)q(\mathbf{z}_{0}) through the trained encoder. Gaussian noises are also added to model real-world noise conditions. q(𝐳0)q(\mathbf{z}_{0}) is iteratively added with noise for TT steps and gradually becomes a Gaussian noise distribution ϵq(𝐳T)\boldsymbol{\epsilon}\sim q(\mathbf{z}_{T}). In the reverse process, our goal is to train a U-shaped Network (U-Net) with parameters θ\theta to learn the inverse distribution pθp_{\theta}, which enables us to gradually recover the original latent distribution of 𝐳0\mathbf{z}_{0} from the diffused latent 𝐳T\mathbf{z}_{T} through the reverse diffusion process. The reverse process is given by

pθ(𝐳0)=zpθ(𝐳T)t=1Tpθ(𝐳t1𝐳t)p_{\theta}\left(\mathbf{z}_{0}\right)=\int_{z}p_{\theta}\left(\mathbf{z}_{T}\right)\prod_{t=1}^{T}p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right) (12)

where pθ(𝐳t1𝐳t)=𝒩(𝐳t1;μθ(𝐳t,t),Σθ(𝐳t,t))p_{\theta}\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)=\mathcal{N}\left(\mathbf{z}_{t-1};\mu_{\theta}\left(\mathbf{z}_{t},t\right),\Sigma_{\theta}\left(\mathbf{z}_{t},t\right)\right). Although it is difficult to obtain the probability distribution q(𝐳t1|𝐳t)q(\mathbf{z}_{t-1}|\mathbf{z}_{t}) of the reverse process, q(𝐳t1|𝐳t,𝐳0)q(\mathbf{z}_{t-1}|\mathbf{z}_{t},\mathbf{z}_{0}) can be obtained by (13) if 𝐳0\mathbf{z}_{0} is known.

q(𝐳t1𝐳t,𝐳0)=𝒩(𝐳t1;μ~(𝐳t,𝐳0),β¯t𝐈)q\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t},\mathbf{z}_{0}\right)=\mathcal{N}\left(\mathbf{z}_{t-1};\tilde{\mu}\left(\mathbf{z}_{t},\mathbf{z}_{0}\right),\bar{\beta}_{t}\mathbf{I}\right) (13)

where β¯t=β1×β2×βt{\bar{\beta}}_{t}={\beta}_{1}\times{\beta}_{2}\times...{\beta}_{t}. The parameters can be optimized by maximizing the evidence lower bound (ELBO) as

logpθ(𝐳0)\displaystyle\log p_{\theta}\left(\mathbf{z}_{0}\right) 𝕂𝕃(q(𝐳T𝐳0)p(𝐳T))\displaystyle\geq-\mathbb{K}\mathbb{L}\left(q\left(\mathbf{z}_{T}\mid\mathbf{z}_{0}\right)\mid p\left(\mathbf{z}_{T}\right)\right)- (14)
t=1T𝔼q(𝐳t𝐳0)𝕂𝕃(q(𝐳t1𝐳t,𝐳0)p(𝐳t1𝐳t))\displaystyle\sum_{t=1}^{T}\mathbb{E}_{q\left(\mathbf{z}_{t}\mid\mathbf{z}_{0}\right)}\mathbb{K}\mathbb{L}\left(q\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t},\mathbf{z}_{0}\right)\mid p\left(\mathbf{z}_{t-1}\mid\mathbf{z}_{t}\right)\right)

Proposition 2: Essentially, the proposed semantic de-noiser is composed of a set of neural networks ϵθ(𝐳t,t)\boldsymbol{\epsilon}_{\theta}(\mathbf{z}_{t},t) with similar parameters and weights, and is trained to estimate the added noise variables for input 𝐳t\mathbf{z}_{t}. Besides, considering that 𝐳0\mathbf{z}_{0} is unknown, 𝐳0\mathbf{z}_{0} is replaced by transforming (11), we further obtain the loss function as (15).

2=𝔼α(𝐱),ϵ𝒩(0,1),t[ϵϵθ(𝐳t,t)22]\mathcal{L}_{2}=\mathbb{E}_{\mathcal{E}_{\alpha}(\mathbf{x}),\boldsymbol{\epsilon}\sim\mathcal{N}(0,1),t}\left[\left\|\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\mathbf{z}_{t},t\right)\right\|_{2}^{2}\right] (15)

Hence, the semantic vector 𝐳t\mathbf{z}_{t} is the input of the semantic de-noiser at time tt, and the corresponding noise variables are estimated and outputted. After the training process gradually converges by minimizing the loss function in (15), the trained semantic de-noiser can accurately eliminate noises.

III-B Online inference phase

The online inference process is illustrated in Algorithm 2. Adding noises in the training phase can result in a higher overall level of noises, which can negatively impact the transmission accuracy. To mitigate this problem, power normalization is employed to ensure that the variance of noises is proportional to the power of signals, leading to a final noise power level of 11 after TT steps of diffusion. Therefore, to eliminate semantic noise effectively during the inference phase, it is necessary to obtain the normalization factor ϕ\phi at the receiver through SNR estimation, thus normalizing the power of the received signal as

𝐳~T=ϕ𝐳~=𝐳~(1/10SNR/10)𝐳~𝐳~\mathbf{\tilde{z}}_{T}=\phi\mathbf{\tilde{z}}=\frac{\mathbf{\tilde{z}}}{\sqrt{(1/10^{SNR/10})}\sqrt{\tilde{\mathbf{z}}\tilde{\mathbf{z}}^{*}}} (16)

where 𝒛~\boldsymbol{\tilde{z}} is the semantic vector transmitted through the channel in (3). Subsequently, the signal 𝒛~T\boldsymbol{\tilde{z}}_{T} is fed into de-noiser for noise elimation. After TT iterations, the final output is the recovered semantic vector 𝒛0\boldsymbol{z}_{0}, which is then input into the decoder to obtain 𝒙~\boldsymbol{\tilde{x}}.

Algorithm 2 The Online Inference Algorithm for Latent-Diff DNSC
1:  Input: Image data 𝐱\mathbf{x} to be transmitted, hyperparameter of Gaussian distribution variance {β1,β2,βT}\{\beta_{1},\beta_{2},...\beta_{T}\};
2:  The encoder δ\mathcal{E}_{\delta} encodes 𝐱\mathbf{x} into a latent representation 𝐳\mathbf{z} via (1);
3:  Power constraints via (2);
4:  𝐳\mathbf{z} undergoes channel transmission and becomes 𝐳~\mathbf{\tilde{z}} at the receiving end via (3);
5:  The SNR estimation module receives 𝐳~\mathbf{\tilde{z}} and normalizes the power via (16);
6:  for de-noising steps = T,…,1 do
7:     𝐲𝒩(0,𝐈)\mathbf{y}\sim\mathcal{N}(0,\mathbf{I}) if t1t\geq 1,else 𝐲=0\mathbf{y}=0;
8:     𝐳~t1=1αt(𝐳~t1αt1α¯tϵθ(𝐳~t,t))+σt𝐲\mathbf{\tilde{z}}_{t-1}=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{\tilde{z}}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\boldsymbol{\epsilon}_{\theta}\left(\mathbf{\tilde{z}}_{t},t\right)\right)+\sigma_{t}\mathbf{y};
9:  end for
10:  The decoder 𝒟φ\mathcal{D}_{\varphi} reconstructs image 𝐱~\mathbf{\tilde{x}} from the received signal 𝐳~\mathbf{\tilde{z}} via (4);

IV Simulation Results and Analysis

In this section, we evaluate the performance of the proposed Latent-Diff DNSC scheme on AWGN channel with perfect SNR estimation. The proposed model is pre-trained by [13]. The diffusion steps TT for the pre-trained model is 10001000, the learning rate is 9.6×1059.6\times 10^{-5}, the training batchsize is 88, and the noise schedule is linear with the increase in {β1,β2,βT}\{\beta_{1},\beta_{2},...\beta_{T}\}.

IV-A Open-Source Image Dataset

We select 660 images from the LAION2B-EN dataset [12] to compare the transmission performance of the proposed scheme with baseline schemes. The LAION2B-EN dataset is an image-text dataset, and all images are resized to have the shape of 3×512×5123\times 512\times 512, where the dimensions correspond to RGB channels, width, and height, respectively.

IV-B Baseline Schemes

We compare the performance of our proposed scheme with DeepJSCC[8], ADJSCC[11], and the traditional JPEG image compression algorithm. DeepJSCC is trained with two SNRs, including 10 dB and 20 dB, ADJSCC is trained with SNRs ranging from 0-28 dB. The Latent-Diff DNSC scheme without the de-noiser module is also provided as a baseline scheme to highlight the significance of the de-noiser.

IV-C Performance Comparison

PSNR and SSIM[14] are used as metrics, representing image quality and similarity, respectively. Fig. 3 illustrates the visualization of a subset of images recovered under different SNRs using the proposed Latent-Diff DNSC scheme with a compression ratio rsr_{s} of 1/481/48.

Refer to caption
Figure 3: Examples of reconstructed images produced by the proposed Latent-Diff DNSC scheme.

Fig. 4 presents the PSNR performance of the proposed Latent-Diff DNSC scheme and four baselinse schemes. The PSNR performance of the proposed Latent-Diff DNSC scheme is significantly better than all baseline schemes over different SNRs. Specifically, at high SNRs, DeepJSCC performs better when trained with SNR of 20 dB compared to SNR of 10 dB, while at low SNRs, DeepJSCC performs better when trained with SNR of 10 dB compared with SNR of 20 dB. The reason is that DeepJSCC has difficulty in online capturing varying channel features under the fixed offline training SNR. ADJSCC addresses this issue and has better performance than JSCC, whether trained with SNR of 10 or 20 dB. Furthermore, our proposed Latent-Diff DNSC scheme, with its superior ability to capture channel characteristics and de-noising capability, outperforms ADJSCC. Under low SNRs, the Latent-Diff DNSC improves PSNR by approximately 3 dB compared to ADJSCC. As the SNR increases, this advantage becomes more pronounced, with the PSNR improvement of approximately 5.7 dB at the SNR of 30 dB. Additionally, the Latent-Diff DNSC scheme without the de-noiser performs poorly at low SNRs. It can be explained that the absence of added noises during the first training stage prevents the encoder and decoder from learning accurate distribution of channel properties under various SNRs. However, it performs well at high SNRs, verifying the excellent image reconstruction ability of the Latent-Diff DNSC scheme under high rsr_{s}.

Refer to caption
Figure 4: PSNR of the proposed Latent-Diff DNSC scheme and baseline schemes on LAION2B-EN images versus different SNRs.
Refer to caption
Figure 5: SSIM of the proposed Latent-Diff DNSC scheme and baseline schemes on LAION2B-EN images versus different SNRs.

Fig. 5 provides the SSIM performance of the proposed Latent-Diff DNCS scheme and baseline schemes versus different SNRs. Similar with Fig. 4, the proposed Latent-Diff DNSC scheme outperforms all baseline schemes in different channel conditions. Furthermore, under low SNRs, the SSIM of the proposed Latent-Diff DNSC is approximately 0.1 higher than that of ADJSCC. AS the SNR decreases, this advantage becomes even more pronounced.

IV-D De-noising Trade-off Between Semantic and Noise Components

Fig. 6 and 7 show the performance variations of the Latent-Diff DNSC system in terms of PSNR and SSIM indicators under different de-noising steps. When the de-noising steps are 15, the system performance is severely degraded, indicating that a large amount of noises in semantic vectors are still not eliminated. When the de-noising steps are 45 and 60, the system performance is lower than that of the system with de-noising steps of 30. This is due to the fact that in the iterative de-noising process, each de-noising step subtracts the estimated noise component from the received semantic vector. Excessive de-noising steps may lead to removal of semantic vector components that do not belong to the noise component, resulting in a decrease in system performance.

Refer to caption
Figure 6: PSNR of the proposed Latent-Diff DNSC scheme on LAION2B-EN images versus different SNRs.
Refer to caption
Figure 7: SSIM of the proposed Latent-Diff DNSC scheme on LAION2B-EN images versus different SNRs.

V Conclusions

In this paper, we propose the DNSC system to combat channel noises by introducing the semantic de-noiser. Based on the DNSC framework, we further design the Latent-Diff DNSC scheme, where the VAE serves as the encoder and decoder, and the semantic de-noiser is achieved by a diffusion process in the latent space involved VAEs and adversarial learning. In this way, chanel noises are modeled in a diffusion manner and gradually eliminated by variational inference. Such design can enable adaptive online de-noising under different SNRs estimated by an SNR estimator. Experimental results on the open-source datasets show that the proposed Latent-Diff DNSC approach can achieve better transmission performance in PSNR and SSIM than four baseline schemes with high compression ratio of 1/481/48. Therefore, the proposed method is a promising solution for image semantic communications under dynamic channel environments in future 6G systems.

References

  • [1] W. Saad, M. Bennis, and M. Chen, “A vision of 6g wireless systems: Applications, trends, technologies, and open research problems,” IEEE network, vol. 34, no. 3, pp. 134–142, 2019.
  • [2] W. Yang, H. Du, Z. Q. Liew, W. Y. B. Lim, Z. Xiong, D. Niyato, X. Chi, X. S. Shen, and C. Miao, “Semantic communications for future internet: Fundamentals, applications, and challenges,” IEEE Communications Surveys & Tutorials, 2022.
  • [3] W. Yang, Z. Q. Liew, W. Y. B. Lim, Z. Xiong, D. Niyato, X. Chi, X. Cao, and K. B. Letaief, “Semantic communication meets edge intelligence,” IEEE Wireless Communications, vol. 29, no. 5, pp. 28–35, 2022.
  • [4] C. Dong, H. Liang, X. Xu, S. Han, B. Wang, and P. Zhang, “Semantic communication system based on semantic slice models propagation,” IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 202–213, 2022.
  • [5] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: Channel auto-encoders, domain specific regularizers, and attention,” in 2016 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT).   IEEE, 2016, pp. 223–228.
  • [6] H. Xie, Z. Qin, G. Y. Li, and B.-H. Juang, “Deep learning enabled semantic communication systems,” IEEE Transactions on Signal Processing, vol. 69, pp. 2663–2675, 2021.
  • [7] Z. Weng and Z. Qin, “Semantic communication systems for speech transmission,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 8, pp. 2434–2444, 2021.
  • [8] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019.
  • [9] H. Du, J. Wang, D. Niyato, J. Kang, Z. Xiong, and D. I. Kim, “Ai-generated incentive mechanism and full-duplex semantic communications for information sharing,” arXiv preprint arXiv:2303.01896, 2023.
  • [10] Q. Hu, G. Zhang, Z. Qin, Y. Cai, G. Yu, and G. Y. Li, “Robust semantic communications with masked vq-vae enabled codebook,” arXiv preprint arXiv:2206.04011, 2022.
  • [11] J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, “Wireless image transmission using deep source channel coding with attention modules,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2315–2328, 2021.
  • [12] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” arXiv preprint arXiv:2210.08402, 2022.
  • [13] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models. 2022 ieee,” in CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10 674–10 685.
  • [14] A. Horé and D. Ziou, “Image quality metrics: Psnr vs. ssim,” in 2010 20th International Conference on Pattern Recognition, 2010, pp. 2366–2369.