This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Diffusion-Aided Joint Source Channel Coding For High Realism Wireless Image Transmission

Mingyu Yang, Bowen Liu, Boyang Wang, and Hun-Seok Kim This work was funded in part by NSF CAREER #1942806.M. Yang, B. Liu, B. Wang, and H. Kim are with the Department of Electrical and Computer Engineering, University of Michigan, Ann Arbor, MI, 48109 USA e-mail: ([email protected]; [email protected]; [email protected]; [email protected])
Abstract

Deep learning-based joint source-channel coding (deep JSCC) has been demonstrated to be an effective approach for wireless image transmission. However, many current approaches utilize an autoencoder framework to optimize conventional metrics such as Mean Squared Error (MSE) and Structural Similarity Index (SSIM), which are inadequate for preserving the perceptual quality of reconstructed images. Such an issue is more prominent under stringent bandwidth constraints or low signal-to-noise ratio (SNR) conditions. To tackle this challenge, we propose DiffJSCC, a novel framework that leverages the prior knowledge of the pre-trained Stable Diffusion model to produce high-realism images via the conditional diffusion denoising process. First, our DiffJSCC employs an autoencoder structure similar to prior deep JSCC works to generate an initial image reconstruction from the noisy channel symbols. This preliminary reconstruction serves as an intermediate step where robust multimodal spatial and textual features are extracted. In the following diffusion step, DiffJSCC uses the derived multimodal features, together with channel state information such as the signal-to-noise ratio (SNR) and channel gain. These serve as conditions to guide the diffusion denoising process through a novel control module, which adjusts the output according to the multimodal conditions. To maintain the balance between realism and fidelity, an optional intermediate guidance approach using the initial image reconstruction is implemented. Extensive experiments on diverse datasets reveal that our method significantly surpasses prior deep JSCC approaches on both perceptual metrics and downstream task performance, showcasing its ability to preserve the semantics of the original transmitted images. Notably, DiffJSCC can achieve highly realistic reconstructions for 768×\times512 pixel Kodak images with only 3072 symbols (<<0.008 symbols per pixel) under 1dB SNR channels. The source code is available at https://github.com/mingyuyng/DiffJSCC.

Index Terms:
deep joint source-channel coding, text-to-image diffusion models, stable diffusion, image captioning.

I Introduction

Robust wireless image and video transmission have become crucial for emerging applications such as augmented reality (AR), virtual reality (VR), and autonomous driving, which commonly offload computation to remote servers. Conventional wireless image transmission utilizes a two-step scheme that first applies compression using codecs like Joint Photographic Experts Group (JPEG) or Better Portable Graphics (BPG), followed by a separate channel coding scheme such as a low-density parity-check code (LDPC) or Polar code for error protection. Nonetheless, these standard compression algorithms are prone to any bit errors and susceptible to dramatic quality degradation (so-called “cliff effect”) when channel coding fails to correct codeword errors entirely. To address this challenge, joint source-channel coding (JSCC) has been proposed to improve error resilience and maintain image integrity during wireless transmission [1, 2].

Refer to caption
Figure 1: Overall framework of the proposed DiffJSCC (e) and comparison with other existing deep JSCC structures (a-d).

Recent advancements in Deep Neural Networks (DNNs) have significantly impacted fields including computer vision [3, 4, 5], natural language processing [6, 7], and wireless communications [8, 9, 10], demonstrating their power in addressing complex problems including JSCC. Bourtsoulatze et al. [11] have introduced a groundbreaking deep learning-based JSCC (deep JSCC) framework that unifies image compression and error correction through an autoencoder architecture. This holistic end-to-end design, shown in Figure 1a, surpasses conventional separate source-channel coding transmission techniques, especially in challenging wireless environments characterized by the Additive White Gaussian Noise (AWGN) channel and the Rayleigh Fading channel.

The scope of deep JSCC has been considerably expanded by recent developments that tailor its capabilities to various communication scenarios. For example, Kurka et al. [12] and Wu et al. [13] integrated feedback mechanisms into JSCC. Yang et al. [14] introduced a JSCC variant compatible with orthogonal frequency division multiplexing (OFDM), which is adept at managing multipath channel effects. Bian et al. [15] have extended JSCC’s reach to multiple-input multiple-output (MIMO) channels. Additionally, Lee et al. [16] refined image reconstruction with a pre-trained channel denoiser at the receiver. Innovative neural network architectures, such as the Transformer [7] and the recently proposed Mamba [17], have also been adapted for JSCC [18, 19, 20], promising further advancements.

Simultaneously, research has advanced techniques to enhance deep JSCC’s adaptability to varying channel conditions. Xu et al. developed ADJSCC [21], which employs a Squeeze-Excitation structure [22] to adaptively respond to signal-to-noise ratio (SNR) variations. Yang et al. [23] introduced a policy network designed to modulate transmission rates in accordance with various channel statuses and image contents. Dai et al. [24] optimized rate adjustment using a learned entropy model, and Zhang et al. [25] leveraged a predictive oracle network for informed rate selection.

Although these prior innovations mark significant progress, their focus on minimizing image distortion through metrics such as Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) does not necessarily translate to maintaining perceptual quality, particularly under bandwidth constraints or adverse channel conditions where images may lose texture and structural clarity. Image compression and restoration studies have long faced the issue of blurry reconstructions. Blau et al. [26] highlighted the perception-distortion trade-off, revealing that optimizing rate-distortion criteria alone does not guarantee the preservation of natural image characteristics. To close this gap, integrating adversarial loss into generative models has become prevalent, substantially improving the visual quality of compressed images [27, 28]. With the advent of diffusion models [29] that excel in producing realistic textures in image generation, various diffusion models are now being applied to enhance the quality of reconstructed images even under high compression scenarios, promising a new frontier in image compression [30, 31].

The application of generative deep learning models has recently been explored in the domain of deep JSCC. Building on prior research, Yang et al. [32] and Wang et al. [33] incorporated adversarial loss using PatchGAN [34] into deep JSCC frameworks (shown in Figure 1.b), aiming to improve visual quality. Despite their efforts, the performance was hindered by the difficulties inherent in training effective Generative Adversarial Networks (GANs) from scratch. In a different approach, Ecenaz et al. [35] leveraged the pre-trained StyleGAN-v2 generator [36] to enhance the perceptual quality of transmitted facial images through deep JSCC (shown in Figure 1.c). However, the reliance on traditional distortion metrics such as mean squared error (MSE) during optimization tempered the gains in practice. Yilmaz et al. [37] recently pioneered the integration of deep JSCC with pre-trained pixel-space diffusion models based on the zero-shot image restoration framework DDNM [38] (as shown in Figure 1.d), achieving notable improvements in distortion and perceptual quality compared to Ecenaz et al.’s results [35]. Their approach involves initially generating an image reconstruction using a conventional JSCC autoencoder. Subsequently, this reconstructed image is utilized to guide the denoising process of the unconditional diffusion model via Range-Null space decomposition. However, since the diffusion process is regulated solely by the intermediate guidance, this method struggles to achieve high-quality image reconstructions, especially when the initial image reconstruction undergoes significant distortion under challenging channel conditions. Additionally, such a zero-shot generation process only supports images as the unimodal condition.

Refer to caption
Figure 2: The general framework of the proposed DiffJSCC. In the transmitter, the image xx is encoded by a JSCC encoder to yield transmission symbols yy. In the receiver, a preliminary image reconstruction x^\hat{x} is generated by a JSCC decoder, which is used to produce multimodal (visual and text) conditions. The final step is the conditional latent diffusion process, which uses these multimodal conditions in the conditional denoiser to guide the image generation procedure.

Recent progress in image restoration [39] and compression [40, 30] has shifted away from solely depending on intermediate guidance for harnessing the prior knowledge in pre-trained diffusion models. Instead, these advancements explore the direct learning of conditional diffusion models through approaches like ControlNet [41] and T2I-adaptor [42], demonstrating state-of-the-art performance and surpassing prior works such as DDNM[38]. These techniques involve fine-tuning the pre-trained diffusion model with additional lightweight control modules or adapters, and considerably reduce the dependence on intermediate guidance. In addition, these fine-tuning methods offer potential support for multimodal conditions to further enhance the control over the image generation process.

Leveraging the latest breakthroughs, we introduce DiffJSCC, a new framework devised to improve the perceptual quality of wirelessly transmitted images through conditional diffusion models as depicted in Figure 1.e. Similar to Yilmaz et al’s method [37], our approach begins with initial image reconstruction via a standard autoencoder framework. In contrast, DiffJSCC focuses on directly learning a conditional diffusion model by fine-tuning the existing pre-trained diffusion model. On one hand, the distorted initial image reconstruction is treated as an input to the conditional denoiser instead of directly guiding the diffusion process, which ensures high-quality image generation even if the initial image reconstruction is highly distorted. On the other hand, this fine-tuning approach allows for multimodal conditions beyond the reconstructed image. Within DiffJSCC, we include text and channel state information (CSI) as additional conditions alongside the visual condition. Both visual and textual conditions are derived via pre-trained neural networks, obviating the need to develop new neural architectures and simplifying the training process. To fuse these multimodal conditions, DiffJSCC utilizes a controller module similar to that in Lin et al. [39]. The benefit of incorporating supplementary textual and CSI conditions has been demonstrated through our experiments. Instead of using pixel-domain diffusion models as in [37], DiffJSCC adopts the advanced latent diffusion model known as Stable Diffusion [43]. The denoising procedure is aligned with the original denoising diffusion probabilistic model (DDPM) [29]. Since DiffJSCC directly learns a conditional diffusion model, intermediate guidance is no longer necessary. However, we demonstrate that this mild intermediate guidance remains beneficial for balancing image realism with fidelity. Empirical evaluations across diverse image datasets reveal that DiffJSCC surpasses traditional separate source-channel coding schemes and prior deep JSCC benchmarks, showing superior results in human-perceptual quality and practical task efficacy, particularly in scenarios of restricted bandwidth and challenging channel conditions.

We summarize our primary contributions as follows:

  • Novel Diffusion-based JSCC Framework: We present a novel JSCC framework that utilizes a pre-trained text-to-image Stable Diffusion model to produce realistic image reconstructions. We develop a custom control module to adapt the Stable Diffusion model for conditional generation for particular datasets. To the best of our knowledge, this is the first instance of integrating the Stable Diffusion model into deep JSCC.

  • Enhanced Multimodal Conditioning: Our framework uniquely extracts multimodal conditions to better guide the Stable Diffusion model by including visual and textual features and the channel information. We demonstrate that these diverse conditions substantially improve the performance.

  • Balancing Fidelity and Realness: We present a denoising guidance process that derives intermediate features with an initial JSCC reconstruction. This approach adeptly balances the image fidelity with perceptual realness.

  • Flexible and Versatile Design: DiffJSCC’s model-agnostic design is compatible with various advanced deep JSCC architectures for initial image reconstruction. Besides, DiffJSCC omits any additional components at the transmitter, suiting mobile and IoT applications that offload heavy computations to remote servers.

  • Improved Image Quality with Low Transmission Rates: Our comprehensive experimentation with multiple datasets confirms that our approach significantly surpasses benchmarks in perceptual quality and task performance, particularly in bandwidth-constrained and noisy channel situations. Notably, DiffJSCC can achieve realistic reconstructions of 768×512768\times 512 pixel images with fewer than 0.0080.008 symbols (channel uses) per pixel over a 1dB SNR AWGN channel.

II Proposed Method

The general architecture of our proposed DiffJSCC is illustrated in Figure 2, which contains three main components. First, we perform Initial Deep JSCC Reconstruction (Sec, II-A) using traditional autoencoder structures. Then, we perform Multimodal Condition Generation (Sec. II-B) to extract multimodal spatial and textual features from the initial JSCC reconstruction. Lastly, the multimodal features, together with the channel state information, are treated as the conditions and fed to the Conditional Latent Diffusion Process (Sec. II-C) to generate the final refined image reconstruction. More details about the three components are introduced in the following sections.

II-A Initial Deep JSCC Reconstruction

The input image is denoted as xC×H×Wx\in\mathbb{R}^{C\times H\times W}, with CC indicating the number of color channels, and HH and WW representing the height and width, respectively. Within the transmitter, a JSCC encoder, EcE_{c}, encodes the image xx into a sequence of complex-valued channel input symbols yKy\in\mathbb{C}^{K}. These KK symbols are subsequently transmitted through a noisy communication channel while each symbol utilizes the channel once. The rate of the transmission is quantified as the channel bandwidth ratio (CFR), which is referred as ρ\rho and is defined as:

ρ=KC×H×W.\rho=\frac{K}{C\times H\times W}. (1)

To comply with power constraints, the final encoder layer normalizes the transmitted symbols yy to ensure an average power output of P¯\bar{P}. Let y~\tilde{y} represent the complex symbols prior to normalization; the normalization procedure is expressed mathematically as:

y=KP¯y~y~y~.y=\sqrt{K\bar{P}}\frac{\tilde{y}}{\sqrt{\tilde{y}^{\prime}\tilde{y}}}. (2)

The wireless channel in our study is modeled as a conditional distribution, with the received symbols being drawn according to y^p(y|ε)\hat{y}\sim p(y|\varepsilon), where ε\varepsilon encapsulates the channel parameters. In this work, we consider both the AWGN channel and Rayleigh Fading channel, and the channel transfer function can be represented as:

y^=hy+n,n𝒞𝒩(0,σ2𝐈K).\hat{y}=hy+n,\;\;n\sim\mathcal{CN}(0,\sigma^{2}\mathbf{I}_{K}). (3)

Here, the noise nKn\in\mathbb{C}^{K} is assumed to have a circularly symmetric complex Gaussian distribution with σ2\sigma^{2} symbolizing the noise’s power level. hh\in\mathbb{C} denotes the channel gain. For AWGN channel, hh is set to 1. For the Rayleigh Fading channel, the channel gain hh is a random variable and it follows h𝒞𝒩(0,H)h\sim\mathcal{CN}(0,H). The channel’s SNR, denoted as γ\gamma, is defined by the ratio γ=P¯𝔼[|h|2]/σ2\gamma=\bar{P}\mathbb{E}[|h|^{2}]/\sigma^{2}. In our experiments, we standardize the average power P¯\bar{P} to 1 for consistency. Following the prior works [21, 44], both the transmitter and receiver are assumed to have perfect knowledge of the channel gain hh and SNR γ\gamma. Thus, at the receiver side, we first perform equalization to the received symbols y^\hat{y}, which is:

y~=h|h|2y^,\tilde{y}=\frac{h^{*}}{|h|^{2}}\hat{y}, (4)

where hh^{*} denotes the complex conjugate of hh.

After that, a JSCC decoder (DcD_{c}) takes the equalized channel symbols y~\tilde{y} as input and generate the initial reconstruction x^C×H×W\hat{x}\in\mathbb{R}^{C\times H\times W}. In this paper, we employ a JSCC architecture akin to the one described in Yang et al.[23]. The encoder and decoder could take the CSI information γ\gamma and hh and are trained jointly. The encoding and decoding processes are expressed as:

y=Ec(x,h,γ),x^=Dc(y^,h,γ).y=E_{c}(x,h,\gamma),\;\;\hat{x}=D_{c}(\hat{y},h,\gamma). (5)

The aim is to minimize the MSE between the original image xx and the decoder’s output x^\hat{x}, with the following loss function deployed for optimization:

JSCC=𝔼x,h,n[xx^22],\mathcal{L}_{JSCC}=\mathbb{E}_{x,h,n}\left[\|x-\hat{x}\|_{2}^{2}\right], (6)

A key aspect of our approach is its model-agnosticism, facilitating seamless integration with more advanced deep JSCC frameworks. While we demonstrate the effectiveness of DiffJSCC using one JSCC structure, the design is inherently adaptable to other more recent architectures such as SwinJSCC[19] or NTSCC[24].

Rather than directly using the received symbols y^\hat{y} as conditioning inputs for the conditional latent diffusion model, the initial JSCC reconstruction acts as an intermediate step and exhibits three primary benefits: First, it enables separate training of the JSCC encoder and the conditional denoiser, which allows more stable training compared to a joint training approach. Second, the initial reconstruction x^\hat{x} harnesses the pre-trained LDM encoder to capture salient spatial features, thus avoiding the need to develop an encoder from scratch as originally proposed by the ControlNet framework[41]. Last, the initial reconstruction x^\hat{x} facilitates the extraction of supplementary textual descriptors using pre-trained image captioning models such as BLIPv2[45] and the CLIP encoder[46], enriching the input conditions for the diffusion model.

II-B Multimodal Condition Generation

Leveraging the initial reconstruction x^\hat{x}, our framework extracts multimodal features with the help of pre-trained neural networks, eliminating the need for additional training specific to our approach. The spatial features fvf_{v} with a size of 4×H/8×W/84\times H/8\times W/8 are extracted by the image encoder EldmE_{ldm} from the pre-trained Stable Diffusion model:

fv=Eldm(x^).f_{v}=E_{ldm}(\hat{x}). (7)

In parallel, given that the Stable Diffusion model is inherently a text-to-image generative model, it is natural to leverage textual information as a supplementary condition. Noticing the presence of an initial image reconstruction, we propose to utilize the pre-trained image captioning model BLIPv2 [45] denoted as EblipE_{blip} to generate textual descriptions for x^\hat{x}. Following the initial captioning step, these descriptions are further encoded by the pre-trained CLIP encoder EclipE_{clip}[46] to obtain textual features ft77×768f_{t}\in\mathbb{R}^{77\times 768}:

ft=Eclip(Eblip(x^)).f_{t}=E_{clip}(E_{blip}(\hat{x})). (8)

An alternative approach to leveraging textual data is by deploying image captioning models at the transmitter end and sending the generated captions along with the image. Nevertheless, since image captioning models often involve significant computational requirements, this method could substantially increase the transmitter’s processing load, making it impractical for mobile systems. It also incurs the overhead of sending text captions through the noisy channel. Within our method, despite the potential quality degradation in captions due to the distortion of initial image reconstructions, the inclusion of ftf_{t} generated from x^\hat{x} has been shown to be beneficial as it encapsulates content-related keywords.

The CSI information, described by the channel SNR γ\gamma and the channel gain hh, plays a crucial role alongside the multimodal features fvf_{v} and ftf_{t} as it provides insight into the extent of distortion that initial image reconstruction experiences. After integrating the CSI, the set of conditions c¯\bar{c} utilized by our conditional diffusion model is formulated as:

c¯={fv,ft,h,γ}.\bar{c}=\{f_{v},f_{t},h,\gamma\}. (9)

These diverse conditions work together to strengthen the conditional diffusion process, improving the model’s capability to reconstruct images with increased authenticity and accuracy, faithfully reconstructing the originally transmitted content.

Refer to caption
Figure 3: Network structure of the proposed denoiser ϵθ\epsilon_{\theta} in the conditional latent diffusion model. The pre-trained Stable Diffusion model is shown in blue and the fine-tuned control module is shown in green.

II-C Conditional Latent Diffusion Model

The proposed DiffJSCC framework takes the set of conditions c¯\bar{c} as the input of the conditional latent diffusion process as depicted in Fig. 2. The diffusion process starts with a random input zTz_{T} sampled from 𝒩(0,𝐈)\mathcal{N}(0,\mathbf{I}) and iterates the denoising process with the multi-modal condition set c¯\bar{c} using the proposed denoiser (Fig. 3) until the denoised output z0z_{0} is obtained. The final output image x^diff\hat{x}_{diff} is generated using an LDM decoder on z0z_{0}: x^diff=Dldm(z0)\hat{x}_{diff}=D_{ldm}(z_{0}).

II-C1 Preliminary of Latent Diffusion Model

Ho et al. introduced the Denoising Diffusion Probabilistic Model (DDPM), a generative approach that iteratively refines a sample from a noise distribution towards a target data distribution using a sequence of denoising operations [29]. To improve this process for efficiency and training stability, Rombach et al. proposed the Stable Diffusion method in which a variational autoencoder (VAE) initially maps images into a lower-dimensional latent space, followed by diffusion denoising steps [43]. The forward diffusion process employs a Markov chain that incrementally introduces Gaussian noise to the latents at each time step tt, with a variance denoted by βt(0,1)\beta_{t}\in(0,1), applied to the state zt1z_{t-1} from the previous step, which is:

zt=1βtzt1+βtϵ,z_{t}=\sqrt{1-\beta_{t}}z_{t-1}+\sqrt{\beta_{t}}\epsilon, (10)

where ϵ\epsilon denotes the noise sampled from a standard Gaussian distribution 𝒩(0,𝐈)\mathcal{N}(0,\mathbf{I}). Summarized over the time steps from 11 to tt, the forward process is represented as:

zt=α¯tz0+1α¯tϵ.z_{t}=\sqrt{\bar{\alpha}_{t}}z_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon. (11)

Here, z0=Eldm(x)z_{0}=E_{ldm}(x) is the latent representation of the original image encoded by the VAE, αt=1βt\alpha_{t}=1-\beta_{t}, and α¯t=i=1tαi\bar{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i} is the cumulative product of αi\alpha_{i} from the initial time step to time step tt, reflecting the progression of the diffusion process.

The aim of the denoising process is to estimate the prior latent state zt1z_{t-1} from the current state ztz_{t}. According to [29], the probability pθ(zt1|zt)p_{\theta}(z_{t-1}|z_{t}) can be solved by minimizing the Evidence Lower Bound (ELBO), which is equivalent to minimizing the KL-divergence between pθ(zt1|zt)p_{\theta}(z_{t-1}|z_{t}) and the posterior distribution q(zt1|zt,z0)q(z_{t-1}|z_{t},z_{0}). q(zt1|zt,z0)q(z_{t-1}|z_{t},z_{0}) can be modeled as a Gaussian 𝒩(zt1;μt(zt,z0),σt2𝐈)\mathcal{N}(z_{t-1};\mu_{t}(z_{t},z_{0}),\sigma_{t}^{2}\mathbf{I}). The mean of this distribution, μt(zt,z0)\mu_{t}(z_{t},z_{0}), can be parameterized as:

μt(zt,z0)=α¯t1βt1α¯tz0+1βt(1α¯t1)1α¯tzt=1αt(ztϵ1αt1α¯t).\begin{split}\mu_{t}(z_{t},z_{0})&=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}z_{0}+\frac{\sqrt{1-\beta_{t}}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}z_{t}\\ &=\frac{1}{\sqrt{\alpha_{t}}}(z_{t}-\epsilon\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}).\end{split} (12)

By approximating pθ(zt1|zt)p_{\theta}(z_{t-1}|z_{t}) with a Gaussian distribution 𝒩(zt1;μθ(zt,t),σt2𝐈)\mathcal{N}(z_{t-1};\mu_{\theta}(z_{t},t),\sigma_{t}^{2}\mathbf{I}), the KL-divergence is simplified as minimizing the difference between μt(zt,z0)\mu_{t}(z_{t},z_{0}) and μθ(zt,t)\mu_{\theta}(z_{t},t). Parameterizing μθ(zt,t)\mu_{\theta}(z_{t},t) as

μθ(zt,t)=1αt(ztϵθ(zt,t)1αt1α¯t),\mu_{\theta}(z_{t},t)=\frac{1}{\sqrt{\alpha_{t}}}(z_{t}-\epsilon_{\theta}(z_{t},t)\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}_{t}}}), (13)

the problem is further simplified as accurately approximating the noise component ϵ\epsilon, which is facilitated by a neural network denoted as ϵθ\epsilon_{\theta}. The network is typically conditioned on a context variable cc that can embed additional information such as text descriptions. Consequently, the network ϵθ\epsilon_{\theta} is trained to minimize the discrepancy between the true noise ϵ\epsilon and its estimated value, using the loss function:

ldm=𝔼z0,c,t,ϵϵϵθ(α¯tz0+1α¯tϵ,c,t)22.\mathcal{L}_{ldm}=\mathbb{E}_{z_{0},c,t,\epsilon}||\epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}z_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,c,t)||_{2}^{2}. (14)

This training paradigm ensures that over time, the network becomes proficient at deranging the noise from the corrupted data, steering the reverse-diffusion towards accurate data reconstruction.

II-C2 The Denoiser Structure

The structure of the proposed denoiser, ϵθ\epsilon_{\theta}, is illustrated in Figure 3. The UNet architecture [47] of the Stable Diffusion denoiser is depicted in blue and remains unchanged during training. This architecture comprises of four encoder blocks, four decoder blocks, and one middle block. Each encoder/decoder block includes two main blocks and one convolutional block for scaling operations. The middle block contains two main blocks. Each main block consists of 4 ResNet and 2 Vision Transformers (ViTs). The ViTs integrate self-attention and cross-attention mechanisms to facilitate interaction with external prompts, such as text encoded by the CLIP text encoder.

The control module highlighted in green in Figure 3 is a pivotal component of the proposed denoiser architecture. Following the ControlNet’s structure [41], this module adopts the UNet encoder and middle block in the Stable Diffusion denoiser and is initialized with the pre-trained weights. The control module is comprised of four encoder blocks in addition to the middle block to handle four distinct resolutions—64×6464\times 64, 32×3232\times 32, 16×1616\times 16, and 8×88\times 8—capturing a wide range of feature abstraction scales. The outputs from these blocks are fed into the four decoder blocks and the middle block of the Stable Diffusion UNet. Before being fed into the UNet, the control module’s outputs go through a series of 1×11\times 1 convolutional layers (‘zero conv’ layers in Figure 3).

It is worth noting that the pre-trained Stable Diffusion UNet (in blue in Figure 3) in the denoiser is adopted without retraining whereas the control module (in green) is trained for the proposed control module input. We obtain the control module input by concatenating the latent representation ztz_{t} with the spatial features fvf_{v} from the LDM encoder EldmE_{ldm} as in [39]. This concatenation enhances the input’s representational capacity but it extends the control module network’s channel depth necessitating additional parameters beyond those present in the pre-trained model. To handle this extension, the extra parameters are set to zero at initialization to maintain a neutral effect at the start of the training process.

To encode the channel CSI, we first perform a sinusoidal encoding to the channel SNR γ\gamma under the assumption that it lies within a fixed range. Then we concatenate it with the real and imaginary part of the channel gain hh (characteristic of a Rayleigh Fading channel) before processing through a multi-layer perception (MLP) to produce the CSI embeddings. These embeddings share the same dimensionality with the time embeddings. In the control module, we integrate the CSI embeddings with the time embeddings by addition to inject the channel information. Furthermore, textual features ftf_{t} serve as crucial prompts infusing contextual information to both the pre-trained UNet and the control module. As shown in Figure 3, textual features ftf_{t} are used as text prompts and are passed to the cross-attention layers of ViTs within each encoder/middle/decoder block as keys and values.

Similar to the LDM loss (14), our approach focuses on reducing the Mean Square Error (MSE) of the noise prediction while adhering to the condition set c¯\bar{c} given in equation (9). Our loss function is thus represented by:

diff=𝔼z0,c¯,t,ϵϵϵθ(α¯tz0+1α¯tϵ,c¯,t)22.\mathcal{L}_{diff}=\mathbb{E}_{z_{0},\bar{c},t,\epsilon}||\epsilon-\epsilon_{\theta}(\sqrt{\bar{\alpha}_{t}}z_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,\bar{c},t)||_{2}^{2}. (15)

This specially designed denoiser fine-tuned with a variety of conditioning variables is the key component in our framework to enhance the resilience of image reconstruction.

Algorithm 1 Latent diffusion denoising process with intermediate latent guidance
  Input: Time steps TT, spatial features fvf_{v}, textual features ftf_{t}, channel SNR γ\gamma, channel gain hh, scale factor λ\lambda, conditional denoiser ϵθ\epsilon_{\theta}
  Output: Generated image x^diff\hat{x}_{diff}
  c¯{fv,ft,h,γ}\bar{c}\leftarrow\{f_{v},f_{t},h,\gamma\}
  Sample zTz_{T} from 𝒩(0,𝐈)\mathcal{N}(0,\mathbf{I})
  for tt from TT to 11
       z~0zt1α¯tϵθ(zt,c¯,t)α¯t\tilde{z}_{0}\leftarrow\frac{z_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(z_{t},\bar{c},t)}{\sqrt{\bar{\alpha}_{t}}}
       z^0z~0λClHlWl(z~0fv)\hat{z}_{0}\leftarrow\tilde{z}_{0}-\frac{\lambda}{C_{l}H_{l}W_{l}}(\tilde{z}_{0}-f_{v})
       μ(zt,z^0)α¯t1βt1α¯tz^0+1βt(1α¯t1)1α¯tzt\mu(z_{t},\hat{z}_{0})\leftarrow\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}\hat{z}_{0}+\frac{\sqrt{1-\beta_{t}}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}z_{t}
       Sample zt1z_{t-1} from 𝒩(μ(zt,z^0),σt𝐈)\mathcal{N}(\mu(z_{t},\hat{z}_{0}),\sigma_{t}\mathbf{I})
  end for
  x^diffDldm(z0)\hat{x}_{diff}\leftarrow D_{ldm}(z_{0})
  return x^diff\hat{x}_{diff}

II-C3 Adding Intermediate Guidance

Though the proposed DiffJSCC does not inherently rely on intermediate guidance throughout the diffusion process, as articulated by Lin et al. [39], diffusion models strive to carefully navigate the delicate trade-off between the realism of the generated images and their fidelity to the original content. It is worth mentioning that these models might unintentionally introduce mismatched elements that may look realistic but different from the actual target. This issue is more prevalent when there is a significant difference between the target and training domains. In scenarios where it is crucial to ensure that reconstructed images faithfully reflect the original content, we incorporate an optional guidance mechanism that leverages the initial JSCC reconstruction x^\hat{x} to direct the intermediate latent state z~0\tilde{z}_{0} during the denoising process. Prior to applying guidance, the intermediate state z~0\tilde{z}_{0} at a particular time tt can be determined equation (11):

z~0=zt1α¯tϵθ(zt,c¯,t)α¯t\tilde{z}_{0}=\frac{z_{t}-\sqrt{1-\bar{\alpha}_{t}}\epsilon_{\theta}(z_{t},\bar{c},t)}{\sqrt{\bar{\alpha}_{t}}} (16)

Having estimated z~0\tilde{z}_{0}, the subsequent latent state zt1z_{t-1} is inferred in accordance with equation (12). To encourage greater correspondence between the denoised image and the initial JSCC reconstruction, z~0\tilde{z}_{0} is nudged closer to the spatial features fspaf_{spa}, yielding a revised estimation z^0\hat{z}_{0}:

z^0=z~0λClHlWl(z~0fv),\hat{z}_{0}=\tilde{z}_{0}-\frac{\lambda}{C_{l}H_{l}W_{l}}(\tilde{z}_{0}-f_{v}), (17)

where the variables ClC_{l}, HlH_{l}, and WlW_{l} correspond to the dimensions of the latent space, and λ\lambda is a scalar that regulates the level of alignment between z^0\hat{z}_{0} and fvf_{v}. The posterior distribution q(zt1|zt,z^0)q(z_{t-1}|z_{t},\hat{z}_{0}) subsequently determines the sampled state zt1z_{t-1} as given by equation (12) substituting z0z_{0} with z^0\hat{z}_{0}. This entire process is captured in Algorithm 1. It is worth emphasizing that the application of the intermediate guidance is optional and can be omitted by assigning a value of zero to λ\lambda, thereby maintaining the general functionality of the diffusion process without the additional guidance.

From the denoised z0z_{0} (or z^0\hat{z}_{0}), the final output image x^diff\hat{x}_{diff} is generated using an LDM decoder x^diff=Dldm(z0)\hat{x}_{diff}=D_{ldm}(z_{0}).

III Training and Evaluation

Refer to caption
Figure 4: Evaluation of the proposed approach versus baselines across different transmission rates under 1dB (top) and 10dB SNR (bottom) on the Kodak dataset under AWGN channel. Lower LPIPS/FID scores indicate higher perceptual quality.
Refer to caption
Figure 5: Evaluation of the proposed approach versus baselines across different SNR levels with transmission rates ρ\rho of 1/3841/384 (top) and 1/961/96 (bottom) using the Kodak dataset under the AWGN channel. Lower LPIPS/FID scores indicate higher perceptual quality.
Refer to caption
Figure 6: Evaluation of the proposed approach versus baselines across different SNR levels with transmission rates ρ\rho of 1/1921/192 (top) and 1/481/48 (bottom) using the Kodak dataset under the Rayleigh Fading channel. Lower LPIPS/FID scores indicate higher perceptual quality.
Refer to caption
Figure 7: Visual comparison between DiffJSCC and baselines on the Kodak dataset at 1dB SNR under the AWGN channel. For DiffJSCC, we show both the final reconstructed images and the captions generated from the initial reconstruction. The keywords are marked in red. Zoom in for better perception.

III-A Datasets and Implementation

III-A1 Dataset

Our DiffJSCC framework evaluation encompasses two principal applications: general image transmission and specialized facial image transmission. In the general image transmission scenario, DiffJSCC is fine-tuned on a selected dataset of approximately 300,000 high-resolution images from the OpenImages collection111https://storage.googleapis.com/openimages/web/index.html. Throughout the training phase, these images are subject to random cropping operations, resulting in dimensions of 256×256256\times 256 pixels. For the performance evaluation, we use the standard Kodak dataset, which consists of 24 curated high-fidelity images with dimensions of 768×512768\times 512 pixels. In addition, the performance evaluation is extended to the ADE20K validation set with 2000 images to assess image reconstruction quality and semantic segmentation accuracy.

In the specialized context of facial image transmission, the DiffJSCC is trained with the CelebAHQ dataset. We use a training set of 27,000 images and a separate test set of 3,000 images where all images are resized to a standardized resolution of 512×512512\times 512 pixels for consistent evaluations.

III-A2 Implementation

Within the proposed framework, we utilize an SNR-adaptive JSCC architecture for both the encoder and decoder, as detailed by Yang et al.[23]. This architecture incorporates the CSI into the feature processing pipeline via a Squeeze-Excitation mechanism [22], enhancing the adaptability to varying SNR conditions. The transmission rate ρ\rho is fixed for each model configuration without the policy network introduced in [23]. Within this network structure, ρ\rho is determined by the output channel count (CoutC_{out}) and the downsampling factor (DD), satisfying ρ=Cout/(3×22D+1)\rho=C_{out}/(3\times 2^{2D+1}). General image transmission training involves random 256×256256\times 256 patches and a batch size of 32, while the facial image dataset utilizes full 512×512512\times 512 CelebAHQ images with a smaller batch size of 16. An Adam optimizer initializes learning at 10310^{-3}, tapering over 100,000 iterations following a cosine decay schedule.

Our control module is based on Stable Diffusion 2.1-base and we adjusted it for images resized to 512×512512\times 512 pixels. The batch size is 16 across 25,000 iterations at a reduced learning rate of 10410^{-4}. Note that the JSCC components remain unaltered (with fixed parameters) through the control module’s training phase. The SNR, γ\gamma, is uniformly sampled within the ranges—[0,14][0,14]dB for general images and [5,5][-5,5]dB for facial images, ensuring adaptability to SNR fluctuations.

We used 2 NVIDIA A40 GPUs to accelerate training. Though the Stable Diffusion model is trained using 1000 time steps, we use a spaced DDPM sampling for 50 steps during inference adopting a common strategy [43, 39] to improve sampling efficiency. Unless specified otherwise, the scaling factor λ\lambda is preset to zero as a standard baseline configuration without the intermediate guidance introduced in Section II-C3.

III-B Metrics

For evaluations, we adopt standard distortion metrics such as PSNR and MS-SSIM[48] to measure image fidelity concerning pixel intensity and structural details, respectively. To evaluate the perceptual quality of reconstructed images, we employ more recently proposed metrics such as Learned Perceptual Image Patch Similarity (LPIPS)[49] and Fréchet Inception Distance (FID) scores[50]. These metrics provide crucial insights into visual similarity and how well the distribution of generated images aligns with that of real images. Given the Kodak dataset contains only 24 images, we transmit each image five times to measure our performance metrics reliably. For Kodak images, the FID score is calculated over half-overlapping patches of 256×256256\times 256 pixels. Our evaluation also includes the performance of the semantic segmentation task on the generated and original images measured by the mean Intersection over Union (mIoU) metric.

III-C Performance on General Image Transmission

Figure 4 presents the performance comparison of our DiffJSCC framework against traditional separate source-channel coding schemes as well as the foundational DeepJSCC approach which is also used for DiffJSCC’s initial image reconstruction. The evaluation is across various transmission rates while fixing the SNR to 1dB or 10dB. The benchmark separate coding scheme utilizes the BPG codec for image compression paired with LDPC codes conforming to the IEEE 802.11n WiFi standard with a specific block length. Under the 1dB SNR scenario, the LDPC code adopts a 2/32/3 coding rate with BPSK modulation, while the coding rate is 1/21/2 with 16-QAM modulation for the 10dB SNR case. To adjust the transmission rate, we manipulate the BPG’s quantization parameter with the range from 51 to 0. The ‘BPG-Capacity’ is the upper bound of the separate source-channel coding scheme (i.e., LDPC is replaced by ideal channel coding) where channel coding is error-free when the rate is under the channel capacity calculated by C=log2(1+γ)C=\text{log}_{2}(1+\gamma).

DeepJSCC and DiffJSCC methods support four distinct rates: ρ=1/384,1/192,1/96,\rho=1/384,1/192,1/96, and 1/481/48. Notably, the basic DeepJSCC method surpasses traditional separate coding methods across multiple metrics, validating previous studies. Though the diffusion-based refinements from our DiffJSCC may induce degradation in traditional distortion metrics such as PSNR and MS-SSIM similar to [30, 39], DiffJSCC exhibits remarkable improvements in visual quality assessed by LPIPS and FID metrics. These metrics capture the capability of DiffJSCC to produce better photorealistic reconstructions even at compression levels unachievable by the other schemes (including theoretical ‘BPG-Capacity’).

Notice, at a 1dB SNR, DiffJSCC attains a low FID score of 117.9117.9 at the transmission rate of merely ρ=1/384=0.0026\rho=1/384=0.0026, outperforming the ‘BPG+Capacity’ baseline’s FID score of 165.4165.4 at a substantially higher rate of ρ=0.0377\rho=0.0377. This translates to DiffJSCC achieving superior image quality by a margin of 28.7%28.7\% while utilizing 14×14\times less symbols. In scenarios where the SNR is increased to 10dB, DiffJSCC continues to hold a significant advantage over the other methods although the performance gap in terms of LPIPS and FID is reduced. When transmitting at a rate of ρ=0.0026\rho=0.0026, the FID score of DiffJSCC demonstrates an impressive 64.3%64.3\% enhancement over DeepJSCC and a 70.7%70.7\% over ‘BPG+Capacity’. This confirms the effectiveness of the DiffJSCC framework in reconstructing highly realistic images under stringent rate and channel constraints.

In Figure 5 and 6, we assess DiffJSCC in comparison to other baseline approaches across various SNR conditions for AWGN channel and Rayleigh Fading channel respectively. The DeepJSCC-advadv model from Yang et al.[32] is included for comparison, which uses the hyperparameter α\alpha to adjust the weight of adversarial loss. Various α\alpha parameters were tested to explore the trade-off between distortion and visual quality. Similarly, we experimented with different λ\lambda values in DiffJSCC with the same goal. These evaluations were conducted at transmission rates of ρ=1/384\rho=1/384 and ρ=1/96\rho=1/96, with SNR values set at γ=1,4,7,10,13\gamma=1,4,7,10,13 dB. All benchmarks including ours are based on the same deep JSCC network architecture. The results in Figure 5 demonstrate that although the DeepJSCC-advadv baseline successfully lowers LPIPS, it struggles to effectively minimize FID scores regardless of the α\alpha value used. The FID score remained above 190190 at ρ=1/384\rho=1/384 and exceeded 110110 at ρ=1/96\rho=1/96. Conversely, DiffJSCC significantly improves both LPIPS and FID scores over the entire range of SNR values, while maintaining similar PSNR and MS-SSIM distortion metrics. At λ=0\lambda=0 and ρ=1/384\rho=1/384, DiffJSCC shows a notable average improvement of 41.4%41.4\% in LPIPS and 56.7%56.7\% in FID scores compared to the standard DeepJSCC. When compared with DeepJSCC-advadv using α=5×103\alpha=5\times 10^{-3}, it achieves 21.1%21.1\% better LPIPS and 52.2%52.2\% enhanced FID metrics. At ρ=1/96\rho=1/96, DiffJSCC consistently demonstrates a significant gain in performance, improving average LPIPS and FID scores by 43%43\% and 53.2%53.2\% over DeepJSCC, respectively, and 18.9%18.9\% and 48.4%48.4\% compared to DeepJSCC-advadv. Increasing the scale factor λ\lambda from 0 to 100 gradually aligns the generated image with the initial JSCC reconstruction, enhancing PSNR and MS-SSIM but reducing perceptual quality. However, this trade-off still leads to better visual perception metrics (LPIPS and FID) than the baselines. This significant improvement in the perceived visual quality highlights the effectiveness of our fine-tuning control module along with the Stable Diffusion model. A similar trend can be observed under the Rayleigh Fading channel as shown in Figure 6.

Figure 7 provides visual examples of the superior perceptual quality achieved by DiffJSCC at a challenging SNR of 1dB under the AWGN channel. At a transmission rate of ρ=1/384\rho=1/384, the reconstructions produced by the conventional DeepJSCC framework exhibit significant distortion and have difficulty recovering important image details. Conversely, DiffJSCC demonstrates a remarkable capacity to restore crucial elements of the image, like the structures of the lighthouse and house, offering significantly enhanced perceptual quality despite slight variances from the original. With the transmission rate raised to 1/961/96, DiffJSCC’s capabilities become more evident, resulting in reconstructions that accurately reflect the original images. Furthermore, we demonstrate the validity of captions generated by the BLIPv2 model based on the initial JSCC reconstructions. These captions prove to be robust against variations in reconstruction quality, consistently capturing the important content keywords of each image (highlighted in red). This confirms the ability of DiffJSCC to convey essential semantic information even at reduced transmission rates.

Refer to caption
Figure 8: Results on the ADE20k dataset with ρ=1/384\rho=1/384 and ρ=1/96\rho=1/96. Higher mIoU scores indicate better segmentation.
Refer to caption
Figure 9: Visualizations on ADK20K with ρ=1/384\rho=1/384 and 1dB SNR under the AWGN channel.
Refer to caption
Figure 10: Comparison between the proposed method and baselines at various SNRs on CelebAHQ under the AWGN channel.
Refer to caption
Figure 11: Visualizations on CelebAHQ with ρ=1/768\rho=1/768 in both high and low SNR scenarios under the AWGN channel.

In Figure 8, we assess the DiffJSCC framework using the ADE20K dataset at transmission rates ρ=1/384\rho=1/384 and ρ=1/96\rho=1/96 under the AWGN channel, examining FID scores and mIoU for semantic segmentation performance. For mIoU evaluations, semantic masks are inferred from the reconstructed images utilizing a pre-trained ViT-adaptor [51]. The data indicates that while the DeepJSCC-advadv approach can effectively reduce FID scores, it comes at the cost of decreased mIoU, thus limiting its practicality for subsequent image processing tasks. In contrast, DiffJSCC notably improves both FID and mIoU. When ρ=1/384\rho=1/384, it demonstrates an average improvement (reduction) in FID by 69.2%69.2\% and a 52.5%52.5\% increase in mIoU over DeepJSCC, alongside the notable improvement of 75.6%75.6\% in FID and 161%161\% in mIoU compared to DeepJSCC-advadv. For the higher transmission rate ρ=1/96\rho=1/96, DiffJSCC consistently shows superior performance, with a mean enhancement of 42.9%42.9\% in FID and 8.4%8.4\% in mIoU over DeepJSCC, along with gains of 33.3%33.3\% in FID and 19.1%19.1\% in mIoU in comparison to DeepJSCC-advadv. With increasing SNR values, the mIoU obtained by DiffJSCC begins to align closely with that of the original undistorted images. Figure 9 presents a visual example from the ADE20K dataset. The example shows that, with DeepJSCC, elements such as grass and buildings are somewhat recognizable, yet roads and lamps appear unclear. The reconstructions obtained from DeepJSCC-advadv significantly lack authenticity, and the ViT-adaptor fails to generate accurate segmentation masks. On the other hand, images produced by DiffJSCC demonstrate higher visual quality to produce more accurate segmentation labels in most regions, thus highlighting the framework’s applicability for practical JSCC scenarios.

III-D Performance on Facial Image Transmission

Figure 10 presents a comparison of our DiffJSCC framework with DeepJSCC and other generative JSCC methods, including InverseJSCC [35] and the approach presented by Yilmaz et al. [37] on CelebAHQ dataset. For DiffJSCC, we showcase results with varying λ\lambda values of 0, 10, and 50. The outcomes of InverseJSCC and Yilmaz et al. are taken from their original papers. We only evaluate the PSNR and LPIPS metrics since MS-SSIM and FID evaluations were not conducted in their works. Overall, DiffJSCC shows superior performance, surpassing Inverse-JSCC regarding both PSNR and LPIPS metrics. In comparison with the Yilmaz et al. method, DiffJSCC has a slightly lower PSNR but achieves superior LPIPS scores. Unlike the Kodak image results, where λ=0\lambda=0 provided the optimal LPIPS, this configuration introduces artifacts because of the significant domain gap between the Stable Diffusion training dataset and the CelebAHQ dataset. However, by utilizing direct guidance with the initial JSCC reconstructions (with λ=50\lambda=50), these artifacts are effectively mitigated, resulting in enhanced LPIPS performance. Quantitatively, at λ=50\lambda=50, DiffJSCC accomplishes an average enhancement of 15.9%15.9\% and 28.6%28.6\% over InverseJSCC for ρ=1/768\rho=1/768 and ρ=1/192\rho=1/192, respectively, and showcases an improvement of 10.3%10.3\% and 9.4%9.4\% for ρ=1/768\rho=1/768 and ρ=1/192\rho=1/192 against the Yilmaz et al. method. Figure 11 visualizes examples from the CelebA dataset, highlighting DiffJSCC’s ability to reconstruct high-quality images in extremely challenging conditions such as a very low bit rate (ρ=1/768\rho=1/768) and a low SNR value (SNR=5dB\text{SNR}=-5\text{dB}). As the SNR increases, the quality of the reconstructions improves, making them more closely resemble the original images.

III-E Ablation Studies

III-E1 The effect of different conditions

In this analysis, we investigate the impact of various conditional signals on the control module’s performance on the Kodak dataset under the AWGN channel, as shown in Figure 12 right, emphasizing the importance of including multimodal information during the image restoration process. We observe that relying exclusively on spatial features extracted from the initial JSCC reconstruction offers limited effectiveness, whereas augmenting the model with channel SNR information yields an average FID score improvement of 2%2\%, a benefit that becomes more pronounced at reduced SNR levels. Incorporating captions as textual prompts into the denoising framework further elevates FID performance by an additional 4.3%4.3\%. Figure 12 left showcases that despite a natural decline in caption quality measured by the CLIP score[46] as SNR and transmission rates decrease, the presence of image keywords ensures sustained improvements in FID as shown in Figure 12 right. This outcome illustrates the resilience and utility of textual cues in the denoising process. The benefit is further demonstrated by the marked performance increase when using captions derived from undistorted images, as indicated by the orange line in Figure 12 right.

Refer to caption
Figure 12: Left: Analysis of the quality of the caption generated by BLIPv2 on the initial JSCC reconstruction. Right: Ablation experiments under various conditions for the control module.
Refer to caption
Figure 13: Example of generated image captions and the effect of them for the final image reconstruction
Refer to caption
Figure 14: Effect of DDPM sampling steps with ρ=1/384\rho=1/384.

Figure 13 serves as a visualization example of the influence of textual cues: Despite the original JSCC reconstruction being affected by blur and distortion, The BLIPv2 model employed in our framework can generate coherent text prompts that capture the main context of the image. Without textual cues (Fig. 13 bottom left), our approach can improve the general image quality but struggles to accurately recreate the parrot on the left. Incorporating text prompts addresses this issue significantly (Fig. 13 bottom right), enabling the realistic restoration of intricate details like the parrot’s head, demonstrating the crucial importance of textual information in the image reconstruction process.

III-E2 The effect of the number of sampling steps

This section investigates the impact of the number of denoising sampling steps on LPIPS and FID scores using the Kodak dataset with a transmission rate of ρ=1/384\rho=1/384. The performance evolution across different SNR levels is illustrated in Figure 14. It is evident that both LPIPS and FID metrics enhance considerably as the number of sampling steps increases. Notably, after reaching 50 steps, these enhancements level off with marginalized gains from additional steps across all evaluated SNR values. When SNR=13dB\text{SNR}=13\text{dB}, for example, increasing the sampling from 25 to 50 steps results in substantial gains of 0.020.02 for LPIPS and 6.56.5 for FID, whereas increasing it from 50 to 100 steps yields reduced improvements of 0.0040.004 for LPIPS and 2.52.5 for FID. These findings led to our decision to use 50 sampling steps in our experiments in earlier sections. Using 50 steps balances the efficiency and effectiveness of our method considering the required computational resources, inference time, and reconstruction image quality.

IV Conclusion

In this study, we introduce DiffJSCC, a novel deep JSCC framework that capitalizes on the robust priors of the pre-trained Stable Diffusion model to substantially improve the perceptual quality of reconstructed images. Unlike the traditional end-to-end learning approach, DiffJSCC initially derives multimodal conditions from the received symbols. For enhanced training stability and efficiency, DiffJSCC employs a JSCC decoder to create an initial reconstruction of the transmitted image, which is then utilized to generate multimodal features. DiffJSCC effectively integrates multimodal conditions that consist of spatial, textual, and channel state information to steer the image generation process. Our framework employs a new control module to fine-tune the Stable Diffusion model based on the generated conditions, and it leverages the initial JSCC reconstruction to guide the denoising diffusion process, balancing fidelity and realism. Extensive experiments on both general and facial image transmission cases demonstrate that DiffJSCC considerably outperforms the latest generative JSCC methods, providing superior human-perceptual image quality and downstream task performance, particularly in environments with low transmission rates and poor SNRs. The proposed DiffJSCC is model-agnostic and can be extended with more advanced deep JSCC architectures and/or improved conditional diffusion models. Furthermore, DiffJSCC introduces no extra computations on the transmitter side, which is particularly beneficial for scenarios with energy-limited image-capturing devices.

References

  • [1] V. Bozantzis and F. Ali, “Combined vector quantisation and index assignment with embedded redundancy for noisy channels,” Electronics Letters, vol. 36, no. 20, p. 1, 2000.
  • [2] D. Goodman and T. Moulsley, “Using simulated annealing to design digital transmission codes for analogue sources,” Electronics letters, vol. 24, no. 10, pp. 617–619, 1988.
  • [3] S. Shoouri, M. Yang, Z. Fan, and H.-S. Kim, “Efficient computation sharing for multi-task visual scene understanding,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  • [4] B. Wang, F. Yang, X. Yu, C. Zhang, and H. Zhao, “Apisr: Anime production inspired real-world anime super-resolution,” arXiv preprint arXiv:2403.01598, 2024.
  • [5] M. Yang, Y. Chen, and H.-S. Kim, “Efficient deep visual and inertial odometry with adaptive visual modality selection,” in European Conference on Computer Vision.   Springer, 2022, pp. 233–250.
  • [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014.
  • [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [8] H. Ye, G. Y. Li, and B.-H. Juang, “Power of deep learning for channel estimation and signal detection in ofdm systems,” IEEE Wireless Communications Letters, vol. 7, no. 1, pp. 114–117, 2017.
  • [9] M. Yang, L.-X. Chuo, K. Suri, L. Liu, H. Zheng, and H.-S. Kim, “ilps: Local positioning system with simultaneous localization and wireless communication,” in IEEE INFOCOM 2019-IEEE Conference on Computer Communications.   IEEE, 2019, pp. 379–387.
  • [10] Y.-S. Hsiao, M. Yang, and H.-S. Kim, “Super-resolution time-of-arrival estimation using neural networks,” in 2020 28th European Signal Processing Conference (EUSIPCO).   IEEE, 2021, pp. 1692–1696.
  • [11] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 3, pp. 567–579, 2019.
  • [12] D. B. Kurka and D. Gündüz, “Deepjscc-f: Deep joint source-channel coding of images with feedback,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 178–193, 2020.
  • [13] H. Wu, Y. Shao, E. Ozfatura, K. Mikolajczyk, and D. Gündüz, “Transformer-aided wireless image transmission with channel feedback,” IEEE Transactions on Wireless Communications, 2024.
  • [14] M. Yang, C. Bian, and H.-S. Kim, “Deep joint source channel coding for wireless image transmission with ofdm,” in ICC 2021-IEEE International Conference on Communications.   IEEE, 2021, pp. 1–6.
  • [15] C. Bian, Y. Shao, H. Wu, and D. Gündüz, “Space-time design for deep joint source channel coding of images over mimo channels,” in 2023 IEEE 24th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC).   IEEE, 2023, pp. 616–620.
  • [16] C. Lee, X. Hu, and H.-S. Kim, “Deep joint source-channel coding with iterative source error correction,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2023, pp. 3879–3902.
  • [17] A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
  • [18] K. Yang, S. Wang, J. Dai, K. Tan, K. Niu, and P. Zhang, “Witt: A wireless image transmission transformer for semantic communications,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  • [19] K. Yang, S. Wang, J. Dai, X. Qin, K. Niu, and P. Zhang, “Swinjscc: Taming swin transformer for deep joint source-channel coding,” arXiv preprint arXiv:2308.09361, 2023.
  • [20] T. Wu, Z. Chen, M. Tao, X. Xu, W. Zhang, and P. Zhang, “Mambajscc: Deep joint source-channel coding with visual state space model,” arXiv preprint arXiv:2405.03125, 2024.
  • [21] J. Xu, B. Ai, W. Chen, A. Yang, P. Sun, and M. Rodrigues, “Wireless image transmission using deep source channel coding with attention modules,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2315–2328, 2021.
  • [22] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [23] M. Yang and H.-S. Kim, “Deep joint source-channel coding for wireless image transmission with adaptive rate control,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 5193–5197.
  • [24] J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, “Nonlinear transform source-channel coding for semantic communications,” IEEE Journal on Selected Areas in Communications, vol. 40, no. 8, pp. 2300–2316, 2022.
  • [25] W. Zhang, H. Zhang, H. Ma, H. Shao, N. Wang, and V. C. Leung, “Predictive and adaptive deep coding for wireless image transmission in semantic communication,” IEEE Transactions on Wireless Communications, vol. 22, no. 8, pp. 5486–5501, 2023.
  • [26] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6228–6237.
  • [27] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. V. Gool, “Generative adversarial networks for extreme learned image compression,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 221–231.
  • [28] F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, “High-fidelity generative image compression,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 913–11 924, 2020.
  • [29] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  • [30] M. Careil, M. J. Muckley, J. Verbeek, and S. Lathuilière, “Towards image compression with perfect realism at ultra-low bitrates,” in The Twelfth International Conference on Learning Representations, 2023.
  • [31] L. Theis, T. Salimans, M. D. Hoffman, and F. Mentzer, “Lossy compression with gaussian diffusion,” arXiv preprint arXiv:2206.08889, 2022.
  • [32] M. Yang, C. Bian, and H.-S. Kim, “Ofdm-guided deep joint source channel coding for wireless multipath fading channels,” IEEE Transactions on Cognitive Communications and Networking, vol. 8, no. 2, pp. 584–599, 2022.
  • [33] J. Wang, S. Wang, J. Dai, Z. Si, D. Zhou, and K. Niu, “Perceptual learned source-channel coding for high-fidelity image semantic transmission,” in GLOBECOM 2022-2022 IEEE Global Communications Conference.   IEEE, 2022, pp. 3959–3964.
  • [34] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134.
  • [35] E. Erdemir, T.-Y. Tung, P. L. Dragotti, and D. Gündüz, “Generative joint source-channel coding for semantic image transmission,” IEEE Journal on Selected Areas in Communications, 2023.
  • [36] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
  • [37] S. F. Yilmaz, X. Niu, B. Bai, W. Han, L. Deng, and D. Gunduz, “High perceptual quality wireless image delivery with denoising diffusion models,” arXiv preprint arXiv:2309.15889, 2023.
  • [38] Y. Wang, J. Yu, and J. Zhang, “Zero-shot image restoration using denoising diffusion null-space model,” in The Eleventh International Conference on Learning Representations, 2022.
  • [39] X. Lin, J. He, Z. Chen, Z. Lyu, B. Fei, B. Dai, W. Ouyang, Y. Qiao, and C. Dong, “Diffbir: Towards blind image restoration with generative diffusion prior,” arXiv preprint arXiv:2308.15070, 2023.
  • [40] E. Lei, Y. B. Uslu, H. Hassani, and S. S. Bidokhti, “Text+ sketch: Image compression at ultra low rates,” arXiv preprint arXiv:2307.01944, 2023.
  • [41] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.
  • [42] C. Mou, X. Wang, L. Xie, Y. Wu, J. Zhang, Z. Qi, and Y. Shan, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4296–4304.
  • [43] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  • [44] T.-Y. Tung, D. B. Kurka, M. Jankowski, and D. Gündüz, “Deepjscc-q: Constellation constrained deep joint source-channel coding,” IEEE Journal on Selected Areas in Information Theory, vol. 3, no. 4, pp. 720–731, 2022.
  • [45] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning.   PMLR, 2023, pp. 19 730–19 742.
  • [46] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [47] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18.   Springer, 2015, pp. 234–241.
  • [48] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2.   Ieee, 2003, pp. 1398–1402.
  • [49] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586–595.
  • [50] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
  • [51] Z. Chen, Y. Duan, W. Wang, J. He, T. Lu, J. Dai, and Y. Qiao, “Vision transformer adapter for dense predictions,” arXiv preprint arXiv:2205.08534, 2022.