This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Compression Ratio Learning and Semantic Communications for Video Imaging

Bowen Zhang1, Zhijin Qin2, and Geoffrey Ye Li1
1 Department of Electrical and Electronic Engineering, Imperial College London, London, UK
2 Department of Electronic Engineering, Tsinghua University, Beijing, China
{k.zhang21, geoffrey.li}@imperial.ac.uk, [email protected]
Abstract

Camera sensors have been widely used in intelligent robotic systems. Developing camera sensors with high sensing efficiency has always been important to reduce the power, memory, and other related resources. Inspired by recent success on programmable sensors and deep optic methods, we design a novel video compressed sensing system with spatially-variant compression ratios, which achieves higher imaging quality than the existing snapshot compressed imaging methods with the same sensing costs. In this article, we also investigate the data transmission methods for programmable sensors, where the performance of communication systems is evaluated by the reconstructed images or videos rather than the transmission of sensor data itself. Usually, different reconstruction algorithms are designed for applications in high dynamic range imaging, video compressive sensing, or motion debluring. This task-aware property inspires a semantic communication framework for programmable sensors. In this work, a policy-gradient based reinforcement learning method is introduced to achieve the explicit trade-off between the compression (or transmission) rate and the image distortion. Numerical results show the superiority of the proposed methods over existing baselines.

I Introduction

Cameras have become ubiquitous in robotic systems and enable robots to evaluate the environment, infer their status, and make intelligent decisions. However, modern cameras always suffer from the age-old problem of limited dynamic range caused by the full-well capacity, dark current noise, and read noise [1], and the challenge of motion blur resulting from either object motion during the image exposure or camera motions due to the robot movements.

To overcome some of these challenges, computational imaging techniques have been widely used for generating high dynamic range (HDR) images [2, 3, 4] or high-speed videos [5, 6]. These methods use spatially-varying pixel exposure or pixel-wise coded exposure along with optimization techniques to get improved performance in imaging systems. Among these methods, snapshot compressive imaging (SCI) [7, 8, 9] captures a set of consecutive video frames with one single exposure and is popular for realizing sparse measurements and low requirements on memory, bandwidth, and power [10].

Due to the hardware limitation of conventional cameras, existing SCI systems capture an image by exposing photo-sensitive elements for a fixed exposure time, leading to a fixed temporal compression ratio for all pixels. The fixed compression ratio of the current SCI systems severely limits their abilities to record natural scenes in a measurement-efficient way. With a small compression ratio, more measurements are sampled, increasing the power, memory, and related-resources. With a large compression ratio, fewer samples are needed but the reconstructed video frames have poor quality. If pixels within SCI systems can be generated under different compression ratios, the video compressive sensing system can maintain high quality reconstruction and achieve efficient measurement. However, such a video compressive sensing system not only has special requirements on the hardware, but also requires an optimal pixel-wise compression ratio assignment policy, which is not trivial as both the shot/read noises and object/camera motions will affect the choice of compression ratios.

Fortunately, programmable sensors or focal-plane sensor–processors [11, 12, 13] can vary compression ratios spatially through pixel-level control of exposure time and read-out operations. Recent works in deep optics, on the other hand, demonstrate the superiority of jointly learning optic parameters and image processing methods over the traditional computational imaging techniques with heuristic designs on optics for applications in HDR imaging [14], video compressive sensing [15, 16]111In these deep optic-based works on video compressive sensing, the coded aperture and the exposure time for generating a snapshot image is optimized but the compression ratio, the number of measurements over a time window, is fixed. Different from these works, we focus on studying the shutter speed and read-out frequencies to adjust per-pixel compression ratio. The most significant difference is that some pixel locations will have more measurements than other locations in the captured sensor data of our system., and motion deblurring [17]. Inspired by these pioneering works, we focus on minimizing the number of measurements for various applications and develop a novel video compressive sensing system with pixel-wise compression ratios, where the ratio allocation policy and the video reconstruction algorithms are optimized through a combination of supervised learning and policy gradient reinforcement learning (RL) [18, 19].

In addition to sensor development, we also study the optimal transmission method for sensor data generated by programmable sensors in terms of source and channel coding methods since the raw sensor data collected by the sensors at the robots need to be sent to the admins for human-robot interaction. For the sensing systems with image-type data [17, 15, 20], off-the-shelf codecs (e.g. BPG [21] or JPEG [22]) as well as other deep image compression methods [23] can be used as potential source coding and the fifth‐generation (5G)-standardized LDPC [24] or polar codes [25] can be chosen for channel coding. However, it is sub-optimal to design the source and channel coding for sensor data independently from the deep optics and video reconstruction algorithms for the following two-fold reasons. The distribution of sensor data will be different from that of regular images under the influence of deep optics. Different pixels in the sensor data contribute differently to the following video reconstruction process and should be compressed differently.

To address these challenges, semantic communications [26, 27, 28, 29] are promising solutions. The deep optics and video reconstruction networks define special message generation and interpretation processes between transceivers, which are called semantic encoders and decoders in semantic communications, respectively. The communication systems should be optimized to ensure the interpretations of the messages through semantic decoders are correct rather than the delivery of raw sensor data itself. To realize semantic communications, one potential way is to introduce task-aware compression [30] for programmable sensors and design the compression methods under the guidance of the video reconstruction process. In this case, the channel coding is still designed separately while earlier studies on joint source and channel coding (JSCC) [31] have demonstrated the benefits of co-designing the source and channel coding process. In this work, a semantic communication framework that achieves the joint optimization of deep optics, data compression, channel coding, and reconstruction is designed, which significantly improves the transmission efficiency of sensor data.

Specifically, our contributions can be summarized as follows,

  • We introduce a RL-based method for adjusting the parameters in programmable sensors, which differs from existing deep optic methods based on differentiable models or functions [17].

  • We build a novel video compressive sensing system with spatially-variant compression ratios, where the ratio allocation policy is learned through an explicit rate-distortion function.

  • We introduce a RL-based method for the explicit trade-off between transmission rates and task accuracy in semantic communications, where the rate allocation policy is trained jointly with coding modules.

  • We propose a semantic communication system for programmable sensors, realizing the co-design of deep optics, data compression, channel coding, and reconstruction algorithms.

II Video Compressed Sensing with Spatially-variant Ratios

Refer to caption
Figure 1: The illustration of spatially-variant compression ratios in video compressed sensing.

As shown in Fig. 1, the principle of the proposed video compressed sensing system is to learn spatially-variant compression ratios for better imaging quality but with a fewer number of measurements. This idea can be further combined with the existing deep optic-based SCI systems with the pixel-wise coded aperture and shuttering functions [20]. In SCI systems with a fixed compression ratio, a fixed number of measurements will be generated for all spatial locations. While in the proposed system with spatially-variant ratios, some locations will have more measurements than others. For example, given four video frames, two snapshots (m1, m2) will be captured in the current SCI systems with 1/21/2 ratio. By contrast, the proposed systems will generate one snapshot with dense measurements (m1) and three other snapshots with sparse measurements (m2, m3, m4). Specifically, spatial locations with 1/41/4 ratio will only have measurements on m1 after one readout operation while locations with 11 ratio will have values on all four snapshots after four readout operations. To improve the sensing efficiency, the ratio map should be jointly designed with the deep optics and video reconstruction algorithms. In this section, we will introduce the details of the proposed system including the forward model and the training losses.

Refer to caption
Figure 2: The overview architecture of the proposed video compressed sensing system with spatially-variant ratios.

II-A System overview

We show the overall pipeline of the proposed sensing system in Fig 2. Denote HH and WW as the height and width of video frames. We first generate a compression ratio map, MH×W×1M\in\mathcal{R}^{H\times W\times 1}, from a small trainable matrix using a ratio generation network with one convolution layer, three residual blocks, and one transposed convolution layer. The spatial size of the trainable matrix is set as 1/8 of the ratio map. We then simulate the capture of a scene, SH×W×TS\in\mathcal{R}^{H\times W\times T}, with a programmable sensor using MM and possibly an extra random binary mask, BH×W×TB\in\mathcal{R}^{H\times W\times T}, which is referred to as coded apertures in earlier works [20, 10]. The captured sensor data, II, stacked with MM is then fed into a video reconstruction network to produce a reconstructed video, V^H×W×T\hat{V}\in\mathcal{R}^{H\times W\times T}. Two training losses (1\mathcal{L}_{1}, 2\mathcal{L}_{2}) are designed to update the learnable matrix, the ratio generation network, and the video reconstruction network to improve the reconstruction quality while at the same time reduce the average compression ratio.

Refer to caption
Figure 3: The exposure time and readout operations in different compression ratios.

II-B Compression ratio generation

Given TT video frames, five kinds of compression ratios are considered during the ratio map design, i.e. 0, 1/T1/T, 2/T2/T, 4/T4/T, 8/T8/T. Therefore, each pixel in MM takes one of the five discrete values. To generate the ratio map, the feature maps generated by the ratio generation network are designed to have five channels. After applying a Softmax function to the channel dimension of the feature maps, each channel indicates the discrete possibility of taking the corresponding ratio. Denote the feature maps representing the discrete probability distributions as PH×W×5P\in\mathcal{R}^{H\times W\times 5}. The final ratio map is obtained through a sampling operation according to PP.

II-C Sensor forward model with varying ratios

The learned ratio map will guide the behaviors of the programmable sensor. The exposure time and readout operations under different ratios are summarized in Fig. 3. We consider that each measurement is read out after a continuous exposure ends and before a new round of exposure begins. Also, the generated ”signal” electrons are cleared after each read-out operation. For example, when the ratio is 2/T2/T, each exposure lasts T/2T/2 frame time, generating 22 measurements in total. In each exposure, the signal will integrate from 0. For a specific ratio, rr, the equivalent measurement matrix, ArA_{r}, is of size rT×TrT\times T with each row representing one measurement base. For the ii-th (i=1,2,..,rTi=1,2,..,rT) row, [Ar]i:[A_{r}]_{i:} is a binary vector with the (i1)/r(i-1)/r column to the i/ri/r column taking values 1 while for other columns 0. For example. A2/T=[111000000111]A_{2/T}=\begin{bmatrix}1&\cdots&1&1&0&\cdots&0&0\\ 0&\cdots&0&0&1&\cdots&1&1\end{bmatrix}. The first row indicates the first measurement is the integrated signal from frame time 11 to T/2T/2 while for the second row indicating the second measurement from time T/2+1T/2+1 to TT. With the measurement matrices, the sensing process at spatial location pp with rpr_{p} ratio can be modelled as,

Ip=𝒰(ArpSp),I_{p}=\mathcal{U}(A_{r_{p}}S_{p}), (1)

where SpT×1S_{p}\in\mathcal{R}^{T\times 1} is the signal at spatial location pp, IprpT×1I_{p}\in\mathcal{R}^{r_{p}T\times 1} is the captured sensor data at location pp, 𝒰()\mathcal{U}(\cdot) represents the camera exposure function capturing the noise effects of the camera [17].

If the random binary masks, B{0,1}H×W×TB\in\{0,1\}^{H\times W\times T} are used to modulate the signal [20], the sensing process can be modelled as,

Ip=𝒰(Arp(SpBp)),I_{p}=\mathcal{U}(A_{r_{p}}(S_{p}\cdot B_{p})), (2)

where \cdot represents element-wise multiplication, BpT×1B_{p}\in\mathcal{R}^{T\times 1} is the signal at spatial location pp.

Following [17], we consider two kinds of noises, i.e. shot noise ns𝒩(0,σs)n_{s}\in\mathcal{N}(0,\sigma_{s}) and read noise nr𝒩(0,σr)n_{r}\in\mathcal{N}(0,\sigma_{r}) in the camera exposure function. The level of shot noise is proportional to the signal electrons’ strength. For a signal ee in [0,1] range, σs=eσss\sigma_{s}=\sqrt{e}\sigma_{ss}, where σss\sigma_{ss} is independent of signals. By contrast, the level of read noise is fixed for a camera, depending on the photon flux. We use the same settings for σss\sigma_{ss} and σr\sigma_{r} as [17], that is, σss=4.95×103\sigma_{ss}=4.95\times 10^{-3} and σr=7.25×103\sigma_{r}=7.25\times 10^{-3}. With the definition of noises, 𝒰(e)=e+ns+nr\mathcal{U}(e)=e+n_{s}+n_{r}.

Note that increasing rr will decrease the signal strength of each measurement because of the short exposure. Since the distortion effects of shot noise and read noise on signals will increase as the signal strength decreases, the imaging quality will not increase unlimitedly as rr increases. Also, using a long exposure will be more beneficial for stationary scenes as the signal strength can increase but not good for moving objects. The best choice of ratios should be a mix of small and large compression ratios for different locations so that the captured sensor data can take benefits from both the long exposure with a larger signal-to-noise ratio and the short exposure with less motion blur.

Refer to caption
Figure 4: The architectural details of video reconstruction network.

II-D Video reconstruction network

After obtaining II, the next step is to reconstruct the targeted video using a deep neural network. The overall architecture of the proposed video reconstruction network is shown in Fig. 4. Our network consists of three components, an initial reconstruction stage (IR), a fusion network (FN), and a deep reconstruction network (DRN) built based on the single-stage spectral-wise transformers (SST) proposed in [32]. The IR and FN are introduced to mitigate the influence of spatially-variant MM and BB. Specifically, if BB is not considered, we get the initial reconstruction, V^0H×W×T\hat{V}_{0}\in\mathcal{R}^{H\times W\times T}, in IR by considering Eq. (1),

V^0p=ArpT(ArpArpT)1(Ip),\hat{V}_{0_{p}}=A_{r_{p}}^{T}(A_{r_{p}}A_{r_{p}}^{T})^{-1}(I_{p}), (3)

where (ArpArpT)1(A_{r_{p}}A_{r_{p}}^{T})^{-1} is fortunately a diagonal matrix as defined in Sec. II-C. By contrast, if BB is considered, we first rewrite Eq. (2) as,

Ip=𝒰(A^rpSp),I_{p}=\mathcal{U}(\hat{A}_{r_{p}}S_{p}), (4)

where A^rp=Arpdiag(Bp)\hat{A}_{r_{p}}=A_{r_{p}}\text{diag}(B_{p}) and diag(Bp)\text{diag}(B_{p}) denotes the T×TT\times T diagonal matrix constructed from BpB_{p}. In this case, some rows of A^rp\hat{A}_{r_{p}} may be all zero, making A^rpA^rpT\hat{A}_{r_{p}}\hat{A}_{r_{p}}^{T} non-invertible. Since the row vectors of A^rp\hat{A}_{r_{p}} are orthogonal to each other, we reconstruct V^0\hat{V}_{0} by using the non-zero row vectors of A^rp\hat{A}_{r_{p}} independently,

V^0p=i=1[A^rp]i:𝟎1×TrT[A^rp]i:T([A^rp]i:[A^rp]i:T)1[Ip]i:+𝟎T×1.\hat{V}_{0_{p}}=\sum_{\begin{subarray}{c}i=1\\ [\hat{A}_{r_{p}}]_{i:}\neq\mathbf{0}_{1\times T}\end{subarray}}^{rT}[\hat{A}_{r_{p}}]_{i:}^{T}([\hat{A}_{r_{p}}]_{i:}[\hat{A}_{r_{p}}]_{i:}^{T})^{-1}[I_{p}]_{i:}+\mathbf{0}_{T\times 1}. (5)

Following the coarse-to-fine criteria, we further use a shallow FN to deal with the spatially-variant coded exposure and ratios, which takes V^0p\hat{V}_{0_{p}}, MM and possibly BB as inputs and outputs the second-level reconstruction results, V^1H×W×T\hat{V}_{1}\in\mathcal{R}^{H\times W\times T}. The architecture of FN is shown in Fig 4.

Next, as illustrated in Fig 4, the DRN takes V^1\hat{V}_{1} as the input and reconstructs the final videos, V^\hat{V}, by cascading three SSTs [32]. Each SST follows the design of U-net [33] with an encoder, a bottleneck, and a decoder. The embedding and mapping block are convolutional layers (conv) with 3×3 kernels. The feature maps in the encoder sequentially pass through one spectral-wise attention block (SAB), one conv with stride 2 (for downsampling), one SAB and one conv with stride 2. The bottleneck is one SAB. The decoder has a symmetrical architecture to the encoder. Following the spirit of U-Net, the skip connections are used for feature aggregation between the encoder and decoder to alleviate the information loss from the downsampling operation. The basic unit of SST is SAB, whose architecture is also shown in Fig 4. It has one feed forward network (FFN), one spectral-wise multi-head self-attention (S-MSA), and two layer-normalization. Different from original MSA that calculates the self-attention along the spatial dimension, S-MSA regards each feature map as a token and calculates the self-attention along the channel dimension, making it computational-effective. More details of S-MSA are explained in [32] 222As the network design of DRN is not the main focus of the paper, we omit some details here for page limits. Interesting readers can refer to the original paper for more details..

II-E Training losses

We use 1\mathcal{L}_{1}, 2\mathcal{L}_{2} to train the ratio generation parts and the video reconstruction network, respectively. Specifically, the learnable matrix and the ratio generation network are trained based on policy gradients. In our system, each spatial location in PP is regarded as an agent and its action space is the available compression ratios. Earlier works have proved the global convergence of policy gradient RL in multi-agency situation [34]. We define the reward of each location, Qp(rp)Q_{p}(r_{p}), under action rpr_{p} according to the rate-distortion trade-off theory,

Qp(rp)=log(1/VpV^p2)λrpT,Q_{p}(r_{p})=\log(1/||V_{p}-\hat{V}_{p}||^{2})-\lambda r_{p}T, (6)

where VV denotes target videos, VpV^p2||V_{p}-\hat{V}_{p}||^{2} denotes the mean-squared error (MSE) (or called distortion) of the reconstructed videos at spatial location pp, rpTr_{p}T denotes the number of measurements employed at location pp, which can also be viewed as the compression rate, λ\lambda is an introduced parameters for rate-distortion trade-off. Increasing λ\lambda will penalize more on the compression rate, leading to a smaller average compression ratio. As this is not a sequential decision problem, there is no need to define future rewards. Although VpV^p2||V_{p}-\hat{V}_{p}||^{2} can only approximately evaluate the effect of action rpr_{p} on the VpV_{p} as the DRN will aggregate information from neighbouring pixels, it is also the most direct way to evaluate action rpr_{p}.

At the same time, the expected reward, JpJ_{p}, is the value function for spatial location pp, where the expectation is w.r.t. rpr_{p} with probability PpP_{p}. Note that, Min1=MaxpJp\text{Min}\mathcal{L}_{1}=\text{Max}\sum_{p}J_{p}. Following [19], we can approximate the gradient of the JpJ_{p} to parameters θ\theta in ratio generation parts with samples generated from PpP_{p},

θJp=𝔼rpθlogPp(rp)×Qp(rp),\nabla_{\theta}J_{p}=\mathbb{E}_{r_{p}}\nabla_{\theta}\log P_{p}(r_{p})\times Q_{p}(r_{p}), (7)

where Pp(rp)P_{p}(r_{p}) denotes the probability of chosen action rpr_{p} from distribution PpP_{p}. Note that the update of θ\theta is based on the average value of θJp\nabla_{\theta}J_{p} from different pp.

On the other hand, the video reconstruction network is trained in a supervised way based on the MSE between VV and V^\hat{V},

2=VV^2,\mathcal{L}_{2}=||V-\hat{V}||^{2}, (8)

III Efficient transmission of data from programmable sensors

Sometimes the sensor data captured by robots’ on-device sensors need to be sent to remote servers/devices through wireless channels for data storage or video reconstruction. Efficient transmission of these sensor data requires advanced data compression techniques to reduce the data rates and bandwidth requirement. As discussed above, existing communication systems focusing on reproducing the raw sensor data are sub-optimal. As a part of semantic communications, designing the compression methods concerning the video reconstruction network based on task-aware compression [30] is promising. After compressing the original sensor data into compact bit streams, the transmitter will add parity bits and modulate the streams for robust transmission over unreliable channels using off-the-shelf methods. The number of modulated symbols after channel coding and modulation is the communication cost.

Nevertheless, there is an increasing belief in the communication community that the classic framework based on the Shannon separation theory needs to be upgraded for joint designs [29]. For semantic communication systems, jointly optimizing the channel coding and modulation with the other components may lead to better video reconstruction quality with fewer communication resources.

In this section, we will design semantic communication frameworks for the proposed video compressed sensing systems based on both the task-aware compression and semantic communications with joint designs.

Refer to caption
Figure 5: The overall framework of task-aware compression methods.

III-A Task-aware compression

III-A1 Architecture

The overall architecture of the deployed deep compression methods for the proposed video compressed sensing systems is shown in Fig. 5. It is mainly based on the architecture used in [30] by substituting the basic feature extraction unit from convolutional blocks to SABs. Some other changes are also made for simplicity333As the learned compression ratio is universal to all scenes, the spatial importance of II should also be independent from the contents of II, therefore, the side links conditioned on a quality map used in [30] can be safely removed. Also, instead of estimating the quality map by a task loss function, we directly change the the distortion metric in the training loss of [30] to task loss, which should be more precise in describing the spatial importance.. For a sensor data II, we first reshape its dimension to H×W×8H\times W\times 8, as each spatial location has at most eight measurements. Zero-padding is used for locations with fewer measurements. An encoder, gag_{a}, then takes the reshaped sensor data as inputs and generates a latent representation YH32×W32×192Y\in\mathcal{R}^{\frac{H}{32}\times\frac{W}{32}\times 192}, which is given by, Y=ga(I)Y=g_{a}(I). YY is then quantized to Y^\hat{Y} by a quantizer. The embedding and mapping blocks are conv3x3. The features inside the encoder subsequently go through five SAB×4\times 4-downsampling×1\times 1 pairs. Each downsampling operation will decrease the spatial size of the features by 14\frac{1}{4} but double the channel dimension. Next, a hyper-encoder, hah_{a}, takes YY as inputs and generates an image-specific side information ZH128×W128×192Z\in\mathcal{R}^{\frac{H}{128}\times\frac{W}{128}\times 192} via Z=ha(Y)Z=h_{a}(Y). The channel dimension of the features keeps constant in hah_{a}. Then the quantized side information Z^=Q(z)\hat{Z}=Q(z) is saved as a lossless bitstream through a factorized entropy model and entropy coding. After that Z^\hat{Z} is forwarded to the hyper-decoder, hsh_{s}, to draw the parameters (μ,σ)(\mu,\sigma) of a Gaussian entropy model, which approximates the distribution of Y^\hat{Y} and is used to save Y^\hat{Y} as a lossless bitstream. For reconstructing II, a decoder, gsg_{s}, operates on Y^\hat{Y} and generates II by I=gs(Y^)I=g_{s}(\hat{Y}). At last, the reconstructed I^\hat{I} is used for video reconstruction. Note that gsg_{s} and gag_{a}, hsh_{s} and hah_{a} have symmetric architectures, respectively.

III-A2 Training losses

The goal of task-aware compression is to minimize the length of the bitstreams and the distortion between VV and V^\hat{V}. This objective raises an optimization problem of minimizing VV^2+β(log2PY^log2PZ^)||V-\hat{V}||^{2}+\beta(-\text{log}_{2}P_{\hat{Y}}-\text{log}_{2}P_{\hat{Z}}), where VV^2||V-\hat{V}||^{2} denotes the video reconstruction quality, log2PY^log2PZ^-\text{log}_{2}P_{\hat{Y}}-\text{log}_{2}P_{\hat{Z}} represents the number of bits used to encode Y^\hat{Y} and Z^\hat{Z}, and β\beta is an introduced trade-off parameter between coding rate and video distortion. Increasing β\beta will increase the compression ratio but decrease the video reconstruction quality.

III-A3 Communication costs

After obtaining the compression systems with different compression ratios, we can then estimate the required number of modulated symbols to transmit these bit streams under different channel conditions. Specifically, the number of modulated symbols (in complex number) used for transmitting a bitstream depends on the implemented channel coding rate rcr_{c} (e.g. 1/3, 1/2, 2/3) and modulation order rmr_{m} (e.g. 4, 16, 64). Suppose the length of bitstream is lbl_{b}, the length of modulated symbols can be calculated as ls=lbrclog2(rM)l_{s}=\frac{l_{b}}{r_{c}\log_{2}(r_{M})}. Another way to evaluate the communication costs is to use Shannon capacity theorem, ls=lblog2(1+snr)l_{s}=\frac{l_{b}}{\log_{2}(1+snr)} if the signal-to-noise (SNR) ratio is snrsnr.

Refer to caption
Figure 6: The overall framework of semantic communications with joint designs.

III-B Semantic communications with joint design

Although the aforementioned deep compression method enables the co-design of deep optics, algorithms, and source coding, the widely-used entropy coding method [30, 23] prohibits the co-design of communication-specific components with deep compression methods. When entropy coding methods are used, each bit from entropy coding methods needs to be transmitted error-freely; otherwise, error propagation will happen in the entropy decoding process. This property leaves the communication systems with no choice but treat each bit equally and carefully. Semantic communications with joint design, on the other hand, allow the effect of channel noise to be considered during the compression process.

III-B1 Architecture

As shown in Fig. 6, the framework consists of three components: semantic coders, semantic-channel coders, and a rate allocation network (RAN). Semantic coders define a special message generation and interpretation method between transceivers based on a shared knowledge base. Semantic-channel coders directly learn the end-to-end mappings between semantic messages and modulated symbols. RAN is responsible for controlling the transmission rates. Different from deep compression methods that focus on the source coding rate only, the transmission rates in semantic communications depend on source coding rate, channel coding rate, and modulation order.

Specifically, the deep optic methods are special semantic encoders, which encode natural scene SS into sensor data II in a predefined way and the video reconstruction networks, which decode the target video VV from II, are special semantic decoders. The sensor data, II, is the semantic message of a scene to be shared between transmitters and receivers. Note that conventional communication systems emphasized the accurate transmission of II. With the semantic decoder (defining IVI\rightarrow V), the semantic communication systems should be optimized to maximize the quality of VV under limited communication costs. Also, this is the first attempt to define the semantic coders in the generation process of source data.

Given II, the transmitter will use a semantic-channel encoder (SCE) to generate a predefined maximum number of modulated symbols, XH8×W8×48X\in\mathcal{R}^{\frac{H}{8}\times\frac{W}{8}\times 48}, from which some symbols will be chosen to transmit II through noisy channels by rate control techniques 444Otherwise, if all the messages are transmitted with the same number of modulated symbols, it disobeys the Shannon source coding theory saying that the minimal possible expected length of codewords should be a function of the entropy of the input word. . This process can be modelled as,

X=SCE(I),X=\text{SCE}(I), (9)

where the number of modulated symbols is measured in real number. The SCE is composed of an embedding layer, three consecutive SAB×4\times 4-downsampling×1\times 1 pairs, two stacked SAB×1\times 1-3×3\times 3Conv×1\times 1 pairs, and a mapping layer. The size of feature maps after each operation is shown in Fig. 6.

To adjust the communication costs according to the semantic contents of message II, the RAN takes XX as inputs, which are not only generated modulated symbols but also the high-level representations of II in feature space, and generates a spatially-variant coding rate map FH8×W8×1F\in\mathcal{R}^{\frac{H}{8}\times\frac{W}{8}\times 1}. Each element in the spatial location of FF indicates how many symbols out of the 4848 symbols in XX at the same spatial location will be kept for transmission. Specifically, FF takes four discreet value f{1,2,3,4}f\in\{1,2,3,4\}, and only the first 12f12f symbols of XX in the channel dimension will be transmitted. The generation of FF is similar to the generation of MM in Sec. II-B: the RAN generates a feature map PFH8×W8×4P_{F}\in\mathcal{R}^{\frac{H}{8}\times\frac{W}{8}\times 4} representing the discreet probability distribution of taking each value from {1,2,3,4}\{1,2,3,4\} and a sampling operation is conducted to generate FF. With FF, a mask operation is conducted on XX to delete unnecessary symbols. The remaining symbols X^\hat{X} are then reshaped into a vector Znz×1Z\in\mathcal{R}^{n_{z}\times 1}, where nzn_{z} denotes the number of remaining symbols. Next, power normalization is applied to ZZ to satisfy power constraints,

Z^=Z/Z2.\hat{Z}=Z/||Z||^{2}. (10)

Finally, Z^\hat{Z} is directly transmitted through wireless environments. The effect of channel noise on Z^\hat{Z} can be represented as,

Z~=hZ^+n,\tilde{Z}=h\hat{Z}+n, (11)

where h𝒞𝒩(0,1)h\in\mathcal{C}\mathcal{N}(0,1) is multiplicative noise and n𝒩(0,σnInz×nz)n\in\mathcal{N}(0,\sigma_{n}I_{n_{z}\times n_{z}}) is additive noise. In additive white Gaussian noise (AWGN) channel, h=1h=1 and the value of σn\sigma_{n} depends on the current signal-to-noise ratio (SNR). Simultaneously, each element of FF can be quantized using 2 bits and the bitstream from FF is also transmitted. Different from the transmission of Z^\hat{Z}, the bitstream of FF is transmitted in an error-free way as in the deep compression methods because a minor error in FF will make it impossible to reshape the flatten vector ZZ back to X^\hat{X}.

At the receiver side, X~\tilde{X} and FF are reshaped from Z~\tilde{Z} and bitstreams respectively. After concatenation, they are fed into a semantic-channel decoder (SCD), which has a symmetric architecture to the SCE. This process can be modelled as,

I~=SCD(X~,F),\tilde{I}=\text{SCD}(\tilde{X},F), (12)

The recovered sensor data I~\tilde{I} is used to generate videos using the video generation network.

III-B2 Training losses

Similar to Sec. II-E, we use 1\mathcal{L}_{1} and 2\mathcal{L}_{2} to train the RAN and SCE/SCD, respectively. Specifically, the RAN is trained based on policy gradients RL, where XX is the state, FF describes the actions, and each spatial location of FF is an agent. Note that the spatial size of V^\hat{V} is 8 times that of FF and XX so the action taken at each spatial location of FF will have a strong effect on the reconstruction quality of a 8×88\times 8 area of V^\hat{V}. Considering this, we first define U=(VV^)2U=(V-\hat{V})^{2} and apply a 8×88\times 8 average pooling to UU, obtaining U~H8×W8×1\tilde{U}\in\mathcal{R}^{\frac{H}{8}\times\frac{W}{8}\times 1}. We then define the reward of location pp, QFp(fp)Q_{F}^{p}(f_{p}) , under action fpf_{p} as follows,

QFp(fp)=log(1/U~p)μfp,\displaystyle Q_{F}^{p}(f_{p})=\text{log}(1/\tilde{U}_{p})-\mu f_{p}, (13)

where log(1/U~p)\text{log}(1/\tilde{U}_{p}) describes the distortion caused by the agent at spatial location pp, fpf_{p} is the transmission rate (also the action) at pp, μ\mu is an introduced parameter for rate-distortion trade-off. Increasing μ\mu will penalize more on the transmission rate, leading to fewer modulated symbols to be transmitted. We adjust the communication costs by tuning μ\mu.

At the same time, the expected reward, JFpJ_{F}^{p}, is the value function for spatial location pp. Min1=MaxpJFp\text{Min}\mathcal{L}_{1}=\text{Max}\sum_{p}J_{F}^{p} and the gradient of JFpJ_{F}^{p} to the parameters δ\delta in RAN can be approximated as follows,

δJFp=𝔼fδlogPFp(f)×QFp(f),\displaystyle\nabla_{\delta}J_{F}^{p}=\mathbb{E}_{f}\nabla_{\delta}\log P_{F}^{p}(f)\times Q_{F}^{p}(f), (14)

Note that the update of δ\delta is based on the average value of δJQp\nabla_{\delta}J_{Q}^{p} from different locations.

On the other hand, the SCE and SCD are trained in a supervised way based on the MSE between VV and V^\hat{V},

2=VV^2,\displaystyle\mathcal{L}_{2}=||V-\hat{V}||^{2}, (15)

III-B3 Communication costs

The communication costs consist of two parts: the transmission of Z^\hat{Z} and BB. As Z^\hat{Z} is of shape nz×1n_{z}\times 1, the number of complex modulated symbols used to transmit Z^\hat{Z} can be denoted as ls(1)=nz/2l_{s}^{(1)}=n_{z}/2. On the other hand, BB is of shape nB=HW64n_{B}=\frac{HW}{64} and each element can be represented by 2 bits, therefore the bitstream length from BB can be modelled as lb=2nBl_{b}=2n_{B}. If the channel coding rate is rcr_{c} and the modulation order is rmr_{m}, the length of the modulated symbols for BB can be calculated as ls(2)=lbrclog2(rM)l_{s}^{(2)}=\frac{l_{b}}{r_{c}\log_{2}(r_{M})}. The total communication costs are ls(1)+ls(2)l_{s}^{(1)}+l_{s}^{(2)}.

IV Experiments

In this section, we first demonstrate the superiority of the proposed video compressed sensing system in terms of sampling rate. Based on the developed sensors, we then evaluate the performance of different communication systems regarding communication costs.

IV-A Video imaging with spatially-variant compression ratios

The following experiments are conducted to evaluate the effectiveness of the method proposed in Sec. II.

IV-A1 Dataset

Following [17], we use the Need for Speed (NfS) dataset [35] to train the network and evaluate its performance. The NfS dataset is collected with significant camera motions and suitable for representing the scene captured by moving robots’ on-board cameras. The dataset consists of 100 videos obtained from the internet, from which 80 are used for training and 20 for testing. Each video is captured at 240 frames per second (fps) with a 1280×7201280\times 720 resolution that we center crop to 256×256256\times 256. For each video, we select 80 random 1616-frame-long segments within the video, therefore T=16T=16 in our experiments. The images are turned to gray images and normalized to be within [0,1][0,1]. Our input and output to the end-to-end model are both the 16-frame video segment.

IV-A2 Implementation details

Our model is implemented in PyTorch. The ratio generation parts are trained with the SGD optimizer (momentum=0.9) at a learning rate of 5×1035\times 10^{-3}. The video reconstruction network is trained with the Adam optimizer at a learning rate of 5×1055\times 10^{-5}. The training process is as follows. We first fix the ratio at 8/T8/T for all spatial locations and train the video reconstruction network for 100 epochs. After that, we gradually increase λ\lambda in Eq. (6) from 5×1035\times 10^{-3} to 0.50.5 and jointly train the ratio generation parts and video reconstruction network. For each λ\lambda, we train for about 100100 epochs. For baselines with a fixed compression ratio for all locations, we set the ratio from 1/T1/T to 8/T8/T and train the video reconstruction network for 100100 epochs in each fixed ratio. Furthermore, we consider both cases where the binary mask, BB, is used or not used. When BB is used, it is randomly initialized and fixed during the experiments.

IV-A3 Results without binary mask

Refer to caption
Figure 7: The performance comparison between learned spatially-variant ratios methods and fixed ratios methods in video imaging systems without coded aperture.

We first compare our method with its fixed-ratio version when BB is not used. We use the peak signal-to-noise ratio (PSNR), 10log10(1/MSE(V,V^)10\log_{10}(1/\text{MSE}(V,\hat{V}), as the performance metric. The results are shown in Fig. 7, where ravg=1HWi=1Hj=1WMi,jr_{avg}=\frac{1}{HW}\sum_{i=1}^{H}\sum_{j=1}^{W}M_{i,j} denotes the average compression ratio for all spatial locations. As shown in Fig. 7, the imaging quality increases along with the growth of ravgr_{avg} for both methods until ravgr_{avg} reaches 0.5(8/T)0.5\,(8/T) where the effects of shot noise and read noise surpass the growth of the number of measurements. From the figure, the proposed method with learned ratios has a significant performance gain over the method with fixed ratio. For example, when ravg=0.0625(1/T)r_{avg}=0.0625\,(1/T), the learned-ratio method has nearly 55 dB gain over the fixed-ratio method, demonstrating the superiority of learning a compression ratio map in reducing the number of measurement samples.

IV-A4 Results with binary mask

Refer to caption
Figure 8: The performance comparison between learned spatially-variant ratios methods and fixed ratios methods in video imaging systems with coded aperture.

We then consider BB and show the performance comparison in Fig. 8. Due to the usage of binary mask, the system performance of the fixed-ratio method increases when ravg=0.0625(1/T)r_{avg}=0.0625\,(1/T), showing the positive effect of coded apertures. However, as using BB will decrease the signal strength, the performance of fixed-ratio method decreases when ravgr_{avg} increases when compared to the same method without BB in Fig. 7. From the Figure, the proposed method with learned ratio also has a steady performance gain over the fixed-ratio method, further proving the effectiveness of the developed sensors.

IV-A5 Learned compression ratio maps

Refer to caption
(a) ravg=0.998r_{avg}=0.998
Refer to caption
(b) ravg=2.31r_{avg}=2.31
Figure 9: The learned ratio maps in a 50×5050\times 50 area of an image when BB is used.

We show the learned ratio maps when BB is used in Fig. 9. As discussed above, there are only five ratio choices from 0 to 8/T8/T and their colors are shown in the color bar. From the figure, the learned ratio maps have fixed patterns for a local area. The patterns are different for different ravgr_{avg}. Besides, the learned ratio maps are a mix of low ratios and high ratios so that the benefits from both long exposure and low exposure can be considered.

IV-A6 Visual results

Refer to caption
Figure 10: Examples of restored video frames from sensing methods with or without learned ratios when compression ratio equals to 1/16.

Fig. 10 shows the restored video frames from different methods. We use the fixed-ratio method with ravg=1r_{avg}=1 and the learned-ratio method with ravg1r_{avg}\simeq 1 in Fig. 8. From the figure, the frames from learned-ratio method have more texture details and less artifact.

IV-B Semantic communications for programmable sensors

In this subsection, we will first evaluate semantic communication frameworks based on the task-aware compression and the joint design, respectively. After that, we will compare semantic communications with the existing transmission methods.

In the following experiments, we first restore the network parameters related to the sensors from pretrained models in the previous subsection and fix them when training the semantic communication frameworks. We use the sensor network with ravg=0.156r_{avg}=0.156 in Fig. 7. The experiments are called fixed-sensing experiments. Next, we jointly train the sensing network with the communication parts, which are called joint-sensing experiments.

IV-B1 Channel condition

We assume the sensor data is transmitted through the AWGN channel with SNR=10\text{SNR}=10 dB.

IV-B2 Implementation details of different semantic communication frameworks

We now describe in more details about the implementations of different semantic communication frameworks.

  • Task-aware compression plus capacity-achieving channel coding (Compr+Cap): We first convert the sensor data into bitstreams and then assume the transmission of the bitstreams can reach Shannon channel capacity. In the considered channel condition, lslb=0.289\frac{l_{s}}{l_{b}}=0.289. Note that it is hard to achieve Shannon capacity in real systems, so its performance can only be regarded as an ideal reference for compression methods. To adjust the communication costs, we train the compression network under different β\beta. During training, we first set β=107\beta=10^{-7} and gradually decrease β\beta to increase the bitstream length. The network is first trained with the Adam optimizer at a learning rate of 5×1055\times 10^{-5} for about 5050 epochs and then at 5×1065\times 10^{-6} for another 5050 epochs under each β\beta.

  • Task-aware compression plus LDPC plus QAM (Compr+LDPC): The bitstreams from task-aware compression methods are transmitted through LDPC channel coding and quadrature amplitude modulation (QAM) modulation. As stated in earlier work [27], when SNR=10\text{SNR}=10 dB for AWGN channels, we should use 16QAM and 2/3 LDPC. In this case, lslb=0.376\frac{l_{s}}{l_{b}}=0.376.

  • Semantic communications without RAN (SemCom+noRAN): In earlier task-oriented semantic communications [29], the RAN is not implemented and all the source data is transmitted with the same communication costs. We also delete the RAN in the proposed framework and demonstrate its performance as a reference. To adjust the communication costs, we change the channel dimension of XX in Eq. (9) from 1616 to 4848. The network is trained for about 5050 epochs at a learning rate of 5×1055\times 10^{-5} and then for 5050 epochs at 5×1065\times 10^{-6} under each dimension of XX.

  • Semantic communications (SemCom): This is the proposed end-to-end semantic communication framework with rate control. To adjust the transmission rate, we train the network under different μ\mu in Eq. (13). The training consists of two steps: we first set f=4f=4 for all locations and train the SCE/SCD with all 48 symbols used for about 8080 epochs. After that, we set μ=1e3\mu=1e-3 and gradually increase it to decrease communication costs. The RAN is trained with the SGD optimizer and the other parts are trained with the Adam optimizer at a learning rate of 5×1055\times 10^{-5} for 5050 epochs and at 5×1065\times 10^{-6} for another 5050 epochs under each μ\mu. For low lsl_{s} situations, we transmit the first 6f6f symbols of XX rather than 12f12f. Also, exploration strategies are used when sampling ff. Specifically, the sampling probability is set to 0.6PF+0.4p^0.6P_{F}+0.4\hat{p}, where we set p^(1)\hat{p}(1)=p^(2)\hat{p}(2)=p^(3)\hat{p}(3)=p^(4)\hat{p}(4)=0.250.25 to encourage the RAN to explore more on other actions so that the SCE/SCD can perform well on all actions.

IV-B3 Comparison of semantic communications in fixed-sensing experiments

Refer to caption
Figure 11: The performance comparison among different semantic communication frameworks in fixed-sensing experiments.

We show the performance comparison of different semantic communication methods in the fixed-sensing experiment in Fig. 11, where lavgl_{avg} denotes the average number of modulated symbols lsl_{s} used for the video clips in the test dataset. From the figure, the ’Compr+LDPC’ performs slightly better than ’SemCom+noRAN’ while the proposed ’SemCom’ method outperfroms these methods to a relatively large extend, showing the advantage of jointly designing the channel coding and modulation. It also demonstrates the effectiveness of directly implementing the rate-distortion trade-off on the modulated symbols through the proposed RAN. Furthermore, the proposed ’SemCom’ has a similar performance with ’Compr+Cap’, showing that semantic communications with joint designs are promising ways to approach Shannon capacity.

IV-B4 Comparison of semantic communications in joint-sensing experiments

Refer to caption
Figure 12: The performance comparison among different semantic communication frameworks in joint-sensing experiments.

The performance comparison of different semantic communication methods in the joint-sensing experiment is shown in Fig. 12. From the figure, SemCom performs significantly better than ’Compr+LDPC’ and even surpasses the ’Compr+Cap’ in large lavgl_{avg} cases, further proving the benefits of joint designs. However, we cannot conclude for sure that semantic communications can surpass Shannon capacity as the performance of ’Compr+Cap’ depends on our implementation of task-aware compression methods.

IV-B5 Implementation of conventional communication methods

We now explain how the sensor data can be transmitted in conventional communication systems and introduce their implementation details.

  • Transmit sensor data by joint source and channel coding (Sensordata+JSCC): In conventional communication systems, the most straightforward way is to directly transmit the raw sensor data regardless of its usage. To simulate this process, the raw sensor data need to be compressed and channel-coded. Recent works on deep JSCC [31] have shown that training a network for joint source and channel coding can perform better than using standardized image compression methods and channel coding methods. Therefore, we follow its design and build a deep network similar to SCE/SCD to transmit the raw sensor. The network is optimized by the mean-squared error (MSE) between original sensor data and reconstructed sensor data.

  • Transmit reconstructed video by H.264 video coding plus LDPC plus QAM (Video+H.264+LDPC): Another choice is to reconstruct the video first via a locally-deployed video reconstruction network and then transmit the reconstructed video. In this way, the computational-costive reconstruction network need to be run at the transmitter. To transmit the video, we use H.264 [36] for video source coding, LDPC for channel coding, and QAM for modulation.

IV-B6 Comparison of semantic communications with conventional communication systems

Refer to caption
Figure 13: The performance comparison among different transmission methods.

The performance comparison between semantic communications and conventional communication systems is shown in Fig. 13. From the figure, Sensordata+JSCC performs the worst. This is because the important data for video reconstruction networks does not get targeted protection during transmission. Video+H.264+LDPC performs better than Sensordata+JSCC as the videos have been reconstructed at the transmitter side. However, this method is still not as effective as SemCom, which shows we can achieve the efficient transmission of sensor data without running time-consuming and resource-costive reconstruction algorithm at the transmitter side.

V Conclusion

In this work, we propose a novel video imaging system with learned compression ratios. We also propose a semantic communication framework for programmable sensors with task-oriented reconstruction algorithms. From the perspective of algorithm development, we show that by combining the policy-gradient reinforcement learning and supervised learning, we can achieve the explicit (compression or transmission) rate-distortion trade-off in different cases. The proposed training pipelines can be extended to many other applications.

References

  • [1] P. E. Debevec and J. Malik, “Recovering high dynamic range radiance maps from photographs,” in Seminal Graph. Pap.: Pushing Boundaries, Vol. 2, 2023, pp. 643–652.
  • [2] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy, “Burst photography for high dynamic range and low-light imaging on mobile cameras,” ACM Trans. Graph., vol. 35, no. 6, pp. 1–12, 2016.
  • [3] S. K. Nayar and T. Mitsunaga, “High dynamic range imaging: Spatially varying pixel exposures,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., vol. 1.   IEEE, 2000, pp. 472–479.
  • [4] Nayar and Branzoi, “Adaptive dynamic range imaging: Optical control of pixel exposures over space and time,” in Proc. IEEE Int. Conf. Comput. Vis.   IEEE, 2003, pp. 1168–1175.
  • [5] J. Gu, Y. Hitomi, T. Mitsunaga, and S. Nayar, “Coded rolling shutter photography: Flexible space-time sampling,” in Proc. IEEE Int. Conf. Comput. Photography.   IEEE, 2010, pp. 1–8.
  • [6] A. C. Sankaranarayanan, P. K. Turaga, R. G. Baraniuk, and R. Chellappa, “Compressive acquisition of dynamic scenes,” in Proc. Eur. Conf. Comput. Vis.   Springer, 2010, pp. 129–142.
  • [7] P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, “Coded aperture compressive temporal imaging,” Opt. Express, vol. 21, no. 9, pp. 10 526–10 545, 2013.
  • [8] A. Wagadarikar, R. John, R. Willett, and D. Brady, “Single disperser design for coded aperture snapshot spectral imaging,” Appl. Opt., vol. 47, no. 10, pp. B44–B51, 2008.
  • [9] A. A. Wagadarikar, N. P. Pitsianis, X. Sun, and D. J. Brady, “Video rate spectral imaging using a coded aperture snapshot spectral imager,” Opt. Express, vol. 17, no. 8, pp. 6368–6388, 2009.
  • [10] X. Yuan, Y. Liu, J. Suo, and Q. Dai, “Plug-and-play algorithms for large-scale snapshot compressive imaging,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 1447–1457.
  • [11] S. J. Carey, A. Lopich, D. R. Barr, B. Wang, and P. Dudek, “A 100,000 fps vision sensor with embedded 535gops/w 256×\times 256 simd processor array,” in 2013 Symp. VLSI Circuits.   IEEE, 2013, pp. C182–C183.
  • [12] J. Chen, S. J. Carey, and P. Dudek, “Feature extraction using a portable vision system,” in IEEE/RSJ Int. Conf. Intell. Robots Syst., Workshop Vis.-based Agile Auton. Navigation UAVs, vol. 2, 2017.
  • [13] M. Z. Wong, B. Guillard, R. Murai, S. Saeedi, and P. H. Kelly, “Analognet: Convolutional neural network inference on analog focal plane sensor processors,” arXiv preprint arXiv:2006.01765, 2020.
  • [14] C. A. Metzler, H. Ikoma, Y. Peng, and G. Wetzstein, “Deep optics for single-shot high-dynamic-range imaging,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2020, pp. 1375–1385.
  • [15] J. N. Martel, L. K. Mueller, S. J. Carey, P. Dudek, and G. Wetzstein, “Neural sensors: Learning pixel exposures for hdr imaging and video compressive sensing with programmable sensors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 7, pp. 1642–1653, 2020.
  • [16] M. Iliadis, L. Spinoulas, and A. K. Katsaggelos, “Deepbinarymask: Learning a binary mask for video compressive sensing,” Digit. Signal Process., vol. 96, p. 102591, 2020.
  • [17] C. M. Nguyen, J. N. Martel, and G. Wetzstein, “Learning spatially varying pixel exposures for motion deblurring,” in Proc. IEEE Int. Conf. Comput. Photogr. (ICCP).   IEEE, 2022, pp. 1–11.
  • [18] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy gradient methods for reinforcement learning with function approximation,” Adv. Neural Inf. Process. Syst., vol. 12, 1999.
  • [19] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy, “Improved image captioning via policy gradient optimization of spider,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 873–881.
  • [20] E. Vargas, J. N. Martel, G. Wetzstein, and H. Arguello, “Time-multiplexed coded aperture imaging: Learned coded aperture and pixel exposures for compressive imaging systems,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2692–2702.
  • [21] F. Bellard. (2018) BPG image format. [Online]. Available: https://bellard.org/bpg/
  • [22] G. K. Wallace, “The jpeg still picture compression standard,” Commun. ACM, vol. 34, no. 4, pp. 30–44, 1991.
  • [23] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” in Int. Conf. Learn. Represent., 2017.
  • [24] J. Thorpe, “Low-density parity-check (ldpc) codes constructed from protographs,” IPN Prog. Rep., vol. 42, no. 154, pp. 42–154, 2003.
  • [25] H. Vangala, E. Viterbo, and Y. Hong, “A comparative study of polar code constructions for the awgn channel,” arXiv preprint arXiv:1501.02473, 2015.
  • [26] Z. Qin, X. Tao, W. Tong, J. Lu, and G. Y. Li, “Semantic communications: Principles and challenges,” arXiv preprint arXiv:2201.01389, 2021.
  • [27] B. Zhang, Z. Qin, and G. Y. Li, “Semantic communications with variable-length coding for extended reality,” IEEE J. Sel. Top., pp. 1–14, 2023.
  • [28] Q. Lan, D. Wen, Z. Zhang, Q. Zeng, X. Chen, P. Popovski, and K. Huang, “What is semantic communication? a view on conveying meaning in the era of machine intelligence,” J. Commun. Inf. Netw., vol. 6, no. 4, pp. 336–371, 2021.
  • [29] M. Jankowski, D. Gündüz, and K. Mikolajczyk, “Wireless image retrieval at the edge,” IEEE J. Sel. Areas Commun., vol. 39, no. 1, pp. 89–100, 2020.
  • [30] M. Song, J. Choi, and B. Han, “Variable-rate deep image compression through spatially-adaptive feature transform,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2380–2389.
  • [31] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, “Deep joint source-channel coding for wireless image transmission,” IEEE Trans. Cogn. Commun. Netw., vol. 5, no. 3, pp. 567–579, 2019.
  • [32] Y. Cai, J. Lin, Z. Lin, H. Wang, Y. Zhang, H. Pfister, R. Timofte, and L. M. Van Gool, “Multi-stage spectral-wise transformer for efficient spectral reconstruction,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 19–20.
  • [33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Proc. Med. Image Comput. Comput.-Assist. Interv.   Springer, 2015, pp. 234–241.
  • [34] S. Leonardos, W. Overman, I. Panageas, and G. Piliouras, “Global convergence of multi-agent policy gradient in markov potential games,” in Proc. Int. Conf. Learn. Represent., 2021.
  • [35] H. Kiani Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey, “Need for speed: A benchmark for higher frame rate object tracking,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 1125–1134.
  • [36] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview of the h. 264/avc video coding standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 560–576, 2003.