\newcites

supReference

Robust Message Embedding via Attention Flow-Based Steganography

Huayuan Ye¹, Shenzhuo Zhang¹, Shiqi Jiang¹, Jing Liao², Shuhang Gu³, Dejun Zheng⁴,
Changbo Wang¹, Chenhui Li^1∗
¹ East China Normal University ² City University of Hong Kong
³ University of Electronic Science and Technology of China, Chengdu, China
⁴ Zhijiang College of Zhejiang Univeristy of Technology, Zhejiang, China
[email protected]; {10195102459, 52265901032}@stu.ecnu.edu.cn; [email protected];
[email protected]; [email protected]; {cbwang, chli}@cs.ecnu.edu.cn

Abstract

Image steganography can hide information in a host image and obtain a stego image that is perceptually indistinguishable from the original one. This technique has tremendous potential in scenarios like copyright protection, information retrospection, etc. Some previous studies have proposed to enhance the robustness of the methods against image disturbances to increase their applicability. However, they generally cannot achieve a satisfying balance between the steganography quality and robustness. Instead of image-in-image steganography, we focus on the issue of message-in-image embedding that is robust to various real-world image distortions. This task aims to embed information into a natural image and the decoding result is required to be completely accurate, which increases the difficulty of data concealing and revealing. Inspired by the recent developments in transformer-based vision models, we discover that the tokenized representation of image is naturally suitable for steganography task. In this paper, we propose a novel message embedding framework, called Robust Message Steganography (RMSteg), which is competent to hide message via QR Code in a host image based on an normalizing flow-based model. The stego image derived by our method has imperceptible changes and the encoded message can be accurately restored even if the image is printed out and photoed. To our best knowledge, this is the first work that integrates the advantages of transformer models into normalizing flow. Our experiment result shows that RMSteg has great potential in robust and high-quality message embedding.

1 Introduction

Refer to caption — Figure 1: Compared with previous methods that can only embed limited bit-level information, RMSteg can achieve a much higher embedding capacity and meanwhile has better steganography quality. Also, it can survive various real-world distortions.

Steganography, the art of hiding secret information in a carrier, has long been a hot research topic. This technique is competent to embed information like images and text into target containers, thus achieving copyright protection [10, 35], information retrospection [46, 50], etc. Steganography aims to prevent people from discovering the existence of secret data instead of understanding the meaning of data, differentiating it from cryptography. Specifically, image steganography uses image as the carrier to hide secret information.

Traditional image steganography methods mainly modify the image in spatial domain [14, 17, 25, 28, 29] or transform domain [2, 34, 53]. This kind of method is easy to be detected by steganalysis techniques [9, 48], which makes it lack security. Recently, with the developments of deep learning, some deep steganography methods have been proposed. Most of them are based on autoencoder [3, 42, 50, 52] and normalizing flow [5, 12, 16, 24, 45, 46].

Images can undergo various digital or real-world disturbances during dissemination. To enable the stego images to survive these distortions, some robust steganography methods have been proposed. They consider various distortion situations like light field messaging (LFM) [42], JPEG compression [45], etc. In the field of robust steganography, robust message embedding is very promising in many application scenarios like hyperlink hiding [35], metadata embedding [13, 50], etc. However, this task requires the decoding result being completely accurate with no error, which poses a challenge to the balance between stego image quality and decoding accuracy, especially when facing real-world distortions. Although some studies [10, 35] have proposed to hide messages in host images and try to make them survive printing and photography, which is among the most demanding situations that require high steganography robustness, they cannot achieve enough steganography quality and capacity at the same time.

As the most widely utilized method, normalizing flow-based model [6, 7, 18] has achieved impressive performance in various steganography tasks. Existing methods [5, 12, 16, 24, 45, 46] generally incorporate normalizing flow by utilizing a convolutional neural network (CNN) based backbone. However, according to our experiments, this kind of model design can lead to obvious artifacts in stego images when handling robust steganography tasks due to the lack of inner-channel feature fusion. Inspired by the transformer-based vision models, we discover that the tokenized representation of image is naturally suitable for robust steganography that requires highly abstract feature learning. As a result, we aim to take advantages of it to address the robust message-in-image steganography problem.

In this paper, we propose a new framework for message embedding, called Robust Message Steganography (RMSteg), a simple demo is demonstrated in Figure 1. We use QR Code as the message carrier and encode it into the host image. Unlike previous methods that directly encode the secret image, we propose an invertible QR Code transition as a preprocessing step, which transforms the QR Code based on the features of the host image, lowering down the artifacts in stego images and meanwhile maintaining a high decoding accuracy. We outline an steganography model called AttnFlow, which integrates tokenized image representation into normalizing flow. We propose an attention affine coupling block (AACB) that leverages the attention mechanism [39, 8] instead of traditional CNN for invertible steganography function learning, thus significantly improving the stego image quality. Compared with previous methods, our method can conquer the aforementioned difficulties and achieve robust, high-quality and high-capacity message-in-image steganography. The main contributions of this paper include three aspects:

•

We use QR Code as the message carrier and propose a transition scheme to transform the QR Code before steganography. This process can improve the stego image quality while maintaining decoding accuracy.
•

We propose an invertible token fusion module that can effectively improve the steganography quality by simply including a small learnable matrix.
•

We propose a normalizing flow-based steganography network that integrates the tokenized image representation. Our network can generate stego images with significantly higher quality and can survive extreme distortions. We use the case of printing and photography to validate our method’s effectiveness.

2 Related Work

2.1 Image Steganography

Traditional Steganography Image steganography hides information in an image by performing imperceptible changes on a host image. Traditional methods modify the image in spatial or transform domain [3]. Spatial-domain steganography generally leverages least-significant-bit (LSB) replacement [25], bit plane complexity segmentation (BPCS) [17, 28] and palette reordering [14, 29] to conceal information. However, this kind of scheme may raise statistical anomalies that can be detected by steganalysis techniques [9, 48]. Some methods utilize high-dimensional features [30] and distortion constraints [20] to improve steganography security and quality. Transform-based steganography can hide data in a transformed domain using discrete cosine transform (DCT) [2] and discrete wavelet transform (DWT) [34, 53]. Due to the limited ability of feature representation and transformation, traditional methods generally cannot achieve a satisfying quality.

Deep Steganography Recently, various deep learning-based image steganography schemes have been proposed and have achieved impressive performance. HiDDeN [52] adopted the autoencoder (AE) to embed binary messages. Baluja [3] first utilized an end-to-end network to hide a color image in another. Some studies [10, 31, 32, 37, 36, 49] incorporated generative adversarial network (GAN) [11] to reduce the image artifacts and defend steganalysis. More recently, the invertible neural network (INN) [6, 7, 18] has been widely used for steganography. These methods successfully hide single [5, 16, 24] or multiple [12, 43, 46] images in a carrier image. There are also some studies focusing on coverless steganography [21, 23, 26, 47] that directly transforms the secret information into a cover image. These methods mainly focus on improving the embedding capacity instead of robustness, as a result, they generally cannot survive image distortions.

Robust Steganography Robust steganography allows information decoding even if the images are interfered with by digital transmission or real-world distortions, which is meaningful for scenarios like copyright protection, secret communication, etc. VisCode [50] hides QR Codes in host images and can survive image brightness changes and slight tampering. LFM [42] is robust to light field messaging. RIIS [45] considers JPEG compression and various kinds of noise separately based on a conditional network. StegaStamp [35] and ChartStamp [10] take printing and photography into account but can only embed very little information at a cost of visual quality loss. As far as we know, existing methods cannot achieve both high-quality and high-capacity message steganography that is robust to extreme image distortions. In this paper, we aim to take the advantages of transformer-based model to address this problem.

2.2 Normalizing Flow-Based Models

Normalizing flow model was first proposed as a generative model by Dinh et al. [6]. With further improvement by RealNVP [7] and GLOW [18], it is also known as the invertible neural network (INN). INN can learn an invertible function using a set of affine coupling layers with shared parameters to map the original data distribution to a simple distribution (e.g., Gaussian distribution). Chen et al. [4] proposed an unbiased estimation for normalizing flow model. i-RevNet [15] utilizes an explicit inversion to improve the invertible architecture.

Recently, normalizing flow has been applied to various downstream tasks in computer vision, such as image [44] and video [54] super-resolution, image-to-image translation [38], etc. Especially, in the field of steganography, normalizing flow-based methods [5, 24, 46] have shown promising performance. HiNet [16] introduces the discrete wavelet transform (DWT) to guide channel squeezing and improve the steganography quality. DeepMIH [12] hides single or multiple images with a saliency detection module. Mou et al. [27] incorporated a key-controllable network design to implement secure video steganography. Xu et al. [45] simulated distortions during model training to improve the robustness and security of their method. Although previous studies have leveraged various methods to improve the network architecture for better performance, they cannot attend to both image quality and steganography robustness simultaneously.

3 Method

3.1 Overview

Given a secret message $T_{s}$ , we first encode it into a QR Code image $I_{q}$ . The concealing procedure aims to embed $I_{q}$ into a host image $I_{h}$ and derive a stego image $I_{s}$ that is perceptually similar to $I_{h}$ . Then, $I_{s}$ can suffer from various real-world image distortions, resulting in a distorted image $I_{s}^{\prime}$ . After that, the revealing procedure aims to restore a QR Code $\hat{I}_{q}$ from $I_{s}^{\prime}$ that can be successfully recognized to obtain the original message.

To achieve the aforementioned targets, we first leverage a QR Code transition scheme (Sec. 3.3) to transform the original QR Code according to the host image, reducing the artifacts it causes in the subsequent steganography process. Then, we use an invertible token fusion (ITF) module (Sec. 3.4) to improve the stego image quality. After that, we propose an AttnFlow model (Sec. 3.5) to perform message embedding. To make our method robust to real-world distortions, we incorporate a distortion simulation module during the training stage, which will be described in detail in Sec. 3.6. Fig. 2 demonstrates an overview of the pipeline of our RMSteg.

3.2 Preliminary: Normalizing Flow

Normalizing Flow [6, 7, 18], also called the invertible neural network (INN), is proposed to model a bijective projection from a complex distribution (e.g., images) to a tractable distribution (e.g. Gaussian distribution and Dirac distribution). This kind of model generally comprises several invertible affine coupling blocks (ACBs). The most basic ACB architecture is proposed by NICE [6], in which the input $u^{i}$ of the $i$ ^th ACB is split into two parts, $u^{i}_{1}$ and $u^{i}_{2}$ , whose corresponding outputs are $u^{i+1}_{1}$ and $u^{i+1}_{2}$ , respectively. For the forward process, the following transformation is performed:

u^{i+1}_{1}=u^{i}_{1}+\sigma(u^{i}_{2}),~{}~{}u^{i+1}_{2}=u^{i}_{2}+\delta(u^{i+1}_{1}),

(1)

where $\sigma(\cdot)$ and $\delta(\cdot)$ are arbitrary functions. Obviously, the backward process can be formulated as:

u^{i}_{2}=u^{i+1}_{2}-\delta(u^{i+1}_{1}),~{}~{}u^{i}_{1}=u^{i+1}_{1}-\sigma(u^{i}_{2}).

(2)

In the normalizing flow architecture, $\delta(\cdot)$ and $\sigma(\cdot)$ in Equation 1 and Equation 2 can be implemented by neural network modules with shared parameters and inverse calculation manner. By stacking multiple ACBs, the network can learn an invertible transformation between two distributions. Since this scheme is inherently suitable for steganography, many studies have utilized it for data hiding and proposed various improvements. In this paper, we further extend the ability of normalizing flow and propose a new network architecture for our robust message embedding task.

3.3 Invertible QR Code Transition

For the message embedding task in this paper, the hidden QR code needs to be restored with enough accuracy to be identified by common devices like cell phones, webcams, etc. To balance the trade-off between the stego image quality and decoding accuracy, VisCode [50] obtains a visual saliency map to guide the QR Code embedding while ChartStamp [10] utilizes the semantic segmentation result as the training loss guidance. Although this kind of rule-based strategy can improve the visual quality of the stego image, it does not consider the inherent relationship between QR Codes and the host image.

In our method, we adopt a more direct approach, which is modifying the QR Code image according to the host image (shown in Figure 2 (a)). We call it invertible QR Code transition (IQRT). The key idea of IQRT is that, the QR Code used for steganography is not necessarily to be black-and-white to keep its information. Thus, a learnable transformation can be applied to the QR Code for a better steganography quality as long as the transformed code is still identifiable. Formally, given a host image $I_{h}$ and a QR Code $I_{q}$ with the same size, we use an off-the-shelf INN architecture¹¹1We only use 2 invertible blocks instead of 16 in the original paper [24]. proposed by ISN [24] to learn an invertible function $f(\cdot)$ that derives the transformed QR Code $I_{q}^{*}$ by $I_{q}^{*}=f(I_{q},I_{h})$ . In the reverse process, the restored QR Code $\hat{I_{q}}$ can be obtained by $\hat{I_{q}}=f^{-1}(I_{q}^{*},I_{s}^{\prime})$ , where $f^{-1}(\cdot)$ is the inverse function of $f(\cdot)$ defined by normalizing flow and $I_{s}^{\prime}$ is the distorted stego image. Here $I_{s}^{\prime}$ is used instead of $I_{h}$ since the latter is unknown in the decoding procedure.

During network training, the transition network is jointly trained with the subsequent steganography network. We employ the same constraint as ArtCoder [33] to the transformed QR Code to ensure that it is still identifiable. Specifically, a Gaussian convolution kernel is applied to each code module to simulate the QR Code scanning procedure. For more details²²2A detailed explanation is also provided in the appendix., we suggest referring to the original paper [33]. We do not use extra constraint to the transition network so that it can learn the best transition strategy according to the overall optimization targets. Figure 3 shows some transition results, it can be observed that the transformed QR Codes have obviously lower brightness. However, with the aforementioned constraint, the transformed QR Codes are still identifiable, guaranteeing almost no information loss.

3.4 Invertible Token Fusion

With the transformed QR Code, we first use a ViT [8] to obtain a tokenized representation $T_{q}\in\mathbb{R}^{N\times D}$ , in which $N$ is the number of tokens and $D$ represents the token dimensionality. Inspired by the invertible $1\times 1$ convolution proposed by GLOW [18], before feeding $T_{q}$ to the subsequent steganography network, we put forth an invertible token fusion (ITF) module (as shown in Figure 2 (b)) to transform the QR Code tokens for a better steganography quality.

Formally, we use a learnable matrix $\mathcal{M}\in\mathbb{R}^{N\times N}$ , which is initialized as an orthogonal matrix using Cholesky decomposition [19], as a transform matrix for $T_{q}$ . In the steganography process, $T_{q}$ is transformed by performing a matrix multiplication: $T_{q}^{\prime}=\mathcal{M}\cdot T_{q}$ . Obviously, in the decoding procedure, the restored tokens $\hat{T_{q}}$ can be obtained by: $\hat{T}_{q}=\mathcal{M}^{-1}\cdot\hat{T}_{q}^{\prime}$ , where $\mathcal{M}^{-1}$ is the inverse matrix.

Different from GLOW [18] that utilizes the invertible convolution to learn the channel-wise fusion strategy, our ITF module learns a patch-wise transformation that enables inner-channel feature interaction. Our experiments also prove that ITF can efficiently and effectively improve the steganography quality by simply introducing the aforementioned learnable matrix.

3.5 Steganography with AttnFlow

Previous steganography studies based on normalizing flow generally adopt a convolutional neural network (CNN) based backbone, mostly DenseNet [40], to construct the affine coupling blocks (ACBs). This kind of design only considers the channel-wise feature fusion and can lead to perceptible artifacts in stego images, especially in the robust steganography task. Motivated by the impressive performance achieved by the transformer-based [8, 39] vision models recently, we propose a model called AttnFlow that introduces attention mechanism to normalizing flow to implement robust steganography.

As shown in Figure 2 (c), similar to ordinary normalizing flow, AttnFlow contains several attention affine coupling blocks (AACBs) for invertible function learning. Assume that the input of the $i$ ^th AACB is split into $T_{h}^{(i-1)}$ and $T_{q}^{(i-1)}$ , corresponding to the host image tokens and QR Code tokens, respectively. Specifically, $T_{h}^{(0)}$ is the tokenized host image obtained with a basic ViT [8] and $T_{q}^{(0)}$ represents the QR Code image tokens output by the ITF module. For the $i$ ^th AACB, we perform the following affine transformation:

\begin{split}T_{h}^{(i)}&=T_{h}^{(i-1)}+\phi(T_{q}^{(i-1)})+\mathcal{C}(T_{q}^{(i-1)},T_{h}^{(0)})\times\alpha_{i},\\ T_{q}^{(i)}&=\eta(T_{h}^{(i)})+T_{q}^{(i-1)}\odot exp(\rho(T_{h}^{(i)})),\end{split}

(3)

in which $\phi(\cdot)$ , $\eta(\cdot)$ , $\rho(\cdot)$ are self-attention blocks [39] followed by a feedforward multilayer perceptron (MLP), $\mathcal{C}(q,kv)$ represents the cross-attention block [39], $exp(\cdot)$ is the exponential function, $\odot$ indicates the Hadamard product and $\alpha_{i}$ is a dependent trainable coefficient for each AACB. We calculate the attention value with:

Attn(Q,K,V)=M\cdot V,~{}~{}M=Softmax(\frac{QK^{T}}{\sqrt{d}}),

(4)

where $Q$ , $K$ , $V$ are derived from the learned projections and $d$ is the dimension of the projected tokens. As described in Equation 3, in addition to the self-attention value, we also calculate the cross-attention value of the initial host image tokens $T_{h}^{(0)}$ upon the QR code tokens $T_{q}^{(i-1)}$ for each AACB. Then, we add these values to $T_{h}^{(i-1)}$ to help AACBs gradually integrate the information from the QR code into the image. For the QR Code tokens, we choose to adopt a generally incorporated [5, 16, 24, 46] affine transformation and replace the original convolutional blocks with $\eta(\cdot)$ and $\rho(\cdot)$ . After $n$ AACBs, $T_{h}^{(n)}$ further goes through a detokenizer³³3The detailed architecture of the tokenizers and detokenizers will be described in the appendix., resulting in the final stego image. Although some methods further map $T_{h}^{(n)}$ and $T_{q}^{(n)}$ as a conditional distribution for better performance, here we choose to adopt the same assumption as HiNet [16], which is simply positing that $T_{q}^{(n)}$ obeys a Gaussian distribution.

In the revealing process, we aim to restore the original QR Code from a distorted stego image $I_{s}^{\prime}$ , we first tokenize it and derive $\hat{T}_{h}^{(n)}$ . Then, we obtain $\hat{T}_{q}^{(n)}$ by sampling from a standard Gaussian distribution. After that, we perform the inverse AACB transformation by going through them with inverse calculation manner:

\begin{split}\hat{T}_{q}^{(i-1)}&=(\hat{T}_{q}^{(i)}-\eta(\hat{T}_{h}^{(i)}))\odot exp(-\rho(\hat{T}_{h}^{(i)})),\\ \hat{T}_{h}^{(i-1)}&=\hat{T}_{h}^{(i)}-\phi(\hat{T}_{q}^{(i-1)})-\mathcal{C}(\hat{T}_{q}^{(i-1)},\hat{T}_{h}^{(0)})\times\alpha_{i},\end{split}

(5)

in which $\hat{T}_{h}^{(0)}$ is obtained by tokenizing $I_{s}^{\prime}$ since $I_{h}$ is unknown in the decoding process. Then, $\hat{T}_{q}^{(0)}$ is detokenized and fed into the reversed QR Code transition (introduced in Sec. 3.3) with $I_{s}^{\prime}$ to get the final decoded QR Code image.

3.6 Optimization Target and Training Strategy

Distortion Simulation Module We use a module to simulate the distortions that the stego images may undergo during printing and photography. In this paper, we choose to use the same simulation module proposed by StegaStamp [35] that considers color shifting, blurring, noising, etc. We mainly modify the standard deviation of Gaussian noise from 0.02 to 0.07 and increase the JPEG compression quality from 25 to 60 for our task. During training, we perform random distortion combinations on stego images to simulate real-world image disturbances.

Loss Function The aforementioned three networks (IQRT, ITF and AttnFlow) are trained jointly. We use the following loss functions to guide the training process:

\mathcal{L}_{steg}^{L1}=\left\|I_{h}-I_{s}\right\|_{1},

(6)

\mathcal{L}_{steg}^{ssim}=ssim(I_{h},I_{s}),

(7)

\mathcal{L}_{steg}^{lpips}=lpips(I_{h},I_{s}),

(8)

\mathcal{L}_{qr}=\|I_{q}-\hat{I}_{q}\|_{1},

(9)

in which $ssim(\cdot)$ represents the structural similarity index [41] and $lpips(\cdot)$ indicates the perception loss [51]. Besides, as introduced in Sec. 3.3, an additional QR Code transition loss $\mathcal{L}_{t}$ is incorporated. The overall loss function is the weighted sum of the above functions:

\mathcal{L}_{total}=\alpha\mathcal{L}_{steg}^{L1}+\beta\mathcal{L}_{steg}^{ssim}+\gamma\mathcal{L}_{steg}^{lpips}+\delta\mathcal{L}_{qr}+\epsilon\mathcal{L}_{t},

(10)

where $\alpha$ , $\beta$ , $\gamma$ , $\delta$ , $\epsilon$ are weight coefficients.

Table 1: Steganography quality under different situations. Here

\sigma

represents the standard deviation of Gaussian noise (given the image pixel values range in

[0,1]

). The best and second-best results are marked in red and blue colors, respectively.

Method	Stego Image			$\sigma=0.1$		$\sigma=0.15$		JPEG Q = 20		JPEG Q = 40		Mixed		Printing
	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$
ISN^$\dagger$	32.175	0.8765	0.3266	0.728	1.563	0.178	5.020	0.991	0.721	0.999	0.184	0.713	3.131	0.960	1.125
HiNet^$\dagger$	31.629	0.8662	0.3423	0.827	1.077	0.162	3.724	0.986	0.573	0.997	0.099	0.677	3.426	0.970	1.619
StegaStamp	21.215	0.7027	0.3055	0.051	6.152	0.000	10.57	0.951	1.259	0.977	0.798	0.557	3.843	0.750	3.214
StegaStamp^$\dagger$	21.173	0.6903	0.3418	0.481	3.298	0.015	6.500	0.953	1.104	0.969	0.833	0.693	2.975	0.900	1.917
Ours	32.883	0.9109	0.0707	0.794	1.235	0.216	3.306	0.995	0.117	1.000	0.038	0.859	0.861	1.000	0.606

4 Experiment

4.1 Experimental Settings

Datasets Our training and testing datasets of host images are the train2017 (118K) and test2017 (41K) datasets of COCO [22], respectively. For QR Code images, we manually construct the training (50K) and testing (41K) datasets with random encoded messages. We generate the QR Code images by adopting the scheme of QR Code version 5 [1] with highest error correction (ECC) level of ‘H’. We incorporate this code version for most of our experiments except the evaluation in Sec. 4.3. The image used for training and testing is 224 $\times$ 224 and the patch size of ViT [8] is 16.

Metrics Our experiments focus on two aspects: stego image quality and decoding accuracy. For stego image quality, we use the peak signal-to-noise ratio (PSNR), SSIM [41] and LPIPS [51] to measure the difference between host images and stego images. For decoding accuracy, we adopt the text recovery accuracy (TRA) [46, 50], which is the ratio of the successfully decoded QR Codes. In addition, we calculate the error module rate (EMR), which represents the error rate (in percentage) of the modules in the QR Code.

Baselines We compare our method with some state-of-the-art methods⁴⁴4Since RIIS [45] has not released its source code or pre-trained model, we cannot compare with it., including ISN [24], HiNet [16] and StegaStamp [35]. Since these methods are not designed specifically for our task, we train these models on our datasets for a fair comparison. Moreover, for ISN and HiNet, since they are not robust steganography methods, we incorporate the distortion simulation module when training them. The re-trained models of these two methods are illustrated as ISN^$\dagger$ and HiNet^$\dagger$, respectively. For StegaStamp, we also additionally train it by using the same distortion level as our method, represented as StegaStamp^$\dagger$.

4.2 Steganography Quality

Steganography quality indicates both stego image quality and decoding accuracy. To compare the robustness of different methods against image distortions, we first consider several manually created situations⁵⁵5The experiments under more situations are presented in the appendix.: Gaussian noise, JPEG compression and random noise combinations. We generate random noise combinations (represented by Mixed) with the distortion simulation module introduced in Sec. 3.6. We then consider the printing and photography case to validate the methods’ robustness against real-world distortions since it is one of the most extremely severe distortion situations and contains mixed disturbance factors. We randomly select 100 host images and embed random message in them. We then use a inkjet printer to print the encoded images out and take photos with a cell phone. To eliminate the potential errors caused by factors like print quality, we repeat the experiment for 5 times and choose the best result.

Table 1 shows the experiment results, Figure 4 and Figure 5 demonstrate some qualitative results. It can be observed that our method can achieve higher stego image quality, especially for LPIPS that represents the perceptual similarity. StegaStamp incorporates adversarial training [11] to make the generated stego image more realistic and it does work when facing low-level distortions. However, with the distortion used during training increases, StegaStamp^$\dagger$ fails to preserve a sound visual quality and instead lead to hue shifting and artifacts. In addition, adversarial training can sometimes bring severe artifacts in some regions, as shown in the 3rd and 4th row of Figure 4. For HiNet and ISN that both leverage normalizing flow, due to the lack of inner-channel interaction by using CNN-based affine block, the stego images they derive have obvious QR Code-like artifacts, making the existence of secret message easy to detect. In terms of decoding accuracy, although HiNet^$\dagger$ outperforms our method in some cases, we achieve the best performance in the mixed noise and printing tests, which are more close to the real-world application scenarios.

Figure 7 shows the decoding accuracy under more levels of distortions, which are Gaussian Noise whose standard deviation ranges from 0.02 to 0.2 and JPEG compression with quality ranging from 10 to 90. It can be observed that our method demonstrates a stable and good performance under these situations.

In practical application scenarios, the shooting condition may vary from time to time and a good message embedding method should keep its robustness in most cases. As a result, we further measure the decoding accuracy under different shooting situations. We mainly consider the shooting distance and angle (the offset relative to vertical shooting). Given the default shooting distance and angle of this paper as 11cm and $0^{\circ}$ , we gradually increase these two values and the test results are shown in Table 2. It can be found that, with the shooting distance and angle grow, the EMRs exhibit a significant increase for all methods. However, for TRA that directly reflects the identifiability of QR Codes, our method maintains a fair performance. The comparison shown in Figure 6 also indicates that RMSteg can achieve high decoding accuracy under different shooting situations.

Table 2: Decoding accuracy under different shooting situations. Here

d

indicates the shooting distance (measured by cm) and

\alpha

is the shooting angle offset (measured by degree). The best and second-best results are marked in red and blue colors, respectively.

Method	$d=11,\alpha=0$		$d=13$		$d=15$		$\alpha=10$		$\alpha=20$		$\alpha=30$
	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$
ISN^$\dagger$	0.960	1.125	0.920	2.173	0.790	2.780	0.910	1.859	0.820	2.394	0.590	3.389
HiNet^$\dagger$	0.970	1.619	0.850	3.033	0.680	3.909	0.880	2.396	0.800	3.362	0.660	3.828
StegaStamp	0.750	3.214	0.610	4.131	0.350	5.321	0.540	3.784	0.360	4.258	0.040	5.916
StegaStamp^$\dagger$	0.900	1.917	0.790	2.469	0.710	2.922	0.860	2.272	0.840	2.413	0.570	3.978
Ours	1.000	0.606	1.000	0.891	0.980	1.269	1.000	0.953	1.000	1.163	0.960	1.680

Table 3: Model performance under different QR Code versions. The numbers in parentheses indicate the encoding capacity in bit.

Version	Stego Image			Mixed		Printing
	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$
v5 (368)	32.883	0.9109	0.0707	0.859	0.861	1.000	0.606
v6 (480)	31.363	0.8903	0.0892	0.859	0.934	1.000	0.877
v7 (528)	31.167	0.8880	0.0902	0.782	1.216	0.890	1.290
v8 (688)	30.765	0.8762	0.1020	0.743	1.370	0.820	1.476

4.3 Quality under Different Embedding Capacity

To validate the generality of our method, we test it under different embedding capacity, i.e., using QR Codes with different versions for training. We demonstrate the model performance on code versions from v5 to v8 in Table 3. Although the steganography quality is getting worse with the embedding capacity increases, the artifacts in the stego images are still not perceptible, especially when the image is printed out, as shown in Figure 8. In addition, RMSteg keeps a TRA of more than 0.8 even for code v8 whose embedding capacity is two more times higher than v5. Compared with StegaStamp [35] that is also designed for robust message embedding, its PSNR is lower than 25 when encoding 200 bits in a $400\times 400$ image according to the original paper. In contrast, our method is able to keep a PSNR of around 30 when encoding more than 600 bits in a $224\times 224$ image. Thus, RMSteg can achieve a higher steganography quality and meanwhile the embedding capacity is much larger.

4.4 Ablation Study

We conduct an ablation study to validate the effectiveness of the invertible QR Code transition (IQRT), the invertible token fusion (ITF) module and the AttnFlow model. The result is shown in Table 4.

IQRT The model without IQRT performs slightly better than the full model in decoding accuracy. This is because IQRT may sometimes cause information loss, e.g., some code module could be wrongly transformed during this procedure, although the QR Code is still identifiable. On the other hand, IQRT can largely improve stego image quality.

ITF Module As discussed in Sec. 3.4, the ITF module can learn a transformation for image tokens and thus leading to better stego image quality. We also find that the ITF module can help derive a better distribution of artifacts brought by message embedding. As shown in Figure 9, the stego image generated using ITF has much less distortion in homogenous regions (i.e., the sky), achieving a better visual quality.

Table 4: Ablation study result. Here CAT indicates cross attention, TR represents tokenize representation. The best and second-best results are marked in red and blue colors, respectively.

Method	Stego Image			Mixed
	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	TRA $\uparrow$	EMR $\downarrow$
w/o IQRT	30.662	0.8651	0.1059	0.871	0.828
w/o ITF	31.422	0.8771	0.0919	0.833	0.947
w/o CAT	32.221	0.9036	0.0859	0.845	0.954
w/o TR + ISN	32.444	0.8856	0.3209	0.692	3.698
w/o TR + HiNet	31.513	0.8674	0.3370	0.714	3.321
1 AACB	30.426	0.8728	0.1076	0.798	1.135
2 AACBs	31.083	0.8972	0.0831	0.819	0.924
3 AACBs	31.649	0.9008	0.0796	0.819	0.924
Ours Full Model	32.883	0.9109	0.0707	0.856	0.861

Cross Attention in AACB We train our model by removing the cross attention blocks defined in Equation 3 and the result shows that this design can improve the overall model performance.

Tokenized Representation We replace the AttnFlow model with ISN and HiNet (two CNN-based normalizing flow model), respectively, to validate the effectiveness of introducing tokenized image representation (the IQRT module is retained). The result shows that, compared with CNN-based scheme, incorporating tokenized image representation makes normalizing flow more competent for robust steganography task.

AACB Number It shows that, increasing the number of AACBs can obtain a better model performance, which is consistent with the fundamental of normalizing flow.

5 Conclusion

We propose a robust message embedding framework based on an attention flow-based model, called RMSteg. Our method is capable of generating stego images that can survive various real-world distortions, especially for printing and photography. To our best knowledge, RMSteg is the first method that introduces the transformer-based attention mechanism into normalizing flow. Our experiments show that this scheme is competent in steganography tasks. Compared with existing methods, RMSteg can achieve a better performance in robust and high-quality message embedding. We believe this is to a large extent due to the incorporation of the tokenized image representation and we hope this scheme can inspire subsequent studies.

References

qrs [2015] Information technology — automatic identification and data capture techniques — qr code bar code symbology specification. ISO/IEC 18004:2015, 2015.
Almohammad et al. [2008] Adel Almohammad, Robert M Hierons, and Gheorghita Ghinea. High capacity steganographic method based upon jpeg. In Intl. Conf. Availability Reliability Security, pages 544–549. IEEE, 2008.
Baluja [2017] Shumeet Baluja. Hiding images in plain sight: Deep steganography. Adv. Neural Inf. Process. Syst. (NIPS), 30, 2017.
Chen et al. [2019] Ricky TQ Chen, Jens Behrmann, David K Duvenaud, and Jörn-Henrik Jacobsen. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32, 2019.
Cheng et al. [2021] Ka Leong Cheng, Yueqi Xie, and Qifeng Chen. Iicnet: A generic framework for reversible image conversion. In Proc. IEEE/CVF Intl. Conf. Comput. Vis., pages 1991–2000, 2021.
Dinh et al. [2014] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In ICLR, 2014.
Dinh et al. [2016] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In ICLR, 2016.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Fridrich et al. [2001] Jessica Fridrich, Miroslav Goljan, and Rui Du. Detecting lsb steganography in color, and gray-scale images. IEEE Multimed., 8(4):22–28, 2001.
Fu et al. [2022] Jiayun Fu, Bin B Zhu, Haidong Zhang, Yayi Zou, Song Ge, Weiwei Cui, Yun Wang, Dongmei Zhang, Xiaojing Ma, and Hai Jin. Chartstamp: Robust chart embedding for real-world applications. In Proc. 30th ACM Intl. Conf. Multimedia, pages 2786–2795, 2022.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Guan et al. [2022] Zhenyu Guan, Junpeng Jing, Xin Deng, Mai Xu, Lai Jiang, Zhou Zhang, and Yipeng Li. Deepmih: Deep invertible network for multiple image hiding. IEEE Trans. Pattern Anal. Mach. Intell., 45(1):372–390, 2022.
Hota and Huang [2019] Alok Hota and Jian Huang. Embedding meta information into visualizations. IEEE Transactions on Visualization and Computer Graphics, 26(11):3189–3203, 2019.
Imaizumi and Ozawa [2014] Shoko Imaizumi and Kei Ozawa. Multibit embedding algorithm for steganography of palette-based images. In Image and Video Technology: 6th Pacific-Rim Symposium, PSIVT 2013, Guanajuato, Mexico, October 28-November 1, 2013. Proceedings 6, pages 99–110, 2014.
Jacobsen et al. [2018] Jörn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon. i-revnet: Deep invertible networks. arXiv preprint arXiv:1802.07088, 2018.
Jing et al. [2021] Junpeng Jing, Xin Deng, Mai Xu, Jianyi Wang, and Zhenyu Guan. Hinet: Deep image hiding by invertible network. In Proc. IEEE/CVF Intl. Conf. Comput. Vis., pages 4733–4742, 2021.
Kawaguchi and Eason [1999] Eiji Kawaguchi and Richard O Eason. Principles and applications of bpcs steganography. In Multimedia Syst. Appl., pages 464–473. SPIE, 1999.
Kingma and Dhariwal [2018] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In NeurIPS, 2018.
Krishnamoorthy and Menon [2013] Aravindh Krishnamoorthy and Deepak Menon. Matrix inversion using cholesky decomposition. In 2013 signal processing: Algorithms, architectures, arrangements, and applications (SPA), pages 70–72. IEEE, 2013.
Li et al. [2014] Bin Li, Ming Wang, Jiwu Huang, and Xiaolong Li. A new cost function for spatial image steganography. In ICIP, 2014.
Li et al. [2022] Yung-Hui Li, Ching-Chun Chang, Guo-Dong Su, Kai-Lin Yang, Muhammad Saqlain Aslam, and Yanjun Liu. Coverless image steganography using morphed face recognition based on convolutional neural network. EURASIP Journal on Wireless Communications and Networking, 2022(1):1–21, 2022.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Liu et al. [2020] Qiang Liu, Xuyu Xiang, Jiaohua Qin, Yun Tan, and Yao Qiu. Coverless image steganography based on densenet feature mapping. EURASIP Journal on Image and Video Processing, 2020:1–18, 2020.
Lu et al. [2021] Shao-Ping Lu, Rong Wang, Tao Zhong, and Paul L Rosin. Large-capacity image steganography based on invertible neural networks. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 10816–10825, 2021.
Mielikainen [2006] Jarno Mielikainen. Lsb matching revisited. IEEE Signal Process. Lett., 13(5):285–287, 2006.
Mohamed et al. [2021] Mohammed Saad Mohamed, EH Hafez, et al. Coverless image steganography based on jigsaw puzzle image generation. Computers, Materials and Continua, 67(2):2077–2091, 2021.
Mou et al. [2023] Chong Mou, Youmin Xu, Jiechong Song, Chen Zhao, Bernard Ghanem, and Jian Zhang. Large-capacity and flexible video steganography via invertible neural network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22606–22615, 2023.
Nguyen et al. [2006] Bui Cong Nguyen, Sang Moon Yoon, and Heung-Kyu Lee. Multi bit plane image steganography. In Digital Watermarking: 5th International Workshop, IWDW 2006, Jeju Island, Korea, November 8-10, 2006. Proceedings 5, pages 61–70, 2006.
Niimi et al. [2002] Michiharu Niimi, Hideki Noda, Eiji Kawaguchi, and Richard O Eason. High capacity and secure digital steganography to palette-based images. In Proceedings. International conference on image processing, pages II–II. IEEE, 2002.
Pevnỳ et al. [2010] Tomáš Pevnỳ, Tomáš Filler, and Patrick Bas. Using high-dimensional image models to perform highly undetectable steganography. In Information Hiding, IH, pages 161–177. Springer, 2010.
Qin et al. [2020] Jiaohua Qin, Jing Wang, Yun Tan, Huajun Huang, Xuyu Xiang, and Zhibin He. Coverless image steganography based on generative adversarial network. Math., 8(9):1394, 2020.
Shi et al. [2017] Haichao Shi, Jing Dong, Wei Wang, Yinlong Qian, and Xiaoyu Zhang. Ssgan: secure steganography based on generative adversarial networks. In Pacific Rim Conference on Multimedia, 2017.
Su et al. [2021] Hao Su, Jianwei Niu, Xuefeng Liu, Qingfeng Li, Ji Wan, Mingliang Xu, and Tao Ren. Artcoder: An end-to-end method for generating scanning-robust stylized qr codes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2277–2286, 2021.
Swanson et al. [1997] Mitchell D Swanson, Bin Zhu, Benson Chau, and Ahmed H Tewfik. Multiresolution video watermarking using perceptual models and scene segmentation. In Proc. Intl. Conf. Image Process., pages 558–561. IEEE, 1997.
Tancik et al. [2020] Matthew Tancik, Ben Mildenhall, and Ren Ng. Stegastamp: Invisible hyperlinks in physical photographs. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 2117–2126, 2020.
Tang et al. [2017] Weixuan Tang, Shunquan Tan, Bin Li, and Jiwu Huang. Automatic steganographic distortion learning using a generative adversarial network. IEEE Signal Processing Letters, 2017.
Tang et al. [2019] Weixuan Tang, Bin Li, Shunquan Tan, Mauro Barni, and Jiwu Huang. Cnn-based adversarial embedding for image steganography. TIFS, 2019.
van der Ouderaa and Worrall [2019] Tycho FA van der Ouderaa and Daniel E Worrall. Reversible gans for memory-efficient image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4720–4728, 2019.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Wang et al. [2018] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops, pages 0–0, 2018.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
Wengrowski and Dana [2019] Eric Wengrowski and Kristin Dana. Light field messaging with deep photographic steganography. In Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pages 1515–1524, 2019.
Wu et al. [2021] Yue Wu, Guotao Meng, and Qifeng Chen. Embedding novel views in a single jpeg image. In Proc. IEEE/CVF Intl. Conf. Comput. Vis., pages 14519–14527, 2021.
Xiao et al. [2020] Mingqing Xiao, Shuxin Zheng, Chang Liu, Yaolong Wang, Di He, Guolin Ke, Jiang Bian, Zhouchen Lin, and Tie-Yan Liu. Invertible image rescaling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 126–144. Springer, 2020.
Xu et al. [2022] Youmin Xu, Chong Mou, Yujie Hu, Jingfen Xie, and Jian Zhang. Robust invertible image steganography. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7875–7884, 2022.
Ye et al. [2023] Huayuan Ye, Chenhui Li, Yang Li, and Changbo Wang. Invvis: Large-scale data embedding for invertible visualization. IEEE Transactions on Visualization and Computer Graphics, 2023.
Yu et al. [2024] Jiwen Yu, Xuanyu Zhang, Youmin Xu, and Jian Zhang. Cross: Diffusion model makes controllable, robust and secure image steganography. Advances in Neural Information Processing Systems, 36, 2024.
Yu et al. [2004] Xiaoyi Yu, Tieniu Tan, and Yunhong Wang. Reliable detection of bpcs-steganography in natural images. In Intl. Conf. Image Graphics (ICIG), pages 333–336. IEEE, 2004.
Zhang et al. [2019] Kevin Alex Zhang, Alfredo Cuesta-Infante, Lei Xu, and Kalyan Veeramachaneni. Steganogan: High capacity image steganography with gans. arXiv:1901.03892, 2019.
Zhang et al. [2020] Peiying Zhang, Chenhui Li, and Changbo Wang. Viscode: Embedding information in visualization images using encoder-decoder network. IEEE Trans. Visual. Comput. Graph., 27(2):326–336, 2020.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pages 586–595, 2018.
Zhu et al. [2018] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. Hidden: Hiding data with deep networks. In Proc. Eur. Conf. Comput. Vis. (ECCV), pages 657–672, 2018.
Zhu et al. [1999] Wenwu Zhu, Zixiang Xiong, and Ya-Qin Zhang. Multiresolution watermarking for images and video. IEEE Trans. Circ. Syst. Video Technol., 9(4):545–550, 1999.
Zhu et al. [2019] Xiaobin Zhu, Zhuangzi Li, Xiao-Yu Zhang, Changsheng Li, Yaqi Liu, and Ziyu Xue. Residual invertible spatio-temporal network for video super-resolution. In Proceedings of the AAAI conference on artificial intelligence, pages 5981–5988, 2019.

\thetitle

Supplementary Material

Appendix A Details on Model Implementation

A.1 Invertible QR Code Transition

We directly adopt the invertible neural network (INN) architecture of ISN \citesuplu2021large_sup to implement the QR Code transition procedure. Instead of the 16 invertible blocks used in the original paper, we only use 2 of them to lower the model complexity, since we empirically find that this transition does not need that large volume of parameters.

As mentioned in Sec. 3.3, we employ a constraint on the transformed QR Code to ensure that it can be identified to restore the secret message. We adopt the same strategy as ArtCoder \citesupsu2021artcoder_sup, which is to simulate the most used Google ZXing \citesupzxing_2013_sup rules that only read the center pixel of each module in a QR Code. According to Xu et al. \citesupxu2021art_sup, the pixels closer to the module center should have a higher probability to be sampled. As a result, this sampling procedure can be modeled by performing a Gaussian convolution operation upon each code module. Specifically, given a QR Code with $n\times n$ modules of size $m\times m$ , a Gaussian convolution kernel sized $m\times m$ is used to convolute each module with a stride of $m$ and derive an $n\times n$ sample result. The feature map is then binarized with a threshold, which is empirically set as 0.02 (given the pixel values range in $[0,1]$ ) in this paper, since we find this threshold value can guarantee the identifiability of transformed QR Code. This means those pixels with values greater than 0.02 are regarded as white modules and the rest are black modules. During training, we obtain an error map $\xi$ which indicates the wrongly transformed code modules and backward the gradient for model optimization. This process is demonstrated in Fig. S1 and the optimization target can be formulated as the following loss function:

\begin{split}\mathcal{L}_{t}&=\left\|qc(I^{*}_{q})\cdot\xi-qc(I_{q})\cdot\xi\right\|_{1},\\ \xi&=\left\|bin_{k}(qc(I^{*}_{q}))-qc(I_{q})\right\|_{1},\end{split}

(S1)

in which $I_{q}$ and $I^{*}_{q}$ represent the original QR Code and transformed result, respectively, $qc(\cdot)$ indicates the Gaussian kernel convolution and $bin_{k}(\cdot)$ is the binarize operation with the threshold as $k$ . Since the aforementioned calculation is differentiable, it can be optimized jointly with other network modules during training.

In our implementation, we resize the QR Code to a size of $5n\times 5n$ and use a $5\times 5$ kernel to perform the convolution operation. The value of $n$ depends on the version of QR Code and it is 37 for version 5 that we adopt in this paper. For each subsequent version, the value of $n$ is 4 greater than the previous version, e.g., it is 41 for version 6 and so forth. Fig. S2 shows some transition results, we also provide the QR Code and its error map after the aforementioned convolution and binarization operations. Although the transition can sometimes lead to some wrongly transformed code modules, it does not affect the identifiability.

A.2 AttnFlow Model

We use some tokenizers to convert images to tokenized representations and some detokenizers to transform them back in the AttnFlow model. In our implementation, all the tokenizers used are based on the vision transformer (ViT) \citesupdosovitskiy2020image_sup architecture. We make some slight changes on the ViT-Base model that contains 12 transformer blocks with the token dimensionality and the multi-layer perceptron (MLP) size being 768 and 3072, respectively. Specifically, we reduce the model complexity in our implementation by lowering the block number to 2 and the MLP size to 2048. The patch size used is $16\times 16$ . For the detokenizer, we simply use an MLP for dimension projection followed by a reshaping operation and two convolutional layers with GELU \citesuphendrycks2016gaussian_sup as activation function to convert the tokens back to image.

For the self-attnetion and cross-attnetion blocks used in AACB, they have the same token dimensionality and MLP size as the tokenizers. We choose to use 4 AACBs in our full model since we find this block number makes a good balance between model performance and training cost.

Appendix B Experiment Details

B.1 Metrics Calculation

LPIPS We use LPIPS \citesupzhang2018unreasonable_sup as part of the optimization target during model training and one of the metrics for stego image quality evaluation. We calculated it with a VGG \citesupDBLP:journals/corr/SimonyanZ14a_sup model pre-trained on ImageNet \citesupdeng2009imagenet_sup.

TRA We calculate the text recovery accuracy (TRA) with the following scheme:

TRA(\hat{I}_{qr})=\begin{cases}1.0&\text{if}~{}\hat{I}_{qr}~{}\text{is identifiable}\\ 0.0&\text{otherwise}\end{cases},

(S2)

where $\hat{I}_{qr}$ indicates the decoded QR Code. The final TRA is average value upon the whole testing dataset.

EMR We calculate the error module rate (EMR) by measuring the wrongly decoded QR Code modules. Given a decoded QR Code, we binarize and compare it with the ground truth to derive the error rate.

It is worth noticing that, although low EMR can generally guarantee a high TRA, these two metrics are not necessarily positively correlated. An example is shown in Fig. S3, the second and third decoded QR Codes have lower EMR than the first one but they are not identifiable. This can be caused by two reasons, the first is that the finder and alignment patterns are damaged, making the code cannot be detected, corresponding to the second case in Fig. S3 (the finder pattern in the lower left is damaged). The second is that, a high error rate is caused in a small area, making the error correction (ECC) scheme of QR Code fail to restore the error, corresponding to the third case in Fig. S3.

B.2 Training Details

Our model is implemented with PyTorch \citesuppaszke2019pytorch_sup and is trained on 4 NVIDIA GeForce 3090 GPUs. We set the batch size as 8 for each GPU and the mode is trained for 50K iterations. The initial learning rate is 0.0001, which decays by $10\%$ after each epoch until it reaches 0.00001. The model is optimized with the AdamW optimizer \citesuploshchilov2017decoupled_sup with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . The hyper-parameters in the loss function (Equation 10) are set as $\alpha=5.0$ , $\beta=0.2$ , $\gamma=3.5$ , $\delta=16$ and $\epsilon=3.0$ . The trainnable parameters $\alpha_{i}$ introduced in Equation 3 are initialized as 0.01 in the beginning. The training host images are randomly cropped from the original images as 224 $\times$ 224 patches. The overall training process takes for about 13 hours.

B.3 Distortion Simulation

We adopt the same types of distortions as StegaStamp \citesuptancik2020stegastamp_sup with some changes in the hyper-parameters. We provide a comparison of the settings of StegaStamp and our in Tab. S1. We largely increase the Gaussian noise level to fit for our task. We lower the distortion level of JPEG compression since we find that there is no need to use a very low JPEG compression quality for training to achieve sufficient robustness when a high-level Gaussian noise is employed. For the parameter of transition, since we manually crop the stego image out from the photo, rather than using object detection as StegaStamp, we do not need such a high parameter setting to promise enough robustness. As for how the distortion simulation is implemented and how do these parameters work, we strongly suggest referring to the description in the original paper \citesuptancik2020stegastamp_sup.

Table S1: Comparison of distortion parameters used by StegaStamp and RMSteg. Here Bri. is brightness, Sat. is saturation, Noi. is Gaussian noise level and Tra. is transition.

Method	Bri.	Hue	Sat.	Contrast	jpeg	Noi.	Blur	Tra.
StegaStamp	0.3	0.1	1.0	$[0.5,1.5]$	25	0.02	7	0.10
Ours	0.3	0.1	1.0	$[0.5,1.5]$	60	0.07	7	0.02

B.4 Printing and Photography

In this paper, we choose to use the case of printing and photography to measure the method robustness under extreme real-world distortions. During experiment, we first print the stego images out and take photos for them. For one stego image, we print it for multiple times on the same paper to prevent the fluctuation of printer’s printing quality. Then, we manually crop the stego images out from the photos using CamScanner \citesupcamscanner_sup and use them for decoding test. A demonstration of the above workflow is shown in Fig. S4.

The printer we used for experiment is an HP OfficeJet Pro 8710 inkjet printer. We choose the printing quality of ‘Normal’ (among ‘Draft’, ‘Normal’ and ‘High’). The printed image size on paper is 5.4cm $\times$ 5.4cm. We take photos with an iPhone 13 Pro indoors under regular illumination. We take 5 photos for each image and choose the best value for final metrics calculation. For the experiments under different shooting angles, we maintain a vertical distance of 11 cm between the lens and the paper, which is our default shooting distance.

Appendix C Further Experiment and Discussion

C.1 Trainnable Coefficients in AACB

We set the coefficient $\alpha_{i}$ of the cross-attention item in the AACB transformation function (Equation 3) as a trainnable parameter to let the network learn by itself. We initialize this parameter as 0.01 at the beginning and optimize it during training. Here we provide the final converged values of $\alpha_{i}$ in the models with different numbers of AACBs. The result is shown in Tab. S2.

Table S2: The

\alpha_{i}

values when using different numbers of AACBs.

Block Number	$\alpha_{1}$	$\alpha_{2}$	$\alpha_{3}$	$\alpha_{4}$
1	0.1354	–	–	–
2	0.0062	0.1425	–	–
3	0.0070	0.0915	0.0087	–
4	0.0022	0.1350	0.0745	0.0070

C.2 Anti-Distortion Ability

We evaluate the anti-distortion ability of our method in the paper. Here we further consider more real-world image distortions.

Tampering During image transmission, tampering is one of the most common and severest distortions. We randomly tamper (in our implementation, we use some black squares and mask them on the stego images) a certain ratio of the areas in the stego image and calculate the decoding accuracy to measure the robustness against this kind of distortion. The results are shown in Tab. S3. It can be observed that, our method has significantly better robustness against tampering than previous methods. We speculate that, after introducing the transformer architecture into normalizing flow, secret message can be embedded into host images in a manner similar to redundant coding due to the inner-channel feature interaction brought by attention mechanism. This allows for the correct recovery of information even when some areas of the image are tampered. Take the two cases shown in Fig. S5 as example, our method can achieve a high decoding accuracy under both cases. However, the remaining four CNN-based methods have high error rates in the tampered regions. This can to some extent prove that, CNN-based method tends to hide secret message in corresponding areas. Therefore, when a certain area is damaged, the corresponding area of the secret message will also fail to decode correctly. More results are provided in Sec. C.5.

Table S3: Decoding accuracy under tampering rate

r

. The best and second-best results are marked in red and blue colors.

Method	$r=5\%$		$r=10\%$		$r=15\%$		$r=20\%$		$r=25\%$		$r=30\%$		$r=35\%$
	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$	TRA $\uparrow$	EMR $\downarrow$
ISN^$\dagger$	0.571	2.878	0.349	5.008	0.110	7.137	0.005	9.259	0.000	11.39	0.000	14.24	0.000	16.37
HiNet^$\dagger$	0.575	2.698	0.354	4.725	0.105	6.723	0.007	8.736	0.000	10.78	0.000	13.48	0.000	15.53
StegaStamp	0.915	3.221	0.673	5.080	0.245	6.902	0.029	8.716	0.001	10.54	0.000	12.89	0.000	14.68
StegaStamp^$\dagger$	0.918	3.038	0.732	4.801	0.336	6.539	0.050	8.277	0.001	9.994	0.000	12.30	0.000	14.05
Ours	0.996	0.119	0.995	0.340	0.966	0.755	0.818	1.324	0.533	2.059	0.233	3.192	0.113	4.195

Light Field Messaging Wengrowski et al. \citesupwengrowski2019light_sup consider the message embedding robustness against on-screen shooting, e.g., taking photo of the stego image displayed on a PC screen. Since the quantitative study upon this topic requires a large amount of experiments, such as the influence of different displayers and camera lenses, which are not the main contribution of this paper, here we just provide some qualitative results. We use an iPhone 13 Pro as the shooting camera and the displayer used to show the stego images is a BenQ EW2770QZ with 2560 $\times$ 1440 resolution. As shown in Fig. S6, we choose different shooting distances for a comprehensive demonstration. It can be observed that, with the shooting distance grows, the distortion of Moiré pattern becomes more and more obvious. However, compared with other methods, our method can keep a high decoding accuracy against this kind of distortion. We believe this can again prove the superiority of transformer-based scheme in handling robust steganography task. More results are provided in Sec. C.5.

C.3 Ablation Study

Since the quantitative results have been provided in Sec. 4.4, here we mainly focus on the ablation experiment implementation details and qualitative comparison results.

IQRT We validate the effectiveness of invertible QR Code transition by removing it from the pipeline, which means we directly tokenize the QR Code image and feed the tokens to the ITF module and then to the AttnFlow model. As shown in Fig. S7, the stego images generated without IQRT have more artifacts. As discussed in the paper, since we only apply one constraint on the transformed QR Code to guarantee its identifiability, the learned transition strategy will tend to help achieve a better steganography quality. However, as shown in Tab. 4, IQRT can lead to a slightly worse decoding accuracy. This is due to the information loss that sometimes happens during the transition process. Two examples can be found in Fig. S2. Overall, we believe this module can effectively improve the stego image quality.

ITF Module The invertible token fusion module learns a transform matrix for QR Code image tokens. Compared with the invertible 1 $\times$ 1 convolution (IConv) proposed by GLOW \citesupkingma2018glow_sup, our ITF learns a patch-wise (or, token-wise) transformation instead of a channel-wise one. We believe it can rearrange the tokens just like IConv that can re-permute the channels in order to compensate for the insufficient distribution transformation ability of normalizing flow due to the affine-formed functions that have to be adopted for the invertibility of the model. The experiment result proves the competence of ITF module and as introduced in the paper, we empirically find that this module can help derive a better distribution of the artifacts in the stego image. Some results are also shown in Fig. S7, we believe this is to some extent due to the token rearranging brought by ITF.

Cross Attention in AACB We introduce an extra cross-attention item in Equation 3. We make this design because we hope more feature interactions can happen between host image tokens and QR Code tokens. As a result, we incorporate the cross-attention mechanism and add it to the affine transformation function. As shown in Fig. S7, this module can improve the visual quality of the stego image. The results in Tab. 4 also shows that, the model with cross-attention has an around $15\%$ improvement in LPIPS.

Tokenized Representation We incorporate tokenized representation (TR) in ITF module and AttnFlow model. To validate the effectiveness of TR, we remove the ITF module and replace the AttnFlow with CNN-based normalizing flow models (ISN \citesuplu2021large_sup and HiNet \citesupjing2021hinet_sup). As the results shown in Fig. S7, the stego images generated are similar to that of original ISN and HiNet, containing obvious QR Code-like artifacts. This proves that CNN-based normalizing flow struggles to achieve a high-quality feature learning in the robust steganography task. In contrast, our transformer-based scheme extends the model ability and can help generate stego images with high visual similarity.

C.4 Limitation Analysis and Future Work

Although our RMSteg can achieve the state-of-the-art performance in robust message embedding, it still has some limitations currently. First, as mentioned in Sec. 4.4, our method can help distribute the steganography residual in heterogeneous regions to avoid perceptible artifacts. However, when facing host images with many homogeneous regions, our method can fail to preserve a good visual quality. It can be observed from Fig. S8 that, although the artifacts is mainly concentrated in heterogeneous areas, the ratio of this kind of regions is too small to preserve the overall visual quality. Secondly, although the embedding capacity of RMSteg far exceeds previous methods, it still cannot conceal large-scale secret information, e.g., multiple images.

In summary, we are going to focus on the two aforementioned limitations, i.e., better steganography quality and higher embedding capacity, in the future. We will explore more schemes to extend the performance of transformer-based steganography method. In addition, as mentioned in Sec. C.2, we will consider more kind of real-world image distortions to improve the method’s applicability.

C.5 Additional Results

In this section, we provide more qualitative results of the experiments mentioned in the paper and appendix.

IQRT Results More QR Code transition results (corresponding to Fig. 3) are provided in Fig. S9 - Fig. S10.

Steganography Results More stego images generated by different methods (corresponding to Fig. 5) are provided in Fig. S11 - Fig. S13.

Print-Proof Robustness More decoding results under different shooting situations (corresponding to Fig. 6) are provided in Fig. S14 - Fig. S16.

Anti-Tampering More decoding results under image tampering (corresponding to Fig. S5) are provided in Fig. S18.

Anti-Light Field Messaging More decoding results under light field messaging (corresponding to Fig. S6) are provided in Fig. S17.

\bibliographystylesup

ieeenat_fullname \bibliographysupsup