Towards Robust Image-in-Audio Deep Steganography

Jaume Ros^1,∗
Margarita Geleta^2,∗
Jordi Pons³
Xavier Giro-i-Nieto¹ ¹Universitat Politècnica de Catalunya ²UC Berkeley ³Dolby Laboratories

Abstract

The field of steganography has experienced a surge of interest due to the recent advancements in AI-powered techniques, particularly in the context of multimodal setups that enable the concealment of signals within signals of a different nature. The primary objectives of all steganographic methods are to achieve perceptual transparency, robustness, and large embedding capacity — which often present conflicting goals that classical methods have struggled to reconcile. This paper extends and enhances an existing image-in-audio deep steganography method by focusing on improving its robustness. The proposed enhancements include modifications to the loss function, utilization of the Short-Time Fourier Transform (STFT), introduction of redundancy in the encoding process for error correction, and buffering of additional information in the pixel subconvolution operation. The results demonstrate that our approach outperforms the existing method in terms of robustness and perceptual transparency^∗∗.

Figure 1: Components of the steganographic pipeline. At the left, the secret image to be embedded and the revealed image at the decoder end are shown together with their RGB density histograms, followed by the cover log spectrogram and the stego (cover + embedded secret) log spectrogram. At the right, the cover and stego waveforms are shown after applying inverse STFT, and their

L_{1}

distance is computed.

^*^*footnotetext: Indicates equal contribution.^**^**footnotetext: The code is available for download on https://github.com/migamic/PixInWav2.

1 Introduction

The use of deep learning in steganography is a relatively new area of research, but it has already improved the performance of traditional steganographic techniques and opened up new possibilities for covert communication [2, 3, 39, 42, 41, 16]. In particular, steganography has evolved to multimodal setups, enabling the embedding of signals of one modality within signals of a different modality [43, 32, 19]. In this paper, we present a novel approach to improving an existing image-in-audio deep steganography method [19]. Our approach involves a series of key enhancements, including the use of short-time Fourier transform (STFT) instead of the short-time discrete cosine transform (STDCT), introducing redundancy in the encoding process for error correction, and buffering additional information in the pixel subconvolution operation, among others. Our enhanced method effectively equips the steganographic agent with a powerful new set of tools to operate with.

Our training pipeline hides ImageNet images [15] within audio samples from the FSDNoisy18K dataset [18] (Figure 1). To evaluate the effectiveness of our approach, we conducted comprehensive experiments and comparison with the baseline model using perceptual metrics for image quality (SSIM and PSNR) and audio quality (SNR). The results demonstrate that our approach outperforms the existing method in terms of robustness and perceptual transparency.

To summarize, we make the following contributions:

1.

We improve the performance of the steganographic method in [19] by changing the real short-time discrete cosine transform (STDCT) by the complex short-time Fourier transform (STFT) and we show that increasing the resolution of spectral representation of the audios improves the model performance.
2.

We improve the secret image reconstruction by introducing new image-in-audio replication-based embedding methods. Additionally, introducing redundancy via replication serves as error correction, enhancing the robustness of the method.
3.

We enhance the architectural design of the model by buffering the luma component of the YCbCr color representation of the secret image in the subpixel convolution operation.

2 Related Work

Steganography is the practice of concealing a secret signal (which may be covert communication or a watermark) within a cover (or host) signal (the medium containing both signals is called stego signal) with the objective of: (1) maximizing the perceptual transparency, i.e. maximizing the similarity between the cover and stego signals; (2) maximizing the robustness, which is the ability to withstand intentional or accidental attacks, and (3) maximizing the embedding capacity, the secret message size per 1-time or -space unit [4, 25, 32, 31]. The earliest known use of steganography dates back to ancient Greece [10, 35]. In the modern era, steganography has been used for a variety of purposes, including copyright protection/watermarking [4, 1, 12, 6, 22, 14, 28, 30], military communications [38, 35, 31, 26], and feature tagging [29, 4, 7, 5, 36]. Recently, steganography found its connectionist approach and gave rise to a new area in deep learning known as deep steganography.

First attempts in deep steganography concerned unimodal setups, in which both the secret and cover signals are of the same modality, such as image-in-image or audio-in-audio. We find numerous examples of image-in-image steganography: Baluja [2, 3] incorporated a convolutional neural network with inception-like modules to encode a secret full-sized color image in a dispersed a manner throughout the bits of the cover image of the same size. Inspired by auto-encoding networks for image compression, the system learns to compress and place the secret image into the least noticeable portions of the cover image. For that aim, firstly, the network extracts features from the secret image and merges them with the cover image in the hiding step. Besides, the author mentions that this technique could be applied on audio samples by interpreting their spectrograms as images; Rehman el al. [39] employ a two-branch encoder to gradually extract features from both secret and cover images and syncronize them at several steps to produce stego images; StegNet [42, 41] continues in the encoder-decoder line but improves the perceptual transparency of the method by introducing skip connections [23] and separate convolutions in the architecture to improve convergence [9]; Duan et al. [16] propose concatenating both secret and cover signals and use U-Nets as hiding and revealing networks. At less extent, we find audio-in-audio steganography [27].

More recently, researchers have also explored the use of deep learning for multimodal steganography, in which the secret message and the cover media are of different modalities. This approach has the potential to further improve the transparency and robustness of secret messages, as well as expanding the range of cover media that can be used for steganography. HiDDeN [43] uses a convolutional decoder-encoder architecture to embed a secret string message within an image, unconstraining the secret to be a specific kind of signal since a string of bits could represent any type of data. There have been attempts to address the specific case of image-in-audio steganography with classical methods [24, 33, 37], but leveraging the audio representation for embedding image data has not been explored extensively in the deep learning context. One of such deep steganographic models is PixInWav [19].

3 Preliminaries

In this section we detail specific parts of interest of the PixInWav [19] model, that serves as a baseline for this work. PixInWav is a deep steganographic model for image-in-audio concealment. The pipeline is trained end-to-end and the trainable part consists of two U-Net-style networks: a hiding network and a revealing network.

The pipeline takes a secret image $s$ and a cover audio waveform $w$ . First, $w$ is transformed into a spectrogram $M$ using the short-time discrete cosine transform (STDCT). Then, the hiding network is applied on the pixel-shuffled $s$ for conversion into a low-power spectrogram watermark, which is residually added onto $M$ , resulting in the stego spectrogram $M^{\prime}$ . $M^{\prime}$ can be transformed back to the temporal domain, $w^{\prime}$ , via inverse STDCT for transmission. At the decoder end, the revealing network takes as input $M^{\prime}$ to extract the revealed image $s^{\prime}$ by pixel-unshuffling the network output.

The loss function is a convex combination involving image and spectrogram reconstructions, with the addition of soft dynamic time warping (DTW) discrepancy [13] (with a smoothing parameter $\gamma$ ) between the cover waveform and the stego waveform for better temporal alignment.

	$\displaystyle\mathcal{L}(s,s^{\prime},w,w^{\prime},M,M^{\prime})$	$\displaystyle=\beta\\|s-s^{\prime}\\|_{1}+\lambda\text{dtw}(w,w^{\prime},\gamma=1)$		(1)
		$\displaystyle+(1-\beta)\\|M-M^{\prime}\\|_{2}$		(1)

We refer the interested reader to the PixInWav paper [19] paper for details.

Limitations. Even though PixInWav has shown the feasibility of image-in-audio connectionist steganography, it suffers from the lack of robustness. This weakness stems from several architectural decisions.

The first one is the uselessness of zero-padding in the pixel-shuffle operation. As in PixInWav [19], pixel-shuffle (or sub-pixel convolution [34]) is used to flatten the RGB image into a single channel by arranging the 3 color channels of each pixel side by side in a $2\times 2$ grid, padding an empty element of value 0 as the fourth element. We show that this fourth element can serve as a buffer to transmit useful information of the image to encode allowing for better image reconstruction.

The second weakness is the forgoing of image replication in the stego spectrogram. The shape of the cover spectrogram does not need to match the shape of the image. While images in the dataset are of size $256\times 256$ ( $512\times 512$ after applying the pixel-shuffle operation), the spectrograms are larger in size and, in general, present a non-square shape^*^**Geleta et al. [19] used spectrograms of shape $4096\times 1024$ . As mentioned in Section 4.3, these values can be arbitrarily increased or decreased. We chose to use a shape of $1024\times 512$ for easier comparison with our STFT implementation and reduced computational load.. This mismatch is overcome by stretching the image via bilinear interpolation, which is a reversible operation that allows to easily resize the image to any desired shape. However, the extra space can be exploited to replicate the secret image, a procedure that can be deemed as an error correction technique for robustness improvement.

Finally, the basis function of the short-time transform is type-2 discrete cosine transform (DCT) and the resolution is fixed. The choice of the STDCT over short-time Fourier transform (STFT) was reasoned in [19] because it had a smaller set of components: STDCT, being a real transform, results in a single real-valued spectrogram, in contrast to the complex STFT transform, that is decomposed in both, magnitude and phase. However, increasing the set of components increases the number of strategies to embed our secret image. Additionally, the resolution of the transform can be increased arbitrarily up to computational constraints.

4 Enhancements

We propose several enhancing features for the image-in-audio steganographic method proposed by PixInWav [19].

Refer to caption — Figure 2: Architecture of the proposed model. The legend shows the three basic components of the steganographic pipeline that have been improved in our work: (pink) change of the audio transform, (green) buffering of luma in the pixel shuffle operation, and (orange) addition of different embedding methods based on replication.

4.1 STFT instead of STDCT

Given a time domain sequence $x[n]$ , its discrete short-time transform operation is given by Equation 2:

X_{\mathcal{T}}\{x[n]\}(m,F_{k})=\sum_{n=mr}^{mr+N-1}x[n]\cdot h[n-mr]\cdot\mathcal{T}[n,F_{k}]

(2)

which is a bivariate function representing the energy of the $F_{k}$ -th frequency component (equispaced frequency samples $F_{k}=k/N$ ) in the $m$ -th frame. The time index is represented by $n$ ; $h$ is the low pass window function, and $r$ represents the hop size of the short-time transform. The function $\mathcal{T}[n,F_{k}]$ is the basis function of the transformation.

The STDCT, used by PixInWav, is a real transform using the type-2 DCT basis function as $\mathcal{T}[n,F_{k}]$ , which produces a single 2D spectrogram from a 1D audio waveform. On the other hand, the STFT is a complex transform using the complex exponential function as its basis function $\mathcal{T}[n,F_{k}]=e^{-i2\pi F_{k}n}$ , which results in a complex signal that can be split into two 2D signals: the magnitude and the phase. We propose using the STFT instead of STDCT, where this duality of the cover signal allows for more possibilities in how the secret signal can be embedded onto the cover signal. Both magnitude and phase, can be used as stego signals in the same manner that the single spectrogram from the STDCT has been used. In this work we consider using any of the two signals as a single stego signal, or using both of them at the same time.

STFT magnitude as a single stego signal. One can approach embedding the information in the STFT magnitude in the same way as using a single STDCT spectrogram, with the main difference being that the imaginary part of the cover signal remains unmodified. Since we are not directly distorting the phase component, the alignment performed by DTW can be deemed redundant. We propose changing the soft DTW used in [19] by a simpler $L_{1}$ distance between the cover and stego waveforms (Equation 3):

	$\displaystyle\mathcal{L}(s,s^{\prime},w,w^{\prime},M,M^{\prime})$	$\displaystyle=\beta\\|s-s^{\prime}\\|_{1}+\lambda\\|w-w^{\prime}\\|_{1}$		(3)
		$\displaystyle+(1-\beta)\\|M-M^{\prime}\\|_{2}$		(3)

STFT phase as a single stego signal. Since the phase has the same spatial dimensions as the magnitude signal, the same methods can be applied only on the phase component. In this case, only the imaginary part is modified.

STFT magnitude and phase as stego signals. A more advanced setup has been developed in which both the STFT magnitude and phase can jointly serve as a stego signal. To handle multiple stego components, the architecture requires an adaptation: the two stego components should be treated separately due to their very different structure. The proposed architecture uses different encoders and decoders for each stego component. The two revealed images are fed into a third network that processes them to obtain a single image as output. Out of the multiple solutions tried, a simple trained weighted average worked the best. The loss function has been accommodated for the possibility of using multiple containers (Equation 4):

	$\displaystyle\mathcal{L}($	$\displaystyle s,s^{\prime},w,w^{\prime},M,M^{\prime},P,P^{\prime})=\beta\\|s-s^{\prime}\\|_{1}+\lambda\\|w-w^{\prime}\\|_{1}$		(4)
		$\displaystyle+(1-\beta)\left[(1-\theta)\\|M-M^{\prime}\\|_{2}+\theta\\|P-P^{\prime}\\|_{2}\right]$		(4)

where $M$ and $P$ now denote the magnitude and phase signals, respectively, and $M^{\prime}$ and $P^{\prime}$ correspond to their respective stego components. These components are weighted by a new hyperparameter $\theta$ , that controls the trade-off between magnitude and phase distortion. Notice that the waveform $w$ is still unique.

4.2 Spectrogram replicas

We consider the case of encoding a $256\times 256\times 3$ RGB image (flattened to $512\times 512$ after applying the pixel-shuffle operation) onto a $1024\times 512$ spectrogram.

PixInWav [19] made use of bilinear interpolation for upsampling the encoded image to match the spectrogram shape, only to be later downsampled to its original size before decoding. This strategy, Stretch from now on, makes the encoding and decoding processes independent of the stego size. However, other options could be devised. In this section we propose alternative architectures to address this problem (Figure 3), that make a better use of the space available by encoding multiple copies of the encoded image, and improving the secret reconstruction and increasing the robustness of the steganographic method.

Replicate method. Our simplest approach uses the fact that the cover spectrogram is significantly larger than the secret image and this allows for a natural replication of the encoded image that is added onto the host signal. When decoding, the two copies are jointly forwarded through the network, split and averaged to produce the final revealed image.

Weighted Replicate method. Weighted Replicate (W-Replicate) improves the previous method by scaling each replica by a trainable weight before adding them onto the container spectrogram, and also when merging them into a single one (essentially, a trained weighted average); resulting in a total of four trainable weights that are added to the model. This change allows the model to learn in which half of the STFT spectrogram (high or low frequencies) the information can be added causing the least distortion.

Weighted & Split Replicate method. The previous two methods decode the spectrogram directly, i.e. a tensor of shape $1024\times 512$ , with the two replicas side by side; this forces the network to treat both replicas equally. Weighted & Split Replicate (WS-Replicate), improves upon this issue by first splitting the container signal and decoding the two replicas separately (i.e. concatenated in a 3rd dimension, resulting in a tensor of shape $512\times 512\times 2$ ). The encoder structure is the same as in W-Replicate.

Multichannel method. All the previous methods rely on the pixel-shuffle operation to flatten the image into a single color channel. Multichannel, however, omits this step and has the model learn to encode the three color channels in the different replicas. Thus, the encoded image is of shape $256\times 256\times C$ , being $C$ the desired number of output channels. Since eight $256\times 256$ replicas can fit into a $1024\times 512$ host signal, we set $C=8$ . They are arranged in a $4\times 2$ grid. As with WS-Replicate, the decoder is fed on the replicas already split and concatenated, only that this time the output is already the final $256\times 256\times 3$ RGB image.

#	Model	Embedding method	$\beta$	Luma	Container size	Revealed SSIM $\uparrow$	Revealed PSNR $\uparrow$	Color restoration	Stego SNR $\uparrow$	Waveform loss $\downarrow$
1	Baseline PixInWav [19] (STDCT, DTW $\lambda=10^{-4}$ )	Stretch	0.05	–	$1024\times 512$	0.84	23.80	Full color	-3.08	10.53
2	PixInWav [19] (STDCT, DTW $\lambda=10^{-4}$ )	Stretch	0.01	–	$1024\times 512$	0.76	20.45	Partial color	8.72	1.05
3	PixInWav [19] (STDCT, DTW $\lambda=10^{-4}$ )	Stretch	0.5	–	$1024\times 512$	0.86	25.29	Full color	-14.90	89.11
4	PixInWav [19] (STDCT, DTW $\lambda=1$ )	Stretch	0.05	–	$1024\times 512$	0.39	11.17	No color	45.98	$1.1\times 10^{-4}$
5	Modified PixInWav (STDCT, $L_{1}\lambda=1$ )	Stretch	0.05	–	$1024\times 512$	0.37	10.84	None	34.65	$2\times 10^{-5}$
6	Ours (STFT: magnitude, DTW $\lambda=10^{-4}$ )	Stretch	0.75	–	$1024\times 512$	0.73	20.06	No color	41.17	$4.8\times 10^{-3}$
7	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Stretch	0.85	–	$1024\times 512$	0.73	20.09	No color	43.27	$1.3\times 10^{-4}$
8	Ours (STFT: magnitude, $L_{1}\lambda=1/2$ )	Stretch	0.75	–	$1024\times 512$	0.69	20.58	No color	44.72	$1.2\times 10^{-4}$
9	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Stretch	0.75	–	$1024\times 512$	0.64	20.95	Partial color	44.66	$1.2\times 10^{-4}$
10	Ours (STFT: phase, $L_{1}\lambda=1$ )	Stretch	0.75	–	$1024\times 512$	0.52	14.89	No color	21.84	$2.1\times 10^{-3}$
11	Ours (STFT: magnitude + phase, $L_{1}\lambda=1$ )	Stretch	0.75	–	$1024\times 512$	0.87	26.27	Partial color	22.19	$7.2\times 10^{-4}$
12	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Replicate	0.75	–	$1024\times 512$	0.71	22.84	Full color	42.76	$1.6\times 10^{-4}$
13	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	W-Replicate	0.75	–	$1024\times 512$	0.64	20.00	Full color	38.25	$2.6\times 10^{-4}$
14	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	WS-Replicate	0.75	–	$1024\times 512$	0.81	25.33	Full color	40.60	$1.9\times 10^{-4}$
15	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Multichannel	0.75	–	$1024\times 512$	0.87	24.08	Partial color	15.83	$3.3\times 10^{-3}$
16	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Stretch	0.75	–	$2048\times 1024$	0.70	19.92	No color	51.43	$5.6\times 10^{-5}$
17	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Stretch	0.75		$2048\times 1024$	0.71	19.99	No color	50.94	$6.2\times 10^{-5}$
18	Ours (STFT: magnitude, DTW $\lambda=10^{-4}$ )	Stretch	0.75		$2048\times 1024$	0.79	20.63	No color	52.84	$3.7\times 10^{-4}$
19	Ours (STFT: magnitude + phase, $L_{1}\lambda=1$ )	Stretch	0.75		$2048\times 1024$	0.91	28.35	Full color	8.14	$1.3\times 10^{-3}$
20	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Replicate	0.75	–	$2048\times 1024$	0.68	22.30	Partial color	53.14	$4.6\times 10^{-5}$
21	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Replicate	0.75		$2048\times 1024$	0.73	20.27	No color	49.49	$6.9\times 10^{-5}$
22	Ours (STFT: magnitude, DTW $\lambda=10^{-4}$ )	Replicate	0.75		$2048\times 1024$	0.64	19.67	No color	51.94	$4.3\times 10^{-4}$
23	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	W-Replicate	0.75	–	$2048\times 1024$	0.83	26.09	Full color	55.22	$4.1\times 10^{-5}$
24	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	W-Replicate	0.75		$2048\times 1024$	0.88	23.81	Full color	50.46	$6.8\times 10^{-5}$
25	Ours (STFT: magnitude, DTW $\lambda=10^{-4}$ )	W-Replicate	0.75		$2048\times 1024$	0.79	21.05	No color	37.62	$1.2\times 10^{-2}$
26	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	W-Replicate	0.5		$2048\times 1024$	0.77	20.65	No color	54.12	$3.8\times 10^{-5}$
27	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	W-Replicate	0.9		$2048\times 1024$	0.82	25.24	Full color	43.31	$1.6\times 10^{-5}$
28	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	WS-Replicate	0.75	–	$2048\times 1024$	0.85	26.15	Full color	31.24	$3.9\times 10^{-4}$
29	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	WS-Replicate	0.75		$2048\times 1024$	0.87	26.88	Full color	31.61	$3.9\times 10^{-4}$
30	Ours (STFT: magnitude, DTW $\lambda=10^{-4}$ )	WS-Replicate	0.75		$2048\times 1024$	0.84	26.20	Full color	30.46	$5.1\times 10^{-2}$
31	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	WS-Replicate	0.5		$2048\times 1024$	0.82	25.50	Full color	34.96	$2.6\times 10^{-4}$
32	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	WS-Replicate	0.9		$2048\times 1024$	0.86	26.73	Full color	27.90	$5.8\times 10^{-4}$
33	Ours (STFT: magnitude, $L_{1}\lambda=1$ )	Multichannel	0.75	–	$2048\times 1024$	0.83	23.26	Partial color	20.02	$2\times 10^{-3}$

Table 1: Results of the ablation study. The metrics reported are SSIM and PSNR for image quality and SNR for audio quality. We also include qualitative information on color reconstruction (full color represents the case when the whole spectrum of colors can be reconstructed, partial color refers to the case when the color spectrum is reconstructed partially, no color refers to only black and white image reconstruction from RGB images, and none refers to no minimal image reconstruction, as hinted by the low SSIM values), and the waveform loss values (soft DTW or

L_{1}

, depending on the model). For reference, a signal with an SNR of 30 decibels (dB) or higher can be considered a perceptually clean signal [17].

4.3 Higher stego resolution

Our baseline system assumes a stego signal of size $1024\times 512$ , which is determined by the STFT applied on the input audio waveform with a set of hyperparameters (frame length and hop size). However, these values are arbitrary and could be changed, specifically, to increase the resolution of the stego spectrogram. In this section we explore the possibilities offered by the use of a larger spectrogram, which should increase the embedding capacity of the stego signal (and the robustness of the whole system if replication is used). Increasing the frame length of the STFT results in a larger size in the frequency dimension, while reducing the hop size of overlapping windows causes the container to increase in the time dimension. We applied both modifications to obtain a container of size $2048\times 1024$ , preserving the property of the dimensions being powers of two, allowing for efficient computations. Some adaptations have been required to accommodate for the larger container size:

Stretch method needs to interpolate to a larger target size, the same as the stego spectogram.

Replicate-based methods use 8 replicas instead of 2, which are arranged in a $4\times 2$ grid. As a consequence, W-Replicate and WS-Replicate, use 8 weights instead of 2 to scale each copy individually. WS-Replicate’s decoder also needs to accept input of depth 8 instead of 2.

Multichannel’s encoder outputs 32 replicas instead of 8; the decoder also expects an input of depth 32. These are arranged in an $8\times 4$ grid.

4.4 Pixel-shuffle RGB channels with luma

The pixel-shuffle operation [34] used to flatten the image arranges each $1\times 1\times 3$ pixel into a $2\times 2\times 1$ grid. Thus, for every RGB pixel, we obtain a $2\times 2$ grid where we can buffer the values. PixInWav [19] padded a value of 0 into the fourth component of every grid. Contrary to this zero-padding approach, we propose padding with the luma component of the pixel in question instead, as a way to add redundancy to the signal that can be later used for error correction in the decoder side. This peculiar pixel-shuffle step then outputs, for every pixel, a $2\times 2$ grid of 4 values $[R,G,B,Y]$ , where $Y$ represents the luma component of the pixel, computed from the RGB components by a standardized transformation to the YCbCr color space [21]. On the decoder side, for each $[R,G,B,Y]$ pixel, the YCbCr representation is computed from the RGB values. The newly computed $Y$ value is then averaged with the received $Y$ , and the whole pixel is transformed back to the RGB color space to yield the final image.

5 Experiments

5.1 Datasets

We have used a subset of 10,000 color images from the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012), sampling 10 images per ImageNet class, [15]. Every image has been cropped and scaled to $256\times 256\times 3$ , normalized, and paired with the STFT transformation of an audio clip (roughly 1.5s) sampled at 44,100 Hz from the FSDnoisy18k dataset [18]. The audio dataset contains a variety of different sounds, ranging across 20 different classes, among which we can find voice, music and noise.

5.2 Ablation Study

In this section, we present the results of our ablation study to determine the impact of the proposed enhancements (Figure 2). Table 1 summarizes the results of our experiments and Figure 4 displays a selection of visual examples. We provide a numerical identifier for each of the models for reference.

STFT instead of STDCT. We assessed the impact of using the magnitude of the STFT as a stego signal instead of the STDCT spectrogram as in [19]. Comparing STDCT models #1–#5 against STFT models #6–#10 we can see that STDCT models struggle to find a balance point with good reconstruction of both, image and audio, while STFT models show a good performance in both.

Modification of the loss function. As a result of our choice of STFT over STDCT, we compared the results of computing the waveform loss using the $L_{1}$ distance instead of the soft DTW discrepancy. In Table 1 we can compare DTW models #6, #18, #22, #25 and #30, with $L_{1}$ models #7, #17, #21, #24 and #29. Using the $L_{1}$ distance, results were slightly better in most cases. Note that the soft DTW loss proved superior when using the STDCT, as seen when comparing models #4 and #5, explained by the reasoning in Section 4.1.

Type of STFT stego signal. Next, we compared the performance of the steganographic operation based on the kind of the stego signal: just magnitude, just phase, or a combining both. We find that using the phase as the sole container is clearly inferior to using the magnitude (compare models #9 and #10 in Table 1), as there is a very significant drop in both image and audio reconstruction quality. Our reasoning for these results is twofold. Firstly, the phase is a much noisier signal in nature, which makes the task of hiding information more difficult. Secondly, minor modifications to the phase component result in a more perceptible distortion in the reconstructed audio, thus rendering the task of concealing the secret signal more challenging. Finally, we compared using both the magnitude and phase as stego signals simultaneously (model #11). The results from Table 1 show that while using both stego signals does substantially increase image quality, there is a significant drop in audio quality, possibly as a consequence of additionally distorting the phase. In conclusion, our study suggests that it is not worth using the phase as a stego signal, since it does not improve the metrics obtained with the baseline model that only uses the magnitude. Therefore, there is no justification for the added overhead in the model.

Comparison of embedding methods. A comparison of different embedding methods has been conducted, and the results are presented in Table 1 accross models #11–#15 (stego signal of size $1024\times 512$ ) and models #16, #20, #23, #28, #33 (stego signal of size $2048\times 1024$ ). The Multichannel method exhibits a considerable enhancement in image quality, albeit at the cost of a significant decrease in audio metrics, thereby rendering it less practical for most real-world applications. Conversely, all replicate-based embedding methods outperform the baseline Stretch approach. Qualitative assessments depicted in Figure 3 demonstrate that WS-Replicate can generate a superior reconstruction of the original image, as it is the only method capable of preserving the authentic color.

Buffering the luma component in the pixel shuffle operation. Experiments shown by pairs of models #16 and #17, #20 and #21, #23 and #24, #28 and #29 show that this addition does improve the quality of the revealed images while maintaining a comparable audio quality with regard to the baseline model.

Higher resolution stego signal. The values from Table 1 show a very significant improvement when using a larger resolution of the stego signal (compare models #6–#15 against #16–#33), both in image and audio quality, as it is expected from having more capacity for carrying information. This increase in performance comes at the cost of increased memory usage and longer training times. Note that, for training purposes, the audio transform can be precomputed for every audio; however, the inverse transform is still needed to compute the waveform loss.

5.3 Embedding method effect on robustness

To evaluate the effect of different embedding methods on the robustness of the steganographic method, we attempted to decode smaller temporal segments of the stego signal. In our experiment, we selectively zero out the spectral content at different time frames of the stego spectrograms, simulating a scenario where some data is lost during transmission [40, 20], either in large contiguous chunks or at random positions.

The qualitative results can be appreciated in Figure 5. They show that methods that use replication are more robust than the baseline Stretch approach, allowing to recover most of the image even if a large part of the stego signal is lost.

5.4 Computational cost study

Enhancement	# params	GMAC $\downarrow$
Baseline	962128	34.6
STFT (magnitude)	+0	+0%
STFT (phase)	+0	+0%
STFT (magnitude + phase)	+962131	+100.00%
$L_{1}$ loss	+0	+0%
Replicate	+0	+0%
W-Replicate	+4	+0%
WS-Replicate	+584	-32.89%
Multichannel	+12735	-81.68%
Stretch (large)	+0	+200.00%
Replicate (large)	+0	+200.00%
W-Replicate (large)	+4	+200.00%
WS-Replicate (large)	+4103	-30.23%
Multichannel (large)	+67695	-73.20%
Luma	+0	+0%

Table 2: Breakdown of computational costs. In this table, we show the relative increment in the number of parameters and Giga-Multiply–accumulate operations (GMAC) of each of the proposed enhancements with respect to the baseline model.

The results from the previous sections show that some setups obtain better performance than others. However, in some cases this comes at the cost of increased computational load, during both training and inference. The trade-off between these two factors should take into account the available resources and future use of the model. Table 2 presents our results of such analysis. An increase in the number of parameters implies a higher memory usage and longer execution times, and an increase in Giga Multiply–accumulate operations (GMAC) operations generally indicates longer execution times and energy consumption. Among models with similar performance, lower values in both metrics should be preferred.

Cost of using the STFT instead of STDCT. For both transforms there exist efficient algorithms with equal asymptotic complexity [11, 8]. We thus consider the two options to be equal in this regard. However, there is an additional cost if we use the magnitude and phase together as stego sigals – this scenario doubles the number of parameters and MAC operations, since two separate encoder and decoder networks are used (plus a small coupling network).

Choice over $\bm{L_{1}}$ and DTW. The usage of the $L_{1}$ loss instead of dynamic time warping cannot be directly assessed, since this is only used during the training process and outside the model. It should be noted, however, that the $L_{1}$ loss is generally much more efficient to compute ( $\mathcal{O}(n)$ time) than (soft-)DTW ( $\mathcal{O}(n^{2})$ time) [13].

Cost of embedding methods. The different embedding methods can also be compared in terms of computational load (Table 2). Replicate does not add any additional parameters with respect to the baseline Stretch method, the only difference being that the information is duplicated and concatenated instead of being upsampled. W-Replicate adds four additional parameters that scale each of the two replicas (which is done at both, the encoder and decoder end). The effect on the load is negligible. WS-Replicate and Multichannel methods use deeper convolution kernels over smaller resolution tensors, thus increasing the amount of parameters while decreasing the total number of operations. This is especially noticeable in Multichannel.

Using higher resolution of stego signal. When using a larger stego spectrogram, the amount of parameters remains the same, except for WS-Replicate and Multichannel, that use even deeper kernels to process a larger number of replicas. However, the number of floating-point operations triples.

Cost of buffering the luma component. The usage of the luma channel in the pixel-shuffle operation only entails a color space change (done through a single matrix multiplication) and averaging the two luma values. These operations do not add any extra trainable parameters, and the computational cost is negligible.

6 Discussion and Conclusion

We have presented a set of key enhancements for an existing image-in-audio deep steganography method, among which: the use of STFT, introduction of redundancy in the encoding and decoding steps for error correction, and buffering of additional information in the pixel subconvolution operation. Our experiments have demonstrated that our approach outperforms the existing method in terms of robustness and perceptual transparency. Our novel approach, thus, represents a significant step forward in the field of multimodal deep steganography, promising improved security and confidentiality in a wide range of applications.

Our qualitative results show a clear system bias to distort those parts of the image where the cover spectrogram exhibits high magnitudes. Although redundancy through replication ameliorates this issue partially, primarily by concealing a substantial portion of the information in the higher frequencies, where the degree of distortion is typically lower, it proves inadequate in certain scenarios where the cover spectrogram manifests high values across all frequencies for a brief duration.

Future work can explore new techniques to increase the system’s robustness on these rare cases, and also the applicability of our approach in real world scenarios, exposing the stego signal to acoustic alterations, such as ambient noise and reverberations.

Acknowledgements

We express our gratitude to Pau Bernat Rodríguez for his discussions throughout this study and his contributions to the project codebase.

References

[1] M.M. Amin, M. Salleh, S. Ibrahim, M.R. Katmin, and M.Z.I. Shamsuddin. Information hiding using steganography. In 4th National Conference of Telecommunication Technology, 2003. NCTT 2003 Proceedings., pages 21–25, 2003.
[2] Shumeet Baluja. Hiding images in plain sight: Deep steganography. Advances in neural information processing systems, 30, 2017.
[3] Shumeet Baluja. Hiding images within images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:1685–1697, 2020.
[4] W. Bender, D. Gruhl, N. Morimoto, and A. Lu. Techniques for data hiding. IBM Systems Journal, 35(3.4):313–336, 1996.
[5] J. Brassil, S. Low, N. Maxemchuk, and L. O’Gorman. Electronic marking and identification techniques to discourage document copying. In Proceedings of INFOCOM ’94 Conference on Computer Communications, pages 1278–1287 vol.3, 1994.
[6] Jack Brassil, Steven Low, Nicholas Maxemchuk, and Larry O’Gorman. Hiding information in document images. In Proc. Conf. Information Sciences and Systems (CISS-95), pages 482–489. Citeseer, 1995.
[7] Germano Caronni. Assuring ownership rights for digital images. In Verläßliche IT-Systeme, pages 251–263. Springer, 1995.
[8] Wen-Hsiung Chen, C. Smith, and S. Fralick. A fast computational algorithm for the discrete cosine transform. IEEE Transactions on Communications, 25(9):1004–1009, 1977.
[9] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
[10] Maura Conway. Code wars: steganography, signals intelligence, and terrorism. Knowledge, Technology & Policy, 16(2):45–62, 2003.
[11] James W. Cooley and John W. Tukey. An algorithm for the machine calculation of complex fourier series, 1965.
[12] Ingemar J Cox, Joe Kilian, F Thomson Leighton, and Talal Shamoon. Secure spread spectrum watermarking for multimedia. IEEE transactions on image processing, 6(12):1673–1687, 1997.
[13] Marco Cuturi and Mathieu Blondel. Soft-dtw: A differentiable loss function for time-series. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 894–903. JMLR.org, 2017.
[14] V. Darmstaedter, J.-F. Delaigle, D. Nicholson, and B. Macq. A block based watermarking technique for mpeg2 signals: Optimization and validation on real digital tv distribution links. In David Hutchison and Ralf Schäfer, editors, Multimedia Applications, Services and Techniques — ECMAST’98, pages 190–206, Berlin, Heidelberg, 1998. Springer Berlin Heidelberg.
[15] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
[16] Xintao Duan, Kai Jia, Baoxia Li, Daidou Guo, En Zhang, and Chuan Qin. Reversible image steganography scheme based on a u-net structure. IEEE Access, 7:9314–9323, 2019.
[17] Dan Ellis. Berkeley international computer science institute (icsi) speech faq. https://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html, 2009.
[18] Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra. Learning sound event classifiers from web audio with noisy labels, 2019.
[19] Margarita Geleta, Cristina Puntí, Kevin McGuinness, Jordi Pons, Cristian Canton, and Xavier Giro-i Nieto. Pixinwav: Residual steganography for hiding pixels in audio. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2485–2489, 2022.
[20] Kaliappan Gopalan, Stanley J. Wenndt, and Stanley J. Wenndt. Steganography for covert data transmission by imperceptible tone insertion. 2004.
[21] Eric Hamilton. Jpeg file interchange format, version 1.02. https://www.w3.org/Graphics/JPEG/jfif3.pdf, 1992.
[22] Frank H Hartung, Jonathan K Su, and Bernd Girod. Spread spectrum watermarking: Malicious attacks and counterattacks. In Security and Watermarking of Multimedia Contents, volume 3657, pages 147–158. SPIE, 1999.
[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[24] Dalal Hmood, Khamael Abbas, and Mohammed Altaei. A new steganographic method for embedded image in audio file, 04 2012.
[25] Neil F. Johnson and Sushil Jajodia. Exploring steganography: Seeing the unseen. Computer, 31(2):26–34, 1998.
[26] David Kahn. The codebreakers : the story of secret writing. The American Historical Review, 74:537, 1968.
[27] Felix Kreuk, Yossi Adi, Bhiksha Raj, Rita Singh, and Joseph Keshet. Hide and speak: Towards deep neural networks for speech steganography, 2019.
[28] Deepa Kundur and Dimitrios Hatzinakos. Digital watermarking using multiresolution wavelet decomposition. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), volume 5, pages 2969–2972. IEEE, 1998.
[29] Eugene T Lin and Edward J Delp. A review of data hiding in digital images. In PICS, volume 299, pages 274–278, 1999.
[30] Nikos Nikolaidis and Ioannis Pitas. Copyright protection of images using robust digital signatures. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 4, pages 2168–2171. IEEE, 1996.
[31] F.A.P. Petitcolas, R.J. Anderson, and M.G. Kuhn. Information hiding-a survey. Proceedings of the IEEE, 87(7):1062–1078, 1999.
[32] N. Provos and P. Honeyman. Hide and seek: an introduction to steganography. IEEE Security & Privacy, 1(3):32–44, 2003.
[33] Kriti Saroha and Pradeep Singh. A variant of lsb steganography for hiding images in audio, 12 2010.
[34] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, 2016.
[35] Alan Siper, Roger Farley, and Craig Lombardo. The rise of steganography. Proceedings of student/faculty research day, CSIS, Pace University, 2005.
[36] Kiyoshi Tanaka, Yasuhiro Nakamura, and Kineo Matsui. Embedding secret information into a dithered multi-level image. In IEEE Military Communications Conference, volume 1, pages 216–220, 1990.
[37] Najiya Thasneem and Renjith V. Ravi. An effective technique for hiding image in audio, 2015.
[38] Philip Thicknesse. A treatise on the art of decyphering, and of writing in cypher: With an harmonic alphabet. W. Brown, 1772.
[39] Atique ur Rehman, Rafia Rahim, Shahroz Nadeem, and Sibt ul Hussain. End-to-end trained cnn encoder-decoder networks for image steganography. In Laura Leal-Taixé and Stefan Roth, editors, Computer Vision – ECCV 2018 Workshops, pages 723–729, Cham, 2019. Springer International Publishing.
[40] Zhiyi Wang, Mingcheng Zhou, Boji Liu, and Taiyong Li. Deep image steganography using transformer and recursive permutation. Entropy, 24(7), 2022.
[41] Pin Wu, Yang Yang, and Xiaoqiang Li. Image-into-Image Steganography Using Deep Convolutional Network: 19th Pacific-Rim Conference on Multimedia, Hefei, China, September 21-22, 2018, Proceedings, Part II, pages 792–802. 09 2018.
[42] Pin Wu, Yang Yang, and Xiaoqiang Li. Stegnet: Mega image steganography capacity with deep convolutional network. Future Internet, 10(6), 2018.
[43] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. Hidden: Hiding data with deep networks, 2018.