Your Text Encoder Can Be An Object-Level Watermarking Controller

Naresh Kumar Devulapally¹ Mingzhen Huang¹ Vishal Asnani²
Shruti Agarwal² Siwei Lyu¹ Vishnu Suresh Lokhande¹
¹University at Buffalo, SUNY ²Adobe Research
{devulapa, vishnulo, mhuang33, siweilyu}@buffalo.edu {shragarw, vasnani}@adobe.com

Abstract

Invisible watermarking of AI-generated images can help with copyright protection, enabling detection and identification of AI-generated media. In this work, we present a novel approach to watermark images of T2I Latent Diffusion Models (LDMs). By only fine-tuning text token embeddings $\mathcal{W}_{*}$ , we enable watermarking in selected objects or parts of the image, offering greater flexibility compared to traditional full-image watermarking. Our method leverages the text encoder’s compatibility across various LDMs, allowing plug-and-play integration for different LDMs. Moreover, introducing the watermark early in the encoding stage improves robustness to adversarial perturbations in later stages of the pipeline. Our approach achieves $99\%$ bit accuracy ( $48$ bits) with a $10^{5}\times$ reduction in model parameters, enabling efficient watermarking. Code: GitHub.

1 Introduction

As debates around copyright and intellectual property intensify, the need for effective watermarking solutions increases [37]. With generative AI models producing indistinguishable images from human-made art, maintaining authorship provenance in digital media is crucial. Efforts are underway, from legislative measures like the Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence [2], to industry initiatives that aim to watermark all AI-generated content [27] [34]. These developments highlight the growing importance of watermarking as a key area of study.

As a result, multiple Generative AI (GenAI) watermarking methods have been proposed [10, 37, 8] to invisibly watermark the entire image while generating the image. However, lots of current watermark methods [7, 4] can only encode a watermark to an open-box generative model as they need to access the latent space. Moreover, similar to other invisible watermarking techniques [3, 8, 7], these methods also suffer with the inherent trade-off between quality and robustness of the embedded signal. Owing to the high imperceptibility requirements for the high-quality image generation, these watermarks more often lacks robustness [10, 40]. To improve the watermark imperceptibility, partial watermarking has been proposed to limit the watermark related changes only to selected “non-salient” objects or regions in the image [25]. In our paper, we bring such partial watermarking to text-to-image generation pipeline. Our partial watermarking is not only for improving the quality of the watermarked image, but also to give the user the flexibility to watermark only selected object of interest in the generated image without accessing the latent space. This is crucial for scenarios where the user would like to protect the unique object in the image, see LABEL:fig:teaser.

Specifically, we propose a token-based watermarking approach that can embed watermarks into selected objects or partial regions of an image. Unlike previous works that have explored architectural modifications, such as the addition of secret encoders [4], distortion layers [7], and adapters [6], our watermarking method focuses solely on learning a watermark embedding. This allows for a prêt-à-porter style of training [31], leveraging pre-trained models with minimal adjustments. Additionally, compared with prior works [4, 6, 7], our model is blind, meaning that the entire generative model is neither modified nor requires additional training. A simple watermark token is learned by our model and can be plugged into any diffusion model for watermark generation. Previous works [8, 37] have investigated image watermarking without altering the architecture. Although effective, they lack the convenience of prêt-à-porter training, which offers the key advantage of personalizing large models with minimal retraining while using the model’s core components as-is. In contrast, our approach enables users to apply watermarking directly at the prompt stage without modifying core model components such as the UNet or image decoder. We introduce new pseudo-tokens $\bm{\mathcal{W}_{*}}$ into the model’s vocabulary, where the $\bm{\mathcal{W}_{*}}$ can be used as an input text prompt to be applied on any T2I diffusion models for generation watermarked images. This also enables users to selectively watermark specific image regions by leveraging the cross-attention between the special token and visual region in the image where the watermark needs to be embedded. By integrating watermarking functionality early in the text-to-image pipeline, our approach performs in-generation watermarking, offering greater robustness against image manipulation attacks compared to post-processing techniques while improving the overall quality of the watermarked images.

We make the following key contributions:

1.

We introduce a novel pseudo-token with watermarking capabilities, learning a new embedding vector for it. We investigate the impact of applying this pseudo-token conditioning at different timesteps of the diffusion process, finding that timesteps closer to the VAE encoder in LDMs during forward process offer enhanced image generation quality.
2.

We introduce object-level watermarking, allowing for selective watermarking with greater precision than traditional whole-image approaches.
3.

We offer plug-and-play integration with various Stable Diffusion (SD) variants by embedding the watermark directly via the text encoder. Our approach achieves $99\%$ bit accuracy with a $10^{5}\times$ reduction in model parameters, enabling efficient watermarking with a throughput of $48$ bits.

2 Related Works

Text-Guided Image Generation Diffusion Models: Diffusion models have revolutionized text-to-image (T2I) generation, surpassing GANs in fidelity and diversity. By iteratively denoising random noise into images conditioned on text prompts, these models enable fine-grained control over generation. Frameworks like DALL-E 2, Imagen [14], and Stable Diffusion [35] democratize high-quality image synthesis, while diffusion-based editing methods [13] enable localized modifications. However, their widespread adoption raises challenges, including copyright infringement and harmful content generation, necessitating controllable synthesis mechanisms such as watermarking.

Textual Inversion for Personalized Image Generation: Textual Inversion [9] embeds visual or stylistic concepts into T2I models through learnable tokens, enabling personalization without altering model parameters. This lightweight approach has been extended to style transfer [38]. In this work, we find that training the special token not only allows for flexible concept learning but also could acts as a watermark controller.

Constraint-Based Control in Text-to-Image Generation: Text-to-image generation has gained significant popularity, especially to incorporate constraints that enhance control over generated content. Common constraints include spatial or layout constraints, which dictate object placement and dimensions within the generated image to meet predefined spatial requirements [19]. Another type, multi-view consistency constraints, ensures scene consistency across perspectives, preserving layout, lighting, and depth even in complex views like outdoor or stylized scenes [41]. Attribute preservation constraints maintain prompt-specified attributes (e.g., color, texture, size) in the generated image, ensuring semantic alignment. Finally, watermarking constraints embed an invisible watermark to verify ownership while preserving image quality. Watermarks can be embedded through polytope constraints defined by orthogonal Gaussian vectors or high-frequency components, allowing watermarking without image degradation. Constraint-based generation methods, which offer explicit control over outputs, draw inspiration from classifier training techniques that optimize using soft penalties [5]. Recent approaches such as [12] have achieved notable success with convergence guarantees and adaptability across data regimes .

In-generation Image watermarking in T2I generation: With the increasing popularity of diffusion models, recent research has explored embedding watermarks within the diffusion process, either by manipulating the noise schedule to encode watermark information or by conditioning the model to output watermarked images. Tree-ring [37] proposed to encode a watermark into the noise space, it can be simply detected by applying a DDIM [35] inversion to obtain the noise. However, current in-generation watermark methods [29, 8, 7] cannot inject a watermark in a local region, e.g. an object. As many diffusion-based image editing method have been emerging for a long time, the object-level watermarking is long overdue. Recently, WAM [33] discusses localized watermarking and proposes a training mechanism for watermarked region detection from watermarked image. However, WAM still requires segmentation masks to localize watermark during watermark embedding in both training and inference.

Blind v/s Non-Blind Watermarking: Mentioned in [18], blind detection methods verify ownership using only the watermarked image, eliminating reliance on auxiliary metadata. Our work adopts this practical approach, ensuring compatibility with standard detection workflows.

Motivated by these insights, we explore similar methods in text-to-image generation, where constraints are often enforced as soft penalties via gradient-based optimization. Current diffusion-based watermarking methods lack spatial control, limiting object-centric applications. Recent advances in localized diffusion editing [13] remain unexplored for watermark integration. By unifying textual inversion with constraint-based control, we propose the first framework for object-level in-generation watermarking in T2I generation, addressing traceability and granularity.

Refer to caption — Figure 1: $\bm{\mathcal{W}_{*}}$ training pipeline. (Left) To find $\mathcal{W}_{*}$ token embeddings, we use an Img2Img generation pipeline. $\mathcal{D}$ and $\mathcal{D}_{w}$ represent VAE decoder in the LDM, and Watermark Detector respectively. While training, we send the input image through LDM encoder to retrieve the latent $z_{0}$ , we then add a forward diffusion noise of $\tau^{*}$ timesteps, followed by iteratively denoising $z_{\tau^{*}}$ using Classifier-Free Guidance [12] from $[\tau^{*}\to 0]$ to retrieve $z^{\prime}_{0,w}$ . During the denoising process, we train for $\mathcal{W}_{*}$ token embeddings. (Middle) We use latent matching loss to control the trajectory of watermarked latents and bit loss to find $\mathcal{W}_{*}$ token embeddings. (Right) We then use trained $\mathcal{W}_{*}$ embeddings to generate watermarked images.

3 Problem Statement

Our problem setting uses Latent Diffusion Models (LDMs) [30] for image generation. We consider an in-generation watermarking scenario, as defined in Sec. 2, where the aim is to utilize existing modules within the LDM pipeline to watermark images while generation.

Watermark Embedder: Given an input text prompt ( $\mathcal{P}=\{p_{0},p_{1},...p_{n}\}$ ) and/or an image $\mathcal{I}$ , the aim of an in-generation watermark embedder ( $W_{e}$ ) is to generate a watermarked image $\mathcal{I}_{w}=LDM(\mathcal{I}\mid W_{e},\mathcal{P}$ ). Each embedder $W_{e}$ is associated with a watermark key $(m)$ (a bit string containing $0$ , $1$ similar with prior works [8, 7]) of length $k$ , i.e., $m\in\{0,1\}^{k}$ . From here we shall use $W_{e,m}$ to denote a watermark embedder and $\mathcal{I}_{w,m}$ to denote a watermarked image.

Watermark Detector: Given a watermarked image $\mathcal{I}_{w,m}$ , watermark detector ${D}_{w}(\cdot)$ is a neural network that predicts the key $m$ from $\mathcal{I}_{w,m}$ . Our method utilizes Blind Watermarking scenario mentioned in Sec. 2 where ${D}_{w}(\cdot)$ only requires the watermarked image $\mathcal{I}_{w,m}$ as input without the need for any additional metadata.

Attack Module: Attack module $\mathcal{A}(\cdot)$ represents unintentional (or) intentional transforms applied to the watermarked image $\mathcal{I}_{w,m}$ that could result in the loss of watermark key during watermark detection. A successful attack module results in imperfect watermark detection, i.e., ${D}_{w}(\mathcal{A}(\mathcal{I}_{w,m}))\neq m$ . This module can usually be seen in the form of image compression by public platforms or an intentional malicious attacker that aims to remove watermark from $\mathcal{I}_{w,m}$ . Resistance to an attack module is a fundamental requirement of a robust watermark embedder $W_{e}$ .

Full-Image and Object-Level Watermarking: Given a text prompt for generation, full-image watermarking refers watermarking entire image and object-level watermarking refers to the scenario that watermarks specific objects ${O_{i},...O_{j}}$ in the image selected by a user. Information about $O_{i}$ can be provided in various ways. For example [8] uses post-processing techniques to localize watermark on various regions using masks. However, using such post-processing methods introduces additional computational overhead and may be susceptible to removal or modification. Instead, integrating object-level watermarking into the generative process enables seamless and robust embedding within the chosen object ${O_{i},...O_{j}}$ regions.

Our problem setting considers both image and object-level watermarking scenarios with a constraint that the information for watermarking a specific object $O_{i}$ shall not be provided in the form of any additional information such as segmentation masks. This information can only be obtained right from the text prompt. Specifically, for our setting, a text prompt $``\text{[A photo of a cat $\mathcal{W}_{*}$]}"$ denotes full-image watermarking and a text prompt $``\text{A photo of a [cat $\mathcal{W}_{*}$]}"$ denotes that the object cat is to be watermarked.

4 Method

Given an image, $\mathcal{I}$ , our training pipeline aims to generate $\mathcal{I}_{w,m}$ where $m$ is the watermark key, $m\in\{0,1\}^{k}$ . We utilize a differentiable watermark detector from [7] ${D_{w}}(I_{w,m})$ that permits gradient flow.

It is known that training larger modules of an LDM pipeline, such as VAE Encoder/Decoder or the UNet, is computationally expensive and do not provide a lightweight, seamless way to integrate watermarking into various other LDM pipelines. There is a need for a unified watermark embedder that can be integrated with various LDM variants.

4.1 Token Embeddings for Watermarking

Our method utilizes the relatively under-explored Text Encoder of the LDM pipeline [21]. We introduce a new token in the Text Encoder, denoted by $\bm{\mathcal{W}_{*}}$ , and fine-tune the text embeddings of $\bm{\mathcal{W}_{*}}$ to watermark $\mathcal{I}$ . In addition to significant lower parameter requirement compared with prior works [8, 7], this token $\bm{\mathcal{W}_{*}}$ act as the watermark trigger that can be seamlessly integrated into the text encoder of any LDM pipeline.

Our method initially aims to train $\bm{\mathcal{W}_{*}}$ token embeddings, similar with Textual Inversion [9], to generate watermarked latents $z_{t,w}=\epsilon_{\theta}(z_{t},t,\{\mathcal{P},\mathcal{W}_{*}\})$ . However, we identify the need to carefully select the optimal noise timesteps during $\bm{\mathcal{W}*}$ training as different choices impact image quality, watermark robustness and image generation time.

4.2 Optimal Timestep and Latent Loss

We perform empirical studies to observe the effect of different noise timesteps during forward diffusion process while $\bm{\mathcal{W}_{*}}$ training. As depicted in Fig. 1 (middle) (and in the section H. of the supplement), we observe a trade-off between image quality and watermarking performance when choosing different noise timesteps. A relatively large timestep $(t\sim T)$ would degrade the image quality while a relatively small timestep $(t\sim 0)$ would lead a lower bit accuracy but enhanced image generation quality. This observation is consistent with prior findings, as mentioned in [31, 17], which highlight the impact of noise strength on conditioning fidelity. We identify an optimal timestep $\tau^{*}=8$ that balances this trade-off between image quality and watermark bit accuracy. During $\bm{\mathcal{W}_{*}}$ training, we add a forward noise of $\tau^{*}$ to the image followed by iterative de-noising to generate watermarked latent $z^{\prime}_{0,w}$ . $z^{\prime}_{0,w}$ is then passed into the VAE decoder $Dec(\cdot)$ to generate a watermarked image $I_{w,m}$ . $D_{w}$ takes $I_{w,m}$ as input, ${D_{w}}(I_{w,m})=m^{\prime}$ , to train $\bm{\mathcal{W}_{*}}$ token embeddings on watermark loss $\mathcal{L}_{w}$ . We utilize $BCE$ loss to embed a specific bit key $m$ using $\bm{\mathcal{W}_{*}}$ .

\mathcal{L}_{w}=BCE(D_{w}(Dec(z^{\prime}_{0,w})),m).

(1)

While the optimal timestep $\tau^{*}$ preserves overall information in $\mathcal{I}$ and significantly reduces the time taken for $I_{w,m}$ generation, we still observe visible corruption in $I_{w,m}$ that hurts imperceptibility watermarking constraint. As a remedy, we employ latent matching loss $\mathcal{L}_{z}$ that aids our method to perform invisible watermarking.

\mathcal{L}_{z}=\min_{\mathcal{W}_{*}}\mathbb{E}_{t}\left[\left\|z^{*}_{t}-z^{\prime}_{t}(\mathcal{W}_{*})\right\|_{2}^{2}\right].

(2)

Our method uses $\mathcal{L}=\alpha\mathcal{L}_{w}+\beta\mathcal{L}_{z}$ to train $\bm{\mathcal{W}_{*}}$ token embeddings for watermarking where $\alpha$ and $\beta$ are tunable hyperparameters.

4.3 Object-level watermarking

Algorithm 1 Object-level Watermarking during Text-to-Image Generation in Latent Diffusion Models

1:Input: LDM Decoder

\mathcal{D}

; Watermark Token

\mathcal{W}^{i}_{*}

for object

i

; Text Prompt

\mathcal{P}

; overlay strength

\pi(t)

2:Select objects to watermark, defining the subset

\{\mathcal{P}^{*}_{0},\mathcal{P}^{*}_{1},\dots,\mathcal{P}^{*}_{k}\}

with corresponding watermarking tokens

\{\mathcal{W}^{*}_{0},\mathcal{W}^{*}_{1},\dots,\mathcal{W}^{*}_{k}\}

from the text prompt

\mathcal{P}=\{\mathcal{P}_{0},\mathcal{P}_{1},\dots,\mathcal{P}_{n}\}

z_{T}\in\mathcal{N}(0,I)

4:for

t=T,\dots,0

5: for each token

\mathcal{P}^{*}_{i}\in\{\mathcal{P}^{*}_{0},\mathcal{P}^{*}_{1},\dots,\mathcal{P}^{*}_{k}\}

\mathcal{M}_{\mathcal{P}^{*}_{i}}^{(t)}/\mathcal{M}_{\mathcal{W}^{*}_{i}}^{(t)}\leftarrow

Attention map of

\mathcal{P}^{*}_{i}/\mathcal{W}^{*}_{i}

\mathcal{M}_{\mathcal{W}_{*}}^{(t)}\leftarrow\pi(t)\cdot\mathcal{M}_{\mathcal{W}_{*}}^{(t)}

(Adjust watermark-strength per timestep)

\mathcal{M}_{\mathcal{P}_{i}}^{(t)}\leftarrow(1-\alpha)\cdot\mathcal{M}_{\mathcal{P}_{i}}^{(t)}+\alpha\cdot\mathcal{M}_{\mathcal{W}_{*}}^{(t)}

(Overlay attention map of

\mathcal{W}^{i}_{*}

on that of

\mathcal{P}^{i}_{*}

)

9: end for

10:

\hat{\epsilon}=\epsilon_{\theta}(z_{t},t,\psi(\mathcal{P}),\mathcal{M}_{\mathcal{P}^{*}_{0:k}}^{(t)},\mathcal{M}_{\mathcal{W}^{*}_{0:k}}^{(t)})

11:

z_{t-1}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}z_{t}+\left(\sqrt{\frac{1}{\alpha_{t-1}}-1}-\sqrt{\frac{1}{\alpha_{t}}-1}\right)\hat{\epsilon}

12:end for

13:Output: Decoder output of

z_{0}

, that is

\mathcal{D}(z_{0})

In the original LDM [30], the query feature $Q$ and key feature $K$ from cross attention operation defines a token-wise attention mask $M$ where $M=\text{softmax}(Q\cdot K^{T})/\sqrt{d})$ . As the architecture of the LDM remains unchanged in our model, we inherit this capability to generate corresponding attention mask from text tokens. Such feature is the key in our model unlocking new possibilities for object-level watermarking utilizing these cross-attention maps.

Once the watermarking token embeddings are fine-tuned, we proceed to generate images using a standard text-to-image LDM model, but with an ability to perform object-level watermarking. We utilize cross-attention maps at each timestep $t$ to localize watermark on any chosen object $O_{i}$ right from the text prompt $\mathcal{P}$ as a token $\mathcal{P}_{i}(\in\mathcal{P})$ . Specifically, we utilize [11] to obtain cross-attention map $M_{t}$ of $O_{i}$ given by:

z^{\prime}_{t-1},M_{t,O_{i}}\leftarrow\text{LDM}(z^{\prime}_{t},\mathcal{P}_{i},t)

(3)

Given a text prompt with multiple tokens $\mathcal{P}=\{\mathcal{P}_{0},\mathcal{P}_{1},\dots,\mathcal{P}_{n}\}$ , the user has the flexibility to choose specific object tokens $\{\mathcal{P}_{i},\mathcal{P}_{j},\dots\}$ to watermark. During the image generation process, we extract the cross-attention maps for the objects intended for watermarking from the UNet. These are then combined with the attention maps of the corresponding watermarking tokens $\mathcal{W}_{*}^{i},\mathcal{W}_{*}^{j},\dots$ to precisely localize the watermark on each object allowing us to seamlessly perform multi-object watermarking.

To bring in the effect of optimal timestep $\tau^{*}$ (seen during training Sec. 4.2), we introduce a watermark overlay strength controller $\pi(t)$ . $\pi(t)$ could be set to a step function with values $0\enspace(\text{ if }t>\tau^{*})$ and $1\enspace(\text{ if }t\leq\tau^{*})$ which exactly mimics the generation scenario seen during training. In addition to the step function, we also test with a smoothing function that brings the effect of $\tau^{*}$ by giving more weight to timesteps closer to the VAE decoder. At these timesteps we observe increased reliability of the attention maps enhancing the precision of the watermark placement. The complete procedure is outlined in Algorithm 1.

5 Experiments

Method	I.W.	O.W.	L.P.	# Params $\downarrow$	Imperceptibility			Robustness to basic attacks (BA):							Adversarial attacks (BA):
					PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$	None $\uparrow$	Brightness $\uparrow$	Contrast $\uparrow$	Blur $\uparrow$	Crop $\uparrow$	Rot. $\uparrow$	JPEG $\uparrow$	SDEdit $\uparrow$	WMAttacker $\uparrow$	DiffPure $\uparrow$
WikiArt (48 bits) - In-Generation Watermarking
Stable Sig. [8]	✓	✗	✗	$10^{5}+$	31.57	0.88	24.71	0.99	0.93	0.87	0.78	0.79	0.70	0.55	0.58	0.53	0.52
LaWa [29]	✓	✗	✗	$10^{5}+$	32.52	0.93	18.23	0.99	0.99	0.98	0.94	0.93	0.83	0.89	0.68	0.76	0.78
TrustMark [3]	✗	✗	✗	$10^{5}+$	39.90	0.97	15.83	0.99	0.98	0.99	0.93	0.89	0.87	0.86	0.68	0.77	0.73
RoSteALS [4]	✓	✗	✗	$10^{5}+$	32.68	0.88	16.63	0.98	0.96	0.94	0.88	0.88	0.75	0.80	0.75	0.72	0.72
AquaLoRA [7]	✓	✗	✗	$10^{5}+$	31.46	0.92	17.27	0.94	0.91	0.91	0.81	0.90	0.58	0.76	0.68	0.67	0.66
WAM [33]	✓	✓	✗	$10^{5}+$	36.46	0.97	16.27	0.97	0.93	0.92	0.84	0.92	0.76	0.84	0.72	0.71	0.72
Ours + SD ${}_{\text{style}}$	✓	✓	✓	768	35.88	0.93	16.72	0.99	0.97	0.99	0.94	0.97	0.94	0.93	0.81	0.83	0.84
Ours + TI	✓	✓	✓	768	36.89	0.94	15.98	0.99	0.96	0.99	0.95	0.98	0.95	0.92	0.82	0.84	0.86
Ours + SD	✓	✓	✓	768	39.92	0.97	14.89	0.99	0.98	0.99	0.97	0.98	0.95	0.95	0.85	0.87	0.88
MS-COCO (48 bits) - In-Generation Watermarking
Stable Sig. [8]	✓	✗	✗	$10^{5}+$	31.93	0.88	24.74	0.99	0.94	0.89	0.81	0.82	0.72	0.59	0.62	0.63	0.57
LaWa [29]	✓	✗	✗	$10^{5}+$	33.53	0.95	17.45	0.99	0.99	0.98	0.94	0.93	0.86	0.90	0.71	0.79	0.79
TrustMark [3]	✗	✗	✗	$10^{5}+$	40.98	0.98	14.89	0.99	0.98	0.99	0.95	0.91	0.89	0.88	0.69	0.78	0.74
RoSteALS [4]	✓	✗	✗	$10^{5}+$	32.68	0.88	16.63	0.99	0.98	0.93	0.89	0.87	0.78	0.81	0.76	0.72	0.75
AquaLoRA [7]	✓	✗	✗	$10^{5}+$	31.46	0.92	17.27	0.95	0.91	0.90	0.80	0.92	0.68	0.78	0.67	0.70	0.68
WAM [33]	✓	✓	✗	$10^{5}+$	36.98	0.97	16.85	0.98	0.94	0.92	0.86	0.94	0.77	0.86	0.73	0.73	0.72
Ours + SD ${}_{\text{style}}$	✓	✓	✓	768	36.99	0.93	16.72	0.99	0.98	0.99	0.94	0.97	0.94	0.93	0.81	0.83	0.84
Ours + TI	✓	✓	✓	768	36.89	0.94	15.23	0.99	0.96	0.99	0.95	0.98	0.96	0.94	0.83	0.84	0.86
Ours + SD	✓	✓	✓	768	40.92	0.97	14.83	0.99	0.98	0.99	0.97	0.98	0.97	0.96	0.86	0.88	0.89

Table 1: Comparison to watermark baselines. (I.W.: In-generation Watermarking, O.W.: Object-level Watermarking, L.P.: Less than $10^{5}$ parameters, BA: Bit Accuracy) We compare our method several baselines. In addition to watermark invisibility and robustness of watermarking, we present the number of parameters used for training. We see that our method uses $10^{5}\times$ lower parameters. We see that early integration for watermarking improves robustness to attacks, we see a consistent trend of this improvements in basic image processing attacks and adversarial attacks. We also present the performance of our method in the presence of personalization using fine-tuned UNet and Textual Inversion. From the above table we see that our method can be plugged into various LDM pipelines. More details on specific implementation of attacks can be found in the supplement.

Datasets: We evaluate our method on two datasets, namely, MS-COCO [20] and WikiArt [36]. We use a subset of 2,000 images from MS-COCO dataset, similar to [8], for training $\mathcal{W}_{*}$ token embeddings. Our method is evaluated on 1000 validation captions from MS-COCO and WikiArt each via image-to-image generation and on 100 prompts from [7] for text-to-image generation.

Evaluation Metrics: In line with prior work [8, 7], we assess our watermarking method using the following metrics:

(a) Robustness, measured by Bit Accuracy under various attacks (mentioned in Tab. 1). Bit accuracy represents the percentage of correctly detected bits in the watermark key $m$ output from the Detector $D_{w}(\cdot)$ . In addition to Bit Accuracy, we also compare our method with techniques [37] that use True Positive Rate as the metrics in supplement (section D). These methods do not embed a bit string $m$ .

(b) Imperceptibility, measured by Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [21, 26], and Fréchet inception distance (FID) [15], comparing watermarked and non-watermarked images.

(c) Parameter Efficiency, where our method trains only the textual embeddings, with parameter usage significantly reduced ( $10^{5}\times$ ) compared to baselines [8, 7], as the T2I pipeline and watermark decoder remain frozen.

Watermark Detector: For our experiments we utilize two watermark detectors from [7] and [8].

5.1 Full-Image Watermarking results

We evaluate full-image watermarking by integrating $\bm{\mathcal{W}_{*}}$ into the Stable-Diffusion v $1.5$ [35], using the AquaLora watermark detector [7]. Our method’s performance is compared to existing watermarking baselines in LDMs, focusing on robustness and imperceptibility. Tab. 1 presents the results, showing that our approach requires significantly fewer parameters while maintaining high bit accuracy under attacks. We categorize attacks encountered after watermarking into two categories, namely, basic image processing attacks such as Rotation, Resize, Crop, JPEG compression etc as listed in Tab. 1, and adversarial attacks such as DiffPure [24], WMAttacker [40] etc. We provide implementation details of attack module in supplement (section I).

We consider that robustness to adversarial attacks including Diff-Pure [24] and SDEdit [22], while beneficial, is not essential for watermarking techniques. These attack methods are complex and go beyond common attacks like crop, resize, and rotations. However, our watermarking method integrated into the denoising process demonstrates enhanced robustness to both basic and adversarial attacks outperforming several in-generation watermarking techniques thereby pushing the benchmark for robust watermarking. When compared to AquaLORA [7], while using the same watermark detector, we see an increase in bit accuracy on both common attacks by over $20\%$ , and over $7\text{dB}$ in PSNR. Additionally, our method retains high robustness and personalization when used alongside a personalized textual inversion token and/or a personalized UNet.

Summary. Our Full-image watermarking achieves improved robustness, lower parameter requirements, and enhanced imperceptibility compared to baselines.

5.2 Key results on Object Watermarking

Object Watermarking and Identification: We present the effectiveness of object-level watermarking, as illustrated in Fig. 2, where we apply watermarks to up to three distinct objects within a single image, mentioned in algorithm Algorithm 1. Using the watermark detector $D_{w}$ , and without the need for any additional information such as segmentation masks, heatmaps are generated by evaluating the bit accuracy of small patches, each covering approximately $10\%$ of the image area. Each patch’s bit accuracy score is determined by the detector, and these scores are aggregated across all patches to produce a comprehensive heatmap. In this map, a score of $1$ represents a bit accuracy of $100\%$ , indicating high fidelity in watermark retrieval, while a score of $0.5$ indicates retrieved watermark key does not match $m$ indicating no watermark. When watermarking a single object, the resulting heatmap shows precise retrieval of the watermark. For images with two or three watermarked objects, the detector reliably identifies each watermark, and accurate retrieval is observed. In cases involving four or more objects (bottom-right of Fig. 2), the heatmap confirms that the watermark is detected across most of the image, likely due to the accuracy of attention maps during inference. For images containing multiple objects, separate heatmaps can be generated for each object to enhance clarity in visualization. The overall heatmap presented in Fig. 2 is an aggregate of individual heatmaps detailed in Fig. 5. Further technical details on the heatmap generation process are available in the supplement (section B).

Single object
	Bit Accuracy
	None	Bright.	Cont.	Blur	Rot.	JPEG
Segment + white bg	$0.99$	$0.97$	$0.96$	$0.97$	$0.96$	$0.97$
Segment + style bg	$0.95$	$0.92$	$0.90$	$0.96$	$0.94$	$0.93$
Crop object (0.8 $\times$ size)	$0.96$	$0.94$	$0.92$	$0.95$	$0.92$	$0.93$
Crop object (0.5 $\times$ size)	$0.92$	$0.91$	$0.90$	$0.91$	$0.92$	$0.90$
Crop object (0.4 $\times$ size)	$0.90$	$0.89$	$0.88$	$0.89$	$0.89$	$0.90$
Multiple objects
1 object	$0.99$	$0.97$	$0.96$	$0.97$	$0.95$	$0.98$
2 objects (no overlap)	$0.94$	$0.93$	$0.95$	$0.95$	$0.94$	$0.96$
3 objects (no overlap)	$0.90$	$0.89$	$0.90$	$0.90$	$0.90$	$0.99$
2 objects (overlap $\geq 40\%$ )	$0.79$	$0.76$	$0.70$	$0.80$	$0.74$	$0.74$

Table 2: Object-level watermarking robustness performance with attacks. We evaluate the performance of our method to perform object-level watermarking in the presence of attacks. We clearly see that our method achieves robust watermarking performance on rotate and crop attacks for different crops performed within watermarked objects. We present our results while watermarking single and up to three objects.

Stress Tests: Object Size and Multiple Object Watermarking: We evaluate the sensitivity of the watermark detection to small object sizes and multiple objects. As shown in Tab. 2, even with a $40\%$ crop of the object, the detector achieves $89\%$ bit accuracy. This accuracy remains robust to transformations. We also test watermark detection across multiple objects, with and without overlap. For non-overlapping objects, detection remains above $90\%$ without attacks and above $89\%$ under attacks. However, performance decreases when the overlap exceeds $40\%$ , particularly in the case of multiple overlapping objects. Our qualitative results indicate that single-object watermarking is not affected by object overlap, but bit accuracy decreases with significant overlap due to multiple layers of interference.

Summary. Object-level watermarking achieves high detection accuracy across multiple objects and varying sizes, with robust performance even under transformations, although overlap between objects during multi-object watermarking reduces bit accuracy.

5.3 Applications: Integration with LDM Pipelines and Textual Inversion Compatibility

We demonstrate the versatility of our method by integrating the plug-in watermark token $\mathcal{W}_{*}$ across diverse text-to-image (T2I) generation pipelines, assessing both watermark robustness and imperceptibility. The schematic in Fig. 1 illustrates this integration process, where $\mathcal{W}_{*}$ is loaded into the text encoder of different LDMs, including pipelines that employ personalized or fine-tuned diffusion models. Additionally, $\mathcal{W_{*}}$ can be combined with Textual Inversion tokens, as shown in qualitative results in Fig. 2. Our method achieves a high PSNR of $35$ dB and bit accuracy above $92\%$ , outperforming the Stable Signature baseline [8] in robustness while maintaining watermark invisibility and resistance to attacks.

5.4 Ablation Studies

Training with SAM Segmentation Masks: Our watermarking pipeline relies on attention maps generated by prompts, though these maps can sometimes lack precision (e.g., spillover beyond the intended object, seen in bottom-right scenario of Fig. 2). This can pose challenges for applications like medical imaging, where precise watermark localization is crucial. In case of a need for such high precision localization of watermark target, our method is flexible to utilize segmentations masks within watermark generation and need not rely on post-generation localization. In this ablation study, we incorporate SAM (Segment-Anything) segmentation masks directly into our generation process as an alternative to attention maps, particularly in Step $4$ of Algorithm 1. Our results (Fig. 4) show that SAM masks significantly improve watermark localization accuracy.

Ablation on Number of bits for Watermarking. We perform an ablation on number of bits that can be embedded into image by our method and present our results as a plot in the supplement (section C.). We observe that our method can watermark with $>91\%$ bit accuracy for $128$ bits under various attacks.

6 Limitations

As our method performs in-generation watermarking. For localizing the watermark on an object, the method relies on Cross Attention maps for each token in the input prompt $\mathcal{P}$ . Hence, the method relies strongly on the accuracy of these cross attention maps. Our ablation studies conducted on using the segmentation masks from SAM enhance the localization of watermark on top of the object. However, relying on an external segmentation model such as SAM could be undesirable and overly relying on Cross Attention maps could be a limitation when the attention maps are not properly defined.

7 Conclusions

In this paper, we propose a novel in-generation watermarking technique to integrate watermarking into the latent within the denoising process of T2I generation. Our watermarking technique provides watermarking control directly from text and fine-tunes token embeddings of a single token. Our method contributes to a novel application of object-level watermarking within T2I generation. We show that our early watermarking technique shows improvements in watermarking robustness across several post generation attacks. Our method aims to motivate future research towards training-free watermarking, controllable watermarking with any T2I generation pipeline.

References

An et al. [2024] Bang An, Mucong Ding, Tahseen Rabbani, Aakriti Agrawal, Yuancheng Xu, Chenghao Deng, Sicheng Zhu, Abdirisak Mohamed, Yuxin Wen, Tom Goldstein, et al. Benchmarking the robustness of image watermarks. arXiv preprint arXiv:2401.08573, 2024.
Biden [2023] Joseph R Biden. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence. 2023.
Bui et al. [2023a] Tu Bui, Shruti Agarwal, and John Collomosse. Trustmark: Universal watermarking for arbitrary resolution images, 2023a.
Bui et al. [2023b] Tu Bui, Shruti Agarwal, Ning Yu, and John Collomosse. Rosteals: Robust steganography using autoencoder latent space, 2023b.
Daras et al. [2022] Giannis Daras, Mauricio Delbracio, Hossein Talebi, Alexandros G. Dimakis, and Peyman Milanfar. Soft diffusion: Score matching for general corruptions, 2022.
Feng et al. [2023] Weitao Feng, Jiyan He, Jie Zhang, Tianwei Zhang, Wenbo Zhou, Weiming Zhang, and Nenghai Yu. Catch you everything everywhere: Guarding textual inversion via concept watermarking, 2023.
Feng et al. [2024] Weitao Feng, Wenbo Zhou, Jiyan He, Jie Zhang, Tianyi Wei, Guanlin Li, Tianwei Zhang, Weiming Zhang, and Nenghai Yu. AquaLoRA: Toward white-box protection for customized stable diffusion models via watermark LoRA. In Proceedings of the 41st International Conference on Machine Learning, pages 13423–13444. PMLR, 2024.
Fernandez et al. [2023] Pierre Fernandez, Guillaume Couairon, Hervé Jégou, Matthijs Douze, and Teddy Furon. The stable signature: Rooting watermarks in latent diffusion models. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 22409–22420, 2023.
Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
Haden [2023] David Haden. Now you see it… now you don’t: Detecting ai use via image watermarking. Information Today, 40(9):31–33, 2023.
Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control, 2022.
Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022.
Huang et al. [2024] Mingzhen Huang, Jialing Cai, Shan Jia, Vishnu Suresh Lokhande, and Siwei Lyu. Paralleledits: Efficient multi-aspect text-driven image editing with attention grouping, 2024.
Imagen-Team-Google [2024] Imagen-Team-Google. Imagen 3, 2024.
Jayasumana et al. [2024] Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit, Daniel Glasner, Ayan Chakrabarti, and Sanjiv Kumar. Rethinking fid: Towards a better evaluation metric for image generation, 2024.
Lei et al. [2024] Liangqi Lei, Keke Gai, Jing Yu, Liehuang Zhu, and Qi Wu. Conceptwm: A diffusion model watermark for concept protection, 2024.
Li et al. [2025] Maomao Li, Yu Li, Yunfei Liu, and Dong Xu. Exploring iterative manifold constraint for zero-shot image editing, 2025.
Li et al. [2021] Yue Li, Hongxia Wang, and Mauro Barni. A survey of deep neural network watermarking techniques, 2021.
Liang et al. [2024] Hanwen Liang, Yuyang Yin, Dejia Xu, Hanxue Liang, Zhangyang Wang, Konstantinos N. Plataniotis, Yao Zhao, and Yunchao Wei. Diffusion4d: Fast spatial-temporal consistent 4d generation via video diffusion models, 2024.
Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: common objects in context. In ECCV, 2014.
Ma et al. [2020] Yuan Ma, Kewen Liu, Hongxia Xiong, Panpan Fang, Xiaojun Li, Yalei Chen, and Chaoyang Liu. Perception-oriented single image super-resolution via dual relativistic average generative adversarial networks, 2020.
Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations, 2022.
Min et al. [2024] Rui Min, Sen Li, Hongyang Chen, and Minhao Cheng. A watermark-conditioned diffusion model for ip protection. arXiv preprint arXiv:2403.10893, 2024.
Nie et al. [2022] Weili Nie, Brandon Guo, Yujia Huang, Chaowei Xiao, Arash Vahdat, and Anima Anandkumar. Diffusion models for adversarial purification, 2022.
Nikolaidis and Pitas [2001] Athanasios Nikolaidis and Ioannis Pitas. Region-based image watermarking. IEEE Transactions on image processing, 10(11):1726–1740, 2001.
Nilsson and Akenine-Möller [2020] Jim Nilsson and Tomas Akenine-Möller. Understanding ssim, 2020.
Noti-Victor [2024] Jacob Noti-Victor. Regulating hidden ai authorship. 2024.
O et al. [2024] M. Mehdi Owrang O, Fariba Jafari Horestani, and Ginger Schwarz. Survival analysis of young triple-negative breast cancer patients, 2024.
Rezaei et al. [2024] Ahmad Rezaei, Mohammad Akbari, Saeed Ranjbar Alvar, Arezou Fatemi, and Yong Zhang. Lawa: Using latent space for in-generation image watermarking, 2024.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
Rout et al. [2024] Litu Rout, Yujia Chen, Nataniel Ruiz, Abhishek Kumar, Constantine Caramanis, Sanjay Shakkottai, and Wen-Sheng Chu. Rb-modulation: Training-free personalization of diffusion models using stochastic optimal control, 2024.
Sander et al. [2024a] Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, and Matthijs Douze. Watermark anything with localized messages. arXiv preprint arXiv:2411.07231, 2024a.
Sander et al. [2024b] Tom Sander, Pierre Fernandez, Alain Durmus, Teddy Furon, and Matthijs Douze. Watermark anything with localized messages, 2024b.
Smuha et al. [2021] Nathalie A Smuha, Emma Ahmed-Rengers, Adam Harkens, Wenlong Li, James MacLaren, Riccardo Piselli, and Karen Yeung. How the eu can achieve legally trustworthy ai: a response to the european commission’s proposal for an artificial intelligence act. Available at SSRN 3899991, 2021.
Song et al. [2022] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2022.
Tan et al. [2019] Wei Ren Tan, Chee Seng Chan, Hernan Aguirre, and Kiyoshi Tanaka. Improved artgan for conditional synthesis of natural image and artwork. IEEE Transactions on Image Processing, 28(1):394–409, 2019.
Wen et al. [2024] Yuxin Wen, John Kirchenbauer, Jonas Geiping, and Tom Goldstein. Tree-rings watermarks: Invisible fingerprints for diffusion images. Advances in Neural Information Processing Systems, 36, 2024.
Zhang et al. [2023] Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, and Changsheng Xu. Inversion-based style transfer with diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10146–10156, 2023.
Zhao et al. [2024a] Chengxin Zhao, Hefei Ling, Sijing Xie, Han Fang, Yaokun Fang, and Nan Sun. Ssyncoa: Self-synchronizing object-aligned watermarking to resist cropping-paste attacks. arXiv preprint arXiv:2405.03458, 2024a.
Zhao et al. [2024b] Xuandong Zhao, Kexun Zhang, Zihao Su, Saastha Vasan, Ilya Grishchenko, Christopher Kruegel, Giovanni Vigna, Yu-Xiang Wang, and Lei Li. Invisible image watermarks are provably removable using generative ai, 2024b.
Zhou et al. [2023] Jinghao Zhou, Tomas Jakab, Philip Torr, and Christian Rupprecht. Scene-conditional 3d object stylization and composition, 2023.
Zhu et al. [2018] Jiren Zhu, Russell Kaplan, Justin Johnson, and Li Fei-Fei. Hidden: Hiding data with deep networks, 2018.

\thetitle

Supplementary Material

Appendix A Post-generation vs. In-generation Watermarking: Limitations and Advances

In post-generation watermarking, also known as post-hoc watermarking, watermarks are injected into images after generation [1]. This approach, while straightforward, incurs additional computational overhead and is vulnerable to circumvention. For example, in cases of model leakage, attackers can easily detect and bypass the postprocessing module [37]. Additionally, post-diffusion methods often result in poorer image quality, introducing artifacts, and their separable nature makes them easily removable in open-source models, such as by commenting out a single line of code in Stable Diffusion code base [8].

In contrast, in-generation watermarking integrates the watermarking process into the image generation pipeline, improving stealthiness and computational efficiency. This approach embeds watermarks directly into generated images without requiring separate post-hoc steps. It is also less susceptible to diffusion denoising or removal via simple modifications.

The closest related work is [23], which also embeds watermarking bits during generation. However, their method requires retraining a UNet layer, limiting its plug-and-play capability, as discussed in Section 2 and Figure 4. Our approach, by comparison, is more convenient, embedding watermarks directly within the text prompt as a watermarking token.

Focusing on object watermarking, our method differs from [39] and [32] in several key aspects. While these methods rely on segmentation maps for supervised watermarking of specific objects in pre-existing images, ours is unsupervised, leveraging attention maps generated automatically during text-to-image generation. Unlike post-hoc methods, which watermark only pre-existing images, our approach simultaneously performs text-to-image generation and watermarking. This enables compatibility with techniques like textual inversion and ensures robustness against attacks, overcoming significant limitations of post-hoc methods.

Method	Imperceptibility			Robustness to Basic Attacks (BA):
Method	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$	None $\uparrow$	Bright. $\uparrow$	Contrast $\uparrow$	Blur $\uparrow$	JPEG $\uparrow$
Dct-Dwt	39.50	0.97	15.93	0.96	0.89	0.91	0.90	0.55
SSL Watermark	31.50	0.86	21.82	0.95	0.91	0.84	0.88	0.55
HiDDeN	31.57	0.88	22.67	0.99	0.93	0.88	0.80	0.88
Ours with [42]	37.92	0.95	15.83	0.99	0.99	0.97	0.96	0.96
Ours with [7]	40.92	0.97	14.83	0.99	0.98	0.99	0.97	0.96

Table 3: Performance comparison of our method with post generation watermarking methods. We evaluate imperceptibility and robustness against basic attacks. Higher PSNR and SSIM, and lower FID, indicate better imperceptibility. Higher robustness scores indicate greater resistance to attacks.

Appendix B Algorithm for Watermark Heatmap Generation

We present an algorithm for generating the heatmaps depicted in Figure 4 of the main paper. The process involves iterating a defined patch across the image to construct the heatmap. Details regarding the minimum allowable size of the patch can be found in Section 4.4. The patch represents the smallest region of the image that can reliably extract bits using a message detector with high confidence; for further explanation, refer to Section 4 in the main paper. Notably, the detector employed is an off-the-shelf solution. The resulting heatmap is normalized to a range of $[0,1]$ , where a value of $0$ indicates regions with $0\%$ bit accuracy, and a value of $1$ corresponds to regions achieving $100\%$ bit accuracy. Lastly, we apply Gaussian blur that preserves the edges and boundaries better than other uniform blurring filters, which is important for maintaining the structure of the heatmap.

Algorithm 2 Watermark Heatmap Generation

1:Input: Watermarked image

\mathbf{I}

; Watermark Detector

\mathcal{D}_{w}

; Ground Truth Key

m

2:Select a patch dimensions

h\times w

3:Divide

\mathbf{I}

into overlapping patches

\{P_{ij}\}

, where

P_{ij}

is the patch at location

(i,j)

of size

h\times w

4:Initialize a heatmap matrix

\mathbf{H}

of zeros with the same spatial resolution as

\mathbf{I}

5:for each patch

P_{ij}

\mathbf{I}

6: Extract watermark string from the patch:

m^{\prime}_{ij}=\mathcal{D}_{w}(P_{ij})

7: Compute the bit accuracy for the patch as (where

|m|

is the length of the bit string):

A_{ij}=\frac{1}{|m|}\sum_{k=1}^{|m|}[m_{k}=m^{\prime}_{ij,k}]

8: Assign the bit accuracy value to the center pixel of the patch in the heatmap

\mathcal{H}

\mathbf{H}(i,j)\leftarrow A_{ij}

9:end for

10:Normalize the heatmap values to the range

[0,1]

\mathbf{H}_{\text{norm}}=\frac{\mathbf{H}-\min(\mathbf{H})}{\max(\mathbf{H})-\min(\mathbf{H})}.

11:Apply a smoothing filter (Gaussian blur) to

\mathbf{H}_{\text{norm}}

\mathbf{H}_{\text{smooth}}=\text{GaussianBlur}(\mathbf{H}_{\text{norm}},\sigma).

12:Generate a heatmap image by overlaying

\mathbf{H}_{\text{smooth}}

on the original watermarked image

\mathbf{I}

13:Output: Heatmap image highlighting regions with watermark accuracy.

Appendix C Number of Bits v/s Bit Accuracy

We perform an ablation study on the number of bits that can be embedded into images by our method, without sacrificing on bit accuracy. Methods such as [4] can embed upto $100$ bits during watermarking. Extending the effort to push the benchmark for number of bits, we show our results on watermarking upto 128 bits.

Appendix D Comparison to methods that use TPR as evaluation metric

We report True Positive Rate of our watermark detection in comparison with baselines including [37, 16]. We threshold the FPR to 0.1% and report TPR:0.1%

Method	Imperceptibility			Robustness to Basic Attacks (TPR):
Method	PSNR $\uparrow$	SSIM $\uparrow$	FID $\downarrow$	None $\uparrow$	Bright. $\uparrow$	Contrast $\uparrow$	Blur $\uparrow$	JPEG $\uparrow$
TreeRings	33.75	0.91	18.93	1.00	0.91	0.90	0.92	0.89
ConceptWM	32.89	0.89	24.58	0.98	0.90	0.84	0.83	0.85
Ours	40.92	0.97	14.83	0.99	0.99	0.98	0.99	0.97

Table 4: Performance comparison of our method with methods that use TPR as evaluation metric. We evaluate imperceptibility and robustness against basic attacks. Higher PSNR and SSIM, and lower FID, indicate better imperceptibility. Higher robustness scores indicate greater resistance to attacks.

Appendix E Empirical study to find the optimal timestep for watermarking

As discussed in the ablation studies section of our paper, we perform empiracal study to find the optimal timestep $\tau^{*}$ that provides an balance between the trade off between watermark invisibility and bit accuracy.

Appendix F Details on Training and Inference time for watermarking

As discussed in our method adds $\tau^{*}$ timesteps of noise during the forward processing and performs denoising to train $\bm{\mathcal{W}_{*}}$ token embeddings. We use 2000 images from MS COCO dataset for training and the entire training process takes about $2$ GPU hours when benchmarked on NVIDIA RTX A6000 GPU.

During inference, the time overhead is minimal, the significant storage savings (a $10^{5}\times$ reduction in parameters) are noteworthy. Benchmarking on an RTX A6000, processing a single image (averaged over 1000 images) shows that for a T2I model, processing 50 diffusion steps takes 5.2 seconds without watermarking and 5.4 seconds with a minimal 0.2-second overhead due to the watermarking token.

Appendix G Attention Map of Watermarking Token

Our method appends a token $\mathcal{W}_{*}$ to the text prompt to integrate watermarking within the T2I generation. We utilize attention maps of each token in the text prompt to overlay the attention map of watermarking token $\mathcal{W}_{*}$ to control object-level watermarking. Here we provide the attention map of $\mathcal{W}_{*}$ token during T2I generation. We see that the difference image of watermarked and non-watermarked images is negligible. We amplify the difference image by $10$ and provide as the image in the last column.

Appendix H Choice of Latent loss for Trajectory Alignment

Our method uses Latent loss $\|z_{t}(\mathcal{W})-z_{t}^{*}\|^{2}$ (where $z_{t}(\mathcal{W})$ represent watermarked latents with $\mathcal{W}_{*}$ appended to the text prompt and $z_{t}^{*}$ denote non-watermarked latents with good image quality) for trajectory alignment during watermarking. In this section, we provide comparative of different loss functions to ensure trajectory alignment (preserve image quality) during watermarking while denoising. We see that in Fig. 9 loss maintains the best invisibility for watermarking. While diffusion loss performs better the case without any image loss, we observe that, for black-box watermarking, diffusion loss performs sub-par compared to the chosen latent loss that controls the trajectory of the latents.

Appendix I Robustness attacks implementation details

We follow standard implementation for simulation of various attacks for watermarking as mentioned in [8] and [7]. Crop 0.1 removes 10% from the image retaining the center. Resize 0.2 scales the image to 20% of its original size. Rotate 25 rotates the image by 25 degrees. Operations are performed using PIL and torchvision utilizing standard implementations from [8, 7].

Appendix J Medical image watermarking

We tested our method to watermark medical images where invisibility of watermark is critical. Our method performs invisible watermarking our medical images with a high PSNR of 35.89 on the TNBC-Seg [28] dataset while maintaining a bit accuracy of $\bf 0.99$ .

Appendix K Qualitative results compared to baselines

In this section, we provide additional qualitative results compared to different watermarking methods that perform in-generation watermarking. We consider Stable Signature [8], AquaLoRA [7], and RoSteALS [4] for comparing qualitative results.

Appendix L Heatmap detection based on user provided key

We present the ability of our watermark heatmap generation module’s ability to retrieve correct watermarked regions based on user key provided.

Original	AquaLoRA	Stable Signature	RoSteALS	Ours (Object)
Prompt: An old-fashioned [drink $\mathcal{W}_{*}$ ] next to a napkin


Prompt: A toy car in front of a [teddy bear $\mathcal{W}_{*}$ ]


Prompt: A [woman $\mathcal{W}_{*}$ ] running on a trail

Original	AquaLoRA	Stable Signature	RoSteALS	Ours (Object)
Prompt: The [Great Wall $\mathcal{W}_{*}$ ]


Prompt: Brown white and black white [guinea pigs $\mathcal{W}_{*}$ ] eating parsley handed to them


Prompt: A shiba inu wearing a beret and [black turtleneck $\mathcal{W}_{*}$ ]

Original	AquaLoRA	Stable Signature	RoSteALS	Ours (Object)
Prompt: A photo of a Ming Dynasty vase on a leather topped [table $\mathcal{W}_{*}$ ].


Prompt: A snowy [owl $\mathcal{W}_{*}$ ] standing in a grassy field


Prompt: Background pattern with alternating roses and [skulls $\mathcal{W}_{*}$ ]

Original	AquaLoRA	Stable Signature	RoSteALS	Ours (Object)
Prompt: A close-up of the keys of a [piano $\mathcal{W}_{*}$ ]


Prompt: A map of [Manhattan $\mathcal{W}_{*}$ ]


Prompt: A [sword $\mathcal{W}_{*}$ ] in a stone