MEATRD: Multimodal Anomalous Tissue Region Detection
Enhanced with Spatial Transcriptomics

Kaichen Xu¹\equalcontrib, Qilong Wu ¹\equalcontrib, Yan Lu ¹, Yinan Zheng ¹, Wenlin Li ¹, Xingjie Tang ¹, Jun Wang ², Xiaobo Sun ¹ Corresponding author.

Abstract

The detection of anomalous tissue regions (ATRs) within affected tissues is crucial in clinical diagnosis and pathological studies. Conventional automated ATR detection methods, primarily based on histology images alone, falter in cases where ATRs and normal tissues have subtle visual differences. The recent spatial transcriptomics (ST) technology profiles gene expressions across tissue regions, offering a molecular perspective for detecting ATRs. However, there is a dearth of ATR detection methods that effectively harness complementary information from both histology images and ST. To address this gap, we propose MEATRD, a novel ATR detection method that integrates histology image and ST data. MEATRD is trained to reconstruct image patches and gene expression profiles of normal tissue spots (inliers) from their multimodal embeddings, followed by learning a one-class classification AD model based on latent multimodal reconstruction errors. This strategy harmonizes the strengths of reconstruction-based and one-class classification approaches. At the heart of MEATRD is an innovative masked graph dual-attention transformer (MGDAT) network, which not only facilitates cross-modality and cross-node information sharing but also addresses the model over-generalization issue commonly seen in reconstruction-based AD methods. Additionally, we demonstrate that modality-specific, task-relevant information is collated and condensed in multimodal bottleneck encoding generated in MGDAT, marking the first theoretical analysis of the informational properties of multimodal bottleneck encoding. Extensive evaluations across eight real ST datasets reveal MEATRD’s superior performance in ATR detection, surpassing various state-of-the-art AD methods. Remarkably, MEATRD also proves adept at discerning ATRs that only show slight visual deviations from normal tissues. Our code is available at https://github.com/wqlzuel/MEATRD.

Introduction

Detecting anomalous tissue regions (ATR) within tissues from affected individuals is essential in clinical diagnostics, pathological studies, and targeted therapies (Srinidhi, Ciga, and Martel 2021). Traditionally, automated ATR detection, which typically applies computer vision techniques to histology images, e.g., whole-slide images (WSI) stained with hematoxylin and eosin (H&E) (Zingman et al. 2023), is a specialized task of image anomaly detection (AD). However, histology images, unlike natural images (e.g., those in ImageNet dataset) (Bergmann et al. 2019), present unique challenges for AD due to their inherent high complexity (Zehnder et al. 2022), subtle differences between ATRs and normal tissues (Shenkar and Wolf 2022), the diverse manifestations of ATRs (Komura and Ishikawa 2018), and variability in staining quality (Zingman et al. 2023). The complexities demand supplementary information to visual cues for accurate ATR detection.

Spatial transcriptomics (ST) meets this need by providing spatial gene expression data. By far, a total of 1033 publicly available human ST datasets that span 56 diseases and 35 tissues, providing a rich resource for investigating ATRs at the molecular level (Wang et al. 2024). A typical ST dataset is structured as a matrix $\mathbf{X}\in\mathbb{R}^{N\times G}$ , where $\mathbf{X}_{i,j}$ denotes the expression read counts of the $j$ -th gene mapped to $i$ -th tissue spot. As illustrated in Figure 1, these spots, ranging in size from 10 to 200 $\mu$ m as per sequencing technology, are spatially arranged in arrays to cover the entire tissue slice (Hu et al. 2023), thereby characterizing gene expression profile across the tissue. This molecular-level data, especially in cases where ATRs are visually similar to normal tissues, can significantly aid in their detection (Hu et al. 2021). However, due to limitations inherent to sequencing technology, ST data suffer from severe noise and substantial missing values in gene expression measurements (Wang et al. 2022), leading to compromised precision in demarcating tissue regions (Wang, Maletic-Savatic, and Liu 2022). Integration of histology images with ST data presents a promising solution to these challenges. As illustrated in our toy example in Figure 1, the blank spots in the ST dataset’s spatial map, which represent tumor core locations with missing gene expression data, are visually identifiable in the accompanying histology image. Conversely, the tumor edge region, which may not be easily distinguishable from normal tissues visually, is detectable in the ST data. Therefore, the information from the two modalities can complement each other, greatly enhancing the precision of ATR detection. Fortunately, ST technologies like 10x Visium (Moses and Pachter 2022) provide accompanying histology images, allowing concurrent analysis of visual and genetic information for ATR detection.

Refer to caption — Figure 1: Detecting ATRs with histology images and ST data. ATRs include both tumor core and edge regions, as delineated by red and blue outlines in the histology image, respectively. The tumor edge region visually resembles the adjacent normal tissues. In the spatial map of the ST dataset, the ATRs encompass both red and blank spots, with blank spots indicating locations of missing gene expression data.

Given the rarity and unpredictable heterogeneity of anomalies, AD in images is often framed as an unsupervised learning problem, where anomalies are not known a priori. Models are trained exclusively on reference datasets comprising inliers to understand ”normality” at training time and identify deviations from this norm as anomalies at inference time (Liu et al. 2023; Bergmann et al. 2019). Contemporary image AD methods, which use deep learning to learn initial representations of normal images (Shvetsova et al. 2021; Liu et al. 2023), often involve an encoder pre-trained on large natural image datasets (Deng and Li 2022; Roth et al. 2022). These representations are then used to either model the inlier distribution in latent space, as seen in one-class classification methods (Ruff et al. 2018), or to reconstruct inliers in reconstruction-based methods (Schlegl et al. 2019). Instances in the target dataset, which exhibit low probability in the inlier distribution or larger-than-expected reconstruction errors are deemed anomalous.

Despite successes of these methods in areas such as manufacturing defect inspection, financial fraud detection, etc (Sohn et al. 2020), the unique challenges posed by ATR detection require more specialized methods (Riasatian et al. 2021; Tschuchnig and Gadermayr 2022). To meet this need, adaptions made to image AD methods focus on representation learning and anomaly discrimination techniques. For example, image encoders pre-trained on natural images are replaced with those tailored for histology images, such as U-Net (Zehnder et al. 2022), DenseNet (Riasatian et al. 2021), and s2-AnoGAN (Pocevičiūtė, Eilertsen, and Lundström 2021). In addition, anomaly scoring is adapted to use perceptual loss instead of pixel-wise reconstruction errors commonly used for natural images (Shvetsova et al. 2021; Zehnder et al. 2022). However, these methods may struggle when ATRs visually resemble normal tissues (Bejnordi et al. 2017). In contrast, ST differentiates tissue regions at the gene expression level (Hu et al. 2021; Dong and Zhang 2022), providing a remedy for ATR detection involving such complexities. Currently, Spatial-ID (Shen et al. 2022) is the only method that uses ST for ATR detection, employing a DNN classifier which assigns spots in the ST dataset to known regions while determining those with uncertain assignments as anomalies. However, this classification-based approach can induce high false positive rates, as uncertainties in assignment could stem from similarities among normal tissues rather than the presence of ATRs (Li et al. 2022). Its sole reliance on ST data also makes it vulnerable to noise and dropouts in gene expression measurements, even for detecting visually recognizable ATRs.

In this study, we propose Multimodality Enhanced Anomalous Tissue Region Detection (MEATRD), the first method that integrates histology images and ST data for enhanced ATR detection. MEATRD conceptualizes tissue spots as nodes within an attributed graph, leveraging a reconstruction-based graph model for inlier nodes reconstruction from dual perspectives of image and gene expression. During inference, the discrepancies between reconstruction errors of inliers (i.e., normal tissues) and anomalies (i.e., ATRs) can be exploited by a discriminative model for accurate ATR detection. As shown in Figure 2, MEATRD involves three stages. Stage I focuses on extracting visual features of histology images. The histology image is segmented into a patch centered around each spot, which are processed into imagery embeddings. Stage II aims to reconstruct the gene expression profiles and image patches of each spot from their fused embeddings, obtained using our innovative masked graph dual-attention transformer (MGDAT) network. MGDAT allows concurrent cross-node and cross-modal attention calculations, promoting efficient cross-modality information sharing and incorporation of spatial relationships among spots. Additionally, to counter potential model over-generalization¹¹1A common pitfall of reconstruction-based methods where anomalies might yield low reconstruction errors (Liu et al. 2023; Ristea et al. 2022)., we employ the node-feature masking strategy, which forces the model to rely more on the surrounding context and cross-modal information. Stage III focuses on acquiring a one-class classification model to identify anomalies. Unlike existing one-class classification AD methods that use instance deep embeddings and are prone to reference-target domain shifts (Ouardini et al. 2019), our model pioneers in using domain shift-robust latent multimodal reconstruction losses (Donahue, Krähenbühl, and Darrell 2016; Schlegl et al. 2019) for more reliable anomaly detection. By collapsing inliers’ reconstruction losses into a compact hypersphere, our model increases the reconstruction error discrepancy between inliers and anomalies, thereby further mitigating model over-generalization. In summary, our main contributions include:

•

We propose MEATRD, a pioneer multimodal method that integrates spatial transcriptomics with histology images for enhanced ATR detection.
•

MEATRD simultaneously addresses the over-generalization in reconstruction-based AD methods and the domain shift issue in one-class classification, leading to significant performance improvement.
•

We design an MGDAT network as the core component of MEATRD to facilitate cross-modality and cross-node information exchange while ameliorating model over-generalization. We also demonstrate the theoretical foundation for this information exchange, which is grounded in MGDAT’s ability to generate inclusive and condensed encoding of modality-specific, task-relevant information (supplementary material D).
•

Extensive benchmarks on eight breast cancer ST datasets demonstrate MEATRD’s superiority over nine state-of-the-art (SOTA) AD methods in accurately detecting ATRs, including those with subtle visual deviations from surrounding normal tissues.

Preliminary

Definition D.1.

ST Dataset and Associated Histology Image. Let $\bm{X}\in\mathbb{R}^{N\times G}$ be a ST dataset, where $N$ is the number of tissue spots and $G$ is the number of genes. $S_{N}$ and $S_{G}$ denote the set of spots and genes, respectively. $\bm{X}_{i,j}$ represents the the read counts of gene $j$ at spot $i$ , and $\bm{x}_{i}\in\mathbb{R}^{G}$ represents the gene expression profile at spot $i$ . Let $\bm{P}\in\mathbb{R}^{H\times W\times C}$ be the associated histology image, where $H,W$ , and $C$ are the height, width, and number of channels, respectively.

Definition D.2.

Graph Representation of ST Dataset and Histology Image. For a given ST dataset $\bm{X}$ , the associated histology image $\bm{P}$ is segmented into $N$ patches, where $\bm{P}_{i}\in\mathbb{R}^{h\times w\times C}$ denotes the patch centered around spot $i\in S_{N}$ , with height $h$ and width $w$ . Then spots are modeled as nodes on an unweighted, attributed graph $G(S_{N},A,\mathcal{\bm{Z}})$ , where $A\in\{0,1\}^{N\times N}$ is the adjacency matrix, and $\mathcal{\bm{Z}}\coloneqq[\mathcal{\bm{Z}}_{image}||\mathcal{\bm{Z}}_{gene}]$ is the node feature matrix. $\mathcal{\bm{Z}}_{img}\in\mathbb{R}^{N\times D}$ and $\mathcal{\bm{Z}}_{gene}\in\mathbb{R}^{N\times D}$ are embeddings of image patches and gene expression profiles of spots. $A(i,j)=1$ if node $j\in n(i)$ , where $n(i)$ is the set of $k$ -nearest neighbors of node $i$ , and $A(i,j)=0$ otherwise. $k$ is typically set to be six due to the hexagonal arrangement of spots (Xu et al. 2024).

Definition D.3.

Problem Definition. Let $\mathcal{X}$ and $\mathcal{P}$ denote the target ST dataset and associated histology image, respectively. Similarly, let $\bm{X}$ and $\bm{P}$ denote the reference ST dataset and associated histology image, respectively. We define $y_{i}\in\{0,1\}$ as the label for spot $i$ , where $y_{i}=1$ indicates an anomalous spot, and $y_{i}=0$ otherwise. Note, $y_{i}=0,\forall i\in\bm{X}$ ; $y_{j}\in\{0,1\},\forall j\in\mathcal{X}$ . The task of ATR detection is defined as identifying the subset of anomalous spots within the target dataset: $\mathbb{S}=\{\mathcal{x}_{i}|y_{\mathcal{x}_{i}}=1,\forall\mathcal{x}_{i}\in\mathcal{X}\}$ , using a model trained exclusively on $\bm{X}$ and $\bm{P}$ . ²²2Related work is in supplementary material A due to space limitation.

Method

As shown in Figure 2, the workflow of MEATRD includes three stages: Stage I extract visual features from histology image patch of each spot; Stage II focuses on the learning of reconstructions of image patches and gene expression profiles using multimodal embeddings generated by a MGADT network; Stage III entails the training of an anomaly discriminator based on latent multimodal reconstruction errors.

Extracting Visual Features of Histology Image Patches (Stage I)

Initially, whole slide images are segmented into 32x32 patches centered around each spot in the ST dataset (Zong et al. 2022). The visual manifolds of these image patches are obtained using a Mobile-Unet, with an encoder consisting of downsampling convolutional layers and inverted residual blocks. Its decoder comprises upsampling deconvolutional layers and inverted residual blocks, connected to the encoder via shortcut connections.

This design not only inherits the merits of U-Net in extracting visual features from histology images but also boosts computational efficiency by reducing the model’s parameters. Given a histology image patch $\bm{P}_{i}$ for spot $i\in S_{N}$ , the Mobile-Unet is pretrained to reconstruct it as $\widehat{\bm{P}}_{i}$ , with a pretraining loss that is a mix of a perceptual loss, based on the Structural Similarity Index (SSIM), and an $L1$ reconstruction loss:

\widehat{\bm{P}}_{i}\coloneqq D_{1}(E_{1}(\bm{P}_{i})),\quad\bm{z}_{i}\in\mathbb{R}^{D}\coloneqq E_{1}(\bm{P}_{i})

(1)

\mathcal{L}_{perc}=-\mathrm{SSIM}(\bm{P}_{i},\widehat{\bm{P}}_{i}),\mathcal{L}_{1}=||\bm{P}_{i}-\widehat{\bm{P}}_{i}||_{1}

(2)

\mathrm{SSIM}(\bm{X},\bm{Y})=\frac{(2\mu_{\bm{X}}\mu_{\bm{Y}}+C_{1})(2\sigma_{\bm{X},\bm{Y}}+C_{2})}{(\mu_{\bm{X}}^{2}+\mu_{\bm{Y}}^{2}+C_{1})(\sigma_{\bm{X}}^{2}+\sigma_{\bm{Y}}^{2}+C_{2})}

(3)

\mathcal{L}_{pretrain}=\mathcal{L}_{perc}+\mathcal{L}_{1}.

(4)

where $\mu_{*}$ and $\sigma_{*}^{2}$ are the average intensity and variance of $*\in\{\bm{X},\bm{Y}\}$ , respectively. $C_{1}$ and $C_{2}$ represent two constants to stabilize the division with a weak denominator. SSIM and $\mathcal{L}_{1}$ measure the structural similarities and pixel-by-pixel discrepancies between the original and reconstructed images, respectively. Then, pretraining loss enhances the representation learning of complex histology images by accounting for both contextual integrity, via $\mathcal{L}_{perc}$ , and local details via $\mathcal{L}_{1}$ (Okada and Taniguchi 2021). Following training, $E_{1}$ is used to yield image patch embeddings for each spot $i\in S_{N}$ . Finally, unlike complex tissue images, which need to be converted into semantically meaningful representations in the first place, gene data have much clearer semantics. Therefore, MEATRD do not require a pretext representation learning stage for gene data. Rather, we use a two-layer MLP in stage II to rasterize gene data before feeding them into MGDAT blocks, where graph-based gene encoding takes places.

Masked Graph Dual-Attention Transformer Network (Stage II)

To generate information-rich multimodal spot embeddings for reconstruction, we fuse histology image patches and gene expression profiles while incorporating contextual inter-dependencies among spots to reveal their biological characteristics. This is achieved by modeling spots as nodes in an attributed graph $G(V,A,\mathcal{\bm{Z}})$ , as described in Definition D.2, on top of which node representations are learned using an innovative masked graph attention network, termed MGDAT. This network, comprising a series of MGDAT blocks, allows information sharing across both data modality and graph nodes. Within each MGDAT block, nodes to be reconstructed are masked before aggregating fused gene and imagery attributes of their neighboring nodes via attention-based mechanism.

Formally speaking, let $G_{i}(V_{i},A_{i},\mathcal{\bm{Z}}_{i})$ denote the subgraph of a target node $i$ that covers up to its 3-hop neighbors, where $V_{i},A_{i}$ and $\mathcal{\bm{Z}}_{i}$ denote the node set, adjacency matrix, and node attribute matrix of $G_{i}$ , respectively. We set the number of hops to be 3 as using more hops will result in over-smoothing while fewer hops will significantly limit the information spread in the graph. $\bm{z}_{i}\in\mathbb{R}^{D}$ represents node $i$ ’s imagery attribute derived from Stage I, and $\bm{\zeta}_{i}\in\mathbb{R}^{D}$ represents node $i$ ’s gene attribute rasterised from its gene expression vector $\bm{x}_{i}$ using a two-layer MLP. $\bm{z}_{i}$ and $\bm{\zeta}_{i}$ are substituted with learnable mask tokens $\bm{z}_{[M]}\in\mathbb{R}^{D}$ and $\bm{\zeta}_{[M]}\in\mathbb{R}^{D}$ .

This target-node-masking serves to prevent self-information leakage of the target node into its own embedding for reconstruction, thus alleviating the potential model over-generalization issue. $G_{i}$ is processed by the MGDAT network through its series of MGDAT blocks. For the $l$ -th block, $l\in\{0,1,2\}$ , the inputs are embeddings of the image patches, $\mathcal{\bm{Z}}_{img,i}^{(l)}\in\mathbb{R}^{|V_{i}|\times D}$ , and the gene expression profiles, $\mathcal{\bm{Z}}_{gene,i}^{(l)}\in\mathbb{R}^{|V_{i}|\times D}$ , of $V_{i}$ . The initial embeddings are defined as $\mathcal{\bm{Z}}_{img,i}^{(0)}\coloneqq[\bm{z}_{1},..,\bm{z}_{[M]},..\bm{z}_{V_{i}}]^{\top}$ and $\mathcal{\bm{Z}}_{gene,i}^{(0)}\coloneqq[\bm{\zeta}_{1},..,\bm{\zeta}_{[M]},..,\bm{\zeta}_{V_{i}}]^{\top}$ . The $l$ -th MGDAT block yields fused bottleneck embeddings $\mathcal{\bm{Z}}_{fb,i}^{(l)}\in\mathbb{R}^{|V_{i}|\times D^{\prime}},D^{\prime}\ll D$ as follows:

\mathcal{\bm{Z}}_{fb,i}^{(l)}=\mathrm{Trm}\left([\mathcal{\bm{Z}}_{img,i}^{(l)}||\mathcal{\bm{Z}}_{gene,i}^{(l)}];W_{Q}^{(l)},W_{K}^{(l)},W_{V}^{(l)}\right)

(5)

where $\mathrm{Trm}$ denotes Transformer. $W_{Q}^{(l)},W_{K}^{(l)},W_{V}^{(l)}\in\mathbb{R}^{2D\times D^{\prime}}$ are query, key and value parameters, respectively. $\mathcal{\bm{Z}}_{fb}^{(l)}$ serves as a bottleneck to collate and condense modality-specific, task-relevant information from image and ST data (Nagrani et al. 2021), as theoretically demonstrated in supplementary material D. By concatenating $\mathcal{\bm{Z}}_{fb}^{(l)}$ with $\mathcal{\bm{Z}}_{img}^{(l)}$ and $\mathcal{\bm{Z}}_{gene}^{(l)}$ , the two data modalities are bridged, facilitating access to their complementary task-relevant information. Next, multimodal information of $l$ -hop neighbors is aggregated as follows:

h_{*,i}^{(l)}=[\mathcal{\bm{Z}}_{*,i}^{(l)}||\mathcal{\bm{Z}}_{fb,i}^{(l)}],\quad\text{where}\ *\in\{img,gene\},

(6)

\alpha_{*,ij}^{(l)}=\frac{\exp(w_{att}^{(l)}\sigma(W^{(l)}[h_{*,i}^{(l)}||h_{*,j}^{(l)}]))}{\sum_{k\in\mathcal{N}_{i}}\exp(w_{att}^{(l)}\sigma(W^{(l)}[h_{*,i}^{(l)}||h_{*,k}^{(l)}])))},

(7)

\mathcal{\bm{Z}}_{*,i}^{(l+1)}=\sigma(\sum_{j\in\mathcal{N}_{i}}\alpha_{*,ij}^{(l)}W^{(l)}h_{*,j}^{(l)}),

(8)

where $\sigma$ denotes LeakyReLU, $w_{att}^{(l)}\in\mathbb{R}^{D}$ and $W^{(l)}\in\mathbb{R}^{D\times(D+D^{\prime})}$ denote the attention weight matrix and regular weight matrix of the $l$ -th MGDAT block, respectively.

The histology image patches of $V_{i}$ are reconstructed from the final image embeddings, $\mathcal{\bm{Z}}_{img,i}$ , through a ResNet-based deconvolutional decoder $D_{2}$ , while the gene expression profiles of $V_{i}$ are reconstructed from the final gene embeddings, $\mathcal{\bm{Z}}_{gene,i}$ , through a single-layer GNN-based decoder $D_{3}$ (Hou et al. 2023):

\widetilde{\bm{P}}_{i}\coloneqq D_{2}(\mathcal{\bm{Z}}_{img,i}),\quad\widetilde{\bm{x}}_{i}\coloneqq D_{3}(\mathcal{\bm{Z}}_{gene,i})

(9)

The training loss of stage II includes an image-level loss, same as that defined in Equation 4, and a gene-level reconstruction loss measured in scaled cosine errors:

\begin{split}\mathcal{L}_{rec}=&\alpha\cdot\sum_{i}^{N}(-\mathrm{SSIM}(\bm{P}_{i},\widetilde{\bm{P}}_{i})+||\bm{P}_{i}-\widetilde{\bm{P}}_{i}||_{1})\\ &+(1-\alpha)\cdot\sum_{i}^{N}\mathcal{L}_{SCE}(\bm{x}_{i},\widetilde{\bm{x}}_{i}),\end{split}

(10)

\mathcal{L}_{SCE}(\bm{x}_{i},\widetilde{\bm{x}}_{i})=\left(1-\frac{\bm{x}_{i}^{\top}\widetilde{\bm{x}}_{i}}{||\bm{x}_{i}||\cdot||\widetilde{\bm{x}}_{i}||}\right)^{\gamma},\gamma\geq 1,

(11)

where $0<\alpha<1$ is the weight assigned to image reconstruction loss, $\gamma$ is a scaling factor. The workflow of Stage II is illustrated in Figure 2 and Algorithm 1 in supplementary material C.

Target Dataset	Metric	Method
12	Metric
		Multimodal-based		Image-based			ST-based
12
		MEATRD	M3DM	SimpleNet	f-AnoGAN	Patch SVDD	DOMINANT	PREM	Spatial-ID	scmap	CAMLU
10x-hBC-A1	AUC	$\mathbf{0.756}_{\pm 0.007}$	$0.520_{\pm 0.046}$	$0.543_{\pm 0.095}$	$\underline{0.642}_{\pm 0.109}$	$0.614_{\pm 0.005}$	$0.488_{\pm 0.117}$	$0.211_{\pm 0.004}$	$0.463_{\pm 0.067}$	$0.500_{\pm 0.000}$	$0.516_{\pm 0.021}$
	F1	$0.892_{\pm 0.007}$	$0.875_{\pm 0.0013}$	$0.875_{\pm 0.011}$	$0.892_{\pm 0.017}$	$\underline{0.892}_{\pm 0.005}$	$0.885_{\pm 0.017}$	$0.865_{\pm 0.000}$	$0.870_{\pm 0.004}$	$\mathbf{0.934}_{\pm 0.000}$	$0.376_{\pm 0.383}$
10x-hBC-B1	AUC	$\mathbf{0.920}_{\pm 0.028}$	$0.505_{\pm 0.029}$	$0.554_{\pm 0.135}$	$\underline{0.736}_{\pm 0.144}$	$0.442_{\pm 0.025}$	$0.698_{\pm 0.077}$	$0.288_{\pm 0.006}$	$0.195_{\pm 0.083}$	$0.504_{\pm 0.000}$	$0.667_{\pm 0.160}$
	F1	$\mathbf{0.841}_{\pm 0.022}$	$0.210_{\pm 0.027}$	$0.302_{\pm 0.127}$	$\underline{0.568}_{\pm 0.176}$	$0.225_{\pm 0.025}$	$0.352_{\pm 0.143}$	$0.073_{\pm 0.008}$	$0.067_{\pm 0.064}$	$0.354_{\pm 0.000}$	$0.428_{\pm 0.365}$
10x-hBC-C1	AUC	$\mathbf{0.715}_{\pm 0.017}$	$0.540_{\pm 0.034}$	$0.501_{\pm 0.099}$	$0.485_{\pm 0.035}$	$0.401_{\pm 0.0032}$	$0.633_{\pm 0.101}$	$0.419_{\pm 0.004}$	$0.384_{\pm 0.055}$	$0.500_{\pm 0.000}$	$\underline{0.660}_{\pm 0.156}$
	F1	$\mathbf{0.842}_{\pm 0.021}$	$0.735_{\pm 0.028}$	$0.735_{\pm 0.024}$	$0.713_{\pm 0.021}$	$0.661_{\pm 0.005}$	$0.769_{\pm 0.040}$	$0.695_{\pm 0.006}$	$0.687_{\pm 0.013}$	$\underline{0.838}_{\pm 0.000}$	$0.808_{\pm 0.021}$
10x-hBC-D1	AUC	$\mathbf{0.803}_{\pm 0.017}$	$0.488_{\pm 0.011}$	$0.485_{\pm 0.111}$	$0.276_{\pm 0.072}$	$0.377_{\pm 0.005}$	$0.530_{\pm 0.172}$	$0.380_{\pm 0.003}$	$0.469_{\pm 0.007}$	$0.503_{\pm 0.000}$	$\underline{0.649}_{\pm 0.066}$
	F1	$\mathbf{0.698}_{\pm 0.016}$	$0.443_{\pm 0.017}$	$0.433_{\pm 0.072}$	$0.253_{\pm 0.085}$	$0.373_{\pm 0.010}$	$0.478_{\pm 0.123}$	$0.344_{\pm 0.010}$	$0.410_{\pm 0.011}$	$\underline{0.626}_{\pm 0.000}$	$0.465_{\pm 0.158}$
10x-hBC-E1	AUC	$\mathbf{0.553}_{\pm 0.046}$	$\underline{0.536}_{\pm 0.014}$	$0.465_{\pm 0.119}$	$0.369_{\pm 0.034}$	$0.300_{\pm 0.009}$	$0.475_{\pm 0.083}$	$0.429_{\pm 0.006}$	$0.449_{\pm 0.082}$	$0.500_{\pm 0.000}$	$0.405_{\pm 0.047}$
	F1	$\mathbf{0.739}_{\pm 0.029}$	$0.598_{\pm 0.009}$	$0.542_{\pm 0.077}$	$0.492_{\pm 0.021}$	$0.443_{\pm 0.006}$	$0.570_{\pm 0.058}$	$0.528_{\pm 0.008}$	$0.542_{\pm 0.054}$	$\underline{0.734}_{\pm 0.000}$	$0.081_{\pm 0.095}$
10x-hBC-F1	AUC	$\mathbf{0.667}_{\pm 0.009}$	$0.485_{\pm 0.046}$	$0.476_{\pm 0.017}$	$0.493_{\pm 0.011}$	$0.483_{\pm 0.005}$	$0.477_{\pm 0.074}$	$0.379_{\pm 0.004}$	$0.380_{\pm 0.074}$	$\underline{0.500}_{\pm 0.000}$	$0.409_{\pm 0.051}$
	F1	$\underline{0.858}_{\pm 0.003}$	$0.832_{\pm 0.009}$	$0.835_{\pm 0.002}$	$0.842_{\pm 0.004}$	$0.840_{\pm 0.003}$	$0.834_{\pm 0.018}$	$0.825_{\pm 0.001}$	$0.820_{\pm 0.005}$	$\mathbf{0.910}_{\pm 0.000}$	$0.036_{\pm 0.022}$
10x-hBC-G2	AUC	$\mathbf{0.640}_{\pm 0.079}$	$0.524_{\pm 0.016}$	$0.482_{\pm 0.074}$	$0.457_{\pm 0.016}$	$0.430_{\pm 0.008}$	$\underline{0.576}_{\pm 0.107}$	$0.430_{\pm 0.006}$	$0.312_{\pm 0.024}$	$0.500_{\pm 0.000}$	$0.518_{\pm 0.001}$
	F1	$\mathbf{0.544}_{\pm 0.045}$	$0.366_{\pm 0.016}$	$0.333_{\pm 0.068}$	$0.295_{\pm 0.002}$	$0.294_{\pm 0.018}$	$0.434_{\pm 0.095}$	$0.273_{\pm 0.006}$	$0.214_{\pm 0.029}$	$\underline{0.510}_{\pm 0.000}$	$0.070_{\pm 0.005}$
10x-hBC-H1	AUC	$\mathbf{0.732}_{\pm 0.064}$	$0.474_{\pm 0.023}$	$0.443_{\pm 0.099}$	$\underline{0.625}_{\pm 0.083}$	$0.415_{\pm 0.005}$	$0.521_{\pm 0.105}$	$0.370_{\pm 0.009}$	$0.319_{\pm 0.061}$	$0.500_{\pm 0.000}$	$0.515_{\pm 0.010}$
	F1	$\mathbf{0.516}_{\pm 0.029}$	$0.273_{\pm 0.029}$	$0.186_{\pm 0.080}$	$0.359_{\pm 0.080}$	$0.066_{\pm 0.003}$	$0.297_{\pm 0.060}$	$0.209_{\pm 0.018}$	$0.179_{\pm 0.038}$	$\underline{0.467}_{\pm 0.000}$	$0.418_{\pm 0.113}$
Mean	AUC	$\mathbf{0.723}$	$0.509$	$0.494$	$0.510$	$0.433$	$\underline{0.550}$	$0.363$	$0.371$	$0.501$	$0.542$
	F1	$\mathbf{0.741}$	$0.542$	$0.530$	$0.552$	$0.474$	$0.577$	$0.476$	$0.474$	$\underline{0.672}$	$0.335$

Table 1: Performance evaluation of anomalous tissue region detection across eight human breast cancer ST datasets. The table presents the results in terms of AUC and F1 scores, with each cell showing the average score from five independent runs and the corresponding standard deviation. The best score for each dataset is bolded, and the second-best score is underline.

Latent Multimodal Reconstruction Loss-based Anomaly Discriminator (Stage III)

Following Stage II, the original and reconstructed image patches of any spot $i$ are processed by a ResNet to generate their respective latent manifolds, denoted as $e_{img,i}\coloneqq\mathrm{ResNet}(\bm{P}_{i})$ and $\tilde{e}_{img,i}\coloneqq\mathrm{ResNet}(\widetilde{\bm{P}}_{i})$ , respectively. Here, we employ a light-weight ResNet as the encoder since this stage focuses on calculating latent loss rather than for the more involved tissue image reconstruction task. Similarly, the manifolds of the original and reconstructed gene expression profiles of spot $i$ are generated by an MLP, denoted as $e_{gene,i}\coloneqq\mathrm{MLP}(\bm{x}_{i})$ and $\tilde{e}_{gene,i}\coloneqq\mathrm{MLP}(\widetilde{\bm{x}}_{i})$ , respectively. Next, these manifolds are normalized, and a feed-forward network (FFN) maps their weighted averages to a latent space where the multimodal reconstruction error, $\ell_{rec,i}$ , is calculated as follows:

\bm{Z}_{fused,i}=\mathrm{FFN}\left(\beta\cdot\frac{e_{img,i}}{||e_{img,i}||}+(1-\beta)\cdot\frac{e_{gene,i}}{||e_{gene,i}||}\right)

(12)

\widetilde{\bm{Z}}_{fused,i}=\mathrm{FFN}\left(\beta\cdot\frac{\tilde{e}_{img,i}}{||\tilde{e}_{img,i}||}+(1-\beta)\cdot\frac{\tilde{e}_{gene,i}}{||\tilde{e}_{gene,i}||}\right)

(13)

\ell_{rec,i}=\bm{Z}_{fused,i}-\widetilde{\bm{Z}}_{fused,i}

(14)

where $0<\beta<1$ represents the relative weight assigned to the histology image. We then train a one-class classifier to collapse latent reconstruction errors of inliers into a compact hypersphere using the loss function:

\mathcal{L}_{occ}=\left\|\ell_{rec,i}-c\right\|^{2}

(15)

where $c=\frac{1}{N}\sum\limits_{k=1}^{N}\ell_{rec,k}$ . The training workflow of Stage III is also illustrated in Algorithm 2 of supplementary material C. At inference time, the anomaly score (AS) of a query spot $j$ is computed as the distance of its latent reconstruction loss to $c$ :

AS_{j}\coloneqq\left\|\ell_{rec,j}-c\right\|^{2}

(16)

Given the observation that a gap exists between anomaly scores of inliers and anomalies (Figure 1 in supplementary material B), the AS threshold for discriminating inliers and anomalies is automatically determined using a Maximum A Posteriori-Expectation-Maximization (MAP-EM)-based mixture model, as detailed in supplementary material B.

Experiments

Experimental Settings

Datasets. MEATRD is extensively evaluated across eight breast cancer datasets and four primary sclerosing cholangitis (PSC) datasets. (see supplementary material E for data description and preprocessing).

Baselines. We select nine SOTA image-based, ST-based, and multi-modal AD methods as baselines. Image-based methods include two one-class classification methods, Patch SVDD (Yi and Yoon 2020) and SimpleNet (Liu et al. 2023), alongside a reconstruction-based method, f-AnoGAN (Schlegl et al. 2019). For ST-based methods, we consider scmap (Kiselev, Yiu, and Hemberg 2018), a classification-based method, CAMLU (Li et al. 2022), a reconstruction-based method, PREM (Pan et al. 2023), a discriminative graph method, DOMINANT (Ding et al. 2019), a generative graph method, and Spatial-ID (Shen et al. 2022), a classification-based method tailored for ST data. Additionally, M3DM (Wang et al. 2023) is chosen as a representative multimodal baseline.

Evaluation Protocols. AUC and F1 scores are used to evaluate the accuracy of ATR detection. For a fair comparison, the F1 score is calculated with the threshold matching the actual proportion of true anomalies (Shenkar and Wolf 2021). Reported metrics and standard deviations are averaged over five independent runs.

Anomalous Tissue Region Detection

In this experiment, as listed in supplementary material F, MEATRD is trained on eight human normal breast ST datasets (i.e., 10x-hNB-{v03-v10}) and tested on eight human breast cancer (i.e., 10x-hBC-{A1-H1}) ST datasets. Table 1 showcases MEATRD’s superiority over baselines in detecting ATRs across datasets, consistently ranking first in AUC scores and six times in F1 scores. It outperforms the second-best performing method with an average leap of 17.45% in AUC scores and 10.31% in F1 scores. Furthermore, Table 4 in supplementary material F indicates that our model performed well in detecting PSCs, demonstrating its generalization to other types of diseases. Generally, methods that use ST data, for example, DOMINANT, scmap and CAMLU, tend to outperform those that rely solely on histology images, indicating the pivotal role of gene expression information provided by ST data in aiding the detection of ATRs, especially those visually similar to normal tissues. Moreover, we find that, DOMINANT, a graph-based AD method, prevails over other baselines, and that M3DM, a multimodal method that utilizes both image and ST data yet fails to account for spatial relationships among spots, does not perform as well as MEATRD. These observations emphasize the value of spatial contextual information in accurate ATR detection.

Discovering Anomalous Tissue Regions Visually Similar to Normal Tissues

To evaluate the efficacy of MEATRD in detecting ATRs with minimal visual distinctions from normal tissues, we conduct a comparative analysis on the 10x-hBC-I1 ST dataset, which encompasses a tumor edge region that visually blends with the adjacent normal tissues, as indicated in red in the annotated histology image from Figure 3. Our analysis includes: the standard MEATRD implementation (MEATRD-standard); MEATRD- $\beta$ , a variant that downplays the influence of histology image by decreasing $\beta$ from 0.5 to 0.1 in Equation 12 and Equation 13; DOMINANT, the top performing baseline utilizing ST data from the previous section; two leading image-based AD methods, f-AnoGAN and SimpleNet. The results, visually presented in Figure 3 demonstrate that MEATRD- $\beta$ more accurately identifies spots within the tumor edge region as anomalous, compared to the other competing methods. This finding is quantitatively supported by its highest precision (0.693) and recall (0.895) scores. The observation that MEATRD-standard, MEATRD- $\beta$ , and DOMINANT prevail over the image-based AD methods underscores the value of using ST data for pinpointing pathogenic tissue regions that visually resemble normal tissues. Furthermore, DOMINANT’s marginal performance edge over MEATRD-standard suggests that in this specific context, the histology image contributes very limited additional information. Indeed, MEATRD- $\beta$ , which places greater emphasis on ST data, showcases an improved performance of 3.1% in precision and 34.4% in recall, compared to MEATRD-standard. Nonetheless, for scenarios involving low-quality ST data and visually traceable ATRs, incorporating visual cues from histology images are undoubtedly beneficial, as established in our prior analysis and ablation study.

case	AUC	F1
0.9	0.678	0.696
0.5	0.723	0.741
0.1	0.709	0.725

(a)

case	AUC	F1
0.9	0.654	0.668
0.5	0.723	0.741
0.1	0.699	0.718

(b)

dim	AUC	F1
128	0.705	0.726
256	0.723	0.741
512	0.721	0.735

(c)

dim	AUC	F1
16	0.723	0.741
64	0.715	0.728
256	0.682	0.711

(d)

dim	AUC	F1
64	0.606	0.623
128	0.720	0.732
256	0.723	0.741

(e)

blocks	AUC	F1
2	0.694	0.719
3	0.723	0.741
4	0.533	0.565

(f)

case	AUC	F1
1	0.718	0.730
2	0.723	0.741
4	0.721	0.737

(g)

Table 2: Sensitivity analysis of hyperparameter in MEATRD across eight human breast cancer datasets. Default settings are marked in gray.

Ablation Studies

We conduct ablation studies over the eight human breast cancer ST datasets (i.e., 10x-hBC-{A1-H1}) to investigate the effects of MEATRD’s key components in ATR detection. These components include using multiple data modality, multimodal data fusion using fused bottleneck embedding, masking for target node reconstruction, multimodal reconstruction losses in the one-class classifier in Stage III, enlarging anomaly score discrepancy between inliers and anomalies using a one-class classifier, using Mobile-Unet as the pretraining backbone in Stage I. The descriptions detailed in the Ablation Studies section in supplementary material F, demonstrate that removing any of these components leads to suboptimal performance. This is due to the inefficient use of cross-modal complementary information, less effective addressing of model over-generalization, and increased sensitivity to reference-target domain shifts.

Metric	Ablation study
Metric	ST only	Image only	w/o MGDAT	w/o TNM	w/o RE	w/o OC	Full
AUC	0.631	0.497	0.639	0.655	0.642	0.584	0.723
F1	0.667	0.544	0.689	0.699	0.685	0.631	0.741

Table 3: Ablation study of key components in MEATRD across eight human breast cancer datasets. Method performance is gauged through average AUC and F1 scores. ”Full” represents the complete MEATRD model. ”ST Only” and ”Image Only” utilize only ST data or histology images, respectively. ”w/o MGDAT” omits the MGDAT block. ”w/o TNM” omits the target-node-masking technique. ”w/o RE” substitutes the latent multimodal reconstruction errors with direct spot embeddings for input to the discriminative model in Stage III. ”w/o OC” discards the entire stage III and utilizes spot reconstruction errors as anomaly scores for ATR detection.

Sensitivity Analysis

Here, we conduct sensitivity analyses on eight 10x-hBC datasets to examine the effects of MEATRD’s key hyperparameters, including $\alpha\ \text{and}\ \beta$ , which control the relative weights between gene and image modalities in Stage II and III; the dimensions of visual and gene embedding from Stage I, bottleneck embedding in Stage II, and the inputs to the one-classification classifier in Stage III; the number of MGDAT layers and its attention heads. The effect of these parameters on MEATRD’s performance, measured by AUC and F1 scores, is presented in Table 2. Detailed results are provided in supplementary material F.3.

Complexity Analysis

We analyze the model complexity of MEATRD across its three stages by evaluating the number of parameters, computational performance (MFlops), time complexity, training time, and inference time. We also compare these metrics with the nine baseline methods. Detailed results are provided in supplementary material F.4. In summary, MEATRD is scalable to the number of spots and edges (proportional to the number of spots due to the adjacency matrix setting) and demonstrates good efficiency in our experiments.

Conclusion

In this paper, we propose MEATRD, a pilot method that integrates histology images and ST data to enhance ATR detection at both visual and molecular levels. MEATRD treats tissue spots as nodes in an attributed graph to embed spatial relationships into their representations. The MGDAT network, a key innovation of MEATRD, facilitates effective cross-node and cross-modality information exchange, enabling comprehensive graph representation learning. MEATRD harmonizes one-class classification with reconstruction deviation-based AD detection, simultaneously addressing the challenges of reference-target domain shift and model over-generalization. Rigorous evaluations on a suite of real ST datasets have demonstrated MEATRD’s superiority over various SOTA AD methods in detecting ATRs including those that are visually akin to contextual normal tissues. Furthermore, MEATRD also offers a framework generalizable to other multimodal AD tasks involving compatible imagery and graph data modalities.

Acknowledgments

The project is funded by the Excellent Young Scientist Fund of Wuhan City (Grant No. 21129040740) to X.S.

References

Andersson et al. (2021) Andersson, A.; Larsson, L.; Stenbeck, L.; et al. 2021. Spatial deconvolution of HER2-positive breast cancer delineates tumor-associated cell type interactions. Nat Commun, 12: 6012.
Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International conference on machine learning, 214–223. PMLR.
Bejnordi et al. (2017) Bejnordi, B. E.; Veta, M.; Van Diest, P. J.; Van Ginneken, B.; Karssemeijer, N.; Litjens, G.; Van Der Laak, J. A.; Hermsen, M.; Manson, Q. F.; Balkenhol, M.; et al. 2017. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22): 2199–2210.
Bergmann et al. (2019) Bergmann, P.; Fauser, M.; Sattlegger, D.; and Steger, C. 2019. MVTec AD – A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
Cover (1999) Cover, T. M. 1999. Elements of information theory. John Wiley & Sons.
Deng and Li (2022) Deng, H.; and Li, X. 2022. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9737–9746.
Ding et al. (2019) Ding, K.; Li, J.; Bhanushali, R.; and Liu, H. 2019. Deep anomaly detection on attributed networks. In Proceedings of the 2019 SIAM International Conference on Data Mining, 594–602. SIAM.
Donahue, Krähenbühl, and Darrell (2016) Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv preprint arXiv:1605.09782.
Dong and Zhang (2022) Dong, K.; and Zhang, S. 2022. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nature communications, 13(1): 1739.
He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.
He and Sun (2015) He, K.; and Sun, J. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5353–5360.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Hou et al. (2023) Hou, Z.; He, Y.; Cen, Y.; Liu, X.; Dong, Y.; Kharlamov, E.; and Tang, J. 2023. GraphMAE2: A Decoding-Enhanced Masked Self-Supervised Graph Learner. In Proceedings of the ACM Web Conference 2023, 737–746.
Hu et al. (2023) Hu, J.; Coleman, K.; Zhang, D.; Lee, E. B.; Kadara, H.; Wang, L.; and Li, M. 2023. Deciphering tumor ecosystems at super resolution from spatial transcriptomics with TESLA. Cell systems, 14(5): 404–417.
Hu et al. (2021) Hu, J.; Li, X.; Coleman, K.; Schroeder, A.; Ma, N.; Irwin, D. J.; Lee, E. B.; Shinohara, R. T.; and Li, M. 2021. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature methods, 18(11): 1342–1351.
Kipf and Welling (2016) Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
Kiselev, Yiu, and Hemberg (2018) Kiselev, V. Y.; Yiu, A.; and Hemberg, M. 2018. scmap: projection of single-cell RNA-seq data across data sets. Nature methods, 15(5): 359–362.
Komura and Ishikawa (2018) Komura, D.; and Ishikawa, S. 2018. Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal, 16: 34–42.
Kumar et al. (2023) Kumar, G.; Pandurengan, R. K.; Parra, E. R.; Kannan, K.; and Haymaker, C. 2023. Spatial modelling of the tumor microenvironment from multiplex immunofluorescence images: methods and applications. Frontiers in immunology, 14: 1288802.
Li et al. (2022) Li, Z.; Wang, Y.; Ganan-Gomez, I.; Colla, S.; and Do, K.-A. 2022. A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data. Bioinformatics, 38(21): 4885–4892.
Liu et al. (2023) Liu, Z.; Zhou, Y.; Xu, Y.; and Wang, Z. 2023. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20402–20411.
Moses and Pachter (2022) Moses, L.; and Pachter, L. 2022. Museum of spatial transcriptomics. Nat Methods, 19: 534–546.
Nagrani et al. (2021) Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; and Sun, C. 2021. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34: 14200–14213.
Okada and Taniguchi (2021) Okada, M.; and Taniguchi, T. 2021. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction. In 2021 ieee international conference on robotics and automation (icra), 4209–4215. IEEE.
Ouardini et al. (2019) Ouardini, K.; Yang, H.; Unnikrishnan, B.; Romain, M.; Garcin, C.; Zenati, H.; Campbell, J. P.; Chiang, M. F.; Kalpathy-Cramer, J.; Chandrasekhar, V.; et al. 2019. Towards practical unsupervised anomaly detection on retinal images. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data: First MICCAI Workshop, DART 2019, and First International Workshop, MIL3ID 2019, Shenzhen, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13 and 17, 2019, Proceedings 1, 225–234. Springer.
Pan et al. (2023) Pan, J.; Liu, Y.; Zheng, Y.; and Pan, S. 2023. PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly Detection. arXiv preprint arXiv:2310.11676.
Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Pocevičiūtė, Eilertsen, and Lundström (2021) Pocevičiūtė, M.; Eilertsen, G.; and Lundström, C. 2021. Unsupervised anomaly detection in digital pathology using GANs. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 1878–1882. IEEE.
Riasatian et al. (2021) Riasatian, A.; Babaie, M.; Maleki, D.; Kalra, S.; Valipour, M.; Hemati, S.; Zaveri, M.; Safarpoor, A.; Shafiei, S.; Afshari, M.; et al. 2021. Fine-tuning and training of densenet for histopathology image representation using tcga diagnostic slides. Medical Image Analysis, 70: 102032.
Ristea et al. (2022) Ristea, N.-C.; Madan, N.; Ionescu, R. T.; Nasrollahi, K.; Khan, F. S.; Moeslund, T. B.; and Shah, M. 2022. Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13576–13586.
Roth et al. (2022) Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; and Gehler, P. 2022. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14318–14328.
Rudolph et al. (2023) Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; and Wandt, B. 2023. Asymmetric student-teacher networks for industrial anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2592–2602.
Ruff et al. (2018) Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S. A.; Binder, A.; Müller, E.; and Kloft, M. 2018. Deep one-class classification. In International conference on Machine Learning, 4393–4402. PMLR.
Sabour, Frosst, and Hinton (2017) Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic routing between capsules. Advances in neural information processing systems, 30.
Schlegl et al. (2019) Schlegl, T.; Seeböck, P.; Waldstein, S. M.; Langs, G.; and Schmidt-Erfurth, U. 2019. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis, 54: 30–44.
Shen et al. (2022) Shen, R.; Liu, L.; Wu, Z.; Zhang, Y.; Yuan, Z.; Guo, J.; Yang, F.; Zhang, C.; Chen, B.; Feng, W.; et al. 2022. Spatial-ID: a cell typing method for spatially resolved transcriptomics via transfer learning and spatial embedding. Nature communications, 13(1): 7640.
Shenkar and Wolf (2021) Shenkar, T.; and Wolf, L. 2021. Anomaly detection for tabular data with internal contrastive learning. In International Conference on Learning Representations.
Shenkar and Wolf (2022) Shenkar, T.; and Wolf, L. 2022. Anomaly detection for tabular data with internal contrastive learning. In International conference on learning representations.
Shvetsova et al. (2021) Shvetsova, N.; Bakker, B.; Fedulova, I.; Schulz, H.; and Dylov, D. V. 2021. Anomaly detection in medical imaging with deep perceptual autoencoders. IEEE Access, 9: 118571–118583.
Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sohn et al. (2020) Sohn, K.; Li, C.-L.; Yoon, J.; Jin, M.; and Pfister, T. 2020. Learning and evaluating representations for deep one-class classification. arXiv preprint arXiv:2011.02578.
Srinidhi, Ciga, and Martel (2021) Srinidhi, C. L.; Ciga, O.; and Martel, A. L. 2021. Deep neural network models for computational histopathology: A survey. Medical image analysis, 67: 101813.
Tian et al. (2020) Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; and Isola, P. 2020. What makes for good views for contrastive learning? Advances in neural information processing systems, 33: 6827–6839.
Tishby, Pereira, and Bialek (2000) Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. arXiv preprint physics/0004057.
Tschuchnig and Gadermayr (2022) Tschuchnig, M. E.; and Gadermayr, M. 2022. Anomaly detection in medical imaging-a mini review. In Data Science–Analytics and Applications: Proceedings of the 4th International Data Science Conference–iDSC2021, 33–38. Springer.
Veličković et al. (2018) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018. Graph Attention Networks. In International Conference on Learning Representations.
Wang et al. (2024) Wang, G.; Wu, S.; Xiong, Z.; Qu, H.; Fang, X.; and Bao, Y. 2024. CROST: a comprehensive repository of spatial transcriptomics. Nucleic Acids Research, 52(D1): D882–D890.
Wang, Maletic-Savatic, and Liu (2022) Wang, L.; Maletic-Savatic, M.; and Liu, Z. 2022. Region-specific denoising identifies spatial co-expression patterns and intra-tissue heterogeneity in spatially resolved transcriptomics data. Nature Communications, 13(1): 6912.
Wang et al. (2023) Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; and Wang, C. 2023. Multimodal Industrial Anomaly Detection via Hybrid Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8032–8041.
Wang et al. (2022) Wang, Y.; Song, B.; Wang, S.; Chen, M.; Xie, Y.; Xiao, G.; Wang, L.; and Wang, T. 2022. Sprod for de-noising spatially resolved transcriptomics data based on position and image information. Nature methods, 19(8): 950–958.
Wolf, Angerer, and Theis (2018) Wolf, F. A.; Angerer, P.; and Theis, F. J. 2018. SCANPY: large-scale single-cell gene expression data analysis. Genome biology, 19: 1–5.
Xu et al. (2024) Xu, K.; Lu, Y.; Hou, S.; Liu, K.; Du, Y.; Huang, M.; Feng, H.; Wu, H.; and Sun, X. 2024. Detecting anomalous anatomic regions in spatial transcriptomics with STANDS. Nature Communications, 15(1): 8223.
Yi and Yoon (2020) Yi, J.; and Yoon, S. 2020. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian conference on computer vision.
Zehnder et al. (2022) Zehnder, P.; Feng, J.; Fuji, R. N.; Sullivan, R.; and Hu, F. 2022. Multiscale generative model using regularized skip-connections and perceptual loss for anomaly detection in toxicologic histopathology. Journal of Pathology Informatics, 13: 100102.
Zingman et al. (2023) Zingman, I.; Stierstorfer, B.; Lempp, C.; and Heinemann, F. 2023. Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. Medical Image Analysis, 103067.
Zong et al. (2022) Zong, Y.; Yu, T.; Wang, X.; Wang, Y.; Hu, Z.; and Li, Y. 2022. conST: an interpretable multi-modal contrastive learning framework for spatial transcriptomics. bioRxiv, 2022–01.

Supplementary Material

Appendix A Related Work

Localized Anomaly Detection in Image

Related works in this field can be broadly divided into two categories: one-class classification methods and reconstruction-based methods. The former aims to delineate normal data distributions and boundaries in a latent space at training time, labeling instances occurring in low-probability density regions (i.e., falling outside the boundary) as anomalies at test time (Shvetsova et al. 2021). For example, Patch SVDD (Yi and Yoon 2020) assesses anomalies according to their proximity to the nearest inlier in a latent space that is learned by minimizing distances between nearby inliers. Another example, SimpleNet (Liu et al. 2023) creates pseudo-anomalies by introducing random noises to extracted visual features of inliers, and trains a separating hyperplane-based discriminator for anomaly differentiation. The main limitation of these methods is their dependency on effective representation learning (Sohn et al. 2020), which may be compromised by batch effects between reference and target datasets (Ouardini et al. 2019).

On the other hand, reconstruction-based methods, trained on normal data only, posit that inliers can be reconstructed more faithfully from their latent manifolds than anomalies. For instance, f-AnoGan (Schlegl et al. 2019), a WGAN(Arjovsky, Chintala, and Bottou 2017)-based method for AD in medical images, employs a discriminator-guided encoder to obtain reconstruction residuals as anomaly scores. While theoretically more robust to batch effects since only anomalies are identified based on reconstruction errors within the same batch, these methods may suffer from model over-generalization, leading to minor reconstruction errors for anomalies (Liu et al. 2023; Ristea et al. 2022). Overall, methods for localized AD in images often overlook the contextual surroundings (Sabour, Frosst, and Hinton 2017; Ristea et al. 2022), although some, such as SSPCAB (Ristea et al. 2022) and PatchCore (Roth et al. 2022), attempt to aggregate information from neighboring patches through simplified means such as adaptive averaging pooling. In contrast, our method, by virtue of the MGDAT network, can comprehensively harness contextual information for improved AD. (He and Sun 2015)

Anomaly Detection using Gene Expression Data

Tissue spots in ST closely resemble single cells in single-cell RNA sequencing (scRNA-seq), augmented with spatial location information. This similarity offers an opportunity to apply anomalous cell (AC) detection methods to ATR detection in ST. Traditional AC detection methods treat scRNA-seq data as tabular, identifying ACs through cell type classification tasks. For example, scmap (Kiselev, Yiu, and Hemberg 2018) computes gene expression similarities between query cells and centroids of known cell types, designating those below a threshold as anomalies. Such classification-based methods depend heavily on labeled references, often a scarce and costly resource. To circumvent this limitation, CAMLU (Li et al. 2022), a reconstruction-based method utilizing unlabeled reference data only, identifies ACs in the target dataset using informative genes selected as per their reconstruction deviations. However, these methods neglect spatial information inherent to ST data, which is crucial for accurate ATR detection. To bridge this gap, specialized methods have been developed, typically leveraging graph neural networks (GNN) to incorporate spatial relationships among spots (Hu et al. 2021; Dong and Zhang 2022). Among these, to the best of our knowledge, Spatial-ID (Shen et al. 2022) is currently the sole method capable of identifying ATRs by utilizing a classifier, pre-trained on labeled scRNA-seq data, to classify spots based on their latent manifolds generated via a variational graph autoencoder. Spots with uncertain soft assignments are labeled as anomalies. However, Spatial-ID, like many other gene-oriented AD methods, is prone to high false positive rates due to its reliance on assignment uncertainties, often arising from inlier similarities rather than genuine anomalies (Li et al. 2022).

An alternative strategy, bypassing the classification framework, involves modeling ST data as an attributed graph and applying node-level graph anomaly detection (GAD) methods, which can be generative or discriminative (Pan et al. 2023). For instance, PREM (Pan et al. 2023) determines anomalous nodes based on their anomaly scores calculated as the dissimilarity between ego and neighbor node embeddings, which are generated through graph contrastive learning. DOMINANT (Ding et al. 2019), a generative GAD method, leverages a graph convolutional network (GCN) (Kipf and Welling 2016) to reconstruct both nodal attributes and topological structure, using combined reconstruction errors as anomaly scores. Generally, all methods discussed in this section are limited by their heavy dependence on the quality of ST data and falling short of exploiting visual information available in histology images to improve the accuracy of ATR detection.

Multimodal Anomaly Detection

By far, the development of multimodal AD methods has been predominantly focused on industrial AD scenarios involving the simultaneous use of 2D and 3D data. Recent methods in this field include M3DM (Wang et al. 2023) and AST (Rudolph et al. 2023). M3DM adopts a contrastive learning-based approach to fuse manifolds of segmented patches from 3D point clouds and RGB images, based on which a discriminative model is trained for anomaly decision. AST concatenates features extracted from RGB images and 3D depth maps, serving as inputs to asymmetric student and teacher networks that determine anomaly scores as per their output discrepancies. However, there is a significant gap in developing multimodal ATR detection methods that combine gene expression data and histology images.

Appendix B Determining anomaly score threshold

Based on the observation that there is a gap between anomaly scores of inliers and true anomalies, as shown in Figure 5, we designed a two-component mixture model to automatically determine the anomaly score threshold that discriminate inliers and anomalies. Specifically, the distribution of anomaly scores is modeled as a univariate Gaussian Mixture Model (GMM) with two components corresponding to anomalous and normal instances, respectively. We specify the prior for anomaly abundance as a beta distribution and the priors for the mean and variance of the two Gaussian components as a Normal Inverse Chi-squared (NIX) distribution. The parameters of these priors are estimated based on inlier anomaly scores in the reference dataset. Utilizing the Maximum A Posteriori (MAP)-EM algorithm, we infer the parameters for both Gaussian components and then assign spots into either normal or anomalous groups based on their probabilities within each component. Specifically, let $\Theta=\left\{\pi,\mu_{k},\sigma_{k}^{2},\forall k\in\left\{1,2\right\}\right\}$ represent the GMM parameters, where $\pi\in\left[0,1\right]$ represents the proportion of anomalies, and $\mu_{k},\sigma_{k}^{2}$ represent the mean and variance for the $k$ -th component, respectively, with the constraint that $\mu_{1}>\mu_{2}$ . Then, the probability density function of an anomaly score $d_{i}$ can be formulated as:

P\left(d_{i}\middle|\Theta\right)=\pi\mathcal{N}\left(d_{i}\middle|\mu_{1},\sigma_{1}^{2}\right)+(1-\pi)\mathcal{N}\left(d_{i}\middle|\mu_{2},\sigma_{2}^{2}\right)

(17)

\pi\sim\mathrm{Beta}\left(\pi\middle|a,b\right)

(18)

\mu_{k},\sigma_{k}^{2}\sim\mathrm{NIX}\left(\mu_{k},\Sigma_{k}\middle|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)

(19)

Parameters for the priors in the GMM are empirically set based on the reference dataset’s anomaly scores $\delta_{i},{\forall}i\in\{1,2,\cdots,N_{ref}\}$ :

m_{0}=\frac{\sum_{i=1}^{N_{ref}}\delta_{i}}{N_{ref}},\ \kappa_{0}=0.01,\ \nu_{0}=3,\ s_{0}^{2}=\frac{\sum_{i=1}^{N_{ref}}\left(\delta_{i}-m_{0}\right)}{N_{ref}}

(20)

a=1,\ b=10

(21)

The values of $a$ and $b$ can be adjusted if prior knowledge about anomaly abundance is available. The complete data log likelihood for the posterior, denoted as $\ell_{c}\left(\Theta\right)$ , is expressed as:

$\displaystyle\ell_{c}\left(\Theta\right)$	$\displaystyle=\mathrm{log}P\left(\mathcal{D}\ \middle\|\Theta\right)$	(22)
	$\displaystyle=\sum_{i}\Big{[}\mathbb{I}\left(z_{i}=1\right)\left(\mathrm{log}\pi+\mathrm{log}\mathcal{N}\left(d_{i}\middle\|\mu_{1},\sigma_{1}^{2}\right)\right)$
	$\displaystyle+\mathbb{I}\left(z_{i}=2\right)\left(\mathrm{log}(1-\pi)+\mathrm{log}\mathcal{N}\left(d_{i}\middle\|\mu_{2},\sigma_{2}^{2}\right)\right)\Big{]}$
	$\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\middle\|a,b\right)$
	$\displaystyle+\sum_{k=1}^{2}{\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle\|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)}$

Here, $z_{i}$ denotes the component membership of spot $i$ . In the $t$ -th iteration of the E-step, the expected sufficient statistics ${\overline{z_{i}}}^{(t)}$ is derived from $\Theta^{(t-1)}$ . In the subsequent M-step, $\Theta^{(t-1)}$ is updated to $\Theta^{(t)}$ by maximizing the auxiliary function $Q\left(\Theta,\Theta^{(t-1)}\right)=E\left({\ell}_{c}\left(\Theta\right)\big{|}\Theta^{(t-1)}\right)$ . We elaborate our MAP-EM algorithm below:

MAP-EM inference of GMM parameters.

We first list the mathematical notations used in the inference below:

Notation	Description
$\mathcal{D}\ =d_{i},\forall i\in\{1,2,\cdots,N\}$	Set of anomaly scores of target spots.
$\Delta\ =\left\{\delta_{i},\forall i\in\{1,2,\cdots,N_{ref}\}\right\}$	Set of anomaly scores of reference spots.
$N$	Number of target spots.
$N_{ref}$	Number of reference spots.
$\pi_{1}$	Anomaly abundance among the target spots.
$\pi_{2}=1-\pi_{1}$	Abundance of normal spots among the target spots.
$\Theta=\left\{\pi_{k},\mu_{k},\sigma_{k}^{2},\forall k\in\left\{1,2\right\}\right\}$	Parameters of the $k$ -th GMM components.
$z_{i}\in\left\{1,2\right\}$	GMM component membership of the spot $i$ .

Table 4: Overview of notations in MAP-EM inference.

Initially, we introduce a prior on $\pi_{1}$ as a Beta distribution, and a conjugate joint prior on $\mu_{k}$ , $\sigma_{k}^{2}$ as a normal inverse chi-squared (NIX) distribution:

\pi_{1}\sim\mathrm{Beta}\left(\pi\middle|a,b\right)

(23)

	$\displaystyle\mu_{k},\sigma_{k}^{2}$	$\displaystyle\sim\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle\|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)$		(24)
		$\displaystyle=\mathcal{N}\left(\mu_{k}\middle\|m_{0},\sigma_{k}^{2}/\kappa_{0}\right)\chi^{-2}\left(\sigma_{k}^{2}\middle\|s_{0}^{2},\nu_{0}\right)$		(24)

Here, we set the parameters of the prior distributions based on the anomaly scores of spots in the reference dataset:

m_{0}=\frac{\sum_{i=1}^{N_{ref}}\delta_{i}}{N_{ref}},\kappa_{0}=0.01,\nu_{0}=3,s_{0}^{2}=\frac{\sum_{i=1}^{N_{ref}}\left(\delta_{i}-m_{0}\right)}{N_{ref}}

(25)

a=1,b=10

(26)

Note that the values of a and b can be set to more appropriate values if prior knowledge about the abundance of anomalies is available. The posterior complete data log likelihood can be written as:

$\displaystyle\ell_{c}\left(\Theta\right)$	$\displaystyle=\mathrm{log}P\left(\mathcal{D}\ \middle\|\Theta\right)$	(27)
	$\displaystyle=\sum_{i}\sum_{k}\mathbb{I}\left(z_{i}=k\right)\left(\mathrm{log}\pi_{k}+\mathrm{log}\mathcal{N}\left(d_{i}\middle\|\mu_{k},\sigma_{k}^{2}\right)\right)$
	$\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\middle\|a,b\right)$
	$\displaystyle+\sum_{k=1}^{2}{\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle\|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)}$

E step. In the $t$ -th iteration, we have the auxiliary function $Q$ as:

		$\displaystyle Q\left(\Theta,\Theta^{(t-1)}\right)=\mathbb{E}\left[\ell_{c}(\Theta)\big{\|}\Theta^{(t-1)}\right]$
		$\displaystyle=\sum_{i}\sum_{k=1}^{2}P\left(z_{i}=k\big{\|}d_{i},\Theta^{(t-1)}\right)$
		$\displaystyle\qquad\left[\mathrm{log}\pi_{k}^{(t-1)}+\mathbb{E}\left(\mathrm{log}N\left(d_{i}\big{\|}\mu_{k}^{(t-1)},{(\sigma_{k}^{2})}^{(t-1)}\right)\right)\right]$
		$\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\|a,b\right)+\sum_{k=1}^{2}\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\big{\|}m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)$

The expected sufficient statistics (ESS) are:

	$\displaystyle\overline{z_{i,k}}$	$\displaystyle=P\left(z_{i}=k\big{\|}d_{i},\Theta^{(t-1)}\right)$		(28)
		$\displaystyle=\frac{\pi_{k}^{(t-1)}\mathcal{N}\left(d_{i}\big{\|}\mu_{k}^{(t-1)},(\sigma_{k}^{2})^{(t-1)}\right)}{\sum_{k^{\prime}}\pi_{k^{\prime}}^{(t-1)}\mathcal{N}\left(d_{i}\big{\|}\mu_{k^{\prime}}^{(t-1)},(\sigma_{k^{\prime}}^{2})^{(t-1)}\right)}$		(28)

M step. In the $t$ -th iteration, the expected complete posterior data log likelihood is:

\begin{split}&Q\left(\Theta,\Theta^{\left(t-1\right)}\right)\propto\\ &\sum_{k=1}^{2}\sum_{i}\left[{\overline{z_{i,k}}}^{\left(t\right)}\left(\mathrm{log}\pi_{k}-\frac{\mathrm{log}\left(\sigma_{k}^{2}\right)}{2}-\frac{\left(d_{i}-\mu_{k}\right)^{2}}{2\sigma_{k}^{2}}\right)\right]\\ &\quad+\mathrm{log}\mathrm{Beta}\left(\pi\middle|a,b\right)\\ &\quad+\sum_{k=1}^{2}\left[\mathrm{log}\mathcal{N}\left(\mu_{k}\middle|m_{0},\sigma_{k}^{2}/\kappa_{0}\right)+\mathrm{log}\chi^{-2}\left(\sigma_{k}^{2}\middle|s_{0}^{2},\nu_{0}\right)\right]\end{split}

(29)

We maximize $Q\left(\Theta,\Theta^{\left(t-1\right)}\right)$ with respect to $\Theta$ . The posterior distribution of $\pi_{1}\$ and $\left\{\mu_{k},\sigma_{k}^{2}\right\}$ are:

\pi_{1}\sim\mathrm{Beta}\left(\pi\middle|a^{\left(t\right)},b^{\left(t\right)}\right)

(30)

a^{\left(t\right)}=a+\sum_{i}{\overline{z_{i,1}}}^{\left(t\right)}

(31)

b^{\left(t\right)}=b+\sum_{i}{\overline{z_{i,2}}}^{\left(t\right)}

(32)

\mu_{k},\sigma_{k}^{2}\sim\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle|m_{k}^{\left(t\right)},\kappa_{k}^{\left(t\right)},\left(s_{k}^{2}\right)^{\left(t\right)},\nu_{k}^{\left(t\right)}\right)

(33)

{\overline{z_{k}}}^{\left(t\right)}=\sum_{i}{\overline{z_{i,k}}}^{\left(t\right)}

(34)

{\bar{d_{k}}}^{\left(t\right)}=\frac{\sum_{i}{({\overline{z_{i,k}}}^{\left(t\right)}}d_{i})}{{\overline{z_{k}}}^{\left(t\right)}}

(35)

\nu_{k}^{\left(t\right)}=\nu_{0}+{\overline{z_{k}}}^{\left(t\right)},\ \kappa_{k}^{\left(t\right)}=\kappa_{0}+{\overline{z_{k}}}^{\left(t\right)}

(36)

m_{k}^{\left(t\right)}=\frac{{\overline{z_{k}}}^{\left(t\right)}{\bar{d_{k}}}^{\left(t\right)}+m_{0}\kappa_{0}}{\kappa_{k}^{\left(t\right)}}

(37)

\left(s_{k}^{2}\right)^{\left(t\right)}={\nu_{0}s}_{0}^{2}+\sum_{i}{({\overline{z_{i,k}}}^{\left(t\right)}}d_{i}^{2})+\kappa_{0}m_{0}^{2}-{\overline{z_{k}}}^{\left(t\right)}\left(m_{k}^{\left(t\right)}\right)^{2}

Then we have the MAP estimates of $\pi_{1}$ , $\mu_{k}\$ and $\sigma_{k}^{2}$ as $\pi_{1}^{(t)}$ , $\mu_{k}^{\left(t\right)}$ and $\left(\sigma_{k}^{2}\right)^{\left(t\right)}$ :

\pi_{1}^{\left(t\right)}=\frac{a^{\left(t\right)}-1}{a^{\left(t\right)}+b^{\left(t\right)}-2}

(38)

\mu_{k}^{\left(t\right)}=m_{k}^{\left(t\right)}

(39)

\left(\sigma_{k}^{2}\right)^{\left(t\right)}=\frac{{\nu_{k}^{\left(t\right)}\left(s_{k}^{2}\right)}^{\left(t\right)}}{\nu_{k}^{\left(t\right)}+3}

(40)

Next, the EM algorithm continues to E step of the $\left(t+1\right)$ -th iteration to update ${\overline{z_{i,k}}}^{(t+1)}$ , $\forall i\in\left[1,N\right]$ , $\forall k\in\left\{1,2\right\}$ until either convergence is achieved, or a pre-specified number of iterations is reached. Finally, the soft assignment of spot $i$ to the anomalous group $(\mathcal{Q}_{i,1})$ is calculated by plugin $\Theta$ :

q_{i,1}=\pi_{1}\mathcal{N}\left(d_{i}\middle|\mu_{1},\sigma_{1}^{2}\right)

(41)

\mathcal{Q}_{i,1}=\frac{q_{i,1}}{\sum_{k}q_{i,k}},\forall i\in\{1,2,\cdots,N\},\forall k\in\left\{1,2\right\}

(42)

If $\mathcal{Q}_{i,1}>0.5$ , then spot $i$ is determined as an anomaly.

Appendix C Algorithm for MEATRD

Algorithm 1 Stage II training. 1:Gene expression profiles

\bm{X}\in\mathbb{R}^{N\times G}

; Image patches

\bm{P}\in\mathbb{R}^{N\times h\times w\times c}

; Attributed graph

G(V,A,\mathcal{\bm{Z}})

; Number of nodes N; Parameter of Image modality

\lambda

; Parameter of Gene modality

\alpha

. 2:Pre-trained Mobile-UNet encoder

E_{1}

; Pre-trained Mobile-Unet decoder

D_{1}

; Gene encoder

f_{E}

; Gene dncoder

f_{D}

; MGDAT network

\mathcal{F}

; ResNET-based image decoder

D_{2}

; GNN-based gene decoder

D_{3}

; Feed-forward network

f

; L1 reconstruction loss function

\mathcal{L}_{1}

; SSIM loss function

\mathcal{L}_{SSIM}

; SCE loss fustion

\mathcal{L}_{SCE}

. 3:Reconstructed ST data of query spot

\hat{\mathbf{X}}_{b}

; Reconstructed histology image of query spot

\hat{\mathbf{P}}_{b}

. 4:for

\mathbf{X}_{b}

\mathbf{P}_{b}

\bm{X}

\bm{P}

\triangleright

Processing Stage II 5:

Z_{gene}=f_{E}(\mathbf{X}_{b})

Z_{img}=E_{1}(\mathbf{P}_{b})

. 6:

Z_{gene},Z_{img}=\mathcal{F}(Z_{gene},Z_{img})

\mathbf{P}_{b}=D_{2}(Z_{img})

\mathbf{X}_{b}=D_{3}(Z_{gene})

. 8:

\mathcal{L}_{rec}=\mathcal{L}_{ssim}(\mathbf{P}_{b},\hat{\mathbf{P}}_{b})+\lambda\mathcal{L}_{1}(\mathbf{P}_{b},\hat{\mathbf{P}}_{b})+\alpha\mathcal{L}_{SCE}(\mathbf{X}_{b},\hat{\mathbf{X}}_{b})

. 9: Update parameters of

E_{1},f_{E},\mathcal{F},D_{2},D_{3}

using

\mathcal{L}_{rec}

. 10:end for 11:return

\hat{\mathbf{P}}_{b}

\hat{\mathbf{X}}_{b}

Algorithm 2 Stage III Training. 1:Gene expression profiles

\bm{X}\in\mathbb{R}^{N\times G}

; Image patches

\bm{P}\in\mathbb{R}^{N\times H\times W\times C}

; Maximum epochs

E_{max}

. 2:Feed-forward network

f

; Image encoder in stage III

E_{2}

; Gene encoder in stage III

E_{3}

; Reconstruction error

\ell_{rec,b}

; Dimension of hyperspherical space

D

3:SVDD Center

c\in\mathbb{R}^{N\times D}

. 4:

\hat{\mathbf{P}}

\hat{\mathbf{X}}

= Learning of Spot Reconstruction (

\mathbf{P}

\mathbf{X}

)

\triangleright

Processing Stage III 5:Initialize

E_{2}

E_{3}

, and

f

. 6:while

epoch<E_{max}

do 7: Compute the center

c

. 8: for

\mathbf{P}_{b}

\mathbf{X}_{b}

\hat{\mathbf{P}}_{b}

\hat{\mathbf{X}}_{b}

\bm{P}

\bm{X}

\hat{\bm{P}}

\hat{\bm{X}}

do 9:

Z_{fused}=f(\beta norm(E_{2}(\mathbf{P}_{b}))+norm(E_{3}(\mathbf{X}_{b}))))

. 10:

\hat{Z}_{fused}=f(\beta norm(E_{2}(\hat{\mathbf{P}}_{b}))+norm(E_{3}(\hat{\mathbf{X}}_{b}))))

. 11:

\ell_{rec,d}=Z_{fused}-\hat{Z}_{fused}

. 12:

\mathcal{L}_{SVDD}=\left\|\ell_{rec,b}-c\right\|^{2}

. 13: Update parameters of

E_{2},E_{3},f

using

\mathcal{L}_{SVDD}

14: end for 15:end while 16:return

c

Appendix D Theoretical Analysis

Fused Bottleneck Encoding as a Minimally Sufficient Representation of Modality-Specific, Task-Relevant Information

In this section, we begin with the mathematical notations (Table 5), properties (D.1), definitions (Definition D.4), and assumptions (D.1 and D.2) pertinent to our theoretical analysis. We then prove that the fused bottleneck encoding serves as a sufficient statistic (Tian et al. 2020) for capturing complementary task-relevant information across data modalities (Proposition D.1), as illustrated in Supplementary Figure 4. Finally, we prove that the fused bottleneck encoding is the most informationally compact among all sufficient encodings (Proposition D.2).

Notation	Description
$v_{i}$	The view associated with the $i$ -th data modality.
$b_{0}$	The biological contents shared between data modalities.
$b_{i}$	The biological contents specific to the $i$ -th data modality.
$I(*)$	The information set inherent to *.
$M$	The mutual information function.
$H$	The entropy function.
$f_{1}$ and $z_{1}$	The encoder and encoding for view $v_{1}$ .
$f_{2}$ and $z_{2}$	The encoder and encoding for view $v_{2}$ .
$f_{3}$ and $z_{3}$	The fusion bottleneck encoder and fused encoding.

Table 5: Summary of notation.

Definition D.4.

Information function $I(x)$ denotes the information set inherent in $x$ , e.g., $I(x)=H(x)$ when $x$ is a variable. Also, we have $I(v_{1},v_{2})=I(v_{1})\cup\ I(v_{2})$ .

Definition D.5.

The relative mutual information between two variables $v_{1}$ and $v_{2}$ is defined as the ratio of their mutual information to their total information:

\widehat{M}(v_{1},v_{2})=\frac{M(v_{1},v_{2})}{I(v_{1})\cup I(v_{2})}=\frac{M(v_{1},v_{2})}{H(v_{1})+H(v_{2})-M(v_{1},v_{2})}

Relative mutual information is more effective in highlighting the significance of shared information between two variables compared to conventional mutual information.

Properties D.1.

Properties of Mutual Information and Entropy:

\displaystyle\textbf{i})\

\displaystyle M(x;y)\geq 0,M(x;y|z)\geq 0.

\displaystyle\textbf{ii})\

\displaystyle M(x;y,z)=M(x;y)+M(x;z|y).

	$\displaystyle\textbf{iii})\ M(x_{1};x_{2};\cdots;x_{n+1})$	$\displaystyle=M(x_{1};\cdots;x_{n})$
		$\displaystyle-M(x1;\cdots;x_{n}\|x_{n+1}).$

	$\displaystyle\textbf{iv})\ \text{If}\$	$\displaystyle I(v_{2})\subseteq I(v_{1})\longrightarrow M(v_{1},v_{2})=H(v_{2}),$
		$\displaystyle I(v_{1},v_{2})=I(v_{1})\cup I(v_{2})=I(v_{1})=H(v_{1})$

	$\displaystyle\textbf{v})\ \text{If}\$	$\displaystyle I(v_{2})\cap I(v_{1})=\varnothing\longrightarrow$
		$\displaystyle I(v_{1},v_{2})=H(v_{1},v_{2})=H(v_{1})+H(v_{2})=I(v_{1})+I(v_{2})$

Proof.

The proofs of properties i, ii, and iii can be found in (Cover 1999). For property iv:

\begin{split}M(v_{1},v_{2})&=\underset{v_{1},v_{2}}{\iint}p(v_{1},v_{2})\mathrm{log}(\frac{p(v_{1},v_{2})}{p(v_{1})p(v_{2})})\\ &=\underset{v_{1},v_{2}}{\iint}p(v_{1},v_{2})\mathrm{log}(\frac{\overbrace{p(v_{2}|v_{1})}^{=1\ \text{as}\ I(v_{2})\subseteq I(v_{1})}p(v_{1})}{p(v_{1})p(v_{2})})\\ &=\underset{v_{2}}{\int}-p(v_{2})\mathrm{log}(p(v_{2}))=H(v_{2}).\end{split}

(43)

In addition, for $I(v_{1},v_{2})$ , we have:

\begin{split}I(v_{1},v_{2})&=I(v_{1})\cup I(v_{2})=H(v_{1},v_{2})\\ &=\underset{v_{1},v_{2}}{\iint}-p(v_{1},v_{2})\mathrm{log}(p(v_{1},v_{2}))\\ &=\underset{v_{1},v_{2}}{\iint}-p(v_{1},v_{2})\mathrm{log}(p(v_{2}|v_{1})p(v_{1}))\\ &=\underset{v_{1}}{\int}-p(v_{1})\mathrm{log}(p(v_{1}))=H(v_{1})=(v_{1}).\end{split}

(44)

For property v, we first clarified that:

I(v_{2})\cap I(v_{1})=\varnothing\longrightarrow p(v_{1},v_{2})=p(v_{1})p(v_{2})

(45)

Therefore, we have:

\begin{split}H(v_{1},v_{2})&=\underset{v_{1},v_{2}}{\iint}-p(v_{1},v_{2})\mathrm{log}(p(v_{1},v_{2}))\\ &=\underset{v_{1},v_{2}}{\iint}-p(v_{1})p(v_{2})\mathrm{log}(p(v_{1})p(v_{2}))\\ &=\underset{v_{1}}{\int}-p(v_{1})\mathrm{log}(p(v_{1}))+\underset{v_{2}}{\int}-p(v_{2})\mathrm{log}(p(v_{2}))\\ &=H(v_{1})+H(v_{2})\end{split}

(46)

∎

Assumption D.1.

Assume that histology image and ST represent two views ( $v_{1}$ and $v_{2}$ ) of the biological information ( $b$ ) inherent in the studied tissue. Let $y$ be an indicator of the normality of regions across the tissue, which is essentially determined by $b$ . Then, we have:

\begin{split}&I(y)=\{b\}=\{b_{0},b_{1},b_{2}\},\\ &\{b_{0}\}\cap\{b_{1}\}=\varnothing,\{b_{0}\}\cap\{b_{2}\}=\varnothing,\{b_{1}\}\cap\{b_{2}\}=\varnothing,\\ &M(y;v_{1})=\{b\}\cap I(v_{1})=\{b_{0},b_{1}\},\\ &M(y;v_{2})=\{b\}\cap I(v_{2})=\{b_{0},b_{2}\}.\\ \end{split}

Here, $b_{0}$ represents the common task-relevant information, while $b_{1}$ and $b_{2}$ represent the task-relevant information specific to $v_{1}$ and $v_{2}$ , respectively.

Assumption D.2.

The encodings $z_{1}=f_{1}(v_{1})$ , $z_{2}=f_{2}(v_{2})$ , and $z_{3}=f_{3}(z_{1},z_{2})$ are generated by the respective encoders. We define $z_{4}=\{z_{1},z_{3}\}$ and $z_{5}=\{z_{2},z_{3}\}$ as per equation (6) in the main text. Assuming $f_{1}$ and $f_{2}$ are information lossless encoders, and, along with the fusion bottleneck encoder $f_{3}$ , follow the information bottleneck theory proposed by Tishby et al., (Tishby, Pereira, and Bialek 2000). That is, $z_{4}\ \text{and}\ z_{5}$ should be maximally informative about $y$ with an information constrain on the bottleneck $z_{3}$ . We use relative mutual information in place of conventional mutual information for more accurate reflection of the significance of shared information. The optimization problems are defined as:

\max_{f_{1},f_{3}}\ \widehat{M}(z_{4};y|f_{1})\quad\mathrm{s.t.}\ \widehat{M}(z_{3};v_{1}|f_{3})\leq I_{c},

\max_{f_{2},f_{3}}\ \widehat{M}(z_{5};y|f_{2})\quad\mathrm{s.t.}\ \widehat{M}(z_{3};v_{2}|f_{3})\leq I_{c},

where $I_{c}$ is the information constraint. These can be converted into the following objective functions by introducing a Lagrange multiplier $\beta>0$ :

\min_{z_{3},z_{4}}\ell(z_{3},z_{4})=\min_{z_{3},z_{4}}-\widehat{M}(z_{4};y)+\beta\widehat{M}(z_{3};v_{1}),

\min_{z_{3},z_{5}}\ell(z_{3},z_{5})=\min_{z_{3},z_{5}}-\widehat{M}(z_{5};y)+\beta\widehat{M}(z_{3};v_{2}).

Proposition D.1.

Inclusiveness of complementary task-relevant information. The objective functions in D.2 are optimized when the bottleneck encoding $z_{3}$ encompasses all task-relevant information specific to $v_{1}$ and $v_{2}$ :

I(z_{3})\supseteq\{b_{1},b_{2}\}

Proof.

Given that $f_{1}$ and $f_{2}$ are information lossless encoders of $v_{1}$ and $v_{2}$ , we have:

I(v_{1})=I(z_{1})=\{b_{0},b_{1},\tilde{z}_{1}\}

(47)

I(v_{2})=I(z_{2})=\{b_{0},b_{2},\tilde{z}_{2}\}

(48)

Here, $\tilde{z}_{1}$ and $\tilde{z}_{2}$ represents task-irrelevant information specific to $v_{1}$ and $v_{2}$ , respectively. $b_{0},\tilde{z}_{1}\ \text{and}\ \tilde{z}_{2}$ are mutually exclusive, i.e.,

		$\displaystyle\{b_{i}\}\cap\{\tilde{z}_{1}\}=\varnothing,\{b_{i}\}\cap\{\tilde{z}_{2}\}=\varnothing,$		(49)
		$\displaystyle\{\tilde{z}_{1}\}\cap\{\tilde{z}_{2}\}=\varnothing,\{b_{i}\}\cap\{b_{j}\}=\varnothing,$
		$\displaystyle\forall i,j\in\{0,1,2\},i\neq j$

Let $\{\check{z}_{3}\}=(\{b_{0},b_{1},b_{2}\}/(\{b_{0},b_{1},b_{2}\}\cap I(z_{3})))\cap\{b_{2}\}$ represent the task-relevant information in $b_{2}$ that is not included in $z_{3}$ . It is obvious:

		$\displaystyle\{\check{z}_{3}\}\subset I(y),\{\check{z}_{3}\}\cap I(z_{3})=\varnothing,$		(50)
		$\displaystyle\{\check{z}_{3}\}\cap\{b_{0}\}=\varnothing,\{\check{z}_{3}\}\cap\{b_{1}\}=\varnothing,$
		$\displaystyle\{\check{z}_{3}\}\cap I(v_{1})=\varnothing,\{\check{z}_{3}\}\cap I(z_{1})=\varnothing.$

If $\{\check{z}_{3}\}\neq\varnothing$ , we have:

\begin{split}&\ell(z_{3},z_{4})=-\widehat{M}(z_{4};y)+\beta\widehat{M}(z_{3};v_{1})\\ &=-\widehat{M}(z_{1},z_{3};y)+\beta\widehat{M}(v_{1};z_{3})\\ &=-\widehat{M}(z_{1},z_{3};y)+\beta(\frac{M(v_{1};z_{3})}{I(v_{1})\cup I(z_{3})}+\frac{\overbrace{M(v_{1};\check{z}_{3}|z_{3})}^{=0}}{I(v_{1})\cup I(z_{3})})\\ &>-\widehat{M}(z_{1},z_{3};y)+\beta\frac{M(z_{3},\check{z}_{3};v_{1})}{I(v_{1})\cup\underbrace{I(z_{3},\check{z}_{3})}_{=H(z_{3})+H(\check{z}_{3})>H(z_{3})=I(z_{3})}}\\ &=-\widehat{M}(z_{1},z_{3};y)+\beta\widehat{M}(z_{3},\check{z}_{3};v_{1})\\ \end{split}

For $\widehat{M}(z_{1},z_{3};y)$ , we have:

\begin{split}\widehat{M}(z_{1},z_{3};y)&=\frac{M(z_{1},z_{3};y)}{I(z_{1},z_{3})\cup I(y)}\\ &=\frac{M(z_{1},z_{3};y)}{I(z_{1},z_{3})\cup\underbrace{(I(y)\cup I(\check{z}_{3}))}_{=H(y)=I(y)\ \text{as}\ \{\check{z}_{3}\}\subset I(y)}}\\ &<\frac{\overbrace{M(y;\check{z}_{3}|z_{1},z_{3})}^{>0}+M(y;z_{1},z_{3})}{I(z_{1},z_{3},\check{z}_{3})\cup I(y)}\\ &=\frac{M(y;z_{1},z_{3},\check{z}_{3})}{I(z_{1},z_{3},\check{z}_{3})\cup I(y)}=\widehat{M}(z_{1},z_{3},\check{z}_{3};y)\\ \end{split}

Thus, $f_{3}$ will be updated to generate $z^{\prime}_{3}\ \text{with}\ I(z^{\prime}_{3})=\{I(z_{3}),\check{z}_{3}\}$ so that:

\begin{split}\ell(z^{\prime}_{3},z_{4})&=-\widehat{M}(z_{1},z^{\prime}_{3};y)+\beta\widehat{M}(z^{\prime}_{3};v_{1})\\ &=-\widehat{M}(z_{1},z_{3},\check{z}_{3};y)+\beta\widehat{M}(z_{3},\check{z}_{3};v_{1})\\ &<\ell(z_{3},z_{4})\end{split}

(51)

This update continues until $\{\check{z}_{3}\}=\varnothing\rightarrow\{b_{2}\}\subseteq I(z_{3})$ . Similarly, using $\ell(z_{3},z_{5})$ , we can show that $\{\hat{z}_{3}\}=(\{b_{0},b_{1},b_{2}\}/(\{b_{0},b_{1},b_{2}\}\cap I(z_{3}))\cap\{b_{1}\}=\varnothing\rightarrow\{b_{1}\}\subseteq I(z_{3})$ . Therefore, $I(z_{3})\supseteq\{b_{1},b_{2}\}$ , completing the proof. ∎

Proposition D.2.

Compactness of complementary task-relevant information. The objective functions in D.2 is minimized when:

I(z_{3})=\{b_{1},b_{2}\}

Proof.

As proved in Proposition D.1, $I(z_{3})\supseteq\{b_{1},b_{2}\}$ . We start with $I(z_{3})=\{b_{0},b_{1},b_{2}\}$ , and then expand $z_{3}$ to encompass additional information from $v_{1}$ , denoted as $\{\check{z}_{3}\}$ , where:

		$\displaystyle\{\check{z}_{3}\}\subset I(v_{1})=I(z_{1}),M(\check{z}_{3},v_{1})>0,$		(52)
		$\displaystyle M(\check{z}_{3},y)=0,\{z_{3}\}\cap\{\check{z}_{3}\}=\varnothing.$		(52)

Let $\ddot{z}_{3}=\{z_{3},\check{z}_{3}\}$ . The objective function becomes:

\begin{split}\ell(\ddot{z}_{3},z_{4})&=-\widehat{M}(z_{4};y)+\beta\widehat{M}(\ddot{z}_{3};v_{1})\\ &=-\widehat{M}(z_{1},z_{3},\check{z}_{3};y)+\beta\widehat{M}(\check{z}_{3},z_{3};v_{1})\\ \end{split}

(53)

For $\widehat{M}(\check{z}_{3},z_{3};v_{1})$ , we have:

\begin{split}\widehat{M}(\check{z}_{3},z_{3};v_{1})&=\frac{M(v_{1};z_{3})+\overbrace{M(v_{1};\check{z}_{3}|z_{3})}^{>0}}{\underbrace{I(v_{1})\cup I(z_{3})}_{\because I(z_{3},\check{z}_{3})=I(z_{3})\cup I(\check{z}_{3});\ I(\check{z}_{3})\cup I(v_{1})=I(v_{1})}}\\ &>\frac{M(z_{3};v_{1})}{I(v_{1})\cup I(z_{3})}=\widehat{M}(z_{3};v_{1})\end{split}

(54)

For $\widehat{M}(z_{1},z_{3},\check{z}_{3};y)$ , we have:

\begin{split}\widehat{M}(z_{1},z_{3},\check{z}_{3};y)&=\frac{\overbrace{M(y;\check{z}_{3}|z_{1},z_{3})}^{=0}+M(y;z_{1},z_{3})}{I(z_{1},z_{3},\check{z}_{3})\cup I(y)}\\ &=\frac{M(y;z_{1},z_{3})}{\underbrace{I(z_{1},z_{3})}_{\because\{\check{z}_{3}\}\subset I(z_{1})}\cup I(y)}=\widehat{M}(z_{1},z_{3};y)\end{split}

(55)

Thus, $\ell(\ddot{z}_{3},z_{4})>-\widehat{M}(z_{1},z_{3};y)+\beta\widehat{M}(z_{3};v_{1})=\ell(z_{3},z_{4}),\forall\beta>0$ . To minimize the objective function, $\check{z}_{3}$ is excluded. Similarly, from $\ell(z_{3},z_{5})$ , we know $z_{3}$ should not expand to encompass additional information from $v_{2}$ . Hence, optimal $z_{3}$ must satisfy $I(z_{3})\subseteq\{b_{0},b_{1},b_{2}\}$ .

Furthermore, if we shrink the information of $z_{3}$ to $\dot{z}_{3}$ , with the reduced information $\{\hat{z}_{3}\}\subseteq\{b_{0}\}\subset I(z_{1})=I(v_{1})\rightarrow\{\dot{z}_{3}\}\cap\{\hat{z}_{3}\}=\varnothing$ . The objective function becomes:

\begin{split}&\ell(z_{3},z_{4})=-\widehat{M}(z_{4};y)+\beta\widehat{M}(z_{3};v_{1})\\ &=-\widehat{M}(z_{1},\dot{z}_{3},\hat{z}_{3};y)+\beta\widehat{M}(\dot{z}_{3},\hat{z}_{3};v_{1})\\ &=-\frac{M(y;z_{1},\dot{z}_{3})+\overbrace{M(y;\hat{z}_{3}|z_{1},\dot{z}_{3})}^{=0\ as\ \hat{z}_{3}\subset I(z_{1})}}{I(z_{1},\dot{z}_{3},\hat{z}_{3})\cup I(y)}\\ &+\beta\frac{M(\dot{z}_{3},\hat{z}_{3};v_{1})}{(I(\dot{z}_{3})\cup I(\hat{z}_{3}))\cup I(v_{1})}\\ &=-\frac{M(z_{1},\dot{z}_{3};y)}{\underbrace{I(z_{1},\dot{z}_{3})}_{\because\{\hat{z}_{3}\}\subset I(z_{1})}\cup I(y)}+\beta\frac{M(v_{1};\dot{z}_{3})+\overbrace{M(v_{1};\hat{z}_{3}|\dot{z}_{3})}^{>0}}{I(\dot{z}_{3})\cup I(v_{1})}\\ &>-\frac{M(z_{1},\dot{z}_{3};y)}{I(z_{1},\dot{z}_{3})\cup I(y)}+\beta\frac{M(\dot{z}_{3};v_{1})}{I(\dot{z}_{3})\cup I(v_{1})}=\ell(\dot{z}_{3},z_{4})\end{split}

Therefore, if $M(v_{1};\hat{z}_{3})>0$ , the objective function can be further optimized by reducing information from $\{b_{0}\}$ until $I(z_{3})\cap\{b_{0}\}=\varnothing\rightarrow I(z_{3})=\{b_{1},b_{2}\}$ . This completes the proof.

∎

In summary, $f_{3}$ effectively captures the view-specific, task-relevant information in the bottleneck encoding $z_{3}$ , which embodies an inclusive and condensed representation of the complementary information, biologically relevant for determining tissue region normality, between the two data modalities. Thus, $z_{3}$ serves as an informational bridge connecting the two data modalities.

Dataset	Tissue (ATR Type)	Total Number of Spots	Anomaly Proportion
10x-hNB-v03	Normal human breast	2364	0.00%
10x-hNB-v04	Normal human breast	2504	0.00%
10x-hNB-v05	Normal human breast	2224	0.00%
10x-hNB-v06	Normal human breast	3037	0.00%
10x-hNB-v07	Normal human breast	2086	0.00%
10x-hNB-v08	Normal human breast	2801	0.00%
10x-hNB-v09	Normal human breast	2694	0.00%
10x-hNB-v10	Normal human breast	2473	0.00%
10x-hBC-A1	Human breast cancer (Cancer in situ, Invasive cancer)	346	12.43%
10x-hBC-B1	Human breast cancer (Invasive cancer)	295	78.64%
10x-hBC-C1	Human breast cancer (Invasive cancer)	176	27.84%
10x-hBC-D1	Human breast cancer (Invasive cancer)	306	54.58%
10x-hBC-E1	Human breast cancer (Invasive cancer)	587	42.08%
10x-hBC-F1	Human breast cancer (Invasive cancer)	691	16.50%
10x-hBC-G2	Human breast cancer (Cancer in situ, Invasive cancer)	467	65.74%
10x-hBC-H1	Human breast cancer (Cancer in situ, Invasive cancer)	613	69.49%
10x-hBC-I1	Human breast cancer (Ductal carcinoma in situ, Lobular carcinoma in situ, Invasive Carcinoma)	1308	62.92%
10x-hLiver-A1	Healthy human liver	2378	0.00%
10x-hLiver-B1	Healthy human liver	2349	0.00%
10x-hLiver-C1	Healthy human liver	2277	0.00%
10x-hLiver-D1	Healthy human liver	2265	0.00%
10x-PSC-A1	PSC human liver (native cell, intrahepatic cholangiocyte)	3118	26.36%
10x-PSC-B1	PSC human liver (PSC fibrotic region)	2670	24.91%
10x-PSC-C1	PSC human liver (native cell, intrahepatic cholangiocyte)	3322	25.89%
10x-PSC-D1	PSC human liver (PSC fibrotic region)	3174	25.65%

Table 6: Overview of the experimental datasets.

Target Dataset	Metric	Method
12	Metric
		Multimodal-based		Image-based			ST-based
12
		MEATRD	M3DM	SimpleNet	f-AnoGAN	PatchSVDD	DOMINANT	PREM	Spatial-ID	scmap	CAMLU
10x-PSC-A1	AUC	$\mathbf{0.657}_{\pm 0.073}$	$0.475_{\pm 0.006}$	$0.483_{\pm 0.129}$	$\underline{0.647}_{\pm 0.001}$	-	$0.590_{\pm 0.043}$	$0.567_{\pm 0.008}$	$0.486_{\pm 0.006}$	$0.500_{\pm 0.000}$	$0.537_{\pm 0.074}$
	F1	$\mathbf{0.629}_{\pm 0.085}$	$0.235_{\pm 0.007}$	$0.284_{\pm 0.114}$	$\underline{0.479}_{\pm 0.001}$	-	$0.356_{\pm 0.046}$	$0.291_{\pm 0.010}$	$0.449_{\pm 0.009}$	$0.415_{\pm 0.000}$	$0.217_{\pm 0.118}$
10x-PSC-B1	AUC	$\mathbf{0.675}_{\pm 0.092}$	$0.475_{\pm 0.006}$	$0.481_{\pm 0.161}$	$\underline{0.655}_{\pm 0.001}$	-	$0.547_{\pm 0.040}$	$0.520_{\pm 0.030}$	$0.519_{\pm 0.030}$	$0.500_{\pm 0.000}$	$0.503_{\pm 0.004}$
	F1	$\mathbf{0.646}_{\pm 0.101}$	$0.235_{\pm 0.007}$	$0.274_{\pm 0.146}$	$0.482_{\pm 0.001}$	-	$0.307_{\pm 0.031}$	$0.231_{\pm 0.005}$	$0.464_{\pm 0.037}$	$\underline{0.625}_{\pm 0.000}$	$0.017_{\pm 0.018}$
10x-PSC-C1	AUC	$\underline{0.664}_{\pm 0.069}$	$0.497_{\pm 0.013}$	$0.468_{\pm 0.156}$	$\mathbf{0.683}_{\pm 0.001}$	-	$0.595_{\pm 0.060}$	$0.508_{\pm 0.008}$	$0.517_{\pm 0.015}$	$0.500_{\pm 0.000}$	$0.501_{\pm 0.002}$
	F1	$\mathbf{0.631}_{\pm 0.078}$	$0.259_{\pm 0.012}$	$0.268_{\pm 0.131}$	$\underline{0.530}_{\pm 0.001}$	-	$0.344_{\pm 0.057}$	$0.245_{\pm 0.006}$	$0.470_{\pm 0.012}$	$0.411_{\pm 0.000}$	$0.008_{\pm 0.006}$
10x-PSC-D1	AUC	$\mathbf{0.655}_{\pm 0.071}$	$0.498_{\pm 0.012}$	$0.473_{\pm 0.182}$	$\underline{0.639}_{\pm 0.001}$	-	$0.534_{\pm 0.091}$	$0.519_{\pm 0.001}$	$0.575_{\pm 0.018}$	$0.500_{\pm 0.000}$	$0.508_{\pm 0.011}$
	F1	$\mathbf{0.627}_{\pm 0.082}$	$0.266_{\pm 0.013}$	$0.287_{\pm 0.172}$	$0.464_{\pm 0.001}$	-	$0.290_{\pm 0.071}$	$0.229_{\pm 0.001}$	$\underline{0.512}_{\pm 0.014}$	$0.408_{\pm 0.000}$	$0.042_{\pm 0.036}$
Mean	AUC	$\mathbf{0.663}$	$0.486$	$0.476$	$\underline{0.656}$	-	$0.567$	$0.529$	$0.524$	$0.500$	$0.512$
	F1	$\mathbf{0.633}$	$0.249$	$0.278$	$\underline{0.489}$	-	$0.324$	$0.249$	$0.474$	$0.465$	$0.071$

Table 7: Performance evaluation of anomalous tissue region detection across four primary sclerosing cholangitis (PSC) liver ST datasets. The table presents the results in terms of AUC and F1 scores, with each cell showing the average score from five independent runs and the corresponding standard deviation. The best score for each dataset is bolded, and the second-best score is underline.

Appendix E Implementation

Dataset Descriptions

As detailed in Table 6, we conducted extensive experiments on two types of disease datasets to validate the generalizability of MEATRD:
Breast Cancer: ST datasets about Breast Cancer used in this study include eight reference 10x Visium datasets (Kumar et al. 2023) derived from human healthy breast tissues, denoted as 10x-hNB-{v03-v10}³³3https://cellxgene.cziscience.com/collections/4195ab4c-20bd-4cd3-8b3d-65601277e731, and nine target 10x Visium datasets (Andersson et al. 2021) derived from human breast cancer tissues, denoted as 10x-hBC-{A1-I1}⁴⁴4https://github.com/almaan/her2st, and https://zenodo.org/records/10437391.
Primary sclerosing cholangitis (PSC)⁵⁵5https://cellxgene.cziscience.com/collections/0c8a364b-97b5-4cc8-a593-23c38c6f0ac5 : Reference 10x Visium datasets denoted as 10x-hLiver-{A1-D1} contains 4 healthy human liver datasets and target 10x Visium datasets denoted as 10x-PSC-{A1-D1} are collected from 4 Primary sclerosing cholangitis slices.
The reference datasets are collectively used during training, and the target datasets are used during inference only. For each ST dataset, genes detected in fewer than 10 spots are excluded (Wolf, Angerer, and Theis 2018).

Data Preprocessing

For each ST dataset, genes detected in fewer than 10 spots are excluded. Then, raw gene expression counts are normalized with library size and log-transformed. 3000 highly variable genes (HVGs) are selected as inputs to the model using using the SCANPY package (Wolf, Angerer, and Theis 2018).

Implementation Details

MEATRD is implemented using PyTorch (Paszke et al. 2019). In Stage I, we adopt the default architecture of the Mobile-UNet with an output embedding dimension of 256. In Stage II, the gene encoder is a two-layer MLP, while the image encoder is the frozen Mobile-Unet encoder from Stage I. The MGDAT network includes three MGDAT blocks, each having a two-layer, four-headed transformer to generate 16-dimensional fused bottleneck embeddings in equation (5), and an attention layer with an input dimension of 272 and an output dimension of 256 in equation (8). The ResNet-based image decoder in this stage comprises eight residual blocks, while the gene decoder is a single-layer GNN with an output dimension of 3000. In Stage III, the image encoder is an eight-layer ResNet, while the gene encoder is consistent with that in Stage II. The FFN for multimodal data fusion is structured as a two-layer MLP with an output dimension of 256. In all stages, the training is conducted with the Adam optimizer, with a batch size to 128, and a learning rate of 1e-4. Stage I is trained for 30 epochs, Stage II for 10 epochs, and Stage III for 5 epochs. Finally, we have the three weight parameters $\alpha=0.5$ and $\beta=1$ . For all baselines but M3DM and Spatial-ID, we adopted the recommended or default settings in the original study. M3DM, designed for natural images and point clouds, has an encoder unsuitable for ST data and histology image. Therefore, we replaced its encoders with MEATRD’s encoders for fair comparison. Since Spatial-ID is pretrained using single-cell sequencing (scRNA-seq) data, we skipped its pretraining step and directly utilizes the pretrained model.

Appendix F Further Experiments

Supplement Results of Anomalous Tissue Region Detection

In this section, we present the performance of MEATRD in testing for primary sclerosing cholangitis. MEATRD is trained on four human healthy liver ST datasets (i.e., 10x-hLiver-{A1-D1}) and tested on four human PSC (i.e., 10x-PSC-{A1-D1}) ST datasets. Table 7 highlights MEATRD’s superiority over baseline models in detecting ATRs across datasets, consistently achieving the highest AUC scores and three times ranking first in F1 scores. The experiment yields similar results in terms of AP scores, as shown in Table 3 in supplementary material F.

Ablation Studies

Backbones	Params	AUC	F1	Training time (min)	Inference time (s)
MobileUNet	46.17M	0.723	0.741	11.984	0.354
ResNet-18	53.71M	0.717	0.729	12.927	0.455
VGG-19	133.402M	0.696	0.703	19.887	0.489
MoCo	53.71M	0.712	0.726	26.435	0.449

Table 8: Ablation study of backbones in MEATRD across eight human breast cancer datasets.

Model									Params	MFlops	Complexity	Training time (s)	Inference time (s)	Memory Usage (GB)
MEATRD								Stage I	0.73M	32.746	$\mathcal{O}(c_{1}\|V\|)$	100.02	-	1.70
	Stage II	46.17M	405.909	$\mathcal{O}(c_{2}\|V\|+c_{3}\|E\|+c_{4}\|D\|^{2})$	469.02	0.15	4.65
	Stage III	5.22M	29.993	$\mathcal{O}(c_{5}\|V\|)$	150.00	0.21	5.15
M3DM									97.37M	8,028.937	-	468.00	20.75	1.60
SimpleNet		72.82M	238.954	-	0.72	7.72	0.57
f-AnoGAN		1.30M	118.496	-	1322.80	3.82	0.14
PatchSVDD		0.17M	1.083	-	5761.08	584.33	0.16
PREM		0.38M	0.768	-	326.58	0.01	0.02
DOMINANT		0.40M	0.396	-	0.83	0.01	11.34
Spatial-ID		4.36M	20.555	-	501.07	0.64	0.21
Scmap		-	-	-	2.02	0.16	-
CAMLU		1.62M	1.619	-	125.57	0.64	-

Table 9: The overall training time on eight 10x-hNB datasets, including a total of 20,183 spots. Each spot is associated with a 3000-dimensional gene expression vector and a histology image patch of size 32x32.

Using multiple data modality. In this evaluation, MEATRD is adapted to use either histology image or ST data alone, by skipping the step of the multimodal data fusion process and omitting the branch for learning the alternative data modality. Using only histology images results in a substantial reduction of MEATRD’s average AUC scores by 31.26% and F1 scores by 26.59%, while using ST data alone on average lowers its AUC scores by 12.72% and F1 scores by 9.99%. These findings corroborate the synergic effects of histology image and ST data in enhancing ATR detection.
Multimodal data fusion using fused bottleneck embedding. To assess the impact of multimodal data fusion on ATR detection, we substitute the cross-modal bottleneck embedding-guided fusion with a direct concatenation of image and gene embeddings. The comparison reveals that using bottleneck embedding for multimodal data fusion contributes to an average increase of 13.15% in AUC scores and 7.55% in F1 scores over the simple concatenation method. This improvement underscores our approach’s efficacy in enhancing multimodal embeddings by collating and condensing the most relevant information from each data modality.
Masking for target node reconstruction. Introducing target-node-masking, which strategically omits self-information during the target node reconstruction, theoretically mitigates the model’s over-generalization issue. This masking prevents the model from ”learning too well” to replicate its input, leading to minimal reconstruction errors even for anomalies. To validate the effectiveness of this technique, we compare the latent multimodal reconstruction errors, defined in equation (13), with and without target-node-masking during inter-node message passing. The violin plots in Figure 5 showcase that this masking not only increases reconstruction errors for both inliers and anomalies but also amplifies the discrepancy between their reconstruction errors. This enhanced discrepancy in turn aids MEATRD’s discriminative model in Stage III to separate anomalies from inliers, contributing to an average increase of 10.38% in AUC scores and 6.01% in F1 scores when compared to the omission of target-node-masking.

Multimodal reconstruction losses in one-class classifier. Most one-class classification methods directly utilize inlier embeddings in the reference to determine normal data distribution and thus rely heavily on the quality of instance embeddings, which, however, is sensitive to batch effects across datasets (Ouardini et al. 2019). Here, we investigate whether using latent multimodal reconstruction losses, which avoid cross-batch comparisons, as an alternative input to METARD’s one-class classifier can improve the accuracy of ATR detection. For this purpose, instance embeddings generated by the MGDAT network instead of the latent reconstruction losses are input to the one-class classifier in Stage III. We find that this modification reduces MEATRD’s performance, with an average decline of 11.20% in AUC scores and 7.56% in F1 scores, demonstrating the necessity of using latent reconstruction losses in this context.
One-class classifier. Finally, to assess the effect of the one-class classification in ATR detection, we remove the entire stage III and use the weighted sum of image and gene reconstruction errors, defined in equation (10) in the main text, as anomaly scores. The direct use of reconstruction errors leads to MEATRD’s suboptimal performance, as indicated by its average decrease of 19.23% in AUC scores and 14.84% in F1 scores. This finding suggests that, by collapsing multimodal reconstruction losses of inliers into a compact hypersphere in the latent space, the separation of inliers and anomalies is boosted, addressing the model over-generalization.
Mobine-Unet as pretrained visual feature extractor. Here, we replace Mobile-Unet with three pretrained visual feature extractor widely used for natural images, including VGG-19 (Simonyan and Zisserman 2014) and ResNet-18 (He et al. 2016), and MoCo (He et al. 2020), to extract visual features from histology images of the eight human breast cancer datasets in Stage I. As shown in Table 8, MEATRD’s performance declines with these networks, as indicated by the lower AUC and F1 scores. This is probably due to that tissue images contain complex patterns and features specific to biological structures, which may not be effectively captured by networks optimized for natural image recognition. Particularly, data augmentation techniques, e.g., blurring and resizing, used by contrastive learning approach can generate ”positive” images with semantics that significantly deviate from the original image.

Sensitivity Analysis

Table 2 shows the average model performance over five independent runs across the eight 10x-hBC datasets. We observe that a heavily weighing on image data (e.g., $\alpha=0.9$ or $\beta=0.9$ ) compromises model performance due to inadequate utilization of gene information for visually indistinguishable ATR in histology image. On the other hand, overly weighting ST data ( $\alpha=0.1$ or $\beta=0.1$ ) also reduces performance, though it outperforms overweighing image data, likely due to the higher signal-to-noise ratio in ST data. Additionally, optimal model performance is achieved with a small bottleneck embedding dimension (e.g., 16), aligning with our theoretical analysis that nuisance information is minimized by in condensed bottleneck embedding. In contrast, a larger dimension for the one-class classification classifier (e.g., 256) is beneficial, providing more flexibility for collapsing inlier embeddings into a hypersphere. The best results are obtained with three MGDAT layers, balancing message passing within the graph and data over-smoothing. Lastly, variations in the number of attention heads in MGDAT and the visual feature dimension have relatively minor impact on model performance.

Complexity Analysis

In this section, we first theoretically analyze MEATRD’s model complexity. As shown in Table 9, Stage I is built on a 36-layer CNN network comprising lightweight inverted residual blocks, with 0.73M parameters and a time complexity of $\mathcal{O}(c_{1}|V|)$ (He and Sun 2015), where $|V|$ denotes the number of instances. In Stage II, the main complexity arises from the MGDAT blocks, which compute feature-level and node-level attentions with complexities of $\mathcal{O}(c_{2}|V|)$ and $\mathcal{O}(c_{3}|E|)$ , respectively (Veličković et al. 2018). Here, $|E|$ denotes the number of graph edges and $|E|\ll|V|^{2}$ . This stage has 46.17M parameters. Stage III mainly involves the lightweight ResNet, which has a time complexity of $\mathcal{O}(c_{5}|V|)$ (He and Sun 2015), and it has 5.22M parameters.

Empirical training on 20,183 image patches shows that Stage I has 32.746 MFlops and a training time of 100.02s, Stage II has 405.909 MFlops and a training time of 469.02s, stage III has 29.993 MFlops and a training time of 150s. Inference time is negligible compared to training time. MEATRD’s total time cost is comparable to well-received methods M3DM and Spatial-ID.

Robustness to Noisy Data

The quality of ST data is indeed crucial for the model’s performance, and $\beta$ can be used to balance the influence of different data sources accordingly. Specifically, when the quality of ST data is lower relative to image data, we set a lower $\beta$ to increase the model’s reliance on image data, and vice versa. In our experiments with human breast tissue datasets (10x-hNB), where image and ST data have comparable quality, we set $\beta$ to 0.5. Typically, ST data quality is assessed using the average location-wise zero proportion $z$ (Zhu et al., 2023), representing the average proportion of zero-read-count genes, with lower values indicating higher signal-to-noise ratios. To explore a more systematic setting of, we altered $z$ of the 10x-hNB by randomly masking gene read counts and tested various $\beta$ values, as shown below:

$\bar{z}$	0.1	0.3	0.5	0.7	0.9
0.925	0.699	0.713	0.723	0.707	0.654
0.950	0.655	0.696	0.712	0.701	0.657
0.975	0.644	0.652	0.658	0.697	0.655

We find that, when $\bar{z}\leq 0.95$ , default setting $\beta=0.5$ consistently yields the best results; only in the extreme case $\bar{z}=0.975$ indicating the quality of gene data is too poor, it’s necessary to set $\beta=0.7$ . Considering this relationship, we also provide an adaptive strategy. We set the heuristic function for $\beta$ varying with $\bar{z}$ as:

\beta=\begin{cases}0.5,&\bar{z}\leq 0.95\\ 0.5+0.5\textrm{sigmoid}\left(200(\bar{z}-0.975)\right),&\bar{z}>0.95\end{cases}

$\displaystyle\ell_{c}\left(\Theta\right)$	$\displaystyle=\mathrm{log}P\left(\mathcal{D}\ \middle\|\Theta\right)$	(22)
	$\displaystyle=\sum_{i}\Big{[}\mathbb{I}\left(z_{i}=1\right)\left(\mathrm{log}\pi+\mathrm{log}\mathcal{N}\left(d_{i}\middle\|\mu_{1},\sigma_{1}^{2}\right)\right)$
	$\displaystyle+\mathbb{I}\left(z_{i}=2\right)\left(\mathrm{log}(1-\pi)+\mathrm{log}\mathcal{N}\left(d_{i}\middle\|\mu_{2},\sigma_{2}^{2}\right)\right)\Big{]}$
	$\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\middle\|a,b\right)$
	$\displaystyle+\sum_{k=1}^{2}{\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle\|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)}$

$\displaystyle\ell_{c}\left(\Theta\right)$	$\displaystyle=\mathrm{log}P\left(\mathcal{D}\ \middle\|\Theta\right)$	(27)
	$\displaystyle=\sum_{i}\sum_{k}\mathbb{I}\left(z_{i}=k\right)\left(\mathrm{log}\pi_{k}+\mathrm{log}\mathcal{N}\left(d_{i}\middle\|\mu_{k},\sigma_{k}^{2}\right)\right)$
	$\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\middle\|a,b\right)$
	$\displaystyle+\sum_{k=1}^{2}{\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle\|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)}$

		$\displaystyle Q\left(\Theta,\Theta^{(t-1)}\right)=\mathbb{E}\left[\ell_{c}(\Theta)\big{\|}\Theta^{(t-1)}\right]$
		$\displaystyle=\sum_{i}\sum_{k=1}^{2}P\left(z_{i}=k\big{\|}d_{i},\Theta^{(t-1)}\right)$
		$\displaystyle\qquad\left[\mathrm{log}\pi_{k}^{(t-1)}+\mathbb{E}\left(\mathrm{log}N\left(d_{i}\big{\|}\mu_{k}^{(t-1)},{(\sigma_{k}^{2})}^{(t-1)}\right)\right)\right]$
		$\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\|a,b\right)+\sum_{k=1}^{2}\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\big{\|}m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)$

MEATRD: Multimodal Anomalous Tissue Region Detection Enhanced with Spatial Transcriptomics

Abstract

Introduction

Preliminary

Definition D.1.

Definition D.2.

Definition D.3.

Method

Extracting Visual Features of Histology Image Patches (Stage I)

Masked Graph Dual-Attention Transformer Network (Stage II)

Latent Multimodal Reconstruction Loss-based Anomaly Discriminator (Stage III)

Experiments

Experimental Settings

Anomalous Tissue Region Detection

Discovering Anomalous Tissue Regions Visually Similar to Normal Tissues

Ablation Studies

Sensitivity Analysis

Complexity Analysis

Conclusion

Acknowledgments

References

Appendix A Related Work

Localized Anomaly Detection in Image

Anomaly Detection using Gene Expression Data

Multimodal Anomaly Detection

Appendix B Determining anomaly score threshold

MAP-EM inference of GMM parameters.

Appendix C Algorithm for MEATRD

Appendix D Theoretical Analysis

Fused Bottleneck Encoding as a Minimally Sufficient Representation of Modality-Specific, Task-Relevant Information

Definition D.4.

Definition D.5.

Properties D.1.

Proof.

Assumption D.1.

Assumption D.2.

Proposition D.1.

Proof.

Proposition D.2.

Proof.

Appendix E Implementation

Dataset Descriptions

Data Preprocessing

Implementation Details

Appendix F Further Experiments

Supplement Results of Anomalous Tissue Region Detection

Ablation Studies

Sensitivity Analysis

Complexity Analysis

Robustness to Noisy Data

MEATRD: Multimodal Anomalous Tissue Region Detection
Enhanced with Spatial Transcriptomics