This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MEATRD: Multimodal Anomalous Tissue Region Detection
Enhanced with Spatial Transcriptomics

Kaichen Xu1\equalcontrib, Qilong Wu 1\equalcontrib, Yan Lu 1, Yinan Zheng 1, Wenlin Li 1, Xingjie Tang 1, Jun Wang 2, Xiaobo Sun 1 Corresponding author.
Abstract

The detection of anomalous tissue regions (ATRs) within affected tissues is crucial in clinical diagnosis and pathological studies. Conventional automated ATR detection methods, primarily based on histology images alone, falter in cases where ATRs and normal tissues have subtle visual differences. The recent spatial transcriptomics (ST) technology profiles gene expressions across tissue regions, offering a molecular perspective for detecting ATRs. However, there is a dearth of ATR detection methods that effectively harness complementary information from both histology images and ST. To address this gap, we propose MEATRD, a novel ATR detection method that integrates histology image and ST data. MEATRD is trained to reconstruct image patches and gene expression profiles of normal tissue spots (inliers) from their multimodal embeddings, followed by learning a one-class classification AD model based on latent multimodal reconstruction errors. This strategy harmonizes the strengths of reconstruction-based and one-class classification approaches. At the heart of MEATRD is an innovative masked graph dual-attention transformer (MGDAT) network, which not only facilitates cross-modality and cross-node information sharing but also addresses the model over-generalization issue commonly seen in reconstruction-based AD methods. Additionally, we demonstrate that modality-specific, task-relevant information is collated and condensed in multimodal bottleneck encoding generated in MGDAT, marking the first theoretical analysis of the informational properties of multimodal bottleneck encoding. Extensive evaluations across eight real ST datasets reveal MEATRD’s superior performance in ATR detection, surpassing various state-of-the-art AD methods. Remarkably, MEATRD also proves adept at discerning ATRs that only show slight visual deviations from normal tissues. Our code is available at https://github.com/wqlzuel/MEATRD.

Introduction

Detecting anomalous tissue regions (ATR) within tissues from affected individuals is essential in clinical diagnostics, pathological studies, and targeted therapies  (Srinidhi, Ciga, and Martel 2021). Traditionally, automated ATR detection, which typically applies computer vision techniques to histology images, e.g., whole-slide images (WSI) stained with hematoxylin and eosin (H&E) (Zingman et al. 2023), is a specialized task of image anomaly detection (AD). However, histology images, unlike natural images (e.g., those in ImageNet dataset) (Bergmann et al. 2019), present unique challenges for AD due to their inherent high complexity (Zehnder et al. 2022), subtle differences between ATRs and normal tissues (Shenkar and Wolf 2022), the diverse manifestations of ATRs (Komura and Ishikawa 2018), and variability in staining quality (Zingman et al. 2023). The complexities demand supplementary information to visual cues for accurate ATR detection.

Spatial transcriptomics (ST) meets this need by providing spatial gene expression data. By far, a total of 1033 publicly available human ST datasets that span 56 diseases and 35 tissues, providing a rich resource for investigating ATRs at the molecular level (Wang et al. 2024). A typical ST dataset is structured as a matrix 𝐗N×G\mathbf{X}\in\mathbb{R}^{N\times G}, where 𝐗i,j\mathbf{X}_{i,j} denotes the expression read counts of the jj-th gene mapped to ii-th tissue spot. As illustrated in Figure 1, these spots, ranging in size from 10 to 200 μ\mum as per sequencing technology, are spatially arranged in arrays to cover the entire tissue slice  (Hu et al. 2023), thereby characterizing gene expression profile across the tissue. This molecular-level data, especially in cases where ATRs are visually similar to normal tissues, can significantly aid in their detection  (Hu et al. 2021). However, due to limitations inherent to sequencing technology, ST data suffer from severe noise and substantial missing values in gene expression measurements (Wang et al. 2022), leading to compromised precision in demarcating tissue regions (Wang, Maletic-Savatic, and Liu 2022). Integration of histology images with ST data presents a promising solution to these challenges. As illustrated in our toy example in Figure 1, the blank spots in the ST dataset’s spatial map, which represent tumor core locations with missing gene expression data, are visually identifiable in the accompanying histology image. Conversely, the tumor edge region, which may not be easily distinguishable from normal tissues visually, is detectable in the ST data. Therefore, the information from the two modalities can complement each other, greatly enhancing the precision of ATR detection. Fortunately, ST technologies like 10x Visium (Moses and Pachter 2022) provide accompanying histology images, allowing concurrent analysis of visual and genetic information for ATR detection.

Refer to caption
Figure 1: Detecting ATRs with histology images and ST data. ATRs include both tumor core and edge regions, as delineated by red and blue outlines in the histology image, respectively. The tumor edge region visually resembles the adjacent normal tissues. In the spatial map of the ST dataset, the ATRs encompass both red and blank spots, with blank spots indicating locations of missing gene expression data.

Given the rarity and unpredictable heterogeneity of anomalies, AD in images is often framed as an unsupervised learning problem, where anomalies are not known a priori. Models are trained exclusively on reference datasets comprising inliers to understand ”normality” at training time and identify deviations from this norm as anomalies at inference time (Liu et al. 2023; Bergmann et al. 2019). Contemporary image AD methods, which use deep learning to learn initial representations of normal images (Shvetsova et al. 2021; Liu et al. 2023), often involve an encoder pre-trained on large natural image datasets (Deng and Li 2022; Roth et al. 2022). These representations are then used to either model the inlier distribution in latent space, as seen in one-class classification methods (Ruff et al. 2018), or to reconstruct inliers in reconstruction-based methods (Schlegl et al. 2019). Instances in the target dataset, which exhibit low probability in the inlier distribution or larger-than-expected reconstruction errors are deemed anomalous.

Despite successes of these methods in areas such as manufacturing defect inspection, financial fraud detection, etc (Sohn et al. 2020), the unique challenges posed by ATR detection require more specialized methods (Riasatian et al. 2021; Tschuchnig and Gadermayr 2022). To meet this need, adaptions made to image AD methods focus on representation learning and anomaly discrimination techniques. For example, image encoders pre-trained on natural images are replaced with those tailored for histology images, such as U-Net (Zehnder et al. 2022), DenseNet (Riasatian et al. 2021), and s2-AnoGAN (Pocevičiūtė, Eilertsen, and Lundström 2021). In addition, anomaly scoring is adapted to use perceptual loss instead of pixel-wise reconstruction errors commonly used for natural images (Shvetsova et al. 2021; Zehnder et al. 2022). However, these methods may struggle when ATRs visually resemble normal tissues (Bejnordi et al. 2017). In contrast, ST differentiates tissue regions at the gene expression level (Hu et al. 2021; Dong and Zhang 2022), providing a remedy for ATR detection involving such complexities. Currently, Spatial-ID (Shen et al. 2022) is the only method that uses ST for ATR detection, employing a DNN classifier which assigns spots in the ST dataset to known regions while determining those with uncertain assignments as anomalies. However, this classification-based approach can induce high false positive rates, as uncertainties in assignment could stem from similarities among normal tissues rather than the presence of ATRs (Li et al. 2022). Its sole reliance on ST data also makes it vulnerable to noise and dropouts in gene expression measurements, even for detecting visually recognizable ATRs.

In this study, we propose Multimodality Enhanced Anomalous Tissue Region Detection (MEATRD), the first method that integrates histology images and ST data for enhanced ATR detection. MEATRD conceptualizes tissue spots as nodes within an attributed graph, leveraging a reconstruction-based graph model for inlier nodes reconstruction from dual perspectives of image and gene expression. During inference, the discrepancies between reconstruction errors of inliers (i.e., normal tissues) and anomalies (i.e., ATRs) can be exploited by a discriminative model for accurate ATR detection. As shown in Figure 2, MEATRD involves three stages. Stage I focuses on extracting visual features of histology images. The histology image is segmented into a patch centered around each spot, which are processed into imagery embeddings. Stage II aims to reconstruct the gene expression profiles and image patches of each spot from their fused embeddings, obtained using our innovative masked graph dual-attention transformer (MGDAT) network. MGDAT allows concurrent cross-node and cross-modal attention calculations, promoting efficient cross-modality information sharing and incorporation of spatial relationships among spots. Additionally, to counter potential model over-generalization111A common pitfall of reconstruction-based methods where anomalies might yield low reconstruction errors (Liu et al. 2023; Ristea et al. 2022)., we employ the node-feature masking strategy, which forces the model to rely more on the surrounding context and cross-modal information. Stage III focuses on acquiring a one-class classification model to identify anomalies. Unlike existing one-class classification AD methods that use instance deep embeddings and are prone to reference-target domain shifts (Ouardini et al. 2019), our model pioneers in using domain shift-robust latent multimodal reconstruction losses (Donahue, Krähenbühl, and Darrell 2016; Schlegl et al. 2019) for more reliable anomaly detection. By collapsing inliers’ reconstruction losses into a compact hypersphere, our model increases the reconstruction error discrepancy between inliers and anomalies, thereby further mitigating model over-generalization. In summary, our main contributions include:

  • We propose MEATRD, a pioneer multimodal method that integrates spatial transcriptomics with histology images for enhanced ATR detection.

  • MEATRD simultaneously addresses the over-generalization in reconstruction-based AD methods and the domain shift issue in one-class classification, leading to significant performance improvement.

  • We design an MGDAT network as the core component of MEATRD to facilitate cross-modality and cross-node information exchange while ameliorating model over-generalization. We also demonstrate the theoretical foundation for this information exchange, which is grounded in MGDAT’s ability to generate inclusive and condensed encoding of modality-specific, task-relevant information (supplementary material D).

  • Extensive benchmarks on eight breast cancer ST datasets demonstrate MEATRD’s superiority over nine state-of-the-art (SOTA) AD methods in accurately detecting ATRs, including those with subtle visual deviations from surrounding normal tissues.

Preliminary

Refer to caption
Figure 2: The workflow of MEATRD.
Definition D.1.

ST Dataset and Associated Histology Image. Let 𝐗N×G\bm{X}\in\mathbb{R}^{N\times G} be a ST dataset, where NN is the number of tissue spots and GG is the number of genes. SNS_{N} and SGS_{G} denote the set of spots and genes, respectively. 𝐗i,j\bm{X}_{i,j} represents the the read counts of gene jj at spot ii, and 𝐱iG\bm{x}_{i}\in\mathbb{R}^{G} represents the gene expression profile at spot ii. Let 𝐏H×W×C\bm{P}\in\mathbb{R}^{H\times W\times C} be the associated histology image, where H,WH,W, and CC are the height, width, and number of channels, respectively.

Definition D.2.

Graph Representation of ST Dataset and Histology Image. For a given ST dataset 𝐗\bm{X}, the associated histology image 𝐏\bm{P} is segmented into NN patches, where 𝐏ih×w×C\bm{P}_{i}\in\mathbb{R}^{h\times w\times C} denotes the patch centered around spot iSNi\in S_{N}, with height hh and width ww. Then spots are modeled as nodes on an unweighted, attributed graph G(SN,A,𝓩)G(S_{N},A,\mathcal{\bm{Z}}), where A{0,1}N×NA\in\{0,1\}^{N\times N} is the adjacency matrix, and 𝓩[𝓩image||𝓩gene]\mathcal{\bm{Z}}\coloneqq[\mathcal{\bm{Z}}_{image}||\mathcal{\bm{Z}}_{gene}] is the node feature matrix. 𝓩imgN×D\mathcal{\bm{Z}}_{img}\in\mathbb{R}^{N\times D} and 𝓩geneN×D\mathcal{\bm{Z}}_{gene}\in\mathbb{R}^{N\times D} are embeddings of image patches and gene expression profiles of spots. A(i,j)=1A(i,j)=1 if node jn(i)j\in n(i), where n(i)n(i) is the set of kk-nearest neighbors of node ii, and A(i,j)=0A(i,j)=0 otherwise. kk is typically set to be six due to the hexagonal arrangement of spots  (Xu et al. 2024).

Definition D.3.

Problem Definition. Let 𝒳\mathcal{X} and 𝒫\mathcal{P} denote the target ST dataset and associated histology image, respectively. Similarly, let 𝐗\bm{X} and 𝐏\bm{P} denote the reference ST dataset and associated histology image, respectively. We define yi{0,1}y_{i}\in\{0,1\} as the label for spot ii, where yi=1y_{i}=1 indicates an anomalous spot, and yi=0y_{i}=0 otherwise. Note, yi=0,i𝐗y_{i}=0,\forall i\in\bm{X}; yj{0,1},j𝒳y_{j}\in\{0,1\},\forall j\in\mathcal{X}. The task of ATR detection is defined as identifying the subset of anomalous spots within the target dataset: 𝕊={𝓍i|y𝓍i=1,𝓍i𝒳}\mathbb{S}=\{\mathcal{x}_{i}|y_{\mathcal{x}_{i}}=1,\forall\mathcal{x}_{i}\in\mathcal{X}\}, using a model trained exclusively on 𝐗\bm{X} and 𝐏\bm{P}. 222Related work is in supplementary material A due to space limitation.

Method

As shown in Figure 2, the workflow of MEATRD includes three stages: Stage I extract visual features from histology image patch of each spot; Stage II focuses on the learning of reconstructions of image patches and gene expression profiles using multimodal embeddings generated by a MGADT network; Stage III entails the training of an anomaly discriminator based on latent multimodal reconstruction errors.

Extracting Visual Features of Histology Image Patches (Stage I)

Initially, whole slide images are segmented into 32x32 patches centered around each spot in the ST dataset (Zong et al. 2022). The visual manifolds of these image patches are obtained using a Mobile-Unet, with an encoder consisting of downsampling convolutional layers and inverted residual blocks. Its decoder comprises upsampling deconvolutional layers and inverted residual blocks, connected to the encoder via shortcut connections.

This design not only inherits the merits of U-Net in extracting visual features from histology images but also boosts computational efficiency by reducing the model’s parameters. Given a histology image patch 𝑷i\bm{P}_{i} for spot iSNi\in S_{N}, the Mobile-Unet is pretrained to reconstruct it as 𝑷^i\widehat{\bm{P}}_{i}, with a pretraining loss that is a mix of a perceptual loss, based on the Structural Similarity Index (SSIM), and an L1L1 reconstruction loss:

𝑷^iD1(E1(𝑷i)),𝒛iDE1(𝑷i)\widehat{\bm{P}}_{i}\coloneqq D_{1}(E_{1}(\bm{P}_{i})),\quad\bm{z}_{i}\in\mathbb{R}^{D}\coloneqq E_{1}(\bm{P}_{i}) (1)
perc=SSIM(𝑷i,𝑷^i),1=𝑷i𝑷^i1\mathcal{L}_{perc}=-\mathrm{SSIM}(\bm{P}_{i},\widehat{\bm{P}}_{i}),\mathcal{L}_{1}=||\bm{P}_{i}-\widehat{\bm{P}}_{i}||_{1} (2)
SSIM(𝑿,𝒀)=(2μ𝑿μ𝒀+C1)(2σ𝑿,𝒀+C2)(μ𝑿2+μ𝒀2+C1)(σ𝑿2+σ𝒀2+C2)\mathrm{SSIM}(\bm{X},\bm{Y})=\frac{(2\mu_{\bm{X}}\mu_{\bm{Y}}+C_{1})(2\sigma_{\bm{X},\bm{Y}}+C_{2})}{(\mu_{\bm{X}}^{2}+\mu_{\bm{Y}}^{2}+C_{1})(\sigma_{\bm{X}}^{2}+\sigma_{\bm{Y}}^{2}+C_{2})} (3)
pretrain=perc+1.\mathcal{L}_{pretrain}=\mathcal{L}_{perc}+\mathcal{L}_{1}. (4)

where μ\mu_{*} and σ2\sigma_{*}^{2} are the average intensity and variance of {𝑿,𝒀}*\in\{\bm{X},\bm{Y}\}, respectively. C1C_{1} and C2C_{2} represent two constants to stabilize the division with a weak denominator. SSIM and 1\mathcal{L}_{1} measure the structural similarities and pixel-by-pixel discrepancies between the original and reconstructed images, respectively. Then, pretraining loss enhances the representation learning of complex histology images by accounting for both contextual integrity, via perc\mathcal{L}_{perc}, and local details via 1\mathcal{L}_{1} (Okada and Taniguchi 2021). Following training, E1E_{1} is used to yield image patch embeddings for each spot iSNi\in S_{N}. Finally, unlike complex tissue images, which need to be converted into semantically meaningful representations in the first place, gene data have much clearer semantics. Therefore, MEATRD do not require a pretext representation learning stage for gene data. Rather, we use a two-layer MLP in stage II to rasterize gene data before feeding them into MGDAT blocks, where graph-based gene encoding takes places.

Masked Graph Dual-Attention Transformer Network (Stage II)

To generate information-rich multimodal spot embeddings for reconstruction, we fuse histology image patches and gene expression profiles while incorporating contextual inter-dependencies among spots to reveal their biological characteristics. This is achieved by modeling spots as nodes in an attributed graph G(V,A,𝓩)G(V,A,\mathcal{\bm{Z}}), as described in Definition D.2, on top of which node representations are learned using an innovative masked graph attention network, termed MGDAT. This network, comprising a series of MGDAT blocks, allows information sharing across both data modality and graph nodes. Within each MGDAT block, nodes to be reconstructed are masked before aggregating fused gene and imagery attributes of their neighboring nodes via attention-based mechanism.

Formally speaking, let Gi(Vi,Ai,𝓩i)G_{i}(V_{i},A_{i},\mathcal{\bm{Z}}_{i}) denote the subgraph of a target node ii that covers up to its 3-hop neighbors, where Vi,AiV_{i},A_{i} and 𝓩i\mathcal{\bm{Z}}_{i} denote the node set, adjacency matrix, and node attribute matrix of GiG_{i}, respectively. We set the number of hops to be 3 as using more hops will result in over-smoothing while fewer hops will significantly limit the information spread in the graph. 𝒛iD\bm{z}_{i}\in\mathbb{R}^{D} represents node ii’s imagery attribute derived from Stage I, and 𝜻iD\bm{\zeta}_{i}\in\mathbb{R}^{D} represents node ii’s gene attribute rasterised from its gene expression vector 𝒙i\bm{x}_{i} using a two-layer MLP. 𝒛i\bm{z}_{i} and 𝜻i\bm{\zeta}_{i} are substituted with learnable mask tokens 𝒛[M]D\bm{z}_{[M]}\in\mathbb{R}^{D}and 𝜻[M]D\bm{\zeta}_{[M]}\in\mathbb{R}^{D}.

This target-node-masking serves to prevent self-information leakage of the target node into its own embedding for reconstruction, thus alleviating the potential model over-generalization issue. GiG_{i} is processed by the MGDAT network through its series of MGDAT blocks. For the ll-th block, l{0,1,2}l\in\{0,1,2\}, the inputs are embeddings of the image patches, 𝓩img,i(l)|Vi|×D\mathcal{\bm{Z}}_{img,i}^{(l)}\in\mathbb{R}^{|V_{i}|\times D}, and the gene expression profiles, 𝓩gene,i(l)|Vi|×D\mathcal{\bm{Z}}_{gene,i}^{(l)}\in\mathbb{R}^{|V_{i}|\times D}, of ViV_{i}. The initial embeddings are defined as 𝓩img,i(0)[𝒛1,..,𝒛[M],..𝒛Vi]\mathcal{\bm{Z}}_{img,i}^{(0)}\coloneqq[\bm{z}_{1},..,\bm{z}_{[M]},..\bm{z}_{V_{i}}]^{\top} and 𝓩gene,i(0)[𝜻1,..,𝜻[M],..,𝜻Vi]\mathcal{\bm{Z}}_{gene,i}^{(0)}\coloneqq[\bm{\zeta}_{1},..,\bm{\zeta}_{[M]},..,\bm{\zeta}_{V_{i}}]^{\top}. The ll-th MGDAT block yields fused bottleneck embeddings 𝓩fb,i(l)|Vi|×D,DD\mathcal{\bm{Z}}_{fb,i}^{(l)}\in\mathbb{R}^{|V_{i}|\times D^{\prime}},D^{\prime}\ll D as follows:

𝓩fb,i(l)=Trm([𝓩img,i(l)||𝓩gene,i(l)];WQ(l),WK(l),WV(l))\mathcal{\bm{Z}}_{fb,i}^{(l)}=\mathrm{Trm}\left([\mathcal{\bm{Z}}_{img,i}^{(l)}||\mathcal{\bm{Z}}_{gene,i}^{(l)}];W_{Q}^{(l)},W_{K}^{(l)},W_{V}^{(l)}\right) (5)

where Trm\mathrm{Trm} denotes Transformer. WQ(l),WK(l),WV(l)2D×DW_{Q}^{(l)},W_{K}^{(l)},W_{V}^{(l)}\in\mathbb{R}^{2D\times D^{\prime}} are query, key and value parameters, respectively. 𝓩fb(l)\mathcal{\bm{Z}}_{fb}^{(l)} serves as a bottleneck to collate and condense modality-specific, task-relevant information from image and ST data  (Nagrani et al. 2021), as theoretically demonstrated in supplementary material D. By concatenating 𝓩fb(l)\mathcal{\bm{Z}}_{fb}^{(l)} with 𝓩img(l)\mathcal{\bm{Z}}_{img}^{(l)} and 𝓩gene(l)\mathcal{\bm{Z}}_{gene}^{(l)}, the two data modalities are bridged, facilitating access to their complementary task-relevant information. Next, multimodal information of ll-hop neighbors is aggregated as follows:

h,i(l)=[𝓩,i(l)||𝓩fb,i(l)],where{img,gene},h_{*,i}^{(l)}=[\mathcal{\bm{Z}}_{*,i}^{(l)}||\mathcal{\bm{Z}}_{fb,i}^{(l)}],\quad\text{where}\ *\in\{img,gene\}, (6)
α,ij(l)=exp(watt(l)σ(W(l)[h,i(l)||h,j(l)]))k𝒩iexp(watt(l)σ(W(l)[h,i(l)||h,k(l)]))),\alpha_{*,ij}^{(l)}=\frac{\exp(w_{att}^{(l)}\sigma(W^{(l)}[h_{*,i}^{(l)}||h_{*,j}^{(l)}]))}{\sum_{k\in\mathcal{N}_{i}}\exp(w_{att}^{(l)}\sigma(W^{(l)}[h_{*,i}^{(l)}||h_{*,k}^{(l)}])))}, (7)
𝓩,i(l+1)=σ(j𝒩iα,ij(l)W(l)h,j(l)),\mathcal{\bm{Z}}_{*,i}^{(l+1)}=\sigma(\sum_{j\in\mathcal{N}_{i}}\alpha_{*,ij}^{(l)}W^{(l)}h_{*,j}^{(l)}), (8)

where σ\sigma denotes LeakyReLU, watt(l)Dw_{att}^{(l)}\in\mathbb{R}^{D} and W(l)D×(D+D)W^{(l)}\in\mathbb{R}^{D\times(D+D^{\prime})} denote the attention weight matrix and regular weight matrix of the ll-th MGDAT block, respectively.

The histology image patches of ViV_{i} are reconstructed from the final image embeddings, 𝓩img,i\mathcal{\bm{Z}}_{img,i}, through a ResNet-based deconvolutional decoder D2D_{2}, while the gene expression profiles of ViV_{i} are reconstructed from the final gene embeddings, 𝓩gene,i\mathcal{\bm{Z}}_{gene,i}, through a single-layer GNN-based decoder D3D_{3} (Hou et al. 2023):

𝑷~iD2(𝓩img,i),𝒙~iD3(𝓩gene,i)\widetilde{\bm{P}}_{i}\coloneqq D_{2}(\mathcal{\bm{Z}}_{img,i}),\quad\widetilde{\bm{x}}_{i}\coloneqq D_{3}(\mathcal{\bm{Z}}_{gene,i}) (9)

The training loss of stage II includes an image-level loss, same as that defined in Equation 4, and a gene-level reconstruction loss measured in scaled cosine errors:

rec=αiN(SSIM(𝑷i,𝑷~i)+𝑷i𝑷~i1)+(1α)iNSCE(𝒙i,𝒙~i),\begin{split}\mathcal{L}_{rec}=&\alpha\cdot\sum_{i}^{N}(-\mathrm{SSIM}(\bm{P}_{i},\widetilde{\bm{P}}_{i})+||\bm{P}_{i}-\widetilde{\bm{P}}_{i}||_{1})\\ &+(1-\alpha)\cdot\sum_{i}^{N}\mathcal{L}_{SCE}(\bm{x}_{i},\widetilde{\bm{x}}_{i}),\end{split} (10)
SCE(𝒙i,𝒙~i)=(1𝒙i𝒙~i𝒙i𝒙~i)γ,γ1,\mathcal{L}_{SCE}(\bm{x}_{i},\widetilde{\bm{x}}_{i})=\left(1-\frac{\bm{x}_{i}^{\top}\widetilde{\bm{x}}_{i}}{||\bm{x}_{i}||\cdot||\widetilde{\bm{x}}_{i}||}\right)^{\gamma},\gamma\geq 1, (11)

where 0<α<10<\alpha<1 is the weight assigned to image reconstruction loss, γ\gamma is a scaling factor. The workflow of Stage II is illustrated in Figure 2 and Algorithm 1 in supplementary material C.

  Target Dataset     Metric     Method
12 
        Multimodal-based     Image-based     ST-based
12 
        MEATRD M3DM     SimpleNet f-AnoGAN Patch SVDD     DOMINANT PREM Spatial-ID scmap CAMLU
  10x-hBC-A1     AUC     0.756±0.007\mathbf{0.756}_{\pm 0.007} 0.520±0.0460.520_{\pm 0.046}     0.543±0.0950.543_{\pm 0.095} 0.642¯±0.109\underline{0.642}_{\pm 0.109} 0.614±0.0050.614_{\pm 0.005}     0.488±0.1170.488_{\pm 0.117} 0.211±0.0040.211_{\pm 0.004} 0.463±0.0670.463_{\pm 0.067} 0.500±0.0000.500_{\pm 0.000} 0.516±0.0210.516_{\pm 0.021}
    F1     0.892±0.0070.892_{\pm 0.007} 0.875±0.00130.875_{\pm 0.0013}     0.875±0.0110.875_{\pm 0.011} 0.892±0.0170.892_{\pm 0.017} 0.892¯±0.005\underline{0.892}_{\pm 0.005}     0.885±0.0170.885_{\pm 0.017} 0.865±0.0000.865_{\pm 0.000} 0.870±0.0040.870_{\pm 0.004} 0.934±0.000\mathbf{0.934}_{\pm 0.000} 0.376±0.3830.376_{\pm 0.383}
  10x-hBC-B1     AUC     0.920±0.028\mathbf{0.920}_{\pm 0.028} 0.505±0.0290.505_{\pm 0.029}     0.554±0.1350.554_{\pm 0.135} 0.736¯±0.144\underline{0.736}_{\pm 0.144} 0.442±0.0250.442_{\pm 0.025}     0.698±0.0770.698_{\pm 0.077} 0.288±0.0060.288_{\pm 0.006} 0.195±0.0830.195_{\pm 0.083} 0.504±0.0000.504_{\pm 0.000} 0.667±0.1600.667_{\pm 0.160}
    F1     0.841±0.022\mathbf{0.841}_{\pm 0.022} 0.210±0.0270.210_{\pm 0.027}     0.302±0.1270.302_{\pm 0.127} 0.568¯±0.176\underline{0.568}_{\pm 0.176} 0.225±0.0250.225_{\pm 0.025}     0.352±0.1430.352_{\pm 0.143} 0.073±0.0080.073_{\pm 0.008} 0.067±0.0640.067_{\pm 0.064} 0.354±0.0000.354_{\pm 0.000} 0.428±0.3650.428_{\pm 0.365}
  10x-hBC-C1     AUC     0.715±0.017\mathbf{0.715}_{\pm 0.017} 0.540±0.0340.540_{\pm 0.034}     0.501±0.0990.501_{\pm 0.099} 0.485±0.0350.485_{\pm 0.035} 0.401±0.00320.401_{\pm 0.0032}     0.633±0.1010.633_{\pm 0.101} 0.419±0.0040.419_{\pm 0.004} 0.384±0.0550.384_{\pm 0.055} 0.500±0.0000.500_{\pm 0.000} 0.660¯±0.156\underline{0.660}_{\pm 0.156}
    F1     0.842±0.021\mathbf{0.842}_{\pm 0.021} 0.735±0.0280.735_{\pm 0.028}     0.735±0.0240.735_{\pm 0.024} 0.713±0.0210.713_{\pm 0.021} 0.661±0.0050.661_{\pm 0.005}     0.769±0.0400.769_{\pm 0.040} 0.695±0.0060.695_{\pm 0.006} 0.687±0.0130.687_{\pm 0.013} 0.838¯±0.000\underline{0.838}_{\pm 0.000} 0.808±0.0210.808_{\pm 0.021}
  10x-hBC-D1     AUC     0.803±0.017\mathbf{0.803}_{\pm 0.017} 0.488±0.0110.488_{\pm 0.011}     0.485±0.1110.485_{\pm 0.111} 0.276±0.0720.276_{\pm 0.072} 0.377±0.0050.377_{\pm 0.005}     0.530±0.1720.530_{\pm 0.172} 0.380±0.0030.380_{\pm 0.003} 0.469±0.0070.469_{\pm 0.007} 0.503±0.0000.503_{\pm 0.000} 0.649¯±0.066\underline{0.649}_{\pm 0.066}
    F1     0.698±0.016\mathbf{0.698}_{\pm 0.016} 0.443±0.0170.443_{\pm 0.017}     0.433±0.0720.433_{\pm 0.072} 0.253±0.0850.253_{\pm 0.085} 0.373±0.0100.373_{\pm 0.010}     0.478±0.1230.478_{\pm 0.123} 0.344±0.0100.344_{\pm 0.010} 0.410±0.0110.410_{\pm 0.011} 0.626¯±0.000\underline{0.626}_{\pm 0.000} 0.465±0.1580.465_{\pm 0.158}
  10x-hBC-E1     AUC     0.553±0.046\mathbf{0.553}_{\pm 0.046} 0.536¯±0.014\underline{0.536}_{\pm 0.014}     0.465±0.1190.465_{\pm 0.119} 0.369±0.0340.369_{\pm 0.034} 0.300±0.0090.300_{\pm 0.009}     0.475±0.0830.475_{\pm 0.083} 0.429±0.0060.429_{\pm 0.006} 0.449±0.0820.449_{\pm 0.082} 0.500±0.0000.500_{\pm 0.000} 0.405±0.0470.405_{\pm 0.047}
    F1     0.739±0.029\mathbf{0.739}_{\pm 0.029} 0.598±0.0090.598_{\pm 0.009}     0.542±0.0770.542_{\pm 0.077} 0.492±0.0210.492_{\pm 0.021} 0.443±0.0060.443_{\pm 0.006}     0.570±0.0580.570_{\pm 0.058} 0.528±0.0080.528_{\pm 0.008} 0.542±0.0540.542_{\pm 0.054} 0.734¯±0.000\underline{0.734}_{\pm 0.000} 0.081±0.0950.081_{\pm 0.095}
  10x-hBC-F1     AUC     0.667±0.009\mathbf{0.667}_{\pm 0.009} 0.485±0.0460.485_{\pm 0.046}     0.476±0.0170.476_{\pm 0.017} 0.493±0.0110.493_{\pm 0.011} 0.483±0.0050.483_{\pm 0.005}     0.477±0.0740.477_{\pm 0.074} 0.379±0.0040.379_{\pm 0.004} 0.380±0.0740.380_{\pm 0.074} 0.500¯±0.000\underline{0.500}_{\pm 0.000} 0.409±0.0510.409_{\pm 0.051}
    F1     0.858¯±0.003\underline{0.858}_{\pm 0.003} 0.832±0.0090.832_{\pm 0.009}     0.835±0.0020.835_{\pm 0.002} 0.842±0.0040.842_{\pm 0.004} 0.840±0.0030.840_{\pm 0.003}     0.834±0.0180.834_{\pm 0.018} 0.825±0.0010.825_{\pm 0.001} 0.820±0.0050.820_{\pm 0.005} 0.910±0.000\mathbf{0.910}_{\pm 0.000} 0.036±0.0220.036_{\pm 0.022}
  10x-hBC-G2     AUC     0.640±0.079\mathbf{0.640}_{\pm 0.079} 0.524±0.0160.524_{\pm 0.016}     0.482±0.0740.482_{\pm 0.074} 0.457±0.0160.457_{\pm 0.016} 0.430±0.0080.430_{\pm 0.008}     0.576¯±0.107\underline{0.576}_{\pm 0.107} 0.430±0.0060.430_{\pm 0.006} 0.312±0.0240.312_{\pm 0.024} 0.500±0.0000.500_{\pm 0.000} 0.518±0.0010.518_{\pm 0.001}
    F1     0.544±0.045\mathbf{0.544}_{\pm 0.045} 0.366±0.0160.366_{\pm 0.016}     0.333±0.0680.333_{\pm 0.068} 0.295±0.0020.295_{\pm 0.002} 0.294±0.0180.294_{\pm 0.018}     0.434±0.0950.434_{\pm 0.095} 0.273±0.0060.273_{\pm 0.006} 0.214±0.0290.214_{\pm 0.029} 0.510¯±0.000\underline{0.510}_{\pm 0.000} 0.070±0.0050.070_{\pm 0.005}
  10x-hBC-H1     AUC     0.732±0.064\mathbf{0.732}_{\pm 0.064} 0.474±0.0230.474_{\pm 0.023}     0.443±0.0990.443_{\pm 0.099} 0.625¯±0.083\underline{0.625}_{\pm 0.083} 0.415±0.0050.415_{\pm 0.005}     0.521±0.1050.521_{\pm 0.105} 0.370±0.0090.370_{\pm 0.009} 0.319±0.0610.319_{\pm 0.061} 0.500±0.0000.500_{\pm 0.000} 0.515±0.0100.515_{\pm 0.010}
    F1     0.516±0.029\mathbf{0.516}_{\pm 0.029} 0.273±0.0290.273_{\pm 0.029}     0.186±0.0800.186_{\pm 0.080} 0.359±0.0800.359_{\pm 0.080} 0.066±0.0030.066_{\pm 0.003}     0.297±0.0600.297_{\pm 0.060} 0.209±0.0180.209_{\pm 0.018} 0.179±0.0380.179_{\pm 0.038} 0.467¯±0.000\underline{0.467}_{\pm 0.000} 0.418±0.1130.418_{\pm 0.113}
  Mean     AUC     0.723\mathbf{0.723} 0.5090.509     0.4940.494 0.5100.510 0.4330.433     0.550¯\underline{0.550} 0.3630.363 0.3710.371 0.5010.501 0.5420.542
    F1     0.741\mathbf{0.741} 0.5420.542     0.5300.530 0.5520.552 0.4740.474     0.5770.577 0.4760.476 0.4740.474 0.672¯\underline{0.672} 0.3350.335
     
Table 1: Performance evaluation of anomalous tissue region detection across eight human breast cancer ST datasets. The table presents the results in terms of AUC and F1 scores, with each cell showing the average score from five independent runs and the corresponding standard deviation. The best score for each dataset is bolded, and the second-best score is underline.

Latent Multimodal Reconstruction Loss-based Anomaly Discriminator (Stage III)

Following Stage II, the original and reconstructed image patches of any spot ii are processed by a ResNet to generate their respective latent manifolds, denoted as eimg,iResNet(𝑷i)e_{img,i}\coloneqq\mathrm{ResNet}(\bm{P}_{i}) and e~img,iResNet(𝑷~i)\tilde{e}_{img,i}\coloneqq\mathrm{ResNet}(\widetilde{\bm{P}}_{i}), respectively. Here, we employ a light-weight ResNet as the encoder since this stage focuses on calculating latent loss rather than for the more involved tissue image reconstruction task. Similarly, the manifolds of the original and reconstructed gene expression profiles of spot ii are generated by an MLP, denoted as egene,iMLP(𝒙i)e_{gene,i}\coloneqq\mathrm{MLP}(\bm{x}_{i}) and e~gene,iMLP(𝒙~i)\tilde{e}_{gene,i}\coloneqq\mathrm{MLP}(\widetilde{\bm{x}}_{i}), respectively. Next, these manifolds are normalized, and a feed-forward network (FFN) maps their weighted averages to a latent space where the multimodal reconstruction error, rec,i\ell_{rec,i}, is calculated as follows:

𝒁fused,i=FFN(βeimg,ieimg,i+(1β)egene,iegene,i)\bm{Z}_{fused,i}=\mathrm{FFN}\left(\beta\cdot\frac{e_{img,i}}{||e_{img,i}||}+(1-\beta)\cdot\frac{e_{gene,i}}{||e_{gene,i}||}\right) (12)
𝒁~fused,i=FFN(βe~img,ie~img,i+(1β)e~gene,ie~gene,i)\widetilde{\bm{Z}}_{fused,i}=\mathrm{FFN}\left(\beta\cdot\frac{\tilde{e}_{img,i}}{||\tilde{e}_{img,i}||}+(1-\beta)\cdot\frac{\tilde{e}_{gene,i}}{||\tilde{e}_{gene,i}||}\right) (13)
rec,i=𝒁fused,i𝒁~fused,i\ell_{rec,i}=\bm{Z}_{fused,i}-\widetilde{\bm{Z}}_{fused,i} (14)

where 0<β<10<\beta<1 represents the relative weight assigned to the histology image. We then train a one-class classifier to collapse latent reconstruction errors of inliers into a compact hypersphere using the loss function:

occ=rec,ic2\mathcal{L}_{occ}=\left\|\ell_{rec,i}-c\right\|^{2} (15)

where c=1Nk=1Nrec,kc=\frac{1}{N}\sum\limits_{k=1}^{N}\ell_{rec,k}. The training workflow of Stage III is also illustrated in Algorithm 2 of supplementary material C. At inference time, the anomaly score (AS) of a query spot jj is computed as the distance of its latent reconstruction loss to cc:

ASjrec,jc2AS_{j}\coloneqq\left\|\ell_{rec,j}-c\right\|^{2} (16)

Given the observation that a gap exists between anomaly scores of inliers and anomalies (Figure 1 in supplementary material B), the AS threshold for discriminating inliers and anomalies is automatically determined using a Maximum A Posteriori-Expectation-Maximization (MAP-EM)-based mixture model, as detailed in supplementary material B.

Refer to caption
Figure 3: Visualized detection results of tumor edge regions that visually resemble the adjacent normal tissues in the 10x-hBC-I1 dataset. The first row, from left to right, displays the original histology image, the one annotated with ground truth region labels, the one highlighting the tumor edge region (in red) and the adjacent healthy region (in blue), and the one annotated with ATRs identified by DOMINANT. The second row presents images annotated with ATRs identified by their respective methods. The performance of each method is also quantified using mean precision and recall scores over five independent runs. These metrics, along with their standard deviations, are displayed right to each method’s panel.

Experiments

Experimental Settings

Datasets. MEATRD is extensively evaluated across eight breast cancer datasets and four primary sclerosing cholangitis (PSC) datasets. (see supplementary material E for data description and preprocessing).

Baselines. We select nine SOTA image-based, ST-based, and multi-modal AD methods as baselines. Image-based methods include two one-class classification methods, Patch SVDD (Yi and Yoon 2020) and SimpleNet (Liu et al. 2023), alongside a reconstruction-based method, f-AnoGAN (Schlegl et al. 2019). For ST-based methods, we consider scmap (Kiselev, Yiu, and Hemberg 2018), a classification-based method, CAMLU (Li et al. 2022), a reconstruction-based method, PREM (Pan et al. 2023), a discriminative graph method, DOMINANT (Ding et al. 2019), a generative graph method, and Spatial-ID (Shen et al. 2022), a classification-based method tailored for ST data. Additionally, M3DM (Wang et al. 2023) is chosen as a representative multimodal baseline.

Evaluation Protocols. AUC and F1 scores are used to evaluate the accuracy of ATR detection. For a fair comparison, the F1 score is calculated with the threshold matching the actual proportion of true anomalies (Shenkar and Wolf 2021). Reported metrics and standard deviations are averaged over five independent runs.

Anomalous Tissue Region Detection

In this experiment, as listed in supplementary material F, MEATRD is trained on eight human normal breast ST datasets (i.e., 10x-hNB-{v03-v10}) and tested on eight human breast cancer (i.e., 10x-hBC-{A1-H1}) ST datasets. Table 1 showcases MEATRD’s superiority over baselines in detecting ATRs across datasets, consistently ranking first in AUC scores and six times in F1 scores. It outperforms the second-best performing method with an average leap of 17.45% in AUC scores and 10.31% in F1 scores. Furthermore, Table 4 in supplementary material F indicates that our model performed well in detecting PSCs, demonstrating its generalization to other types of diseases. Generally, methods that use ST data, for example, DOMINANT, scmap and CAMLU, tend to outperform those that rely solely on histology images, indicating the pivotal role of gene expression information provided by ST data in aiding the detection of ATRs, especially those visually similar to normal tissues. Moreover, we find that, DOMINANT, a graph-based AD method, prevails over other baselines, and that M3DM, a multimodal method that utilizes both image and ST data yet fails to account for spatial relationships among spots, does not perform as well as MEATRD. These observations emphasize the value of spatial contextual information in accurate ATR detection.

Discovering Anomalous Tissue Regions Visually Similar to Normal Tissues

To evaluate the efficacy of MEATRD in detecting ATRs with minimal visual distinctions from normal tissues, we conduct a comparative analysis on the 10x-hBC-I1 ST dataset, which encompasses a tumor edge region that visually blends with the adjacent normal tissues, as indicated in red in the annotated histology image from Figure 3. Our analysis includes: the standard MEATRD implementation (MEATRD-standard); MEATRD-β\beta, a variant that downplays the influence of histology image by decreasing β\beta from 0.5 to 0.1 in Equation 12 and Equation 13; DOMINANT, the top performing baseline utilizing ST data from the previous section; two leading image-based AD methods, f-AnoGAN and SimpleNet. The results, visually presented in Figure 3 demonstrate that MEATRD-β\beta more accurately identifies spots within the tumor edge region as anomalous, compared to the other competing methods. This finding is quantitatively supported by its highest precision (0.693) and recall (0.895) scores. The observation that MEATRD-standard, MEATRD-β\beta, and DOMINANT prevail over the image-based AD methods underscores the value of using ST data for pinpointing pathogenic tissue regions that visually resemble normal tissues. Furthermore, DOMINANT’s marginal performance edge over MEATRD-standard suggests that in this specific context, the histology image contributes very limited additional information. Indeed, MEATRD-β\beta, which places greater emphasis on ST data, showcases an improved performance of 3.1% in precision and 34.4% in recall, compared to MEATRD-standard. Nonetheless, for scenarios involving low-quality ST data and visually traceable ATRs, incorporating visual cues from histology images are undoubtedly beneficial, as established in our prior analysis and ablation study.

case AUC F1
0.9 0.678 0.696
0.5 0.723 0.741
0.1 0.709 0.725
(a)
case AUC F1
0.9 0.654 0.668
0.5 0.723 0.741
0.1 0.699 0.718
(b)
dim AUC F1
128 0.705 0.726
256 0.723 0.741
512 0.721 0.735
(c)
dim AUC F1
16 0.723 0.741
64 0.715 0.728
256 0.682 0.711
(d)
dim AUC F1
64 0.606 0.623
128 0.720 0.732
256 0.723 0.741
(e)
blocks AUC F1
2 0.694 0.719
3 0.723 0.741
4 0.533 0.565
(f)
case AUC F1
1 0.718 0.730
2 0.723 0.741
4 0.721 0.737
(g)
Table 2: Sensitivity analysis of hyperparameter in MEATRD across eight human breast cancer datasets. Default settings are marked in gray.

Ablation Studies

We conduct ablation studies over the eight human breast cancer ST datasets (i.e., 10x-hBC-{A1-H1}) to investigate the effects of MEATRD’s key components in ATR detection. These components include using multiple data modality, multimodal data fusion using fused bottleneck embedding, masking for target node reconstruction, multimodal reconstruction losses in the one-class classifier in Stage III, enlarging anomaly score discrepancy between inliers and anomalies using a one-class classifier, using Mobile-Unet as the pretraining backbone in Stage I. The descriptions detailed in the Ablation Studies section in supplementary material F, demonstrate that removing any of these components leads to suboptimal performance. This is due to the inefficient use of cross-modal complementary information, less effective addressing of model over-generalization, and increased sensitivity to reference-target domain shifts.

  Metric Ablation study
ST only Image only w/o MGDAT w/o TNM w/o RE w/o OC Full
AUC 0.631 0.497 0.639 0.655 0.642 0.584 0.723
F1 0.667 0.544 0.689 0.699 0.685 0.631 0.741
 
Table 3: Ablation study of key components in MEATRD across eight human breast cancer datasets. Method performance is gauged through average AUC and F1 scores. ”Full” represents the complete MEATRD model. ”ST Only” and ”Image Only” utilize only ST data or histology images, respectively. ”w/o MGDAT” omits the MGDAT block. ”w/o TNM” omits the target-node-masking technique. ”w/o RE” substitutes the latent multimodal reconstruction errors with direct spot embeddings for input to the discriminative model in Stage III. ”w/o OC” discards the entire stage III and utilizes spot reconstruction errors as anomaly scores for ATR detection.

Sensitivity Analysis

Here, we conduct sensitivity analyses on eight 10x-hBC datasets to examine the effects of MEATRD’s key hyperparameters, including αandβ\alpha\ \text{and}\ \beta, which control the relative weights between gene and image modalities in Stage II and III; the dimensions of visual and gene embedding from Stage I, bottleneck embedding in Stage II, and the inputs to the one-classification classifier in Stage III; the number of MGDAT layers and its attention heads. The effect of these parameters on MEATRD’s performance, measured by AUC and F1 scores, is presented in Table 2. Detailed results are provided in supplementary material F.3.

Complexity Analysis

We analyze the model complexity of MEATRD across its three stages by evaluating the number of parameters, computational performance (MFlops), time complexity, training time, and inference time. We also compare these metrics with the nine baseline methods. Detailed results are provided in supplementary material F.4. In summary, MEATRD is scalable to the number of spots and edges (proportional to the number of spots due to the adjacency matrix setting) and demonstrates good efficiency in our experiments.

Conclusion

In this paper, we propose MEATRD, a pilot method that integrates histology images and ST data to enhance ATR detection at both visual and molecular levels. MEATRD treats tissue spots as nodes in an attributed graph to embed spatial relationships into their representations. The MGDAT network, a key innovation of MEATRD, facilitates effective cross-node and cross-modality information exchange, enabling comprehensive graph representation learning. MEATRD harmonizes one-class classification with reconstruction deviation-based AD detection, simultaneously addressing the challenges of reference-target domain shift and model over-generalization. Rigorous evaluations on a suite of real ST datasets have demonstrated MEATRD’s superiority over various SOTA AD methods in detecting ATRs including those that are visually akin to contextual normal tissues. Furthermore, MEATRD also offers a framework generalizable to other multimodal AD tasks involving compatible imagery and graph data modalities.

Acknowledgments

The project is funded by the Excellent Young Scientist Fund of Wuhan City (Grant No. 21129040740) to X.S.

References

  • Andersson et al. (2021) Andersson, A.; Larsson, L.; Stenbeck, L.; et al. 2021. Spatial deconvolution of HER2-positive breast cancer delineates tumor-associated cell type interactions. Nat Commun, 12: 6012.
  • Arjovsky, Chintala, and Bottou (2017) Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In International conference on machine learning, 214–223. PMLR.
  • Bejnordi et al. (2017) Bejnordi, B. E.; Veta, M.; Van Diest, P. J.; Van Ginneken, B.; Karssemeijer, N.; Litjens, G.; Van Der Laak, J. A.; Hermsen, M.; Manson, Q. F.; Balkenhol, M.; et al. 2017. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22): 2199–2210.
  • Bergmann et al. (2019) Bergmann, P.; Fauser, M.; Sattlegger, D.; and Steger, C. 2019. MVTec AD – A Comprehensive Real-World Dataset for Unsupervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
  • Cover (1999) Cover, T. M. 1999. Elements of information theory. John Wiley & Sons.
  • Deng and Li (2022) Deng, H.; and Li, X. 2022. Anomaly detection via reverse distillation from one-class embedding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9737–9746.
  • Ding et al. (2019) Ding, K.; Li, J.; Bhanushali, R.; and Liu, H. 2019. Deep anomaly detection on attributed networks. In Proceedings of the 2019 SIAM International Conference on Data Mining, 594–602. SIAM.
  • Donahue, Krähenbühl, and Darrell (2016) Donahue, J.; Krähenbühl, P.; and Darrell, T. 2016. Adversarial feature learning. arXiv preprint arXiv:1605.09782.
  • Dong and Zhang (2022) Dong, K.; and Zhang, S. 2022. Deciphering spatial domains from spatially resolved transcriptomics with an adaptive graph attention auto-encoder. Nature communications, 13(1): 1739.
  • He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.
  • He and Sun (2015) He, K.; and Sun, J. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5353–5360.
  • He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
  • Hou et al. (2023) Hou, Z.; He, Y.; Cen, Y.; Liu, X.; Dong, Y.; Kharlamov, E.; and Tang, J. 2023. GraphMAE2: A Decoding-Enhanced Masked Self-Supervised Graph Learner. In Proceedings of the ACM Web Conference 2023, 737–746.
  • Hu et al. (2023) Hu, J.; Coleman, K.; Zhang, D.; Lee, E. B.; Kadara, H.; Wang, L.; and Li, M. 2023. Deciphering tumor ecosystems at super resolution from spatial transcriptomics with TESLA. Cell systems, 14(5): 404–417.
  • Hu et al. (2021) Hu, J.; Li, X.; Coleman, K.; Schroeder, A.; Ma, N.; Irwin, D. J.; Lee, E. B.; Shinohara, R. T.; and Li, M. 2021. SpaGCN: Integrating gene expression, spatial location and histology to identify spatial domains and spatially variable genes by graph convolutional network. Nature methods, 18(11): 1342–1351.
  • Kipf and Welling (2016) Kipf, T. N.; and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  • Kiselev, Yiu, and Hemberg (2018) Kiselev, V. Y.; Yiu, A.; and Hemberg, M. 2018. scmap: projection of single-cell RNA-seq data across data sets. Nature methods, 15(5): 359–362.
  • Komura and Ishikawa (2018) Komura, D.; and Ishikawa, S. 2018. Machine learning methods for histopathological image analysis. Computational and structural biotechnology journal, 16: 34–42.
  • Kumar et al. (2023) Kumar, G.; Pandurengan, R. K.; Parra, E. R.; Kannan, K.; and Haymaker, C. 2023. Spatial modelling of the tumor microenvironment from multiplex immunofluorescence images: methods and applications. Frontiers in immunology, 14: 1288802.
  • Li et al. (2022) Li, Z.; Wang, Y.; Ganan-Gomez, I.; Colla, S.; and Do, K.-A. 2022. A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data. Bioinformatics, 38(21): 4885–4892.
  • Liu et al. (2023) Liu, Z.; Zhou, Y.; Xu, Y.; and Wang, Z. 2023. Simplenet: A simple network for image anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20402–20411.
  • Moses and Pachter (2022) Moses, L.; and Pachter, L. 2022. Museum of spatial transcriptomics. Nat Methods, 19: 534–546.
  • Nagrani et al. (2021) Nagrani, A.; Yang, S.; Arnab, A.; Jansen, A.; Schmid, C.; and Sun, C. 2021. Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34: 14200–14213.
  • Okada and Taniguchi (2021) Okada, M.; and Taniguchi, T. 2021. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction. In 2021 ieee international conference on robotics and automation (icra), 4209–4215. IEEE.
  • Ouardini et al. (2019) Ouardini, K.; Yang, H.; Unnikrishnan, B.; Romain, M.; Garcin, C.; Zenati, H.; Campbell, J. P.; Chiang, M. F.; Kalpathy-Cramer, J.; Chandrasekhar, V.; et al. 2019. Towards practical unsupervised anomaly detection on retinal images. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data: First MICCAI Workshop, DART 2019, and First International Workshop, MIL3ID 2019, Shenzhen, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13 and 17, 2019, Proceedings 1, 225–234. Springer.
  • Pan et al. (2023) Pan, J.; Liu, Y.; Zheng, Y.; and Pan, S. 2023. PREM: A Simple Yet Effective Approach for Node-Level Graph Anomaly Detection. arXiv preprint arXiv:2310.11676.
  • Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  • Pocevičiūtė, Eilertsen, and Lundström (2021) Pocevičiūtė, M.; Eilertsen, G.; and Lundström, C. 2021. Unsupervised anomaly detection in digital pathology using GANs. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), 1878–1882. IEEE.
  • Riasatian et al. (2021) Riasatian, A.; Babaie, M.; Maleki, D.; Kalra, S.; Valipour, M.; Hemati, S.; Zaveri, M.; Safarpoor, A.; Shafiei, S.; Afshari, M.; et al. 2021. Fine-tuning and training of densenet for histopathology image representation using tcga diagnostic slides. Medical Image Analysis, 70: 102032.
  • Ristea et al. (2022) Ristea, N.-C.; Madan, N.; Ionescu, R. T.; Nasrollahi, K.; Khan, F. S.; Moeslund, T. B.; and Shah, M. 2022. Self-supervised predictive convolutional attentive block for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13576–13586.
  • Roth et al. (2022) Roth, K.; Pemula, L.; Zepeda, J.; Schölkopf, B.; Brox, T.; and Gehler, P. 2022. Towards total recall in industrial anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14318–14328.
  • Rudolph et al. (2023) Rudolph, M.; Wehrbein, T.; Rosenhahn, B.; and Wandt, B. 2023. Asymmetric student-teacher networks for industrial anomaly detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2592–2602.
  • Ruff et al. (2018) Ruff, L.; Vandermeulen, R.; Goernitz, N.; Deecke, L.; Siddiqui, S. A.; Binder, A.; Müller, E.; and Kloft, M. 2018. Deep one-class classification. In International conference on Machine Learning, 4393–4402. PMLR.
  • Sabour, Frosst, and Hinton (2017) Sabour, S.; Frosst, N.; and Hinton, G. E. 2017. Dynamic routing between capsules. Advances in neural information processing systems, 30.
  • Schlegl et al. (2019) Schlegl, T.; Seeböck, P.; Waldstein, S. M.; Langs, G.; and Schmidt-Erfurth, U. 2019. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Medical image analysis, 54: 30–44.
  • Shen et al. (2022) Shen, R.; Liu, L.; Wu, Z.; Zhang, Y.; Yuan, Z.; Guo, J.; Yang, F.; Zhang, C.; Chen, B.; Feng, W.; et al. 2022. Spatial-ID: a cell typing method for spatially resolved transcriptomics via transfer learning and spatial embedding. Nature communications, 13(1): 7640.
  • Shenkar and Wolf (2021) Shenkar, T.; and Wolf, L. 2021. Anomaly detection for tabular data with internal contrastive learning. In International Conference on Learning Representations.
  • Shenkar and Wolf (2022) Shenkar, T.; and Wolf, L. 2022. Anomaly detection for tabular data with internal contrastive learning. In International conference on learning representations.
  • Shvetsova et al. (2021) Shvetsova, N.; Bakker, B.; Fedulova, I.; Schulz, H.; and Dylov, D. V. 2021. Anomaly detection in medical imaging with deep perceptual autoencoders. IEEE Access, 9: 118571–118583.
  • Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  • Sohn et al. (2020) Sohn, K.; Li, C.-L.; Yoon, J.; Jin, M.; and Pfister, T. 2020. Learning and evaluating representations for deep one-class classification. arXiv preprint arXiv:2011.02578.
  • Srinidhi, Ciga, and Martel (2021) Srinidhi, C. L.; Ciga, O.; and Martel, A. L. 2021. Deep neural network models for computational histopathology: A survey. Medical image analysis, 67: 101813.
  • Tian et al. (2020) Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; and Isola, P. 2020. What makes for good views for contrastive learning? Advances in neural information processing systems, 33: 6827–6839.
  • Tishby, Pereira, and Bialek (2000) Tishby, N.; Pereira, F. C.; and Bialek, W. 2000. The information bottleneck method. arXiv preprint physics/0004057.
  • Tschuchnig and Gadermayr (2022) Tschuchnig, M. E.; and Gadermayr, M. 2022. Anomaly detection in medical imaging-a mini review. In Data Science–Analytics and Applications: Proceedings of the 4th International Data Science Conference–iDSC2021, 33–38. Springer.
  • Veličković et al. (2018) Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò, P.; and Bengio, Y. 2018. Graph Attention Networks. In International Conference on Learning Representations.
  • Wang et al. (2024) Wang, G.; Wu, S.; Xiong, Z.; Qu, H.; Fang, X.; and Bao, Y. 2024. CROST: a comprehensive repository of spatial transcriptomics. Nucleic Acids Research, 52(D1): D882–D890.
  • Wang, Maletic-Savatic, and Liu (2022) Wang, L.; Maletic-Savatic, M.; and Liu, Z. 2022. Region-specific denoising identifies spatial co-expression patterns and intra-tissue heterogeneity in spatially resolved transcriptomics data. Nature Communications, 13(1): 6912.
  • Wang et al. (2023) Wang, Y.; Peng, J.; Zhang, J.; Yi, R.; Wang, Y.; and Wang, C. 2023. Multimodal Industrial Anomaly Detection via Hybrid Fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8032–8041.
  • Wang et al. (2022) Wang, Y.; Song, B.; Wang, S.; Chen, M.; Xie, Y.; Xiao, G.; Wang, L.; and Wang, T. 2022. Sprod for de-noising spatially resolved transcriptomics data based on position and image information. Nature methods, 19(8): 950–958.
  • Wolf, Angerer, and Theis (2018) Wolf, F. A.; Angerer, P.; and Theis, F. J. 2018. SCANPY: large-scale single-cell gene expression data analysis. Genome biology, 19: 1–5.
  • Xu et al. (2024) Xu, K.; Lu, Y.; Hou, S.; Liu, K.; Du, Y.; Huang, M.; Feng, H.; Wu, H.; and Sun, X. 2024. Detecting anomalous anatomic regions in spatial transcriptomics with STANDS. Nature Communications, 15(1): 8223.
  • Yi and Yoon (2020) Yi, J.; and Yoon, S. 2020. Patch svdd: Patch-level svdd for anomaly detection and segmentation. In Proceedings of the Asian conference on computer vision.
  • Zehnder et al. (2022) Zehnder, P.; Feng, J.; Fuji, R. N.; Sullivan, R.; and Hu, F. 2022. Multiscale generative model using regularized skip-connections and perceptual loss for anomaly detection in toxicologic histopathology. Journal of Pathology Informatics, 13: 100102.
  • Zingman et al. (2023) Zingman, I.; Stierstorfer, B.; Lempp, C.; and Heinemann, F. 2023. Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. Medical Image Analysis, 103067.
  • Zong et al. (2022) Zong, Y.; Yu, T.; Wang, X.; Wang, Y.; Hu, Z.; and Li, Y. 2022. conST: an interpretable multi-modal contrastive learning framework for spatial transcriptomics. bioRxiv, 2022–01.

Supplementary Material

Appendix A Related Work

Localized Anomaly Detection in Image

Related works in this field can be broadly divided into two categories: one-class classification methods and reconstruction-based methods. The former aims to delineate normal data distributions and boundaries in a latent space at training time, labeling instances occurring in low-probability density regions (i.e., falling outside the boundary) as anomalies at test time (Shvetsova et al. 2021). For example, Patch SVDD (Yi and Yoon 2020) assesses anomalies according to their proximity to the nearest inlier in a latent space that is learned by minimizing distances between nearby inliers. Another example, SimpleNet (Liu et al. 2023) creates pseudo-anomalies by introducing random noises to extracted visual features of inliers, and trains a separating hyperplane-based discriminator for anomaly differentiation. The main limitation of these methods is their dependency on effective representation learning (Sohn et al. 2020), which may be compromised by batch effects between reference and target datasets (Ouardini et al. 2019).

On the other hand, reconstruction-based methods, trained on normal data only, posit that inliers can be reconstructed more faithfully from their latent manifolds than anomalies. For instance, f-AnoGan (Schlegl et al. 2019), a WGAN(Arjovsky, Chintala, and Bottou 2017)-based method for AD in medical images, employs a discriminator-guided encoder to obtain reconstruction residuals as anomaly scores. While theoretically more robust to batch effects since only anomalies are identified based on reconstruction errors within the same batch, these methods may suffer from model over-generalization, leading to minor reconstruction errors for anomalies (Liu et al. 2023; Ristea et al. 2022). Overall, methods for localized AD in images often overlook the contextual surroundings (Sabour, Frosst, and Hinton 2017; Ristea et al. 2022), although some, such as SSPCAB (Ristea et al. 2022) and PatchCore (Roth et al. 2022), attempt to aggregate information from neighboring patches through simplified means such as adaptive averaging pooling. In contrast, our method, by virtue of the MGDAT network, can comprehensively harness contextual information for improved AD. (He and Sun 2015)

Anomaly Detection using Gene Expression Data

Tissue spots in ST closely resemble single cells in single-cell RNA sequencing (scRNA-seq), augmented with spatial location information. This similarity offers an opportunity to apply anomalous cell (AC) detection methods to ATR detection in ST. Traditional AC detection methods treat scRNA-seq data as tabular, identifying ACs through cell type classification tasks. For example, scmap (Kiselev, Yiu, and Hemberg 2018) computes gene expression similarities between query cells and centroids of known cell types, designating those below a threshold as anomalies. Such classification-based methods depend heavily on labeled references, often a scarce and costly resource. To circumvent this limitation, CAMLU (Li et al. 2022), a reconstruction-based method utilizing unlabeled reference data only, identifies ACs in the target dataset using informative genes selected as per their reconstruction deviations. However, these methods neglect spatial information inherent to ST data, which is crucial for accurate ATR detection. To bridge this gap, specialized methods have been developed, typically leveraging graph neural networks (GNN) to incorporate spatial relationships among spots (Hu et al. 2021; Dong and Zhang 2022). Among these, to the best of our knowledge, Spatial-ID (Shen et al. 2022) is currently the sole method capable of identifying ATRs by utilizing a classifier, pre-trained on labeled scRNA-seq data, to classify spots based on their latent manifolds generated via a variational graph autoencoder. Spots with uncertain soft assignments are labeled as anomalies. However, Spatial-ID, like many other gene-oriented AD methods, is prone to high false positive rates due to its reliance on assignment uncertainties, often arising from inlier similarities rather than genuine anomalies (Li et al. 2022).

An alternative strategy, bypassing the classification framework, involves modeling ST data as an attributed graph and applying node-level graph anomaly detection (GAD) methods, which can be generative or discriminative (Pan et al. 2023). For instance, PREM (Pan et al. 2023) determines anomalous nodes based on their anomaly scores calculated as the dissimilarity between ego and neighbor node embeddings, which are generated through graph contrastive learning. DOMINANT (Ding et al. 2019), a generative GAD method, leverages a graph convolutional network (GCN) (Kipf and Welling 2016) to reconstruct both nodal attributes and topological structure, using combined reconstruction errors as anomaly scores. Generally, all methods discussed in this section are limited by their heavy dependence on the quality of ST data and falling short of exploiting visual information available in histology images to improve the accuracy of ATR detection.

Multimodal Anomaly Detection

By far, the development of multimodal AD methods has been predominantly focused on industrial AD scenarios involving the simultaneous use of 2D and 3D data. Recent methods in this field include M3DM (Wang et al. 2023) and AST (Rudolph et al. 2023). M3DM adopts a contrastive learning-based approach to fuse manifolds of segmented patches from 3D point clouds and RGB images, based on which a discriminative model is trained for anomaly decision. AST concatenates features extracted from RGB images and 3D depth maps, serving as inputs to asymmetric student and teacher networks that determine anomaly scores as per their output discrepancies. However, there is a significant gap in developing multimodal ATR detection methods that combine gene expression data and histology images.

Appendix B Determining anomaly score threshold

Based on the observation that there is a gap between anomaly scores of inliers and true anomalies, as shown in Figure 5, we designed a two-component mixture model to automatically determine the anomaly score threshold that discriminate inliers and anomalies. Specifically, the distribution of anomaly scores is modeled as a univariate Gaussian Mixture Model (GMM) with two components corresponding to anomalous and normal instances, respectively. We specify the prior for anomaly abundance as a beta distribution and the priors for the mean and variance of the two Gaussian components as a Normal Inverse Chi-squared (NIX) distribution. The parameters of these priors are estimated based on inlier anomaly scores in the reference dataset. Utilizing the Maximum A Posteriori (MAP)-EM algorithm, we infer the parameters for both Gaussian components and then assign spots into either normal or anomalous groups based on their probabilities within each component. Specifically, let Θ={π,μk,σk2,k{1,2}}\Theta=\left\{\pi,\mu_{k},\sigma_{k}^{2},\forall k\in\left\{1,2\right\}\right\} represent the GMM parameters, where π[0,1]\pi\in\left[0,1\right] represents the proportion of anomalies, and μk,σk2\mu_{k},\sigma_{k}^{2} represent the mean and variance for the kk-th component, respectively, with the constraint that μ1>μ2\mu_{1}>\mu_{2}. Then, the probability density function of an anomaly score did_{i} can be formulated as:

P(di|Θ)=π𝒩(di|μ1,σ12)+(1π)𝒩(di|μ2,σ22)P\left(d_{i}\middle|\Theta\right)=\pi\mathcal{N}\left(d_{i}\middle|\mu_{1},\sigma_{1}^{2}\right)+(1-\pi)\mathcal{N}\left(d_{i}\middle|\mu_{2},\sigma_{2}^{2}\right) (17)
πBeta(π|a,b)\pi\sim\mathrm{Beta}\left(\pi\middle|a,b\right) (18)
μk,σk2NIX(μk,Σk|m0,κ0,s02,ν0)\mu_{k},\sigma_{k}^{2}\sim\mathrm{NIX}\left(\mu_{k},\Sigma_{k}\middle|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right) (19)

Parameters for the priors in the GMM are empirically set based on the reference dataset’s anomaly scores δi,i{1,2,,Nref}\delta_{i},{\forall}i\in\{1,2,\cdots,N_{ref}\} :

m0=i=1NrefδiNref,κ0=0.01,ν0=3,s02=i=1Nref(δim0)Nrefm_{0}=\frac{\sum_{i=1}^{N_{ref}}\delta_{i}}{N_{ref}},\ \kappa_{0}=0.01,\ \nu_{0}=3,\ s_{0}^{2}=\frac{\sum_{i=1}^{N_{ref}}\left(\delta_{i}-m_{0}\right)}{N_{ref}} (20)
a=1,b=10a=1,\ b=10 (21)

The values of aa and bb can be adjusted if prior knowledge about anomaly abundance is available. The complete data log likelihood for the posterior, denoted as c(Θ)\ell_{c}\left(\Theta\right), is expressed as:

c(Θ)\displaystyle\ell_{c}\left(\Theta\right) =logP(𝒟|Θ)\displaystyle=\mathrm{log}P\left(\mathcal{D}\ \middle|\Theta\right) (22)
=i[𝕀(zi=1)(logπ+log𝒩(di|μ1,σ12))\displaystyle=\sum_{i}\Big{[}\mathbb{I}\left(z_{i}=1\right)\left(\mathrm{log}\pi+\mathrm{log}\mathcal{N}\left(d_{i}\middle|\mu_{1},\sigma_{1}^{2}\right)\right)
+𝕀(zi=2)(log(1π)+log𝒩(di|μ2,σ22))]\displaystyle+\mathbb{I}\left(z_{i}=2\right)\left(\mathrm{log}(1-\pi)+\mathrm{log}\mathcal{N}\left(d_{i}\middle|\mu_{2},\sigma_{2}^{2}\right)\right)\Big{]}
+logBeta(π|a,b)\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\middle|a,b\right)
+k=12logNIX(μk,σk2|m0,κ0,s02,ν0)\displaystyle+\sum_{k=1}^{2}{\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)}

Here, ziz_{i} denotes the component membership of spot ii. In the tt-th iteration of the E-step, the expected sufficient statistics zi¯(t){\overline{z_{i}}}^{(t)} is derived from Θ(t1)\Theta^{(t-1)}. In the subsequent M-step, Θ(t1)\Theta^{(t-1)} is updated to Θ(t)\Theta^{(t)} by maximizing the auxiliary function Q(Θ,Θ(t1))=E(c(Θ)|Θ(t1))Q\left(\Theta,\Theta^{(t-1)}\right)=E\left({\ell}_{c}\left(\Theta\right)\big{|}\Theta^{(t-1)}\right). We elaborate our MAP-EM algorithm below:

MAP-EM inference of GMM parameters.

We first list the mathematical notations used in the inference below:

  Notation Description
 𝒟=di,i{1,2,,N}\mathcal{D}\ =d_{i},\forall i\in\{1,2,\cdots,N\} Set of anomaly scores of target spots.
Δ={δi,i{1,2,,Nref}}\Delta\ =\left\{\delta_{i},\forall i\in\{1,2,\cdots,N_{ref}\}\right\} Set of anomaly scores of reference spots.
NN Number of target spots.
NrefN_{ref} Number of reference spots.
π1\pi_{1} Anomaly abundance among the target spots.
π2=1π1\pi_{2}=1-\pi_{1} Abundance of normal spots among the target spots.
Θ={πk,μk,σk2,k{1,2}}\Theta=\left\{\pi_{k},\mu_{k},\sigma_{k}^{2},\forall k\in\left\{1,2\right\}\right\} Parameters of the kk-th GMM components.
zi{1,2}z_{i}\in\left\{1,2\right\} GMM component membership of the spot ii.
 
Table 4: Overview of notations in MAP-EM inference.

Initially, we introduce a prior on π1\pi_{1} as a Beta distribution, and a conjugate joint prior on μk\mu_{k},σk2\sigma_{k}^{2} as a normal inverse chi-squared (NIX) distribution:

π1Beta(π|a,b)\pi_{1}\sim\mathrm{Beta}\left(\pi\middle|a,b\right) (23)
μk,σk2\displaystyle\mu_{k},\sigma_{k}^{2} NIX(μk,σk2|m0,κ0,s02,ν0)\displaystyle\sim\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right) (24)
=𝒩(μk|m0,σk2/κ0)χ2(σk2|s02,ν0)\displaystyle=\mathcal{N}\left(\mu_{k}\middle|m_{0},\sigma_{k}^{2}/\kappa_{0}\right)\chi^{-2}\left(\sigma_{k}^{2}\middle|s_{0}^{2},\nu_{0}\right)

Here, we set the parameters of the prior distributions based on the anomaly scores of spots in the reference dataset:

m0=i=1NrefδiNref,κ0=0.01,ν0=3,s02=i=1Nref(δim0)Nrefm_{0}=\frac{\sum_{i=1}^{N_{ref}}\delta_{i}}{N_{ref}},\kappa_{0}=0.01,\nu_{0}=3,s_{0}^{2}=\frac{\sum_{i=1}^{N_{ref}}\left(\delta_{i}-m_{0}\right)}{N_{ref}} (25)
a=1,b=10a=1,b=10 (26)

Note that the values of a and b can be set to more appropriate values if prior knowledge about the abundance of anomalies is available. The posterior complete data log likelihood can be written as:

c(Θ)\displaystyle\ell_{c}\left(\Theta\right) =logP(𝒟|Θ)\displaystyle=\mathrm{log}P\left(\mathcal{D}\ \middle|\Theta\right) (27)
=ik𝕀(zi=k)(logπk+log𝒩(di|μk,σk2))\displaystyle=\sum_{i}\sum_{k}\mathbb{I}\left(z_{i}=k\right)\left(\mathrm{log}\pi_{k}+\mathrm{log}\mathcal{N}\left(d_{i}\middle|\mu_{k},\sigma_{k}^{2}\right)\right)
+logBeta(π|a,b)\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi\middle|a,b\right)
+k=12logNIX(μk,σk2|m0,κ0,s02,ν0)\displaystyle+\sum_{k=1}^{2}{\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle|m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)}

E step. In the tt-th iteration, we have the auxiliary function QQ as:

Q(Θ,Θ(t1))=𝔼[c(Θ)|Θ(t1)]\displaystyle Q\left(\Theta,\Theta^{(t-1)}\right)=\mathbb{E}\left[\ell_{c}(\Theta)\big{|}\Theta^{(t-1)}\right]
=ik=12P(zi=k|di,Θ(t1))\displaystyle=\sum_{i}\sum_{k=1}^{2}P\left(z_{i}=k\big{|}d_{i},\Theta^{(t-1)}\right)
[logπk(t1)+𝔼(logN(di|μk(t1),(σk2)(t1)))]\displaystyle\qquad\left[\mathrm{log}\pi_{k}^{(t-1)}+\mathbb{E}\left(\mathrm{log}N\left(d_{i}\big{|}\mu_{k}^{(t-1)},{(\sigma_{k}^{2})}^{(t-1)}\right)\right)\right]
+logBeta(π|a,b)+k=12logNIX(μk,σk2|m0,κ0,s02,ν0)\displaystyle+\mathrm{log}\mathrm{Beta}\left(\pi|a,b\right)+\sum_{k=1}^{2}\mathrm{log}\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\big{|}m_{0},\kappa_{0},s_{0}^{2},\nu_{0}\right)

The expected sufficient statistics (ESS) are:

zi,k¯\displaystyle\overline{z_{i,k}} =P(zi=k|di,Θ(t1))\displaystyle=P\left(z_{i}=k\big{|}d_{i},\Theta^{(t-1)}\right) (28)
=πk(t1)𝒩(di|μk(t1),(σk2)(t1))kπk(t1)𝒩(di|μk(t1),(σk2)(t1))\displaystyle=\frac{\pi_{k}^{(t-1)}\mathcal{N}\left(d_{i}\big{|}\mu_{k}^{(t-1)},(\sigma_{k}^{2})^{(t-1)}\right)}{\sum_{k^{\prime}}\pi_{k^{\prime}}^{(t-1)}\mathcal{N}\left(d_{i}\big{|}\mu_{k^{\prime}}^{(t-1)},(\sigma_{k^{\prime}}^{2})^{(t-1)}\right)}

M step. In the tt-th iteration, the expected complete posterior data log likelihood is:

Q(Θ,Θ(t1))k=12i[zi,k¯(t)(logπklog(σk2)2(diμk)22σk2)]+logBeta(π|a,b)+k=12[log𝒩(μk|m0,σk2/κ0)+logχ2(σk2|s02,ν0)]\begin{split}&Q\left(\Theta,\Theta^{\left(t-1\right)}\right)\propto\\ &\sum_{k=1}^{2}\sum_{i}\left[{\overline{z_{i,k}}}^{\left(t\right)}\left(\mathrm{log}\pi_{k}-\frac{\mathrm{log}\left(\sigma_{k}^{2}\right)}{2}-\frac{\left(d_{i}-\mu_{k}\right)^{2}}{2\sigma_{k}^{2}}\right)\right]\\ &\quad+\mathrm{log}\mathrm{Beta}\left(\pi\middle|a,b\right)\\ &\quad+\sum_{k=1}^{2}\left[\mathrm{log}\mathcal{N}\left(\mu_{k}\middle|m_{0},\sigma_{k}^{2}/\kappa_{0}\right)+\mathrm{log}\chi^{-2}\left(\sigma_{k}^{2}\middle|s_{0}^{2},\nu_{0}\right)\right]\end{split} (29)

We maximize Q(Θ,Θ(t1))Q\left(\Theta,\Theta^{\left(t-1\right)}\right) with respect to Θ\Theta. The posterior distribution of π1\pi_{1}\ and {μk,σk2}\left\{\mu_{k},\sigma_{k}^{2}\right\} are:

π1Beta(π|a(t),b(t))\pi_{1}\sim\mathrm{Beta}\left(\pi\middle|a^{\left(t\right)},b^{\left(t\right)}\right) (30)
a(t)=a+izi,1¯(t)a^{\left(t\right)}=a+\sum_{i}{\overline{z_{i,1}}}^{\left(t\right)} (31)
b(t)=b+izi,2¯(t)b^{\left(t\right)}=b+\sum_{i}{\overline{z_{i,2}}}^{\left(t\right)} (32)
μk,σk2NIX(μk,σk2|mk(t),κk(t),(sk2)(t),νk(t))\mu_{k},\sigma_{k}^{2}\sim\mathrm{NIX}\left(\mu_{k},\sigma_{k}^{2}\middle|m_{k}^{\left(t\right)},\kappa_{k}^{\left(t\right)},\left(s_{k}^{2}\right)^{\left(t\right)},\nu_{k}^{\left(t\right)}\right) (33)
zk¯(t)=izi,k¯(t){\overline{z_{k}}}^{\left(t\right)}=\sum_{i}{\overline{z_{i,k}}}^{\left(t\right)} (34)
dk¯(t)=i(zi,k¯(t)di)zk¯(t){\bar{d_{k}}}^{\left(t\right)}=\frac{\sum_{i}{({\overline{z_{i,k}}}^{\left(t\right)}}d_{i})}{{\overline{z_{k}}}^{\left(t\right)}} (35)
νk(t)=ν0+zk¯(t),κk(t)=κ0+zk¯(t)\nu_{k}^{\left(t\right)}=\nu_{0}+{\overline{z_{k}}}^{\left(t\right)},\ \kappa_{k}^{\left(t\right)}=\kappa_{0}+{\overline{z_{k}}}^{\left(t\right)} (36)
mk(t)=zk¯(t)dk¯(t)+m0κ0κk(t)m_{k}^{\left(t\right)}=\frac{{\overline{z_{k}}}^{\left(t\right)}{\bar{d_{k}}}^{\left(t\right)}+m_{0}\kappa_{0}}{\kappa_{k}^{\left(t\right)}} (37)
(sk2)(t)=ν0s02+i(zi,k¯(t)di2)+κ0m02zk¯(t)(mk(t))2\left(s_{k}^{2}\right)^{\left(t\right)}={\nu_{0}s}_{0}^{2}+\sum_{i}{({\overline{z_{i,k}}}^{\left(t\right)}}d_{i}^{2})+\kappa_{0}m_{0}^{2}-{\overline{z_{k}}}^{\left(t\right)}\left(m_{k}^{\left(t\right)}\right)^{2}

Then we have the MAP estimates of π1\pi_{1}, μk\mu_{k}\ and σk2\sigma_{k}^{2} as π1(t)\pi_{1}^{(t)},μk(t)\mu_{k}^{\left(t\right)} and (σk2)(t)\left(\sigma_{k}^{2}\right)^{\left(t\right)}:

π1(t)=a(t)1a(t)+b(t)2\pi_{1}^{\left(t\right)}=\frac{a^{\left(t\right)}-1}{a^{\left(t\right)}+b^{\left(t\right)}-2} (38)
μk(t)=mk(t)\mu_{k}^{\left(t\right)}=m_{k}^{\left(t\right)} (39)
(σk2)(t)=νk(t)(sk2)(t)νk(t)+3\left(\sigma_{k}^{2}\right)^{\left(t\right)}=\frac{{\nu_{k}^{\left(t\right)}\left(s_{k}^{2}\right)}^{\left(t\right)}}{\nu_{k}^{\left(t\right)}+3} (40)

Next, the EM algorithm continues to E step of the (t+1)\left(t+1\right)-th iteration to update zi,k¯(t+1){\overline{z_{i,k}}}^{(t+1)} ,i[1,N]\forall i\in\left[1,N\right],k{1,2}\forall k\in\left\{1,2\right\} until either convergence is achieved, or a pre-specified number of iterations is reached. Finally, the soft assignment of spot ii to the anomalous group (𝒬i,1)(\mathcal{Q}_{i,1}) is calculated by plugin Θ\Theta:

qi,1=π1𝒩(di|μ1,σ12)q_{i,1}=\pi_{1}\mathcal{N}\left(d_{i}\middle|\mu_{1},\sigma_{1}^{2}\right) (41)
𝒬i,1=qi,1kqi,k,i{1,2,,N},k{1,2}\mathcal{Q}_{i,1}=\frac{q_{i,1}}{\sum_{k}q_{i,k}},\forall i\in\{1,2,\cdots,N\},\forall k\in\left\{1,2\right\} (42)

If 𝒬i,1>0.5\mathcal{Q}_{i,1}>0.5, then spot ii is determined as an anomaly.

Appendix C Algorithm for MEATRD

Algorithm 1 Stage II training. 1:Gene expression profiles 𝑿N×G\bm{X}\in\mathbb{R}^{N\times G}; Image patches 𝑷N×h×w×c\bm{P}\in\mathbb{R}^{N\times h\times w\times c}; Attributed graph G(V,A,𝓩)G(V,A,\mathcal{\bm{Z}}); Number of nodes N; Parameter of Image modality λ\lambda; Parameter of Gene modality α\alpha. 2:Pre-trained Mobile-UNet encoder E1E_{1}; Pre-trained Mobile-Unet decoder D1D_{1}; Gene encoder fEf_{E}; Gene dncoder fDf_{D}; MGDAT network \mathcal{F}; ResNET-based image decoder D2D_{2}; GNN-based gene decoder D3D_{3}; Feed-forward network ff; L1 reconstruction loss function 1\mathcal{L}_{1}; SSIM loss function SSIM\mathcal{L}_{SSIM}; SCE loss fustion SCE\mathcal{L}_{SCE}. 3:Reconstructed ST data of query spot 𝐗^b\hat{\mathbf{X}}_{b}; Reconstructed histology image of query spot 𝐏^b\hat{\mathbf{P}}_{b}. 4:for 𝐗b\mathbf{X}_{b}, 𝐏b\mathbf{P}_{b} in 𝑿\bm{X}, 𝑷\bm{P} do \triangleright Processing Stage II 5:     Zgene=fE(𝐗b)Z_{gene}=f_{E}(\mathbf{X}_{b}), Zimg=E1(𝐏b)Z_{img}=E_{1}(\mathbf{P}_{b}). 6:     Zgene,Zimg=(Zgene,Zimg)Z_{gene},Z_{img}=\mathcal{F}(Z_{gene},Z_{img}) 7:     𝐏b=D2(Zimg)\mathbf{P}_{b}=D_{2}(Z_{img}) , 𝐗b=D3(Zgene)\mathbf{X}_{b}=D_{3}(Z_{gene}) . 8:     rec=ssim(𝐏b,𝐏^b)+λ1(𝐏b,𝐏^b)+αSCE(𝐗b,𝐗^b)\mathcal{L}_{rec}=\mathcal{L}_{ssim}(\mathbf{P}_{b},\hat{\mathbf{P}}_{b})+\lambda\mathcal{L}_{1}(\mathbf{P}_{b},\hat{\mathbf{P}}_{b})+\alpha\mathcal{L}_{SCE}(\mathbf{X}_{b},\hat{\mathbf{X}}_{b}). 9:     Update parameters of E1,fE,,D2,D3E_{1},f_{E},\mathcal{F},D_{2},D_{3} using rec\mathcal{L}_{rec}. 10:end for 11:return 𝐏^b\hat{\mathbf{P}}_{b}, 𝐗^b\hat{\mathbf{X}}_{b}
Algorithm 2 Stage III Training. 1:Gene expression profiles 𝑿N×G\bm{X}\in\mathbb{R}^{N\times G}; Image patches 𝑷N×H×W×C\bm{P}\in\mathbb{R}^{N\times H\times W\times C}; Maximum epochs EmaxE_{max}. 2:Feed-forward network ff; Image encoder in stage III E2E_{2}; Gene encoder in stage III E3E_{3}; Reconstruction error rec,b\ell_{rec,b}; Dimension of hyperspherical space DD 3:SVDD Center cN×Dc\in\mathbb{R}^{N\times D}. 4: 𝐏^\hat{\mathbf{P}}, 𝐗^\hat{\mathbf{X}} = Learning of Spot Reconstruction (𝐏\mathbf{P}, 𝐗\mathbf{X}) \triangleright Processing Stage III 5:Initialize E2E_{2}, E3E_{3}, and ff. 6:while epoch<Emaxepoch<E_{max} do 7:     Compute the center cc. 8:     for 𝐏b\mathbf{P}_{b}, 𝐗b\mathbf{X}_{b}, 𝐏^b\hat{\mathbf{P}}_{b}, 𝐗^b\hat{\mathbf{X}}_{b} in 𝑷\bm{P}, 𝑿\bm{X}, 𝑷^\hat{\bm{P}}, 𝑿^\hat{\bm{X}} do 9:         Zfused=f(βnorm(E2(𝐏b))+norm(E3(𝐗b))))Z_{fused}=f(\beta norm(E_{2}(\mathbf{P}_{b}))+norm(E_{3}(\mathbf{X}_{b})))). 10:         Z^fused=f(βnorm(E2(𝐏^b))+norm(E3(𝐗^b))))\hat{Z}_{fused}=f(\beta norm(E_{2}(\hat{\mathbf{P}}_{b}))+norm(E_{3}(\hat{\mathbf{X}}_{b})))). 11:         rec,d=ZfusedZ^fused\ell_{rec,d}=Z_{fused}-\hat{Z}_{fused}. 12:         SVDD=rec,bc2\mathcal{L}_{SVDD}=\left\|\ell_{rec,b}-c\right\|^{2}. 13:         Update parameters of E2,E3,fE_{2},E_{3},f using SVDD\mathcal{L}_{SVDD} 14:     end for 15:end while 16:return cc

Appendix D Theoretical Analysis

Fused Bottleneck Encoding as a Minimally Sufficient Representation of Modality-Specific, Task-Relevant Information

In this section, we begin with the mathematical notations (Table 5), properties (D.1), definitions (Definition D.4), and assumptions (D.1 and D.2) pertinent to our theoretical analysis. We then prove that the fused bottleneck encoding serves as a sufficient statistic (Tian et al. 2020) for capturing complementary task-relevant information across data modalities (Proposition D.1), as illustrated in Supplementary Figure 4. Finally, we prove that the fused bottleneck encoding is the most informationally compact among all sufficient encodings (Proposition D.2).

  Notation Description
  viv_{i} The view associated with the ii-th data modality.
b0b_{0} The biological contents shared between data modalities.
bib_{i} The biological contents specific to the ii-th data modality.
I()I(*) The information set inherent to *.
MM The mutual information function.
HH The entropy function.
f1f_{1} and z1z_{1} The encoder and encoding for view v1v_{1}.
f2f_{2} and z2z_{2} The encoder and encoding for view v2v_{2}.
f3f_{3} and z3z_{3} The fusion bottleneck encoder and fused encoding.
 
Table 5: Summary of notation.
Definition D.4.

Information function I(x)I(x) denotes the information set inherent in xx, e.g., I(x)=H(x)I(x)=H(x) when xx is a variable. Also, we have I(v1,v2)=I(v1)I(v2)I(v_{1},v_{2})=I(v_{1})\cup\ I(v_{2}).

Definition D.5.

The relative mutual information between two variables v1v_{1} and v2v_{2} is defined as the ratio of their mutual information to their total information:

M^(v1,v2)=M(v1,v2)I(v1)I(v2)=M(v1,v2)H(v1)+H(v2)M(v1,v2)\widehat{M}(v_{1},v_{2})=\frac{M(v_{1},v_{2})}{I(v_{1})\cup I(v_{2})}=\frac{M(v_{1},v_{2})}{H(v_{1})+H(v_{2})-M(v_{1},v_{2})}

Relative mutual information is more effective in highlighting the significance of shared information between two variables compared to conventional mutual information.

Properties D.1.

Properties of Mutual Information and Entropy:

i)\displaystyle\textbf{i})\ M(x;y)0,M(x;y|z)0.\displaystyle M(x;y)\geq 0,M(x;y|z)\geq 0.
ii)\displaystyle\textbf{ii})\ M(x;y,z)=M(x;y)+M(x;z|y).\displaystyle M(x;y,z)=M(x;y)+M(x;z|y).
iii)M(x1;x2;;xn+1)\displaystyle\textbf{iii})\ M(x_{1};x_{2};\cdots;x_{n+1}) =M(x1;;xn)\displaystyle=M(x_{1};\cdots;x_{n})
M(x1;;xn|xn+1).\displaystyle-M(x1;\cdots;x_{n}|x_{n+1}).
iv)If\displaystyle\textbf{iv})\ \text{If}\ I(v2)I(v1)M(v1,v2)=H(v2),\displaystyle I(v_{2})\subseteq I(v_{1})\longrightarrow M(v_{1},v_{2})=H(v_{2}),
I(v1,v2)=I(v1)I(v2)=I(v1)=H(v1)\displaystyle I(v_{1},v_{2})=I(v_{1})\cup I(v_{2})=I(v_{1})=H(v_{1})
v)If\displaystyle\textbf{v})\ \text{If}\ I(v2)I(v1)=\displaystyle I(v_{2})\cap I(v_{1})=\varnothing\longrightarrow
I(v1,v2)=H(v1,v2)=H(v1)+H(v2)=I(v1)+I(v2)\displaystyle I(v_{1},v_{2})=H(v_{1},v_{2})=H(v_{1})+H(v_{2})=I(v_{1})+I(v_{2})
Proof.

The proofs of properties i, ii, and iii can be found in (Cover 1999). For property iv:

M(v1,v2)=v1,v2p(v1,v2)log(p(v1,v2)p(v1)p(v2))=v1,v2p(v1,v2)log(p(v2|v1)=1asI(v2)I(v1)p(v1)p(v1)p(v2))=v2p(v2)log(p(v2))=H(v2).\begin{split}M(v_{1},v_{2})&=\underset{v_{1},v_{2}}{\iint}p(v_{1},v_{2})\mathrm{log}(\frac{p(v_{1},v_{2})}{p(v_{1})p(v_{2})})\\ &=\underset{v_{1},v_{2}}{\iint}p(v_{1},v_{2})\mathrm{log}(\frac{\overbrace{p(v_{2}|v_{1})}^{=1\ \text{as}\ I(v_{2})\subseteq I(v_{1})}p(v_{1})}{p(v_{1})p(v_{2})})\\ &=\underset{v_{2}}{\int}-p(v_{2})\mathrm{log}(p(v_{2}))=H(v_{2}).\end{split} (43)

In addition, for I(v1,v2)I(v_{1},v_{2}), we have:

I(v1,v2)=I(v1)I(v2)=H(v1,v2)=v1,v2p(v1,v2)log(p(v1,v2))=v1,v2p(v1,v2)log(p(v2|v1)p(v1))=v1p(v1)log(p(v1))=H(v1)=(v1).\begin{split}I(v_{1},v_{2})&=I(v_{1})\cup I(v_{2})=H(v_{1},v_{2})\\ &=\underset{v_{1},v_{2}}{\iint}-p(v_{1},v_{2})\mathrm{log}(p(v_{1},v_{2}))\\ &=\underset{v_{1},v_{2}}{\iint}-p(v_{1},v_{2})\mathrm{log}(p(v_{2}|v_{1})p(v_{1}))\\ &=\underset{v_{1}}{\int}-p(v_{1})\mathrm{log}(p(v_{1}))=H(v_{1})=(v_{1}).\end{split} (44)

For property v, we first clarified that:

I(v2)I(v1)=p(v1,v2)=p(v1)p(v2)I(v_{2})\cap I(v_{1})=\varnothing\longrightarrow p(v_{1},v_{2})=p(v_{1})p(v_{2}) (45)

Therefore, we have:

H(v1,v2)=v1,v2p(v1,v2)log(p(v1,v2))=v1,v2p(v1)p(v2)log(p(v1)p(v2))=v1p(v1)log(p(v1))+v2p(v2)log(p(v2))=H(v1)+H(v2)\begin{split}H(v_{1},v_{2})&=\underset{v_{1},v_{2}}{\iint}-p(v_{1},v_{2})\mathrm{log}(p(v_{1},v_{2}))\\ &=\underset{v_{1},v_{2}}{\iint}-p(v_{1})p(v_{2})\mathrm{log}(p(v_{1})p(v_{2}))\\ &=\underset{v_{1}}{\int}-p(v_{1})\mathrm{log}(p(v_{1}))+\underset{v_{2}}{\int}-p(v_{2})\mathrm{log}(p(v_{2}))\\ &=H(v_{1})+H(v_{2})\end{split} (46)

Assumption D.1.

Assume that histology image and ST represent two views (v1v_{1} and v2v_{2}) of the biological information (bb) inherent in the studied tissue. Let yy be an indicator of the normality of regions across the tissue, which is essentially determined by bb. Then, we have:

I(y)={b}={b0,b1,b2},{b0}{b1}=,{b0}{b2}=,{b1}{b2}=,M(y;v1)={b}I(v1)={b0,b1},M(y;v2)={b}I(v2)={b0,b2}.\begin{split}&I(y)=\{b\}=\{b_{0},b_{1},b_{2}\},\\ &\{b_{0}\}\cap\{b_{1}\}=\varnothing,\{b_{0}\}\cap\{b_{2}\}=\varnothing,\{b_{1}\}\cap\{b_{2}\}=\varnothing,\\ &M(y;v_{1})=\{b\}\cap I(v_{1})=\{b_{0},b_{1}\},\\ &M(y;v_{2})=\{b\}\cap I(v_{2})=\{b_{0},b_{2}\}.\\ \end{split}

Here, b0b_{0} represents the common task-relevant information, while b1b_{1} and b2b_{2} represent the task-relevant information specific to v1v_{1} and v2v_{2}, respectively.

Assumption D.2.

The encodings z1=f1(v1)z_{1}=f_{1}(v_{1}), z2=f2(v2)z_{2}=f_{2}(v_{2}), and z3=f3(z1,z2)z_{3}=f_{3}(z_{1},z_{2}) are generated by the respective encoders. We define z4={z1,z3}z_{4}=\{z_{1},z_{3}\} and z5={z2,z3}z_{5}=\{z_{2},z_{3}\} as per equation (6) in the main text. Assuming f1f_{1} and f2f_{2} are information lossless encoders, and, along with the fusion bottleneck encoder f3f_{3}, follow the information bottleneck theory proposed by Tishby et al., (Tishby, Pereira, and Bialek 2000). That is, z4andz5z_{4}\ \text{and}\ z_{5} should be maximally informative about yy with an information constrain on the bottleneck z3z_{3}. We use relative mutual information in place of conventional mutual information for more accurate reflection of the significance of shared information. The optimization problems are defined as:

maxf1,f3M^(z4;y|f1)s.t.M^(z3;v1|f3)Ic,\max_{f_{1},f_{3}}\ \widehat{M}(z_{4};y|f_{1})\quad\mathrm{s.t.}\ \widehat{M}(z_{3};v_{1}|f_{3})\leq I_{c},
maxf2,f3M^(z5;y|f2)s.t.M^(z3;v2|f3)Ic,\max_{f_{2},f_{3}}\ \widehat{M}(z_{5};y|f_{2})\quad\mathrm{s.t.}\ \widehat{M}(z_{3};v_{2}|f_{3})\leq I_{c},

where IcI_{c} is the information constraint. These can be converted into the following objective functions by introducing a Lagrange multiplier β>0\beta>0:

minz3,z4(z3,z4)=minz3,z4M^(z4;y)+βM^(z3;v1),\min_{z_{3},z_{4}}\ell(z_{3},z_{4})=\min_{z_{3},z_{4}}-\widehat{M}(z_{4};y)+\beta\widehat{M}(z_{3};v_{1}),
minz3,z5(z3,z5)=minz3,z5M^(z5;y)+βM^(z3;v2).\min_{z_{3},z_{5}}\ell(z_{3},z_{5})=\min_{z_{3},z_{5}}-\widehat{M}(z_{5};y)+\beta\widehat{M}(z_{3};v_{2}).
Refer to caption
Figure 4: Information diagrams of the two data modalities v1v_{1} and v2v_{2}. The bottleneck encoding, generated by the MGDAT block, embodies the minimally sufficient representation for modality-specific, task-relevant information (i.e., b1+b2b_{1}+b_{2}).
Proposition D.1.

Inclusiveness of complementary task-relevant information. The objective functions in D.2 are optimized when the bottleneck encoding z3z_{3} encompasses all task-relevant information specific to v1v_{1} and v2v_{2}:

I(z3){b1,b2}I(z_{3})\supseteq\{b_{1},b_{2}\}
Proof.

Given that f1f_{1} and f2f_{2} are information lossless encoders of v1v_{1} and v2v_{2}, we have:

I(v1)=I(z1)={b0,b1,z~1}I(v_{1})=I(z_{1})=\{b_{0},b_{1},\tilde{z}_{1}\} (47)
I(v2)=I(z2)={b0,b2,z~2}I(v_{2})=I(z_{2})=\{b_{0},b_{2},\tilde{z}_{2}\} (48)

Here, z~1\tilde{z}_{1} and z~2\tilde{z}_{2} represents task-irrelevant information specific to v1v_{1} and v2v_{2}, respectively. b0,z~1andz~2b_{0},\tilde{z}_{1}\ \text{and}\ \tilde{z}_{2} are mutually exclusive, i.e.,

{bi}{z~1}=,{bi}{z~2}=,\displaystyle\{b_{i}\}\cap\{\tilde{z}_{1}\}=\varnothing,\{b_{i}\}\cap\{\tilde{z}_{2}\}=\varnothing, (49)
{z~1}{z~2}=,{bi}{bj}=,\displaystyle\{\tilde{z}_{1}\}\cap\{\tilde{z}_{2}\}=\varnothing,\{b_{i}\}\cap\{b_{j}\}=\varnothing,
i,j{0,1,2},ij\displaystyle\forall i,j\in\{0,1,2\},i\neq j

Let {zˇ3}=({b0,b1,b2}/({b0,b1,b2}I(z3))){b2}\{\check{z}_{3}\}=(\{b_{0},b_{1},b_{2}\}/(\{b_{0},b_{1},b_{2}\}\cap I(z_{3})))\cap\{b_{2}\} represent the task-relevant information in b2b_{2} that is not included in z3z_{3}. It is obvious:

{zˇ3}I(y),{zˇ3}I(z3)=,\displaystyle\{\check{z}_{3}\}\subset I(y),\{\check{z}_{3}\}\cap I(z_{3})=\varnothing, (50)
{zˇ3}{b0}=,{zˇ3}{b1}=,\displaystyle\{\check{z}_{3}\}\cap\{b_{0}\}=\varnothing,\{\check{z}_{3}\}\cap\{b_{1}\}=\varnothing,
{zˇ3}I(v1)=,{zˇ3}I(z1)=.\displaystyle\{\check{z}_{3}\}\cap I(v_{1})=\varnothing,\{\check{z}_{3}\}\cap I(z_{1})=\varnothing.

If {zˇ3}\{\check{z}_{3}\}\neq\varnothing, we have:

(z3,z4)=M^(z4;y)+βM^(z3;v1)=M^(z1,z3;y)+βM^(v1;z3)=M^(z1,z3;y)+β(M(v1;z3)I(v1)I(z3)+M(v1;zˇ3|z3)=0I(v1)I(z3))>M^(z1,z3;y)+βM(z3,zˇ3;v1)I(v1)I(z3,zˇ3)=H(z3)+H(zˇ3)>H(z3)=I(z3)=M^(z1,z3;y)+βM^(z3,zˇ3;v1)\begin{split}&\ell(z_{3},z_{4})=-\widehat{M}(z_{4};y)+\beta\widehat{M}(z_{3};v_{1})\\ &=-\widehat{M}(z_{1},z_{3};y)+\beta\widehat{M}(v_{1};z_{3})\\ &=-\widehat{M}(z_{1},z_{3};y)+\beta(\frac{M(v_{1};z_{3})}{I(v_{1})\cup I(z_{3})}+\frac{\overbrace{M(v_{1};\check{z}_{3}|z_{3})}^{=0}}{I(v_{1})\cup I(z_{3})})\\ &>-\widehat{M}(z_{1},z_{3};y)+\beta\frac{M(z_{3},\check{z}_{3};v_{1})}{I(v_{1})\cup\underbrace{I(z_{3},\check{z}_{3})}_{=H(z_{3})+H(\check{z}_{3})>H(z_{3})=I(z_{3})}}\\ &=-\widehat{M}(z_{1},z_{3};y)+\beta\widehat{M}(z_{3},\check{z}_{3};v_{1})\\ \end{split}

For M^(z1,z3;y)\widehat{M}(z_{1},z_{3};y), we have:

M^(z1,z3;y)=M(z1,z3;y)I(z1,z3)I(y)=M(z1,z3;y)I(z1,z3)(I(y)I(zˇ3))=H(y)=I(y)as{zˇ3}I(y)<M(y;zˇ3|z1,z3)>0+M(y;z1,z3)I(z1,z3,zˇ3)I(y)=M(y;z1,z3,zˇ3)I(z1,z3,zˇ3)I(y)=M^(z1,z3,zˇ3;y)\begin{split}\widehat{M}(z_{1},z_{3};y)&=\frac{M(z_{1},z_{3};y)}{I(z_{1},z_{3})\cup I(y)}\\ &=\frac{M(z_{1},z_{3};y)}{I(z_{1},z_{3})\cup\underbrace{(I(y)\cup I(\check{z}_{3}))}_{=H(y)=I(y)\ \text{as}\ \{\check{z}_{3}\}\subset I(y)}}\\ &<\frac{\overbrace{M(y;\check{z}_{3}|z_{1},z_{3})}^{>0}+M(y;z_{1},z_{3})}{I(z_{1},z_{3},\check{z}_{3})\cup I(y)}\\ &=\frac{M(y;z_{1},z_{3},\check{z}_{3})}{I(z_{1},z_{3},\check{z}_{3})\cup I(y)}=\widehat{M}(z_{1},z_{3},\check{z}_{3};y)\\ \end{split}

Thus, f3f_{3} will be updated to generate z3withI(z3)={I(z3),zˇ3}z^{\prime}_{3}\ \text{with}\ I(z^{\prime}_{3})=\{I(z_{3}),\check{z}_{3}\} so that:

(z3,z4)=M^(z1,z3;y)+βM^(z3;v1)=M^(z1,z3,zˇ3;y)+βM^(z3,zˇ3;v1)<(z3,z4)\begin{split}\ell(z^{\prime}_{3},z_{4})&=-\widehat{M}(z_{1},z^{\prime}_{3};y)+\beta\widehat{M}(z^{\prime}_{3};v_{1})\\ &=-\widehat{M}(z_{1},z_{3},\check{z}_{3};y)+\beta\widehat{M}(z_{3},\check{z}_{3};v_{1})\\ &<\ell(z_{3},z_{4})\end{split} (51)

This update continues until {zˇ3}={b2}I(z3)\{\check{z}_{3}\}=\varnothing\rightarrow\{b_{2}\}\subseteq I(z_{3}). Similarly, using (z3,z5)\ell(z_{3},z_{5}), we can show that {z^3}=({b0,b1,b2}/({b0,b1,b2}I(z3)){b1}={b1}I(z3)\{\hat{z}_{3}\}=(\{b_{0},b_{1},b_{2}\}/(\{b_{0},b_{1},b_{2}\}\cap I(z_{3}))\cap\{b_{1}\}=\varnothing\rightarrow\{b_{1}\}\subseteq I(z_{3}). Therefore, I(z3){b1,b2}I(z_{3})\supseteq\{b_{1},b_{2}\}, completing the proof. ∎

Proposition D.2.

Compactness of complementary task-relevant information. The objective functions in D.2 is minimized when:

I(z3)={b1,b2}I(z_{3})=\{b_{1},b_{2}\}
Proof.

As proved in Proposition D.1, I(z3){b1,b2}I(z_{3})\supseteq\{b_{1},b_{2}\}. We start with I(z3)={b0,b1,b2}I(z_{3})=\{b_{0},b_{1},b_{2}\}, and then expand z3z_{3} to encompass additional information from v1v_{1}, denoted as {zˇ3}\{\check{z}_{3}\}, where:

{zˇ3}I(v1)=I(z1),M(zˇ3,v1)>0,\displaystyle\{\check{z}_{3}\}\subset I(v_{1})=I(z_{1}),M(\check{z}_{3},v_{1})>0, (52)
M(zˇ3,y)=0,{z3}{zˇ3}=.\displaystyle M(\check{z}_{3},y)=0,\{z_{3}\}\cap\{\check{z}_{3}\}=\varnothing.

Let z¨3={z3,zˇ3}\ddot{z}_{3}=\{z_{3},\check{z}_{3}\}. The objective function becomes:

(z¨3,z4)=M^(z4;y)+βM^(z¨3;v1)=M^(z1,z3,zˇ3;y)+βM^(zˇ3,z3;v1)\begin{split}\ell(\ddot{z}_{3},z_{4})&=-\widehat{M}(z_{4};y)+\beta\widehat{M}(\ddot{z}_{3};v_{1})\\ &=-\widehat{M}(z_{1},z_{3},\check{z}_{3};y)+\beta\widehat{M}(\check{z}_{3},z_{3};v_{1})\\ \end{split} (53)

For M^(zˇ3,z3;v1)\widehat{M}(\check{z}_{3},z_{3};v_{1}), we have:

M^(zˇ3,z3;v1)=M(v1;z3)+M(v1;zˇ3|z3)>0I(v1)I(z3)I(z3,zˇ3)=I(z3)I(zˇ3);I(zˇ3)I(v1)=I(v1)>M(z3;v1)I(v1)I(z3)=M^(z3;v1)\begin{split}\widehat{M}(\check{z}_{3},z_{3};v_{1})&=\frac{M(v_{1};z_{3})+\overbrace{M(v_{1};\check{z}_{3}|z_{3})}^{>0}}{\underbrace{I(v_{1})\cup I(z_{3})}_{\because I(z_{3},\check{z}_{3})=I(z_{3})\cup I(\check{z}_{3});\ I(\check{z}_{3})\cup I(v_{1})=I(v_{1})}}\\ &>\frac{M(z_{3};v_{1})}{I(v_{1})\cup I(z_{3})}=\widehat{M}(z_{3};v_{1})\end{split} (54)

For M^(z1,z3,zˇ3;y)\widehat{M}(z_{1},z_{3},\check{z}_{3};y), we have:

M^(z1,z3,zˇ3;y)=M(y;zˇ3|z1,z3)=0+M(y;z1,z3)I(z1,z3,zˇ3)I(y)=M(y;z1,z3)I(z1,z3){zˇ3}I(z1)I(y)=M^(z1,z3;y)\begin{split}\widehat{M}(z_{1},z_{3},\check{z}_{3};y)&=\frac{\overbrace{M(y;\check{z}_{3}|z_{1},z_{3})}^{=0}+M(y;z_{1},z_{3})}{I(z_{1},z_{3},\check{z}_{3})\cup I(y)}\\ &=\frac{M(y;z_{1},z_{3})}{\underbrace{I(z_{1},z_{3})}_{\because\{\check{z}_{3}\}\subset I(z_{1})}\cup I(y)}=\widehat{M}(z_{1},z_{3};y)\end{split} (55)

Thus, (z¨3,z4)>M^(z1,z3;y)+βM^(z3;v1)=(z3,z4),β>0\ell(\ddot{z}_{3},z_{4})>-\widehat{M}(z_{1},z_{3};y)+\beta\widehat{M}(z_{3};v_{1})=\ell(z_{3},z_{4}),\forall\beta>0. To minimize the objective function, zˇ3\check{z}_{3} is excluded. Similarly, from (z3,z5)\ell(z_{3},z_{5}), we know z3z_{3} should not expand to encompass additional information from v2v_{2}. Hence, optimal z3z_{3} must satisfy I(z3){b0,b1,b2}I(z_{3})\subseteq\{b_{0},b_{1},b_{2}\}.

Furthermore, if we shrink the information of z3z_{3} to z˙3\dot{z}_{3}, with the reduced information {z^3}{b0}I(z1)=I(v1){z˙3}{z^3}=\{\hat{z}_{3}\}\subseteq\{b_{0}\}\subset I(z_{1})=I(v_{1})\rightarrow\{\dot{z}_{3}\}\cap\{\hat{z}_{3}\}=\varnothing. The objective function becomes:

(z3,z4)=M^(z4;y)+βM^(z3;v1)=M^(z1,z˙3,z^3;y)+βM^(z˙3,z^3;v1)=M(y;z1,z˙3)+M(y;z^3|z1,z˙3)=0asz^3I(z1)I(z1,z˙3,z^3)I(y)+βM(z˙3,z^3;v1)(I(z˙3)I(z^3))I(v1)=M(z1,z˙3;y)I(z1,z˙3){z^3}I(z1)I(y)+βM(v1;z˙3)+M(v1;z^3|z˙3)>0I(z˙3)I(v1)>M(z1,z˙3;y)I(z1,z˙3)I(y)+βM(z˙3;v1)I(z˙3)I(v1)=(z˙3,z4)\begin{split}&\ell(z_{3},z_{4})=-\widehat{M}(z_{4};y)+\beta\widehat{M}(z_{3};v_{1})\\ &=-\widehat{M}(z_{1},\dot{z}_{3},\hat{z}_{3};y)+\beta\widehat{M}(\dot{z}_{3},\hat{z}_{3};v_{1})\\ &=-\frac{M(y;z_{1},\dot{z}_{3})+\overbrace{M(y;\hat{z}_{3}|z_{1},\dot{z}_{3})}^{=0\ as\ \hat{z}_{3}\subset I(z_{1})}}{I(z_{1},\dot{z}_{3},\hat{z}_{3})\cup I(y)}\\ &+\beta\frac{M(\dot{z}_{3},\hat{z}_{3};v_{1})}{(I(\dot{z}_{3})\cup I(\hat{z}_{3}))\cup I(v_{1})}\\ &=-\frac{M(z_{1},\dot{z}_{3};y)}{\underbrace{I(z_{1},\dot{z}_{3})}_{\because\{\hat{z}_{3}\}\subset I(z_{1})}\cup I(y)}+\beta\frac{M(v_{1};\dot{z}_{3})+\overbrace{M(v_{1};\hat{z}_{3}|\dot{z}_{3})}^{>0}}{I(\dot{z}_{3})\cup I(v_{1})}\\ &>-\frac{M(z_{1},\dot{z}_{3};y)}{I(z_{1},\dot{z}_{3})\cup I(y)}+\beta\frac{M(\dot{z}_{3};v_{1})}{I(\dot{z}_{3})\cup I(v_{1})}=\ell(\dot{z}_{3},z_{4})\end{split}

Therefore, if M(v1;z^3)>0M(v_{1};\hat{z}_{3})>0, the objective function can be further optimized by reducing information from {b0}\{b_{0}\} until I(z3){b0}=I(z3)={b1,b2}I(z_{3})\cap\{b_{0}\}=\varnothing\rightarrow I(z_{3})=\{b_{1},b_{2}\}. This completes the proof.

In summary, f3f_{3} effectively captures the view-specific, task-relevant information in the bottleneck encoding z3z_{3}, which embodies an inclusive and condensed representation of the complementary information, biologically relevant for determining tissue region normality, between the two data modalities. Thus, z3z_{3} serves as an informational bridge connecting the two data modalities.

  Dataset Tissue (ATR Type) Total Number of Spots Anomaly Proportion
  10x-hNB-v03 Normal human breast 2364 0.00%
  10x-hNB-v04 Normal human breast 2504 0.00%
  10x-hNB-v05 Normal human breast 2224 0.00%
  10x-hNB-v06 Normal human breast 3037 0.00%
  10x-hNB-v07 Normal human breast 2086 0.00%
  10x-hNB-v08 Normal human breast 2801 0.00%
  10x-hNB-v09 Normal human breast 2694 0.00%
  10x-hNB-v10 Normal human breast 2473 0.00%
  10x-hBC-A1 Human breast cancer (Cancer in situ, Invasive cancer) 346 12.43%
  10x-hBC-B1 Human breast cancer (Invasive cancer) 295 78.64%
  10x-hBC-C1 Human breast cancer (Invasive cancer) 176 27.84%
  10x-hBC-D1 Human breast cancer (Invasive cancer) 306 54.58%
  10x-hBC-E1 Human breast cancer (Invasive cancer) 587 42.08%
  10x-hBC-F1 Human breast cancer (Invasive cancer) 691 16.50%
  10x-hBC-G2 Human breast cancer (Cancer in situ, Invasive cancer) 467 65.74%
  10x-hBC-H1 Human breast cancer (Cancer in situ, Invasive cancer) 613 69.49%
  10x-hBC-I1 Human breast cancer (Ductal carcinoma in situ, Lobular carcinoma in situ, Invasive Carcinoma) 1308 62.92%
  10x-hLiver-A1 Healthy human liver 2378 0.00%
  10x-hLiver-B1 Healthy human liver 2349 0.00%
  10x-hLiver-C1 Healthy human liver 2277 0.00%
  10x-hLiver-D1 Healthy human liver 2265 0.00%
  10x-PSC-A1 PSC human liver (native cell, intrahepatic cholangiocyte) 3118 26.36%
  10x-PSC-B1 PSC human liver (PSC fibrotic region) 2670 24.91%
  10x-PSC-C1 PSC human liver (native cell, intrahepatic cholangiocyte) 3322 25.89%
  10x-PSC-D1 PSC human liver (PSC fibrotic region) 3174 25.65%
 
Table 6: Overview of the experimental datasets.
  Target Dataset     Metric     Method
12 
        Multimodal-based     Image-based     ST-based
12 
        MEATRD M3DM     SimpleNet f-AnoGAN PatchSVDD     DOMINANT PREM Spatial-ID scmap CAMLU
  10x-PSC-A1     AUC     0.657±0.073\mathbf{0.657}_{\pm 0.073} 0.475±0.0060.475_{\pm 0.006}     0.483±0.1290.483_{\pm 0.129} 0.647¯±0.001\underline{0.647}_{\pm 0.001} -     0.590±0.0430.590_{\pm 0.043} 0.567±0.0080.567_{\pm 0.008} 0.486±0.0060.486_{\pm 0.006} 0.500±0.0000.500_{\pm 0.000} 0.537±0.0740.537_{\pm 0.074}
    F1     0.629±0.085\mathbf{0.629}_{\pm 0.085} 0.235±0.0070.235_{\pm 0.007}     0.284±0.1140.284_{\pm 0.114} 0.479¯±0.001\underline{0.479}_{\pm 0.001} -     0.356±0.0460.356_{\pm 0.046} 0.291±0.0100.291_{\pm 0.010} 0.449±0.0090.449_{\pm 0.009} 0.415±0.0000.415_{\pm 0.000} 0.217±0.1180.217_{\pm 0.118}
  10x-PSC-B1     AUC     0.675±0.092\mathbf{0.675}_{\pm 0.092} 0.475±0.0060.475_{\pm 0.006}     0.481±0.1610.481_{\pm 0.161} 0.655¯±0.001\underline{0.655}_{\pm 0.001} -     0.547±0.0400.547_{\pm 0.040} 0.520±0.0300.520_{\pm 0.030} 0.519±0.0300.519_{\pm 0.030} 0.500±0.0000.500_{\pm 0.000} 0.503±0.0040.503_{\pm 0.004}
    F1     0.646±0.101\mathbf{0.646}_{\pm 0.101} 0.235±0.0070.235_{\pm 0.007}     0.274±0.1460.274_{\pm 0.146} 0.482±0.0010.482_{\pm 0.001} -     0.307±0.0310.307_{\pm 0.031} 0.231±0.0050.231_{\pm 0.005} 0.464±0.0370.464_{\pm 0.037} 0.625¯±0.000\underline{0.625}_{\pm 0.000} 0.017±0.0180.017_{\pm 0.018}
  10x-PSC-C1     AUC     0.664¯±0.069\underline{0.664}_{\pm 0.069} 0.497±0.0130.497_{\pm 0.013}     0.468±0.1560.468_{\pm 0.156} 0.683±0.001\mathbf{0.683}_{\pm 0.001} -     0.595±0.0600.595_{\pm 0.060} 0.508±0.0080.508_{\pm 0.008} 0.517±0.0150.517_{\pm 0.015} 0.500±0.0000.500_{\pm 0.000} 0.501±0.0020.501_{\pm 0.002}
    F1     0.631±0.078\mathbf{0.631}_{\pm 0.078} 0.259±0.0120.259_{\pm 0.012}     0.268±0.1310.268_{\pm 0.131} 0.530¯±0.001\underline{0.530}_{\pm 0.001} -     0.344±0.0570.344_{\pm 0.057} 0.245±0.0060.245_{\pm 0.006} 0.470±0.0120.470_{\pm 0.012} 0.411±0.0000.411_{\pm 0.000} 0.008±0.0060.008_{\pm 0.006}
  10x-PSC-D1     AUC     0.655±0.071\mathbf{0.655}_{\pm 0.071} 0.498±0.0120.498_{\pm 0.012}     0.473±0.1820.473_{\pm 0.182} 0.639¯±0.001\underline{0.639}_{\pm 0.001} -     0.534±0.0910.534_{\pm 0.091} 0.519±0.0010.519_{\pm 0.001} 0.575±0.0180.575_{\pm 0.018} 0.500±0.0000.500_{\pm 0.000} 0.508±0.0110.508_{\pm 0.011}
    F1     0.627±0.082\mathbf{0.627}_{\pm 0.082} 0.266±0.0130.266_{\pm 0.013}     0.287±0.1720.287_{\pm 0.172} 0.464±0.0010.464_{\pm 0.001} -     0.290±0.0710.290_{\pm 0.071} 0.229±0.0010.229_{\pm 0.001} 0.512¯±0.014\underline{0.512}_{\pm 0.014} 0.408±0.0000.408_{\pm 0.000} 0.042±0.0360.042_{\pm 0.036}
  Mean     AUC     0.663\mathbf{0.663} 0.4860.486     0.4760.476 0.656¯\underline{0.656} -     0.5670.567 0.5290.529 0.5240.524 0.5000.500 0.5120.512
    F1     0.633\mathbf{0.633} 0.2490.249     0.2780.278 0.489¯\underline{0.489} -     0.3240.324 0.2490.249 0.4740.474 0.4650.465 0.0710.071
     
Table 7: Performance evaluation of anomalous tissue region detection across four primary sclerosing cholangitis (PSC) liver ST datasets. The table presents the results in terms of AUC and F1 scores, with each cell showing the average score from five independent runs and the corresponding standard deviation. The best score for each dataset is bolded, and the second-best score is underline.

Appendix E Implementation

Dataset Descriptions

As detailed in Table 6, we conducted extensive experiments on two types of disease datasets to validate the generalizability of MEATRD:
Breast Cancer: ST datasets about Breast Cancer used in this study include eight reference 10x Visium datasets (Kumar et al. 2023) derived from human healthy breast tissues, denoted as 10x-hNB-{v03-v10}333https://cellxgene.cziscience.com/collections/4195ab4c-20bd-4cd3-8b3d-65601277e731, and nine target 10x Visium datasets (Andersson et al. 2021) derived from human breast cancer tissues, denoted as 10x-hBC-{A1-I1}444https://github.com/almaan/her2st, and https://zenodo.org/records/10437391.
Primary sclerosing cholangitis (PSC)555https://cellxgene.cziscience.com/collections/0c8a364b-97b5-4cc8-a593-23c38c6f0ac5 : Reference 10x Visium datasets denoted as 10x-hLiver-{A1-D1} contains 4 healthy human liver datasets and target 10x Visium datasets denoted as 10x-PSC-{A1-D1} are collected from 4 Primary sclerosing cholangitis slices.
The reference datasets are collectively used during training, and the target datasets are used during inference only. For each ST dataset, genes detected in fewer than 10 spots are excluded (Wolf, Angerer, and Theis 2018).

Data Preprocessing

For each ST dataset, genes detected in fewer than 10 spots are excluded. Then, raw gene expression counts are normalized with library size and log-transformed. 3000 highly variable genes (HVGs) are selected as inputs to the model using using the SCANPY package  (Wolf, Angerer, and Theis 2018).

Implementation Details

MEATRD is implemented using PyTorch (Paszke et al. 2019). In Stage I, we adopt the default architecture of the Mobile-UNet with an output embedding dimension of 256. In Stage II, the gene encoder is a two-layer MLP, while the image encoder is the frozen Mobile-Unet encoder from Stage I. The MGDAT network includes three MGDAT blocks, each having a two-layer, four-headed transformer to generate 16-dimensional fused bottleneck embeddings in equation (5), and an attention layer with an input dimension of 272 and an output dimension of 256 in equation (8). The ResNet-based image decoder in this stage comprises eight residual blocks, while the gene decoder is a single-layer GNN with an output dimension of 3000. In Stage III, the image encoder is an eight-layer ResNet, while the gene encoder is consistent with that in Stage II. The FFN for multimodal data fusion is structured as a two-layer MLP with an output dimension of 256. In all stages, the training is conducted with the Adam optimizer, with a batch size to 128, and a learning rate of 1e-4. Stage I is trained for 30 epochs, Stage II for 10 epochs, and Stage III for 5 epochs. Finally, we have the three weight parameters α=0.5\alpha=0.5 and β=1\beta=1. For all baselines but M3DM and Spatial-ID, we adopted the recommended or default settings in the original study. M3DM, designed for natural images and point clouds, has an encoder unsuitable for ST data and histology image. Therefore, we replaced its encoders with MEATRD’s encoders for fair comparison. Since Spatial-ID is pretrained using single-cell sequencing (scRNA-seq) data, we skipped its pretraining step and directly utilizes the pretrained model.

Appendix F Further Experiments

Supplement Results of Anomalous Tissue Region Detection

In this section, we present the performance of MEATRD in testing for primary sclerosing cholangitis. MEATRD is trained on four human healthy liver ST datasets (i.e., 10x-hLiver-{A1-D1}) and tested on four human PSC (i.e., 10x-PSC-{A1-D1}) ST datasets. Table 7 highlights MEATRD’s superiority over baseline models in detecting ATRs across datasets, consistently achieving the highest AUC scores and three times ranking first in F1 scores. The experiment yields similar results in terms of AP scores, as shown in Table 3 in supplementary material F.

Ablation Studies

 Backbones Params AUC F1 Training time (min) Inference time (s)
  MobileUNet 46.17M 0.723 0.741 11.984 0.354
ResNet-18 53.71M 0.717 0.729 12.927 0.455
VGG-19 133.402M 0.696 0.703 19.887 0.489
  MoCo 53.71M 0.712 0.726 26.435 0.449
 
Table 8: Ablation study of backbones in MEATRD across eight human breast cancer datasets.
                                                                 Model Params MFlops Complexity Training time (s) Inference time (s) Memory Usage (GB)
     MEATRD Stage I 0.73M 32.746 𝒪(c1|V|)\mathcal{O}(c_{1}|V|) 100.02 - 1.70
Stage II 46.17M 405.909 𝒪(c2|V|+c3|E|+c4|D|2)\mathcal{O}(c_{2}|V|+c_{3}|E|+c_{4}|D|^{2}) 469.02 0.15 4.65
Stage III 5.22M 29.993 𝒪(c5|V|)\mathcal{O}(c_{5}|V|) 150.00 0.21 5.15
                                                                 M3DM 97.37M 8,028.937 - 468.00 20.75 1.60
SimpleNet 72.82M 238.954 - 0.72 7.72 0.57
f-AnoGAN 1.30M 118.496 - 1322.80 3.82 0.14
PatchSVDD 0.17M 1.083 - 5761.08 584.33 0.16
PREM 0.38M 0.768 - 326.58 0.01 0.02
DOMINANT 0.40M 0.396 - 0.83 0.01 11.34
Spatial-ID 4.36M 20.555 - 501.07 0.64 0.21
Scmap - - - 2.02 0.16 -
CAMLU 1.62M 1.619 - 125.57 0.64 -
 
Table 9: The overall training time on eight 10x-hNB datasets, including a total of 20,183 spots. Each spot is associated with a 3000-dimensional gene expression vector and a histology image patch of size 32x32.

Using multiple data modality. In this evaluation, MEATRD is adapted to use either histology image or ST data alone, by skipping the step of the multimodal data fusion process and omitting the branch for learning the alternative data modality. Using only histology images results in a substantial reduction of MEATRD’s average AUC scores by 31.26% and F1 scores by 26.59%, while using ST data alone on average lowers its AUC scores by 12.72% and F1 scores by 9.99%. These findings corroborate the synergic effects of histology image and ST data in enhancing ATR detection.
Multimodal data fusion using fused bottleneck embedding. To assess the impact of multimodal data fusion on ATR detection, we substitute the cross-modal bottleneck embedding-guided fusion with a direct concatenation of image and gene embeddings. The comparison reveals that using bottleneck embedding for multimodal data fusion contributes to an average increase of 13.15% in AUC scores and 7.55% in F1 scores over the simple concatenation method. This improvement underscores our approach’s efficacy in enhancing multimodal embeddings by collating and condensing the most relevant information from each data modality.
Masking for target node reconstruction. Introducing target-node-masking, which strategically omits self-information during the target node reconstruction, theoretically mitigates the model’s over-generalization issue. This masking prevents the model from ”learning too well” to replicate its input, leading to minimal reconstruction errors even for anomalies. To validate the effectiveness of this technique, we compare the latent multimodal reconstruction errors, defined in equation (13), with and without target-node-masking during inter-node message passing. The violin plots in Figure 5 showcase that this masking not only increases reconstruction errors for both inliers and anomalies but also amplifies the discrepancy between their reconstruction errors. This enhanced discrepancy in turn aids MEATRD’s discriminative model in Stage III to separate anomalies from inliers, contributing to an average increase of 10.38% in AUC scores and 6.01% in F1 scores when compared to the omission of target-node-masking.

Refer to caption
Figure 5: Violin plots illustrating the distributions of latent multimodal reconstruction errors for inliers (blue) and anomalies (yellow) with (“Full”) and without (“w/o TNM”) the implementation of target node masking (TNM).

Multimodal reconstruction losses in one-class classifier. Most one-class classification methods directly utilize inlier embeddings in the reference to determine normal data distribution and thus rely heavily on the quality of instance embeddings, which, however, is sensitive to batch effects across datasets (Ouardini et al. 2019). Here, we investigate whether using latent multimodal reconstruction losses, which avoid cross-batch comparisons, as an alternative input to METARD’s one-class classifier can improve the accuracy of ATR detection. For this purpose, instance embeddings generated by the MGDAT network instead of the latent reconstruction losses are input to the one-class classifier in Stage III. We find that this modification reduces MEATRD’s performance, with an average decline of 11.20% in AUC scores and 7.56% in F1 scores, demonstrating the necessity of using latent reconstruction losses in this context.
One-class classifier. Finally, to assess the effect of the one-class classification in ATR detection, we remove the entire stage III and use the weighted sum of image and gene reconstruction errors, defined in equation (10) in the main text, as anomaly scores. The direct use of reconstruction errors leads to MEATRD’s suboptimal performance, as indicated by its average decrease of 19.23% in AUC scores and 14.84% in F1 scores. This finding suggests that, by collapsing multimodal reconstruction losses of inliers into a compact hypersphere in the latent space, the separation of inliers and anomalies is boosted, addressing the model over-generalization.
Mobine-Unet as pretrained visual feature extractor. Here, we replace Mobile-Unet with three pretrained visual feature extractor widely used for natural images, including VGG-19  (Simonyan and Zisserman 2014) and ResNet-18  (He et al. 2016), and MoCo  (He et al. 2020), to extract visual features from histology images of the eight human breast cancer datasets in Stage I. As shown in Table 8, MEATRD’s performance declines with these networks, as indicated by the lower AUC and F1 scores. This is probably due to that tissue images contain complex patterns and features specific to biological structures, which may not be effectively captured by networks optimized for natural image recognition. Particularly, data augmentation techniques, e.g., blurring and resizing, used by contrastive learning approach can generate ”positive” images with semantics that significantly deviate from the original image.

Sensitivity Analysis

Table 2 shows the average model performance over five independent runs across the eight 10x-hBC datasets. We observe that a heavily weighing on image data (e.g., α=0.9\alpha=0.9 or β=0.9\beta=0.9 ) compromises model performance due to inadequate utilization of gene information for visually indistinguishable ATR in histology image. On the other hand, overly weighting ST data (α=0.1\alpha=0.1 or β=0.1\beta=0.1) also reduces performance, though it outperforms overweighing image data, likely due to the higher signal-to-noise ratio in ST data. Additionally, optimal model performance is achieved with a small bottleneck embedding dimension (e.g., 16), aligning with our theoretical analysis that nuisance information is minimized by in condensed bottleneck embedding. In contrast, a larger dimension for the one-class classification classifier (e.g., 256) is beneficial, providing more flexibility for collapsing inlier embeddings into a hypersphere. The best results are obtained with three MGDAT layers, balancing message passing within the graph and data over-smoothing. Lastly, variations in the number of attention heads in MGDAT and the visual feature dimension have relatively minor impact on model performance.

Complexity Analysis

In this section, we first theoretically analyze MEATRD’s model complexity. As shown in Table 9, Stage I is built on a 36-layer CNN network comprising lightweight inverted residual blocks, with 0.73M parameters and a time complexity of 𝒪(c1|V|)\mathcal{O}(c_{1}|V|) (He and Sun 2015), where |V||V| denotes the number of instances. In Stage II, the main complexity arises from the MGDAT blocks, which compute feature-level and node-level attentions with complexities of 𝒪(c2|V|)\mathcal{O}(c_{2}|V|) and 𝒪(c3|E|)\mathcal{O}(c_{3}|E|), respectively  (Veličković et al. 2018). Here, |E||E| denotes the number of graph edges and |E||V|2|E|\ll|V|^{2}. This stage has 46.17M parameters. Stage III mainly involves the lightweight ResNet, which has a time complexity of 𝒪(c5|V|)\mathcal{O}(c_{5}|V|)  (He and Sun 2015), and it has 5.22M parameters.

Empirical training on 20,183 image patches shows that Stage I has 32.746 MFlops and a training time of 100.02s, Stage II has 405.909 MFlops and a training time of 469.02s, stage III has 29.993 MFlops and a training time of 150s. Inference time is negligible compared to training time. MEATRD’s total time cost is comparable to well-received methods M3DM and Spatial-ID.

Robustness to Noisy Data

The quality of ST data is indeed crucial for the model’s performance, and β\beta can be used to balance the influence of different data sources accordingly. Specifically, when the quality of ST data is lower relative to image data, we set a lower β\beta to increase the model’s reliance on image data, and vice versa. In our experiments with human breast tissue datasets (10x-hNB), where image and ST data have comparable quality, we set β\beta to 0.5. Typically, ST data quality is assessed using the average location-wise zero proportion zz (Zhu et al., 2023), representing the average proportion of zero-read-count genes, with lower values indicating higher signal-to-noise ratios. To explore a more systematic setting of, we altered zz of the 10x-hNB by randomly masking gene read counts and tested various β\beta values, as shown below:

z¯\bar{z} 0.1 0.3 0.5 0.7 0.9
0.925 0.699 0.713 0.723 0.707 0.654
0.950 0.655 0.696 0.712 0.701 0.657
0.975 0.644 0.652 0.658 0.697 0.655

We find that, when z¯0.95\bar{z}\leq 0.95, default setting β=0.5\beta=0.5 consistently yields the best results; only in the extreme case z¯=0.975\bar{z}=0.975 indicating the quality of gene data is too poor, it’s necessary to set β=0.7\beta=0.7. Considering this relationship, we also provide an adaptive strategy. We set the heuristic function for β\beta varying with z¯\bar{z} as:

β={0.5,z¯0.950.5+0.5sigmoid(200(z¯0.975)),z¯>0.95\beta=\begin{cases}0.5,&\bar{z}\leq 0.95\\ 0.5+0.5\textrm{sigmoid}\left(200(\bar{z}-0.975)\right),&\bar{z}>0.95\end{cases}