Magnifying Networks for Images with Billions of Pixels

Neofytos Dimitriou Ognjen Arandjelović
Department of Computer Science
University of St Andrews
KY16 9SX, United Kingdom
{neofytosd,ognjen.arandjelovic}@gmail.com

Abstract

The shift towards end–to–end deep learning has brought unprecedented advances in many areas of computer vision. However, deep neural networks are trained on images with resolutions that rarely exceed $1,000\times 1,000$ pixels. The growing use of scanners that create images with extremely high resolutions (average can be $100,000\times 100,000$ pixels) thereby presents novel challenges to the field. Most of the published methods preprocess high–resolution images into a set of smaller patches, imposing an a priori belief on the best properties of the extracted patches (magnification, field of view, location, etc. ). Herein, we introduce Magnifying Networks (MagNets) as an alternative deep learning solution for gigapixel image analysis that does not rely on a preprocessing stage nor requires the processing of billions of pixels. MagNets can learn to dynamically retrieve any part of a gigapixel image, at any magnification level and field of view, in an end—to—end fashion with minimal ground truth (a single “global”, slide–level label). Our results on the publicly available Camelyon16 and Camelyon17 datasets corroborate to the effectiveness and efficiency of MagNets and the proposed optimization framework for whole slide image classification. Importantly, MagNets process far less patches from each slide than any of the existing approaches ( $10$ to $300$ times less).

Keywords Computational Pathology, Digital Pathology, Multi–instance learning, Weak supervision, Camelyon16, Camelyon17

1 Introduction

Refer to caption — Figure 1: An Illustration of the architecture of MagNet. The depicted model consists of four magnifying layers and a classification layer. For each ROI of each magnifying layer, the right level of image resolution is set based on the level of magnification as far. Note that the ROIs of the last layer can span across different magnification levels, and with varying levels of fidelity, thereby providing information across multiple resolutions, and multiple fields of views.

Convolutional neural networks (CNNs) have been the major drive behind the paradigm shift of computer vision towards deep learning. Over the years, a cornucopia of convolution-based network topologies have been described with key differences in their depth, width, and connectivity patterns [42, 18, 45, 51, 20, 6, 29, 21, 39]. As a consequence of the continuous improvements and the widespread availability of standardized data sets [11, 24, 28, 13, 32, 1, 10], end-to-end trained CNNs dominate the area of visual object recognition.

In parallel with the increase in data availability, hardware advancements have created the possibility of both the capturing and storing of higher resolution images. One of the most extreme yet practically important examples can be found in digital pathology and in particular, in the task of whole slide image (WSI) classification [12, 4]. WSIs are digitized microscope slides, often several gigabytes in size, that have a typical resolution of $100,000\times 100,000$ pixels [50, 12]. Conventional neural network optimization using such images is practically infeasible considering the associated memory and compute requirements.

Herein, we introduce a new family of neural networks, henceforth referred to as Magnifying Networks (MagNets), for images that have billions of pixels. MagNets use an attention based mechanism to decide on a course to fine basis the regions of the gigapixel image that need to be analysed, at an increasingly fine scale, in the task and data specific manner of processing extremely large images. Incidentally, this is conceptually similar to a pathologist’s knowledge and attention based use of magnification with a brightfield miscroscope. A microscope has multiple magnification settings that enable the user to view a specimen at different scales. Starting at the lowest magnification setting, the entire specimen can be observed. As the magnification is increased, finer detail is accessed, while at the same time, a smaller part of the specimen is displayed. During a visual examination, the clinician finds areas of interest at lower magnification levels and then examines them further at higher and higher magnification levels, accruing in the process information from all magnification levels that collectively enable a clinical decision to be made. Similarly, a MagNet starts at the lowest magnification level and recursively identifies, magnifies, and analyses areas of interest with more fine-grain detail (see Figure 1). While remaining within the realm of weakly supervised learning, we extend the spatial transformer module [21] with a differentiable upsampling mechanism. Depending on the amount of magnification at the current magnifying layer, a version of the WSI at a higher resolution can be accessed by the subsequent layer, as illustrated in Figure 1. MagNets provide a novel way of solving both the “where” and “what” problems of gigapixel image analysis in an end-to-end fashion [12]. Importantly, as we show in our experiments, our models can be optimized without the need for extra supervision for the “where” problem (e.g. boundary boxes).

In this work, we focus on training and evaluating MagNets for WSI classification [12]. In particular, we conduct experiments by benchmarking MagNets on the Camelyon data sets. The Camelyon challenge provides a fitting problem for MagNets, considering the varying granularity that needs to be accessed in having to predict macro–metastases, micro–metastasis, and isolated tumour cells (ITC) from WSIs. Given the innate transparency of a model with hard attention, no preprocessing requirements, and ability to perform both localization and classification tasks with no additional information (only slide-level information used [12]), it would be no exaggeration to say that MagNets have the potential to revolutionize fields, such as Digital Pathology. Our contributions are:

•

We propose the possibility of identifying and magnifying ROIs starting from a very low resolution downsampled version of the WSI ( $56\times 56\times 3$ dimensions). We experimentally show that recursively identifying and magnifying ROIs, allows for the extraction of informative areas across magnification levels, without having to preprocess billions of pixels.
•

We introduce a novel form of the spatial transformer module so that we can explore the possibility of “learning to zoom” for gigapixel images, without leaving the weakly supervised paradigm.
•

To the best of our knowledge, this is the first work that automatically finds, and fuses information from multiple learnt magnifications on WSIs. The proposed method is able to exploit rich contextual and salient features, overcoming the typical problem of patch–based processing that poorly capture the information that is distributed beyond the patch size. This is an important step towards creating a network architecture that can generalize across different modalities in computational pathology.

The rest of the paper is organised as follows. In Section 2 we provide further context for gigapixel image analysis by discussing the related work in the field, and in particular, the existing ideas for tackling the “where” problem. In Section 3, we introduce the key novelty of the present work, that is, explain the technical underpinnings of MagNets. In Section 4, we provide more details on the data, as well as the optimization framework. Finally, in Section 5, we discuss our results, and conclude our work with Section 6.

2 Related Work

Although other computer vision approaches exist, CNN–based methodologies have emerged as the most effective and popular choice as a way of automatically learning image features rather than handcrafting them [13, 1]. Therefore, we focus on CNN–based methodologies for WSI classification. Moreover, in order to better highlight the contribution of MagNets, we sort the related works based on their “where” problem approach.

A CNN typically excels with image sizes of less than one million pixels [42, 18, 45, 51, 20]. Moreover, although they have been a few recent works that explored the use of higher resolution images (e.g. up to $8192\times 8192$ [36]), the current state of the hardware cannot enable CNN–based learning directly from images with billions of pixels. Therefore, on top of optimizing for better visual understanding, practitioners also need to come up with ways for either approximating the spatial distribution of the signal from gigapixel images, or performing some form of dimensionality reduction on the WSIs themselves.

2.1 Patch extraction

2.1.1 Strongly supervised

One way of identifying and extracting the signal from gigapixel images relies on the use of annotations from domain–specific experts. More specifically, for WSIs, patches based on annotations by pathologist can be extracted in such a way as to ensure a balanced training data set. In essence, the availability of pixel–level annotations turns the “where” problem into a trivial challenge that is often addressed with some type of preprocessing pipeline. There exists a relatively large body of work that follows this paradigm [33, 49, 31, 26, 25, 52, 27, 44]. Most of these approaches extract the ROIs from a single magnification level, e.g. the largest available at $20\times$ or $40\times$ . A few, such as the approach of Sui et al. [44], instead extract patches by tiling annotated areas at multiple magnification levels. The work of Gecer et al. [14] is also of particular interest as they built a system conceptually similar to MagNet. Four separate fully CNNs were trained to imitate the actions of pathologists at selecting the right magnification. The training data set was constructed from recordings of pathologists zooming into WSIs to carry out a specific clinical task. The fully–supervised nature of this approach, however, limits its applicability to many clinical tasks for which annotations to this extend is either extremely laborious and expensive, or simply infeasible.

2.1.2 Weakly supervised

In the absence of pixel–level annotations, the literature is divided into three main ways of tackling the “where” problem. The most prominent approach is to tile the entirety of a WSI, only perhaps excluding patches that do not meet certain criteria (e.g. otsu, entropy, HSV colour space transformation, etc.) [37, 30, 48, 34, 9, 40, 47, 5, 19, 47]. The second approach involves random sampling from a grid–like patch population [16, 7]. Methodologies that use either of the above two approaches, typically mitigate for the simplistic solution to the “where” problem in the later parts of their pipelines. For example, a few recent studies have employed instance–level self–supervision, under the multi–instance learning paradigm, to mitigate for the highly unbalanced nature of tiling (noise–to–signal ratio can be extremely high in WSIs) [30, 9].

More related to our proposed methodology is the body of work using the third approach. The idea is to allow the right patches to be decided by the model without having to first process them in one way or another. This has been attempted by using different types of attention network [3, 38, 35]. BenTaeib and Hamarneh [3] employed a recurrent visual attention network to find sub-regions to analyse within $5,000\times 5,000$ resolution patches. Notably, these high–resolution images were tiled from fixed magnification scales. In fact, processing higher resolution patches came as a consequence of not employing any type of upsampling within the method. On the other hand, Qaiser and Rajpoot [38] used a non–differentiable attention network on $1,024\times 1,024$ resolution patches at $2.5\times$ magnification scale to identify, extract, and process patches from higher, predefined magnification scales ( $10\times$ or $20\times$ ). In general, we find that all of the existing attention–based methodologies on WSI classification still employ heavy preprocessing steps that inevitably impose $a$ $priori$ beliefs on the best magnification scale, field of view, location, etc. of the extracted patches, not to mention the computational burden that is associated with them.

The framework we introduce in this paper allows for a multi-resolution, multi-field of view neural network to be optimized in an end-to-end fashion with gigapixel images without requiring initial preprocessing pipelines, and without having to process billions of pixels.

2.2 Spatial Transformers

The most prominent use cases of spatial transformers (STs) are spatial invariance [41] and supervised semantic segmentation [23, 8, 17]. To the best of our knowledge, published work on the bridge of WSI classification and STs is nonexistent. Even beyond WSI classification, STs have rarely been trained in a weakly–supervised fashion [43, 15, 2]. However, the setting of our work differs significantly from these works, and therefore any meaningful comparison cannot be made.

3 Magnifying Networks

A MagNet consists of $N$ magnifying layers followed by a classification layer. The magnifying layers are responsible for identifying the signal within a WSI, whereas the classification layer is concerned with the visual understanding of the extracted signal in relation to the task at hand. Consider a single gigapixel image $I_{0}$ that will pass through a MagNet.

3.1 Magnifying Layer

Resize and pad.

As we subsequently employ convolutional layers expecting $56\times 56$ pixel images, input $I$ , either as a single input image (e.g. $I_{0}$ ) or a set of images, is resized to a $56\times h_{i}$ or $w_{i}\times 56$ resolution, based on bilinear interpolation, with the smaller side, $h_{i}$ or $w_{i}$ , then symmetrically padded (new pixels are black) so that $h_{i}=56$ or $w_{i}=56$ accordingly. For the purpose of up-sampling (see “Sampling” paragraph), a larger version $I^{\prime}$ ( $112\times 112$ resolution) is also generated using the same protocol.

Note that although preliminary experiments were conducted using larger images as input to the magnifying layers ( $I$ and $I^{\prime}$ with $112\times 112$ and $224\times 224$ resolutions respectively), single GPU training of multiple, stacked magnifying layers was not possible with these resolutions.

Convolutional layers.

The salient parts in each image (e.g. areas with tissue at the lowest magnification level) vary significantly in size. This can be observed in Figure 3. Therefore, the right kernel size for the convolutional operations varies depending on $I$ . As such, we stack convolutional operations with different kernel sizes similarly to InceptionNet–v3 [46]. Further information is provided in the Appendix. Our MagNets employ three stacked convolutional layers for each patch independently, e.g. a 2-layer MagNet with two patches extracted at each magnifying layer has six of these layers (two at the first layer, and four at the second).

Spatial Transformer.

A STN consists of three parts; a localization network, a grid generator, and a sampler [21].

The localization network is typically a FCNN or a recurrent neural network [43] that receives an input from a CNN, and its role is to output the parameters of a spatial transformation. With scalability in mind, either options were too prohibitive (GPU VRAM-wise) for MagNet as for each ST there would have been a large amount of parameters that needed to be optimized.

Instead, MagNets utilizes a spatial sparsemax at the last convolutional layer whose output can be used to infer the affine transformation parameters ( $s$ , $t_{x}$ , $t_{y}$ ) directly. In particular, the dimensions of the output of the last convolutional layer are the same as the input image, i.e. in our case $56\times 56$ pixels. Therefore, the output can be thought as a probability mass function, for which following the spatial sparsemax operation, the expected value translates to the scaling parameter ( $s$ ), and expected $L2$ operation over both the x–axis and y–axis corresponds to the translation parameters ( $t_{x}$ , $t_{y}$ ). More information in provided in the Appendix.

Given the transformation parameters $s$ for isotropic scaling and $t_{x}$ , $t_{y}$ for translation on each axis, we further constrain the parameters accordingly:

$\displaystyle s$	$\displaystyle=max(s,0.05)$	(1)
$\displaystyle t_{x}$	$\displaystyle=tanh(t_{x})$	(2)
$\displaystyle t_{y}$	$\displaystyle=tanh(t_{y})$	(3)

for spatial (affine) transformation $\theta$ ,

\displaystyle\theta

\displaystyle=\begin{bmatrix}s&0&t_{x}\\ 0&s&t_{y}\\ \end{bmatrix}

(4)

The $tanh$ constrain on the translation parameters implicitly forces the network to favour center extraction, whereas the minimum bound imposed on the scaling helped ensure that we do not get vanishing gradients for some STs during the early stages of training.

The grid generator then creates the desired grid by applying $\theta$ on a meshgrid with dimensions $56\times 56$ . Based on a differentiable sampler, e.g. billinear sampling, an image can be transformed by $\theta$ by interpolating it onto the grid.

Sampler

Early on while experimenting with bilinear interpolation, it became obvious that it was a poor choice of a sampler for our work. For more information we direct the reader to the excellent work of Jiang et al. [22] that later came out corroborating to our empirical analysis and providing an alternative sampler, called Linearized Multi-Sampling, whose gradients are not affected by the amount of scaling (i.e. how much the network has zoomed-in) performed (part of our empirical analysis is included in the Appendix). We use the original implementation of this sampler as provided to the authors by Jiang et al. [22].

Sampling

This is the part that makes each layer "magnifying", and constitutes our main technical contribution to the field. Given $\theta$ and $I$ , we can transform $I$ based on $\theta$ (simple matrix operation) to retrieve an image containing only the detected ROI. MagNet applies the transformation $\theta$ on $I^{\prime}$ instead, thereby allowing the output to contain information (finer-grain) that was not present in $I$ . An example of a magnifying layer that outputs two patches is shown in Figure 2. The introduction of new information by transforming images that came from a higher magnification level is what enabled the idea of stacking multiple magnification layers together.

3.2 Classification Layer

The images of the last magnifying layer are sampled using a grid with $224\times 224$ pixel resolution (instead of $56\times 56$ pixels). These images are forwarded through an ImageNet–pretrained CNN (InceptionNet-v3) that outputs a feature map into a Gated Recurrent Unit (GRU) network. The output of the GRU is passed through a FCNN (two layers with 512 and 256 hidden neurons respectively) to output a slide–level $\hat{y}$ estimate on whether the given WSI contains cancer or not.

3.3 Auxiliary classifiers

A form of both self–supervision and weak–supervision is introduced by using two auxiliary classifiers. These are ImageNet–pretrained ResNets–18 networks [18] that output a slide-level prediction using the extracted images from magnifying layer $1$ and layer $3$ respectively. Cross-entropy is used between the slide-level labels and the ResNet-18 outputs in a weakly–supervised fashion. In addition, a paradoxical loss was also employed as a form of self–supervision [35]. The premise of the paradoxical loss is that information presented at layer $3$ images should provide an equally good, or better, prediction than that from layer $1$ .

3.4 Configurations

The network consists of $L$ magnifying layers each of which can access increasingly higher magnification scales as determined dynamically from the degree of zooming (i.e. $s$ ) thus far. At each layer $l$ , $P_{l}$ number of patches are extracted (ROIs).

A consequence of the recurrent nature of MagNets is that an exponential number of patches are extracted and analysed from a single gigapixel image, if more than 1 patch is extracted per layer. In particular, given a constant $P_{l}$ across the layers:

\textrm{Total patches}=\begin{cases}L,&\text{if $P=1$}.\\ P^{L},&\text{otherwise}.\end{cases}

We find that a combination of $[2,3]$ for $P_{l}$ , (i.e. 2 ROIs are extracted in some magnifying layers, whereas in others, 3 ROIs are extracted) provides a balance between a sufficient rate of expansion (breadth), while allowing for up to 4-layer MagNets (depth) to be trained on a GPU with $24$ GB of VRAM.

4 Evaluation

4.1 Camelyon data sets

The Camelyon data sets contain WSIs from surgically resected lymph nodes of breast cancer patients. These WSIs were independently curated across multiple hospitals [13, 32]. Camelyon16 includes images from 238 normal and 160 cancerous tissue sections whereas the publicly available portion of Camelyon17 has a total of 500 WSIs (318 normal, 182 cancerous). In addition, in the case of metastasis, metadata is available as to the extend of the metastasis (macrometastasis, micrometastasis, or isolated tumour cells (ITC)). Since only a small portion of the data set contains the much more difficult ITC cases, we exclude such cases from the training data set.

We follow the protocol described in the Camelyon competition website, and in addition set aside $25\%$ of Camelyon17 as a testing set. We shuffle the remaining WSIs from Camelyon17 with the Camelyon16 WSIs, and train on the $80\%$ and validate the better models from the remaining $20\%$ . The best MagNets (based on the validation set) are retrained on both the training and validation data, and evaluated on the testing set.

The pixel-level annotations that are available for some of the WSIs of Camelyon are not used in our work. Instead, we only use the binary slide-level label that indicates the presence, or lack thereof, of cancerous cells somewhere in the gigapixel image.

4.2 Data Augmentation

For any given image, we apply a filter that removes the “grey” pixels of the image. This includes any white background, as well as gradients and other artifacts that have red, green, and blue channel values close together (threshold set to $15$ ). The effects of the filter are illustrated in Figure 3. We employ neither colour normalization nor random colour perturbation [33]. Synthetic data augmentation we perform is based on horizontal and vertical mirroring, and rotation by 90, 180, and 270 degrees.

4.3 Training

The networks were trained using the Adam optimizer for 200 epochs. A batch size of $16$ and $8$ was employed for MagNet networks with 3 and 4 layers respectively. The initial learning rate was set to $3\times 10^{-}5$ and was decayed using a cosine annealing scheduler. Both ResNet and InceptioNet networks are initialized using pretrained networks on ImageNet. The ST convolutional layers are randomly initialized.

“Frozen” patch

Some WSIs have already been preprocessed so that they only contain regions with tissue, whereas others depict all of the tissue slide (see Figure 3). This diversity comes as a consequence of the differences in the clinical pipelines leading to the creation of WSIs, e.g. due to different scanning profiles. In order to mitigate for the above intra–data variance, we freeze the 1st patch of the 2nd layer so that it always attends to the whole input image. This allows for the image to catch-up in quality in the cases where a large amount of zooming was required at the first magnifying layer, i.e. when the WSI shows the whole tissue slide.

Loss Functions

We employ the paradoxical loss function as described by Maksoud et al. [35], as a form of self–supervision for the convolutional layers in the weakly–supervised STs. In addition, cross-entropy is used between the slide-level labels and both the last output of the GRU as well as the ResNet-18 outputs.

Table 1: The results of MagNets on the testing set containing

25\%

of the publicly available Camelyon17 data set.

	Macro-	Micro-	ITC	Macro- & Micro-	All
Mean RGB Baseline	59%	57%	-	58%	-
Baseline w/ patch–level supervision	91%	63%	-	77%	-
3-layer MagNet	95%	71%	57%	84%	71%
4-layer MagNet	91%	76%	63%	84%	75%

5 Results & Discussion

To evaluate the proposed method, using the optimization framework described in the previous section we trained 3-layer and 4-layer MagNets on the task of cancer detection from WSIs. A summary of the results is presented in Table 1 which shows the area under the receiver operating characteristic curve (AUROC) – the standard evaluation metric used in the related literature [49, 47, 13, 32] – obtained using different configurations of the proposed method. Note that we only present our results since we were unable to identify any existing work on Camelyon17 that employs a more sophisticated than tiling approach to the “where” problem in a weakly–supervised fashion.

Interrogating our findings further, we note that although the 3–layer MagNet achieves better separation between macro-metastasis and normal cases, the 4–layer MagNet does better on micro-metastasis, as well as on the ITC cases. The latter is unsurprising given that a 4-layer MagNet makes use of finer-grain detail (cells are visible in Figure 5). Perhaps more importantly, the 4–layer MagNet performs well on ITC cases with a $63\%$ AUROC despite the lack of ITC examples in our training set.

MagNets exhibit robust and effective exploration capabilities, namely attending to image content in an attention driven manner, exploring slides at varying magnification levels best suited to the task at hand and learning how to fuse relevant information both within the same WSI region and across different regions and magnification levels. In addition, the classifier (in the form of InceptionNet) demonstrates an excellent ability to distinguish normal from cancerous tissue across irrespectively of the magnification scale. Examples corroborating this are shown in Figure 4. It is particularly remarkable to observe cases such as those shown in rows 3–4 of the bottom group of examples in Figure 4, wherein the classifier can be seen to change its decision from “normal” to “cancerous” when sufficient evidence for cancer is accumulated at any magnification level. These examples were verified by using annotations provided by pathologists.

Figure 5 shows the forward pass of a single example through a 4-layer MagNet. The Figure is made up of three images. Image A contains $10\times 4$ sub-images wherein the first column contains the input $I$ to a magnifying layer along red squares indicating the patches to be magnified next, as selected by the MagNet. The rest of the images visualize the last magnifying layer along with its magnified patches of $224\times 224$ pixel resolution that are forwarded to the classification layer. Since a GRU is employed in the classification layer, we are able to output a prediction with each new patch, allowing for trivial post-processing analysis. For instance, in the example shown, we observe that the model did not find anything that could be considered as a malignancy for the better part of the last layer’s patches. The patches responsible for the change in malignancy probability (from $0.19$ to $0.30$ and then $0.43$ ) contain a region that was not seen in previous patches and which contains unusual looking tissue. As the final patches are forwarded through the GRU, we observe that the model successfully classifies the WSI as cancerous.

6 Conclusions

In this work, we introduced the MagNet – a neural network consisting of fully-connected, convolutional, and recurrent layers that employ spatial transformers (ST) in a novel manner so as to facilitate attention and data driven recurrent exploration and, ultimately, end-to-end learning over gigapixel images. The built-in hard attention mechanism of MagNets makes them perfectly suited for clinical use. In particular, any machine learning system that is deployed in clinical practice must generate accurate explanations for its decisions that are easily interpretable by a medical expert. The explanations generated by MagNets are visually intuitive for a domain-specific expert to interpret and can be generated on the go without any additional overhead. Moreover, MagNets can be optimized without extra supervision (e.g. bounding boxes) for the task at hand. This is of high significance since for most clinical tasks, collecting ground truth data required for a higher degree of supervision is either extremely laborious and expensive, or simply not possible, e.g. in the case of patient prognosis.

References

[1] G. Aresta, T. Araújo, S. Kwok, S. S. Chennamsetty, M. Safwan, V. Alex, B. Marami, M. Prastawa, M. Chan, M. Donovan, G. Fernandez, J. Zeineh, M. Kohl, C. Walz, F. Ludwig, S. Braunewell, M. Baust, Q. Dang Vu, M. Nguyen Nhat To, and P. Aguiar. Bach: Grand challenge on breast cancer histology images. arXiv:1808.04277, 2018.
[2] Marc Aubreville, Maximilian Krappmann, Christof Bertram, Robert Klopfleisch, and Andreas Maier. A guided spatial transformer network for histology cell differentiation. In Proceedings of the Eurographics Workshop on Visual Computing for Biology and Medicine, VCBM ’17, page 21–25, Goslar, DEU, 2017. Eurographics Association.
[3] A. BenTaieb and G. Hamarneh. Predicting cancer with a recurrent visual attention model for histopathology images. In A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López, and G. Fichtinger, editors, Medical Image Computing and Computer Assisted Intervention, pages 129–137. Springer International Publishing, 2018.
[4] Peter D. Caie, Neofytos Dimitriou, and Ognjen Arandjelović. Chapter 8 - precision medicine in digital pathology via image analysis and machine learning. In Stanley Cohen, editor, Artificial Intelligence and Deep Learning in Pathology, pages 149–173. Elsevier, 2021.
[5] G. Campanella, W. K. V. Silva, and J. T. Fuchs. Terabyte-scale deep multiple instance learning for classification and localization in pathology. arXiv:1805.06983, 2018.
[6] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. CoRR, abs/1707.01629, 2017.
[7] Philip Chikontwe, Meejeong Kim, Soo Jeong Nam, Heounjeong Go, and Sang Hyun Park. Multiple instance learning with center embeddings for histopathology classification. In Anne L. Martel, Purang Abolmaesumi, Danail Stoyanov, Diana Mateus, Maria A. Zuluaga, S. Kevin Zhou, Daniel Racoceanu, and Leo Joskowicz, editors, Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, pages 519–528, Cham, 2020. Springer International Publishing.
[8] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. CoRR, abs/1512.04412, 2015.
[9] Olivier Dehaene, Axel Camara, Olivier Moindrot, Axel de Lavergne, and Pierre Courtiol. Self-Supervision Closes the Gap Between Weak and Strong Supervision in Histology. arXiv, Dec. 2020.
[10] I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and R. Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. arXiv:1805.06561, 2017.
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
[12] Neofytos Dimitriou, Ognjen Arandjelović, and Peter D. Caie. Deep learning for whole slide image analysis: An overview. Frontiers in Medicine, 6:264, 2019.
[13] B. B. Ehteshami, M. Veta, P. Johannes van Diest, and et al. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA, 318(22):2199–2210, 2017.
[14] B. Gecer, S. Aksoy, E. Mercan, G. L. Shapiro, L. D. Weaver, and G. J. Elmore. Detection and classification of cancer in whole slide breast histopathology images using deep convolutional networks. Pattern Recognition, 84:345 – 356, 2018.
[15] S. Guo, L. Liu, W. Wang, S. Lao, and L. Wang. An attention model based on spatial transformers for scene recognition. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 3757–3762, 2016.
[16] Noriaki Hashimoto, Daisuke Fukushima, Ryoichi Koga, Yusuke Takagi, Kaho Ko, Kei Kohno, Masato Nakaguro, Shigeo Nakamura, Hidekata Hontani, and Ichiro Takeuchi. Multi-scale domain-adversarial multiple-instance cnn for cancer subtype classification with unannotated histopathological images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[17] K. He, G. Gkioxari, P. Dollár, and B. R. Girshick. Mask R-CNN. CoRR, abs/1703.06870, 2017.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
[19] L. Hou, D. Samaras, M. T. Kurc, Y. Gao, E. J. Davis, and H. J. Saltz. Patch-based convolutional neural network for whole slide tissue image classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, jun 2016.
[20] G. Huang, Z. Liu, and Q. K. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
[21] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. CoRR, abs/1506.02025, 2015.
[22] Wei Jiang, Weiwei Sun, Andrea Tagliasacchi, Eduard Trulls, and Kwang Moo Yi. Linearized multi-sampling for differentiable image transformation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[23] J. Johnson, A. Karpathy, and F. F. Li. Densecap: Fully convolutional localization networks for dense captioning. CoRR, abs/1511.07571, 2015.
[24] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
[25] Mahendra Khened, Avinash Kori, Haran Rajkumar, Ganapathy Krishnamurthi, and Balaji Srinivasan. A generalized deep learning framework for whole-slide image segmentation and analysis. Sci. Rep., 11(1):11579, June 2021.
[26] Bin Kong, Xin Wang, Zhongyu Li, Qi Song, and Shaoting Zhang. Cancer metastasis detection via spatially structured deep network. In Marc Niethammer, Martin Styner, Stephen Aylward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Dinggang Shen, editors, Information Processing in Medical Imaging, pages 236–248. Springer International Publishing, 2017.
[27] Navid Alemi Koohbanani, Balagopal Unnikrishnan, Syed Ali Khurram, Pavitra Krishnaswamy, and Nasir Rajpoot. Self-path: Self-supervision for classification of pathology images with limited annotations. IEEE Transactions on Medical Imaging, 40(10):2845–2856, 2021.
[28] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, and V. Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982, 2018.
[29] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. CoRR, abs/1605.07648, 2016.
[30] Bin Li, Yin Li, and Kevin W. Eliceiri. Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14318–14328, June 2021.
[31] Y. Li and W. Ping. Cancer metastasis detection with neural conditional random field. arXiv:1806.07064, 2018.
[32] G. Litjens, P. Bandi, B. Ehteshami Bejnordi, O. Geessink, M. Balkenhol, P. Bult, A. Halilovic, M. Hermsen, R. van de Loo, R. Vogels, F. Q. Manson, N. Stathonikos, A. Baidoshvili, P. van Diest, C. Wauters, M. van Dijk, and J. van der Laak. 1399 h&e-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset. GigaScience, 7(6), May 2018.
[33] Y. Liu, K. Gadepalli, M. Norouzi, E. G. Dahl, T. Kohlberger, A. Boyko, S. Venugopalan, A. Timofeev, Q. P. Nelson, S. G. Corrado, D. J. Hipp, L. Peng, and C. M. Stumpe. Detecting cancer metastases on gigapixel pathology images. CoRR, abs/1703.02442, 2017.
[34] Ming Y Lu, Drew F K Williamson, Tiffany Y Chen, Richard J Chen, Matteo Barbieri, and Faisal Mahmood. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng., 5(6):555–570, June 2021.
[35] Sam Maksoud, Kun Zhao, Peter Hobson, Anthony Jennings, and Brian C. Lovell. Sos: Selective objective switch for rapid immunofluorescence whole slide image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
[36] Hans Pinckaers, Bram van Ginneken, and Geert Litjens. Streaming convolutional neural networks for end-to-end learning with multi-megapixel images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1581–1590, 2022.
[37] Antoine Pirovano, Hippolyte Heuberger, Sylvain Berlemont, SaÏd Ladjal, and Isabelle Bloch. Automatic feature selection for improved interpretability on whole slide imaging. Machine Learning and Knowledge Extraction, 3(1):243–262, 2021.
[38] T. Qaiser and M. N. Rajpoot. Learning where to see: A novel attention model for automated immunohistochemical scoring. arXiv, 2019.
[39] S. Sabour, N. Frosst, and E. G. Hinton. Dynamic routing between capsules. CoRR, abs/1710.09829, 2017.
[40] Yash Sharma, Aman Shrivastava, Lubaina Ehsan, Christopher A. Moskaluk, Sana Syed, and Donald E. Brown. Cluster-to-conquer: A framework for end-to-end multi-instance learning for whole slide image classification. In MIDL, 2021.
[41] C. Shu, X. Chen, Q. Xie, and H. Han. Hierarchical spatial transformer network. CoRR, abs/1801.09467, 2018.
[42] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[43] K. S. Sønderby, K. C. Sønderby, L. Maaløe, and O. Winther. Recurrent spatial transformer networks. CoRR, abs/1509.05329, 2015.
[44] Dong Sui, Weifeng Liu, Jing Chen, Chunxiao Zhao, Xiaoxuan Ma, Maozu Guo, and Zhaofeng Tian. A pyramid architecture-based deep learning framework for breast cancer detection. Biomed Res. Int., 2021:2567202, Oct. 2021.
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, E. R. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
[46] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision, 2015.
[47] David Tellez, Geert Litjens, Jeroen van der Laak, and Francesco Ciompi. Neural image compression for gigapixel histopathology image analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2):567–578, 2021.
[48] Hiroki Tokunaga, Yuki Teramoto, Akihiko Yoshizawa, and Ryoma Bise. Adaptive weighting multi-field-of-view cnn for semantic segmentation in pathology. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12589–12598, 2019.
[49] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and H. A. Beck. Deep learning for identifying metastatic breast cancer. arXiv:1606.05718, 2016.
[50] X. Yue, N. Dimitriou, and O. Arandjelović. Colorectal cancer outcome prediction from H&E whole slide images using machine learning and automatically inferred phenotype profiles. BICOB, pages 139–149, 2019.
[51] S. Zagoruyko and N. Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
[52] Yu Zhao, Fan Yang, Yuqi Fang, Hailing Liu, Niyun Zhou, Jun Zhang, Jiarui Sun, Sen Yang, Bjoern Menze, Xinjuan Fan, and Jianhua Yao. Predicting lymph node metastasis using histopathological images based on multiple instance learning with deep graph convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.