Ladder Siamese Network: a Method and Insights
for Multi-level Self-Supervised Learning

Ryota Yoshihashi Shuhei Nishimura Dai Yonebayashi Yuya Otsuka Tomohiro Tanaka Takashi Miyazaki
Yahoo Japan Corporation
[email protected]

Abstract

Siamese-network-based self-supervised learning (SSL) suffers from slow convergence and instability in training. To alleviate this, we propose a framework to exploit intermediate self-supervisions in each stage of deep nets, called the Ladder Siamese Network. Our self-supervised losses encourage the intermediate layers to be consistent with different data augmentations to single samples, which facilitates training progress and enhances the discriminative ability of the intermediate layers themselves. While some existing work has already utilized multi-level self supervisions in SSL, ours is different in that 1) we reveal its usefulness with non-contrastive Siamese frameworks in both theoretical and empirical viewpoints, and 2) ours improves image-level classification, instance-level detection, and pixel-level segmentation simultaneously. Experiments show that the proposed framework can improve BYOL baselines by 1.0% points in ImageNet linear classification, 1.2% points in COCO detection, and 3.1% points in PASCAL VOC segmentation. In comparison with the state-of-the-art methods, our Ladder-based model achieves competitive and balanced performances in all tested benchmarks without causing large degradation in one.

Refer to caption — Figure 1: Illustration of a) existing major Siamese SSL methods and b) our Ladder Siam framework. We exploit intermediate-layer self-supervisions, to stabilize the training process and to enhance intermediate-layer reusability in downstream tasks.

1 Introduction

Conventional deep neural networks are notoriously label-hungry, requiring massive human-annotated training data to demonstrate their full performance [59]. Self-supervised learning (SSL) [9] is a promising approach to reduce this annotation dependency of deep nets, and to make machine learning more autonomous toward enabling more human-like learning mechanisms.

Among various SSL methods, one of the promising approaches is cross-view learning with Siamese networks [9, 24, 22]. This approach is further divided into two types: contrastive and non-contrastive methods. Contrastive methods include the pioneer work SimCLR [9], which makes pairs of representations from the same instances similar and pairs from different instances dissimilar. While its training using positive and negative pairs is intuitive, the selection of negative pairs may affect performance and it is more sensitive to the batch sizes [2, 52]. Non-contrastive methods, including the commonly-used BYOL [22] and SimSiam [11], eliminate the necessity for negative pairs by defining loss functions that only depend on positive pairs, which empirically improves learned representations.

A common problem in the Siamese SSL is slow convergence and instability during training. In non-contrastive methods, existence of trivial solutions exacerbates the problem. The non-contrastive training may fall into the trivial solutions that map every input signal to a constant vector, which is called a collapse. While existing studies observed that the collapse can be avoidable in carefully designed training frameworks [9, 31, 62], it is still a problem in certain settings depending on model choices and training-dataset sizes [38].

To alleviate this difficulty of the Siamese SSL, we propose a training framework to exploit multi-level self-supervision. The multi-level self-supervision encourages each stage of representations in the hierarchical networks to be consistent to different data augmentations within the same images. This is expected to 1) enhance the progress of training of the earlier stages by directly exposing them to the loss, and 2) as a side effect, improve the discriminative ability of the middle layers themselves. One might be concerned about such an aggressive loss addition to all levels in non-contrastive frameworks known to cause the collapse. However, from theoretical viewpoints, we argue that multiple losses added by multi-level supervisions do not increase the range of trivial solutions’ existence, which might be counterintuitive. Figure 1 shows the overview of the architecture. We name our framework Ladder Siamese Network after an autoencoder-based un- and semi-supervised learning method [56] with multi-level losses in the pre-SSL era.

In the context of cross-view SSL, multi-level self supervision has been partly examined as a component of task-specialized SSL methods. For example, DetCo [74] is a contrastive method to exploit multi-level self-supervision and local-patch-based training for detection. In contrast, we explore multi-level self-supervision as a pretraining method for generic-purpose backbones. We found that in combination with non-contrastive losses, multi-level self-supervisions (MLS) can improve classification, detection, and segmentation performance simultaneously, while DetCo had to sacrifice classification accuracy to improve detection performance with the multi-level self-supervision. In addition, we find that dense self-supervised losses [69, 75], which were to boost detection and segmentation performances at the cost of classification, can be exploited with much less classification degradation in the Ladder Siam framework than in conventional top-level-loss-only settings.

Our contributions are summarized as follows. First, we propose a non-contrastive multi-level SSL framework, Ladder Siam. Second, in experiments, we show that Ladder Siam is useful to build a representation hierarchy that maintains competitive performances to the state-of-the-art methods in classification, detection, and segmentation simultaneously. Third, we extensively analyze the role of multi-level self-supervision in training from theoretical and experimental view points and show how the intermediate losses work to compliment the top-level supervision. Code and pretrained weights will be released upon acceptance.

2 Related Work

2.1 Siamese self-supervised learning

Siamese SSL was derived from the line of instance-discrimination-based SSL [18, 72]. Instance discrimination is a proxy task to identify differently data-augmented versions of single images, which is conceptually simpler than the previous proxy tasks [51, 21, 54]. The early instance-discrimination method is parametric [18] to perform a classification where each training instance is a class. Afterwards, a memory bank that stores per-instance weights [72] was introduced to improve scalability by being non-parametric. Finally, the memory bank was replaced by the Siamese-net-style dual encoders [6, 36] that compute per-instance weights from the second view of the input on-the-fly [9].

Many of the Siamese SSL methods after the pioneer SimCLR [9] have similar overall architecture [11] and explore various loss functions, prediction modules, and network update rules. For example, MoCo [24, 10, 12] introduced momentum encoders, which are slowly updated during training to increase targets’ time-consistency. SwAV [7] exploits online clustering and predicts cluster assignments, rather than representation vectors themselves, to improve stability. BYOL [22] adopts the asymmetric predictor, which is on only the one side of the Siamese net, and incorporated a non-contrastive loss that does not rely on negative samples. Further sophistication of prediction modules [32, 53], loss-function design [82, 3, 61], and data augmentation [63, 50, 23] within the Siamese architecture is ongoing. However, we argue that architectural changes, such as the addition of intermediate losses, have not been deeply investigated.

Nevertheless, we are aware of a few studies that adopted multi-level self-supervisions (MLS) in Siamese SSL. DetCo [74] used MLS in combination with local patch-wise contrastive learning. CsMl [76] combined MLS with nearest-neighbor-based positive-pair augmentation. HCCL [8] incorporated MLS in its deep projection heads rather than in the backbone. Hierarchical Augmentation Invariance [83] assigned specific data augmentation types for each level to learn invariance against them. Remarkably, all of the work presented MLS in bundles to boost the system performance after combined them with other ideas. We instead focus on the analyses of vanilla MLS and show that even a straightforward implementation based on BYOL can outperform the preceding MLS methods. The ideas based on MLS have also been examined in other domains, such as video [79] and medical images [34].

A number of Siamese SSL methods incorporate region- or pixel-wise learning, which is useful to improve locality awareness and spatial granularity of representations. Region-based methods often use an extra region-proposal module. For example, DetCon [28] utilizes multiscale combinatorial grouping [1], SoCo [71] and UniVIP [41] utilize selective search [64], and CYBORGS [67] and Odin [29] utilize region grouping by k-means [45] to define region-to-region losses. While they are effective especially in object detection, the usage of a handcrafted region-proposal may cause implementation complexity and loss of generality, for example, when applying to non-object image datasets such as scenes or textures. In contrast, we explore a method that does not rely on extra modules yet can improve detection and segmentation.

Pixel-wise methods eliminate global pooling and aim to define dense supervision. For example, DenseCL [69] exploits the dense correspondence between feature maps. PixPro [75] utilizes coordinate-based alignment by tracing the cropping-based data augmentation. VICRegL [4] boosts the dense SSL with VIC regularization [3], and there are more studies along this line to improve matching strategy [39, 70]. DenseSiam [85] incorporates both dense and region-based learning. We incorporate DenseCL, the simplest one in our Ladder framework with a non-contrastive modification.

2.2 Intermediate-layer supervision

Intermediate-layer supervision has been examined in various areas since the advent of deep neural networks to alleviate their training difficulty. Our direct source of inspiration is LadderNet and its variants [65, 56, 80] that perform autoencoder-based denoising [66] of intermediate representations as additional supervisions. Deeply supervised nets [37] exploits classification losses on intermediate layers, which has been incorporated in more supervised methods [60, 86]. Deep contrastive supervision [84] is a supervised learning method that exploits SSL-like contrastive losses on intermediate layers as regularizers. While it is related to our method in terms of the intermediate-loss usages, we investigates purely self-supervised settings. Knowledge distillation is another area where intermediate-layer supervision is common to give student models more hints to mimic teacher models [57, 78]. However, they use teachers trained with supervised learning and are largely different to our SSL setting, where the dual encoders are simultaneously updated. In the broadest sense, methods that encourages reuse of intermediate layers by lateral connections [58, 42, 49, 35] could be seen as forms of intermediate-layer supervisions.

3 Method

3.1 Preparation: Siamese SSL

First, we briefly review the Siamese SSL framework [22, 11] as a background and introduce notations. Given an input $\bm{x}$ , a deep network with $N$ stages that maps $\bm{x}$ to the output $\bm{y}$ generally can be written as

$\displaystyle\vspace{-2mm}\bm{y}$	$\displaystyle=$	$\displaystyle\bm{f}_{N}(\bm{z}_{N-1})$
$\displaystyle\bm{z}_{i}$	$\displaystyle=$	$\displaystyle\bm{f}_{i}(\bm{z}_{i-1})\quad\qquad(i=1,2,...,N-1)$	(1)
$\displaystyle\bm{z}_{0}$	$\displaystyle=$	$\displaystyle\bm{x},\vspace{-2mm}$

where $\bm{f}_{i}$ denotes the $i$ -th stage of the network and $\bm{z}_{i}$ denotes the intermediate representations produced by $\bm{f}_{i}$ . Here, stages mean certain groups of layers in networks (i.g., conv1, res2, res3, … in ResNets [27]), typically grouped by their resolutions and divided by downsampling layers. The composite function

\displaystyle\bm{f}(\bm{x})=(\bm{f}_{N}\circ\bm{f}_{N-1}\circ...\circ\bm{f}_{1})(\bm{x})

(2)

denotes the whole network as a single function.

In supervised learning, the loss function that compares the outputs $\bm{y}$ and the annotated labels drives the training forward. However, in SSL, we need an alternative to the label. Here, Siamese frameworks exploit two views of single instances, which are two versions of input images differently augmented by random transformation. Given an input $\bm{x}$ , using its two views $\bm{x}^{a}$ , $\bm{x}^{b}$ and their corresponding outputs $\bm{y}^{a}=\bm{f}(\bm{x}^{a})$ , $\bm{y}^{b}=\hat{\bm{f}}(\bm{x}^{b})$ , a self-supervised loss function is defined as $L(\bm{y}^{a},\bm{y}^{b})$ . The network for the second view $\hat{\bm{f}}$ may be identical to $\bm{f}$ [9], or the slowly-updated version of $\bm{f}$ with momentum [22].

An example of the concrete form of $L(\bm{y}^{a},\bm{y}^{b})$ is the mean-square errors (MSE) with a predictor, introduced by BYOL [22], which is denoted by

\displaystyle L_{\text{BYOL}}(\bm{y}^{a},\bm{y}^{b})

\displaystyle=

\displaystyle|\bm{q}(\bm{y}^{a})-\bm{y}^{b}|_{2}^{2},

(3)

where $\bm{q}$ is an multi-layer perceptron (MLP) called a predictor. The output of the predictor is normalized. The predictors are to give the losses asymmetricity, which is empirically beneficial for overall performances. ¹¹1Previous work [22] refers to the final-part MLP of the trained network to as the projector. While we follow the backbone-projector-predictor setting, in formulation, we include the projector in $\bm{f}$ for notation simplicity.

In gradient-based optimization of the loss in Eq. 3, updates of the intermediate layers $\bm{f}_{N-1},\bm{f}_{N-2},...,\bm{f}_{1}$ are purely based on backpropagation, which might be indirect. Here, our motivation is to expose the intermediate layers directly to their own learning objectives.

3.2 Ladder Siamese Network

To enhance learning of intermediate layers, we add losses on the basis of the intermediate representations in deep nets. We denote the intermediate representations corresponding to the two views $\bm{x}^{a}$ and $\bm{x}^{b}$ by $\bm{z}^{a}_{1},\bm{z}^{a}_{2},...,\bm{z}^{a}_{N-1}$ and $\bm{z}^{b}_{1},\bm{z}^{b}_{2},...,\bm{z}^{b}_{N-1}$ respectively. Using these, the overall loss is defined by

\displaystyle L_{\text{all}}=L(\bm{y}^{a},\bm{y}^{b})+\sum_{i=1}^{N-1}w_{i}L_{i}(\bm{z}^{a}_{i},\bm{z}^{b}_{i}),

(4)

where $L$ denotes the final-layer loss and $L_{i}$ denotes the $i$ -th intermediate loss. We introduce loss weights $w_{i}$ to control the balance between the losses. Usual Siamese SSL can be seen as a special case of Ladder Siamese SSL where $w_{1}=w_{2}=...=w_{N-1}=0$ .

For the concrete form of $L_{i}(\bm{z}^{a}_{i},\bm{z}^{b}_{i})$ , we use an adaptation of the BYOL loss (Eq. 5) for the intermediate layers, which is defined by

	$\displaystyle L_{i}$	$\displaystyle=$	$\displaystyle\|\bm{q}_{i}(\bm{y}^{a}_{i})-\bm{y}^{b}_{i}\|_{2}^{2},$		(5)
	$\displaystyle\bm{y}^{k}_{i}$	$\displaystyle=$	$\displaystyle\bm{p}_{i}(\text{avgpool}(\bm{z}^{k}_{i}))\;\;\;(k=a,b).$

This is near identical to Eq. 3, except that each level of the losses has its own projector $\bm{p}_{i}$ and predictor $\bm{q}_{i}$ , and global-pooling layers are added in a side-branching manner apart from the main stream of the backbone network (but note that this is still equivalent to the BYOL loss that reuses the global pooling in the backbone network). Figure 2a illustrates this intermediate-layer predictor.

A concern in this multi-loss setup is the number of hyperparameters, which increase the cost of hyperparameter searches for optimal training. However, we empirically show that an easy heuristic can reduce hyperparameters by

\displaystyle w_{i}=2^{i-N}w,

(6)

where $w$ is a loss weight-coefficient newly introduced instead of $w_{1},w_{2},...,\text{and }w_{N-1}$ . This is to simply halve the loss weight for every one-stage shallower part of the network.

For implementation, we set the intermediate losses on res2, res3, and res4 in addition to the final stage res5 on ResNets [27]. We do not set the loss on conv1, the first block that consists of a convolution and a pooling, because it seems too powerless to learn consistency against data augmentations.

3.3 Dense loss for lower-layer supervision

While a naive configuration where all-level losses are set to be the same as Eq. 3 is possible, and in a later section, we see that it is suitable for image-level classification, there is a remaining design space for varying each level loss. We exploit this to enable the coexistence of global and local factors within single networks.

We put dense losses for lower (input-side) parts of a network, and global losses for higher (output-side) parts. In literature, dense losses [69, 75, 85] equipped with stronger locality-aware supervisory signals have advantages in object detection and segmentation, while they degrade classification accuracies. Here, our intention is to enhance role division of the lower layers and higher layers. Such differentiation in hierarchical networks may naturally emerge [55, 48], and we aim to enhance it to improve locality awareness without largely sacrificing classification accuracies.

Inspired by DenseCL [69], we newly design the DenseBYOL loss, which is a non-contrastive counterpart of DenseCL designed on the basis of MoCo-style contrastive learning. The purpose of this re-invention is to avoid potential ill effects caused by combining contrastive and non-contrastive losses, and maintain conciseness of our BYOL-based codebase. We define our DenseBYOL loss by

$\displaystyle L_{\text{Dense}}(\bm{y}^{a},\bm{y}^{b})$	$\displaystyle=$	$\displaystyle\|\bm{q}^{\text{conv}}(\bm{y}^{a})-\text{align}(\bm{y}^{b};\bm{y}^{a})\|_{2}^{2},$	(7)
$\displaystyle\text{align}(\bm{y}^{b};\bm{y}^{a})$	$\displaystyle=$	$\displaystyle[\bm{y}^{b}_{u,v}\|(u,v)=\text{argmax}_{u,v}\langle\bm{y}^{a}_{i,j},\bm{y}^{b}_{u,v}\rangle]_{i,j},$
$\displaystyle\bm{y}^{k}$	$\displaystyle=$	$\displaystyle\bm{p}^{\text{conv}}(\bm{z}^{k})\;\;\;\;\;(k=a,b),$

where align is a spatial resampling operator that picks up corresponding points $(u,v)$ and their feature vectors $\bm{y}^{b}_{u,v}$ for every $\bm{y}^{a}_{i,j}$ on the basis of the cosine similarity function denoted by $\langle\cdot,\cdot\rangle$ . The projector and predictor is replaced by $\bm{p}^{\text{conv}}$ , $\bm{q}^{\text{conv}}$ , the $1\times 1$ convolution-based projector and predictor, which no longer require global pooling. Figure 2b illustrates this dense version of the predictor. Pseudo code of Eq. 7 is shown in Supplementary Material.

In our Ladder Siam framework, we set the dense losses in the lower half of the network, i.e., res2 and res3 of a ResNet, and the global losses in the others (i.e., res4 and res5). Following DenseCL [69], we used the dense losses in combination with the global ones by averaging.

3.4 Do More Intermediate-layer Losses Mean More Risks of a Collapse?

Non-contrastive Siamese learning has risks of a collapse, a phenomenon of networks falling into trivial solutions by learning a constant function, which is useless for representation learning. Ladder Siam adds intermediate losses, and needs to avoid the collapse of all losses for successful training. This might seem intuitively difficult.

However, in fact, we show that the intermediate losses do not provide new trivial solutions in addition to the final loss. In other words, all intermediate trivial solutions belong to the final loss’s trivial solutions. This can be formally written as follows:

Theorem 1.

In a deep net denoted by Eq. 3.1, a set of parameters $\Theta_{l}$ such that causes the collapse of the intermediate representation $\bm{z}_{l}$ , is a subset of a set of parameters $\Theta_{m}$ such that causes the collapse of $\bm{z}_{m}$ when $l\leq m$ .

Sketch of proof.

When $l\leq m$ , $\bm{z}_{m}$ can be written using $\bm{z}_{l}$ and a part of the net by

\displaystyle\bm{z}_{m}

\displaystyle=

\displaystyle\bm{f}_{m}\circ\bm{f}_{m-1}\circ...\circ\bm{f}_{l+1}(\bm{z}_{l}).

(8)

Given $\bm{z}_{l}$ collapsed,

$\displaystyle\bm{z}_{l}$	$\displaystyle=$	const.	(9)
$\displaystyle\Rightarrow\bm{z}_{m}$	$\displaystyle=$	$\displaystyle\bm{f}_{m}\circ\bm{f}_{m-1}\circ...\circ\bm{f}_{l+1}(\text{const.})$	(10)
	$\displaystyle=$	$\displaystyle\text{const.},$	(10)

which is summarized into $\bm{z}_{l}=\text{const.}\Rightarrow\bm{z}_{m}=\text{const.}$ In the parameter space, this means

\Theta_{m}=\{\bm{\theta}|\bm{z}_{m}=\text{const.}\}\supseteq\Theta_{1}=\{\bm{\theta}|\bm{z}_{l}=\text{const.}\}.

∎

This means that the possible ranges of the trivial solutions have a nested structure $\Theta_{1}\subseteq\Theta_{2}\subseteq\ldots\subseteq\Theta_{N}$ . Thus, we only have to avoid the collapse in $\Theta_{N}$ to avoid the collapse in all other intermediate representations, which we have already succeeded in BYOL training. In experiments, we did not observe any hyperparameter setting where Ladder BYOL collapsed but BYOL did not, which agrees with this analysis.

As a limitations of this analysis, our statement is only applicable when the collapse can be complete; studies indicated that a collapse can be dimensional [33] or partial dimensional [38], where the representations are not constant but strongly correlated.

4 Experiments

We evaluate Ladder Siam’s effectiveness as a versatile representation learner in various vision tasks. Following the standard protocol in prior work [24, 85], we first pretrained our networks with the proposed method using the ImageNet dataset, and then finetuned them or built classifiers on them as feature extractors with frozen parameters in the downstream tasks.

4.1 Pretraining

We pretrained Ladder Siam on the ImageNet-1k [16] dataset (also known as ILSVRC2012) in unsupervised fashion, i.e., without using labels. We used 100-epoch and 200-epoch training with cosine annealing [47] without restart as default because this schedule is the most widely used. We followed BYOL [22] in other settings; as an optimizer, we used LARS [81] with a batch size 4,096, initial learning rate of 7.2, and weight decay of 0.000001.

Method configuration

We trained two types of Ladder Siam variations. Ladder-BYOL is the simpler one where all intermediate losses are the BYOL-style global loss described in Eq. 3. We followed BYOL [22] in other implementation details including the application of stop-grad and momentum encoder. Ladder-DenseBYOL is the dense-loss-equipped alternative more oriented toward dense-prediction tasks e.g., segmentation; it replaced the intermediate losses on the earlier-half stages by the dense loss described in Eq. 7. For comparisons, we additionally implemented DenseBYOL, which has no intermediate losses but a top-level loss of Eq. 7. As a backbone architecture, we used ResNet50 [27] as default, since it is well used and the most compatible in comparison with other methods. We set the loss weights at res2, res3, res4, and res5 to 1/16, 1/8, 1/4, and 1, respectively, as default. The impact of the setting is investigated in a later section.

Hardware and time consumption

We used KVM virtual machines on our private cloud infrastructure. Each machine has eight NVIDIA A100-80GB-SXM GPUs, 252 vCPUs, and 1 TB memory. They took around 60 hours for our 200-epoch pretraining.

Table 1: Results of BYOL and our Ladder-BYOL.

	BYOL	Ladder-BYOL
100-epoch pretraining
IN acc@1	67.4	$\bm{68.3}$ (+ 0.9)
CC box mAP	39.3	$\bm{40.5}$ (+ 1.2)
VOC mIoU	63.8	$\bm{66.6}$ (+ 2.8)
200-epoch pretraining
IN acc@1	71.7	$\bm{72.8}$ (+ 1.1)
CC box mAP	40.9	$\bm{41.4}$ (+ 0.5)
VOC mIoU	64.3	$\bm{67.4}$ (+ 3.1)
400-epoch pretraining
IN acc@1	73.1	$\bm{73.6}$ (+ 0.5)

Table 2: Performance comparison in various vision tasks by state-of-the-art SSL methods and ours. Bold indicates the best and underline indicates the second best results.

\dagger

: our reproduced results using released pretrained weights. *: minor differences in pretraining settings, see main texts for details.

	Classification	Detection in COCO		Semantic segmentation		Avg. rank
Method	IN linear acc.	box mAP	mask mAP	VOC mIoU	Cityscapes mIoU
ReSim [73]	66.1	40.0	36.1	–	76.8
DetCo [74]	68.6	40.1	36.4	–	76.5
DenseCL [69]	63.3	40.3	36.4	69.4	75.7	6.4
PixPro [75]	66.3*	40.5	36.6	–	76.3
HAI-SimSiam [25]	70.1	–	–	–	–
LEWEL-BYOL [32]	72.8	41.3	37.4	65.7 $\dagger$	71.3 $\dagger$	3.4
RegionCL-SimSiam* [77]	71.3	38.8	35.2	–	–
RegionCL-DenseCL [77]	68.5	40.4	36.7	64.8 $\dagger$	74.1 $\dagger$	6.2
CsMl [76]	71.6	40.3	36.6	–	–
DenseSiam [85]	–	40.8	36.8	–	77.0
Ladder-BYOL (ours)	72.8	41.4	37.2	67.4	73.9	3
Ladder-DenseBYOL (ours)	72.0	41.1	37.0	68.6	75.2	3.4

4.2 Downstream tasks

Classification

We conducted linear probing using ImageNet-1k. Linear classifiers were trained on the frozen representation using SGD. The training of the classifier was done during 100 epochs with cosine annealing. We used the mmselfsup [14] codebase.

Detection

We finetuned Mask R-CNN [26] with FPN [42] on the COCO dataset [43], train2017 for training and val2017 for evaluation. Since Mask R-CNN jointly solves box-based detection and instance segmentation, we trained single Mask R-CNN models for both box-based and mask-based evaluation. The training schedule was set to 1 $\times$ schedule, since longer training schedules tend to make detection performances similar regardless of initialization with pretrained models or random ones [25].

Segmentation

We finetuned FCNs [46] on PASCAL VOC [20] and Cityscapes [15]. While we did not see a major consensus among the SSL literature on segmentation-evaluation protocol, we used FCN-D8, which is an FCN modified to have eight-pixel stride by dilated convolutions, provided by mmsegmentation [13] as the simplest option. In PASCAL VOC, we used train_aug2012 for training. We set the input resolution to $512\times 512$ and training iterations to 20k. In Cityscapes, we used the train_fine subset for training. We set input resolution to $769\times 769$ and training iterations to 40k. This setting is the same with as that of [24, 85].

4.3 Results

Comparisons with the baseline

We first compare our Ladder-BYOL model with BYOL, which our implementation is based on and we regard as a baseline. The results are shown in Table 1. We observed improvements over the baselines with our Ladder version on all datasets and training schedules we used. The relative improvements were 0.9 % points in ImageNet linear classification (IN), 1.2 % points in COCO (CC) detection, and 2.8 % points in VOC segmentation when we adopted 100-epoch pretraining. With 200-epoch pretraining, the improvements were 1.0 % points in IN, 0.5 % points in CC detection, and 3.1 % points in VOC segmentation, which shows that our Ladder Siam training framework is consistently beneficial in combination with BYOL. The improvement is + 0.5 % points in IN with the longer 400-epoch pretraining. This relative improvement is a bit smaller than in shorter-term training, and we regard this as the result of faster convergence.

Comparisons with state of the art methods

We show the results in Table 2. We selected recent ResNet50-based SSL methods that do not rely on extra region extractors or multi-crop strategies, for which we can draw a fair comparison. The reported scores of the compared methods are from the original papers and the DenseSiam [85] paper, which paid great effort for fair 200-epoch-pretraining-based comparisons based on their reimplementations, unless otherwise noted. In the same way, we reported 200-epoch results of our models. Incidentally, as marked by * in the table, we placed the classification accuracy of the 400-epoch-pretrained model for PixPro due to the unavailability of 200-epoch weights. We placed classification accuracy of the 100-epoch model for RegionCL-SimSiam, which we expect to be similar to its 200-epoch results due to the fast convergence and saturation of SimSiam-based methods [9].

Given the diversity of the downstream tasks, we do not see a single clear winner. However, our Ladder-BYOL maintain a balance of downstream performances at a high level by being the best in ImageNet-1k (IN) linear classification, the best in COCO (CC) box-based detection, and the second best in instance segmentation. For example, LEWEL-BYOL [32] performed well and similarly in classification and detection to ours, but was found to be less generalizable to segmentation. In contrast, DenseCL [69] was the best in VOC segmentation but at the cost of classification accuracy. Our Ladder-DenseBYOL is the second best in VOC semantic segmentation, while it has similar but slightly worse performances in the other tasks than Ladder-BYOL. Thus, it can be regarded as a still versatile but somewhat segmentation-oriented backbone.

Component-wise comparisons

We further show component-wise comparisons that focus on methods that have connections on the underlying ideas and consist of similar components to ours. First, we compare the effect of adding intermediate losses with DetCo [74]. Table 3 shows the ImageNet classification-performance changes by adding MLS by the intermediate losses, which were provided by the original paper [74] as a part of an ablative study and computed by us. While the two DetCo variants degraded their classification performances by MLS, ours improved in contrast. A possible cause of this reversal is the difference of contrastive and non-contrastive losses; the contrastive loss used in DetCo can be bottlenecked by its reliance on negative pairs, which are sometimes too hard to distinguish from positives [38]. This might be more harmful when the losses are assigned to less powerful intermediate layers.

Table 4 compares the effect of incorporating dense SSL losses [69, 75] in various base SSL methods. Note that PixPro and DenseCL used dense losses as their top-level supervision, but our Ladder models exploited dense losses on intermediate layers. Regardless of base methods or dense loss types, the addition of the dense losses degraded classification and improved segmentation in all examine conditions. However, classification degradation in our Ladder-DenseBYOL, which is - 0.7 %-points, is softer than in the others. This observation suggests that dense losses as the intermediate supervision is a reasonable way to relieve classification degradation.

Table 3: Effect of adding multi-level supervision on ImageNet accuracy.

	Baseline	+ Multi-level sup.
DetCo w/o GLS [74]	64.3	63.2 (- 1.1)
DetCo w/ GLS [74]	67.1	66.6 (- 0.5)
Ours	71.7	72.8 (+ 1.1)

Table 4: Effect of adding dense SSL losses.

	Dense	IN acc@1	VOC mIoU
base MoCov2 [10]		67.6	67.5
DenseCL [69]	✓	63.3 (- 4.3)	69.4 (+ 1.9)
base BYOL [22]		67.4	63.8
PixPro [75]	✓	66.3 (- 1.1)	65.0 (+ 1.2)
DenseBYOL	✓	65.2 (- 2.2)	65.2 (+ 1.4)
base Ladder-BYOL		72.8	67.4
Ladder-DenseBYOL	✓	72.0 (- 0.8)	68.6 (+ 1.2)

Table 5: Effect of replacing global losses with dense losses. “G” and “D” denote the usage of global and dense losses respectively. We used 100-epoch pretraining.

					IN	VOC
	res2	res3	res4	res5	acc@1	mIoU
BYOL	–	–	–	G	67.4	64.3
DenseBYOL	–	–	–	D	65.2	65.1
Ladder-B	G	G	G	G	$\bm{68.8}$	66.6
Ladder-DB	D	D	G	G	68.2	$\bm{67.1}$
	D	D	D	D	67.7	66.0

Table 6: Effect of the loss-weight hyperparameters.

Method	Epochs	Loss weights	Acc@1	mAP
Ladder-	100	[0, 0, 0, 1]	67.4	39.3
BYOL		[1/32, 1/16, 1/8, 1]	$\bm{68.8}$	40.2
		[1/16, 1/8, 1/4, 1]	68.3	40.5
		[1/8, 1/4, 1/2, 1]	66.9	$\bm{40.7}$
		[1/8, 1/8, 1/8, 1]	68.5	40.2
Ladder-	200	[0, 0, 0, 1]	71.7	40.7
BYOL		[1/32, 1/16, 1/8, 1]	71.7	41.1
		[1/16, 1/8, 1/4, 1]	$\bm{72.8}$	$\bm{41.4}$
		[1/8, 1/4, 1/2, 1]	72.5	40.9

Table 7: Results of classification using intermediate layers.

Method	res2	res3	res4	res5
BYOL	28.5	40.9	56.8	68.1
Ladder-BYOL	31.9	47.2	61.7	68.7

[Uncaptioned image] — Figure 4: Visualization of gradients, i.e., supervisory signals during training at the earliest-stage provided by each intermediate loss.

Hyperparameters and ablations

We conducted ablation analyses of the intermediate losses and the results are summarized Table 5. Ladder-BYOL and Ladder-DenseBYOL were confirmed to outperform their single-loss counterparts. We additionally tested a Ladder-DenseBYOL variation where all losses are dense, but it was suboptimal both in classification and segmentation, offering more evidence of the effectiveness of mixing dense and global losses.

Next, we investigated the impact of the intermediate-loss weights as hyperparameters. The results are summarized in Table 6. In setting the loss weights, we followed Eq. 6 and modified the coefficient $w$ to control the strength of the overall intermediate losses. An interesting trend was seen in the 100-epoch pretraining; larger intermediate loss weights degraded classification and improved detection. This implies that models could be tuned to be classification-oriented or detection-oriented just by the hyperparameter. However, the same trend was not observed in the 200-epoch pretraining. This might be related to the stronger convergence by longer training, which results in a single good setting rather than selectable variations.

Analyses on intermediate representations

We investigated how intermediate representations differed by direct exposure to supervisions in Ladder models. Table 7 summarizes linear probing results of intermediate representations in each stage. We used IN classification here, and improvements in all intermediate layers were confirmed. Figure 3 shows distributions of euclidean distance between two random data-augmented views measured in each level as violin plots. The distance was computed using representation vectors after global average pooling. Ladder training provided stronger consistency against the data augmentation to the intermediate layers lower than res4, which is seemingly the source of the improved intermediate-layer discriminability.

Gradient visualization

After confirming the effectiveness of Ladder Siam, we further investigated whether the roles of each intermediate loss as a supervisor are similar or whether some sort of role divisions emerge. We observed signs of role divisions in Fig. 5, which visualized gradients of each-level loss with reference to an intermediate representation. Given the representation $\bm{z}_{\text{res}2}$ and the losses $L_{\text{res}3}$ , $L_{res4}$ and $L_{res5}$ viewed as a function on $\bm{z}_{\text{res}2}$ , we computed $\frac{\partial L_{\text{res-}i}}{\partial\bm{z}_{\text{res}2}}$ , which is the contribution of each $L_{\text{res-}i}$ to the total gradient of $\bm{z}_{\text{res}2}$ for $i=3,4,5$ . We excluded $L_{\text{res}2}$ in the visualization because the global average pooling on $\bm{z}_{\text{res}2}$ provides spatially uniform gradients, which is improper for visualization. For plotting, we took the absolute-sum along the channel axis and visualized 2-D patterns of gradient magnitude. In Fig. 5, the later-level gradients are more focused on objects while the earlier-level ones are more globally distributed on both foregrounds and backgrounds, which can be useful to widely collect learnable factors. At the same time, the earlier losses might be non-object-centric when disrupted by background clutters, and here we hypothesize that later-level and earlier-level losses work complementarily together.

Meta-analyses: how do downstream-task performances correlate?

Finally, we are interested in the correlation patterns seen among the various SSL methods’ downstream-task performances summarized in Table 5. Is a good performance in one downstream task a sign of a good performance in another? To answer this question, we conducted correlation coefficient testing over the scores in Table 5 as a post-hoc analysis. Here, we calculated Pearson’s correlation coefficients of a) ImageNet linear accuracy vs. COCO detection box mAP, b) ImageNet linear accuracy vs. Cityscapes segmentation mIoU, and c) COCO detection box mAP vs. Cityscapes segmentation mIoU. For segmentation, we selected Cityscapes as there are more available data points (i.e., reported scores). Due to the triple comparison, we applied Bonferroni’s correction [19, 5] in statistical testing. As a result, a positive correlation with $\rho=0.84$ was observed in the classification-detection comparison, and a negative correlation in the classification-segmentation ( $\rho=-0.70$ ) and detection-segmentation ( $\rho=-0.82$ ) as shown in Fig. 5. While a positive correlation is amenable to SSL’s dogma to pursue generally reusable representations, the negative ones between Cityscapes segmentation and the others are notable in future research designs.

5 Conclusion

In this paper, we presented Ladder Siamese Network, conceptually simple yet effective framework to stably learn versatile self-supervised representations. Other than effectiveness, Ladder Siam’s advantage is its flexibility to incorporate various learning mechanisms in each level, which may inspire more sophisticated designs of self-supervised learning objectives. In future, we will explore further effective combinations of loss functions such as region-proposal-based and unsupervised-segmentation-based ones with our Ladder Siam framework.

Limitations

While Ladder Siam worked well with hierarchical representations of conv nets, its applicability to Vision Transformers [17] remains an open question. Hierarchical Transformers [68, 44, 30] are promising in vision tasks, and they would be compatible with MLS. However, non-hierarchical Transformer was found to be competitive [40]. The question of whether we should apply MLS to Transformers interacts with whether Transformers should be hierarchical, and they might need parallel consideration.

Acknowledgements

The authors would like to thank members of Tech Lab and Image-processing Group, Yahoo Japan Corporation for the helpful comments and discussion.

References

[1] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In CVPR, pages 328–335, 2014.
[2] Sanjeev Arora, Hrishikesh Khandeparkar, Mikhail Khodak, Orestis Plevrakis, and Nikunj Saunshi. A theoretical analysis of contrastive unsupervised representation learning. In ICML, pages 9904–9923. International Machine Learning Society (IMLS), 2019.
[3] Adrien Bardes, Jean Ponce, and Yann LeCun. VICReg: Variance-invariance-covariance regularization for self-supervised learning. In ICLR, 2022.
[4] Adrien Bardes, Jean Ponce, and Yann LeCun. VICRegL: Self-supervised learning of local visual features. In NeurIPS, 2022.
[5] Frank Bretz, Torsten Hothorn, and Peter Westfall. Multiple comparisons using R. Chapman and Hall/CRC, 2016.
[6] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a “siamese” time delay neural network. NeurIPS, 6, 1993.
[7] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS, 33:9912–9924, 2020.
[8] Hesen Chen, Ming Lin, Xiuyu Sun, and Rong Jin. Hierarchical cross contrastive learning of visual representations. preprint on OpenReview, 2021.
[9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607, 2020.
[10] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
[11] Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In CVPR, pages 15750–15758, 2021.
[12] Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In ICCV, pages 9640–9649, 2021.
[13] MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
[14] MMSelfSup Contributors. MMSelfSup: Openmmlab self-supervised learning toolbox and benchmark. https://github.com/open-mmlab/mmselfsup, 2021.
[15] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The Cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
[16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255. IEEE, 2009.
[17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
[18] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. NeurIPS, 27, 2014.
[19] Olive Jean Dunn. Multiple comparisons among means. Journal of the American statistical association, 56(293):52–64, 1961.
[20] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The PASCAL visual object classes (VOC) challenge. IJCV, 88(2):303–338, 2010.
[21] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
[22] Jean-Bastien Grill et al. Bootstrap your own latent-a new approach to self-supervised learning. NeurIPS, 33:21271–21284, 2020.
[23] Tomohiro Hayase, Suguru Yasutomi, and Nakamasa Inoue. Downstream augmentation generation for contrastive learning. ICASSP, 2022.
[24] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
[25] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training. In ICCV, pages 4918–4927, 2019.
[26] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In ICCV, pages 2961–2969, 2017.
[27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
[28] Olivier J Hénaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron Van den Oord, Oriol Vinyals, and João Carreira. Efficient visual pretraining with contrastive detection. In ICCV, pages 10086–10096, 2021.
[29] Olivier J. Hénaff, Skanda Koppula, Evan Shelhamer, Daniel Zoran, Andrew Jaegle, João Carreira Andrew Zisserman, , and Relja Arandjelović. Object discovery and representation networks. In ECCV, pages 260–277. Springer, 2022.
[30] Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, and Seong Joon Oh. Rethinking spatial dimensions of vision transformers. In ICCV, pages 11936–11945, 2021.
[31] Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature decorrelation in self-supervised learning. In ICCV, pages 9598–9608, 2021.
[32] Lang Huang, Shan You, Mingkai Zheng, Fei Wang, Chen Qian, and Toshihiko Yamasaki. Learning where to learn in cross-view self-supervised learning. In CVPR, pages 14451–14460, 2022.
[33] Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. In ICLR, 2021.
[34] Aakash Kaku, Sahana Upadhya, and Narges Razavian. Intermediate layers matter in momentum contrastive self supervised learning. In NeurIPS, volume 34, 2021.
[35] Rei Kawakami, Ryota Yoshihashi, Seiichiro Fukuda, Shaodi You, Makoto Iida, and Takeshi Naemura. Cross-connected networks for multi-task learning of detection and segmentation. In ICIP, pages 3636–3640. IEEE, 2019.
[36] Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2. Lille, 2015.
[37] Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial intelligence and statistics, pages 562–570. PMLR, 2015.
[38] Alexander C Li, Alexei A Efros, and Deepak Pathak. Understanding collapse in non-contrastive learning. In ECCV, 2022.
[39] Xiaoni Li, Yu Zhou, Yifei Zhang, Aoting Zhang, Wei Wang, Ning Jiang, Haiying Wu, and Weiping Wang. Dense semantic contrast for self-supervised visual representation learning. In ACM MM, pages 1368–1376, 2021.
[40] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
[41] Zhaowen Li, Yousong Zhu, Fan Yang, Wei Li, Chaoyang Zhao, Yingying Chen, Zhiyang Chen, Jiahao Xie, Liwei Wu, Rui Zhao, et al. UniVIP: A unified framework for self-supervised visual pre-training. In CVPR, pages 14627–14636, 2022.
[42] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
[43] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
[44] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
[45] Stuart Lloyd. Least squares quantization in PCM. IEEE transactions on information theory, 28(2):129–137, 1982.
[46] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
[47] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
[48] Chao Ma, Jia-Bin Huang, Xiaokang Yang, and Ming-Hsuan Yang. Hierarchical convolutional features for visual tracking. In ICCV, pages 3074–3082, 2015.
[49] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. Cross-stitch networks for multi-task learning. In CVPR, pages 3994–4003, 2016.
[50] Atsuyuki Miyai, Qing Yu, Daiki Ikami, Go Irie, and Kiyoharu Aizawa. Rethinking rotation in self-supervised contrastive learning: Adaptive positive or negative data augmentation. In WACV, 2023.
[51] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, pages 69–84. Springer, 2016.
[52] Kento Nozawa and Issei Sato. Understanding negative samples in instance discriminative self-supervised representation learning. NeurIPS, 34:5784–5797, 2021.
[53] Bo Pang, Yifan Zhang, Yaoyi Li, Jia Cai, and Cewu Lu. Unsupervised visual representation learning by synchronous momentum grouping. In ECCV, 2022.
[54] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In CVPR, pages 2536–2544, 2016.
[55] Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision transformers see like convolutional neural networks? NeurIPS, 34:12116–12128, 2021.
[56] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. NeurIPS, 28, 2015.
[57] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. ICLR, 2015.
[58] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (MICCAI), pages 234–241. Springer, 2015.
[59] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, pages 843–852, 2017.
[60] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
[61] Chenxin Tao, Honghui Wang, Xizhou Zhu, Jiahua Dong, Shiji Song, Gao Huang, and Jifeng Dai. Exploring the equivalence of siamese self-supervised learning via a unified gradient framework. In CVPR, pages 14431–14440, 2022.
[62] Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning dynamics without contrastive pairs. In ICML, pages 10268–10278. PMLR, 2021.
[63] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, and Phillip Isola. What makes for good views for contrastive learning? NeurIPS, 33:6827–6839, 2020.
[64] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. IJCV, 104(2):154–171, 2013.
[65] Harri Valpola. From neural PCA to deep unsupervised learning. In Advances in independent component analysis and learning machines, pages 143–171. Elsevier, 2015.
[66] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, Pierre-Antoine Manzagol, and Léon Bottou. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of machine learning research, 11(12), 2010.
[67] Renhao Wang, Hang Zhao, and Yang Gao. CYBORGS: Contrastively bootstrapping object representations by grounding in segmentation. In ECCV, pages 260–277. Springer, 2022.
[68] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV, pages 568–578, 2021.
[69] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, and Lei Li. Dense contrastive learning for self-supervised visual pre-training. In CVPR, pages 3024–3033, 2021.
[70] Zhaoqing Wang, Qiang Li, Guoxin Zhang, Pengfei Wan, Wen Zheng, Nannan Wang, Mingming Gong, and Tongliang Liu. Exploring set similarity for dense self-supervised representation learning. In CVPR, pages 16590–16599, 2022.
[71] Fangyun Wei, Yue Gao, Zhirong Wu, Han Hu, and Stephen Lin. Aligning pretraining for detection via object-level contrastive learning. In NeurIPS, volume 34, pages 22682–22694, 2021.
[72] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In CVPR, pages 3733–3742, 2018.
[73] Tete Xiao, Colorado J. Reed, Xiaolong Wang, Kurt Keutzer, and Trevor Darrell. Region similarity representation learning. In ICCV, pages 8392–8401, 2021.
[74] Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang Xu, Peize Sun, Zhenguo Li, and Ping Luo. Detco: Unsupervised contrastive learning for object detection. In ICCV, 2021.
[75] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen Lin, and Han Hu. Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In CVPR, pages 16684–16693, 2021.
[76] Haohang Xu, Xiaopeng Zhang, Hao Li, Lingxi Xie, Wenrui Dai, Hongkai Xiong, and Qi Tian. Seed the views: Hierarchical semantic alignment for contrastive representation learning. IEEE TPAMI, 2022.
[77] Yufei Xu, Qiming Zhang, Jing Zhang, and Dacheng Tao. RegionCL: Can simple region swapping contribute to contrastive learning? In ECCV, 2022.
[78] Chuanguang Yang, Zhulin An, Linhang Cai, and Yongjun Xu. Hierarchical self-supervised augmented knowledge distillation. IJCAI, 2021.
[79] Xitong Yang, Xiaodong Yang, Sifei Liu, Deqing Sun, Larry Davis, and Jan Kautz. Hierarchical contrastive motion learning for video action recognition. In BMVC, 2021.
[80] Ryota Yoshihashi, Wen Shao, Rei Kawakami, Shaodi You, Makoto Iida, and Takeshi Naemura. Classification-reconstruction learning for open-set recognition. In CVPR, pages 4016–4025, 2019.
[81] Yang You, Igor Gitman, and Boris Ginsburg. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017.
[82] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow twins: Self-supervised learning via redundancy reduction. In ICML, pages 12310–12320. PMLR, 2021.
[83] Junbo Zhang and Kaisheng Ma. Rethinking the augmentation module in contrastive learning: Learning hierarchical augmentation invariance with expanded views. In CVPR, pages 16650–16659, 2022.
[84] Linfeng Zhang, Xin Chen, Junbo Zhang, Runpei Dong, and Kaisheng Ma. Contrastive deep supervision. In ECCV, 2022.
[85] Wenwei Zhang, Jiangmiao Pang, Kai Chen, and Chen Change Loy. Dense siamese network for dense unsupervised learning. In ECCV, 2022.
[86] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In CVPR, pages 2881–2890, 2017.

Ladder Siamese Network: a Method and Insights for Multi-level Self-Supervised Learning