\jyear

2021

[1]\fnmJo \surPlested

[1]\orgdivSchool of Engineering and Information Technology, \orgnameUniversity of New South Wales, \orgaddress\streetNorthcott Drive, \cityCampbell, \postcode2612, \stateACT, \countryAustralia

2]\orgdivOptus Centre for Artificial Intelligence, \orgnameCurtin University, \orgaddress\streetKent Street, \cityBentley, \postcode6102, \stateWA, \countryAustralia

Deep transfer learning for image classification: a survey

[email protected] \fnmTom \surGedeon [email protected] * [

Abstract

Deep neural networks such as convolutional neural networks (CNNs) and transformers have achieved many successes in image classification in recent years. It has been consistently demonstrated that best practice for image classification is when large deep models can be trained on abundant labelled data. However there are many real world scenarios where the requirement for large amounts of training data to get the best performance cannot be met. In these scenarios transfer learning can help improve performance. To date there have been no surveys that comprehensively review deep transfer learning as it relates to image classification overall. However, several recent general surveys of deep transfer learning and ones that relate to particular specialised target image classification tasks have been published. We believe it is important for the future progress in the field that all current knowledge is collated and the overarching patterns analysed and discussed. In this survey we formally define deep transfer learning and the problem it attempts to solve in relation to image classification. We survey the current state of the field and identify where recent progress has been made. We show where the gaps in current knowledge are and make suggestions for how to progress the field to fill in these knowledge gaps. We present a new taxonomy of the applications of transfer learning for image classification. This taxonomy makes it easier to see overarching patterns of where transfer learning has been effective and, where it has failed to fulfill its potential. This also allows us to suggest where the problems lie and how it could be used more effectively. We demonstrate that under this new taxonomy, many of the applications where transfer learning has been shown to be ineffective or even hinder performance are to be expected when taking into account the source and target datasets and the techniques used. In many of these cases, the key problem is that methods and hyperparameter settings designed for large and very similar target datasets are used for smaller and much less similar target datasets. We identify alternative choices that could lead to better outcomes.

keywords:

Deep Transfer Learning, Image Classification, Convolutional Neural Networks, Deep Learning

1 Introduction

Deep neural network architectures such as convolutional neural networks (CNNs) and more recently transformers have achieved many successes in image classification krizhevsky2012imagenet ; girshick2014rich ; li2020deep ; masi2018deep ; mazurowski2019deep . It has been consistently demonstrated that these models perform best when there is abundant labelled data available for the task and large models can be trained ngiam2018domain ; mahajan2018exploring ; kolesnikov2019big . However there are many real world scenarios where the requirement for large amounts of training data cannot be met. Some of these are:

1.

Insufficient data because the data is very rare or there are issues with privacy etc. For example new and rare disease diagnosis tasks in the medical domain have limited training data due to both the examples themselves being rare and privacy concerns.
2.

It is prohibitively expensive to collect and/or label data. For example labelling can only be done by highly qualified experts in the field.
3.

The long tail distribution where a small number of objects/words/classes are very frequent and thus easy to model, while many many more are rare and thus hard to model bengio2015sharing . For example most language generation problems.

There are a several other reasons why we may want to learn from a small number of training examples:

•

It is interesting from a cognitive science perspective to attempt to mimic the human ability to learn general concepts from a small number of examples.
•

There may be restraints on compute resources that limit training a large model from random initialisation with large amounts of data. For example environmental concerns strubell2019energy .

In all these scenarios transfer learning can often greatly improve performance. In this paradigm the model is trained on a related dataset and task for which more data is available and the trained weights are used to initialise a model for the target task. In order for this process to improve rather than harm performance the dataset must related closely enough and best practice methods used.

In this survey we review recent progress in deep transfer learning for image classification and highlight areas where knowledge is lacking and could be improved. With the exponentially increasing demand for the application of modern deep CNN models to a wider array of real world application areas, work in transfer learning has increased at a commensurable pace. It is important to regularly take stock and survey the current state of the field, where recent progress has been made and where the gaps in current knowledge are. We also make suggestions for how to progress the field to fill in these knowledge gaps. While there are many surveys in related domains and specific sub areas, to the best of our knowledge there are none that focus on deep transfer learning for image classification in general. We believe it is important for the future progress in the field that all the knowledge is collated together and the overarching patterns analysed and discussed.

We make the following contributions:

1.

formally defining deep transfer learning and the problem it attempts to solve as it relates to image classification
2.

performing a thorough review of recent progress in the field
3.

presenting a taxonomy of source and target dataset relationships in transfer learning applications that helps highlight why transfer learning does not perform as expected in certain application areas
4.

giving a detailed summary of source and target datasets commonly used in the area to provide an easy reference for the reader looking to understand relationships between where transfer learning has performed best and where results have been less consistent
5.

summarizing current knowledge in the area as well as pointing out knowledge gaps and suggested directions for future research.

In Section 2 we review all surveys in the area from general transfer learning, to more closely related domains. In section 3 we introduce the problem domain and formalise the difficulties with learning from small datasets that transfer learning attempts to solve. This section includes terminology and definitions that are used throughout this paper. Section 4 details the source and target datasets commonly used in deep learning for image classification. Section 5 provides a detailed analysis of all recent advances and improvements to transfer learning and specific application areas and highlights gaps in current knowledge. In Section 6 we give an overview of other problem domains that are closely related to deep transfer learning for image classification including the similarities and differences in each. Finally Section 7 summarises all current knowledge, gaps and problems and recommends directions for future work in the area.

2 Related work

Many reviews related to deep transfer learning have been published in the past decade and the pace has only increased in the last few years. However, they differ from ours in two main ways. The first group consists of more general reviews, that provide a high level overview of transfer learning and attempt to include all machine learning sub-fields and all task sub-fields. Reviews in this group are covered in Section 2.1. The second group is more specific with reviews providing a comprehensive breakdown of the progress on a particular narrow domain specific task. They are discussed in the relevant parts of Section 5.7. There are a few surveys that are more closely related to ours with differences discussed in Section 2.2.

2.1 General transfer learning surveys

The most recent general transfer learning survey jiang2022transferability , is an extremely broad overview of most areas related to deep transfer learning including those areas related to deep transfer learning for image classification outlined in Section 6. As it is a broad general survey there is no emphasis on how deep transfer learning applies to image classification and thus the trends seen in this area are not covered.

A thorough theoretical analysis of general transfer learning techniques is given in zhuang2020comprehensive . Transfer learning techniques are split into data-based and model-based, then further divided into subcategories. Deep learning models are explicitly discussed as a sub-Section of model-based categorisation. The focus is on generative models such as auto-encoders and Generative Adversarial Networks (GANs) and several papers are reviewed. Neural networks are also mentioned briefly under the Parameter Control Strategy and Feature Transformation Strategy Sections. However, the focus is on unsupervised pretraining strategies, rather than best practice for transferring learning.

Zhang et al. zhang2019recent take the most similar approach to categorizing the transfer learning task space as ours. They divide transfer learning into 17 categories based on source and target dataset and label attributes. They then review approaches taken within each category. Since it is a general transfer learning survey with no focus on deep learning and image classification the trends in this area are not covered.

Weiss et al. weiss2016survey divide general transfer learning into homogeneous, where the source and target dataset distributions are the same, and heterogeneous, where they are not, and give a thorough description of each. They review many different approaches in each category, but few of them are related to deep neural networks.

2.2 Closely related work

There are some recent review papers that based on their title seem to be more closely related. However, they are short summary papers containing limited details on the subject matter rather than full review papers.

A Survey on Deep Transfer Learning tan2018survey defines deep transfer learning and separates it into four categories based on the subset of techniques used. The focus is more on showing a broad selection of methods rather than providing much detail or focusing particularly on deep transfer learning methods. Most major works in the area from the past decade are missing.

Deep Learning and Transfer Learning Approaches for Image Classification krishna2019deep focuses on defining CNNs along with some of the major architectures and results from the past decade. The paper includes a few brief paragraphs defining transfer learning and some of the image classification results incorporate transfer learning, but no review of the topic is performed.

A Survey of Transfer Learning for Convolutional Neural Networks ribani2019survey is a short paper which briefly introduces the transfer learning task and settings, and introduces general categories of approaches and applications. It does not review any specific approaches or applications.

Transfer Learning for Visual Categorization: A Survey shao2014transfer . Is a full review paper, but is older with no deep learning techniques included.

In Small Sample Learning in Big Data Era shu2018small deep transfer learning is a large part of the work, but not the focus. Some examples of deep learning applied to image classification domains are mentioned, but there is no discussion of methods for improving deep transfer learning as it relates to image classification.

3 Overview

3.1 Problem Definition

In this Section definitions used throughout the paper are introduced. Transfer learning can be categorised by both the task and the mode. We start by defining the model, then the task and finally how they interact together in this case.

Deep learning is a modern name for neural networks with more than one hidden layer. Neural networks are themselves a sub-area of machine learning. Mitchell mitchell1997machine provides a succinct definition of machine learning:

Definition 1.

”A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Neural networks are defined by Gurney gurney1997introduction as:

Definition 2.

”A neural network is an interconnected assembly of simple processing elements, units or nodes, called neurons, whose functionality is loosely based on the animal neuron. The processing ability of the network is stored in the inter unit connection strengths, or weights, obtained by a process of adaptation to, or learning from, a set of training patterns.”

The neurons in a multilayer feed forward neural network of the type that we consider in this review have nonlinear activation functions goodfellow2016deep and are arranged in layers with weights $W$ feeding forward from one layer to the next.

Generally, a neural network learns to improve its performance at task T from Experience E, being the set of training patterns, via gradient descent and backpropagation. Backpropagation is an application of the chain rule applied to propagate derivatives from final layers of the neural network to the hidden and input weights rumelhart1986learning . There are other less frequently used ways to train neural networks, such as with genetic algorithms, that have shown to be successful in particular applications. In this paper we assume training is done via backpropagation for generality.

While it has been proven that neural networks with one hidden layer are universal approximators hornik1989multilayer , in practice because the loss function is non-convex with respect to the weights it is difficult to optimise. For this reason modern networks are often arranged in very deep networks and task specific architectures, like CNNs and transformers for images, to allow for easier training of parameters.

The hierarchical structure of these networks allows for ever more complex patterns to be learned. This is one of the things that has allowed deep learning to be successful at many different tasks in recent years, when compared to other machine learning algorithms. However, this only applies if there is enough data to train them. Figure 1 shows the increase in ImageNet 1K performance with the number of model parameters. Figure 2 shows that for large modern CNN models in general the performance on ImageNet 1K increases with the number of training examples in the source dataset. This suggests that large modern CNNS are likely overfitting when trained from random initialization on ImageNet 1K. Of course there are some outliers as the increase in performance from additional source data also depends on how related the source data is to the target data. This is discussed further in in Section 5.3.1. These two results combine to show the stated effect that deep learning performance scales with the size of the dataset and model.

Refer to caption — Figure 1: Increase in performance on ImageNet 1K due to model size, measured by number of parameters in millions

As noted in section 1 there are many real world scenarios where large amounts of data are unavailable or we are interested in training a model on a small amount of data for other reasons.

3.2 Learning from small vs large datasets

A thorough review of the problems of learning from a small number of training examples is given in wang2020generalizing .

3.2.1 Empirical Risk Minimization

We are interested in finding a function $f$ that minimises the expected risk:

R_{TRUE}\left(f\right)=E[\ell(f(x),y)]=\intop\ell(f(x),y)\,dp(x,y)

with

f^{*}=arg\,min_{f}R_{TRUE}\left(f\right)

$R_{TRUE}\left(f\right)$ is the true risk if we have access to an infinite set of all possible data and labels, with $\hat{f}$ being the function that minimizes the true risk. In practical applications however the joint probability distribution $P(x,y)=P(y|x)P(x)$ is unknown and the only available information is contained in the training set. For this reason the true risk is replaced by the empirical risk, which is the average of sample losses over the training set $D$

R_{n}\left(f\right)=\frac{1}{n}\sum_{i=1}^{n}\ell(f(x_{i}),y_{i}),

leading to empirical risk minimisation vapnik1992principles .

Before we begin training our model we must choose a family of candidate functions $\mathcal{F}$ . In the case of CNNs this involves choosing the relevant hyperparameters that determine our model architecture, including number of layers, the number and shape of filters in each convolutional layer, whether and where to include features like residual connections and normalization layers, and many more. This constrains our final function to the family of candidate functions defined by the free parameters that make up the given architecture. We are then attempting to find a function in $\mathcal{F}$ which minimises the empirical risk:

f_{n}=arg\,min_{f}R_{n}\left(f\right)

Since the optimal function $f^{*}$ is unlikely to be in $\mathcal{F}$ we also define:

f_{\mathcal{F}}^{*}=arg\,min_{f\epsilon\mathcal{F}}R_{TRUE}\left(f\right)

to be the function in $\mathcal{F}$ that minimises the true risk. We can then decompose the excess error that comes from choosing the function in $\mathcal{F}$ that minimizes $R_{n}\left(f\right)$ :

	$\displaystyle E[R(f_{n})-R(f^{*})]$	$\displaystyle=E[R(f_{\mathcal{F}}^{})-R(f^{})]$
		$\displaystyle+E[R(f_{n})-R(f_{\mathcal{F}}^{*})]$
		$\displaystyle=\varepsilon_{app}+\varepsilon_{est}$

The approximation error $\varepsilon_{app}$ measures how closely functions in $\mathcal{F}$ can approximate the optimal solution $f^{*}$ . The estimation error $\varepsilon_{est}$ measures the effect of minimizing the empirical risk $R(f_{n})$ instead of the expected risk $R(f^{*})$ bottou2008tradeoffs . So finding a function that is as close as possible to $f^{*}$ can be broken down into:

1.

choosing a class of models that is more likely to contain the optimal function
2.

having a large and broad range of training examples in $D$ to better approximate an infinite set of all possible data and labels.

3.2.2 Unreliable Empirical Risk Minimizer

In general, $\varepsilon_{est}$ can be reduced by having a larger number of examples wang2020generalizing . Thus, when there are sufficient and varied labelled training examples in $D$ , the empirical risk $R(f_{n})$ can provide a good approximation to $R(f_{\mathcal{F}}^{*})$ the optimal $f$ in $\mathcal{F}$ . When $n$ the number of training examples in $D$ is small the empirical risk $R(f_{n})$ may not be a good approximation of the expected risk $R(f_{\mathcal{F}}^{*})$ . In this case the empirical risk minimizer overfits.

To alleviate the problem of having an unreliable empirical risk minimizer when $D_{train}$ is not sufficient, prior knowledge can be used. Prior knowledge can be used to augment the data in $D_{train}$ , constrain the candidate functions $\mathcal{F}$ , or constrain the parameters of $f$ via initialization or regularization wang2020generalizing . Task specific deep neural network architectures such as CNNs and Recurrent Neural Networks (RNNs) are examples of constraining the candidate functions $\mathcal{F}$ through prior knowledge of what the optimal function form may be.

In this review we focus on transfer learning as a form of constraining the parameters of $f$ to address the unreliable empirical risk minimizer problem. Section 6 discusses how deep transfer learning relates to other techniques that use prior knowledge to solve the small dataset problem.

3.3 Deep transfer learning

Deep transfer learning is transfer learning applied to deep neural networks. Pan and Yang pan2009survey define transfer learning as:

Definition 3.

”Given a source domain $\mathcal{D\mathrm{{}_{S}}}$ and learning task $T_{S}$ , a target domain $\mathcal{D\mathrm{{}_{T}}}$ and learning task $T_{T}$ , transfer learning aims to help improve the learning of the target predictive function $f_{T}\left(.\right)$ in $\mathcal{D\mathrm{{}_{T}}}$ using the knowledge in $\mathcal{D\mathrm{{}_{S}}}$ and $T_{S}$ , where $\mathcal{D\mathrm{{}_{S}}}$ $\neq\mathcal{D\mathrm{{}_{T}}}$ , or $T_{S}\neq T_{T}$ .”

For the purposes of this paper we define deep transfer learning as follows:

Definition 4.

Given a source domain $\mathcal{D\mathrm{{}_{S}}}$ and learning task $T_{S}$ , a target domain $\mathcal{D\mathrm{{}_{T}}}$ and learning task $T_{T}$ deep transfer learning aims to improve the performance of the target model $M$ on the target task $T_{T}$ by initialising it with weights $W$ that are trained on source task $T_{S}$ using source dataset $D_{S}$ (pretraining), where $\mathcal{D\mathrm{{}_{S}}}$ $\neq\mathcal{D\mathrm{{}_{T}}}$ , or $T_{S}\neq T_{T}$ .

Some or all of $W$ are retained when the model is “transferred” to the target task $T_{T}$ and dataset $D_{T}$ . The model is used for prediction on $T_{T}$ after fully training any reinitialised weights and with or without continuing training on the pretrained weights (fine-tuning). Figure 3 shows the pretraining and fine-tuning pipeline when applying transfer learning with a deep neural network.

Combining the discussion from Section 3.2.2 with Definition 4, using deep transfer learning techniques to pretrain weights W can be thought of as regularizing W . Initialising W with weights that have been well trained on a large source dataset rather than with very small random values results in a flatter loss surface and smaller gradients, which in turn results in more stable updates neyshabur2020being ; liu2019towards . In the classic transfer learning setting the source dataset is many orders of magnitude larger than the target dataset. One example is pretraining on ImageNet 1K with 1.3 million training images and transferring to medical imaging tasks which often only have 100s of labelled examples. So even with the same learning rate and number of epochs, the number of updates to the weights while training on the target dataset will be orders of magnitude less than for pretraining. This also prevents the model from creating large weights that are based on noise or idiosyncrasies in the small target dataset.

Advances in transfer learning can be categorized based on ways of constraining the parameters of W as follows:

1.
Initialization. Answering questions like:
- •
  
  how much pretraining should be done?
- •
  
  is more source data or more closely related source data better?
- •
  
  which pretrained parameters should be transferred vs reinitialized?
2.

Parameter regularization. Regularizing weights, with the assumption that if the parameters are constrained to be close to a set point, they will be less likely to overfit.
3.

Feature regularization. Regularizing the features for each training example that are produced by the weights. Based on the assumption that if the features stay close to those trained by the large source data set the model will be less likely to overfit.

We describe progress and problems with deep transfer learning under these categories as well as based on the relationship between source and target dataset in section 4. Then, in section 5, we describe how deep transfer learning relates to other methods.

3.4 Negative Transfer

The stated goal of transfer learning as per Definition 3 is to improve the learning of the target predictive function $f_{T}\left(.\right)$ in $\mathcal{D\mathrm{{}_{T}}}$ using the knowledge in $\mathcal{D\mathrm{{}_{S}}}$ and $T_{S}$ . To achieve this goal the source dataset must be similar enough to the target dataset to ensure that the features learned in pretraining are relevant to the target task. If the source dataset is not well related to the target dataset the target model can be negatively impacted by pretraining. This is negative transfer rosenstein2005transfer ; pan2009survey . Wang et al. wang2019characterizing ; wang2019transferable define the negative transfer gap (NTG) as follows:

Definition 5.

”Let $\epsilon_{\tau}$ represent the test error on the target domain, $\theta$ a specific transfer learning algorithm under which the negative transfer gap is defined and $\varnothing$ is used to represent the case where the source domain data/information are not used by the target domain learner. Then, negative transfer happens when the error using the source data is larger than the error without using the source data: $\epsilon_{\tau}(\theta(S,\tau))>\epsilon_{\tau}(\theta(\varnothing,\tau))$ , and the degree of negative transfer can be evaluated by the negative transfer gap”

NTG=\epsilon_{\tau}(\theta(S,\tau))-\epsilon_{\tau}(\theta(\varnothing,\tau))

From this definition we see that negative transfer occurs when the negative transfer gap is positive. Wang et al. elaborate on factors that affect negative transfer wang2019characterizing ; wang2019transferable :

•

Divergence between the source and target domains. Transfer learning makes the assumption that there is some similarity between joint distributions in source domain $P_{S}(X,Y)$ and target domain $P_{T}(X,Y)$ . The higher the divergence between these values the less information there is in the source domain that can be exploited to to improve performance in the target domain. In the extreme case if there is no similarity it is not possible for transfer learning to improve performance.
•

Negative transfer is relative to the size and quality of the source and target datasets. For example, if labelled target data is abundant enough, a model trained on this data only may perform well. In this example, transfer learning methods are more likely to impair the target learning performance. Conversely, if there is no labelled target data, a bad transfer learning method would perform better than a random guess, which means negative transfer would not happen.

With deep neural networks, once the weights have been pretrained to respond to particular features in a large source dataset the weights will not change far from their pretrained values during fine-tuning neyshabur2020being . This is particularly so if the target dataset is orders of magnitude smaller as is often the case. This premise allows transfer learning to improve performance and also allows for negative transfer. If the weights transferred are pretrained to respond to unsuitable features then this training will not be fully reversed during the fine-tuning phase and the model could be more likely to overfit to these inappropriate features. Scenarios such as this usually lead to overfitting the idiosyncrasies of the target training set plested2019analysis ; kornblith2021better . A related scenario is explored in kornblith2021better where it is shown that alternative loss functions that improve how well the pretrained features fit the source dataset lead to a reduction in performance on the target dataset. The authors state that ”.. there exists a trade-off between learning invariant features for the original task and features relevant for transfer tasks.”

In image classification models, features learned through lower layers are more general, and those learned in higher layers are more task specific yosinski2014transferable . It is likely that if less layers are transferred negative transfer should be less prevalent, with training all layers from random initialization being the extreme end of this. There has been limited work to test this, however it is shown to an extent in abnar2021exploring .

4 Datasets commonly used in transfer learning for image classification

4.1 Source

ImageNet 1K, 5K, 9K, 21K

ImageNet is an image database organized according to the WordNet hierarchy imagenet_cvpr09 . ImageNet 1K or ILSVRC2012 is a well known subset of ImageNet that is used for an annual challenge. ImageNet 1K consists of 1,000 common image classes with at least 1,000 total images in each class for a total of just over 1.3 million images in the training set. ImageNet 5K, 9K and 21K are larger subsets of the full ImageNet dataset containing the most common 5,000, 9,000 and 21,000 image classes respectively. All three ImageNet datasets have been used as both source and target datasets, depending on the type of experiments being performed. They are most commonly used as a source dataset because of their large sizes and general classes.

JFT dataset

JFT is an internal Google dataset for large-scale image classification, which comprises over 300 million high-resolution images hinton2015distilling . Images are annotated with labels from a set of 18291 categories. For example, 1165 type of animals and 5720 types of vehicles are labelled in the dataset sun2017revisiting . There are 375M labels and on average each image has 1.26 labels.

Instagram hashtag datasets

Mahajan et al. mahajan2018exploring collected a weakly labelled image dataset with a maximum size of 3.5 billion labelled images from Instagram, being over 3,000 times larger than the commonly used large source dataset ImageNet 1K. The hashtags were used as labels for training and evaluation. By varying the selected hashtags and the number of images to sample, a variety of datasets of different sizes and visual distributions were created. One of the datasets created contained 1,500 hashtags that closely matched the 1,000 ImagNet 1K classes.

Places365

Places365 (Places) zhou2017places contains 365 categories of scenes collected by counting all the entries that corresponded to names of scenes, places and environments in WordNet English dictionary. They included any concrete noun which could reasonably complete the phrase I am in a place, or let’s go to the place. There are two datasets:

•

Places365-standard has 1.8 million training examples total with a minimum of 3,068 images per class.
•

Places365-challenge has 8 million training examples.

Places365 is generally used as a source dataset when the target dataset is scene based such as SUN.

Inaturalist

Inaturalist van2018inaturalist consists of 859,000 images from over 5,000 different species of plants and animals. Inaturalist is generally used as a source dataset when the target dataset is when the target dataset contains fine-grained plants or animal classes.

4.2 Target

General

General image classification datasets contain a variety of classes with a mixture of superordinate and subordinate classes from many different categories in WordNet miller1995wordnet . ImageNet is a canonical example of a general image classification dataset.

Examples of general image classification datasets commonly used as target datasets are:

•

CIFAR-10 and CIFAR-100 krizhevsky2009learning : Each have a total of 50,000 training and 10,000 test images of 32x32 colour images from 10 and 100 classes respectively.
•

PASCAL VOC 2007 pascal-voc-2007 : Has 20 classes belonging to the superordinate categories of person, animal, vehicle, and indoor objects. It contains 9,963 images with 24,640 annotated objects and a 50/50 train test split. The size of each image is roughly $501\times 375$ .
•

Caltech-101 fei2004learning : has pictures of objects belonging to 101 categories. About 40 to 800 images per category, with most categories having around 50 images. The size of each image is roughly $300\times 200$ pixels.
•

Caltech-256 griffin2007caltech . An extension of Caltech-101 with 256 categories and a minimum of 80 images per category. It includes a large clutter category for testing background rejection.

Fine-grained

Fine-grained image classification datasets contain subordinate classes from one particular superordinate class. examples are:

•

Food-101 (Food) bossard14 : Contains 101 different classes of food objects with 75,750 training examples and 25,250 test examples.
•

Birdsnap (Birds) berg2014birdsnap : Contains 500 different species of birds, with 47,386 training examples and 2,443 test examples.
•

Stanford Cars (Cars) KrauseStarkDengFei-Fei_3DRR2013 : Contains 196 different makes and models of cars with 8,144 training examples and 8,041 test examples.
•

FGVC Aircraft (Aircraft) maji13fine-grained : Contains 100 different makes and models of aircraft with 6,667 training examples and 3,333 test examples.
•

Oxford-IIIT Pets (Pets)parkhi2012cats : Contains 37 different breeds of cats and dogs with 3,680 training examples and 3,369 test examples.
•

Oxford 102 Flowers (Flowers) nilsback2008automated : Contains 102 different types of flowers with 2,040 training examples and 6,149 test examples.
•

Caltech-uscd Birds 200 (CUB) wah2011caltech : Contains 200 different species of birds with around 60 training examples per class.
•

Stanford Dogs (Dogs) khosla2011novel : Contains 20,580 images of 120 breeds of dogs

Scenes

Scene datasets contain examples of different indoor and/or outdoor scene settings. Examples are:

•

SUN397 (SUN) xiao2010sun : Contains 397 categories of scenes. This dataset preceded Places-365 and used the same techniques for data collection. The scenes with at least 100 training examples were included in the final dataset.
•

MIT 67 Indoor Scenes quattoni2009recognizing : Contains 67 Indoor categories, and a total of 15620 images. There are at least 100 images per category.

Others

There are a number of other datasets that have less of an overarching theme and are less related to the common source datasets. These are often used in conjunction with deep transfer learning for image classification to show models and techniques are widely applicable. Examples of these are:

•

Describable Textures (DTD) cimpoi2014describing : Consists of 3,760 training examples of texture images with 47 classes of texture adjectives.
•

Daimler pedestrian classification munder2006experimental : Contains 23,520 training images with two classes, being contains pedestrians and does not contain pedestrians.
•

German road signs (GTSRB) stallkamp2012man : Contains 39,209 training images of German road signs in 43 classes.
•

Omniglot lake2015human : Contains over 1.2 million training examples of 1,623 different handwritten characters from 50 writing systems.
•

SVHN digits in the wild (SVHN) netzer2011reading : Contains 73,257 training examples of labelled digits cropped from Street View images.
•

UCF101 Dynamic Images (UCF101) soomro2012ucf101 : Contains 9,537 static frames of 101 classes of actions cropped from action videos.
•

Visual Decathlon Challenge (Decathlon) rebuffi2017learning : A challenge designed to simultaneously solve 10 image classification problems being: ImageNet, CIFAR-100, Aircraft, Daimler pedestrian, Describable textures, German traffic signs, Omniglot, SVHN, UCF101, VGG-Flowers. All images resized to have a shorter side of 72 pixels

5 Deep transfer learning progress and areas for improvement

In the past decade, the successes of CNNs on image classification tasks have inspired many researchers to apply them to an increasingly wide range of domains. Model performance is strongly affected by the relationship between the amount of training data and the number of trainable parameters in a model as shown in Figures 1 and 2. As a result there has been ever growing interest in using transfer learning to allow large CNN models to be trained in domains where there is only limited training data available or other constraints exist.

As deep learning gained popularity in 2012 to 2016 transferability of features and best practices for performing deep transfer learning was explored agrawal2014analyzing ; azizpour2015factors ; huh2016makes ; sharif2014cnn ; yosinski2014transferable . While there are some recent works that have introduced improvement to transfer learning techniques and insights, there are many more that have focused on best practice for either general kornblith2019better ; mahajan2018exploring ; li2020rethinking ; plested2021non or specific raghu2019transfusion ; heker2020joint application domains rather than techniques. We fully review both.

When reviewing the application of deep transfer learning for image classification we divide applications into categories. We split tasks in two directions being small versus large target datasets and closely versus loosely related source and target datasets. For example using ImagNet imagenet_cvpr09 as a source dataset to pretrain a model for classifying tumours on medical images is a loosely related transfer and is likely to be a small target dataset due to privacy and scarcity of disease. This category division aligns with the factors that affect negative transfer outlined in wang2019characterizing ; wang2019transferable .

The distinction between target dataset sizes is useful as it has been shown that small target datasets are much more sensitive to changes in transfer learning hyperparameters plested2019analysis . It has also been shown that standard transfer learning hyperparamters do not perform as well when transferring to a less related target task he2018rethinking ; kornblith2019better ; plested2021non , with negative transfer being an extreme example of this wang2019characterizing ; wang2019transferable , and that the similarity between datasets should be considered when deciding on hyperparameters li2020rethinking ; plested2021non . These distinctions go some way to explaining conflicting performance of deep transfer learning methods in recent years he2018rethinking ; li2020rethinking ; wan2019towards ; zoph2020rethinking .

We start this section by describing general studies on deep transfer learning techniques, including recent advances. Then we review work in each of the application areas described by our split above. Section 7 summarizes current knowledge and makes final recommendation for future directions of research in the field.

5.1 General deep transfer learning for image classification

Early work on deep transfer learning showed that:

1.

Deep transfer learning results in comparable or above state of the art performance in many different tasks, particularly when compared to shallow machine learning methods sharif2014cnn ; azizpour2015factors .
2.

More pretraining both in terms of the number of training examples and the number of iterations tends to result in better performance agrawal2014analyzing ; azizpour2015factors ; huh2016makes .
3.

Fine-tuning the weights on the target task tends to result in better performance particularly when the target dataset is larger and less similar to the source dataset agrawal2014analyzing ; yosinski2014transferable ; azizpour2015factors .
4.

Transferring more layers tends to result in better performance when the source and target dataset and task are closely matched, but less layers are better when they are less related agrawal2014analyzing ; yosinski2014transferable ; azizpour2015factors ; chu2016best ; abnar2021exploring .
5.

Deeper networks result in better performance azizpour2015factors .

It should be noted that all the studies referenced above were completed prior to advances in residual networks he2016deep and other modern very deep CNNs. It has been argued that residual networks when combined with fine-tuning makes features more transferable he2016deep . As many of the above studies were carried out within a similar time period some results have not been combined. For instance, most were done with AlexNet, a relatively shallow network, as a base and many did not perform fine tuning and/or simply used a deep neural network as a feature detector at whatever layer it was transferred. It has since been shown that when fine-tuning is used effectively, transferring less than the maximum number of layers can result in better performance. This applies even when the source and target datasets are highly related, particularly with smaller target datasets plested2019analysis ; plested2021non ; abnar2021exploring .

More recently it has been shown that the performance of models on ImageNet 1K correlates well with performance when the pretrained model is transferred to other tasks kornblith2019better . The authors additionally demonstrate that the increase in performance of deep transfer learning over random initialization is highly dependent on both the target dataset size and the relationship between the classes in the source and target datasets. This will be discussed more in the following sections.

5.2 Recent advances

Recent advances in the body of knowledge related to deep transfer learning for image classification can be divided into advances in techniques, and general insights on best practice. We describe advances in transfer learning techniques here and insights on best practice in Section 5.2.4. Recent advances in techniques are divided into regularization, hyperparameter based, matching the source domain to the target domain, and a few others that do not fit the previous categories. We discuss matching the source domain to the target domain under the relevant source versus task domains in Sections 5.3.1 and 5.6.1 and the rest below. In our reviews of recent work we attempt to present a balanced view of the evidence for the improvements offered by newer models compared to prior ones and the limitations of those improvements. However, in some of the more recent cases this is difficult as the original papers provide limited evidence and new work showing the limitations of the methods has not yet been done.

5.2.1 Regularization based technique advances

Most regularization based techniques aim to solve the problem of the unreliable empirical risk minimizer 3.2.2 by restricting the model weights or the features produced by them so they can’t fit small idiosyncrasies in the data. They achieve this by adding a regularization term $\lambda\cdot\Omega\left(.\right)$ to the loss function to make it:

min_{w}L\left(w\right)=\left\{\frac{1}{n}\sum_{i=1}^{n}L\left(z\left(x_{i},w\right),y_{i}\right)+\lambda\cdot\Omega\left(.\right)\right\}

with the first term $\frac{1}{n}\sum_{i=1}^{n}L\left(z\left(x_{i},w\right),y_{i}\right)$ being the empirical loss and the second term being the regularization term. The tuning parameter $\lambda>0$ balances the trade-off between the two.

Weight regularization directly restricts how much the model weights can move.

Knowledge distillation or feature based regularization uses the distance between the feature maps output from one or more layers of the source and target networks to regularize the model:

\Omega(w,w_{s})=\frac{1}{n}\sum_{j=1}^{N}\sum_{i=1}^{n}d\left(F_{j}\left(w_{t},x_{i}\right),F_{j}\left(w_{s},x_{i}\right)\right)

where $F_{j}\left(w_{t},x_{i}\right)$ is the feature map output by the jth filter in the target network defined by weights $w_{t}$ for input value $x_{i}$ , and $d\left(.\right)$ is a measure of dissimilarity between two feature maps.

The success of regularization based techniques for deep transfer learning rely heavily on the assumption that the source and target datasets are closely related. This is required to ensure that the optimal weights or features for the target dataset are not far from those trained on the source dataset.

There have been many new regularization based techniques introduced in the last three years. We review major new techniques in chronological order.

1.

L2-SP li2018explicit ; li2020baseline is a form of weight regularization. The aim of transfer learning is to create models that are regularized by keeping features that are reasonably close to those trained on a source dataset for which overfitting is not as much of a problem. The authors argue that because of this, during the target dataset training phase the fine tuned weights should be decayed towards the pretrained weights, not zero. Several regularizers that decay weights towards their starting point, denoted SP regularizers, were tested in the original papers. The L2-SP regularizer $\Omega(w)=\frac{\alpha}{2}\left\|w-w^{0}\right\|_{2}^{2}$ which is the L2 loss between the source weights and the current weights is shown to significantly outperform the standard L2 loss on the four target datasets shown in the paper with a Resent-101 model. The original paper showed results for transferring to four small target datasets that were very similar to the two source datasets used for pretraining. It has since been shown that the L2-SP regularizer can result in minimal improvement or even negative transfer when the source and target datasets are less related li2020rethinking ; wan2019towards ; plested2021non ; chen2019catastrophic . More recent work has showed that in some cases using L2-SP regularization for lower layers and L2 regularization for higher layers can improve performance plested2021non .

DELTA li2019delta is an example of knowledge distillation or feature map based regularization. It is based on the idea of re-using CNN channels that are not useful to the target task while not changing channels that are useful. Training on the target task is regularized by the attention weighted L2 loss between the final layer feature maps of the source and target models:

	$\displaystyle\Omega(w,w^{0},x_{i},y_{i})=\sum_{j=1}^{N}(W_{j}(w^{0},x_{i},y_{i})$
	$\displaystyle\cdot\left\\|FM_{j}(w,x_{i})-FM_{j}(w^{0},x_{i}))\right\\|_{2}^{2}$

Where $FM_{j}(w,x_{i})$ is the output from the $jth$ filter applied to the $ith$ input. The attention weights $W_{j}$ for each filter are calculated by removing the model’s filters one by one (setting its output weights to 0), calculating the increase in loss. Filters resulting in a high increase in loss are then set with a higher weight for regularization, encouraging them to stay similar to those trained on the source task. Others that are not as useful in the target task are less regularized and can change more. This regularization resulted in performance that was slightly better than the L2-SP regularization in most cases with ResNet-101 and Inceptionv3 models, ImageNet 1K as the source dataset and a variety of target datasets. The original paper showed state of the art performance for DELTA on Caltech 256-30, however they used mostly the same datasets as the original L2-SP paper li2018explicit and for the two additional datasets used they showed that L2-SP outperformed the baseline L2 regularization. It has since been shown that like L2-SP, DELTA can also hinder performance when the source and target datasets are less similar kou2020stochastic ; chen2019catastrophic ; jeon2020sample .

3.

Wan et al. wan2019towards propose decomposing the transfer learning gradient update into the empirical loss and regularization loss gradient vectors. Then when the angle between the two vectors is greater than 90 degrees they further decompose the regularization loss gradient vector into the portion perpendicular to the empirical loss gradient and the remaining vector in the opposite direction of the empirical loss gradient. They remove the latter term, in the hopes that not allowing the regularization term to move the weights in the opposite direction of the empirical loss term will stop negative transfer. They show that their proposal improves performance slightly with a ResNet 18 on four different datasets. However, their results are poor compared to state of the art as they do not test on modern very deep models. For this reason, it is difficult to judge how well their regularization method performs in general.
4.

Batch spectral shrinkage (BSS) chen2019catastrophic introduces a loss penalty applied to smaller singular values of channelwise features in each batch update during fine-tuning so that untransferable spectral components are suppressed. They test this method using a ResNet50 pretrained on ImageNet 1K and fine-tuned on a range of different target datasets. The results show that their method never hurts performance on the given datasets and often produces significant performance gains over L2, L2-SP and DELTA regularization for smaller target datasets. They also show that BSS can improve performance for less similar target datasets where L2-SP hinders performance.
5.

Sample-based regularization jeon2020sample proposes regularization using the distance between feature maps of pairs of inputs in the same class, as well as weight regularization. The model was tested using a ResNet-50 and transferring from ImageNet 1K and Places365 to a number of different, fine grained classification tasks. The authors report an improvement over L2-SP, DELTA and BSS in all tests. Their results reconfirm that BSS performs better than DELTA and L2SP in most cases and in some cases DELTA and L2SP decrease performance compared to the standard L2 regularization baseline.

5.2.2 Normalization based technique advances

Further to regularization based methods, there are several recent techniques that attempt to better align fine-tuning in the target domain with the source domain. This is achieved by making adjustments to the standard batch normalization or other forms of normalization that are used between layers in modern CNNs.

1.

Sharing batch normalization hyperparameters across source and target domains has been shown to be more effective than having separate ones across many domain adaptation tasks wang2019transferable ; maria2017autodial . Wang et al. wang2019transferable introduce an additional batch normalization hyperparameter called domain adaptive $\alpha$ . This takes standard batch normalization with $\gamma$ and $\beta$ shared across source and target domain and scales them based on the transferability value of each channel calculated using the mean and variance statistics prior to normalization. As far as we are aware these techniques have not been applied to the general supervised transfer learning case.
2.

Stochastic normalization kou2020stochastic samples batch normalization based on mini-batch statistics or based on moving statistics for each filter with probability hyperparameter p. At the start of fine-tuning on the target dataset the moving statistics are initialised with those calculated during pretraining in order to act as a regularizer. This is designed to overcome problems with small batch sizes resulting in noisy batch-statistics or the collapse in training associated with using moving statistics to normalize all feature maps ioffe2017batch ; ioffe2015batch . The authors results show that their methods improve over BSS, DELTA and L2-SP for low sampling versions of three standard target datasets and improve over all but BSS for larger versions of the same datasets. Their results again show that BSS performs better than DELTA and L2SP in most cases and in many cases DELTA and L2-SP decrease performance compared to the standard L2 regularization baseline.

5.2.3 Other recent new techniques

Guo et al. guo2019spottune make two copies of their ResNet models pretrained on ImageNet 1K. One model is used as a fixed feature selector with the pretrained layers frozen and the other model is fine-tuned. They reinitialize the final classification layer in both. A policy net trained with reinforcement learning is then used to create a mask to combine layers from each model together in a unique way for each target example. They show that their SpotTune model improves performance compared to fine-tuning with an equivalent size single model (double the size of the two individual models within the SpotTune architecture) and achieves close to or better than state of the art in most cases. MultiTune simplifies SpotTune by removing the policy network and concatenating the features from each model prior to the final classification layer rather than selecting layers. It also improves on SpotTune by using two different non-binary fine-tuning hyperparameter settings plested2021non rather than one fine-tuned and one frozen model. The results show that MultiTune improves or equals accuracy compared to SpotTune in most cases tested, with significantly less training time.

Co-tuning for transfer learning you2020co uses a probabilistic mapping of hard labels in the source dataset to soft labels in the target dataset. This mapping allows them to keep the final classification layer in a ResNet50 and train it using both the target data and soft labels from the source dataset. As with many other recent results, they show that their algorithm improves on all others, including BSS, DELTA and L2SP, but their results are significantly below state of the art for identical model sizes, source and target dataset. They do show the same ordering for the target datasets, using BSS improves on DELTA which improves on L2SP.

5.2.4 Insights on best practice

Further to advances in techniques and models, there has been a large body of recent research that extends the early work on best practice for deep transfer learning for image classification described in Section 5.1. These studies give insights on the following decisions that need to be made when performing deep transfer learning for image classification:

•

Selecting the best model for the task. Models that perform better on ImageNet were found to perform better on a range of target datasets in kornblith2019better , however this effect eventually saturates abnar2021exploring . Given a set of models with similar accuracy on a source task, the best model for target tasks can vary between target datasets abnar2021exploring .
•

Choosing the best data for pretraining. In many cases pretraining with smaller more closely related source datasets was found to produce better results on target datasets than with larger less closely related source datasets mensink2021factors ; ngiam2018domain ; mahajan2018exploring ; cui2016fine ; cui2018large ; puigcerver2020scalable . For best results the source dataset should include the image domain of the target dataset mensink2021factors . For example ImageNet 1k contains more classes of pets than Oxford Pets making them an ideal source and target dataset combination. There are various measures of similarity used to define closely related that are outlined in Section 5.3.1.
•

Finding the best hyperparameters for fine-tuning. Several studies include extensive hyperparameter searches over learning rate, learning rate decay, weight decay, and momentum kornblith2019better ; li2018explicit ; mahajan2018exploring ; li2020rethinking ; plested2021non . These studies show the relationship between the size of the target dataset and its similarity to the source dataset with fine-tuning hyperparameter settings. Optimal learning rate and momentum, are both shown to be lower for more related source and target datasets li2020rethinking ; plested2021non . Also the number of layers to reinitialise from random weights is strongly related to the optimal learning rate neyshabur2020being ; plested2021non .
•

Whether a multi-step transfer process is better than a single step process. A multi-step pretraining process, where the intermediate dataset is smaller and more closely related to the target dataset, often outperform a single step pretraining process when originating from a very different, large source dataset mensink2021factors ; ng2015deep ; puigcerver2020scalable ; gonthier2020analysis . Related to this, using a self-supervised learning technique for pretraining on a more closely related source dataset can outperform using a supervised learning technique on a less closely related dataset zoph2020rethinking .
•

Which type of regularization to use. L2-SP or other more recent transfer learning specific regularization techniques like DELTA, BSS, stochastic normalization, etc, improve performance when the source and target dataset are closely related, but often hinder it when they are less related li2018explicit ; li2019delta ; wan2019towards ; plested2021non . These regularization techniques are discussed in more detail in Section 5.2.1.
•

Which loss function to use. Alternatives to the cross-entropy loss function are shown to produce representations with higher class separation that obtain higher accuracy on the source task, but are less useful for target tasks in kornblith2021better . The results show a trade-off between learning features that perform better on the source task and features relevant for the target task.

In an attempt to generalize hyperparameters and protocols when pretraining with source source datasets that are larger than ImageNet 1K, Kolesnikov et al. created Big Transfer (BiT) kolesnikov2019big . They pretrain various sizes of ResNet on ImageNet 1K and 21K, and JFT and transfer them to four small to medium closely related image classification target datasets as well as the COCO-2017 object detection dataset lin2014microsoft . Based on these experiments they make a number of general claims about deep transfer learning when pretraining on very large datasets including:

1.

Batch normalization (BN) ioffe2015batch is detrimental to BiT, and Group Normalization wu2018group combined with Weight Standardization performs well with large batches.
2.

MixUp zhang2017mixup is not useful for pretraining on large source datasets and is only useful during fine-tuning for mid-sized target datasets (20-500K training examples)
3.

Regularization (L2, L2-SP, dropout) does not enhance performance in the fine-tuning phase, even with very large models (the largest model used for experiments has 928 million parameters). Adjusting the training and learning rate decay time based on the size of the target dataset, longer for larger datasets, provides sufficient regularization.

The authors use general fine-tuning hyperparameters for learning rate scheduling, training time and amount/usage of MixUp that are only adjusted based on the target dataset size, not for individual target datasets. They achieve performance that is comparable to models with selectively tuned hyperparameters for their model pretrained on ImageNet and state of the art, or close to in many cases, for their model pretrained on the 300 times larger source dataset JFT. However their target datasets, ImageNet, CIFAR 10 & 100, and Pets are very closely related to their source datasets making them easier to transfer to. Their final target dataset, Flowers, is also known to be better suited to transfer to from their source datasets. See section 5.6 for further discussion of which target datasets are easier to transfer to.

We expect that best practice recommendations developed for closely related datasets will not be applicable to less closely related target datasets as has been shown for many other methods and recommendations li2018explicit ; li2019delta ; wan2019towards ; plested2021non ; li2020rethinking ; chen2019catastrophic ; kou2020stochastic . To test this hypothesis we reran a selection of the experiments in BiT using Stanford Cars as the target dataset which is very different from the source dataset ImageNet 21K and known to be more difficult to transfer to kornblith2019better ; plested2021non . We first confirmed that we could reproduce their state of the art results for the datasets listed in the paper, then produced the results in Table 1 using Stanford Cars. These results show that BiT produces far below state of the art results for this less related dataset. The first column shows the results with all the recommended hyperparameters from the paper. While the performance can be improved with increases in learning rates and number of epochs before the learning rate is decayed, final results are still well below state of the art for a comparable model, source and target dataset. The fine grained classification task in Stanford Cars is known to be less similar to the more general ImageNet and JFT datasets. Because of this it is not surprising that recommendations developed for more closely related target datasets do not apply.

Table 1: Big transfer (BiT): General visual representation learning kolesnikov2019big extended results using BiT-M pretrained on ImageNet 21K . State of the art is the best known result for this model, source and target dataset. Default is the learning rate decay schedule specified by the paper for this size target dataset and x2 is two times the number of batches before decaying the learning rate compared to the default.

Dataset	Default lr (0.003)		lr 0.01		lr 0.03		lr 0.1		State of the art
	default decay	x2	default decay	x2	default decay	x2	default decay	x2
Cars	86.20	86.15	85.81	87.49	81.41	88.96	27.51	5.22	95.3ngiam2018domain

5.2.5 Insights on transferability

Here we review works that give more general insight as to what is happening with model weights, representations and the loss landscape when transfer learning is performed as well as measures of transferability of pretrained weights to target tasks.

Several methods for analysing the feature space were used in neyshabur2020being . They found that models trained from pretrained weights make similar mistakes on the target domain, have similar features and are surprisingly close in $\ell_{2}$ distance in the parameter space. They are in the same basins of the loss landscape. Models trained from random initialization do not live in the same basin, make different mistakes, have different features and are farther away in $\ell_{2}$ distance in the parameter space.

A flatter and easier to navigate loss landscape for pretrained models compared to their randomly initialized counterparts was also shown in liu2019towards . They showed improved Lipschitzness and that this accelerates and stabilizes training substantially. Particularly that the singular vectors of the weight gradient with large singular values are shrunk in the weight matrices. Thus, the magnitude of gradient back-propagated through a pretrained layer is controlled, and pretrained weight matrices stabilize the magnitude of gradient, especially in lower layers, leading to more stable training.

Several recent techniques have been proposed for measuring the transferability of pretrained weights:

1.

H-score bao2019information is a measure of how well a pretrained model $f$ is likely to perform on a new task with input space $X$ and output space $Y$ based on the inter-class covariance $cov(\mathbb{E}_{P_{X|Y}}[f(X)|Y])$ and the feature redundancy $tr(cov(f(X)))$

$\mathcal{H}(f)=tr(cov(f(X))^{-1}cov(\mathbb{E}_{P_{X|Y}}[f(X)|Y])$

the H-score increases as interclass covariance increases and feature redundancy decreases. The authors show that H-score has a strong correlation with target task performance. They also show that it can be used to rank transferability and create minimum spanning trees of task transferability. The latter may be useful in guiding multi-step transfer learning for less related tasks as discussed in Section 5.2.4.
2.

Transferability and negative conditional entropy (NCE) for transfer learning tasks where the source and target datasets are the same, but the tasks differ, are defined in tran2019transferability . The authors define transferability as the log-likelihood $l_{Y}(w_{Z},k_{Y})$ , where $w_{z}$ is the weights of the model backbone pretrained on the source task $Z$ and $k_{Y}$ is the weights of the classifier trained on the target task. They then define conditional cross-entropy (NCE) as another measure of transferability, defined as being the empirical cross entropy of label $\bar{y}$ from the target domain given a lablel $\bar{z}$ from the source domain. To empirically demonstrate the effectiveness of the NCE measure a ResNet18 model as the backbone was paired with an SVM classifier. NCE was demonstrated to have strong correlation with accuracy on the target tasks for combinations of 437 source and target tasks.

LEEP nguyen2020leep is another measure of transferability. Using the pretrained model, the joint distribution over labels in the source dataset and the target dataset labels is estimated to construct an empirical predictor. LEEP is the log expectation of the empirical predictor. LEEP is defined mathematically as:

T(\theta,D)=\frac{1}{n}\sum_{i=1}^{n}log\left(\sum_{z\epsilon Z}\hat{P}\left(y_{i}|z\right)\theta\left(x_{i}\right)_{z}\right)

where $\theta\left(x_{i}\right)_{z}$ is is the probability of the source label $z$ for target input data $x_{i}$ predicted using the pretrained weights $\theta$ , and $\hat{P}\left(y_{i}|z\right)$ is the empirical conditional probability of target label $y_{i}$ given source label $z$ . LEEP is shown to have good theoretical properties and empirically it is demonstrated to have strong correlation with performance gain from pretraining weights on the source tasks. This is shown on source tasks ImageNet 1K and CIFAR10 and 200 random target tasks taken from the closely related CIFAR100 and less closely related FashionMNIST. The authors expand NCE to the case where the source and target datasets are different by creating dummy labels for the target data based on the source task using the pretrained model $\theta$ . They show that LEEP has a stronger correlation with performance gain than the expanded NCE measure and H-score.

5.3 Large closely related target datasets

An early, systematic and extremely thorough treatment of deep transfer learning for large closely related image datasets was performed by Yosinski et al. yosinski2014transferable . They did two experiments using AlexNet krizhevsky2012imagenet . One split ImageNet 1K imagenet_cvpr09 into two randomly chosen sets of 500 object classes. The other split ImageNet into two sets of classes that were as different as possible to make up the source and target dataset. These two sets of classes were natural object classes and man-made object classes. In both experiments they used the full 1,000 plus training examples per class, so both are examples of a very large target dataset. After creating the source and target datasets they pretrained an AlexNet krizhevsky2012imagenet on the source dataset, then transferred varying numbers of layers of weights to the target dataset reinitializing the remaining layers with small random weights. They then trained either the entire model or just the reinitialised layers applying the standard training hyperparameters of the day to fine tuning, these being a learning rate of 0.1 decayed by multiplying by 0.1 every 30 epochs.

Yosinski et al. yosinski2014transferable concluded that weights from the lower layers of a CNN trained on an image task are more general and easily transferable between tasks, while the upper layers are more task specific. They showed that fine-tuning is important in transfer learning to allow fragile co-adapted features from the middle layers to retrain together for the new task. Finally, they demonstrated that for the more related source and target datasets (the randomly chosen classes), the more layers transferred the better the final performance. Whereas for the less related datasets the upper layers were less transferable. It should be noted that experiments with fine-tuning were not done for the less related source and target datasets. The only experiments performed froze the weights that were transferred to show the raw transferability of the weights.

The finding that more layers transferred results in better performance yosinski2014transferable has heavily influenced transfer learning hyperparameters. It is important to note that their results were only shown to hold for large and closely related target datasets and their specific fine-tuning hyperparameters. In fact they demonstrated that higher layers are less transferable when the source and target datasets are less related. Despite this their results and training hyperparameters have been used as a template to inform transfer learning procedures on a wide variety of datasets and tasks he2018rethinking ; scott2018adapted ; wu2018facial ; tan2019efficientnet ; mormont2018comparison ; kornblith2019better .

The original experiments performed by Yosinski et al. on closely related datasets (the randomly chosen classes from ImageNet 1K) have since been repeated while varying the size of the target dataset and fine-tuning hyperparameters plested2019analysis . These experiments established that transferring more layers did not generally result in better performance when optimal fine-tuning hyperparameters were used. They also demonstrated that transferring the correct number of layers and using optimal fine-tuning hyperparameters had a much larger impact on performance as the size of the target dataset was reduced. Recent work plested2021non has expanded these experiments to more current and deeper models that are known to perform better for transfer learning he2016deep and less closely related source and target datasets. This work has confirmed that for some datasets transferring more layers is better. However, for others, reinitialising with random weights, multiple layers, or even whole blocks of layers of an Inception v4 model, improves performance over the baseline of transferring all but the final classification layer.

5.3.1 More versus better matched pretraining data

Recent works have tested the limits of the effectiveness of transfer learning with large source and target datasets by pretraining on datasets that that are 6 $\times$ he2017mask , 300 $\times$ sun2017revisiting ; ngiam2018domain ; xie2020self , and even 3000 $\times$ mahajan2018exploring larger than ImageNet 1K and target datasets that are up to 9 $\times$ larger mahajan2018exploring .

In general, increasing the size of the source dataset increases the performance on the target dataset. This occurs even when the target dataset is large such as ImageNet 1K (1K classes, 1.3M training images) and ImageNet 5k (5K classes, 6.6M training images) or 9k (9K classes 10.5M training images) ngiam2018domain ; mahajan2018exploring ; kolesnikov2019big . However, source data that is carefully curated to match target data more closely can perform better than pretraining on larger more general source datasets ngiam2018domain ; mahajan2018exploring .

Early work in huh2016makes showed that additional pretraining data is useful only if it is well correlated to the target task. In some cases adding additional unrelated training data can actually hinder performance.

In more recent work a ResNeXt-101 32×16d was pretrained on various large Instagram source datasets. When using ImageNet 1K as the target dataset, the model was found to have the same performance when pretrained on a source dataset of 960M images with hashtags that most closely matched the ImageNet 1K classes as when the model that was pretrained with 3.5B images with 17K different hashtags mahajan2018exploring . However, pretraining with the smaller source dataset produced significantly worse performance when the target dataset was ImageNet 5K or 9K, or the more task specific Caltech Birds wah2011caltech and Places365 zhou2017places .

A similar result was found in ngiam2018domain . Pretraining using data from subgroups of classes that closely matched the target classes, consistently achieved better performance than using the entire JFT dataset with 300 million images and 18,291 classes. Similar performance was achieved using their technique of reweighting classes during source pretraining so that the class distribution statistics more closely matched the target class distribution. Interestingly though, applying the same technique to pretraining with the much smaller ImageNet 1K dataset resulted in minimal performance gains in most cases and a significant decrease in performance for one target dataset. This suggests a minimum threshold for the number of training examples where better matched training data is better than more training data. In contradiction to mahajan2018exploring these results showed that pretraining with the entire JFT dataset resulted in worse performance than pretraining with just ImageNet 1K for most target classes. In some cases performance was worse with pretraining on the entire JFT dataset rather than initialising with random weights without pretraining. This may be an indication of issues with the suitability of the JFT dataset, where image labels are non-mutually exclusive and on average, each image has 1.26 labels, as a source task for a mutually exclusive target classification task.

With a similar set up to ngiam2018domain , Puigcerver et al. puigcerver2020scalable produced a large number of different “expert” models by starting from one model pretrained on all of JFT then fine-tuning copies of this model on different subclasses. Performance proxies such as K nearest neighbours were then used to select the relevant expert for each target task. They found that selecting the best expert pretrained on the whole of JFT then fine-tuned on just a subset of classes resulted in better performance than just pretraining on the whole of JFT. These results are consistent with those from ng2015deep outlined in Section 5.7.2 in showing that a multi-stage fine tuning pipeline with an intermediate dataset that is more closely matched to the target dataset can produce better performance than transferring straight from a larger, less related source dataset.

Sun et al. sun2017revisiting found that pretraining a ResNet 101 he2016deep with the 300 million images from the full JFT dataset resulted in significantly better performance on ImageNet 1K classification compared to random initialisation. When the target task was object detection on the COCO-2017 dataset lin2014microsoft pretraining with the full JFT dataset or JFT plus ImageNet 1K produced a much larger increase in performance than pretraining with only ImageNet 1K as the source dataset. Both sets of results seem to indicate that the large ResNet 101 model generally overfits to the ImageNet 1K classification task when trained from random initialization, despite the dataset having over 1 million labelled training examples. They found that larger ResNet models produced greater performance gains than smaller models on the COCO object detection task with ResNet 152 outperforming both ResNet 101 and 50. Given that the difference between the mAP@[0.5,0.95] of the ResNet 101 and 152 was not significant when they were both pretrained on ImageNet 1K, but was when they were pretrained on the 300 times large JFT dataset it again indicates that larger models may overfit to ImageNet 1K.

Yalniz et al. yalniz2019billion created a multi-step semi-supervised training procedure that consisted of:

1.

training a model on ImageNet 1K
2.

using that model to label a much larger dataset of up to one billion social media images with hashtags related to the ImageNet 1K classes.
3.

using these weak self-labelled examples to train a new model
4.

finally, fine-tuning the new model with the ImageNet 1K training set.

They showed a significant increase in ImageNet 1K performance across a range of smaller ResNet architectures.

Xie et al. xie2020self extended the training procedure outlined above yalniz2019billion to iterate between training a large EfficientNet tan2019efficientnet model on ImageNet 1K, then using that model to label the 300 million images from the full JFT dataset and training the model using both the weak self-labelled images from JFT and real labelled images from ImageNet 1K. This resulted in state of the art performance on ImageNet 1K. They also found that testing this model on more difficult ImageNet test sets being ImageNet-A (particularly difficult ImageNet 1K test examples) hendrycks2019benchmarking , as well as ImageNet-C and ImageNet-P (test images with corruptions and perturbations such as blurring, fogging, rotation and scaling) hendrycks2021natural resulted in large increases in state of the art performance. State of the art accuracy was doubled on two out of the three more difficult ImageNet test sets.

5.3.2 Similarity measures for matching source and target datasets

In light of the findings in the previous section the question of how to measure similarity between source and target datasets is important. In some cases source and target domain similarity can be effectively estimated through human intuition and/or similarity between class labels mahajan2018exploring ; ngiam2018domain ; puigcerver2020scalable . Further to this there are many methods for using calculations of domain similarity in the hopes of finding the best source data for pretraining:

•

Cui et al. cui2018large showed that performance on the target dataset increases as a measure of similarity based on the earth mover distance between the source and target domains increases.
•

Domain adaptive transfer learning (DATL) ngiam2018domain uses importance weights based on the ratio $P_{t}(y)/P_{s}(y)$ to reweight classes during source pretraining so that the class distribution statistics match the target statistic $P_{t}(y)$ . Where $P_{t}(y)$ and $P_{s}(y)$ describe the distribution of labels in the target and source datasets respectively. They show that using adaptive transfer or a subset of the source dataset that more closely matches the target dataset improves performance over using the entire unweighted 300 million JFT training examples.
•

Puigcerver et al. puigcerver2020scalable train a large selection of expert models on particular subsets of JFT source data based on class labels. They then determine the best model to use for each prediction based on performance proxies such as k nearest neighbours. This technique was shown to increase performance on several target datasets compared to training on the whole JFT dataset.
•

Using reinforcement learning to learn a weight for each class in the source dataset so as to rely more heavily on more effective training examples during pretraining zhu2019learning . Also using reinforcement learning to value individual training samples for a particular task and rely more heavily on more valuable examples during training yoon2020data .
•

Ge et al ge2017borrowing create histograms for images from the source and target domain using the activation maps from the first two layers of filters in a CNN. They then use nearest neighbour ranking to find the most similar images in the source dataset to those in the target dataset. Samples in the source domain are then weighted based on their KL-divergence from a given target sample.

5.4 Large target datasets with less similar tasks

There are very few large image datasets that are not closely related to the image classification tasks that are usually used for pretraining. The Places365 with 1.8 million training images was used as a source task in mahajan2018exploring . It was shown that with less related source and target datasets the bigger and more diverse the source training dataset the better the results on the target dataset.

The largest object detection dataset is COCO lin2014microsoft with 135,000 training images across 80 classes. There have been mixed results shown in whether pretraining with ImageNet 1K and other common source datasets improve object detection performance on the COCO dataset. These results are discussed in detail in Section 5.7.5.

5.5 Smaller target datasets

It is often not possible to train deep neural networks on small datasets from random initialization mazurowski2019deep ; mormont2018comparison ; kraus2017automated ; heker2020joint . This means that transfer learning becomes more heavily relied on as the size of the target dataset decreases. It has also been shown that transfer learning hyperparameters have a much great impact on performance as the size of the target dataset reduces plested2019analysis . As the size of the target dataset decreases two competing factors affect transfer learning performance:

1.

The empirical risk estimate becomes less reliable, as detailed in Section 3.2.2, making overfitting idiosyncrasies in the target dataset more likely.
2.

The pretrained weights implicitly regularize the fine-tuned model and the final weights do not move far from their pretrained values neyshabur2020being ; liu2019towards ; raghu2019transfusion

The outcome of Point 1 is an increasing need to use transfer learning and other methods to reduce overfitting. The implicit regularization noted in Point 2 can have a positive impact in reducing overfitting the empirical risk estimate as per Point 1. It can also have a negative impact (negative transfer) if the weights transferred from the source dataset are far from optimal for the target dataset. When the weights, and thus the features produced, are restricted to being far from optimal the negative effect on performance can be compounded by Point 1 plested2019analysis .

5.6 Smaller target datasets with similar tasks

The most well known study on transfer learning yosinski2014transferable was performed with an AlexNet krizhevsky2012imagenet on strongly related and very large source and target datasets. The same experiment was repeated in plested2019analysis , but included a range of smaller target dataset sizes. They demonstrated a significant improvement in performance using more optimal transfer learning hyperparameters compared to common hyperparameters from yosinski2014transferable . This improvement increased considerably as the size of the dataset decreased. By using optimal compared to commonly used hyperparameters the average accuracy was found to increase from 20.86 to 30.12% for the smallest target dataset of just 10 examples for each of the 500 classes. The commonly used practice of transferring all layers except the final classification layer was shown not to be optimal in any of the experiments.

When pretraining CNN models on ImageNet 1K imagenet_cvpr09 and transferring to significantly smaller target datasets the improvement of deep transfer learning over random initialization correlates positively with how closely the target dataset relates to the source dataset. The improvement correlates negatively with the size of the target dataset. Both correlations are seen strongly in kornblith2019better , The positive correlation of performance improvement using transfer learning with the similarity between source and target datasets is most obvious. The authors state that on Stanford Cars and FGVC Aircraft, the improvement was unexpectedly small. The improvement over equivalent models trained from scratch is marginal for these two datasets, at 0.6% and 0.2%. Stanford Cars and FGVC Aircraft both contain fine-grained makes and models of cars and aircraft respectively, whereas ImageNet 1K contains no makes and models, just a few coarse categories for each. This means that the similarity between the source and target datasets for these two tasks is much lower than for example the more general CIFAR-10/100 or Oxford-IIIT Pets for which there are actually slightly more fine-grained classes in the source than the target dataset.

The improvement in using fine-tuning over fixed pretrained features for Stanford Cars and FGVC Aircraft is significantly larger, at 25.4% and 25.7%, than for any of the 10 other datasets used in the experiments. This indicates that the pretrained features are not well suited to the task. For comparison, at the other end of the similarity scale, Oxford-IIIT Pets shows one of the largest improvements from 83.2% accuracy for training from random initialisation to 94.5% accuracy for fine tuning a pretrained model. For this dataset the increase in performance using fine-tuning compared to fixed pretrained weights is marginal at 1.1%. This shows that the pretrained features are well suited to the task without fine-tuning. This difference in performance is likely compounded by the fact that Oxford-IIIT Pets is around half the size of both Stanford Cars and FGVC Aircraft. More recent work has shown that the difference between performance on the target task using fixed pretrained weights versus fine-tuning with even very poor hyperparameters can be used to predict transfer learning hyperparameters for a given source and target task and model plested2021non .

The negative correlation between the size of the target dataset and the improvement over the baselines in kornblith2019better can be seen clearly when the increase in performance is compared to the target training set size as presented in Table 2. Another obvious correlation is that target datasets with lower baseline accuracy figures increase by a larger amount as there is more room for improvement. If we remove all the datasets where the baseline performance was above 88% which would likely limit the performance increase, the increases are close to reverse order based on the size of the target dataset. This highlights the negative correlation between target dataset size and transfer learning performance. The only discrepancies in the ordering come from general and scene tasks getting larger performance increases than fine-grained and texture tasks Tables 3 and 4. We would expect the former to be more closely related to the ImageNet 1K source dataset than the latter.

Both Flowers and Food-101 are interesting outliers when looking at the performance increase compared to the baseline performance, and the size of the target datasets. They are both fine-grained tasks, and ImageNet 1K has very few classes of each. A future research direction could involve looking at the reliance on colours in features for both ImageNet 1K pretrained models and models fine-tuned on these two target datasets as we would expect colour to be useful in classifying both.

Table 2: Performance increase compared to target dataset size for experiment results from kornblith2019better

Dataset	Type	Size in 1,000s	Performance increase / baseline performance	Performance increase rank
Food-101	fine-grained	78	3 / 87	8
CIFAR-10	general	50	1.98 / 96.06	10
CIFAR-100	general	50	6.7 / 81	6
Birdsnap	fine-grained	47	2.5 / 75.9	9
SUN397	scenes	20	11.4 / 55	3
Stanford Cars	fine-grained	8	0.6 / 92.7	11
FGVC Aircraft	fine-grained	7	0.2 / 88.8	12
PASCAL VOC 2007	general	5	16.5 / 70.9	2
Describable Textures	textures	4	11.3 / 66.8	4
Oxford-IIIT Pets	fine-grained	4	11.3 / 83.2	4
Caltech-101	general	3	17.9 / 77	1
Oxford 102 Flowers	fine-grained	2	4.55 / 93.9	7

Table 3: Performance increase compared to target dataset size for general and scene target datasets. Experiment results from kornblith2019better

Dataset	Type	Size in 1,000s	Performance increase / baseline performance	Performance increase rank
CIFAR-100	general	50	6.7 / 81	4
SUN397	scenes	20	11.4 / 55	3
PASCAL VOC 2007	general	5	16.5 / 70.9	2
Caltech-101	general	3	17.9 / 77	1

Table 4: Performance increase compared to target dataset size for fine-grained classification target datasets. Experiment results from kornblith2019better

Dataset	Type	Size in 1,000s	Performance increase / baseline performance	Performance increase rank
Food-101	fine-grained	78	3 / 87	3
Birdsnap	fine-grained	47	2.5 / 75.9	4
Describable Textures	textures	4	11.3 / 66.8	2
Oxford-IIIT Pets	fine-grained	4	11.3 / 83.2	1

5.6.1 More vs better matched pretraining data part 2

Similarly to large target tasks, better matched pretraining data has been shown to produce better performance on the target task than more pretraining data.

Pretraining on the scene classification task Places365 achieved considerably better performance on the scene classification dataset MIT Indoor 67 as compared to pretraining with the smaller and less related ImageNet 1K in li2018explicit . However the reverse was true for the target tasks that were more related to ImageNet 1K, being Stanford Dogs and Caltech256-30 and 60.

Source and target data classes were matched using the Earth Mover’s Distance peleg1989unified ; rubner2000earth in cui2018large . Pretraining with well matched subsets was shown to perform better than with the largest source dataset on a variety of fine-grained visual classification (FGVC) tasks. They also showed a strong correlation between source and target domain similarity as calculated by the Earth Mover’s Distance and performance on the target task for all but one target task.

Ge and Yu ge2017borrowing found model performance was improved by fine-tuning a model on the target task along with the most related data from the source task. Related data was calculated using nearest neighbour on the activations of Gabor filters. Sabatelli et al. sabatelli2018deep found that performance was improved when pretraining on a smaller art classification source dataset rather than the larger unrelated ImageNet 1K when the target task was art classification. In the same domain Gonthier et al. gonthier2020analysis used a two step fine-tuning process consisting of:

1.

pretraining on ImageNet 1K
2.

fine tuning on an art classification dataset that was an order of magnitude larger than the target task
3.

fine-tuning on the final target art classification dataset.

They found that this improved performance over pretraining on only ImageNet 1K or only the intermediate art classification dataset.

5.7 Smaller target datasets with less similar tasks

Transfer learning has been shown to be effective in many areas where the target datasets are small and less related to ImageNet 1K and other common source datasets. However, there have also been several recent results that fit into this category where deep transfer learning has shown little or no improvement over random initialization zoph2020rethinking ; he2018rethinking ; raghu2019transfusion . Transfer learning shows better performance on smaller target datasets that are more closely related to the source dataset than larger less related datasets in general see Section 5.6 and kornblith2019better . Task specific self-supervised learning methods applied to source datasets that are more closely related but unlabelled, often perform better than supervised learning methods applied to less closely related source datasets zoph2020rethinking ; azizi2021big . Recent work has shown that even when the target dataset is very dissimilar to the source dataset and transfer learning brings no performance gain it can accelerate the convergence speed he2018rethinking ; raghu2019transfusion .

5.7.1 Face recognition

Face recognition often relies on pretraining of deep CNN architectures on general image classification datasets like ImageNet 1K. However this area of research presents its own unique challenges masi2018deep :

1.

With each class representing only one individual there are often only slight differences between classes and a small number of training images per class.
2.

There can be many more classes than is common for an image classification task with common face recognition datasets having 10’s or 100’s of thousands or even millions of different subjects.

Related to Point 1, a common challenge for facial recognition models is to pretrain them on a large number of publicly available faces of celebrities then use them to quickly learn to classify new faces with limited examples via transfer learning techniques. Several additions to the typically used cross-entropy loss have been proposed to help with this challenge they:

•

explicitly minimize the intraclass variance wen2016discriminative ,
•

increase the margin between classes liu2016large ; liu2017sphereface ,
•

encourage features to lie on a hypersphere ranjan2017l2 ; zheng2018ring .

These methods are all designed to produce a well defined feature space during pretraining so that there is likely to be a good separation between subjects for both source and target subjects. When the number of subjects in face recognition datasets get very large, deep metric learning losses are generally used instead of classification losses schroff2015facenet ; wang2018cosface ; deng2019arcface . It has recently been shown that metrics designed specifically for face recognition such as CosFace wang2018cosface and ArcFace deng2019arcface can be more successful when applied to common deep metric learning benchmark datasets or small fine-grained image classification tasks musgrave2020metric . Deep metric learning is discussed further in Section 6.3.

5.7.2 Facial expression recognition

Like many fine-grained classification problems facial expression recognition (FER) datasets are often challenging because they are small. Most well known FER datasets have less than 10,000 images or videos. Even the larger ones often have only around 100 different subjects, making the individual images highly correlated. An additional challenge unique to facial expression recognition is that high intersubject (intraclass) variations exist due to different personal attributes, such as age, gender, ethnic backgrounds and level of expressiveness li2020deep .

Again, for this task it has been shown that pretraining with source data more closely matched to the target dataset results in better performance. pretraining on a large facial recognition dataset has been shown to perform better than pretraining on the more general and less closely related ImageNet 1K imagenet_cvpr09 . A multi-stage pretraining pipeline, using a larger FER dataset for interim fine-tuning prior to the final fine-tuning on the smaller target dataset, was shown to improve performance in ng2015deep .

5.7.3 Pretraining with general image classification datasets for medical imaging classification tasks

The two major problems presented when using deep neural networks for medical imaging tasks that are most relevant to this review are:

1.

Scarcity of data. Training data in medical image datasets often numbers in just the 100’s or 1,000’s of examples as opposed to the 100,000s, millions or even billions of examples often available in more general image datasets mazurowski2019deep .
2.

Severely unbalanced classes. There are often many more examples of healthy images as opposed to those with a rare disease mazurowski2019deep .

Training very deep networks from scratch is problematic in many medical imaging cases due to the problems listed above. Deep transfer learning is adopted in almost every modality of medical imaging, including X-rays, CT scans, pathological images, positron emission tomography (PET), and MRI mazurowski2019deep . Despite this there has been limited work addressing best practice for deep transfer learning in the medical imaging classification domain.

Early experiments tajbakhsh2016convolutional compared an AlexNet pretrained on ImageNet 1K with and without fine-tuning, an AlexNet trained from scratch, and traditional models with hand crafted features. They demonstrated that:

1.

The use of a pretrained AlexNet CNN with adequate fine-tuning consistently improved on or matched training from random initialisation and traditional methods.
2.

While the increase in performance using a pretrained and fine-tuned AlexNet was marginal for relatively larger target datasets the performance improvement was much more significant as the size of the target datasets was reduced.

More recent experiments in deep transfer learning for digital pathology classification using a range of modern models with residual connections and four different target datasets were performed by Mormont et al. mormont2018comparison . The models were pretrained on ImageNet 1K. When using the pretrained models as fixed feature extractors the last layer features were always outperformed by features taken from an inner layer of the network. In keeping with previous results in the field they found that fine tuning improves on fixed extraction. However, they did not combine fine tuning with testing different layers in the network for optimal performance, which we suggest for future work.

5.7.4 More vs better matched pretraining data in the medical image domain

There have been a number of studies in the last two years examining whether a very large less related source dataset is better, for pretraining for image classification in the medical domain, when compared to a within domain dataset that is orders of magnitude smaller raghu2019transfusion ; heker2020joint . Both these scenarios have also been compared to self-supervised pretraining on a medium sized unlabelled within domain dataset azizi2021big .

Previous results showing that pretraining with much larger source datasets increases performance on the target task were extended to the medical image domain in mustafa2021supervised . Performance on three well known small medical image classification target tasks was shown to improve as the size of the source dataset increased from ImageNet 1K with 1.3 million training examples to JFT with 300 million training examples. All pretraining produced better performance than random initialization. There has also been some work showing that performing supervised pretraining on a moderately sized medical image dataset and transferring to a smaller one can increase performance compared to training from random initialisation kraus2017automated ; heker2020joint , other shallow machine learning methods kraus2017automated , and even pretraining with ImageNet 1K heker2020joint .

Raghu et al. raghu2019transfusion showed that when medical image datasets are large (around two hundred thousand training examples), pretraining on ImageNet 1K results in limited improvements over random initialisation. This applies to both the large CNN models large ResNet50 and Inception v3 and small CNN models designed for better performance on medical datasets. When the number of training examples was reduced to a small dataset size of 5,000, pretraining with ImageNet 1K resulted in small improvements over random initialisation. They also reaffirm that lower layers are more transferable than higher more task specific layers as per yosinski2014transferable , even when the source and target tasks are very different.

A two step self-supervised pretraining process was used on two different medical imaging classification target tasks in azizi2021big . This involved:

1.

self-supervised pretraining on ImageNet 1K
2.

followed by self-supervised fine-tuning on a large unlabelled medical dataset from same source as the target task
3.

then fine-tuning on the final labelled medical image classification task.

This was found to produce significantly better results on the target task compared to self-supervised pretraining on only ImageNet and slightly better results than pretraining on the unlabelled medical dataset only or pretraining on ImageNet 1K with supervised methods.

Using a multi-stage supervised pretraining pipeline such as in ng2015deep does not appear to have been applied to classifying medical images with deep CNNs. This could be a useful area of further research.

5.7.5 Pretraining with image classification datasets for object detection tasks

Although it is somewhat beyond the scope of this paper, we briefly review the evidence on whether pretraining on image classification datasets is beneficial for object detection tasks.

Starting with girshick2014rich many studies have shown improved performance on object detection tasks when ImageNet pretraining is used ren2015faster ; redmon2016you . It has become the conventional wisdom that pretraining is needed to achieve top results on object detection tasks as the datasets tend to be smaller than image classification datasets like ImageNet imagenet_cvpr09 . However, a number of recent results have challenged this idea.

Comparable results on the COCO object detection task lin2014microsoft are shown with random initialisation of weights compared to pretraining on ImageNet 1K imagenet_cvpr09 when training protocols are adjusted to be optimal for training from random initialization he2018rethinking . This result is repeated even when the target dataset is reduced to be on 10K images total, or 10% of the full COCO size. A number of other experiments show similar results with and without pretraining shen2017dsod ; shen2017learning ; zhu2019scratchdet and that performance is often better without pretraining when more exact measures of bounding box overlap are used he2018rethinking ; zhu2019scratchdet .

Mixed results from pretraining on the 300 $\times$ larger Instagram hashtag dataset for the same COCO task are reported in mahajan2018exploring . The improvements were marginal at best, whereas the improvements reported on the ImageNet and CUB2011 classification tasks are larger. However in sun2017revisiting performance on COCO was significantly improved when pretraining on the large JFT source dataset with 300 million training examples, compared to ImageNet 1K. Again, these mixed results seem to indicate that deep transfer learning training hyperparameters can have a large influence on results when target datasets are smaller and less similar to source datasets.

A multi-stage pretraining pipeline ng2015deep may again be useful to consider in this domain. Developing more domain specific self-supervised pretraining techniques could also be considered.

5.7.6 Semantic Image segmentation

Due to the inherent difficulty of gathering and creating per pixel labelled segmentation datasets, their scale is not as large as the size of classification datasets such as ImageNet. For this reason semantic image segmentation models have traditionally been pretrained on ImageNet 1K and other image classification datasets such as JFT garcia2018survey ; ghosh2019understanding ; minaee2020image . The same pattern is noted here as with other transfer learning applications in that more training data tends to produce better results chen2018encoder . Recently it has been shown that self-training on unlabelled but more closely related data consistently improves performance over training the same model from scratch. Whereas pretraining on ImageNet 1K can reduce performance (negative transfer), particularly when the target dataset is larger zoph2020rethinking .

5.8 Comparison to label based taxonomy

Transfer learning is often categorised based on the labels available for the source and target task as well as whether the source domain and target domain are from the same distribution as follows pan2009survey .

1.

Inductive transfer learning: In this case the label information of the target-domain instances is available. The source and target tasks are different but related and the data can be from the same or a different but related domain. This is the broader category and is generally the case with image classification. For this reason, it is the primary focus of this review.
2.

Transductive transfer learning: In transductive transfer learning the label information only comes from the source domain. The tasks are the same, but the source and target domains are different. A common example of this is domain adaptation which is covered in Section 6.1. Transductive transfer learning is not a focus of this review as it is rare in image classification that the source and target domain have the exact same class labels. This situation is more common in natural language processing where you may find for example a sentiment analysis model transferred between two slightly different product review domains.
3.

Unsupervised or semi-supervised transfer learning: Pure unsupervised learning where both the source and target domains have no labels is rarely useful in transfer learning. However, domain adaptation with no target task labels is sometimes referred to as unsupervised domain adaptation. Semi-supervised learning is the more common scenario and refers to when the source domain does not have labels but the target domain does. Semi-supervised learning is most commonly used when there are orders of magnitude more unlabelled data for the target domain or a closely related domain srivastava2015unsupervised ; pathak2016context ; jing2020self ; radford2015unsupervised ; ledig2017photo . In this situation an unsupervised or self-supervised learning algorithm is used to train on the unlabelled data before fine-tuning on the labelled data. This technique tends to result in negative transfer if the amount of unlabelled data is not significantly larger than the labelled data paine2014analysis .

We argue that for modern very deep CNNs with millions of parameters in 100s of layers used for image classification, the size of and similarity between source and target domains are much more important than the labels used in pretraining. The robustness and generality of the features learned are similarly much more important than the final classification task they are trained on. For example a source domain that is very large and identical or similar to the target domain but without labels, or with weak labels, will produce better results than a fully labelled source dataset that is very different to the target domain zoph2020rethinking ; azizi2021big .

5.9 Discussion

There are two overarching themes throughout this review of transfer learning techniques and applications:

1.

More source data is better in general, but a more closely related source dataset for pretraining will often produce better performance on the target task than a larger source dataset.
2.

The size of the target dataset and how closely related it is to the source dataset strongly impacts the performance of transfer learning. In particular using sub-optimal transfer learning hyperparameters can result in negative transfer when the target dataset is less related and large enough to be trained from random initialisation.

Currently, there tends to be an all or nothing approach to transfer learning. Either transferring all layers improves performance or it doesn’t and possibly decreases performance (negative transfer). The same approach is often taken to freezing layers, either layers are frozen or they are fine-tuned at the same learning rate as the rest of the model, and weight regularization, either all transferred weights are decayed towards their pretrained values (L2SP or other recent regularization techniques) or towards zero (or sometimes not at all). We advocate that the all or nothing approach should be discarded and instead all decisions made about how to perform transfer learning should be thought of as a sliding scale. Transferring all layers, freezing layers and decaying all weights towards pretrained values is at one extreme of the scale and is likely to only be optimal for source and target datasets that are extremely similar. Training from scratch is at the other end of the scale and is likely to only be optimal if the source and target domain have no similarities. This lines up with the observations in yosinski2014transferable that “first-layer features appear not to be specific to a particular dataset or task, but general in that they are applicable to many datasets and tasks”. Recent work by Abnar et al. has shown the potential for improvements in transfer learning by taking into account the similarities of the source and target task abnar2021exploring . More work is needed to show how to perform transfer learning optimally for a given source and target dataset relationship and target dataset size, rather than showing whether transfer learning is effective at all.

6 Areas related to deep transfer learning for image classification

There are several different problem domains that are either closely related to deep transfer learning or that use it as a standard part of state of the art models. We briefly examine how they relate to deep transfer learning for image classification below.

6.1 Domain adaptation

Domain adaptation (DA) could be considered the most closely related problem domain. In domain adaptation the task $T$ remains the same, but the Domain $D$ differs between source and target task. According to the definition 3 by Pan and Yang pan2009survey domain adaptation falls under general transfer learning as transductive transfer learning as described in Section 5.8. Others define it as a separate area goodfellow2016deep . Either way domain adaptation is not a focus of this review as it is rare in image classification that the source and target domain have the exact same class labels. This situation is more common in natural language processing where you may find for example, a sentiment analysis model transferred between two slightly different product review domains csurka2017domain ; wang2018deep1 ; zhang2017transfer . The few datasets that are used in domain adaptation for image classification are carefully curated to have identical object classes in different domains gong2012geodesic ; saenko2010adapting ; tommasi2014testbed .

Despite there being few examples of domain adaptation problems in image classification there are techniques that have been developed for these tasks that could be more broadly useful to deep transfer learning for image classification. Wang and Deng state that ”When the labeled samples from the target domain are available in supervised DA, soft label and metric learning are always effective” wang2018deep1 . Metric learning has been shown to be effective in K-shot learning as described in Section 6.3. It has recently been shown to improve performance with slightly larger but still small fine-grained target datasets with up to 42 examples per class ridnik2020tresnet . It is also common in domain adaptation to use a multi-stage adaptation similar to that shown to be effective for facial expression recognition in ng2015deep .

6.2 Concept drift and multitask learning

The concept drift problem is related to domain adaptation in that it deals with adapting models to distributions that change over time. Multitask learning is another related problem domain where the focus is on learning one model that can perform well at multiple tasks in multiple domains. While both can be seen as types of transfer learning, the distinction between pure transfer learning focused tasks and concept drift or multitask learning is an important one. The focus in the former is to learn just the target task as well as possible and performance is only judged on this. Whereas in both concept drift and multitask learning the focus is to learn more than one task well:

•

in concept drift learning the new distribution needs to be modelled well without forgetting all learning about the old distribution
•

in multitask learning many tasks need to be learned well. This needs to be achieved without new tasks overwriting previously learned tasks.

The issues described above are examples of catastrophic forgetting. Catastrophic forgetting is defined as the scenario where learning a new set of patterns suddenly and completely erases a network’s knowledge of what it has already learned french1999catastrophic ; mccloskey1989catastrophic ; ratcliff1990connectionist . The major difference between general transfer learning and concept drift or multitask learning is that usually catastrophic forgetting is not a consideration in the former.

6.3 K-shot or few shot learning

The K-shot or few shot (K-shot) learning problem domain is specifically focused on learning from very few or even zero labelled examples. The focus of new work in this domain is generally on improving metrics for defining and learning the best embedding spaces and comparisons to new examples, plus tricks of data augmentation. Transfer learning is implicit in most of the techniques used as weights are pretrained on a related task wang2020generalizing ; kaya2019deep .

Methods for refining deep transfer learning methods for k-shot image classification tasks can be divided into:

•

data augmentation and other methods for improving the source dataset
•

metrics that define the “goodness” of the embedding space and how that information should be used to classify small target classes
•

methods that look at the best way to update weights based on the target dataset.

We focus on methods that fall under the last heading, as they are most relevant to this review. However, as stated previously, we note that metric methods have recently been shown to improve transfer learning performance for some small datasets that fall outside of the standard K-shot learning definition ridnik2020tresnet . More research is recommended to extend these results.

Weight update methods

Weight update methods look at the best way to update weights from the source or other related tasks, including the decision not to update any weights. Many models across a broad spectrum of K-shot learning techniques do not perform any updates to weights using the target dataset. There is some recent evidence that shows that weight updates are superior in a variety of cases scott2018adapted , but again given the small target dataset sizes the right transfer learning hyperparameters must be used. Many models that do not update weights to the target task do simulate the few-shot learning scenario in training on related datasets. They train on a subset of the available classes, then optimize the model by maximising performance on the unseen classes. This results in a model that generalises well to unseen examples vinyals2016matching ; snell2017prototypical .

In Generalizing from a Few Examples: A Survey on Few-shot Learning wang2020generalizing Wang et al. define transfer learning as falling under the informing the parameter space umbrella of few-shot learning techniques. Under this general umbrella transfer learning and related topics can be split into several different strategies for dealing with very small target datasets:

1.
Refining pretrained parameters: these methods explicitly look at ways for improving deep transfer learning methods for very small target datasets. They are grouped into ways of explicitly regularizing to avoid overfitting, including:
- •
  
  early stopping
- •
  
  selectively updating only certain weights
- •
  
  updating certain groups of weights together
- •
  
  weight regularization.
2.

Refining meta-learned parameters: these methods are drawn from multitask learning. Parameters $W_{0}$ are learned from several related tasks, then again fine-tuned on the target task. The models differ in the way the weights are updated from the multitask to the task specific model. Regularizing methods that use task specific information or model the uncertainty of using meta-learned parameters can also be included in this heading.
3.

Learning the optimizer: These methods work on learning optimizers that can perform better than hand designed optimizers in a specific problem domain when fine-tuning a pretrained model with a small number of training examples.

While there is a lot of cross-over of deep transfer learning for image classification from the refining pretrained parameters category of K-shot learning, the latter focuses only on very small target datasets. This means they also do not consider how the change in size of the target dataset and the relationship between the source and target dataset affect the optimal transfer learning methods and hyperparameters.

6.4 Unsupervised or self-supervised learning

In unsupervised or self-supervised deep learning (self-supervised) a model of a dataset distribution is learned by using an aspect of the data itself as a training signal. The most general example is autoencoders which consist of an encoder and decoder with the target being the same as the input. The middle encoding is regularized in some way, in the hopes of producing a meaningful semantic encoding. Other examples of task specific self-supervised learning include predicting the next frame of a video srivastava2015unsupervised , predicting the next word in a sentence brown2020language , image inpainting or upsampling pathak2016context ; ledig2017photo , and many more. For a detailed treatment of self-supervised methods for learning image representations see jing2020self . Generative Adversarial Networks (GANs) are also an example of self-supervised learning radford2015unsupervised ; goodfellow2014generative . In learning to classify real vs fake training examples the discriminator in a GAN learns useful features of the data distribution and the weights can then be used to initialise a classification model.

Self-supervised learning relates closely to transfer learning as it is often used for pretraining when there is limited labelled data for a source task, but a large amount of related unlabelled data. Self-supervised learning has been shown to be beneficial to performance when the ratio of unlabelled to labelled training examples is high, but may harm performance when the ratio is low paine2014analysis . This relates back to the question of more versus better related source training data that comes up regularly in deep transfer learning and is covered in sections 5.3.1 and 5.6.1.

7 Discussion and suggestions for future directions

7.1 Summary of current knowledge

Early work in deep transfer learning for image classification showed that transfer learning is effective compared to training from randomly initialised weights, particularly when it involves small target datasets and the source and target datasets are similar agrawal2014analyzing ; yosinski2014transferable ; azizpour2015factors .

Our review has highlighted the limits of current knowledge in each area and suggested future research directions to expand the current body of knowledge:

1.

Fine tuning tends to work better than freezing weights in most cases agrawal2014analyzing ; yosinski2014transferable ; azizpour2015factors . Freezing lower layers may work better in limited cases where the target domain is small and the tasks are extremely similar plested2019analysis . More work is needed to see whether this applies with modern deep CNNs.
2.

Recent regularization techniques such as L2-SP, DELTA, BSS and stochastic normalization tend to improve performance when source and target tasks are very similar. However, it has been shown that L2-SP and DELTA particularly can result in minimal improvement or even worse performance when the source and target datasets are less related li2020rethinking ; wan2019towards ; plested2021non ; chen2019catastrophic . Most work so far has focused on how these techniques compare when the source and target dataset are very closely related. More work is needed to show how each of these methods perform when they are applied to less similar datasets. Recent work has shown that in some cases using L2-SP regularization for lower layers and L2 regularization for higher layers can improve performance over using either one for all layers plested2021non . While the evidence for this is so far limited it does align with observations from yosinski2014transferable and abnar2021exploring that lower layers are more transferable than higher layers.
3.

Both the learning rate and momentum should be lower during fine-tuning for more similar source and target tasks and higher for less closely related datasets li2020rethinking ; plested2021non . The learning rate should also be decayed more quickly the more similar the source and target tasks are, so as not to change the pretrained parameters as much kolesnikov2019big . Similarly the learning rate should be decayed more quickly with smaller target datasets where the empirical risk estimate is likely to be less reliable and overfitting more of a problem plested2019analysis ; kolesnikov2019big . However, when the target data set is small it must be taken into account that the number of weight updates per epoch will be lower and the number of updates should be reduced, not necessarily the number of epochs. When the source and target datasets are less similar it may be optimal to fine-tune higher layers at a higher learning rate than lower layers plested2021non . More work is needed to show how the learning rate, momentum and number of updates before decaying the learning rate change when the source and target tasks are very different.
4.

Transferring more layers is better than less layers when source and target datasets are large and very similar and learning rates are high for all layers yosinski2014transferable . This does not necessarily hold true when target datasets are smaller and fine tuning hyperparameters are more optimal plested2019analysis ; plested2021non ; abnar2021exploring . This relates back to Point 3 that the learning rate and momentum should be lower for more similar tasks. As the empirical risk minimizer is unreliable for small datasets it is more prone to overfitting. Pretraining weights limits how far they can move based on this unreliable risk minimizer neyshabur2020being ; liu2019towards , so it acts as a regularizer to prevent overfitting. An additional way to limit overfitting is to use a smaller learning rate and momentum, and also decay the learning rate quickly to prevent weights becoming too large based on unreliable statistics. These two items combined together mean it is likely that a larger fine-tuning learning rate is optimal when more layers are pretrained and a lower learning rate when less layers are pretrained plested2019analysis ; plested2021non . For this reason it is important to tune both the number of layers pretrained and optimal learning rate together. Recent work has also shown that using fixed pretrained features from lower layers without fine-tuning can result in better performance than features from higher layers when source and target datasets are less similar and the target dataset is small mormont2018comparison ; abnar2021exploring .
5.

More source and target training data is better in general. Improvements in performance from pretraining on datasets that are up to 3,000 times larger than ImageNet 1K have shown that the largest modern models are overfitting on ImageNet 1K. Pretraining with much larger datasets helps prevent this ngiam2018domain ; mahajan2018exploring ; kolesnikov2019big . However, closely related data is better than more data when data is abundant ngiam2018domain ; mahajan2018exploring . Self-supervised learning on a more related unlabelled source dataset has been shown to improve performance over supervised learning on a less related labelled dataset in some cases heker2020joint ; zoph2020rethinking . More work is needed to identify under what circumstances the former is better than the latter. A multi-step fine-tuning process including a more closely related intermediate task has been shown to improve performance over a single step transfer from ImageNet 1K in a number of specialised tasks with target datasets that are very different from ImageNet 1K ng2015deep ; gonthier2020analysis ; azizi2021big . More work is needed to see if this technique could benefit other tasks where limited closely related data is available.
6.

Measures of transferability can predict the performance of a pretrained model on a particular target task bao2019information ; tran2019transferability ; nguyen2020leep . The performance of a simple classification model with frozen weights can give some insight into transfer learning best practice for the target task plested2021non . More work is needed to determine if transferability measures in general can help determine optimal transfer learning hyperparameters across a wide range of source and target datasets.

7.2 Recommendations for best practice

Our recommendations for best practice in deep transfer learning for image classification broken down by target dataset size and similarity to the source dataset are summarized below.

Larger, similar target datasets

Most techniques will work well in this case, so less time needs to be spent finding the best technique for the given task. A lower learning rate and momentum are better for fine-tuning when the source and target datasets are similar. This may not be necessary when the target dataset is very large, for example ImageNet 1K, and overfitting a poor empirical risk minimizer is less of a problem. All weights should be fine-tuned not frozen. L2-SP regularization or other more recent techniques like DELTA, BSS, stochastic normalization etc. are likely to work reasonably well. Generally, less regularization will be needed with a very large target dataset.

Larger, less similar target datasets

Pretraining weights may not improve performance and negative transfer is more likely in this scenario, so care needs to be taken. When fine-tuning a higher learning rate, momentum and training for longer before decaying the learning rate should be used. This is to attempt to move weights out of the flat basin of the loss landscape, that is created when using pretrained weights, and further away from their pretrained values neyshabur2020being ; liu2019towards . Recent regularization techniques like L2-SP, DELTA, BSS, stochastic normalization, etc. should not be used in this case as weights should be allowed to move freely away from their pretrained values based on the larger target dataset. Reinitializing more layers of weights could also be attempted in order to take advantage of the more general lower layers, while allowing the more task specific higher layers to train from random initialization yosinski2014transferable ; plested2021non ; abnar2021exploring .

Smaller, more similar datasets

This scenario is where transfer learning really excels, although more care needs to be taken to use effective hyperparameters when the dataset is small plested2019analysis . The optimal learning rate and momentum will likely be low as the weights will not need to be moved far from their pretrained values. If they are, overfitting the unreliable empirical risk minimizer is more likely with a small dataset. Recent regularization techniques like L2-SP, DELTA, BSS, stochastic normalization, etc. are likely to improve performance significantly in this case.

Smaller, less similar datasets

In this case transfer learning can be very useful if done well, but can also lead to poor results. It is difficult to strike an optimal balance between

1.

Allowing the weights to move far enough from their pretrained values that the model is not using inappropriate features for the classification task
2.

Not allowing the weights to overfit the unreliable empirical risk minimizer.

Given this if there is any way to use a more closely related dataset it is likely to improve results. This could be

1.

Doing unsupervised pretraining on a large, unlabelled, more closely related dataset instead of a large, labelled, less similar dataset.
2.

Using a moderately sized closely related dataset either instead of a large source dataset for pretraining or as an intermediary fine-tuning step when transferring weights from a less related source task to the final target task.

If it is not possible to use a more related dataset then a high initial learning rate and momentum should be used, but the learning rate should be decayed quickly. Recent regularization techniques like L2-SP, DELTA, BSS, stochastic normalization, etc. should not be used on all layers, but in some cases may be beneficial on lower layers.

References

[1] Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi. Exploring the limits of large scale pre-training. arXiv preprint arXiv:2110.02095, 2021.
[2] Pulkit Agrawal, Ross Girshick, and Jitendra Malik. Analyzing the performance of multilayer neural networks for object recognition. In European conference on computer vision, pages 329–344. Springer, 2014.
[3] Shekoofeh Azizi, Basil Mustafa, Fiona Ryan, Zachary Beaver, Jan Freyberg, Jonathan Deaton, Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting Chen, et al. Big self-supervised models advance medical image classification. arXiv preprint arXiv:2101.05224, 2021.
[4] Hossein Azizpour, Ali Sharif Razavian, Josephine Sullivan, Atsuto Maki, and Stefan Carlsson. Factors of transferability for a generic convnet representation. IEEE transactions on pattern analysis and machine intelligence, 38(9):1790–1802, 2015.
[5] Yajie Bao, Yang Li, Shao-Lun Huang, Lin Zhang, Lizhong Zheng, Amir Zamir, and Leonidas Guibas. An information-theoretic approach to transferability in task transfer learning. In 2019 IEEE International Conference on Image Processing (ICIP), pages 2309–2313. IEEE, 2019.
[6] Samy Bengio. Sharing representations for long tail computer vision problems. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 1–1, 2015.
[7] Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L Alexander, David W Jacobs, and Peter N Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2011–2018, 2014.
[8] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, 2014.
[9] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.
[10] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
[11] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018.
[12] Xinyang Chen, Sinan Wang, Bo Fu, Mingsheng Long, and Jianmin Wang. Catastrophic forgetting meets negative transfer: Batch spectral shrinkage for safe transfer learning. openreview, 2019.
[13] Brian Chu, Vashisht Madhavan, Oscar Beijbom, Judy Hoffman, and Trevor Darrell. Best practices for fine-tuning visual classifiers to new domains. In European conference on computer vision, pages 435–442. Springer, 2016.
[14] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.
[15] Gabriela Csurka. Domain adaptation for visual applications: A comprehensive survey. arXiv preprint arXiv:1702.05374, 2017.
[16] Yin Cui, Yang Song, Chen Sun, Andrew Howard, and Serge Belongie. Large scale fine-grained categorization and domain-specific transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4109–4118, 2018.
[17] Yin Cui, Feng Zhou, Yuanqing Lin, and Serge Belongie. Fine-grained categorization and dataset bootstrapping using deep metric learning with humans in the loop. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1153–1162, 2016.
[18] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
[19] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019.
[20] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/ workshop/index.html.
[21] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
[22] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
[23] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea, Victor Villena-Martinez, Pablo Martinez-Gonzalez, and Jose Garcia-Rodriguez. A survey on deep learning techniques for image and video semantic segmentation. Applied Soft Computing, 70:41–65, 2018.
[24] Weifeng Ge and Yizhou Yu. Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1086–1095, 2017.
[25] Swarnendu Ghosh, Nibaran Das, Ishita Das, and Ujjwal Maulik. Understanding deep learning techniques for image segmentation. ACM Computing Surveys (CSUR), 52(4):1–35, 2019.
[26] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
[27] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2066–2073. IEEE, 2012.
[28] Nicolas Gonthier, Yann Gousseau, and Saïd Ladjal. An analysis of the transfer learning of convolutional neural networks for artistic images. arXiv preprint arXiv:2011.02727, 2020.
[29] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning, volume 1. MIT press Cambridge, 2016.
[30] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
[31] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256 object category dataset. authors.library.caltech.edu, 2007.
[32] Yunhui Guo, Honghui Shi, Abhishek Kumar, Kristen Grauman, Tajana Rosing, and Rogerio Feris. Spottune: transfer learning through adaptive fine-tuning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4805–4814, 2019.
[33] Kevin Gurney. An introduction to neural networks. CRC press, 1997.
[34] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018.
[35] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
[36] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[37] Michal Heker and Hayit Greenspan. Joint liver lesion segmentation and classification via transfer learning. arXiv preprint arXiv:2004.12352, 2020.
[38] Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
[39] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15262–15271, 2021.
[40] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[41] Kurt Hornik, Maxwell Stinchcombe, Halbert White, et al. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
[42] Minyoung Huh, Pulkit Agrawal, and Alexei A Efros. What makes imagenet good for transfer learning? arXiv preprint arXiv:1608.08614, 2016.
[43] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. arXiv preprint arXiv:1702.03275, 2017.
[44] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
[45] Yunho Jeon, Yongseok Choi, Jaesun Park, Subin Yi, Dongyeon Cho, and Jiwon Kim. Sample-based regularization: A transfer learning strategy toward better generalization. arXiv preprint arXiv:2007.05181, 2020.
[46] Junguang Jiang, Yang Shu, Jianmin Wang, and Mingsheng Long. Transferability in deep learning: A survey. arXiv preprint arXiv:2201.05867, 2022.
[47] Longlong Jing and Yingli Tian. Self-supervised visual feature learning with deep neural networks: A survey. IEEE transactions on pattern analysis and machine intelligence, 2020.
[48] Mahmut Kaya and Hasan Şakir Bilge. Deep metric learning: A survey. Symmetry, 11(9):1066, 2019.
[49] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained Visual Categorization (FGVC), volume 2, 2011.
[50] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, and Neil Houlsby. Big transfer (bit): General visual representation learning. arXiv preprint arXiv:1912.11370, 2019.
[51] Simon Kornblith, Ting Chen, Honglak Lee, and Mohammad Norouzi. Why do better loss functions lead to less transferable features? Advances in Neural Information Processing Systems, 34, 2021.
[52] Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better imagenet models transfer better? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2661–2671, 2019.
[53] Zhi Kou, Kaichao You, Mingsheng Long, and Jianmin Wang. Stochastic normalization. Advances in Neural Information Processing Systems, 33, 2020.
[54] Oren Z Kraus, Ben T Grys, Jimmy Ba, Yolanda Chong, Brendan J Frey, Charles Boone, and Brenda J Andrews. Automated analysis of high-content microscopy data with deep learning. Molecular systems biology, 13(4):924, 2017.
[55] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
[56] Sajja Tulasi Krishna and Hemantha Kumar Kalluri. Deep learning and transfer learning approaches for image classification. International Journal of Recent Technology and Engineering (IJRTE), 7(5S4):427–432, 2019.
[57] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. cs.toronto.edu, 2009.
[58] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[59] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
[60] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
[61] Hao Li, Pratik Chaudhari, Hao Yang, Michael Lam, Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. Rethinking the hyperparameters for fine-tuning. arXiv preprint arXiv:2002.11770, 2020.
[62] Shan Li and Weihong Deng. Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing, 2020.
[63] Xingjian Li, Haoyi Xiong, Hanchao Wang, Yuxuan Rao, Liping Liu, and Jun Huan. Delta: Deep learning transfer using feature map with attention for convolutional networks. arXiv preprint arXiv:1901.09229, 2019.
[64] Xuhong Li, Yves Grandvalet, and Franck Davoine. Explicit inductive bias for transfer learning with convolutional networks. arXiv preprint arXiv:1802.01483, 2018.
[65] Xuhong Li, Yves Grandvalet, and Franck Davoine. A baseline regularization scheme for transfer learning with convolutional neural networks. Pattern Recognition, 98:107049, 2020.
[66] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
[67] Hong Liu, Mingsheng Long, Jianmin Wang, and Michael I Jordan. Towards understanding the transferability of deep representations. arXiv preprint arXiv:1909.12031, 2019.
[68] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 212–220, 2017.
[69] Weiyang Liu, Yandong Wen, Zhiding Yu, and Meng Yang. Large-margin softmax loss for convolutional neural networks. In ICML, volume 2, page 7, 2016.
[70] Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens van der Maaten. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pages 181–196, 2018.
[71] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, Toyota Technological Institute at Chicago, 2013.
[72] Fabio Maria Carlucci, Lorenzo Porzi, Barbara Caputo, Elisa Ricci, and Samuel Rota Bulo. Autodial: Automatic domain alignment layers. In Proceedings of the IEEE International Conference on Computer Vision, pages 5067–5075, 2017.
[73] Iacopo Masi, Yue Wu, Tal Hassner, and Prem Natarajan. Deep face recognition: A survey. In 2018 31st SIBGRAPI conference on graphics, patterns and images (SIBGRAPI), pages 471–478. IEEE, 2018.
[74] Maciej A Mazurowski, Mateusz Buda, Ashirbani Saha, and Mustafa R Bashir. Deep learning in radiology: An overview of the concepts and a survey of the state of the art with focus on mri. Journal of magnetic resonance imaging, 49(4):939–954, 2019.
[75] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
[76] Thomas Mensink, Jasper Uijlings, Alina Kuznetsova, Michael Gygli, and Vittorio Ferrari. Factors of influence for transfer learning across diverse appearance domains and task types. arXiv preprint arXiv:2103.13318, 2021.
[77] George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
[78] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey. arXiv preprint arXiv:2001.05566, 2020.
[79] Tom M Mitchell et al. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37):870–877, 1997.
[80] Romain Mormont, Pierre Geurts, and Raphaël Marée. Comparison of deep transfer learning strategies for digital pathology. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2262–2271, 2018.
[81] Stefan Munder and Dariu M Gavrila. An experimental study on pedestrian classification. IEEE transactions on pattern analysis and machine intelligence, 28(11):1863–1868, 2006.
[82] Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. In European Conference on Computer Vision, pages 681–699. Springer, 2020.
[83] Basil Mustafa, Aaron Loh, Jan Freyberg, Patricia MacWilliams, Megan Wilson, Scott Mayer McKinney, Marcin Sieniek, Jim Winkens, Yuan Liu, Peggy Bui, et al. Supervised transfer learning at scale for medical imaging. arXiv preprint arXiv:2101.05913, 2021.
[84] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. research.google, 2011.
[85] Behnam Neyshabur, Hanie Sedghi, and Chiyuan Zhang. What is being transferred in transfer learning? arXiv preprint arXiv:2008.11687, 2020.
[86] Hong-Wei Ng, Viet Dung Nguyen, Vassilios Vonikakis, and Stefan Winkler. Deep learning for emotion recognition on small datasets using transfer learning. In Proceedings of the 2015 ACM on international conference on multimodal interaction, pages 443–449, 2015.
[87] Jiquan Ngiam, Daiyi Peng, Vijay Vasudevan, Simon Kornblith, Quoc V Le, and Ruoming Pang. Domain adaptive transfer learning with specialist models. arXiv preprint arXiv:1811.07056, 2018.
[88] Cuong Nguyen, Tal Hassner, Matthias Seeger, and Cedric Archambeau. Leep: A new measure to evaluate transferability of learned representations. In International Conference on Machine Learning, pages 7294–7305. PMLR, 2020.
[89] Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722–729. IEEE, 2008.
[90] Tom Le Paine, Pooya Khorrami, Wei Han, and Thomas S Huang. An analysis of unsupervised pre-training in light of recent advances. arXiv preprint arXiv:1412.6597, 2014.
[91] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
[92] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
[93] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2536–2544, 2016.
[94] Shmuel Peleg, Michael Werman, and Hillel Rom. A unified approach to the change of resolution: Space and gray-level. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7):739–742, 1989.
[95] Jo Plested and Tom Gedeon. An analysis of the interaction between transfer learning protocols in deep neural networks. In International Conference on Neural Information Processing, pages 312–323. Springer, 2019.
[96] Jo Plested, Xuyang Shen, and Tom Gedeon. Non-binary deep transfer learning for imageclassification. arXiv e-prints, page arXiv:2107.08585, July 2021.
[97] Joan Puigcerver, Carlos Riquelme, Basil Mustafa, Cedric Renggli, André Susano Pinto, Sylvain Gelly, Daniel Keysers, and Neil Houlsby. Scalable transfer learning with expert models. arXiv preprint arXiv:2009.13239, 2020.
[98] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 413–420. IEEE, 2009.
[99] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[100] Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, and Samy Bengio. Transfusion: Understanding transfer learning for medical imaging. arXiv preprint arXiv:1902.07208, 2019.
[101] Rajeev Ranjan, Carlos D Castillo, and Rama Chellappa. L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507, 2017.
[102] Roger Ratcliff. Connectionist models of recognition memory: constraints imposed by learning and forgetting functions. Psychological review, 97(2):285, 1990.
[103] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems, pages 506–516, 2017.
[104] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
[105] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99, 2015.
[106] Ricardo Ribani and Mauricio Marengoni. A survey of transfer learning for convolutional neural networks. In 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T), pages 47–57. IEEE, 2019.
[107] Tal Ridnik, Hussam Lawen, Asaf Noy, and Itamar Friedman. Tresnet: High performance gpu-dedicated architecture. arXiv preprint arXiv:2003.13630, 2020.
[108] Michael T Rosenstein, Zvika Marx, Leslie Pack Kaelbling, and Thomas G Dietterich. To transfer or not to transfer. In NIPS 2005 workshop on transfer learning, volume 898, pages 1–4, 2005.
[109] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
[110] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
[111] Matthia Sabatelli, Mike Kestemont, Walter Daelemans, and Pierre Geurts. Deep transfer learning for art classification problems. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
[112] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
[113] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
[114] Tyler Scott, Karl Ridgeway, and Michael C Mozer. Adapted deep embeddings: A synthesis of methods for k-shot inductive transfer learning. In Advances in Neural Information Processing Systems, pages 76–85, 2018.
[115] Ling Shao, Fan Zhu, and Xuelong Li. Transfer learning for visual categorization: A survey. IEEE transactions on neural networks and learning systems, 26(5):1019–1034, 2014.
[116] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 806–813, 2014.
[117] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. In Proceedings of the IEEE international conference on computer vision, pages 1919–1927, 2017.
[118] Zhiqiang Shen, Honghui Shi, Rogerio Feris, Liangliang Cao, Shuicheng Yan, Ding Liu, Xinchao Wang, Xiangyang Xue, and Thomas S Huang. Learning object detectors from scratch with gated recurrent feature pyramids. arXiv preprint arXiv:1712.00886, 1, 2017.
[119] Jun Shu, Zongben Xu, and Deyu Meng. Small sample learning in big data era. arXiv preprint arXiv:1808.04572, 2018.
[120] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. arXiv preprint arXiv:1703.05175, 2017.
[121] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
[122] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
[123] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
[124] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243, 2019.
[125] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pages 843–852, 2017.
[126] Nima Tajbakhsh, Jae Y Shin, Suryakanth R Gurudu, R Todd Hurst, Christopher B Kendall, Michael B Gotway, and Jianming Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging, 35(5):1299–1312, 2016.
[127] Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. A survey on deep transfer learning. In International conference on artificial neural networks, pages 270–279. Springer, 2018.
[128] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946, 2019.
[129] Tatiana Tommasi and Tinne Tuytelaars. A testbed for cross-dataset analysis. In European Conference on Computer Vision, pages 18–31. Springer, 2014.
[130] Anh T Tran, Cuong V Nguyen, and Tal Hassner. Transferability and hardness of supervised classification tasks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1395–1405, 2019.
[131] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778, 2018.
[132] Vladimir Vapnik. Principles of risk minimization for learning theory. In Advances in neural information processing systems, pages 831–838, 1992.
[133] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29:3630–3638, 2016.
[134] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. vision.caltech.edu, 2011.
[135] Ruosi Wan, Haoyi Xiong, Xingjian Li, Zhanxing Zhu, and Jun Huan. Towards making deep transfer learning never hurt. In 2019 IEEE International Conference on Data Mining (ICDM), pages 578–587. IEEE, 2019.
[136] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5265–5274, 2018.
[137] Mei Wang and Weihong Deng. Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153, 2018.
[138] Ximei Wang, Ying Jin, Mingsheng Long, Jianmin Wang, and Michael Jordan. Transferable normalization: Towards improving transferability of deep neural networks. openreview.net, 2019.
[139] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR), 53(3):1–34, 2020.
[140] Zirui Wang, Zihang Dai, Barnabás Póczos, and Jaime Carbonell. Characterizing and avoiding negative transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11293–11302, 2019.
[141] Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. A survey of transfer learning. Journal of Big data, 3(1):9, 2016.
[142] Yandong Wen, Kaipeng Zhang, Zhifeng Li, and Yu Qiao. A discriminative feature learning approach for deep face recognition. In European conference on computer vision, pages 499–515. Springer, 2016.
[143] Yue Wu, Tal Hassner, KangGeon Kim, Gerard Medioni, and Prem Natarajan. Facial landmark detection with tweaked convolutional neural networks. IEEE transactions on pattern analysis and machine intelligence, 40(12):3067–3074, 2018.
[144] Yuxin Wu and Kaiming He. Group normalization. In Proceedings of the European conference on computer vision (ECCV), pages 3–19, 2018.
[145] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
[146] Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10687–10698, 2020.
[147] I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv Mahajan. Billion-scale semi-supervised learning for image classification. arXiv preprint arXiv:1905.00546, 2019.
[148] Jinsung Yoon, Sercan Arik, and Tomas Pfister. Data valuation using reinforcement learning. In International Conference on Machine Learning, pages 10842–10851. PMLR, 2020.
[149] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks? In Advances in neural information processing systems, pages 3320–3328, 2014.
[150] Kaichao You, Zhi Kou, Mingsheng Long, and Jianmin Wang. Co-tuning for transfer learning. Advances in Neural Information Processing Systems, 33, 2020.
[151] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[152] Jing Zhang, Wanqing Li, and Philip Ogunbona. Transfer learning for cross-dataset recognition: a survey. arXiv preprint arXiv:1705.04396, 2017.
[153] Jing Zhang, Wanqing Li, Philip Ogunbona, and Dong Xu. Recent advances in transfer learning for cross-dataset visual recognition: A problem-oriented perspective. ACM Computing Surveys (CSUR), 52(1):1–38, 2019.
[154] Yutong Zheng, Dipan K Pal, and Marios Savvides. Ring loss: Convex feature normalization for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5089–5097, 2018.
[155] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
[156] Linchao Zhu, Sercan O Arik, Yi Yang, and Tomas Pfister. Learning to transfer learn. openreview.net, 2019.
[157] Rui Zhu, Shifeng Zhang, Xiaobo Wang, Longyin Wen, Hailin Shi, Liefeng Bo, and Tao Mei. Scratchdet: Training single-shot object detectors from scratch. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2268–2277, 2019.
[158] Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive survey on transfer learning. Proceedings of the IEEE, 2020.
[159] Barret Zoph, Golnaz Ghiasi, Tsung-Yi Lin, Yin Cui, Hanxiao Liu, Ekin D Cubuk, and Quoc V Le. Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882, 2020.