Class-incremental Learning using a Sequence of Partial Implicitly Regularized Classifiers
Abstract
In class-incremental learning, the objective is to learn a number of classes sequentially without having access to the whole training data. However, due to a problem known as catastrophic forgetting, neural networks suffer substantial performance drop in such settings. The problem is often approached by experience replay, a method which stores a limited number of samples to be replayed in future steps to reduce forgetting of the learned classes. When using a pretrained network as a feature extractor, we show that instead of training a single classifier incrementally, it is better to train a number of specialized classifiers which do not interfere with each other yet can cooperatively predict a single class. Our experiments on CIFAR100 dataset show that the proposed method improves the performance over SOTA by a large margin.
Index Terms:
continual-learning, class-incremental learning, catastrophic forgetting
I Introduction
Artificial neural networks(ANNs) have been at the top of the machine learning landscape for a while. They have achieved impressive performances across various applications, including object recognition[1], anomaly detection[2, 3], accident detection[4, 5], action recognition[6, 7, 8], scene classification[9], hyperspectral image classification[10, 11], medical image analysis[12, 13], machine translation[14, 15] etc. Their success is mainly attributed to the availability of large amounts of data and sufficient computational power. While ANNs are inspired by the biological brain, still there are some shortcomings that makes them different. Specifically, they are not designed to learn in incremental way, like humans do. Humans keep learning new knowledge throughout their lives. However, experiments show that ANNs almost completely forget their previous knowledge when they are trained on a new task[16]. This is termed as “Catastrophic Forgetting”[17].
The common way to train ANNs is to provide them the whole dataset at once, and let them iterate through it multiple times. Unfortunately, there are situations where this is not a feasible option. There can be memory limits in the learning device, making it impossible to store all data. Also, there can be security concerns for storing the data if it contains sensitive information. Catastrophic forgetting in ANNs was first addressed by McCloskey[17] in 1989, but is still an unsolved problem hindering the progress towards building AI agents that can learn continuously.
Three main continual learning scenarios are identified: task-incremental learning, class-incremental learning, and domain-incremental learning [18]. In task-incremental learning, the model is required to learn a sequence of tasks sequentially, and during inference it will be provided with the task identity. On the other hand, in class-incremental learning, no such information is provided at inference time, and the model is required to predict both the correct class and the task. A slightly different scenario is domain-incremental learning, which, unlike class-incremental learning, does not require predicting the task identity.
This paper is concerned about class-incremental learning, where at each training session multiple new classes are to be learned while also maintaining the knowledge previous classes. Despite many proposed solutions in the literature, the baseline methods like “Experience Replay” (ER)[19] and GDumb[20] are still shown to be as effective as the state of the art. Given that each training session contains examples of only a few classes, the deep networks are prone to overfitting. Therefore, it is plausible to use a frozen feature extractor and smaller classifiers on top. ER trains only a single network, extending its outputs units for the new classes. In each training session, the network is trained on the examples of new classes as well as a small number of examples from previous classes which have been stored in the limited memory buffer. While retraining on these few stored examples helps retain the knowledge of past classes, there is still no guarantee that the new knowledge will not interfere with past. To solve this problem, we propose to train a separate classifier for each group of new classes. These classifiers do not share any weights with each other, implying that the newly acquired knowledge will be stored separately without overriding others.
II Related Work
In the following, we present some of the continual learning method existing in the literature. We divide these methods into three groups which mainly rely on one of the three ideas: replay, regularization, and architectural techniques, which are all described in the following subsections.
II-A Replay Methods
A straightforward solution to prevent catastrophic forgetting is to revisit the previous tasks. Rehearsal methods accomplish this by storing a small portion of the seen examples for later retraining. When the model is faced with new training data, it augments it with memory samples to reduce the forgetting of previous knowledge.
iCaRL[21] uses the nearest-mean-of-exemplars classification approach, classifying items to the class with the nearest center. It uses a heuristic approach for updating the memory buffer, prioritizing items based on their proximity to their class mean. Together with network distillation, it has been able to learn in class-incremental learning scenario. Gradient Episodic Memory (GEM)[22] is designed for task-incremental learning from streaming data. In this setting, the model receives a series of tuples consisting of input, task label, and target, respectively. GEM uses the memory samples not for replay but to serve the inequality constraints that prevents the loss for the past tasks to decrease. Replay using Memory Indexing (REMIND) [23] applies quantization to the extracted feature maps from deeper layers of a CNN and stores their indices in memory. This results in a compressed representation and allows for much more number of stored examples. REMIND’s architecture consists of a frozen feature extractor followed by a trainable classifier. REMIND was shown to outperform existing approaches in a streaming setup on the datasets ImageNet, and CORe50[24].
Rethinking Experience Replay[19]
The authors emphasize that simple experience replay(or simple rehearsal)[25] is still as effective as state of the art if certain issues are resolved. As stated by the authors, the three important issues with this approach are: overfitting to stored examples, biased prediction and accuracy towards the later classes, and non-i.i.d stored data in cases where the buffer is small. Their proposals include an additional bias correction layer (BiC)[26], exponential decay of learning rate, and balanced sampling for the memory buffer. In their reported results ER has outperformed SOTA sophisticated replay methods such as iCaRL[21], GEM[22], A-GEM[27], and HAL[28], sometimes by a very large margin.
GDumb [20]
is a recent paper that questioned our progress in continual learning. It proposes a simple and most generic baseline for continual learning. It basically maintains a balanced memory buffer and only trains on the samples contained in it. The reported results have shown that it outperforms many SOTA methods in their respective settings for which they were designed to. GDumb serves a strong baseline for all continual settings including class-incremental learning.
II-B Regularization Methods
These methods impose a type of regularization that helps retain the learned knowledge of past tasks. They do it by adding additional loss terms to the loss function.
Learning without forgetting (LwF)[29] presented a modification to the standard fine-tuning. A new network is initialized as a copy of the old one with an extension of the output layer(called multihead approach) for the new task. A distillation loss is added between old and new task heads so the new network is reminded of the old tasks indirectly. In backward pass, only the new network is updated.
Elastic Weight Consolidation (EWC)[30] and Synaptic Intelligence (SI) [31] both try to approximate the importance of each of the parameters for previous tasks and selectively apply regularization to limit their change. Regularization methods alone, as several papers demonstrate, are not sufficient for proper class-incremental learning[16, 32].
II-C Architectural Methods
Architectural methods try to manipulate weights, neurons, layers, or architecture of the network to protect the learned knowledge while acquiring a new one. They either use fixed or dynamic architectures. These methods have the advantage of completely eliminating interference between tasks while allowing knowledge transfer between them. However, practically, they are coupled with scalability issues as the number of tasks grow.
“Progressive Neural Networks”(PNN)[33] inherently target the task-incremental learning. Their architecture grows laterally, each time adding a new neural network, called “columns”, for a new task. The new columns have connections to other layers of previous networks and thus, highly benefit from knowledge transfer. Once a column is trained, it will be kept frozen making the PNNs completely immune to catastrophic forgetting. A big issue with PNNs is that they keep growing too large, limiting their use in practice to only a small number of tasks. “Compacting, Picking and Growing” (CPG) [34] tries to overcome this issue by dynamically controlling the architecture of the network in an efficient way. The weights of the network are grouped into “compact” and “free”, where the newly added weights are considered free until they are trained and then compacted. For each new task, a learnable binary mask is created and applied to the compact weights to select a subset of them. Then this task is learnt using this subset of compact weights and the free weights. The network is grown during the training if the performance does not reaching a satisfactory level, providing more free weights. The training is followed by pruning to compact the newly learned weights. This method is useful in scenarios with less number but larger tasks.
II-C1 Similarity with other works
Here we specifically mention some of the similar works in the literature and point out the differences with our ours. Our main difference is that we train a set of specialized classifiers dedicated to each class group while also making them able to detect if a sample is from previous classes.
Aljundi et al. [35] trained a specialized model for each new task and proposed a method for choosing the relevant one at inference time. Specifically, by training separate autoencoders describing each task’s data distribution, during inference they are able to choose the most relevant model for which the corresponding autoencoder has the least reconstruction error. After an autoencoder for a task is trained, it is used to select the most relevant task to the current task. Then the classifier model is trained based on the most related tasks.
AR1[36] trains a linear classifier on top of a feature extractor. In each phase, a new linear classifier is trained for the current group of classes. Then a mean-normalization is applied to the weights of this classifier to eliminate the prediction bias. In testing, the prediction of the model is the softmax over outputs of all classifiers. The difference with our method is that we use deeper classifiers, while they do not. On the other hand, we use memory buffer, but AR1 doesn’t use one.
III Methods
Our purpose is to sequentially train on a number of disjoint datasets containing different classes. Relying on a pretrained deep feature extractor, we only consider training small networks on top of it. The traditional way is to construct and train a single head with expanding output units to facilitate the prediction of new classes. However, despite the replay mechanism, the single head is still not sufficiently equipped against catastrophic forgetting. We improve the situation by instead training much smaller classifiers, one at a time that are specific only to new classes in the training session. For a descriptive diagram of our method, please see Fig. 1. In the following paragraph, we further formalize the training setting and details of our method.
In class-incremental learning we aim to train a network on a sequence of datasets with inputs and labels , where for any . Note that in the training session only the dataset is available. Suppose we are allowed to store a small number of examples from the current training session in a limited memory buffer. We denote the samples in the memory buffer by and its capacity by , thus . The memory can be updated with new examples at each training session. Having access to past examples through the memory buffer, the training data for the -th session will then consist of .
Let be the feature extraction network. Our approach is that at -th training session we create a new classifier network with output units that classifies between the new classes. The final classification is done through the final layer which simply concatenates the outputs of all classifiers and then applies the softmax. The objective is to minimize the cross entropy loss between the outputs of the final layer and the true target over all training samples:
where
Note that only the last classifier is trained, while the loss is a function of the outputs of all classifiers. In this way, the last classifier is adjusting itself to respect the prediction of previous classifiers. In other words, it is learning to produce higher output values for the samples belonging to its feature space (the new classes in the current training session) while suppressing itself for the samples belonging to previous classifiers. Although all of the classifiers are in a sense “partial”, i.e. they can only predict the classes belonging to them, they also see previous classes during the training as “negative” examples acting like a regularizer which makes them even stronger and robust. We also emphasize that since the previous classifiers are frozen, it eliminates the problem of “forgetting” for them.
III-A Memory buffer
As the memory should be kept updated to include new samples, a sampling strategy has to decide which examples should be selected or removed. We adopt the greedy sampling approach described in [20]. It randomly replaces some of the old examples in the memory with the new ones, while trying to satisfy the balancing constraint, that is, to maintain an equal number of examples for each class.
III-B Training
Training consists of multiple sessions. At the beginning of each training session, a classifier is constructed which has the output size equal to the number of classes in the training data. Training objective is to learn the parameters of this new classifier. We train it by mini-batch gradient descent where each mini-batch contains an equal number of samples from the current dataset and the memory buffer. At the end of each training session the memory buffer is updated with new samples.
III-C Early Stopping
As the number of examples per class in the memory keeps decreasing over time, the model becomes prone to overfitting. Therefore, an early stopping mechanism is essential. We propose that a portion of the data should be held for validation. Apart from splitting the incoming data into train and validation parts, we also dedicate a part of the memory for validation samples. We make sure that the part of the memory for training gets updated only with the samples in the training set, and similarly the validation part of the memory gets updated only with the samples in the validation set.
III-D Bias correction layer
Since the mini-batches contain nonequal number of samples from the old and new classes, it poses a class imbalance problem. This problem has been first addressed in [26] and the authors proposed a simple yet effective solution, called Bias Correction (BiC). It is a layer containing only two parameters and applies a linear transformation to the output logits belonging to new classes as follows:
where and are the bias parameters and is the output logits of the final layer. The BiC layer is trained separately at the end of each training session with a small amount of data as it contains only two parameters.
IV Experimental Results
We consider the class-incremental learning scenario on CIFAR100 dataset by splitting it into 5, 10, and 20 disjoint parts, each part containing 20, 10, and 5 classes respectively. In all of our experiments, we use the same class ordering which is obtained by random shuffling. We also run experiments with different memory sizes. We compare our method against two state-of-the-art baselines: ER equipped with BiC, and GDumb.
IV-A Implementation details
For feature extraction, we use a EfficientNet-B0 network[37] that is pretrained on ImageNet. We remove from it the final convolution layer, the output layer and the final MBConv block. Since this feature extractor accepts inputs of size , we resize the CIFAR100 images which are to match the input size. The feature extractor is kept frozen in all experiments of our method and the baselines. For classifiers, we use a single convolution layer followed by global average pooling, and a dense layer. In all experiments the classifiers have the same architecture except the number of filters in the convolution layer and the number of units in the final dense layer. Overall, we have made sure that our method does not use more trainable parameters in total than the baselines (See table II).
In all experiments, we stop a training session when the validation loss does not improve for 10 consecutive epochs. We have dedicated 10% of the data for validation. For all methods we start by learning rate of 0.01. For our method we decrease the learning rate when the validation loss does not improve for 3 consecutive epochs. For GDumb and ER we apply exponential decay to learning with rate 0.95. At the end of each training session, just before testing, we train the BiC layer to remove the prediction bias towards later classes. Since the whole purpose of BiC is to remove prediction bias, we will train on it only on the validation part of the memory buffer(after it has been updated to include all seen classes) which the model itself has never trained on.
Splits | 5 | 10 | 20 | |||
---|---|---|---|---|---|---|
Memory | 2000 | 500 | 1000 | 2000 | 1000 | 2000 |
GDumb | 48.24 | 23.79 | 39.89 | 48.74 | 39.57 | 47.83 |
ER w/ BiC | 57.89 | 42.46 | 48.97 | 54.68 | 48.78 | 55.23 |
Ours | 67.83 | 56.45 | 63.48 | 65.79 | 61.60 | 63.25 |
Method | # trainable parameters | ||
---|---|---|---|
5 splits | 10 splits | 20 splits | |
GDumb | 118K | 118K | 118K |
ER | 118K | 118K | 118K |
Ours | 17K 86K | 8K 82K | 4K 80K |
IV-B Baselines
GDumb and ER use the same architecture, they only differ in training. Gdumb only trains on memory samples. It updates the memory buffer at the beginning of each training session and then trains only on the memory buffer discarding the rest of the data. We also train a BiC layer at the end of each training session of ER. BiC layer is not necessary for GDumb due to the fact that it trains on balanced dataset, i.e., the memory buffer.
IV-C Analysis
We test the models at the end of each training session over all classes that have been encountered so far. In table I we report the accuracy at the end of the training in three different settings with varying memory buffer sizes (500, 1000, and 2000) and dataset splitting into 5 (Fig. 2), 10 (Fig. 3), and 20 (Fig. 4) parts. In all settings we see a large gap in accuracy between our method and the second best method, ER. ER always outperforms GDumbs, as expected, because it trains on all available data and the early stopping mechanism that we introduced here prevents it from unnecessary training which causes more forgetting of past data.
We observe that the gap in accuracy tends to get larger when we shrink the memory buffer size. Our method still reaches a remarkable performance of 56.45% when memory size is 500 leaving a gap of 14% relative to the next method, ER. On the other hand, we see a large performance drop of GDumb. This is mainly because it is highly dependent on the memory size as it trains only on the memory samples. However, GDumb shows the least difference in performance when training with different dataset splittings, meaning that it might have an advantage in settings more close to online learning.
IV-D Ablation studies
We conduct experiments to see if all components of our method are indeed important. The first component is that we freeze the classifiers after we train them so that in this way we believe the problem of “forgetting” is eliminated. Therefore we have conducted an experiment where we let our method to keep updating all classifiers. The second component is the additional BiC layer. Looking at the confusion matrices, we have observed a prediction bias towards the last group of classes which was a signal to incorporate BiC layer. We also present the results here where our method does not use a BiC layer. These experiments are conducted in the case of CIFAR100 split into 10 classes, and a memory buffer size of 2000. The reported results are in Table III. We can see the benefit of using BiC layer, which confirms that our method would have some prediction bias without it. On the other hand, we can see a large drop in accuracy (65.79% 57.04%) when we allow updating our classifiers. Nevertheless, still the performance stays above the other two methods (looking at Table I, 54.68% and 48.74% of ER and GDumb, respectively).
Method | Accuracy % |
---|---|
Ours w/out freezing | 57.04 |
Ours w/out BiC | 62.86 |
Ours | 65.79 |
V Conclusion
In this paper, we proposed a new approach for class-incremental learning. We tackled the problem by training a separate classifier for each new group of classes. By freezing these classifiers after they have been trained, we have limited the problem of “forgetting” and achieved big improvements over strong SOTA baselines. Our method consistently achieved better performance when tested on CIFAR100 learning different number of classes at a time.
References
- [1] A. Khan and K. Fraz, “Post-training iterative hierarchical data augmentation for deep networks,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- [2] A. R. Rivera, A. Khan, I. E. I. Bekkouch, and T. S. Sheikh, “Anomaly detection based on zero-shot outlier synthesis and hierarchical feature distillation,” IEEE Transactions on Neural Networks and Learning Systems, 2020.
- [3] K. Yakovlev, I. E. I. Bekkouch, A. M. Khan, and A. M. Khattak, “Abstraction-based outlier detection for image data,” in Proceedings of SAI Intelligent Systems Conference. Springer, 2020, pp. 540–552.
- [4] E. Batanina, I. E. I. Bekkouch, Y. Youssry, A. Khan, A. M. Khattak, and M. Bortnikov, “Domain adaptation for car accident detection in videos,” in 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA). IEEE, 2019, pp. 1–6.
- [5] M. Bortnikov, A. Khan, A. M. Khattak, and M. Ahmad, “Accident recognition via 3d cnns for automated traffic monitoring in smart cities,” in Science and Information Conference. Springer, 2019, pp. 256–264.
- [6] Y. Gavrilin and A. Khan, “Across-sensor feature learning for energy-efficient activity recognition on mobile devices,” in 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 2019, pp. 1–7.
- [7] K. Sozykin, S. Protasov, A. Khan, R. Hussain, and J. Lee, “Multi-label class-imbalanced action recognition in hockey videos via 3d convolutional neural networks,” in 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). IEEE, 2018, pp. 146–151.
- [8] A. M. Khan, Y.-K. Lee, S. Lee, and T.-S. Kim, “Accelerometer’s position independent physical activity recognition system for long-term activity monitoring in the elderly,” Medical & biological engineering & computing, vol. 48, no. 12, pp. 1271–1279, 2010.
- [9] S. Protasov, A. M. Khan, K. Sozykin, and M. Ahmad, “Using deep features for video scene detection and annotation,” Signal, Image and Video Processing, vol. 12, no. 5, pp. 991–999, 2018.
- [10] M. Ahmad, A. M. Khan, M. Mazzara, S. Distefano, M. Ali, and M. S. Sarfraz, “A fast and compact 3-d cnn for hyperspectral image classification,” IEEE Geoscience and Remote Sensing Letters, 2020.
- [11] M. Ahmad, A. M. Khan, M. Mazzara, and S. Distefano, “Multi-layer extreme learning machine-based autoencoder for hyperspectral image classification.” in VISIGRAPP (4: VISAPP), 2019, pp. 75–82.
- [12] M. Gusarev, R. Kuleev, A. Khan, A. R. Rivera, and A. M. Khattak, “Deep learning models for bone suppression in chest radiographs,” in 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB). IEEE, 2017, pp. 1–7.
- [13] A. Dobrenkii, R. Kuleev, A. Khan, A. R. Rivera, and A. M. Khattak, “Large residual multiple view 3d cnn for false positive reduction in pulmonary nodule detection,” in 2017 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB). IEEE, 2017, pp. 1–6.
- [14] A. Khusainova, A. Khan, and A. R. Rivera, “Sart-similarity, analogies, and relatedness for tatar language: New benchmark datasets for word embeddings evaluation,” arXiv preprint arXiv:1904.00365, 2019.
- [15] A. Valeev, I. Gibadullin, A. Khusainova, and A. Khan, “Application of low-resource machine translation techniques to russian-tatar language pair,” arXiv preprint arXiv:1910.00368, 2019.
- [16] R. Kemker, M. McClure, A. Abitino, T. Hayes, and C. Kanan, “Measuring catastrophic forgetting in neural networks,” arXiv preprint arXiv:1708.02072, 2017.
- [17] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” vol. 24, pp. 109–165, 1989.
- [18] G. M. van de Ven and A. S. Tolias, “Three scenarios for continual learning,” 2019.
- [19] P. Buzzega, M. Boschini, A. Porrello, and S. Calderara, “Rethinking experience replay: a bag of tricks for continual learning,” arXiv preprint arXiv:2010.05595, 2020.
- [20] A. Prabhu, P. Torr, and P. Dokania, “Gdumb: A simple approach that questions our progress in continual learning,” in The European Conference on Computer Vision (ECCV), August 2020.
- [21] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl: Incremental classifier and representation learning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5533–5542.
- [22] D. Lopez-Paz and M. Ranzato, “Gradient episodic memory for continual learning,” in Advances in neural information processing systems, 2017, pp. 6467–6476.
- [23] T. L. Hayes, K. Kafle, R. Shrestha, M. Acharya, and C. Kanan, “Remind your neural network to prevent catastrophic forgetting,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- [24] V. Lomonaco and D. Maltoni, “Core50: a new dataset and benchmark for continuous object recognition,” 2017.
- [25] R. Ratcliff, “Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.” Psychological review, vol. 97, no. 2, p. 285, 1990.
- [26] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and Y. Fu, “Large scale incremental learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 374–382.
- [27] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny, “Efficient lifelong learning with a-gem,” arXiv preprint arXiv:1812.00420, 2018.
- [28] A. Chaudhry, A. Gordo, P. K. Dokania, P. Torr, and D. Lopez-Paz, “Using hindsight to anchor past knowledge in continual learning,” arXiv preprint arXiv:2002.08165, 2020.
- [29] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 12, pp. 2935–2947, 2017.
- [30] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
- [31] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” Proceedings of machine learning research, vol. 70, p. 3987, 2017.
- [32] T. Lesort, A. Stoian, and D. Filliat, “Regularization shortcomings for continual learning,” 2020.
- [33] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
- [34] C.-Y. Hung, C.-H. Tu, C.-E. Wu, C.-H. Chen, Y.-M. Chan, and C.-S. Chen, “Compacting, picking and growing for unforgetting continual learning,” in Advances in Neural Information Processing Systems, 2019, pp. 13 669–13 679.
- [35] R. Aljundi, P. Chakravarty, and T. Tuytelaars, “Expert gate: Lifelong learning with a network of experts,” 2017.
- [36] D. Maltoni and V. Lomonaco, “Continuous learning in single-incremental-task scenarios,” Neural Networks, vol. 116, pp. 56–73, 2019.
- [37] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 6105–6114.