Partial Wasserstein Covering
Abstract
We consider a general task called partial Wasserstein covering with the goal of providing information on what patterns are not being taken into account in a dataset (e.g., dataset used during development) compared with another dataset(e.g., dataset obtained from actual applications). We model this task as a discrete optimization problem with partial Wasserstein divergence as an objective function. Although this problem is NP-hard, we prove that it satisfies the submodular property, allowing us to use a greedy algorithm with a 0.63 approximation. However, the greedy algorithm is still inefficient because it requires solving linear programming for each objective function evaluation. To overcome this inefficiency, we propose quasi-greedy algorithms that consist of a series of acceleration techniques, such as sensitivity analysis based on strong duality and the so-called -transform in the optimal transport field. Experimentally, we demonstrate that we can efficiently fill in the gaps between the two datasets and find missing scene in real driving scenes datasets.
1 Introduction
A major challenge in real-world machine learning applications is coping with mismatches between the data distribution obtained in real-world applications and those used for development. Regions in the real-world data distribution that are not well supported in the development data distribution (i.e., regions with low relative densities) result in potential risks such as a lack of evaluation or high generalization error, which in turn leads to low product quality. Our motivation is to provide developers with information on what patterns are not being taken into account when developing products by selecting some of the (usually unlabeled) real-world data distribution, also referred to as application dataset,111More precisely, we call a finite set of data sampled from the real-world data distribution application dataset in this study. to fill in the gaps in the development dataset. Note that the term “development” includes choosing models to use, designing subroutines for fail safe, and training/testing models.
Our research question is formulated as follows. To resolve the lack of data density in development datasets, how and using which metric can we select data from application datasets with a limited amount of data for developers to understand? One reasonable approach is to define the discrepancy between the data distributions and select data to minimize this discrepancy. The Wasserstein distance has attracted significant attention as a metric for data distributions [1, 2]. However, the Wasserstein distance is not capable of representing the asymmetric relationship between application datasets and development datasets, i.e., the parts that are over-included during development increase the Wasserstein distance.

In this paper, we propose partial Wasserstein covering (PWC) that selects a limited amount of data from the application dataset by minimizing the partial Wasserstein divergence [3] between the application dataset and the union of the development dataset and the selected data. PWC, as illustrated in Fig. 1, selects data from areas with fewer development data than application data in the data distributions (lower-right area in the blue distribution in Fig. 1) while ignoring areas with sufficient development data (upper-middle of the orange distribution). We also highlight the data selected through an anomaly detection method, LOF [4], where irregular data (upper-right points) were selected, but the major density difference was ignored. Furthermore, we show the selection obtained using an active learning method, coreset with a -center [5], where the data are chosen to improve the accuracy rather than fill the gap in terms of the distribution mismatch.
Our main contributions are summarized as follows.
-
•
We propose PWC that extracts data that are lacking in a development dataset from an application dataset by minimizing the partial Wasserstein divergence between an application dataset and the development dataset.
-
•
We prove that PWC is a maximization problem involving a submodular function whose inputs are the set of selected data. This allows us to use a greedy algorithm with a guaranteed approximation ratio.
-
•
Additionally, we propose fast heuristics based on sensitivity analysis and the Sinkhorn derivative for an accelerated computation.
-
•
Experimentally, we demonstrate that compared with baselines, PWC extracts data lacking in the development data distribution from the application distribution more efficiently.
The remainder of this paper is organized as follows. Section 2 briefly introduces the partial Wasserstein divergence, linear programming (LP), and submodular functions. In Sect. 3, we detail the proposed PWC, fast algorithms for approximating the optimization problem, and some theoretical results. Section 4 presents a literature review of related work. We demonstrated the PWC in some numerical experiments in Sect. 5.1. In Sect. 6, we present our conclusions.
2 Preliminaries
In this section, we introduce the notations used throughout this paper. We then describe partial Wasserstein divergences, sensitivity analysis for LP, and submodular functions.
Notations
Vectors and matrices are denoted in bold (e.g., and ), where and denote the th and th elements, respectively. To clarify the elements, we use notations such as and , where and specify the element values depending on their subscripts. denotes the inner product of matrices or vectors. is a vector with all elements equal to one. For a natural number , we define . For a finite set , we denote its power set as and its cardinality as . The -norm is denoted as . The delta function is denoted as . is a set of real positive numbers, and denotes a closed interval. denotes an indicator function (zero for false and one for true).
Partial Wasserstein divergence
In this paper, we consider the partial Wasserstein divergence [6, 3, 7] as an objective function. Partial Wasserstein divergence is designed to measure the discrepancy between two distributions with different total masses by considering variations in optimal transport. Throughout this paper, datasets are modeled as empirical distributions that are represented by mixtures of delta functions without necessarily having the same total mass. Suppose two empirical distributions (or datasets) and with probability masses and , which are denoted as and . Without loss of generality, the total mass of is greater than or equal to that of (i.e., and ).
Based on the definitions above, we define the partial Wasserstein divergence as follows:222The partial optimal transport problems in the literature contain a wider problem definition than Eq.(1) as summarized in Table 1 (b) of [3], but this paper employs this one-side relaxed Wasserstein divergence corresponding to Table 1 (c) in [3] without loss of generality.
(1) |
where is the transport cost between and , and is the amount of mass flowing from to (to be optimized). Unlike the standard Wasserstein distance, the second constraint in Eq. (1) is not defined with “”, but with “”. This modification allows us to treat distributions with different total masses. The condition indicates that the mass in does not need to be transported, whereas the condition indicates that all of the mass in should be transported without excess or deficiency (just as in the original Wasserstein distance). This property is useful for the problem defined below, which treats datasets with vastly different sizes.
To compute the partial Wasserstein divergence, we must solve the minimization problem in Eq.(1). In this paper, we consider the following two methods. (i) LP using simplex method. (ii) Generalized Sinkhorn iteration with entropy regularization (with a small regularization parameter ) [8, 9].
As will be detailed later, an element in mass varies when adding data to the development datasets. A key to our algorithm is to quickly estimate the extent to which the partial Wasserstein divergence will change when an element in varies. If we compute using LP, we can employ a sensitivity analysis, which will be described in the following paragraph. If we use generalized Sinkhorn iterations, we can use automatic differentiation techniques to obtain a partial derivative with respect to (see Sect. 3.3).
LP and sensitivity analysis
The sensitivity analysis of LP plays an important role in our algorithm. Given a variable and parameters , , and , the standard form of LP can be written as follows: Sensitivity analysis is a framework for estimating changes in a solution when the parameters , and of the problem vary. We consider a sensitivity analysis for the right-hand side of the constraint (i.e., ). When changes as , the optimal value changes by if a change in lies within , where is the optimal solution of the following dual problem corresponding to the primal problem: We refer readers to [10] for the details of calculating the upper bound and the lower bound .
Submodular function
Our covering problem is modeled as a discrete optimization problem involving submodular functions, which are a subclass of set functions that play an important role in discrete optimization. A set function is called a submodular iff A submodular function is monotone iff for . An important property of monotone submodular functions is that a greedy algorithm provides a -approximate solution to the maximization problem under a budget constraint [11]. The contraction of a (monotone) submodular function , which is defined as , where , is also a (monotone) submodular function.
3 Partial Wasserstein covering problem
3.1 Formulation
As discussed in Sect. 1, our goal is to fill in the gap between the application dataset by adding some data from a candidate dataset to a small dataset . We consider (i.e., we copy some data in and add them into ) in the above-mentioned scenarios, but we herein consider the most general formulation. We model this task as a discrete optimization problem called the partial Wasserstein covering problem.
Given a dataset for development , a dataset obtained from an application , and a dataset containing candidates for selection , where , the PWC problem is defined as the following optimization:333We herein consider the partial Wasserstein divergence between the unlabled datasets, because the application and candidate datasets are usually unlabeled. If they are labeled, we can use the labels information as in [2].
(2) |
We select a subset from the candidate dataset under the budget constraint , and then add that subset to the development data to minimize the partial Wasserstein divergence between the two datasets and . The second term is a constant with respect to , which is included to make the objective non-negative.
In Eq. (2), we must specify the probability mass (i.e., and in Eq. (1)) for each data point. Here, we employ a uniform mass, which is a natural choice because we do not have prior information regarding each data point. Specifically, we set the weights to for and for . With this choice of masses, we can easily show that when , where is the Wasserstein distance. Therefore, based on the monotone property (Sect. 3.2), the objective value is non-negative for any .
The PWC problem in Eq.(2) can be written as a mixed integer linear program (MILP) as follows. Instead of using the divergence between and , we consider the divergence between and . Hence, the mass is an -dimensional vector, where the first elements correspond to the data in and the remaining elements correspond to the data in . In this case, we never transport to data points in , meaning we use the following mass that depends on :
(3) |
As a result, the problem is an MILP problem with an objective function w.r.t. and with and .
One may wonder why we use the partial Wasserstein divergence in Eq.(2) instead of the standard Wasserstein distance. This is because the asymmetry in the conditions of the partial Wasserstein divergence enables us to extract only the parts with a density lower than that of the application dataset, whereas the standard Wasserstein distance becomes large for parts with sufficient density in the development dataset. Furthermore, the guaranteed approximation algorithms that we describe later can be utilized only for the partial Wasserstein divergence version.
3.2 Submodularity of the PWC problem
In this section, we prove the following theorem to guarantee the approximation ratio of the proposed algorithms.
Theorem 1.
Given the datasets and , and a subset of a dataset , is a monotone submodular function.
To prove Theorem 1, we reduce our problem to the partial maximum weight matching problem [12], which is known to be submodular. First, we present the following lemmas.
Lemma 1.
Let be the set of all rational numbers, and and be positive integers with . Consider and . Then, the extreme points of a convex polytope defined by the linear inequalities are also rational, meaning .
Proof of Lemma 1. It is well known [10] that any extreme point of the polytope is obtained by (i) choosing inequalities from and (ii) solving the corresponding -variable linear system , where and are the corresponding sub-matrix/vector (when replacing with ). As all elements in and are rational, 444This immediately follows from the fact that , where is the adjugate matrix of and is the determinant of . Since , and are both rational. we conclude that any extreme points are rational. ∎
Lemma 2.
Let , and be positive integers and . Given a positive-valued -by- matrix , the following set function is a submodular function:
(4) |
Here, is a set defined by replacing the constraint in Eq. (1) with .
Proof of Lemma 2. For any fixed , there is an optimal solution of , where all elements are rational because all the coefficients in the constraints are in ( Lemma 1). Thus, there is a natural number that satisfies where the set is the set of possible transports when the transport mass is restricted to the rational numbers .
(5) |
The problem of can be written as a network flow problem on a bipartite network following Sect 3.4 in [9]. Since the elements of are integers, we can reformulate the problem of using maximum-weight bipartite matching (i.e., an assignment problem) on a bipartite graph whose vertices are generated by copying the vertices and times with modified edge weights. Therefore, is a bipartite graph, where and . For any , the transport mass is encoded by using the copied vertices and , and the weights of the edges among them are to represent . The partial maximum weight matching function for a graph with positive edge weights is a submodular function [12]. Therefore, is a submodular function. ∎
Lemma 3.
If , there exists an optimal solution in the maximization problem of satisfying (i.e., ).
Proof of Lemma 3. Given an optimal solution , suppose there exists satisfying . This implies that . We denote the gap as . Based on the assumption of , we have . Hence, there must exist such that . We denote the gap as . We also define . Then, a matrix is still a feasible solution. However, leads to , which contradicts being the optimal solution. Thus, . ∎
Proof of Theorem 1. Given , where , the following is satisfied:
(6) |
where , , , and . Here, and Lemma 3 yield . Since is a contraction of the submodular function (Lemma 2), is a submodular function.
We now prove the monotonicity of . For , we have because . This implies that . ∎
Finally, we prove Proposition 1 to clarify the computational intractability of the PWC problem.
Proposition 1.
The PWC problem with marginals and is NP-hard.
Proof of Proposition 1. The decision version of the PWC problem, denoted as , asks whether a subset such that exists. We discuss a reduction to the PWC from the well-known NP-complete problems called set cover problems [13]. Given a family from the set and an integer , is a decision problem for answering iff there exists a sub-family such that and . Our reduction is defined as follows. Given for and , we set , and . Based on this polynomial reduction, the given instance is Yes iff the reduced instance is . This reduction means that the decision version of PWC is NP-complete and now Proposition 1 holds.
3.3 Algorithms
MILP and greedy algorithms
The PWC problem in Eq. (2) can be solved directly as an MILP problem. We refer to this method as PW-MILP. However, this is extremely inefficient when , , and are large.
As a consequence of Theorem 1, a simple greedy algorithm provides a 0.63 approximation because the problem is a type of monotone submodular maximization with a budget constraint [11]. Specifically, this greedy algorithm selects data from the candidate dataset sequentially in each iteration to maximize . We consider two baseline methods, called PW-greedy-LP and PW-greedy-ent. PW-greedy-LP solves the LP problem using the simplex method. PW-greedy-ent computes the optimal using Sinkhorn iterations [8].
Even for the greedy algorithms mentioned above, the computational cost is still high because we need to calculate the partial Wasserstein divergences for all candidates on in each iteration, yielding a computational time complexity of . Here, is the complexity of the partial Wasserstein divergence with the simplex method for the LP or Sinkhorn iteration.
To reduce the computational cost, we propose heuristic approaches called quasi-greedy algorithms. A high-level description of the quasi-greedy algorithms is provided below. In each step of greedy selection, we use a type of sensitivity value that allows us to estimate how much the partial Wasserstein divergence varies when adding candidate data, instead of performing the actual computation of divergence. Specifically, if we add a data point to , denoted as , we have , where is a one-hot vector with the corresponding th element activated. Hence, we can estimate the change in the divergence for data addition by computing the sensitivity with respect to . If efficient sensitivity computation is possible, we can speed up the greedy algorithms. We describe the concrete methods for each of the approaches, simplex method, and Sinkhorn iterations in the following paragraphs.
Quasi-greedy algorithm for the simplex method
When we use the simplex method for the partial Wasserstein divergence computation, we employ a sensitivity analysis for LP. The dual problem of partial Wasserstein divergence computation is defined as follows.
(7) |
where and are dual variables. Using sensitivity analysis, the changes in the partial Wasserstein divergences can be estimated as , where is an optimal dual solution of Eq. (7), and is the change in . It is worth noting that now corresponds to in Eq. (3), and a smaller Eq. (7) results in a larger Eq. (2). Thus, we propose a heuristic algorithm that iteratively selects satisfying at iteration , where is the optimal dual solution at . We emphasize that we have as long as holds, where is the upper bound obtained from the sensitivity analysis, leading to the same selection as that of the greedy algorithm. The computational complexity of the heuristic algorithm is because we can obtain by solving the dual simplex. It should be noted that we add a small value to when to avoid undefined values in . We refer to this algorithm as PW-sensitivity-LP and summarize the overall algorithm in Algorithm 1.
For computational efficiency, we use the solution matrix from the previous step for the initial value in the simplex algorithm in the current step. Any feasible solution is also feasible because for all . The previous solutions of the dual forms and are also utilized for the initialization of the dual simplex and Sinkhorn iterations.
Faster heuristic using C-transform
The heuristics above still require solving a large LP in each step, where the number of dual constraints is . Here, we aim to reduce the size of the constraints in the LP to using C-transform (Sect.3.1 in [9]).
To derive the algorithm, we consider the dual Eq.(7). Considering the fact that for , we first solve the smaller LP problem by ignoring such that , whose system size is . Let the optimal dual solution of this smaller LP be , where . For each , we consider an LP in which is added to . Instead of solving each LP such as PW-greedy-LP, we approximate the optimal solution using a technique called the -transform. More specifically, for each such that , we only optimize and fix the other variables to be the dual optimal above. This is done by . Note that this is the largest value of , satisfying the dual constraints. This gives the estimated increase in the objective function when is added to . Finally, we select the instance . As shown later, this heuristic experimentally works efficiently in terms of computational times. We refer to this algorithm as PW-sensitivity-Ctrans.
Quasi-greedy algorithm for Sinkhorn iteration
When we use the generalized Sinkhorn iterations, we can easily obtain the derivative of the entropy-regularized partial Wasserstein divergence , where denotes the partial Wasserstein divergence, using automatic differentiation techniques such as those provided by PyTorch [14]. Based on this derivative, we can derive a quasi-greedy algorithm for the Sinkhorn iteration as (i.e., we replace Lines 6 and 7 of Algorithm 1 with this formula). Because the derivative can be obtained with the same computational complexity as the Sinkhorn iterations, the overall complexity is . We refer this algorithm PW-sensitivity-ent.
4 Related work
Instance selection and data summarization
Tasks for selecting an important portion of a large dataset are common in machine learning. Instance selection involves extracting a subset of a training dataset while maintaining the target task performance. Accoring to [15], two types of instance selection methods have been studied: model-dependent methods [16, 17, 18, 19, 20] and label-dependent methods [21, 22, 23, 24, 25, 26]. Unlike instance selection methods, PWCs do not depend on either model or ground-truth labels. Data summarization involves finding representative elements in a large dataset [27]. Here, the submodularity plays an important role [28, 29, 30, 31]. Unlike data summarization, PWC focuses on filling in gaps between datasets.
Anomaly detection
Our covering problem can be considered as the task of selecting a subset of a dataset that is not included in the development dataset. In this regard, our method is related to anomaly detectors e.g., LOF [4], one-class SVM [32], and deep learning-based methods [33, 34]. However, our goal is to extract patterns that are less included in the development dataset, but have a certain volume in the application, rather than selecting rare data that would be considered as observation noise or infrequent patterns that are not as important as frequent patterns.
Active learning
The problem of selecting data from an unlabeled dataset is also considered in active learning. A typical approach usees the output of the target model e.g., the entropy-based method [35, 36], whereas some methods are independent of the output of the model e.g., the -center-based [5], and Wasserstein distance-based approach [37]. In contrast, our goal is finding unconsidered patterns during development, even if they do not contribute to improving the prediction performance by directly adding them to the training data; developers may use these unconsidered data to test the generalization performance, design subroutines to process the patterns, redevelop a new model, or alert users to not use the products in certain special situations. Furthermore, we emphasize that the submodularity and guaranteed approximation algorithms can be used only with the partial Wasserstein divergence, but not with the vanilla Wasserstein distance.
Facility location problem (FLP) and -median
The FLP and related -median [38] are similar to our problem in the sense that we select data to minimize an objective function. FLP is a mathematical model for finding a desirable location for facilities (e.g., stations, schools, and hospitals). While FLP models facility opening costs, our data covering problem does not consider data selection costs, making it more similar to -median problems. However, unlike the -median, our covering problem allows relaxed transportation for distribution, which is known as Kantorovich relaxation in the optimal transport field [9]. Hence, our approach using partial Wasserstein divergence enables us to model more flexible assignments among distributions beyond naïve assignments among data.
5 Experiments
In our experiments, we considered a scenario in which we wished to select some data in the application dataset for the PWC problem (i.e., ). For PW-*-LP and PW-sensitivity-Ctrans, we used IBM ILOG CPLEX 20.01 [39] for the dual simplex algorithm, where the sensitivity analysis for LP is also available. For PW-*-ent, we computed the entropic regularized version of the partial Wasserstein divergence using PyTorch [14] on a GPU. We computed the sensitivity using automatic differentiation. We set the coefficient for entropy regularization and terminated the Sinkhorn iterations when the divergence did not change by at least compared with the previous step, or the number of iterations reached 50,000. All experiments were conducted with an Intel® Xeon® Gold 6142 CPU and an NVIDIA® TITAN RTX™ GPU.
5.1 Comparison of algorithms
First, we compare our algorithms, namely, the greedy-based PW-greedy-*, sensitivity-based PW-sensitivity-*, and random selection for the baseline. For the evaluation, we generated two-dimensional random values following a normal distribution for and , respectively. Figure 2 (left) presents the partial Wasserstein divergences between and while varying the number of the selected data when and Fig. 2 (right) presents the computational times when and varied. As shown in Fig. 2 (left), PW-greedy-LP, and PW-sensitivity-LP select exactly the same data as the global optimal solution obtained by PW-MILP. As shown in Fig. 2 (right), considering the logarithmic scale, PW-sensitivity-* significantly reduces the computational time compared with PW-MILP, while the naïve greedy algorithms do not scale. In particular, PW-sensitivity-ent can be calculated quickly as long as the GPU memory is sufficient, whereas its CPU version (PW-sensitivity-ent(CPU)) is significantly slower. PW-sensitivity-Ctrans is the fastest among the methods without GPUs. It is 8.67 and 3.07 times faster than PW-MILP and PW-sensitivity-LP, respectively. For quantitative evaluation, we define the empirical approximation ratio as , where is the optimal solution obtained by PW-MILP when . Figure 3 shows the empirical approximation ratio () for 50 different random seeds. Both PW-*-LP achieve approximation ratios close to one, whereas PW-*-ent and PW-sensitivity-Ctrans have slightly lower approximation ratios555An important feature of PW-*-ent is that they can be executed without using any commercial solvers..


5.2 Finding a missing category
For the quantitative evaluation of the missing pattern findings, we herein consider a scenario in which a category (i.e., label) is less included in the development dataset than in the application dataset666Note that in real scenarios, missing patterns do not always correspond to labels.. We employ subsets of the MNIST dataset [40], where the development dataset contains labels at a rate of 0.5%, whereas all labels are included in the application data in equal ratios (i.e., 10%). We randomly sampled 500 images from the validation split for each and . We employ the squared distance on the pixel space as the ground metric for our method. For baseline methods, we employed LOF [4] novelty detection provided by scikit-learn [41] with default parameters. For the LOF, we selected top- data according to the anomaly scores. We also employed active learning approaches based on entropies of the prediction [35, 36] and coreset with the -center and -center greedy [5, 42]. For the entropy-based method, we train a LeNet-like neural network using the training split, and then select data with high entropies of the predictions from the application dataset. LOF and coreset algorithms are conducted on the pixel space.
We conducted 10 trials using different random seeds for the proposed algorithms and baselines, except for the PW-greedy-* algorithms, because they took over 24h. Figure 4 presents a histogram of the selected labels when , where -axis corresponds to labels of selected data (i.e., or not ) and -axis shows relative frequencies of selected data. The proposed methods (i.e., PW-*) extract more data corresponding to the label (0.71 for PW-MILP) than the baselines. We can conclude that the proposed methods successfully selected data from the candidate dataset to fill in the missing areas in the distribution.

5.3 Missing scene extraction in driving datasets
Finally, we demonstrate PWC in a realistic scenario, using driving scene images. We adopted two datasets, BDD100K [43] and KITTI (Object Detection Evaluation 2012) [44] as the application and development datasets, respectively. The major difference between these two datasets is that KITTI () contains only daytime images, whereas BDD100k () contains both daytime and nighttime images. To reduce computational time, we randomly selected 1,500 data points for the development dataset from the test split of the KITTI dataset and 3,000 data points for the application dataset from the test split of BDD100k. To compute the transport costs between images, we calculated the squared norms between the feature vectors extracted by a pretrained ResNet with 50 layers [45] obtained from Torchvision [46]. Before inputting the images into the ResNet, each image was resized to a height of 224 pixels and then center-cropped to a width of 224, followed by normalization. As baseline methods, we adopted the LOF [4] and coreset with the -center greedy [5]. Figure 5 presents the obtained top-3 images. One can see that PWC (PW-sensitivity-LP) selects the major pattern (i.e., nighttime scenes) that is not included in the development data, whereas LOF and coreset mainly extracts specific rare scenes (e.g., a truck crossing a street or out-of-focus scene). The coreset selects a single image of nighttime scenes; however, we emphasize that providing multiple images is essential for the developers to understand what patterns in the image (e.g., nighttime, type of cars, or roadside) are less included in the image.
The above results indicate that the PWC problem enables us to accelerate updating ML-based systems when the distributions of application and development data are different because our method does not focus on isolated anomalies but major missing patterns in the application data, as shown in Fig. 1 and Fig. 5. The ML workflow using our method can efficiently find such patterns and allow developers to incrementally evaluate and update the ML-based systems (e.g., test the generalization performance, redevelop a new model, and designing a special subroutine for the pattern). Identifying patterns that are not taken into account during development and addressing them individually can improve the reliability and generalization ability of ML-based systems, such as image recognition in autonomous vehicles.



6 Conclusion
In this paper, we proposed the PWC, which fills in the gaps between development datasets and application datasets based on partial Wasserstein divergence. We also proved the submodularity of the PWC, leading to a greedy algorithm with the guaranteed approximation ratio. In addition, we proposed quasi-greedy algorithms based on sensitivity analysis and derivatives of Sinkhorn iterations. Experimentally, we demonstrated that the proposed method covers the major areas of development datasets with densities lower than those of the corresponding areas in application datasets. The main limitation of the PWC is scalability; the space complexity of the dual simplex or Sinkhorn iteration is at least , which might require approximation, e.g., stochastic optimizations [47], slicing [3] and neural networks [48]). As a potential risk, if infrequent patterns in application datasets are as important as frequent patterns, our proposed method may ignore some important cases, and it may be desirable to use our covering with methods focusing on fairness.
References
- [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 06–11 Aug 2017.
- [2] David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33, volume 33, pages 21428–21439. Curran Associates, Inc., 2020.
- [3] Nicolas Bonneel and David Coeurjolly. Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG), 38(4):1–13, 2019.
- [4] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 93–104, 2000.
- [5] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
- [6] Alessio Figalli. The optimal partial transport problem. Archive for rational mechanics and analysis, 195(2):533–560, 2010.
- [7] Laetitia Chapel, Mokhtar Alaya, and Gilles Gasso. Partial optimal transport with applications on positive-unlabeled learning. In Advances in Neural Information Processing Systems 33, pages 2903–2913, 2020.
- [8] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
- [9] Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
- [10] Robert J Vanderbei. Linear programming, volume 3. Springer, 2015.
- [11] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
- [12] Amotz Bar-Noy and George Rabanca. Tight approximation bounds for the seminar assignment problem. In International Workshop on Approximation and Online Algorithms, pages 170–182. Springer, 2016.
- [13] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
- [14] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019.
- [15] J Arturo Olvera-López, J Ariel Carrasco-Ochoa, J Francisco Martínez-Trinidad, and Josef Kittler. A review of instance selection methods. Artificial Intelligence Review, 34(2):133–143, 2010.
- [16] Peter Hart. The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory, 14(3):515–516, 1968.
- [17] GL Ritter. An algorithm for a selective nearest neighbor decision rule. IEEE Trans. Information Theory, 21(11):665–669, 1975.
- [18] Chien-Hsing Chou, Bo-Han Kuo, and Fu Chang. The generalized condensed nearest neighbor rule as a data reduction method. In Proceedings of the 18th International Conference on Pattern Recognition, volume 2, pages 556–559. IEEE, 2006.
- [19] Dennis L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421, 1972.
- [20] Fernando Vázquez, J Salvador Sánchez, and Filiberto Pla. A stochastic approach to wilson’s editing algorithm. In Iberian conference on pattern recognition and image analysis, pages 35–42. Springer, 2005.
- [21] D Randall Wilson and Tony R Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257–286, 2000.
- [22] Henry Brighton and Chris Mellish. Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, 6(2):153–172, 2002.
- [23] José C Riquelme, Jesús S Aguilar-Ruiz, and Miguel Toro. Finding representative patterns with ordered projections. Pattern Recognition, 36(4):1009–1018, 2003.
- [24] James C Bezdek and Ludmila I Kuncheva. Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12):1445–1473, 2001.
- [25] Huan Liu and Hiroshi Motoda. On issues of instance selection. Data Mining and Knowledge Discovery, 6(2):115–130, 2002.
- [26] Barbara Spillmann, Michel Neuhaus, Horst Bunke, Elżbieta Pekalska, and Robert PW Duin. Transforming strings to vector spaces using prototype selection. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), pages 287–296. Springer, 2006.
- [27] Mohiuddin Ahmed. Data summarization: a survey. Knowledge and Information Systems, 58(2):249–273, 2019.
- [28] Marko Mitrovic, Ehsan Kazemi, Morteza Zadimoghaddam, and Amin Karbasi. Data summarization at scale: A two-stage submodular approach. In Proceedings of the 35th International Conference on Machine Learning, pages 3596–3605. PMLR, 2018.
- [29] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 510–520, 2011.
- [30] Baharan Mirzasoleiman, Morteza Zadimoghaddam, and Amin Karbasi. Fast distributed submodular cover: Public-private data summarization. In Advances in Neural Information Processing Systems 29, pages 3594–3602, 2016.
- [31] Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, and Jeff A Bilmes. Learning mixtures of submodular functions for image collection summarization. In Advances in Neural Information Processing Systems 27, pages 1413–1421, 2014.
- [32] Bernhard Schölkopf, Robert C Williamson, Alexander J Smola, John Shawe-Taylor, John C Platt, et al. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12, volume 12, pages 582–588. Citeseer, 1999.
- [33] Ki Hyun Kim, Sangwoo Shim, Yongsub Lim, Jongseob Jeon, Jeongwoo Choi, Byungchan Kim, and Andre S Yoon. Rapp: Novelty detection with reconstruction along projection pathway. In International Conference on Learning Representations, 2019.
- [34] Jinwon An and Sungzoon Cho. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1):1–18, 2015.
- [35] Alex Holub, Pietro Perona, and Michael C Burl. Entropy-based active learning for object recognition. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8. IEEE, 2008.
- [36] Luiz FS Coletta, Moacir Ponti, Eduardo R Hruschka, Ayan Acharya, and Joydeep Ghosh. Combining clustering and active learning for the detection and learning of new image classes. Neurocomputing, 358:150–165, 2019.
- [37] Changjian Shui, Fan Zhou, Christian Gagné, and Boyu Wang. Deep active learning: Unified and principled method for query and training. In International Conference on Artificial Intelligence and Statistics, pages 1308–1318. PMLR, 2020.
- [38] Charles S ReVelle and Ralph W Swain. Central facilities location. Geographical analysis, 2(1):30–42, 1970.
- [39] IBM ILOG CPLEX Optimization Studio. Cplex user’s manual. Vers, 12:207–258, 2013.
- [40] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
- [41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- [42] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
- [43] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
- [44] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research, 2013.
- [45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- [46] Sébastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, pages 1485–1488, 2010.
- [47] Genevay Aude, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. arXiv preprint arXiv:1605.08527, 2016.
- [48] Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, and Hongyuan Zha. On scalable and efficient computation of large scale optimal transport. In International Conference on Machine Learning, pages 6882–6892. PMLR, 2019.
- [49] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292–2300, 2013.
Appendix A Reformulation of Eq.(2) as a MILP
The MILP formulation of Eq.(2) with marginal masses and can be defined based on the partial Wasserstein divergence in Eq.(1), i.e., the marginal mass depends on which subset is selected. Precisely, the marginal mass of side (i.e., source) is . The mass of side (i.e., target) is now Eq.(3), meaning that represents the mass can be transported into . This parameterization enables us to define Eq.(2) as MILP.
Let be the - decision variable. The value means the data is included in (i.e., ), and means . Continuous decision variables represent transportation masses between and for each data pair and . The minimization problem among two sets with marginal masses and is written as the following MILP problem:
(8) |
subject to
(9) | |||||
(10) | |||||
(11) | |||||
(12) | |||||
(13) | |||||
(14) |
Appendix B Pseudocodes of proposed algorithms
This section explains the evaluated methods in the main text; the greedy-based algorithm PW-greedy-*, sensitivity-based ones PW-sensitivity-*, and two baselines random.
B.1 Greedy-based algorithms
We explain the proposed greedy-based algorithm PW-greedy-LP, which is based on LP formulation of the partial Wasserstein divergence. The pseudocode is given below. For each insertion iteration up to , we compute all the partial Wasserstein divergences when for only if . We select a new data if this addition maximizes the selection objective (i.e., minimize the distance among two datasets).
Note that PW-greedy-ent is obtained by just replacing the LP part of computing with the (partial) Sinkhorn iterations of computing the entropy regularized Partial Wasserstein divergence .
B.2 -transform-based sensitivity algorithm
Developed sensitivity-based algorithms (e.g., PW-sensitivity-LP) reduces computational costs to evaluate all the divergence in greedy-based algorithms to estimate the importance of -th data by their sensitivity values. In the LP-based method, such values are obtained by sensitivity analyses functions included in the LP solver. Further, gradients of the partial Wasserstein divergences can be applicable when the entropy-regularized PWC is used. Note that these methods are explained in the main text (Algorithm 2).
Here, we give further details particularly when the -transform is adopted to accelerate sensitivity-based algorithms. The pseudocode is given below. For each iteration up to , we compute the dual optimal solutions with LP-formulation of system size , instead of using the full size for practical acceleration. Once dual optimal solutions are computed by a solver, we could estimate the effect of adding a new data using the computed solutions as they are also feasible even if is added to the next step LP formulation. The effect of adding can be estimated by -transform as we discussed (see Chapter 3 in [9] as well) and this estimation is apparently possible in constant time (see Line 6). In conclusion, this -transform-based algorithm has the same computational complexity with those of PW-sensitivity-LP, this computational trick accelerates the total computation times as we evaluated in experiments.
B.3 Sinkhorn algorithms
The entropy regularized Wasserstein divergence [49] has attracted much attention because it can be computed in high speed on GPUs by some simple matrix multiplier, named the Sinkhorn iterations. Entropy regularized Wasserstein distance is formally defined as
(15) | ||||
(16) | ||||
(17) |
where . An important property of the Sinkhorn iterations is differentiable with respect to the inputs and constraints, for example, and can be obtained by the automatic differentiation techniques. It is noteworthy that we add a small value to for the computation of the derivative. We can also employ Sinkhorn-like iterations on GPUs to solve the entropy regularized partial Wasserstein divergences [8].
The (partial) derivatives and can be computed after evaluating . For example in PW-greedy-LP, we can replace with , leading another proposed method PW-greedy-ent. Further, instead of selecting , where , we could select that minimizes the partial derivatives of , leading another proposed method PW-sensitivity-ent.
Appendix C Details of Experiments
C.1 LeNet-like network
We show the PyTorch code of LeNet-like network used for the entropy-based active learning approach in MNIST experiment. The network is trained by SGD optimizer, where learning rate is 0.001 and momentum is 0.9 to minimize the cross entropy loss function during 20 epochs.
C.2 Finding a completely missing category
In Sec. 5.2, we consider the situation, where the frequency of a label in the development dataset is very rare compared to that in the application dataset. We herein consider another situation, where a label is completely missing in the development dataset. Fig. 6 illustrates the relative frequency of missing category in selected data. In this situation, partial Wasserstein covering methods (i.e., PW-) outperform the baselines including the active learning methods (i.e., -center, -center-greedy, and active-ent) and the anomaly detection method (i.e., LOF).

C.3 Finding a missing category on CIFAR-10
We perform the missing category finding experiment using CIFAR-10 dataset. We herein consider airplane label (the first label in the CIFAR-10) as the missing label. As the ground metrics, we employ -norm between the feature vectors extracted by a neural network. We obtain the neural network from torchhub777https://github.com/chenyaofo/pytorch-cifar-models, and train using the subset of the training data of CIFAR-10 from scratch using the Adam optimizer888We employed the default parameters in PyTorch for the Adam optimizer. during 10 epochs. For the subset of the training data, we remove the missing category data so that the percentage of the missing category would be the same as the development data (i.e., 0.5%). The other settings are the same as in Sec.5.2. Fig. 7 illustrates the results indicating that our covering algorithms outperform the baseline, while PW-sensitivity-ent failed to find the missing category. This could be due to the Sinkhorn iterations not sufficiently converged.

C.4 Missing scene extraction in real driving datasets varying number of data
In the Sec. 5.3, we showed the results of 1,500 data for the development dataset, and 3,000 data for the application dataset. We herein show some additional results varying number of data. We also show top-8 images for each. These result show that partial Wasserstein covering successfully extract the major differences (i.e., night scenes).












Appendix D Discussion
D.1 General partial OT and unbalanced OT
We only consider the formulation of the partial Wasserstein divergence (Eq. 1), while the other formulations of the optimal transport can be considered. We note that the general partial OT [6] for discrete distributions is actually the same as Eq.(1) when . A covering problem using the unbalanced OT (i.e., ) can be considered, while the submodularity may not be hold in this case. We also emphasize that we need to adjust the strength of the penalty term (i.e., ) to select data points within the budgets.