This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Partial Wasserstein Covering

Keisuke Kawano, Satoshi Koide, Keisuke Otaki
Toyota Central R&D Labs., Inc.
{kawano, koide, otaki}@mosk.tytlabs.co.jp
Abstract

We consider a general task called partial Wasserstein covering with the goal of providing information on what patterns are not being taken into account in a dataset (e.g., dataset used during development) compared with another dataset(e.g., dataset obtained from actual applications). We model this task as a discrete optimization problem with partial Wasserstein divergence as an objective function. Although this problem is NP-hard, we prove that it satisfies the submodular property, allowing us to use a greedy algorithm with a 0.63 approximation. However, the greedy algorithm is still inefficient because it requires solving linear programming for each objective function evaluation. To overcome this inefficiency, we propose quasi-greedy algorithms that consist of a series of acceleration techniques, such as sensitivity analysis based on strong duality and the so-called CC-transform in the optimal transport field. Experimentally, we demonstrate that we can efficiently fill in the gaps between the two datasets and find missing scene in real driving scenes datasets.

1 Introduction

A major challenge in real-world machine learning applications is coping with mismatches between the data distribution obtained in real-world applications and those used for development. Regions in the real-world data distribution that are not well supported in the development data distribution (i.e., regions with low relative densities) result in potential risks such as a lack of evaluation or high generalization error, which in turn leads to low product quality. Our motivation is to provide developers with information on what patterns are not being taken into account when developing products by selecting some of the (usually unlabeled) real-world data distribution, also referred to as application dataset,111More precisely, we call a finite set of data sampled from the real-world data distribution application dataset in this study. to fill in the gaps in the development dataset. Note that the term “development” includes choosing models to use, designing subroutines for fail safe, and training/testing models.

Our research question is formulated as follows. To resolve the lack of data density in development datasets, how and using which metric can we select data from application datasets with a limited amount of data for developers to understand? One reasonable approach is to define the discrepancy between the data distributions and select data to minimize this discrepancy. The Wasserstein distance has attracted significant attention as a metric for data distributions [1, 2]. However, the Wasserstein distance is not capable of representing the asymmetric relationship between application datasets and development datasets, i.e., the parts that are over-included during development increase the Wasserstein distance.

Refer to caption
Figure 1: Concept of PWC. PWC extracts some data from an unlabeled application dataset by minimizing the partial Wasserstein divergence between the application dataset and the union of the selected data and a development dataset. PWC focuses on regions in which development data are lacking compared with the application dataset, whereas anomaly detection extracts irregular data, and active learning selects to improve accuracy.

In this paper, we propose partial Wasserstein covering (PWC) that selects a limited amount of data from the application dataset by minimizing the partial Wasserstein divergence [3] between the application dataset and the union of the development dataset and the selected data. PWC, as illustrated in Fig. 1, selects data from areas with fewer development data than application data in the data distributions (lower-right area in the blue distribution in Fig. 1) while ignoring areas with sufficient development data (upper-middle of the orange distribution). We also highlight the data selected through an anomaly detection method, LOF [4], where irregular data (upper-right points) were selected, but the major density difference was ignored. Furthermore, we show the selection obtained using an active learning method, coreset with a kk-center [5], where the data are chosen to improve the accuracy rather than fill the gap in terms of the distribution mismatch.

Our main contributions are summarized as follows.

  • We propose PWC that extracts data that are lacking in a development dataset from an application dataset by minimizing the partial Wasserstein divergence between an application dataset and the development dataset.

  • We prove that PWC is a maximization problem involving a submodular function whose inputs are the set of selected data. This allows us to use a greedy algorithm with a guaranteed approximation ratio.

  • Additionally, we propose fast heuristics based on sensitivity analysis and the Sinkhorn derivative for an accelerated computation.

  • Experimentally, we demonstrate that compared with baselines, PWC extracts data lacking in the development data distribution from the application distribution more efficiently.

The remainder of this paper is organized as follows. Section 2 briefly introduces the partial Wasserstein divergence, linear programming (LP), and submodular functions. In Sect. 3, we detail the proposed PWC, fast algorithms for approximating the optimization problem, and some theoretical results. Section 4 presents a literature review of related work. We demonstrated the PWC in some numerical experiments in Sect. 5.1. In Sect. 6, we present our conclusions.

2 Preliminaries

In this section, we introduce the notations used throughout this paper. We then describe partial Wasserstein divergences, sensitivity analysis for LP, and submodular functions.

Notations

Vectors and matrices are denoted in bold (e.g., 𝐱\mathbf{x} and 𝐀\mathbf{A}), where xix_{i} and AijA_{ij} denote the iith and (i,j)(i,j)th elements, respectively. To clarify the elements, we use notations such as 𝐱=(f(i))i\mathbf{x}=(f(i))_{i} and 𝐀=(g(i,j))ij\mathbf{A}=(g(i,j))_{ij}, where ff and gg specify the element values depending on their subscripts. ,\left<\cdot,\cdot\right> denotes the inner product of matrices or vectors. 𝟙nn\mathbbm{1}_{n}\in\mathbb{R}^{n} is a vector with all elements equal to one. For a natural number nn, we define [[n]]{1,,n}[\![n]\!]\coloneqq\{1,\cdots,n\}. For a finite set VV, we denote its power set as 2V2^{V} and its cardinality as |V||V|. The L2L_{2}-norm is denoted as \|\cdot\|. The delta function is denoted as δ()\delta(\cdot). +\mathbb{R}_{+} is a set of real positive numbers, and [a,b][a,b]\subset\mathbb{R} denotes a closed interval. I[]I[\cdot] denotes an indicator function (zero for false and one for true).

Partial Wasserstein divergence

In this paper, we consider the partial Wasserstein divergence [6, 3, 7] as an objective function. Partial Wasserstein divergence is designed to measure the discrepancy between two distributions with different total masses by considering variations in optimal transport. Throughout this paper, datasets are modeled as empirical distributions that are represented by mixtures of delta functions without necessarily having the same total mass. Suppose two empirical distributions (or datasets) XX and YY with probability masses 𝐚+m\mathbf{a}\in\mathbb{R}^{m}_{+} and 𝐛+n\mathbf{b}\in\mathbb{R}^{n}_{+}, which are denoted as X=i=1maiδ(𝐱(i))X=\sum_{i=1}^{m}a_{i}\delta(\mathbf{x}^{(i)}) and Y=j=1nbjδ(𝐲(j))Y=\sum_{j=1}^{n}b_{j}\delta(\mathbf{y}^{(j)}). Without loss of generality, the total mass of YY is greater than or equal to that of XX (i.e., i=1mai=1\sum_{i=1}^{m}a_{i}=1 and j=1nbj1\sum_{j=1}^{n}b_{j}\geq 1).

Based on the definitions above, we define the partial Wasserstein divergence as follows:222The partial optimal transport problems in the literature contain a wider problem definition than Eq.(1) as summarized in Table 1 (b) of [3], but this paper employs this one-side relaxed Wasserstein divergence corresponding to Table 1 (c) in [3] without loss of generality.

𝒫𝒲2(X,Y)min𝐏U(𝐚,𝐛)𝐏,𝐂,whereU(𝐚,𝐛)={𝐏[0,1]m×n|𝐏𝟙n=𝐚,𝐏𝟙m𝐛},\displaystyle\begin{split}&\mathcal{PW}^{2}(X,Y)\coloneqq\!\!\min_{\mathbf{P}\in U(\mathbf{a},\mathbf{b})}\!\!\left<\mathbf{P},\mathbf{C}\right>,\text{where}\\ &\ U(\mathbf{a},\mathbf{b})=\{\mathbf{P}\in[0,1]^{m\times n}\ |\ \mathbf{P}\mathbbm{1}_{n}=\mathbf{a},\ \mathbf{P}\!^{\top}\!\mathbbm{1}_{m}\leq\mathbf{b}\},\end{split} (1)

where Cij𝐱(i)𝐲(j)2C_{ij}\coloneqq\|\mathbf{x}^{(i)}-\mathbf{y}^{(j)}\|^{2} is the transport cost between 𝐱(i)\mathbf{x}^{(i)} and 𝐲(j)\mathbf{y}^{(j)}, and PijP_{ij} is the amount of mass flowing from 𝐱(i)\mathbf{x}^{(i)} to 𝐲(j)\mathbf{y}^{(j)} (to be optimized). Unlike the standard Wasserstein distance, the second constraint in Eq. (1) is not defined with “==”, but with “\leq”. This modification allows us to treat distributions with different total masses. The condition 𝐏𝟙m𝐛\mathbf{P}^{\top}\mathbbm{1}_{m}\leq\mathbf{b} indicates that the mass in YY does not need to be transported, whereas the condition 𝐏𝟙n=𝐚\mathbf{P}\mathbbm{1}_{n}=\mathbf{a} indicates that all of the mass in XX should be transported without excess or deficiency (just as in the original Wasserstein distance). This property is useful for the problem defined below, which treats datasets with vastly different sizes.

To compute the partial Wasserstein divergence, we must solve the minimization problem in Eq.(1). In this paper, we consider the following two methods. (i) LP using simplex method. (ii) Generalized Sinkhorn iteration with entropy regularization (with a small regularization parameter ε>0\varepsilon>0[8, 9].

As will be detailed later, an element in mass 𝐛\mathbf{b} varies when adding data to the development datasets. A key to our algorithm is to quickly estimate the extent to which the partial Wasserstein divergence will change when an element in 𝐛\mathbf{b} varies. If we compute 𝒫𝒲2\mathcal{PW}^{2} using LP, we can employ a sensitivity analysis, which will be described in the following paragraph. If we use generalized Sinkhorn iterations, we can use automatic differentiation techniques to obtain a partial derivative with respect to bjb_{j} (see Sect. 3.3).

LP and sensitivity analysis

The sensitivity analysis of LP plays an important role in our algorithm. Given a variable 𝐱m\mathbf{x}\in\mathbb{R}^{m} and parameters 𝐜m\mathbf{c}\in\mathbb{R}^{m}, 𝐝n\mathbf{d}\in\mathbb{R}^{n}, and 𝐀n×m\mathbf{A}\in\mathbb{R}^{n\times m}, the standard form of LP can be written as follows: min𝐜𝐱,s.t.𝐀𝐱𝐝,𝐱0.\min\ \mathbf{c}^{\top}\mathbf{x},\ \text{s.t.}\ \mathbf{A}\mathbf{x}\leq\mathbf{d},\ \mathbf{x}\geq 0. Sensitivity analysis is a framework for estimating changes in a solution when the parameters 𝐜,𝐀\mathbf{c},\mathbf{A}, and 𝐝\mathbf{d} of the problem vary. We consider a sensitivity analysis for the right-hand side of the constraint (i.e., 𝐝\mathbf{d}). When djd_{j} changes as dj+Δdjd_{j}+\Delta d_{j}, the optimal value changes by 𝐲jΔdj\mathbf{y}^{*}_{j}\Delta d_{j} if a change Δdj\Delta d_{j} in djd_{j} lies within (d¯j,d¯j)(\underline{d}_{j},\overline{d}_{j}), where 𝐲\mathbf{y}^{*} is the optimal solution of the following dual problem corresponding to the primal problem: max𝐝𝐲s.t. 𝐀𝐲𝐜,𝐲0.\max\ \mathbf{d}^{\top}\mathbf{y}\ \text{s.t. }\ \mathbf{A}^{\top}\mathbf{y}\geq\mathbf{c},\ \mathbf{y}\geq 0. We refer readers to [10] for the details of calculating the upper bound d¯j\overline{d}_{j} and the lower bound d¯i\underline{d}_{i}.

Submodular function

Our covering problem is modeled as a discrete optimization problem involving submodular functions, which are a subclass of set functions that play an important role in discrete optimization. A set function ϕ:2V\phi:2^{V}\to\mathbb{R} is called a submodular iff ϕ(ST)+ϕ(ST)ϕ(S)+ϕ(T)(S,TV).\phi(S\cup T)+\phi(S\cap T)\leq\phi(S)+\phi(T)\ (\forall S,T\subseteq V). A submodular function is monotone iff ϕ(S)ϕ(T)\phi(S)\leq\phi(T) for STS\subseteq T. An important property of monotone submodular functions is that a greedy algorithm provides a (11/e)0.63(1-1/e)\approx 0.63-approximate solution to the maximization problem under a budget constraint |S|K|S|\leq K [11]. The contraction ϕ~:2V\tilde{\phi}:2^{V}\to\mathbb{R} of a (monotone) submodular function ϕ\phi, which is defined as ϕ~(S)ϕ(ST)ϕ(T)\tilde{\phi}(S)\coloneqq\phi(S\cup T)-\phi(T), where TVT\subseteq V, is also a (monotone) submodular function.

3 Partial Wasserstein covering problem

3.1 Formulation

As discussed in Sect. 1, our goal is to fill in the gap between the application dataset 𝒟app\mathcal{D}_{\text{app}} by adding some data from a candidate dataset 𝒟cand\mathcal{D}_{\text{cand}} to a small dataset 𝒟dev\mathcal{D}_{\text{dev}}. We consider 𝒟cand=𝒟app\mathcal{D}_{\text{cand}}=\mathcal{D}_{\text{app}} (i.e., we copy some data in 𝒟app\mathcal{D}_{\text{app}} and add them into DdevD_{\text{dev}}) in the above-mentioned scenarios, but we herein consider the most general formulation. We model this task as a discrete optimization problem called the partial Wasserstein covering problem.

Given a dataset for development 𝒟dev={𝐲(j)}j=1Ndev\mathcal{D}_{\text{dev}}=\{\mathbf{y}^{(j)}\}_{j=1}^{N_{\text{dev}}}, a dataset obtained from an application 𝒟app={𝐱(i)}i=1Napp\mathcal{D}_{\text{app}}=\{\mathbf{x}^{(i)}\}_{i=1}^{N_{\text{app}}}, and a dataset containing candidates for selection 𝒟cand={𝐬(j)}j=1Ncand\mathcal{D}_{\text{cand}}=\{\mathbf{s}^{(j)}\}_{j=1}^{N_{\text{cand}}}, where NappNdevN_{\text{app}}\geq N_{\text{dev}}, the PWC problem is defined as the following optimization:333We herein consider the partial Wasserstein divergence between the unlabled datasets, because the application and candidate datasets are usually unlabeled. If they are labeled, we can use the labels information as in [2].

maxS𝒟cands.t. |S|K𝒫𝒲2(𝒟app,S𝒟dev)+𝒫𝒲2(𝒟app,𝒟dev)\max_{\begin{subarray}{c}S\subseteq\mathcal{D}_{\text{cand}}\\ \text{s.t. }|S|\leq K\end{subarray}}-\mathcal{PW}^{2}(\mathcal{D}_{\text{app}},S\cup\mathcal{D}_{\text{dev}})+\mathcal{PW}^{2}(\mathcal{D}_{\text{app}},\mathcal{D}_{\text{dev}}) (2)

We select a subset SS from the candidate dataset 𝒟cand\mathcal{D}_{\text{cand}} under the budget constraint |S|K(Ncand)|S|\leq K~{}(\ll N_{\text{cand}}), and then add that subset to the development data 𝒟dev\mathcal{D}_{\text{dev}} to minimize the partial Wasserstein divergence between the two datasets 𝒟app\mathcal{D}_{\text{app}} and S𝒟devS\cup\mathcal{D}_{\text{dev}}. The second term is a constant with respect to SS, which is included to make the objective non-negative.

In Eq. (2), we must specify the probability mass (i.e., 𝐚\mathbf{a} and 𝐛\mathbf{b} in Eq. (1)) for each data point. Here, we employ a uniform mass, which is a natural choice because we do not have prior information regarding each data point. Specifically, we set the weights to ai=1/Nappa_{i}=1/N_{\text{app}} for 𝒟app\mathcal{D}_{\text{app}} and bj=1/Ndevb_{j}=1/N_{\text{dev}} for S𝒟devS\cup\mathcal{D}_{\text{dev}}. With this choice of masses, we can easily show that 𝒫𝒲2(𝒟app,S𝒟dev)=𝒲2(𝒟app,𝒟dev)\mathcal{PW}^{2}(\mathcal{D}_{\text{app}},S\cup\mathcal{D}_{\text{dev}})=\mathcal{W}^{2}(\mathcal{D}_{\text{app}},\mathcal{D}_{\text{dev}}) when S=S=\emptyset, where 𝒲2(𝒟app,𝒟dev)\mathcal{W}^{2}(\mathcal{D}_{\text{app}},\mathcal{D}_{\text{dev}}) is the Wasserstein distance. Therefore, based on the monotone property (Sect. 3.2), the objective value is non-negative for any SS.

The PWC problem in Eq.(2) can be written as a mixed integer linear program (MILP) as follows. Instead of using the divergence between 𝒟app\mathcal{D}_{\text{app}} and S𝒟devS\cup\mathcal{D}_{\text{dev}}, we consider the divergence between 𝒟app\mathcal{D}_{\text{app}} and 𝒟cand𝒟dev\mathcal{D}_{\text{cand}}\cup\mathcal{D}_{\text{dev}}. Hence, the mass 𝐛\mathbf{b} is an (Ncand+Ndev)(N_{\text{cand}}+N_{\text{dev}})-dimensional vector, where the first NcandN_{\text{cand}} elements correspond to the data in 𝒟cand\mathcal{D}_{\text{cand}} and the remaining elements correspond to the data in 𝒟dev\mathcal{D}_{\text{dev}}. In this case, we never transport to data points in 𝒟candS\mathcal{D}_{\text{cand}}\setminus S, meaning we use the following mass that depends on SS:

bj(S)={I[𝐬(j)S]Ndev,if 1jNcand1Ndev,if Ncand+1j.b_{j}(S)=\begin{cases}\frac{I[\mathbf{s}^{(j)}\in S]}{N_{\text{dev}}},&\text{if }1\leq j\leq N_{\text{cand}}\\ \frac{1}{N_{\text{dev}}},&\text{if }N_{\text{cand}}+1\leq j.\end{cases} (3)

As a result, the problem is an MILP problem with an objective function 𝐏,𝐂\langle\mathbf{P},\mathbf{C}\rangle w.r.t. S𝒟candS\subseteq\mathcal{D}_{\text{cand}} and 𝐏U(𝐚,𝐛(S))\mathbf{P}\in U(\mathbf{a},\mathbf{b}(S)) with |S|K|S|\leq K and ai=1/Nappa_{i}=1/N_{\text{app}}.

One may wonder why we use the partial Wasserstein divergence in Eq.(2) instead of the standard Wasserstein distance. This is because the asymmetry in the conditions of the partial Wasserstein divergence enables us to extract only the parts with a density lower than that of the application dataset, whereas the standard Wasserstein distance becomes large for parts with sufficient density in the development dataset. Furthermore, the guaranteed approximation algorithms that we describe later can be utilized only for the partial Wasserstein divergence version.

3.2 Submodularity of the PWC problem

In this section, we prove the following theorem to guarantee the approximation ratio of the proposed algorithms.

Theorem 1.

Given the datasets X={𝐱(i)}i=1NxX=\{\mathbf{x}^{(i)}\}_{i=1}^{N_{x}} and Y={𝐲(j)}j=1NyY=\{\mathbf{y}^{(j)}\}_{j=1}^{N_{y}}, and a subset of a dataset S{𝐬(j)}j=1NsS\subseteq\{\mathbf{s}^{(j)}\}_{j=1}^{N_{s}}, ϕ(S)=𝒫𝒲2(X,SY)+𝒫𝒲2(X,Y)\phi(S)=-\mathcal{PW}^{2}(X,S\cup Y)+\mathcal{PW}^{2}(X,Y) is a monotone submodular function.

To prove Theorem 1, we reduce our problem to the partial maximum weight matching problem [12], which is known to be submodular. First, we present the following lemmas.

Lemma 1.

Let \mathbb{Q} be the set of all rational numbers, and mm and nn be positive integers with nmn\leq m. Consider 𝐀m×n\mathbf{A}\in\mathbb{Q}^{m\times n} and 𝐛m\mathbf{b}\in\mathbb{Q}^{m}. Then, the extreme points 𝐱\mathbf{x}^{*} of a convex polytope defined by the linear inequalities 𝐀𝐱𝐛\mathbf{A}\mathbf{x}\leq\mathbf{b} are also rational, meaning 𝐱n\mathbf{x}^{*}\in\mathbb{Q}^{n}.

Proof of Lemma 1. It is well known [10] that any extreme point of the polytope is obtained by (i) choosing nn inequalities from 𝐀𝐱𝐛\mathbf{A}\mathbf{x}\leq\mathbf{b} and (ii) solving the corresponding nn-variable linear system 𝐁𝐱=𝐛~\mathbf{B}\mathbf{x}=\tilde{\mathbf{b}}, where 𝐁\mathbf{B} and 𝐛~\tilde{\mathbf{b}} are the corresponding sub-matrix/vector (when replacing \leq with ==). As all elements in 𝐁1\mathbf{B}^{-1} and 𝐛~\tilde{\mathbf{b}} are rational, 444This immediately follows from the fact that 𝐁1=𝐁~/det𝐁\mathbf{B}^{-1}=\tilde{\mathbf{B}}/\det{\mathbf{B}}, where 𝐁~\tilde{\mathbf{B}} is the adjugate matrix of 𝐁\mathbf{B} and det𝐁\det\mathbf{B} is the determinant of 𝐁\mathbf{B}. Since 𝐁n×n\mathbf{B}\in\mathbb{Q}^{n\times n}, 𝐁~\tilde{\mathbf{B}} and det𝐁\det\mathbf{B} are both rational. we conclude that any extreme points are rational. ∎

Lemma 2.

Let l,ml,m, and nn be positive integers and Z=[[n]]Z=[\![n]\!]. Given a positive-valued mm-by-nn matrix 𝐑>0\mathbf{R}>0, the following set function ψ:2Z\psi:2^{Z}\to\mathbb{R} is a submodular function:

ψ(S)=max𝐏U(𝟙m/m,𝐛(S))𝐑,𝐏,wherebj(S)=I[jS]lj[[n]].\displaystyle\begin{split}\psi(S)=&\max_{\mathbf{P}\in U^{\leq}(\mathbbm{1}_{m}/m,\mathbf{b}(S))}\left<\mathbf{R},\mathbf{P}\right>,\\ &\text{where}\ \ b_{j}(S)=\tfrac{I[j\in S]}{l}\quad\forall j\in[\![n]\!].\end{split} (4)

Here, U(𝐚,𝐛(S))U^{\leq}(\mathbf{a},\mathbf{b}(S)) is a set defined by replacing the constraint 𝐏𝟙n=𝐚\mathbf{P}\mathbbm{1}_{n}=\mathbf{a} in Eq. (1) with 𝐏𝟙n𝐚\mathbf{P}\mathbbm{1}_{n}\leq\mathbf{a}.

Proof of Lemma 2. For any fixed SZS\subseteq Z, there is an optimal solution 𝐏\mathbf{P}^{*} of ψ(S)\psi(S), where all elements PijP_{ij}^{*} are rational because all the coefficients in the constraints are in \mathbb{Q} (\because Lemma 1). Thus, there is a natural number M+M\in\mathbb{Z}_{+} that satisfies ψ(S)=1mlMmax𝐏U(S)𝐑,𝐏\psi(S)=\frac{1}{mlM}\max_{\mathbf{P}\in U_{\mathbb{Q}}(S)}\left<\mathbf{R},\mathbf{P}\right> where the set U(S)U_{\mathbb{Q}}(S) is the set of possible transports when the transport mass is restricted to the rational numbers \mathbb{Q}.

U(S)={𝐏+m×n𝐏𝟙nlM𝟙m,𝐏𝟙mmMI[jS]𝟙n}.\displaystyle\begin{split}U_{\mathbb{Q}}(S)=\left\{\mathbf{P}\in\mathbb{Z}_{+}^{m\times n}\mid\mathbf{P}\mathbbm{1}_{n}\leq lM\cdot\mathbbm{1}_{m},\right.\\ \left.\mathbf{P}^{\top}\mathbbm{1}_{m}\leq mM\cdot I[j\in S]\cdot\mathbbm{1}_{n}\right\}.\end{split} (5)

The problem of ψ(S)\psi(S) can be written as a network flow problem on a bipartite network following Sect 3.4 in [9]. Since the elements of 𝐏\mathbf{P} are integers, we can reformulate the problem of ψ(S)\psi(S) using maximum-weight bipartite matching (i.e., an assignment problem) on a bipartite graph whose vertices are generated by copying the vertices lMlM and mMmM times with modified edge weights. Therefore, G(VW,E)G(V\cup W,E) is a bipartite graph, where V={v1,1,,v1,lM,,vm,1,,vm,lM}V=\{v_{1,1},\dots,v_{1,lM},\dots,v_{m,1},\dots,v_{m,lM}\} and W={w1,1,,w1,mM,,wn,1,,wn,mM}W=\{w_{1,1},\dots,w_{1,mM},\dots,w_{n,1},\dots,w_{n,mM}\}. For any i[[m]],j[[n]]i\in[\![m]\!],j\in[\![n]\!], the transport mass Pij+P_{ij}\in\mathbb{Z}_{+} is encoded by using the copied vertices {vi,1,,vi,lM}\{v_{i,1},\dots,v_{i,lM}\} and {wj,1,,wj,mM}\{w_{j,1},\dots,w_{j,mM}\}, and the weights of the edges among them are 1mlMRij>0\frac{1}{mlM}R_{ij}>0 to represent RijR_{ij}. The partial maximum weight matching function for a graph with positive edge weights is a submodular function [12]. Therefore, ψ(S)\psi(S) is a submodular function. ∎

Lemma 3.

If |S|l|S|\geq l, there exists an optimal solution 𝐏\mathbf{P}^{*} in the maximization problem of ψ\psi satisfying 𝐏𝟙n=𝟙mm\mathbf{P}^{*}\mathbbm{1}_{n}=\frac{\mathbbm{1}_{m}}{m} (i.e., 𝐏argmax𝐏U(𝟙mm,𝐛(S))𝐑,𝐏=argmax𝐏U(𝟙mm,𝐛(S))𝐑,𝐏\mathbf{P}^{*}\coloneqq\operatorname*{arg\,max}_{\mathbf{P}\in U^{\leq}(\frac{\mathbbm{1}_{m}}{m},\mathbf{b}(S))}\left<\mathbf{R},\mathbf{P}\right>=\operatorname*{arg\,max}_{\mathbf{P}\in U(\frac{\mathbbm{1}_{m}}{m},\mathbf{b}(S))}\left<\mathbf{R},\mathbf{P}\right>).

Proof of Lemma 3. Given an optimal solution 𝐏\mathbf{P}^{*}, suppose there exists ii^{\prime} satisfying j=1nPij<1m\sum_{j=1}^{n}P^{*}_{i^{\prime}j}<\frac{1}{m}. This implies that i,jPij<1\sum_{i,j}P^{*}_{ij}<1. We denote the gap as δ1mj=1nPij>0\delta\coloneqq\frac{1}{m}-\sum_{j=1}^{n}P_{i^{\prime}j}^{*}>0. Based on the assumption of |S|l|S|\geq l, we have j=1nbj(S)1\sum_{j=1}^{n}b_{j}(S)\geq 1. Hence, there must exist jj^{\prime} such that i=1mPij<bj\sum_{i=1}^{m}P_{ij^{\prime}}^{*}<b_{j^{\prime}}. We denote the gap as δbji=1mPij>0\delta^{\prime}\coloneqq b_{j^{\prime}}-\sum_{i=1}^{m}P_{ij^{\prime}}^{*}>0. We also define δ~=min{δ,δ}>0\tilde{\delta}=\min\{\delta,\delta^{\prime}\}>0. Then, a matrix 𝐏=(Pij+δ~I[i=i,j=j])ij\mathbf{P}^{\prime}=(P_{ij}^{*}+\tilde{\delta}I[i=i^{\prime},j=j^{\prime}])_{ij} is still a feasible solution. However, Rij>0R_{ij}>0 leads to 𝐑,𝐏>𝐑,𝐏\left<\mathbf{R},\mathbf{P}^{\prime}\right>>\left<\mathbf{R},\mathbf{P}^{*}\right>, which contradicts 𝐏\mathbf{P}^{*} being the optimal solution. Thus, j=1nPij=1m,i\sum_{j=1}^{n}P_{ij}^{*}=\frac{1}{m},\forall i. ∎

Proof of Theorem 1. Given Cmax=maxi,jCij+γC_{\text{max}}=\max_{i,j}C_{ij}+\gamma, where γ>0\gamma>0, the following is satisfied:

ϕ(S)=(Cmax+max𝐏U(𝐚,𝐛(S))𝐑,𝐏)(Cmax+max𝐏U(𝐚,𝐛())𝐑,𝐏),\displaystyle\begin{split}\phi(S)&=(-C_{\text{max}}+\max_{\mathbf{P}\in U(\mathbf{a},\mathbf{b}(S))}\left<\mathbf{R},\mathbf{P}\right>)\\ &\ \ -(-C_{\text{max}}+\max_{\mathbf{P}\in U(\mathbf{a},\mathbf{b}(\emptyset))}\left<\mathbf{R},\mathbf{P}\right>),\end{split} (6)

where 𝐑=(CmaxCij)ij>0\mathbf{R}=(C_{\text{max}}-C_{ij})_{ij}>0, m=Nxm=N_{x}, n=Ns+Nyn=N_{s}+N_{y}, and l=Nyl=N_{y}. Here, |SY|Ny=l|S\cup Y|\geq N_{y}=l and Lemma 3 yield ϕ(S)=ψ(SY)ψ(Y)\phi(S)=\psi(S\cup Y)-\psi(Y). Since ϕ(S)\phi(S) is a contraction of the submodular function ψ(S)\psi(S) (Lemma 2), ϕ(S)=𝒫𝒲2(X,SY)+𝒫𝒲2(X,Y)\phi(S)=-\mathcal{PW}^{2}(X,S\cup Y)+\mathcal{PW}^{2}(X,Y) is a submodular function.

We now prove the monotonicity of ϕ\phi. For STS\subseteq T, we have U(𝐚,𝐛(S))U(𝐚,𝐛(T))U(\mathbf{a},\mathbf{b}(S))\subseteq U(\mathbf{a},\mathbf{b}(T)) because 𝐛(S)𝐛(T)\mathbf{b}(S)\leq\mathbf{b}(T). This implies that ϕ(S)ϕ(T)\phi(S)\leq\phi(T). ∎

Finally, we prove Proposition 1 to clarify the computational intractability of the PWC problem.

Proposition 1.

The PWC problem with marginals 𝐚\mathbf{a} and 𝐛(S)\mathbf{b}(S) is NP-hard.

Proof of Proposition 1. The decision version of the PWC problem, denoted as 𝙿𝚆𝙲(𝐚,𝐛(S),C,K,x)\mathtt{PWC}(\mathbf{a},\mathbf{b}(S),C,K,x), asks whether a subset S,|S|KS,|S|\leq K such that ϕ(S)x\phi(S)\leq x\in\mathbb{R} exists. We discuss a reduction to the PWC from the well-known NP-complete problems called set cover problems [13]. Given a family 𝒢={Gj}j=1M\mathcal{G}=\{G_{j}\subseteq\mathcal{I}\}_{j=1}^{M} from the set =[[N]]\mathcal{I}=[\![N]\!] and an integer kk, 𝚂𝚎𝚝𝙲𝚘𝚟𝚎𝚛(𝒢,,k)\mathtt{SetCover}(\mathcal{G},\mathcal{I},k) is a decision problem for answering 𝚈𝚎𝚜\mathtt{Yes} iff there exists a sub-family 𝒢𝒢\mathcal{G^{\prime}}\subseteq\mathcal{G} such that |𝒢|k|\mathcal{G^{\prime}}|\leq k and G𝒢G=\cup_{G\in\mathcal{G^{\prime}}}G=\mathcal{I}. Our reduction is defined as follows. Given 𝚂𝚎𝚝𝙲𝚘𝚟𝚎𝚛(𝒢,,k)\mathtt{SetCover}(\mathcal{G},\mathcal{I},k) for i[[N]]i\in[\![N]\!] and j[[M]]j\in[\![M]\!], we set ai=1,bj(S)=|Gj|I[jS],Cij=I[iGj],K=ka_{i}=1,b_{j}(S)=|G_{j}|\cdot I[j\in S],C_{ij}=I[i\in G_{j}],K=k, and x=0x=0. Based on this polynomial reduction, the given 𝚂𝚎𝚝𝙲𝚘𝚟𝚎𝚛\mathtt{SetCover} instance is Yes iff the reduced 𝙿𝚆𝙲\mathtt{PWC} instance is 𝚈𝚎𝚜\mathtt{Yes}. This reduction means that the decision version of PWC is NP-complete and now Proposition 1 holds.

3.3 Algorithms

MILP and greedy algorithms

The PWC problem in Eq. (2) can be solved directly as an MILP problem. We refer to this method as PW-MILP. However, this is extremely inefficient when NappN_{\text{app}}, NdevN_{\text{dev}}, and NcandN_{\text{cand}} are large.

As a consequence of Theorem 1, a simple greedy algorithm provides a 0.63 approximation because the problem is a type of monotone submodular maximization with a budget constraint |S|K|S|\leq K [11]. Specifically, this greedy algorithm selects data from the candidate dataset 𝒟cand\mathcal{D}_{\text{cand}} sequentially in each iteration to maximize ϕ\phi. We consider two baseline methods, called PW-greedy-LP and PW-greedy-ent. PW-greedy-LP solves the LP problem using the simplex method. PW-greedy-ent computes the optimal 𝐏\mathbf{P}^{\star} using Sinkhorn iterations [8].

Even for the greedy algorithms mentioned above, the computational cost is still high because we need to calculate the partial Wasserstein divergences for all candidates on 𝒟app\mathcal{D}_{\text{app}} in each iteration, yielding a computational time complexity of O(KNcandCPW)O(K\!\cdot\!N_{\text{cand}}\!\cdot\!C_{PW}). Here, CPWC_{PW} is the complexity of the partial Wasserstein divergence with the simplex method for the LP or Sinkhorn iteration.

To reduce the computational cost, we propose heuristic approaches called quasi-greedy algorithms. A high-level description of the quasi-greedy algorithms is provided below. In each step of greedy selection, we use a type of sensitivity value that allows us to estimate how much the partial Wasserstein divergence varies when adding candidate data, instead of performing the actual computation of divergence. Specifically, if we add a data point 𝐬(j)\mathbf{s}^{(j)} to SS, denoted as T={𝐬(j)}ST=\{\mathbf{s}^{(j)}\}\cup S, we have 𝐛(T)=𝐛(S)+𝐞jNdev\mathbf{b}(T)=\mathbf{b}(S)+\frac{\mathbf{e}_{j}}{N_{\text{dev}}}, where 𝐞j\mathbf{e}_{j} is a one-hot vector with the corresponding jjth element activated. Hence, we can estimate the change in the divergence for data addition by computing the sensitivity with respect to 𝐛(S)\mathbf{b}(S). If efficient sensitivity computation is possible, we can speed up the greedy algorithms. We describe the concrete methods for each of the approaches, simplex method, and Sinkhorn iterations in the following paragraphs.

Quasi-greedy algorithm for the simplex method

When we use the simplex method for the partial Wasserstein divergence computation, we employ a sensitivity analysis for LP. The dual problem of partial Wasserstein divergence computation is defined as follows.

max𝐟Napp,𝐠Ncand+Ndev𝐟,𝐚+𝐠,𝐛(S),s.t. gj0,fi+gjCij,i,j,\displaystyle\begin{split}\max_{\mathbf{f}\in\mathbb{R}^{N_{\text{app}}},\mathbf{g}\in\mathbb{R}^{N_{\text{cand}}+N_{\text{dev}}}}\left<\mathbf{f},\mathbf{a}\right>+\left<\mathbf{g},\mathbf{b}(S)\right>,\\ \text{s.t.\ \ }g_{j}\leq 0,\ f_{i}+g_{j}\leq C_{ij},\forall i,j,\end{split} (7)

where 𝐟\mathbf{f} and 𝐠\mathbf{g} are dual variables. Using sensitivity analysis, the changes in the partial Wasserstein divergences can be estimated as gjΔbjg_{j}^{*}\Delta b_{j}, where gjg_{j}^{*} is an optimal dual solution of Eq. (7), and Δbj\Delta b_{j} is the change in bjb_{j}. It is worth noting that Δbj\Delta b_{j} now corresponds to I[𝐬(j)S]I[\mathbf{s}^{(j)}\in S] in Eq. (3), and a smaller Eq. (7) results in a larger Eq. (2). Thus, we propose a heuristic algorithm that iteratively selects 𝐬(j)\mathbf{s}^{(j^{*})} satisfying j=argminj[[Ncand]],s(j)Sgj(t)j^{*}=\operatorname*{arg\,min}_{j\in[\![N_{\text{cand}}]\!],s^{(j)}\notin S}\ g_{j}^{*(t)} at iteration tt, where gj(t)g_{j}^{*(t)} is the optimal dual solution at tt. We emphasize that we have 1Ndevgj=ϕ({𝐬(j)}S)ϕ(S)-\frac{1}{N_{\text{dev}}}g_{j}^{*}=\phi(\{\mathbf{s}^{(j)}\}\cup S)-\phi(S) as long as b¯jbj1/Ndev\overline{b}_{j}-b_{j}\geq 1/N_{\text{dev}} holds, where b¯j\overline{b}_{j} is the upper bound obtained from the sensitivity analysis, leading to the same selection as that of the greedy algorithm. The computational complexity of the heuristic algorithm is O(KCPW)O(K\!\cdot\!C_{PW}) because we can obtain 𝐠\mathbf{g}^{*} by solving the dual simplex. It should be noted that we add a small value to bjb_{j} when bj=0b_{j}=0 to avoid undefined values in 𝐠\mathbf{g}. We refer to this algorithm as PW-sensitivity-LP and summarize the overall algorithm in Algorithm 1.

Algorithm 1 Data selection for the PWC problem with sensitivity analysis (PW-sensitivity-LP)
1:Input: 𝒟app,𝒟dev,𝒟cand\mathcal{D}_{\text{app}},\mathcal{D}_{\text{dev}},\mathcal{D}_{\text{cand}}
2:Output: SS
3:S{}S\leftarrow\{\}, t0t\leftarrow 0
4:while |S|<K|S|<K do
5:     Calculate 𝒫𝒲2(𝒟app,S𝒟dev)\mathcal{PW}^{2}(\mathcal{D}_{\text{app}},S\cup\mathcal{D}_{\text{dev}}) , and obtain gj(t)g_{j}^{*(t)} from the sensitivity analysis.
6:     j=argminj[[Ncand]],𝐬(j)Sgj(t)j^{*}=\operatorname*{arg\,min}_{j\in[\![N_{\text{cand}}]\!],\mathbf{s}^{(j)}\notin S}\ g_{j}^{*(t)}
7:     SS{𝐬(j)}S\leftarrow S\cup\{\mathbf{s}^{(j^{*})}\}
8:     tt+1t\leftarrow t+1
9:end while

For computational efficiency, we use the solution matrix 𝐏\mathbf{P} from the previous step for the initial value in the simplex algorithm in the current step. Any feasible solution 𝐏U(𝐚,𝐛(S))\mathbf{P}\in U(\mathbf{a},\mathbf{b}(S)) is also feasible because 𝐏U(𝐚,𝐛(T))\mathbf{P}\in U(\mathbf{a},\mathbf{b}(T)) for all STS\subseteq T. The previous solutions of the dual forms 𝐟\mathbf{f} and 𝐠\mathbf{g} are also utilized for the initialization of the dual simplex and Sinkhorn iterations.

Faster heuristic using C-transform

The heuristics above still require solving a large LP in each step, where the number of dual constraints is Napp×(Ncand+Ndev)N_{\text{app}}\times(N_{\text{cand}}+N_{\text{dev}}). Here, we aim to reduce the size of the constraints in the LP to Napp×(|S|+Ndev)Napp×(Ncand+Ndev)N_{\text{app}}\times(|S|+N_{\text{dev}})\ll N_{\text{app}}\times(N_{\text{cand}}+N_{\text{dev}}) using C-transform (Sect.3.1 in [9]).

To derive the algorithm, we consider the dual Eq.(7). Considering the fact that bj(S)=0b_{j}(S)=0 for 𝐬(j)S\mathbf{s}^{(j)}\notin S, we first solve the smaller LP problem by ignoring jj such that 𝐬(j)S\mathbf{s}^{(j)}\notin S, whose system size is Napp×(|S|+Ndev)N_{\text{app}}\times(|S|+N_{\text{dev}}). Let the optimal dual solution of this smaller LP be (𝐟,𝐠)(\mathbf{f}^{*},\mathbf{g}^{*}), where 𝐟Napp,𝐠|S|+Ndev\mathbf{f}^{*}\in\mathbb{R}^{N_{\text{app}}},\mathbf{g}^{*}\in\mathbb{R}^{|S|+N_{\text{dev}}}. For each 𝐬(j)S\mathbf{s}^{(j)}\notin S, we consider an LP in which 𝐬(j)\mathbf{s}^{(j)} is added to SS. Instead of solving each LP such as PW-greedy-LP, we approximate the optimal solution using a technique called the CC-transform. More specifically, for each jj such that 𝐬(j)S\mathbf{s}^{(j)}\notin S, we only optimize gjg_{j} and fix the other variables to be the dual optimal above. This is done by gj𝐂min{0,mini[[Ndev]]Cijfi}g_{j}^{\mathbf{C}}\coloneqq\min\{0,\min_{i\in[\![N_{\text{dev}}]\!]}C_{ij}-f_{i}^{*}\}. Note that this is the largest value of gjg_{j}, satisfying the dual constraints. This gj𝐂g_{j}^{\mathbf{C}} gives the estimated increase in the objective function when 𝐬(j)S\mathbf{s}^{(j)}\notin S is added to SS. Finally, we select the instance j=argminj[[Ncand]],𝐬(j)Sgj𝐂j^{*}=\arg\min_{j\in[\![N_{\text{cand}}]\!],\mathbf{s}^{(j)}\notin S}g^{\mathbf{C}}_{j}. As shown later, this heuristic experimentally works efficiently in terms of computational times. We refer to this algorithm as PW-sensitivity-Ctrans.

Quasi-greedy algorithm for Sinkhorn iteration

When we use the generalized Sinkhorn iterations, we can easily obtain the derivative of the entropy-regularized partial Wasserstein divergence j𝒫𝒲ϵ2(𝒟app,S𝒟dev)/bj\nabla_{j}\coloneqq\partial\mathcal{PW}^{2}_{\epsilon}(\mathcal{D}_{\text{app}},S\cup\mathcal{D}_{\text{dev}})/\partial b_{j}, where 𝒫𝒲ϵ2\mathcal{PW}^{2}_{\epsilon} denotes the partial Wasserstein divergence, using automatic differentiation techniques such as those provided by PyTorch [14]. Based on this derivative, we can derive a quasi-greedy algorithm for the Sinkhorn iteration as j=argminj[[Ncand]],𝐬(j)Sjj^{*}=\operatorname*{arg\,min}_{j\in[\![N_{\text{cand}}]\!],\mathbf{s}^{(j)}\notin S}\nabla_{j} (i.e., we replace Lines 6 and 7 of Algorithm 1 with this formula). Because the derivative can be obtained with the same computational complexity as the Sinkhorn iterations, the overall complexity is O(KCPW)O(K\cdot C_{PW}). We refer this algorithm PW-sensitivity-ent.

4 Related work

Instance selection and data summarization

Tasks for selecting an important portion of a large dataset are common in machine learning. Instance selection involves extracting a subset of a training dataset while maintaining the target task performance. Accoring to [15], two types of instance selection methods have been studied: model-dependent methods [16, 17, 18, 19, 20] and label-dependent methods [21, 22, 23, 24, 25, 26]. Unlike instance selection methods, PWCs do not depend on either model or ground-truth labels. Data summarization involves finding representative elements in a large dataset [27]. Here, the submodularity plays an important role [28, 29, 30, 31]. Unlike data summarization, PWC focuses on filling in gaps between datasets.

Anomaly detection

Our covering problem can be considered as the task of selecting a subset of a dataset that is not included in the development dataset. In this regard, our method is related to anomaly detectors e.g., LOF [4], one-class SVM [32], and deep learning-based methods [33, 34]. However, our goal is to extract patterns that are less included in the development dataset, but have a certain volume in the application, rather than selecting rare data that would be considered as observation noise or infrequent patterns that are not as important as frequent patterns.

Active learning

The problem of selecting data from an unlabeled dataset is also considered in active learning. A typical approach usees the output of the target model e.g., the entropy-based method [35, 36], whereas some methods are independent of the output of the model e.g., the kk-center-based [5], and Wasserstein distance-based approach [37]. In contrast, our goal is finding unconsidered patterns during development, even if they do not contribute to improving the prediction performance by directly adding them to the training data; developers may use these unconsidered data to test the generalization performance, design subroutines to process the patterns, redevelop a new model, or alert users to not use the products in certain special situations. Furthermore, we emphasize that the submodularity and guaranteed approximation algorithms can be used only with the partial Wasserstein divergence, but not with the vanilla Wasserstein distance.

Facility location problem (FLP) and kk-median

The FLP and related kk-median [38] are similar to our problem in the sense that we select data S𝒟candS\subseteq\mathcal{D}_{\text{cand}} to minimize an objective function. FLP is a mathematical model for finding a desirable location for facilities (e.g., stations, schools, and hospitals). While FLP models facility opening costs, our data covering problem does not consider data selection costs, making it more similar to kk-median problems. However, unlike the kk-median, our covering problem allows relaxed transportation for distribution, which is known as Kantorovich relaxation in the optimal transport field [9]. Hence, our approach using partial Wasserstein divergence enables us to model more flexible assignments among distributions beyond naïve assignments among data.

5 Experiments

In our experiments, we considered a scenario in which we wished to select some data in the application dataset 𝒟app\mathcal{D}_{\text{app}} for the PWC problem (i.e., 𝒟app=𝒟cand\mathcal{D}_{\text{app}}=\mathcal{D}_{\text{cand}}). For PW-*-LP and PW-sensitivity-Ctrans, we used IBM ILOG CPLEX 20.01 [39] for the dual simplex algorithm, where the sensitivity analysis for LP is also available. For PW-*-ent, we computed the entropic regularized version of the partial Wasserstein divergence using PyTorch [14] on a GPU. We computed the sensitivity using automatic differentiation. We set the coefficient for entropy regularization ϵ=0.01\epsilon=0.01 and terminated the Sinkhorn iterations when the divergence did not change by at least maxi,jCij×1012\max_{i,j}C_{ij}\times 10^{-12} compared with the previous step, or the number of iterations reached 50,000. All experiments were conducted with an Intel® Xeon® Gold 6142 CPU and an NVIDIA® TITAN RTX™ GPU.

5.1 Comparison of algorithms

First, we compare our algorithms, namely, the greedy-based PW-greedy-*, sensitivity-based PW-sensitivity-*, and random selection for the baseline. For the evaluation, we generated two-dimensional random values following a normal distribution for 𝒟app\mathcal{D}_{\text{app}} and 𝒟dev\mathcal{D}_{\text{dev}}, respectively. Figure 2 (left) presents the partial Wasserstein divergences between 𝒟app\mathcal{D}_{\text{app}} and S𝒟devS\cup\mathcal{D}_{\text{dev}} while varying the number of the selected data when Napp=Ndev=30N_{\text{app}}=N_{\text{dev}}=30 and Fig. 2 (right) presents the computational times when K=30K=30 and Napp=Ndev=NcandN_{\text{app}}=N_{\text{dev}}=N_{\text{cand}} varied. As shown in Fig. 2 (left), PW-greedy-LP, and PW-sensitivity-LP select exactly the same data as the global optimal solution obtained by PW-MILP. As shown in Fig. 2 (right), considering the logarithmic scale, PW-sensitivity-* significantly reduces the computational time compared with PW-MILP, while the naïve greedy algorithms do not scale. In particular, PW-sensitivity-ent can be calculated quickly as long as the GPU memory is sufficient, whereas its CPU version (PW-sensitivity-ent(CPU)) is significantly slower. PW-sensitivity-Ctrans is the fastest among the methods without GPUs. It is 8.67 and 3.07 times faster than PW-MILP and PW-sensitivity-LP, respectively. For quantitative evaluation, we define the empirical approximation ratio as ϕ(SK)/ϕ(SK)\phi(S_{K})/\phi(S^{*}_{K}), where SKS^{*}_{K} is the optimal solution obtained by PW-MILP when |S|=K|S|=K. Figure 3 shows the empirical approximation ratio (Napp=Ndev=30,K=15N_{\text{app}}=N_{\text{dev}}=30,K=15) for 50 different random seeds. Both PW-*-LP achieve approximation ratios close to one, whereas PW-*-ent and PW-sensitivity-Ctrans have slightly lower approximation ratios555An important feature of PW-*-ent is that they can be executed without using any commercial solvers..

Refer to caption
Figure 2: Partial Wasserstein divergence 𝒫𝒲2(𝒟app,S𝒟dev)\mathcal{PW}^{2}(\mathcal{D}_{\text{app}},S\cup\mathcal{D}_{\text{dev}}) when data are sequentially selected and added to SS (left). Computational time, where colored areas indicate standard deviations (right).
Refer to caption
Figure 3: Approximation ratios evaluated empirically for K=15K=15 and 5050 random instances.

5.2 Finding a missing category

For the quantitative evaluation of the missing pattern findings, we herein consider a scenario in which a category (i.e., label) is less included in the development dataset than in the application dataset666Note that in real scenarios, missing patterns do not always correspond to labels.. We employ subsets of the MNIST dataset [40], where the development dataset contains 0 labels at a rate of 0.5%, whereas all labels are included in the application data in equal ratios (i.e., 10%). We randomly sampled 500 images from the validation split for each 𝒟app\mathcal{D}_{\text{app}} and 𝒟dev\mathcal{D}_{\text{dev}}. We employ the squared L2L^{2} distance on the pixel space as the ground metric for our method. For baseline methods, we employed LOF [4] novelty detection provided by scikit-learn [41] with default parameters. For the LOF, we selected top-KK data according to the anomaly scores. We also employed active learning approaches based on entropies of the prediction [35, 36] and coreset with the kk-center and kk-center greedy [5, 42]. For the entropy-based method, we train a LeNet-like neural network using the training split, and then select KK data with high entropies of the predictions from the application dataset. LOF and coreset algorithms are conducted on the pixel space.

We conducted 10 trials using different random seeds for the proposed algorithms and baselines, except for the PW-greedy-* algorithms, because they took over 24h. Figure 4 presents a histogram of the selected labels when K=30K=30, where xx-axis corresponds to labels of selected data (i.e., 0 or not 0) and yy-axis shows relative frequencies of selected data. The proposed methods (i.e., PW-*) extract more data corresponding to the label 0 (0.71 for PW-MILP) than the baselines. We can conclude that the proposed methods successfully selected data from the candidate dataset 𝒟cand(=𝒟app)\mathcal{D}_{\text{cand}}(=\mathcal{D}_{\text{app}}) to fill in the missing areas in the distribution.

Refer to caption
Figure 4: Relative frequency of missing category in selected data. The black bars denote the standard deviations.

5.3 Missing scene extraction in driving datasets

Finally, we demonstrate PWC in a realistic scenario, using driving scene images. We adopted two datasets, BDD100K [43] and KITTI (Object Detection Evaluation 2012) [44] as the application and development datasets, respectively. The major difference between these two datasets is that KITTI (𝒟dev\mathcal{D}_{\text{dev}}) contains only daytime images, whereas BDD100k (𝒟app\mathcal{D}_{\text{app}}) contains both daytime and nighttime images. To reduce computational time, we randomly selected 1,500 data points for the development dataset from the test split of the KITTI dataset and 3,000 data points for the application dataset from the test split of BDD100k. To compute the transport costs between images, we calculated the squared L2L_{2} norms between the feature vectors extracted by a pretrained ResNet with 50 layers [45] obtained from Torchvision [46]. Before inputting the images into the ResNet, each image was resized to a height of 224 pixels and then center-cropped to a width of 224, followed by normalization. As baseline methods, we adopted the LOF [4] and coreset with the kk-center greedy [5]. Figure 5 presents the obtained top-3 images. One can see that PWC (PW-sensitivity-LP) selects the major pattern (i.e., nighttime scenes) that is not included in the development data, whereas LOF and coreset mainly extracts specific rare scenes (e.g., a truck crossing a street or out-of-focus scene). The coreset selects a single image of nighttime scenes; however, we emphasize that providing multiple images is essential for the developers to understand what patterns in the image (e.g., nighttime, type of cars, or roadside) are less included in the image.

The above results indicate that the PWC problem enables us to accelerate updating ML-based systems when the distributions of application and development data are different because our method does not focus on isolated anomalies but major missing patterns in the application data, as shown in Fig. 1 and Fig. 5. The ML workflow using our method can efficiently find such patterns and allow developers to incrementally evaluate and update the ML-based systems (e.g., test the generalization performance, redevelop a new model, and designing a special subroutine for the pattern). Identifying patterns that are not taken into account during development and addressing them individually can improve the reliability and generalization ability of ML-based systems, such as image recognition in autonomous vehicles.

Refer to caption
(a) Partial Wassersein covering
Refer to caption
(b) LOF
Refer to caption
(c) Coreset (kk-center greedy)
Figure 5: Covering results when the application dataset (BDD100k) contains nighttime scenes, but the development dataset (KITTI) does not. PWC (PW-sensitivity-LP) extracts the major difference between the two datasets (i.e., nighttime images), whereas LOF and coreset (kk-center greedy) mainly extract some rare cases.

6 Conclusion

In this paper, we proposed the PWC, which fills in the gaps between development datasets and application datasets based on partial Wasserstein divergence. We also proved the submodularity of the PWC, leading to a greedy algorithm with the guaranteed approximation ratio. In addition, we proposed quasi-greedy algorithms based on sensitivity analysis and derivatives of Sinkhorn iterations. Experimentally, we demonstrated that the proposed method covers the major areas of development datasets with densities lower than those of the corresponding areas in application datasets. The main limitation of the PWC is scalability; the space complexity of the dual simplex or Sinkhorn iteration is at least O(NappNdev)O(N_{\text{app}}\cdot N_{\text{dev}}), which might require approximation, e.g., stochastic optimizations [47], slicing [3] and neural networks [48]). As a potential risk, if infrequent patterns in application datasets are as important as frequent patterns, our proposed method may ignore some important cases, and it may be desirable to use our covering with methods focusing on fairness.

References

  • [1] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 214–223. PMLR, 06–11 Aug 2017.
  • [2] David Alvarez-Melis and Nicolo Fusi. Geometric dataset distances via optimal transport. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33, volume 33, pages 21428–21439. Curran Associates, Inc., 2020.
  • [3] Nicolas Bonneel and David Coeurjolly. Spot: sliced partial optimal transport. ACM Transactions on Graphics (TOG), 38(4):1–13, 2019.
  • [4] Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. Lof: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pages 93–104, 2000.
  • [5] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  • [6] Alessio Figalli. The optimal partial transport problem. Archive for rational mechanics and analysis, 195(2):533–560, 2010.
  • [7] Laetitia Chapel, Mokhtar Alaya, and Gilles Gasso. Partial optimal transport with applications on positive-unlabeled learning. In Advances in Neural Information Processing Systems 33, pages 2903–2913, 2020.
  • [8] Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
  • [9] Gabriel Peyré and Marco Cuturi. Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning, 11(5-6):355–607, 2019.
  • [10] Robert J Vanderbei. Linear programming, volume 3. Springer, 2015.
  • [11] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions—i. Mathematical programming, 14(1):265–294, 1978.
  • [12] Amotz Bar-Noy and George Rabanca. Tight approximation bounds for the seminar assignment problem. In International Workshop on Approximation and Online Algorithms, pages 170–182. Springer, 2016.
  • [13] Richard M Karp. Reducibility among combinatorial problems. In Complexity of computer computations, pages 85–103. Springer, 1972.
  • [14] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019.
  • [15] J Arturo Olvera-López, J Ariel Carrasco-Ochoa, J Francisco Martínez-Trinidad, and Josef Kittler. A review of instance selection methods. Artificial Intelligence Review, 34(2):133–143, 2010.
  • [16] Peter Hart. The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory, 14(3):515–516, 1968.
  • [17] GL Ritter. An algorithm for a selective nearest neighbor decision rule. IEEE Trans. Information Theory, 21(11):665–669, 1975.
  • [18] Chien-Hsing Chou, Bo-Han Kuo, and Fu Chang. The generalized condensed nearest neighbor rule as a data reduction method. In Proceedings of the 18th International Conference on Pattern Recognition, volume 2, pages 556–559. IEEE, 2006.
  • [19] Dennis L Wilson. Asymptotic properties of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, (3):408–421, 1972.
  • [20] Fernando Vázquez, J Salvador Sánchez, and Filiberto Pla. A stochastic approach to wilson’s editing algorithm. In Iberian conference on pattern recognition and image analysis, pages 35–42. Springer, 2005.
  • [21] D Randall Wilson and Tony R Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257–286, 2000.
  • [22] Henry Brighton and Chris Mellish. Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, 6(2):153–172, 2002.
  • [23] José C Riquelme, Jesús S Aguilar-Ruiz, and Miguel Toro. Finding representative patterns with ordered projections. Pattern Recognition, 36(4):1009–1018, 2003.
  • [24] James C Bezdek and Ludmila I Kuncheva. Nearest prototype classifier designs: An experimental study. International Journal of Intelligent Systems, 16(12):1445–1473, 2001.
  • [25] Huan Liu and Hiroshi Motoda. On issues of instance selection. Data Mining and Knowledge Discovery, 6(2):115–130, 2002.
  • [26] Barbara Spillmann, Michel Neuhaus, Horst Bunke, Elżbieta Pekalska, and Robert PW Duin. Transforming strings to vector spaces using prototype selection. In Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), pages 287–296. Springer, 2006.
  • [27] Mohiuddin Ahmed. Data summarization: a survey. Knowledge and Information Systems, 58(2):249–273, 2019.
  • [28] Marko Mitrovic, Ehsan Kazemi, Morteza Zadimoghaddam, and Amin Karbasi. Data summarization at scale: A two-stage submodular approach. In Proceedings of the 35th International Conference on Machine Learning, pages 3596–3605. PMLR, 2018.
  • [29] Hui Lin and Jeff Bilmes. A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 510–520, 2011.
  • [30] Baharan Mirzasoleiman, Morteza Zadimoghaddam, and Amin Karbasi. Fast distributed submodular cover: Public-private data summarization. In Advances in Neural Information Processing Systems 29, pages 3594–3602, 2016.
  • [31] Sebastian Tschiatschek, Rishabh K Iyer, Haochen Wei, and Jeff A Bilmes. Learning mixtures of submodular functions for image collection summarization. In Advances in Neural Information Processing Systems 27, pages 1413–1421, 2014.
  • [32] Bernhard Schölkopf, Robert C Williamson, Alexander J Smola, John Shawe-Taylor, John C Platt, et al. Support vector method for novelty detection. In Advances in Neural Information Processing Systems 12, volume 12, pages 582–588. Citeseer, 1999.
  • [33] Ki Hyun Kim, Sangwoo Shim, Yongsub Lim, Jongseob Jeon, Jeongwoo Choi, Byungchan Kim, and Andre S Yoon. Rapp: Novelty detection with reconstruction along projection pathway. In International Conference on Learning Representations, 2019.
  • [34] Jinwon An and Sungzoon Cho. Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE, 2(1):1–18, 2015.
  • [35] Alex Holub, Pietro Perona, and Michael C Burl. Entropy-based active learning for object recognition. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 1–8. IEEE, 2008.
  • [36] Luiz FS Coletta, Moacir Ponti, Eduardo R Hruschka, Ayan Acharya, and Joydeep Ghosh. Combining clustering and active learning for the detection and learning of new image classes. Neurocomputing, 358:150–165, 2019.
  • [37] Changjian Shui, Fan Zhou, Christian Gagné, and Boyu Wang. Deep active learning: Unified and principled method for query and training. In International Conference on Artificial Intelligence and Statistics, pages 1308–1318. PMLR, 2020.
  • [38] Charles S ReVelle and Ralph W Swain. Central facilities location. Geographical analysis, 2(1):30–42, 1970.
  • [39] IBM ILOG CPLEX Optimization Studio. Cplex user’s manual. Vers, 12:207–258, 2013.
  • [40] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
  • [41] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • [42] Olivier Bachem, Mario Lucic, and Andreas Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
  • [43] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Darrell. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2020.
  • [44] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research, 2013.
  • [45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [46] Sébastien Marcel and Yann Rodriguez. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM International Conference on Multimedia, pages 1485–1488, 2010.
  • [47] Genevay Aude, Marco Cuturi, Gabriel Peyré, and Francis Bach. Stochastic optimization for large-scale optimal transport. arXiv preprint arXiv:1605.08527, 2016.
  • [48] Yujia Xie, Minshuo Chen, Haoming Jiang, Tuo Zhao, and Hongyuan Zha. On scalable and efficient computation of large scale optimal transport. In International Conference on Machine Learning, pages 6882–6892. PMLR, 2019.
  • [49] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems 26, pages 2292–2300, 2013.

Appendix A Reformulation of Eq.(2) as a MILP

The MILP formulation of Eq.(2) with marginal masses 𝐚\mathbf{a} and 𝐛(S)\mathbf{b}(S) can be defined based on the partial Wasserstein divergence in Eq.(1), i.e., the marginal mass 𝐛(S)\mathbf{b}(S) depends on which subset S𝒟candS\subseteq\mathcal{D}_{\mathrm{cand}} is selected. Precisely, the marginal mass of 𝒟app\mathcal{D}_{\mathrm{app}} side (i.e., source) is ai=1Nappa_{i}=\frac{1}{N_{\mathrm{app}}}. The mass of 𝒟cand𝒟dev\mathcal{D}_{\mathrm{cand}}\cup\mathcal{D}_{\mathrm{dev}} side (i.e., target) is now Eq.(3), meaning that bj(S)b_{j}(S) represents the mass can be transported into jj. This parameterization enables us to define Eq.(2) as MILP.

Let yj{0,1},j[[Ncand]]y_{j}\in\{0,1\},j\in[\![N_{\mathrm{cand}}]\!] be the 0-11 decision variable. The value yj=1y_{j}=1 means the data s(j)𝒟cands^{(j)}\in\mathcal{D}_{\mathrm{cand}} is included in SS (i.e., s(j)Ss^{(j)}\in S), and yj=0y_{j}=0 means s(j)Ss^{(j)}\notin S. Continuous decision variables Pij[0,1]P_{ij}\in[0,1] represent transportation masses between 𝒟app\mathcal{D}_{\mathrm{app}} and 𝒟cand𝒟dev\mathcal{D}_{\mathrm{cand}}\cup\mathcal{D}_{\mathrm{dev}} for each data pair i[[Napp]]i\in[\![N_{\mathrm{app}}]\!] and j[[Ncand+Ndev]]j\in[\![N_{\mathrm{cand}}+N_{\mathrm{dev}}]\!]. The minimization problem among two sets with marginal masses 𝐚\mathbf{a} and 𝐛(S)\mathbf{b}(S) is written as the following MILP problem:

min𝐏[0,1]Napp×(Ncand+Ndev),{yj}j=1Ncand𝐏,𝐂\displaystyle\min_{\mathbf{P}\in[0,1]^{N_{\mathrm{app}}\times(N_{\mathrm{cand}}+N_{\mathrm{dev}})},\{y_{j}\}_{j=1}^{N_{\mathrm{cand}}}}\langle\mathbf{P},\mathbf{C}\rangle (8)

subject to

j[[Ncand+Ndev]]Pij\displaystyle\sum_{j\in[\![N_{\mathrm{cand}}+N_{\mathrm{dev}}]\!]}P_{ij} =1Napp\displaystyle=\frac{1}{N_{\mathrm{app}}} for all i[[Napp]]\displaystyle\text{ for all }i\in[\![N_{\mathrm{app}}]\!] (9)
i[[Napp]]Pij\displaystyle\sum_{i\in[\![N_{\mathrm{app}}]\!]}P_{ij} yjNdev\displaystyle\leq\frac{y_{j}}{N_{\mathrm{dev}}} for all 1jNapp\displaystyle\text{ for all }1\leq j\leq N_{\mathrm{app}} (10)
i[[Napp]]Pij\displaystyle\sum_{i\in[\![N_{\mathrm{app}}]\!]}P_{ij} 1Ndev\displaystyle\leq\frac{1}{N_{\mathrm{dev}}} for all Napp+1jNapp+Ndev\displaystyle\text{ for all }N_{\mathrm{app}}+1\leq j\leq N_{\mathrm{app}}+N_{\mathrm{dev}} (11)
j[[Ncand]]yj\displaystyle\sum_{j\in[\![N_{\mathrm{cand}}]\!]}y_{j} K\displaystyle\leq K (12)
yj\displaystyle y_{j} {0,1}\displaystyle\in\{0,1\} for all j[[Ncand]]\displaystyle\text{ for all }j\in[\![N_{\mathrm{cand}}]\!] (13)
Pij\displaystyle P_{ij} [0,1]\displaystyle\in[0,1] for all i[[Napp]],j[[Ncand+Ndev]]\displaystyle\text{ for all }i\in[\![N_{\mathrm{app}}]\!],j\in[\![N_{\mathrm{cand}}+N_{\mathrm{dev}}]\!] (14)

Constraints (9)-(11) ensures the requirements of the partial Wasserstein divergence of Eq. (1). Note that Constraints (10) and (12) represent that we select up to KK data from 𝒟cand\mathcal{D}_{\mathrm{cand}}. Constraints (13)-(14) define decision variables.

Appendix B Pseudocodes of proposed algorithms

This section explains the evaluated methods in the main text; the greedy-based algorithm PW-greedy-*, sensitivity-based ones PW-sensitivity-*, and two baselines random.

B.1 Greedy-based algorithms

We explain the proposed greedy-based algorithm PW-greedy-LP, which is based on LP formulation of the partial Wasserstein divergence. The pseudocode is given below. For each insertion iteration up to KK, we compute all the partial Wasserstein divergences when 𝐬(j)\mathbf{s}^{(j)} for j[[Ncand]]j\in[\![N_{\text{cand}}]\!] only if 𝐬(j)S\mathbf{s}^{(j)}\notin S. We select a new data 𝐬(j)\mathbf{s}^{(j)} if this addition maximizes the selection objective (i.e., minimize the distance among two datasets).

Algorithm 2 Greedy data selection for the PWC problem (PW-greedy-LP)
1:Input: 𝒟app,𝒟dev,𝒟cand\mathcal{D}_{\text{app}},\mathcal{D}_{\text{dev}},\mathcal{D}_{\text{cand}}
2:Output: SS
3:S{}S\leftarrow\{\}
4:while |S|<K|S|<K do \triangleright Iteration up to KK data
5:     Compute dj𝒫𝒲2(𝒟app,S{𝐬(j)}𝒟dev)d_{j}\coloneqq\mathcal{PW}^{2}(\mathcal{D}_{\text{app}},S\cup\{\mathbf{s}^{(j)}\}\cup\mathcal{D}_{\text{dev}}) for each j[[Ncand]]j\in[\![N_{\text{cand}}]\!] only if 𝐬(j)S\mathbf{s}^{(j)}\notin S \triangleright O(NcandC𝑃𝑊)O(N_{\text{cand}}\cdot C_{\mathit{PW}}) in each iteration \triangleright (1). each divergence requires O(C𝑃𝑊)O(C_{\mathit{PW}}) \triangleright (2). NcandN_{\text{cand}} times to compute {dj}\{d_{j}\}
6:     Select j=argminj[[Ncand]],𝐬(j)Sdjj^{\star}=\operatorname*{arg\,min}_{j\in[\![N_{\text{cand}}]\!],\mathbf{s}^{(j)}\notin S}d_{j}
7:     SS{𝐬(j)}S\leftarrow S\cup\{\mathbf{s}^{(j)}\}
8:end while

Note that PW-greedy-ent is obtained by just replacing the LP part of computing 𝒫𝒲2(,)\mathcal{PW}^{2}(\cdot,\cdot) with the (partial) Sinkhorn iterations of computing the entropy regularized Partial Wasserstein divergence 𝒫𝒲ϵ2(,)\mathcal{PW}^{2}_{\epsilon}(\cdot,\cdot).

B.2 CC-transform-based sensitivity algorithm

Developed sensitivity-based algorithms (e.g., PW-sensitivity-LP) reduces computational costs to evaluate all the divergence djd_{j} in greedy-based algorithms to estimate the importance of jj-th data by their sensitivity values. In the LP-based method, such values are obtained by sensitivity analyses functions included in the LP solver. Further, gradients of the partial Wasserstein divergences can be applicable when the entropy-regularized PWC is used. Note that these methods are explained in the main text (Algorithm 2).

Here, we give further details particularly when the CC-transform is adopted to accelerate sensitivity-based algorithms. The pseudocode is given below. For each iteration up to KK, we compute the dual optimal solutions 𝐟,𝐠\mathbf{f}^{\star},\mathbf{g}^{\star} with LP-formulation of system size Ncand×(|S|+Ndev)N_{\text{cand}}\times(|S|+N_{\text{dev}}), instead of using the full size Ncand×(Ndev+Ndev)N_{\text{cand}}\times(N_{\text{dev}}+N_{\text{dev}}) for practical acceleration. Once dual optimal solutions are computed by a solver, we could estimate the effect of adding a new data 𝐬(j)\mathbf{s}^{(j)} using the computed solutions as they are also feasible even if 𝐬(j)\mathbf{s}^{(j)} is added to the next step LP formulation. The effect of adding 𝐬(j)\mathbf{s}^{(j)} can be estimated by CC-transform as we discussed (see Chapter 3 in [9] as well) and this estimation is apparently possible in constant time (see Line 6). In conclusion, this CC-transform-based algorithm has the same computational complexity with those of PW-sensitivity-LP, this computational trick accelerates the total computation times as we evaluated in experiments.

Algorithm 3 Data selection for the PWC problem with sensitivity analysis and CC-transform (PW-sensitivity-Ctrans)
1:Input: 𝒟app,𝒟dev,𝒟cand\mathcal{D}_{\text{app}},\mathcal{D}_{\text{dev}},\mathcal{D}_{\text{cand}}
2:Output: SS
3:S{}S\leftarrow\{\}
4:while |S|<K|S|<K do
5:     Compute 𝒫𝒲2(𝒟app,S𝒟dev)\mathcal{PW}^{2}(\mathcal{D}_{\text{app}},S\cup\mathcal{D}_{\text{dev}}) with LP of system size Ncand×(|S|+Ndev)N_{\text{cand}}\times(|S|+N_{\text{dev}}), and obtain the optimal dual 𝐟,𝐠\mathbf{f}^{\star},\mathbf{g}^{\star}, where 𝐟Napp,𝐠|S|+Ndev\mathbf{f}^{\star}\in\mathbb{R}^{N_{\text{app}}},\mathbf{g}^{\star}\in\mathbb{R}^{|S|+N_{\text{dev}}}
6:     Compute CC-transforms: For each j[[Ncand]]j\in[\![N_{\text{cand}}]\!] if 𝐬(j)\mathbf{s}^{(j)}, we evaluate the effect of adding 𝐬(j)\mathbf{s}^{(j)} with the value gj𝐂min{0,mini[[Ndev]]Ci,jfi}g_{j}^{\mathbf{C}}\coloneqq\min\{0,\min_{i\in[\![N_{\text{dev}}]\!]}C_{i,j}-f_{i}^{*}\}.
7:     j=argminj[[Ncand]],𝐬(j)S𝐠j𝐂j^{\star}=\operatorname*{arg\,min}_{j\in[\![N_{\text{cand}}]\!],\mathbf{s}^{(j)}\notin S}\mathbf{g}^{\mathbf{C}}_{j}
8:     SS{𝐬(j)}S\leftarrow S\cup\{\mathbf{s}^{(j^{\star})}\}
9:end while

B.3 Sinkhorn algorithms

The entropy regularized Wasserstein divergence [49] has attracted much attention because it can be computed in high speed on GPUs by some simple matrix multiplier, named the Sinkhorn iterations. Entropy regularized Wasserstein distance is formally defined as

𝒲ϵ2(X,Y)\displaystyle\mathcal{W}^{2}_{\epsilon}(X,Y) =min𝐏U(𝐚,𝐛)𝐏,𝐂ϵH(𝐏),\displaystyle=\min_{\mathbf{P}\in U(\mathbf{a},\mathbf{b})}\left<\mathbf{P},\mathbf{C}\right>-\epsilon H(\mathbf{P}), (15)
U(𝐚,𝐛)\displaystyle U(\mathbf{a},\mathbf{b}) ={𝐏[0,1]m×n|𝐏𝟙n=𝐚,𝐏𝟙m=𝐛},\displaystyle=\{\mathbf{P}\in[0,1]^{m\times n}\ |\ \mathbf{P}\mathbbm{1}_{n}=\mathbf{a},\ \mathbf{P}\!^{\top}\!\mathbbm{1}_{m}=\mathbf{b}\}, (16)
H(𝐏)\displaystyle H(\mathbf{P}) =i=1mj=1nPij(logPij1),\displaystyle=-\sum_{i=1}^{m}\sum_{j=1}^{n}P_{ij}(\log P_{ij}-1), (17)

where ϵ>0\epsilon>0. An important property of the Sinkhorn iterations is differentiable with respect to the inputs and constraints, for example, 𝒲ϵ2(X,Y)/X\partial\mathcal{W}^{2}_{\epsilon}(X,Y)/\partial X and 𝒲ϵ2(X,Y)/bj\partial\mathcal{W}^{2}_{\epsilon}(X,Y)/\partial b_{j} can be obtained by the automatic differentiation techniques. It is noteworthy that we add a small value to bjb_{j} for the computation of the derivative. We can also employ Sinkhorn-like iterations on GPUs to solve the entropy regularized partial Wasserstein divergences [8].

The (partial) derivatives 𝒲ϵ2(X,Y)/X\partial\mathcal{W}^{2}_{\epsilon}(X,Y)/\partial X and 𝒲ϵ2(X,Y)/bj\partial\mathcal{W}^{2}_{\epsilon}(X,Y)/\partial b_{j} can be computed after evaluating 𝒫𝒲ϵ2\mathcal{PW}^{2}_{\epsilon}. For example in PW-greedy-LP, we can replace 𝒫𝒲2\mathcal{PW}^{2} with 𝒫𝒲ϵ2\mathcal{PW}^{2}_{\epsilon}, leading another proposed method PW-greedy-ent. Further, instead of selecting j=argminj[[Ncand]],𝐬(j)Sdjj^{\star}=\arg\min_{j\in[\![N_{\mathrm{cand}}]\!],\mathbf{s}^{(j)}\notin{}S}d_{j}, where dj:=𝒫𝒲2(𝒟app,S{𝐬(j)}𝒟dev)d_{j}:=\mathcal{PW}^{2}(\mathcal{D}_{\mathrm{app}},S\cup\{\mathbf{s}^{(j)}\}\cup\mathcal{D}_{\mathrm{dev}}), we could select jj^{\star} that minimizes the partial derivatives of 𝒫𝒲2\mathcal{PW}^{2}, leading another proposed method PW-sensitivity-ent.

Appendix C Details of Experiments

C.1 LeNet-like network

We show the PyTorch code of LeNet-like network used for the entropy-based active learning approach in MNIST experiment. The network is trained by SGD optimizer, where learning rate is 0.001 and momentum is 0.9 to minimize the cross entropy loss function during 20 epochs.

1class Net(nn.Module):
2 def __init__(self):
3 super(Net, self).__init__()
4 self.conv1 = nn.Conv2d(1, 32, 3) # 28x28x32 -> 26x26x32
5 self.conv2 = nn.Conv2d(32, 64, 3) # 26x26x64 -> 24x24x64
6 self.pool = nn.MaxPool2d(2, 2) # 24x24x64 -> 12x12x64
7 self.dropout1 = nn.Dropout2d()
8 self.fc1 = nn.Linear(12 * 12 * 64, 128)
9 self.dropout2 = nn.Dropout2d()
10 self.fc2 = nn.Linear(128, 10)
11
12 def forward(self, x):
13 x = F.relu(self.conv1(x))
14 x = self.pool(F.relu(self.conv2(x)))
15 x = self.dropout1(x)
16 x = x.view(-1, 12 * 12 * 64)
17 x = F.relu(self.fc1(x))
18 x = self.dropout2(x)
19 x = self.fc2(x)
20 return x

C.2 Finding a completely missing category

In Sec. 5.2, we consider the situation, where the frequency of a label in the development dataset is very rare compared to that in the application dataset. We herein consider another situation, where a label is completely missing in the development dataset. Fig. 6 illustrates the relative frequency of missing category in selected data. In this situation, partial Wasserstein covering methods (i.e., PW-\star) outperform the baselines including the active learning methods (i.e., kk-center, kk-center-greedy, and active-ent) and the anomaly detection method (i.e., LOF).

Refer to caption
Figure 6: Relative frequency of completely missing category in selected data (MNIST dataset). The black bars denote the standard deviations.

C.3 Finding a missing category on CIFAR-10

We perform the missing category finding experiment using CIFAR-10 dataset. We herein consider airplane label (the first label in the CIFAR-10) as the missing label. As the ground metrics, we employ L2L^{2}-norm between the feature vectors extracted by a neural network. We obtain the neural network from torchhub777https://github.com/chenyaofo/pytorch-cifar-models, and train using the subset of the training data of CIFAR-10 from scratch using the Adam optimizer888We employed the default parameters in PyTorch for the Adam optimizer. during 10 epochs. For the subset of the training data, we remove the missing category data so that the percentage of the missing category would be the same as the development data (i.e., 0.5%). The other settings are the same as in Sec.5.2. Fig. 7 illustrates the results indicating that our covering algorithms outperform the baseline, while PW-sensitivity-ent failed to find the missing category. This could be due to the Sinkhorn iterations not sufficiently converged.

Refer to caption
Figure 7: Relative frequency of missing category in selected data (CIFAR-10 dataset). The black bars denote the standard deviations.

C.4 Missing scene extraction in real driving datasets varying number of data

In the Sec. 5.3, we showed the results of 1,500 data for the development dataset, and 3,000 data for the application dataset. We herein show some additional results varying number of data. We also show top-8 images for each. These result show that partial Wasserstein covering successfully extract the major differences (i.e., night scenes).

Refer to caption
(a) Partial Wassersein covering (PW-sensitivity-LP)
Refer to caption
(b) LOF
Refer to caption
(c) Coreset (k-center greedy)
Refer to caption
(d) Random
Figure 8: Development dataset 500 and application dataset 1,000
Refer to caption
(a) Partial Wassersein covering (PW-sensitivity-LP)
Refer to caption
(b) LOF
Refer to caption
(c) Coreset (k-center greedy)
Refer to caption
(d) Random
Figure 9: Development dataset 1,000 application dataset 2,000
Refer to caption
(a) Partial Wassersein covering (PW-sensitivity-LP)
Refer to caption
(b) LOF
Refer to caption
(c) Coreset (k-center greedy)
Refer to caption
(d) Random
Figure 10: Development dataset 1,500 application dataset 3,000

Appendix D Discussion

D.1 General partial OT and unbalanced OT

We only consider the formulation of the partial Wasserstein divergence (Eq. 1), while the other formulations of the optimal transport can be considered. We note that the general partial OT [6] for discrete distributions is actually the same as Eq.(1) when 𝐚𝟙n𝐛𝟙m\mathbf{a}^{\top}\mathbbm{1}_{n}\leq\mathbf{b}^{\top}\mathbbm{1}_{m}. A covering problem using the unbalanced OT (i.e., min𝐏𝐏,C+τ1D1(𝐏𝟙m|𝐚)+τ2D2(𝐏𝟙n|𝐛)\min_{\mathbf{P}}\left<\mathbf{P},C\right>+\tau_{1}D_{1}(\mathbf{P}\mathbbm{1}_{m}|\mathbf{a})+\tau_{2}D_{2}(\mathbf{P}^{\top}\mathbbm{1}_{n}|\mathbf{b})) can be considered, while the submodularity may not be hold in this case. We also emphasize that we need to adjust the strength of the penalty term (i.e., τ1,τ2\tau_{1},\tau_{2}) to select data points within the budgets.