This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\stackMath

Information Elicitation Meets Clustering

Yuqing Kong
The Center on Frontiers of Computing Studies,
Peking University
[email protected]
Abstract

In the setting where we want to aggregate people’s subjective evaluations, plurality vote may be meaningless when a large amount of low-effort people always report “good” regardless of the true quality. “Surprisingly popular” method, picking the most surprising answer compared to the prior, handle this issue to some extent. However, it is still not fully robust to people’s strategies. Here in the setting where a large number of people are asked to answer a small number of multi-choice questions (multi-task, large group), we propose an information aggregation method which is robust to people’s strategies. Interestingly, this method can be seen as a rotated “surprisingly popular”. It is based on a new clustering method, Determinant MaxImization (DMI)-clustering, and a key conceptual idea that information elicitation without ground-truth can be seen as a clustering problem. Of independent interest, DMI-clustering is a general clustering method that aims to maximize the volume of the simplex consisting of each cluster’s mean multiplying the product of the cluster sizes. We show that DMI-clustering is invariant to any non-degenerate affine transformation for all data points. When the data point’s dimension is a constant, DMI-clustering can be solved in polynomial time. In general, we present a simple heuristic for DMI-clustering which is very similar to Lloyd’s algorithm for k-means. Additionally, we also apply the clustering idea in the single-task setting and use spectral method to propose a new aggregation method that utilizes the second-moment information elicited from the crowds.

1 Introduction

When we want to forecast an event, make a decision, evaluate a product, or test a policy, we need to elicit information from the crowds and then aggregate the information. In many scenarios, we may not have ground-truth answer to verify the elicited information. For example, when we poll people’s voting intentions or elicit people’s subjective evaluations, there is no ground-truth for people’s feedback. Recently, a line of work [17, 18, 20, 14, 15, 13] explore how to incentivize people to be honest without ground-truth. Some work [19, 15] also study how to aggregate the information without ground-truth.

A natural way to aggregate information without ground-truth is “plurality vote”, i.e., output the most popular option. For example, in product evaluation, when the feedback of is 56% “good”, 40% “so so”, 4% “bad”, the plurality answer is “good”. However, if there are a large amount of people who always choose “good” regardless of the quality, plurality vote will always output “good” thus becomes meaningless. A more robust method is “surprisingly popular”, proposed by Prelec et al. [19]. In the above example, if the prior is 70% “good”, 26% “so so”, 4% “bad”, “surprisingly popular” will aggregate the feedback to “so so” which is the most surprisingly popular answer (4026>5670,44\frac{40}{26}>\frac{56}{70},\frac{4}{4}). This will handle the above issue to a certain degree. However, “surprisingly popular” may not work in some scenarios.

Example 1.1.

When the ground truth state is “good”, the distribution over people’s feedback is (56% “good”, 40% “so so”, 4% “bad”); when the ground truth state is “so so”, the distribution is (56% “good”, 34% B, 10% “bad”); ground truth state “bad” corresponds to (46% “good”, 44% “so so”, 10% “bad”). Each state shows up with equal probability. Thus, the prior is approximately (52.7% “good”,39.3% “so so”,8% “bad”)

In the above example, for both (56% “good”, 34% “so so”, 10% “bad”) and (46% “good”, 44% “so so”, 10% “bad”), the “surprisingly popular” answer is “bad”, while they correspond to different ground truth.

Therefore, “surprisingly popular” requires an additional prior assumption about the relationship between the ground truth state and people’s answers [19]. However, even if people’s honest answers satisfy the assumption, people can perform a strategy to their answers to break the assumption. Here is an example.

Example 1.2.

When the ground truth state is “good”, the distribution over people’s feedback is (100%,0%,0%); when the ground truth state is “so so”, the distribution is (0%,100%,0%); ground truth state “bad” corresponds to (0%,0%,100%). All people perform a strategy: given my private answer is “good”, I will report “good” with probability .56, “so so” with probability .4, “bad” with probability .04, given my answer is “so so”, I will … The strategy is represented by a row-stochastic matrix [.56.4.04.56.34.1.46.44.1]\begin{bmatrix}.56&.4&.04\\ .56&.34&.1\\ .46&.44&.1\end{bmatrix} where each row represents the reporting distribution for each private answer. In this case, people’s strategic behaviors change the report distribution to the previous example and "surprisingly popular" will not work here.

Thus, a fundamental problem of the “surprisingly popular” method is that though it is more robust than plurality vote method, it is not fully robust to people’s strategies.

To this end, we can naturally ask for an information aggregation method that is robust to people’s strategies, that is, regardless of people’s strategies, the aggregation result will not change.

However, it may ask too much since people can permute their answers and no method is robust to this permutation strategy. Thus we change the requirement to the aggregation result will not change up to a permutation and use side information (e.g. ground truth of a small set of questions) to determine the permutation.

Refer to caption
Figure 1: Reduction to clustering: we map each task to a data point that represents the task’s collected feedback (e.g. 30% “good”, 20% “so so”, 50% “bad”) and then clustering the data points to assign each task a final answer.

It turns out that when we have a large number of people, and a small set of a priori similar tasks (each task is a multi-choice question), the answer is yes. A key conceptual idea is to reduce information aggregation without ground truth to a clustering/unsupervised learning problem. That is, we can use a datapoint to represents a task’s feedback. Assigning answers to the tasks can be understood as clustering the data points (Figure 1). Recall the previous example, when people perform a strategy to their reports, all feedback are transformed by an affine transformation 111It holds even if people adopt different strategies as long as each task is performed by a large number random people.. Therefore, designing a robust information aggregation method is equivalent to design an affine-invariant clustering method. We then propose an affine-invariant clustering method, Determinant MaxImization (DMI) clustering, and apply it into the information elicitation without verification.

DMI-clustering is the technical highlight in this paper. Previous affine-invariant clustering work [2, 10] either has a different definition for affine-invariance222They require that items that are same up to affine transformations (e.g. images of the same chair observed in different angles) should be clustered into the same cluster. In our setting, we require the clustering result remains the same when all items are transformed by the same affine transformation. from this work or has an underlying probabilistic model assumption and requires sufficiently number of data points [5, 12]. In contrast, DMI-clustering is a non-parametric affine-invariant method that applies to any finite number of data points and does not require any probabilistic model assumption. DMI-clustering provides a relatively affine-invariant evaluation to each clustering, the volume of the simplex consisting of each cluster mean multiplying the product of the cluster sizes, and aims to find the clustering with the highest evaluation.

When we apply DMI-clustering into information elicitation, the dimension of the data points is the number of options, which can be seen as a constant. In this case, we show that there exists a polynomial time algorithm for DMI-clustering. In general, we present a simple local-maximal searching heuristic for DMI-clustering, k-cofactors, which is similar to Lloyd’s algorithm [16] for k-means. We illustrate the geometric interpretation of DMI-clustering and k-cofactors for 2d points in Figure 2.

To apply DMI-clustering to multi-task information elicitation, we can normalize each question’s feedback as a data point (e.g. (46% A, 44% B, 10% C)). Given multiple questions’ feedback, we can apply DMI-clustering to obtain the final answer333To determine the cluster’s label, we can use side information like ground-truth of a small set of questions..

We also compare DMI-clustering with the surprisingly popular method. Interestingly, we find that they are equivalent for binary-question and in general, DMI-clustering can be understood as a “rotated” surprisingly popular method.

Refer to caption
Figure 2: Geometric interpretation of DMI-clustering’s goal and algorithm in 2d. Left: DMI-clustering aims to maximize the area of the triangle that consists of the clusters’ means, multiplying the product of the cluster sizes. Right: we can update the current partition (orange dashed lines) by 1) assigning the points into clusters based on the current partition; 2) extending the line segmentations of each cluster’s mean (star) and all points’ mean (grey dashed lines) to obtain the new partition (black dashed lines). When we apply DMI-clustering to multi-task information elicitation, a point here represents a task’s collected feedback.

In addition to the multi-task setting, it’s also very useful to think from a clustering view in the single-task setting where we only have a single multi-choice question to people. The “surprisingly popular” method can be seen as a clustering method which only employs the first moment. To obtain the first moment, we can ask people’s second moment information (e.g. what’s your prediction for other people’s answer) and reconstruct the first moment using the second moment [19]. From a clustering view, it’s very natural to apply spectral method, which uses the second moment information, to design a new information aggregation method in the single-task case. If we can use a hypothetical question to elicit a higher-order-moment information, we can apply higher-order-moment clustering method for single-task setting.

Our results

We connect information elicitation without ground truth with clustering, and

Multi-task & Affine-invariance

we propose an affine-invariant clustering method and apply it to propose a multi-task information elicitation mechanism that 1) aggregates information in a way that is robust to people’s strategies when the number of agents is infinite, 2) reduces the payment variance in some scenarios while still preserves previous multi-task mechanisms’ incentive properties;

Single-task & Methods of moments

we apply method of moments clustering method in the single-task setting. With only the second-moment information, we apply spectral method to propose a new aggregation mechanism which is better than the “surprisingly popular” method in some scenarios.

The main conceptual contribution of this work is the connection between clustering and information elicitation without ground truth. This connection not only brings clustering techniques to information elicitation without verification, but also brings attention to the clustering setting where the input data point, and the property of the data (e.g. moment information) can be elicited from the crowds. We hope this connection can generate more results in the future for information elicitation/aggregation without ground truth in the future.

1.1 Related Work

Information elicitation without ground truth (peer prediction) is proposed by Miller et al. [17]. The core problem is how to incentivize people to report high-quality answer even if the answer cannot be verified. The key idea is to reward each agent based on her report and her peer’s reports. A series of work [18, 8, 20, 14, 15, 13] have focused on this problem and develop reward schemes in varies of settings. However, in many scenarios, we want to not only incentivize high-quality answer but also make an aggregation of the answers. Thus, Some of information elicitation work also explore how to aggregate information without ground truth and there are two approaches here.

One approach is to use the plurality of high-payment agents’ answer. Liu et al. [15] employ this way to aggregate people’s reports based on various of payment schemes. However, incentive-compatible payment schemes only guarantees high-quality answer receives high payment in expectation. If the payment has a high variance, which happens especially when the number of tasks is small, high-payment agent may not provide high quality answers. Thus, here each agent needs to perform a sufficiently number of tasks to reduce the payment variance. A large literature of extracting wisdom from crowd also uses the idea that first estimate the crowd’s quality and then use the estimated quality to aggregate their opinions (e.g. (e.g. [6, 22])). The quality can either be estimated from previous performance the crowds or from a EM framework proposed by Dawid and Skene [9]. Though the EM framework based estimation does not requires historical data, it still requires each person to answer a sufficient amount questions to get a good estimation. Moreover, it usually requires an underlying probabilistic model: there exists a ground truth and conditioning on the ground truth, people’s reports are independent.

Another approach is to directly evaluate the answer itself. A popular method is the “surprisingly popular” method. Prelec et al. [19] uses this method in the single-task setting by eliciting people’s predictions for other people’s answer and reconstructing the prior using the predictions. In the setting we have already known the prior, this method can be directly implemented. However, like we mentioned in introduction this method requires an additional assumption for the prior.

More importantly, none of these aggregation approaches is robust to people’s strategies. Though for Determinant based Mutual Information (DMI)-Mechanism [13], the expected payment is relatively robust to other people’s strategies, we still need each agent to perform a sufficiently number of tasks to reduce the payment variance. However, the setting when we have a large number of agents and each agent performs a small number of tasks may be more practical. The multi-task setting in this work focuses on this more practical setting and proposes an aggregation method, DMI-clustering, which is robust to people’s strategies. DMI-clustering is inspired by DMI-Mechanism [13]. Unlike DMI-Mechanism based aggregation approach uses the plurality of high-payment agents’ answer, DMI-clustering learns the answers which has the highest determinant mutual information with the collection of people’s answers. When we have a large number of participants, the learning process is independent of their strategies, even if each participant only performs a small number of tasks. Xu et al. [24] also employ the property of Determinant based Mutual Information to design a robust loss function. Unlike this work, Xu et al. [24] focus on the noisy labels setting rather than the unsupervised clustering setting.

DMI-clustering is a new general clustering method. The most important property of DMI-clustering is affine-invariance. Affine-invariant clustering is well studied in image classification [23, 10, 2]. However, they have a different definition for affine-invariance from this work. In their definition, images that are about the same item observed in different angles should be classified into the same cluster.

Unlike the above work about affine-invariance, we focus on the setting where all data points are transformed by the same affine transformation and we want the clustering for these data points remains invariant to the transformation. Brubaker and Vempala [5], Huang and Yang [12] also use this definition and propose an affine-invariant clustering method. However, Brubaker and Vempala [5] focus on the application in mixture model and require the knowledge of the number of components and sufficient amount of sample data points. Huang and Yang [12] have an underlying probabilistic model and develop a maximum likelihood estimator based algorithm. In contrast, DMI-clustering is a non-parametric affine-invariant method that applies to any finite number of data points and does not require any probabilistic model assumption. Moreover, DMI-clustering provides a relatively affine-invariant evaluation to each clustering. Thus, when we have side information that gives a constraint for the set of clustering (e.g. we need to pick one from clustering I or clustering II), DMI-clustering method is still affine-invariant.

DMI-clustering is more similar to k-means clustering. Like k-means, it has a natural geometric interpretation and a simple local-maxima searching heuristic that uses each cluster’s mean. One difference from k-means is that DMI-clustering can not pick any kk it wants since it needs to satisfy the affine-invariance444If we have dd linearly independent dd-dimensional data points, k-means method can pick any kk. However, DMI-clustering must pick k=dk=d and make each individual data point a cluster by itself since these data points can be any other linearly independent dd-dimensional data points by a proper affine-transformation. .

The definition of DMI-clustering uses determinant. Log-determinant divergence [7]’s definition also uses determinant and is affine-invariant in its definition. However, still, the affine-invariance of log-determinant divergence is different from this work. It refers to the property that the divergence between two positive definite matrices is invariant to affine transformations. The geometric interpretation of DMI-clustering is maximizing the volume of simplex constituted by cluster’s means, multiplying the product of cluster sizes. Thurau et al. [21] also connect simplex volume maximization with clustering. However, their problem is very different from ours and irrelevant to affine-invariance. They aim to design an efficient algorithm for matrix factorization by greedily picking the basis vectors such that the simplex formed by these basis vectors has the maximal volume.

2 Determinant MaxImization (DMI) Clustering

In this section, we will introduce a new clustering method, DMI-clustering, that is based on determinant maximization. We will show that this method is invariant to any affine transformation which is performed on all data points due to the multiplicative property of determinant. We will also show that this clustering method will generate a clustering which has a natural geometric interpretation: there will be a representative vector for each cluster and each data point is mapped to the cluster whose representative vector has the highest inner-product with the data point.

Moreover, we will introduce a heuristic algorithm, k-cofactors, which is similar to Lloyd’s algorithm for k-means. When the data points are in 2-dimension, we visualize the algorithm and introduce an interesting geometric way to perform the algorithm. All synthetic data used for demonstration is attached in appendix. We will also show that there exists a polynomial time algorithm for DMI-clustering when dd is a constant. We first introduce the clustering method.

Input: 𝐀n×d\mathbf{A}\in\mathbb{R}^{n\times d}
  /* clustering the rows of 𝐀\mathbf{A} */
Output: k,𝐂{0,1}n×k\mathbf{C}^{*}\in\{0,1\}^{n\times k}, LL, 𝐃k×k\mathbf{D}^{*}\in\mathbb{R}^{k\times k}
  /* kk clusters, map row ii to cluster cc if Cic=1C^{*}_{ic}=1 */
Add column (1,1,,1)(1,1,\cdots,1)^{\top} to 𝐀\mathbf{A}, i.e., 𝐀+1=[𝐀𝟏]\mathbf{A}^{+1}=[\mathbf{A}\quad\mathbf{1}]
Find the rank kk of 𝐀+1\mathbf{A}^{+1}
Pick kk linearly independent columns of 𝐀+1\mathbf{A}^{+1} to construct matrix 𝐀~\tilde{\mathbf{A}} arbitrarily
L=L= the set of picked column indices
Let 𝒞\mathcal{C} be {𝐂{0,1}n×k|i,cCic=1}\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}555We can pick other constraint set 𝒞\mathcal{C} here, for example, we can use 𝒞soft:={𝐂[0,1]n×k|i,cCic=1}\mathcal{C}^{soft}:=\{\mathbf{C}\in[0,1]^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}
Find 𝐂argmax𝐂𝒞det[𝐂𝐀~]\mathbf{C}^{*}\in\mathrm{argmax}_{\mathbf{C}\in\mathcal{C}}\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]
  /* we call |det[𝐂𝐀~]||\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]| the DMI-score of 𝐂\mathbf{C}, DMI-S(𝐂,𝐀)\text{DMI-S}(\mathbf{C},\mathbf{A}) */
𝐃=(𝐂𝐀~)1\mathbf{D}^{*}=(\mathbf{C}^{*\top}\tilde{\mathbf{A}})^{-1}
Algorithm 1 DMI-clustering DMI-C(𝐀)\text{DMI-C}(\mathbf{A})

The core of DMI-clustering is an evaluation metric for each clustering, DMI-score. We will show that intuitively, DMI-score is the volume of the simplex formed by clusters’ means, multiplying the product of cluster sizes. DMI-clustering aims to find the clustering that has the highest DMI-score. A naive attempt to solve the optimization problem in DMI-clustering is to enumerate all possible 𝐂\mathbf{C}. However, it needs an exponential running time in nn, even when dd is a constant. To have a better algorithm, in addition to show that DMI-clustering is affine-invariant, we also analyze the clustering result of DMI-clustering and find that it is linearly-partitioned. This linearly-partitioned property will directly induce a polynomial time algorithm when dd is a constant and a simple local maximal-searching heuristics in general. We will use the following notation to describe the clustering result of DMI-clustering.

Definition 2.1 (Index of row’s max).

For matrix 𝐌n×k\mathbf{M}\in\mathbb{R}^{n\times k}, we define idxmax(𝐌)n×k\mathrm{idxmax}(\mathbf{M})\in\mathbb{R}^{n\times k} as a matrix such that for all ii, idxmax(𝐌)\mathrm{idxmax}(\mathbf{M})’s ithi^{th} row is a one-hot vector that indicates the index of the largest entry of 𝐌\mathbf{M}’s ithi^{th} row. When there is a tie, we define idxmax(𝐌)\mathrm{idxmax}(\mathbf{M}) as a set of all matrices whose rows indicate the index of the largest entry of 𝐌\mathbf{M}’s rows. For example,

idxmax([.3.4.101.1])=[010001].\mathrm{idxmax}\left({\begin{bmatrix}.3&.4&.1\\ 0&-1&.1\end{bmatrix}}\right)=\begin{bmatrix}0&1&0\\ 0&0&1\end{bmatrix}.

We start to introduce the theoretic properties of DMI-clustering and visualize them in Figure 3.

Refer to caption
Figure 3: Geometric interpretation of DMI-clustering’s properties: 1) legal: when the set of points are kd+1k\leq d+1 distinct points (e.g. k=3,d=2k=3,d=2, 𝐀=[𝐚1,𝐚1,𝐚2,𝐚3,𝐚2]\mathbf{A}^{\top}=[\mathbf{a}_{1}^{\top},\mathbf{a}_{1}^{\top},\mathbf{a}_{2}^{\top},\mathbf{a}_{3}^{\top},\mathbf{a}_{2}^{\top}]), each of the distinct points will form a cluster; 2) affine-invariant: when we perform the same affine-transformation to all points, the clustering result will not change; 3) Linearly-partitioned: for all i[n]i\in[n], if row ii is mapped to cluster cc, then the row vector 𝐀~i\tilde{\mathbf{A}}_{i} has the highest inner product to column vector 𝐃c\mathbf{D}_{c} among all column vectors of 𝐃\mathbf{D}^{*}. Thus, 𝐃\mathbf{D}^{*} determines a partition (the black dashed line) and the mean of all points lies in the intersection of the partition lines.
Theorem 2.2.

DMI-clustering is

Legal

when there exists full rank 𝐂𝒞\mathbf{C}^{*}\in\mathcal{C} such that 𝐀=𝐂𝐁\mathbf{A}=\mathbf{C}^{*\top}\mathbf{B} where 𝐁\mathbf{B} is invertible, for both hard constraint 𝒞={𝐂{0,1}n×k|i,cCic=1}\mathcal{C}=\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\} and soft constraint 𝒞soft\mathcal{C}^{soft}, DMI-C(𝐀)\text{DMI-C}(\mathbf{A}) equals 𝐂\mathbf{C}^{*} (up to column permutation).

Affine-invariant

the DMI-score of 𝐂\mathbf{C} is relatively invariant to any invertible affine transformation of the data points, i.e., for all invertible 𝐓d×d\mathbf{T}\in\mathbb{R}^{d\times d} and 𝐛1×b\mathbf{b}\in\mathbb{R}^{1\times b}, for all 𝐂1,𝐂2\mathbf{C}_{1},\mathbf{C}_{2} which have non-zero DMI-score regarding 𝐀\mathbf{A},

DMI-S(𝐂1,𝐀)DMI-S(𝐂2,𝐀)=DMI-S(𝐂1,𝐀𝐓+𝐛)DMI-S(𝐂2,𝐀𝐓+𝐛)\frac{\text{DMI-S}(\mathbf{C}_{1},\mathbf{A})}{\text{DMI-S}(\mathbf{C}_{2},\mathbf{A})}=\frac{\text{DMI-S}(\mathbf{C}_{1},\mathbf{A}\mathbf{T}+\mathbf{b})}{\text{DMI-S}(\mathbf{C}_{2},\mathbf{A}\mathbf{T}+\mathbf{b})}

thus, for any constraint set 𝒞\mathcal{C}, DMI-C(𝐀𝐓+𝐛)\text{DMI-C}(\mathbf{A}\mathbf{T}+\mathbf{b}) is equivalent to DMI-C(𝐀)\text{DMI-C}(\mathbf{A}) in the sense that any solution of DMI-C(𝐀𝐓+𝐛)\text{DMI-C}(\mathbf{A}\mathbf{T}+\mathbf{b}) is also a solution of DMI-C(𝐀)\text{DMI-C}(\mathbf{A}), vice versa;

Linearly-partitioned

when we maximize over 𝒞={𝐂{0,1}n×k|i,cCic=1}\mathcal{C}=\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}, 𝐂\mathbf{C}^{*} is a local maximal of det[𝐂𝐀~]\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}] if and only if 𝐂idxmax(𝐀~𝐃)\mathbf{C}^{*}\in\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D}^{*}) where 𝐃=(𝐂𝐀~)1\mathbf{D}^{*}=(\mathbf{C}^{*\top}\tilde{\mathbf{A}})^{-1}.

Geometrically, the average of 𝐀~\tilde{\mathbf{A}}’s row vectors is the intersection of all partitions, i.e., 𝐚~¯𝐃𝟏\bar{\tilde{\mathbf{a}}}*\mathbf{D}^{*}\propto\mathbf{1} and if 𝐀𝐚¯\mathbf{A}-\bar{\mathbf{a}}’s rank is d, the optimization goal is equivalent with maximizing the volume of the simplex consist of each cluster’s mean, multiplying the product of the cluster sizes.

For simple one-dimensional case, we show that DMI-clustering will partition the numbers based on the mean (Corollary 2.3). For other dimensions, the linearly-partitioned property directly induce an algorithm (Algorithm 2) to find the local optimum. This algorithm is very similar to Lloyd’s algorithm for k-means: 1. assignment: assign the points according to the current partition; 2. update: update the partition using the current clusters. We call this algorithm k-cofactors since the update step employs the cofactors of the matrix which consists of the current cluster means. In the 2d case, we will translate the operations in k-cofactors into natural geometric operations (Algorithm 3, Figure 4). In the end of this section, we will also show that the linearly-partitioned property induces a polynomial algorithm when the dimension dd is a constant, based on properties of shatter functions.

Corollary 2.3 (1d points).

For n real numbers a1,a2,,ana_{1},a_{2},\cdots,a_{n} where there exists iji\neq j such that aiaja_{i}\neq a_{j}, DMI-clustering will partition the numbers into two clusters: above the average and below the average.

Proof of Corollary 2.3.

Since DMI-clustering is affine-invariant, we subtract the average from each number and define vector 𝐯:=(a1a¯,a2a¯,,ana¯)\mathbf{v}:=(a_{1}-\bar{a},a_{2}-\bar{a},\cdots,a_{n}-\bar{a}).

det[[𝐩,𝟏𝐩][𝐯𝟏]]\displaystyle\det[[\mathbf{p},\mathbf{1}-\mathbf{p}]^{\top}[\mathbf{v}\quad\mathbf{1}]] =det[[𝐩𝐯𝐩𝟏(𝟏𝐩)𝐯(𝟏𝐩)𝟏]]\displaystyle=\det[\begin{bmatrix}\mathbf{p}^{\top}\mathbf{v}&\mathbf{p}^{\top}\mathbf{1}\\ (\mathbf{1}-\mathbf{p})^{\top}\mathbf{v}&(\mathbf{1}-\mathbf{p})^{\top}\mathbf{1}\\ \end{bmatrix}]
det[[𝐩𝐯𝐩𝟏𝟏𝐯𝟏𝟏]]\displaystyle\propto\det[\begin{bmatrix}\mathbf{p}^{\top}\mathbf{v}&\mathbf{p}^{\top}\mathbf{1}\\ \mathbf{1}^{\top}\mathbf{v}&\mathbf{1}^{\top}\mathbf{1}\\ \end{bmatrix}] (add first row the second row)
=det[[𝐩𝐯𝐩𝟏0𝟏𝟏]]\displaystyle=\det[\begin{bmatrix}\mathbf{p}^{\top}\mathbf{v}&\mathbf{p}^{\top}\mathbf{1}\\ 0&\mathbf{1}^{\top}\mathbf{1}\\ \end{bmatrix}] (the sum of 𝐯\mathbf{v} is 0)
𝐩𝐯\displaystyle\propto\mathbf{p}^{\top}\mathbf{v}

To maximize 𝐩𝐯\mathbf{p}^{\top}\mathbf{v}, the 0-1 vector 𝐩\mathbf{p} will pick all positive entries of 𝐯\mathbf{v}. Thus, DMI-clustering will partition the numbers into two clusters: above the average and below the average. ∎

Input: 𝐀~n×k\tilde{\mathbf{A}}\in\mathbb{R}^{n\times k}
Output: 𝐂{0,1}n×k\mathbf{C}\in\{0,1\}^{n\times k}
Initialize 𝐂\mathbf{C}
𝐃:=(𝐂𝐀~)1\mathbf{D}:=(\mathbf{C}^{\top}\tilde{\mathbf{A}})^{-1}
while 𝐂idxmax(𝐀~𝐃)\mathbf{C}\notin\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D}) do
       𝐂idxmax(𝐀~𝐃)\mathbf{C}\leftarrow\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D})777Update 𝐂\mathbf{C}’s row only if it does not indicate the index of the highest entry of 𝐀~𝐃\tilde{\mathbf{A}}\mathbf{D}’s row
        /* assignment step */
       𝐃:=(𝐂𝐀~)1\mathbf{D}:=(\mathbf{C}^{\top}\tilde{\mathbf{A}})^{-1}
        /* update step */
      
end while
return 𝐂\mathbf{C}
Algorithm 2 k-cofactors: finding local maximal of max𝐂𝒞det[𝐂𝐀~]\max_{\mathbf{C}\in\mathcal{C}}\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]
Refer to caption
Figure 4: Demonstration of k-cofactors algorithm: the above figure demonstrates a running example for 2d points and shows the repeated steps during the process: left: the current 𝐃\mathbf{D} determines a partition (the orange dashed lines); middle (assignment step): points are clustered by the partition; right (update step): a new 𝐃\mathbf{D} is generated by the current clusters.
Refer to caption
Figure 5: An empirical run of k-cofactors algorithm: the initial k clusters are created by 𝐂idxmax(𝐀~𝐚~¯)\mathbf{C}\leftarrow\mathrm{idxmax}(\tilde{\mathbf{A}}-\bar{\tilde{\mathbf{a}}}). Each subfigure shows the current cluster 𝐂\mathbf{C} and its generated partition 𝐃=(𝐂A~)1\mathbf{D}=(\mathbf{C}^{\top}\tilde{A})^{-1} (the black dashed lines). Note that the process stops as long as the generated partition 𝐃=(𝐂A~)1\mathbf{D}=(\mathbf{C}^{\top}\tilde{A})^{-1} is consistent with the current cluster, i.e., 𝐂idxmax(𝐀~𝐃)\mathbf{C}\notin\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D}). The red circle marks the inconsistent part.
Input: (𝐚1,𝐚2,,𝐚n)(\mathbf{a}_{1},\mathbf{a}_{2},\cdots,\mathbf{a}_{n})
  /* nn 2d points */
Output: 𝐂{0,1}n×3\mathbf{C}\in\{0,1\}^{n\times 3}
  /* 33 clusters, 𝐚i\mathbf{a}_{i} belongs to cluster cc if Cic=1C_{ic}=1 */
Initialize 𝐂\mathbf{C}
Calculate mean 𝐚¯\bar{\mathbf{a}} of all points
  /* The center intersection point in Figure 4 */
Calculate mean 𝐚¯c\bar{\mathbf{a}}_{c} of each cluster c=1,2,3c=1,2,3
  /* The stars in Figure 4 */
Draw partition line, starting from 𝐚¯\bar{\mathbf{a}}, by extending line segment [𝐚¯c,𝐚¯][\bar{\mathbf{a}}_{c},\bar{\mathbf{a}}] for all cc
  /* the black dashed lines in Figure 4 */
while the partition is not consistent with the current cluster do
       Assign points to clusters based on the current partition
        /* assignment step */
       Update the partition based on the current clusters
        /* update step */
      
end while
return 𝐂\mathbf{C}
Algorithm 3 k-cofactors for 2d points
Corollary 2.4 (2d points).

For any 𝐀n×2\mathbf{A}\in\mathbb{R}^{n\times 2}, if 𝐀𝐚¯\mathbf{A}-\bar{\mathbf{a}}’s rank is two888Otherwise it will reduce to the 1d case., then running DMI-C(𝐀)\text{DMI-C}(\mathbf{A}) (Algorithm 1) with k-cofactors algorithm (Algorithm 2) is equivalent to Algorithm 3. Geometrically, the optimization goal is equivalent with maximizing the area of the triangles Δ𝐚¯1,𝐚¯2,𝐚¯3\Delta_{\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}_{2},\bar{\mathbf{a}}_{3}} multiplying the n1n2n3n_{1}n_{2}n_{3} where ncn_{c} is the number of points in cluster cc.

Proof of Corollary 2.4.

If 𝐀𝐚¯\mathbf{A}-\bar{\mathbf{a}}’s rank is two, then [𝐀𝟏][\mathbf{A}\quad\mathbf{1}] is rank three. Therefore, we will pick 𝐀~=[𝐀𝟏]\tilde{\mathbf{A}}=[\mathbf{A}\quad\mathbf{1}].

𝐂𝐀~=diag(n1,n2,n3)[𝐚¯11𝐚¯21𝐚¯31]\mathbf{C}^{\top}\tilde{\mathbf{A}}=\mathrm{diag}(n_{1},n_{2},n_{3})\begin{bmatrix}&\bar{\mathbf{a}}_{1}&\quad 1\\ &\bar{\mathbf{a}}_{2}&\quad 1\\ &\bar{\mathbf{a}}_{3}&\quad 1\\ \end{bmatrix} where ncn_{c} is the number of points in cluster cc. Thus maximizing det[𝐂𝐀~]\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}] is equivalent to maximizing the area of the triangles Δ𝐚¯1,𝐚¯2,𝐚¯3\Delta_{\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}_{2},\bar{\mathbf{a}}_{3}} multiplying the n1n2n3n_{1}n_{2}n_{3}.

For 𝐃=(𝐂𝐀~)1=[𝐝1,𝐝2,𝐝3]\mathbf{D}=(\mathbf{C}^{\top}\tilde{\mathbf{A}})^{-1}=[\mathbf{d}_{1},\mathbf{d}_{2},\mathbf{d}_{3}], we have

[𝐚¯11]𝐝1=1n1[𝐚¯11]𝐝2=[𝐚¯11]𝐝3=0[\bar{\mathbf{a}}_{1}\quad 1]*\mathbf{d}_{1}=\frac{1}{n_{1}}\quad[\bar{\mathbf{a}}_{1}\quad 1]*\mathbf{d}_{2}=[\bar{\mathbf{a}}_{1}\quad 1]*\mathbf{d}_{3}=0

We also have that

[𝐚¯𝟏]𝐝1=(cncn[𝐚¯c𝟏])𝐝1=1n[\bar{\mathbf{a}}\quad\mathbf{1}]*\mathbf{d}_{1}=(\sum_{c}\frac{n_{c}}{n}[\bar{\mathbf{a}}_{c}\quad\mathbf{1}])*\mathbf{d}_{1}=\frac{1}{n}

and by similar analysis, [𝐚¯𝟏]𝐝2=[𝐚¯𝟏]𝐝3=1n[\bar{\mathbf{a}}\quad\mathbf{1}]*\mathbf{d}_{2}=[\bar{\mathbf{a}}\quad\mathbf{1}]*\mathbf{d}_{3}=\frac{1}{n}

For point 𝐚=λ𝐚¯+(1λ)𝐚¯1\mathbf{a}=\lambda\bar{\mathbf{a}}+(1-\lambda)\bar{\mathbf{a}}_{1},

[𝐚𝟏]𝐝1=λ1n+(1λ)1n1[\mathbf{a}\quad\mathbf{1}]*\mathbf{d}_{1}=\lambda\frac{1}{n}+(1-\lambda)\frac{1}{n_{1}}

and

[𝐚𝟏]𝐝2=[𝐚𝟏]𝐝3=λ1n[\mathbf{a}\quad\mathbf{1}]*\mathbf{d}_{2}=[\mathbf{a}\quad\mathbf{1}]*\mathbf{d}_{3}=\lambda\frac{1}{n}

when λ<1\lambda<1, 𝐚\mathbf{a} should be mapped to cluster 1 and when λ1\lambda\geq 1, 𝐚\mathbf{a} can be mapped to either cluster 2 or cluster 3. Thus, we can obtain the partition line for cluster 2 and 3, call it line 2-3, by extending line segment [𝐚¯1,𝐚¯][\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}]. By analogous analysis for 𝐚¯2,𝐚¯3\bar{\mathbf{a}}_{2},\bar{\mathbf{a}}_{3}, we obtain three partition lines like Figure 4.

For each point 𝐱\mathbf{x} between line 1-2 and 2-3, we can represent it as 𝐱=𝐚¯+s(𝐚¯𝐚¯3)+t(𝐚¯𝐚¯1)\mathbf{x}=\bar{\mathbf{a}}+s(\bar{\mathbf{a}}-\bar{\mathbf{a}}_{3})+t(\bar{\mathbf{a}}-\bar{\mathbf{a}}_{1}) where s,t>0s,t>0. Then [𝐱𝟏]𝐝2=1n(1+s+t)[\mathbf{x}\quad\mathbf{1}]*\mathbf{d}_{2}=\frac{1}{n}(1+s+t), [𝐱𝟏]𝐝1=1n(1+s+t)t1n1[\mathbf{x}\quad\mathbf{1}]*\mathbf{d}_{1}=\frac{1}{n}(1+s+t)-t\frac{1}{n_{1}} and [𝐱𝟏]𝐝3=1n(1+s+t)s1n3[\mathbf{x}\quad\mathbf{1}]*\mathbf{d}_{3}=\frac{1}{n}(1+s+t)-s\frac{1}{n_{3}}. Thus, 𝐱\mathbf{x} belongs to cluster 2. By analogous analysis for other points, we finish the proof.

Proof of Theorem 2.2.

We first prove the affine-invariance. Note that

[𝐀𝐓+𝐛𝟏]=[𝐀𝟏][𝐓0𝐛1].[\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}]=[\mathbf{A}\quad\mathbf{1}]\begin{bmatrix}\mathbf{T}&0\\ \mathbf{b}&1\end{bmatrix}.

Thus, for invertible 𝐓\mathbf{T}, both [𝐀𝐓+𝐛𝟏][\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}] and [𝐀𝟏][\mathbf{A}\quad\mathbf{1}] have the same rank kk.

Since 𝐀~\tilde{\mathbf{A}} are kk linearly independent columns of [𝐀𝟏][\mathbf{A}\quad\mathbf{1}], there exists 𝐏k×(d+1)\mathbf{P}\in\mathbb{R}^{k\times(d+1)} such that [𝐀𝟏]=𝐀~𝐏[\mathbf{A}\quad\mathbf{1}]=\tilde{\mathbf{A}}\mathbf{P}. Therefore, we have [𝐀𝐓+𝐛𝟏]=𝐀~𝐏[𝐓0𝐛1][\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}]=\tilde{\mathbf{A}}\mathbf{P}\begin{bmatrix}\mathbf{T}&0\\ \mathbf{b}&1\end{bmatrix}.

Thus, for any 𝐀~\tilde{\mathbf{A}}^{\prime} that consists of kk linearly independent columns in [𝐀𝐓+𝐛𝟏][\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}], we pick corresponded column vectors of 𝐏[𝐓0𝐛1]\mathbf{P}\begin{bmatrix}\mathbf{T}&0\\ \mathbf{b}&1\end{bmatrix} to construct matrix 𝐐k×k\mathbf{Q}\in\mathbb{R}^{k\times k}. We have 𝐀~=𝐀~𝐐\tilde{\mathbf{A}}^{\prime}=\tilde{\mathbf{A}}\mathbf{Q} and 𝐐\mathbf{Q} is invertible (otherwise 𝐀~\tilde{\mathbf{A}}^{\prime}’s rank will be less than kk which is a contradiction).

Therefore,

det[𝐂𝐀~]=det[𝐂𝐀~]det[𝐐]±det[𝐂𝐀~]\det[\mathbf{C}^{\top}\tilde{\mathbf{A}^{\prime}}]=\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]\det[\mathbf{Q}]\propto\pm\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]

due to the multiplicative property of determinant. The relative-invariance of DMI-score and affine-invariance of DMI-clustering directly follows from the above formula.

We then show that DMI-clustering is legal. Note that since it is affine-invariant, we only need to analyze DMI-C(𝐂)\text{DMI-C}(\mathbf{C}^{*}). Since the sum of the column vectors of 𝐂\mathbf{C}^{*} is all one vector 𝟏\mathbf{1} and the rank of 𝐂\mathbf{C}^{\top} is dd, we can pick 𝐂\mathbf{C}^{*} itself as the matrix that consists of dd linearly independent columns of [𝐂𝟏][\mathbf{C}^{*}\quad\mathbf{1}].

We normalize 𝐂\mathbf{C}^{*} by dividing each column vector by its sum to obtain ncol(𝐂)\mathrm{ncol}(\mathbf{C}^{*}). Note that this normalization is equivalent to multiplying a diagonal matrix on the right. Thus, maximizing det[𝐂𝐂]\det[\mathbf{C}^{\top}\mathbf{C}^{*}] over 𝐂𝒞soft\mathbf{C}\in\mathcal{C}^{soft} is equivalent to max𝐂𝒞softdet[𝐂ncol(𝐂)]\max_{\mathbf{C}\in\mathcal{C}^{soft}}\det[\mathbf{C}^{\top}\mathrm{ncol}(\mathbf{C}^{*})].

Moreover, 𝐂ncol(𝐂)\mathbf{C}^{*\top}\mathrm{ncol}(\mathbf{C}^{*}) is an identity matrix whose determinant is one. Since both 𝐂\mathbf{C}^{\top} and ncol(𝐂)\mathrm{ncol}(\mathbf{C}^{*}) are column-stochastic matrices999Each entry is in [0,1] and each column sums to 1., their product is still a column-stochastic matrix. When 𝐂\mathbf{C} does not equal to 𝐂\mathbf{C}^{*} even up to column permutation, i.e., there does not exist a permutation matrix π\pi such that 𝐂=π𝐂\mathbf{C}=\pi\mathbf{C}^{*}, the product of 𝐂\mathbf{C}^{\top} and ncol(𝐂)\mathrm{ncol}(\mathbf{C}^{*}) will be a non-permutation column-stochastic matrix whose determinant is strictly less than 1. Therefore, when there exists full rank 𝐂𝒞\mathbf{C}^{*}\in\mathcal{C} such that 𝐀=𝐂𝐁\mathbf{A}=\mathbf{C}^{*\top}\mathbf{B} where 𝐁\mathbf{B} is invertible, DMI-C(𝐀)\text{DMI-C}(\mathbf{A}) equals 𝐂\mathbf{C}^{*} (up to column permutation).

Finally, we prove that the clustering result is linearly-partitioned. 𝐀~\tilde{\mathbf{A}}’s row vectors are 𝐚~1,𝐚~2,,𝐚~n\tilde{\mathbf{a}}_{1},\tilde{\mathbf{a}}_{2},\cdots,\tilde{\mathbf{a}}_{n}. 𝐂𝐀~\mathbf{C}^{*\top}\tilde{\mathbf{A}}’s row vectors are denoted by 𝐯1,𝐯2,,𝐯k\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{k}.

For all i[n]i\in[n], we denote the cluster it is mapped to by cic_{i}^{*}, i.e., 𝐂i,ci=1\mathbf{C}^{*}_{i,c_{i}^{*}}=1. For all c[k]c\in[k], when 𝐂\mathbf{C}^{*} is a local maximum, the current assignment is better than moving ii to cluster cc, that is,

det[𝐂𝐀~]=det[[𝐯1𝐯2𝐯k]]det[[𝐯ci𝐚~i𝐯c+𝐚~i]]\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]=\det[\begin{bmatrix}\mathbf{v}_{1}\\ \mathbf{v}_{2}\\ \vdots\\ \mathbf{v}_{k}\end{bmatrix}]\geq\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}-\tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]

Due to the multilinear property of determinant,

det[[𝐯ci𝐚~i𝐯c+𝐚~i]]=\displaystyle\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}-\tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]= det[[𝐯ci𝐯c+𝐚~i]]det[[𝐚~i𝐯c+𝐚~i]]\displaystyle\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]-\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]
=\displaystyle= det[[𝐯ci𝐯c]]+det[[𝐯ci𝐚~i]]det[[𝐚~i𝐯c]]\displaystyle\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]+\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]-\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]

Thus, we have

det[[𝐚~i𝐯c]]det[[𝐯ci𝐚~i]]\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]\geq\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]

for all cc.

Since 𝐃:=(𝐂𝐀~)1\mathbf{D}^{*}:=(\mathbf{C}^{*\top}\tilde{\mathbf{A}})^{-1}, det[𝐂𝐀~]Dij\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]D^{*}_{ij} is the cofactors of matrix 𝐂𝐀~\mathbf{C}^{*\top}\tilde{\mathbf{A}}. We denote 𝐃\mathbf{D}^{*}’s column vectors by 𝐝1,𝐝2,,𝐝k\mathbf{d}^{*}_{1},\mathbf{d}^{*}_{2},\cdots,\mathbf{d}^{*}_{k}.

Due to Laplace expansion, we have

𝐚~i(det[𝐂𝐀~]𝐝ci)=det[[𝐚~i𝐯c]]det[[𝐯ci𝐚~i]]=𝐚~i(det[𝐂𝐀~]𝐝c)\tilde{\mathbf{a}}_{i}*(\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]\mathbf{d}^{*}_{c_{i}^{*}})=\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]\geq\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]=\tilde{\mathbf{a}}_{i}*(\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]*\mathbf{d}^{*}_{c})

Since det[𝐂𝐀~]\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}] is the maximum that must be positive, we have for all ii, for all cc,

𝐚~i𝐝ci𝐚~i𝐝c.\tilde{\mathbf{a}}_{i}*\mathbf{d}^{*}_{c_{i}^{*}}\geq\tilde{\mathbf{a}}_{i}*\mathbf{d}^{*}_{c}.

Thus, 𝐂idxmax(𝐀~𝐃)\mathbf{C}^{*}\in\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D}^{*}). Moreover,

𝟏𝐃1=𝟏𝐂𝐀~=𝟏𝐀~𝐚~¯\displaystyle\mathbf{1}\mathbf{D}^{*-1}=\mathbf{1}\mathbf{C}^{*\top}\tilde{\mathbf{A}}=\mathbf{1}\tilde{\mathbf{A}}\propto\bar{\tilde{\mathbf{a}}}

which shows that 𝐚~¯𝐃𝟏\bar{\tilde{\mathbf{a}}}*\mathbf{D}^{*}\propto\mathbf{1}, i.e., the mean of all data points is in the intersection of all partitions. We also have

𝐂𝐀~=diag(n1,n2,,nk)[𝐚¯11𝐚¯21𝐚¯k1]\mathbf{C}^{\top}\tilde{\mathbf{A}}=\mathrm{diag}(n_{1},n_{2},\cdots,n_{k})\begin{bmatrix}&\bar{\mathbf{a}}_{1}&\quad 1\\ &\bar{\mathbf{a}}_{2}&\quad 1\\ &\vdots&\quad\vdots\\ &\bar{\mathbf{a}}_{k}&\quad 1\\ \end{bmatrix}

whose determinant is proportional to the volume of the simplex formed by (𝐚¯1,𝐚¯2,,𝐚¯k)(\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}_{2},\cdots,\bar{\mathbf{a}}_{k}), multiplying the product of cluster sizes n1,n2,,nkn_{1},n_{2},\cdots,n_{k}. We finish the proof.

The linearly-partitioned part in the above theorem shows that instead of trying all possible n×kn\times k 𝐂\mathbf{C}, we can try all possible k×kk\times k 𝐃\mathbf{D} and check whether its corresponding 𝐂\mathbf{C} is optimal to obtain the result of DMI-clustering. In particular, when dimension dd is a constant, there exists the number of distinct linear-partitions are polynomial in the number of data points nn due to theory of VC dimension [4]. This implies the following corollary.

Corollary 2.5.

When dimension dd is a constant, when we maximize over the default constraint 𝒞={𝐂{0,1}n×k|i,cCic=1}\mathcal{C}=\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}, there exists a polynomial algorithm for DMI-clustering.

Proof.

When dd is a constant, the number of clusters kd+1k\leq d+1 is a constant as well. Given nn data points, for any linear-partition 𝐃\mathbf{D}^{*}, the points that belong to cluster 1 is the set of all points {a~|a~𝐝1>a~𝐝c,c1}\{\tilde{a}|\tilde{a}*\mathbf{d}^{*}_{1}>\tilde{a}*\mathbf{d}^{*}_{c},c\neq 1\} thus is intersections of k1k-1 half-spaces. Since half-space’s VC dimension is a constant k+1k+1 here, there will be at most polynomial O(n(k1)(k+1))O(n^{(k-1)(k+1)}) distinct cluster 1. Moreover, the number of clusters kk is a constant here. Thus there will be most O(nk(k1)(k+1))O(n^{k(k-1)(k+1)}) distinct linear-partitions. Thus, we can enumerate these linear-partitions in polynomial time and check which one gives the best cluster 𝐂\mathbf{C}^{*} that maximizes the determinant.

In the next section, we will apply DMI-clustering to extract knowledge from elicited information without ground truth. We will show that this method will be robust to people’s strategies and expertise.

3 Multi-task & Invariant Clustering

This section will introduce a multi-task information elicitation mechanism which is not only dominantly-truthful but also extracts knowledge from the collected answers. The extraction is robust when each task is answered by a large number of randomly selected people.

3.1 Knowledge-Extracted DMI (K-DMI)-Mechanism

Setting

There are nn a priori similar tasks and mm agents. Each task is a multi-choice question with CC options. Each agent is assigned a random set of tasks (size 2C\geq 2C) in random order independently. Agents finish tasks without communication. Each agent ii’s reports are represented as a n×Cn\times C 0-1 report matrix 𝐑i\mathbf{R}_{i} where the (t,c)(t,c) entry of 𝐑i\mathbf{R}_{i} is 1 if and only if agent ii reports option cc for task tt. The all zero rows in 𝐑i\mathbf{R}_{i} are the tasks that agent ii does not perform.

Model

Tasks have states which are drawn i.i.d. from an unknown distribution. Each task tt’s state is a vector 𝐚t[0,1]C\mathbf{a}_{t}\in[0,1]^{C} where each agent will receive signal cc with probability 𝐚t(c)\mathbf{a}_{t}(c). Each agent ii is associated with a strategy 𝐒i[0,1]C×C\mathbf{S}_{i}\in[0,1]^{C\times C} where 𝐒i(c,c)\mathbf{S}_{i}(c,c^{\prime}) is the probability that she receives cc and reports cc^{\prime}. We assume that each agent performs the same strategy for all tasks. This assumption is reasonable since the tasks are a priori similar and agents receive tasks in a random order independently. A payment scheme is dominantly truthful if for each agent, regardless of other people’s strategies, telling the truth pays her higher101010not strictly higher than any other strategy in expectation.

Input: 𝐑i{0,1}n×C,i[m]\mathbf{R}_{i}\in\{0,1\}^{n\times C},i\in[m]
Output: extracted knowledge 𝐂\mathbf{C}^{*},payment pi,i[m]p_{i},i\in[m]
/* Knowledge extraction */
𝐀=i𝐑i\mathbf{A}=\sum_{i}\mathbf{R}_{i}
t,c,Atc=AtccAtc\forall t,c,A_{tc}=\frac{A_{tc}}{\sum_{c}A_{tc}}
Get 𝐂\mathbf{C}^{*} from DMI-C(𝐀)\text{DMI-C}(\mathbf{A})
/* Payment calculation */
for i[m]i\in[m] do
       𝐀=ji𝐑j\mathbf{A}=\sum_{j\neq i}\mathbf{R}_{j}
       t,c,Atc=AtccAtc\forall t,c,A_{tc}=\frac{A_{tc}}{\sum_{c}A_{tc}}
       TiT_{i} is the set of tasks agent ii performed, Ti=[n]/TiT_{-i}=[n]/T_{i}
      
      /* Use TiT_{-i} to obtain the partition to guarantee independence */
       Get 𝐃i\mathbf{D}^{*}_{-i} and LL from DMI-C(𝐀Ti,:)\text{DMI-C}(\mathbf{A}_{T_{-i},:})
       /* Use obtained partition to extract knowledge for tasks in TiT_{i} */
       𝐂iidxmax(𝐀Ti,L𝐃i)\mathbf{C}^{*}_{i}\leftarrow\mathrm{idxmax}(\mathbf{A}_{T_{i},L}\mathbf{D}^{*}_{-i})
       Divide TiT_{i} into two disjoint parts T1,T2T_{1},T_{2} such that both |T1|,|T2|C|T_{1}|,|T_{2}|\geq C
       Restrict 𝐂i\mathbf{C}^{*}_{i} and 𝐑i\mathbf{R}_{i} to TT_{\ell} to get 𝐂\mathbf{C}^{*}_{\ell} and 𝐑i\mathbf{R}_{i}^{\ell} respectively for =1,2\ell=1,2
       /* Reward agent ii using the extracted knowledge */
       pi=det(𝐂1𝐑i1)det(𝐂2𝐑i2)p_{i}=\det(\mathbf{C}_{1}^{*\top}\mathbf{R}^{1}_{i})\det(\mathbf{C}_{2}^{*\top}\mathbf{R}^{2}_{i})
end for
Algorithm 4 Knowledge-Extracted DMI (K-DMI)-Mechanism
Refer to caption
Figure 6: Knowledge extraction: we aggregate all agents’ answers and normalize each row to obtain answer matrix 𝐀\mathbf{A}. When 𝐀\mathbf{A}’s rank is CC, DMI-C(𝐀)\text{DMI-C}(\mathbf{A}) is equivalent to max𝐂det[𝐂𝐀]\max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}].
Refer to caption
Figure 7: Payment calculation: for each agent, we use the tasks she does not perform to learn the partition 𝐃\mathbf{D} and induce a clustering result for the tasks she performs. To pay her, we divide the tasks she performs into two disjoint parts T1,T2T_{1},T_{2}. We calculate determinant of the product of the two matrices in each green rectangle and multiply the determinant. Note that it’s same as the original DMI-Mechanism if we change the clustering result to a peer agent’s report.

K-DMI-Mechanism is a modification of DMI-Mechanism which pays each agent the determinant mutual information between her report and her peer’s report in expectation.

Definition 3.1 (Determinant based Mutual Information (DMI) [13]).

For two random variables X,YX,Y whose joint distribution is denoted as a matrix 𝐔X,Y\mathbf{U}_{X,Y},

DMI(X;Y):=|det(𝐔X,Y)|\textsc{DMI}(X;Y):=|\det(\mathbf{U}_{X,Y})|

Since DMI satisfies data processing inequality and its square has a polynomial format, we can use 2C\geq 2C tasks to implement it. Compared to DMI, K-DMI-Mechanism adds a clustering process, which can be interpreted as finding a clustering which has the highest determinant mutual information with the collection of all agents’ answers. For the payment calculation, K-DMI-Mechanism pays each agent the determinant mutual information between her report and the clustering result to reduce variance. The original DMI-Mechanism’s payments may not be fair to the hard-working agents when there exists a lot of low-efforts agents. The current payment calculation method may be more robust to other people’s strategies. For example, in the simple case where the answer distributions only have CC distinct formats, we can use other tasks to learn the correct partition and each agent is rewarded between her report and the underlying ground-truth in K-DMI.

Theorem 3.2.

K-DMI-Mechanism is dominantly-truthful. If the number of agents mm goes to infinite and limm1mi[m]𝐒i\lim_{m\rightarrow\infty}\frac{1}{m}\sum_{i\in[m]}\mathbf{S}_{i} exists and is invertible, the clustering result is invariant to people’s strategies. When 𝐀\mathbf{A}’s rank is CC, DMI-C(𝐀)\text{DMI-C}(\mathbf{A}) is equivalent to argmax𝐂det[𝐂𝐀]\mathrm{argmax}_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}] and v:=max𝐂det[𝐂𝐀]v:=max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}] also indicates the quality of answers: vv decreases when agents perform strategies.

Proof of Theorem 3.2.

From agent ii’s view, each task’s elicited answers are drawn i.i.d. from a distribution. Thus, the partition 𝐃i\mathbf{D}_{-i}^{*} learned from the set of tasks TiT_{-i} is independent of the set of tasks TiT_{i} she performs. Therefore, after operating 𝐃i\mathbf{D}_{-i}^{*} on TiT_{i}’s answers (excluding agent ii’s own answer) to obtain a clustering result, the i.i.d. assumption is still satisfied for the pair of agent ii’s answer and the clustering result 𝐂i\mathbf{C}_{i}^{*}. We can think the clustering result 𝐂i\mathbf{C}_{i}^{*} is answer from a virtual peer agent. The payment calculation reduced to DMI-Mechanism. Since DMI-Mechanism is dominantly-truthful, k-DMI-Mechanism is also dominantly truthful.

For each task tt, in expectation, the honest answer distributions will be 𝐚t\mathbf{a}_{t}. Note that each agent is assigned a random set of tasks independently. We denote the probability each agent is assigned task tt as pp. In expectation, there will be pmpm agents who assign task tt. Moreover, after agents perform strategies, before normalization, the expected answer feedback will be i[m]p𝐚t𝐒i=𝐚ti[m]p𝐒i\sum_{i\in[m]}p\mathbf{a}_{t}\mathbf{S}_{i}=\mathbf{a}_{t}\sum_{i\in[m]}p\mathbf{S}_{i}. Thus, when mm goes to infinite, the expected answer distribution will be 𝐚tlimm1mi[m]𝐒i\mathbf{a}_{t}\lim_{m\rightarrow\infty}\frac{1}{m}\sum_{i\in[m]}\mathbf{S}_{i}. Since DMI-clustering is affine-invariant, then clustering result is invariant to people’s strategies.

When 𝐀\mathbf{A}’s rank is CC, we can pick 𝐀~=𝐀\tilde{\mathbf{A}}=\mathbf{A} (all one column is the sum of 𝐀\mathbf{A}’s columns thus cannot be added here). Thus, DMI-C(𝐀)\text{DMI-C}(\mathbf{A}) is equivalent to argmax𝐂det[𝐂𝐀]\mathrm{argmax}_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}].

For v:=max𝐂det[𝐂𝐀]v:=max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}], when agents perform strategies, it becomes

v\displaystyle v =max𝐂det[𝐂𝐀(limm1mi[m]𝐒i)]\displaystyle=max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}(\lim_{m\rightarrow\infty}\frac{1}{m}\sum_{i\in[m]}\mathbf{S}_{i})]
max𝐂det[𝐂𝐀]\displaystyle\leq max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}]

since the absolute value of any transition matrix’s determinant is less than one.

Thus, we finish the proof.

Discussion for the practical implementation of K-DMI

In DMI-clustering, we cannot distinguish two clustering if they are equivalent up to a permutation for the options. However, if we have a small set of labeled questions, given an agreement measure, we can pick an option-permutation that has the highest agreement with the labeled questions. In addition to the default constraint, we can also pick other constraint set 𝒞\mathcal{C} without losing the strategy-invariant property. For example, we can use other aggregation methods’ results as candidates and determine the best candidate using DMI-clustering. In the payment calculation part, if we have some assumption about each agent’s honest answer and the obtained clustering like the determinant of their multiplication is positive in expectation, we do not need to divide the tasks into two parts. We can just pay the agent the determinant of the product of her report matrix and clustering result. In running DMI-clustering, we can employ some easy heuristics to obtain an initialization. For example, we can run the surprisingly popular method or use plurality vote of high-payment agents to get an initialization.

3.2 DMI-clustering vs. Surprisingly Popular

Surprisingly popular method picks the option which is the most surprisingly popular compare to the prior. In the multi-task setting, we can use the average to estimate the prior.

Definition 3.3 (Surprisingly popular).

For task whose answer distribution is 𝐚\mathbf{a}, we map it to cluster argmaxcaca¯c\mathrm{argmax}_{c}\frac{a_{c}}{\bar{a}_{c}}.

The surprisingly popular method shows a certain amount of robustness to people’s strategies compared to plurality vote. For example, if a large set of people always report one option with 100% probability, the plurality vote method will be distorted but the surprising popular method will decrease this distortion. In fact, in the binary question case, we will show that the surprisingly popular method is equivalent to DMI-clustering thus is robust to people’s strategies. However, in general, without any prior assumption, the surprisingly popular method does not satisfy the legal nor affine-invariant property, thus is not robust to people’s strategies even if the number of agents goes to infinite. We will show a visual comparison between DMI-clustering and the surprisingly popular method for three-options case in Figure 8.

Corollary 3.4 (Binary question: DMI-clustering = Surprisingly popular).

In the binary case, DMI-clustering will map 𝐚=(a1,a2)\mathbf{a}=(a_{1},a_{2}) to cluster ii if ai>a¯ia_{i}>\bar{a}_{i} for i=1,2i=1,2.

Proof.

In this case, 𝐀\mathbf{A} is a n×2n\times 2 row-stochastic matrix. Since DMI-clustering is affine-invariant, we subtract the row average from each row DMI-C(𝐀)=DMI-C(𝐀𝐚¯)\text{DMI-C}(\mathbf{A})=\text{DMI-C}(\mathbf{A}-\bar{\mathbf{a}}). Note that [𝐀𝐚¯𝟏][\mathbf{A}-\bar{\mathbf{a}}\quad\mathbf{1}] is rank 2, we pick 𝐀~\tilde{\mathbf{A}} as the first and third columns [𝐀:,1a¯1𝟏][\mathbf{A}_{:,1}-\bar{a}_{1}\quad\mathbf{1}] and use 𝐯\mathbf{v} as a shorthand for 𝐀:,1a¯1\mathbf{A}_{:,1}-\bar{a}_{1}. Note that the sum of 𝐯\mathbf{v} is 0. Then by the same analysis in Corollary 2.3, the optimal 0-1 vector 𝐩\mathbf{p} will pick all positive entries of 𝐯\mathbf{v}. Thus, DMI-clustering will map 𝐚=(a1,a2)\mathbf{a}=(a_{1},a_{2}) to cluster ii if aia¯i>0a_{i}-\bar{a}_{i}>0 for i=1,2i=1,2. ∎

Refer to caption
Figure 8: Surprisingly popular vs. DMI-clustering: both methods have the mean of all points as the center of the partitions. Left (surprisingly popular): the center determines the cluster, the partition lines (the black dashed lines) are generated by extending the line segment between the extreme point and the center. Right (DMI-clustering): the partition lines extend the line segments between each cluster’s mean and the center. Thus, this partition can be seen as a “rotated” version of the surprisingly popular method’s partition.

Both surprisingly popular and DMI-clustering method set the mean of all points as the intersection of the partitions. We normalize each column of 𝐀\mathbf{A} by dividing it by its sum and will see that DMI-clustering can be seen as a “rotated” version of surprisingly popular.

Corollary 3.5 (DMI-clustering = “rotated” surprisingly popular).

With column-normalized 𝐀\mathbf{A}, ncol(𝐀)\mathrm{ncol}(\mathbf{A}), the surprisingly popular method is represented as 𝐂idxmax(ncol(𝐀))\mathbf{C}\leftarrow\mathrm{idxmax}(\mathrm{ncol}(\mathbf{A})). If ncol(𝐀)\mathrm{ncol}(\mathbf{A}) is rank CC, there exists 𝐃\mathbf{D} such that DMI-clustering’s result is represented as 𝐂idxmax(ncol(𝐀)𝐃)\mathbf{C}\leftarrow\mathrm{idxmax}(\mathrm{ncol}(\mathbf{A})\mathbf{D}).

Proof.

The first statement follows directly from the definition of surprisingly popular method. It’s left to prove the second statement. Since DMI-clustering is invariant right affine transformation to 𝐀\mathbf{A}, we can use column-normalized 𝐀\mathbf{A} as input. For [ncol(𝐀)𝟏][\mathrm{ncol}(\mathbf{A})\quad\mathbf{1}], the last column 𝟏\mathbf{1} is a linear combination of ncol(𝐀)\mathrm{ncol}(\mathbf{A})’s columns since 𝐀\mathbf{A}’s columns sum to 𝟏\mathbf{1}. If ncol(𝐀)\mathrm{ncol}(\mathbf{A}) is rank CC, then we can pick 𝐀~=ncol(𝐀)\tilde{\mathbf{A}}=\mathrm{ncol}(\mathbf{A}). Thus, there exists 𝐃\mathbf{D} such that DMI-clustering’s result is represented as 𝐂idxmax(ncol(𝐀)𝐃)\mathbf{C}\leftarrow\mathrm{idxmax}(\mathrm{ncol}(\mathbf{A})\mathbf{D}).

4 Single-task & Method of Moments

The biggest advantage of surprisingly popular is that it only uses the first moment information. Thus, it is originally proposed for the single-task setting. In this section, we will introduce the single-task setting and show that from a clustering view, we can utilize the second moment information to propose a spectral clustering based mechanism which can be better than surprising popular in some settings.

Setting

Agents are assigned a single task (e.g. Have you ever text while driving before?). Agents finish the task without any communication. Each agent ii is asked to report her signal cic_{i} (e.g. No) and her prediction 𝐩i\mathbf{p}_{i} for other people’s signals (e.g. I think 10% people say yes and 90% people say no).

Input: (ci,𝐩i)i(c_{i},\mathbf{p}_{i})_{i}
Output: c
𝐚1×C\mathbf{a}\in\mathbb{R}^{1\times C} where c,ac\forall c,a_{c} is the ratio of people who answers cc
c,c,Pr[c|c]=i𝟙(ci=c)𝐩i(c)i𝟙(ci=c)\forall c,c^{\prime},\Pr[c^{\prime}|c]=\frac{\sum_{i}\mathbbm{1}(c_{i}=c)\mathbf{p}_{i}(c^{\prime})}{\sum_{i}\mathbbm{1}(c_{i}=c)}
c,Pr[c]=(cPr[c|c]Pr[c|c])1\forall c,\Pr[c]=(\sum_{c^{\prime}}\frac{\Pr[c^{\prime}|c]}{\Pr[c|c^{\prime}]})^{-1}, c,c,Pr[c,c]=Pr[c|c]Pr[c]\forall c,c^{\prime},\Pr[c^{\prime},c]=\Pr[c^{\prime}|c]\Pr[c]
  /* constructing first moment using the second moment */
𝐯1×C\mathbf{v}\in\mathbb{R}^{1\times C} where c,vc:=Pr[c]\forall c,v_{c}:=\Pr[c]
  /* first moment */
Return argmaxcacvc\mathrm{argmax}_{c}\frac{a_{c}}{v_{c}}
Algorithm 5 Surprisingly popular (1st moment) [19]
Input: (ci,𝐩i)i(c_{i},\mathbf{p}_{i})_{i}
Output: \plus\plus or \minus\minus
Get answer collections 𝐚\mathbf{a}, prior Pr[c],c\Pr[c],\forall c, joint distribution Pr[c,c],c,c\Pr[c,c^{\prime}],\forall c,c^{\prime} like Algorithm 5
𝐯1×C\mathbf{v}\in\mathbb{R}^{1\times C} where c,vc:=Pr[c]\forall c,v_{c}:=\Pr[c]
  /* first moment */
𝐌C×C\mathbf{M}\in\mathbb{R}^{C\times C} where c,c,Mc,c=Pr[c,c]\forall c,c^{\prime},M_{c,c^{\prime}}=\Pr[c,c^{\prime}]
  /* second moment */
𝐂𝐨𝐯=𝐌𝐯𝐯\mathbf{Cov}=\mathbf{M}-\mathbf{v}^{\top}\mathbf{v}
Get the first eigenvector 𝐞\mathbf{e}^{*} of 𝐂𝐨𝐯\mathbf{Cov}
Return the sign of the inner product between 𝐞\mathbf{e}^{*} and 𝐚𝐯\mathbf{a}-\mathbf{v}.
Algorithm 6 Spectral truth serum (2nd moment)
Refer to caption
Figure 9: Surprisingly popular vs. Spectral Truth Serum: the black dashed lines are partitions. In the right figure, the grey dashed line is 𝐞\mathbf{e}^{*}. If the ground-truth clustering is more like the left case, the surprisingly popular method is better; if the ground-truth clustering is more like the right case (two worlds, spherical gaussians), the spectral truth serum is better.
Lemma 4.1.

[3] The first eigenvector 𝐞\mathbf{e}^{*} of the covariance matrix of mixture of two spherical gaussians passes through the means.

The above lemma shows that if the two Gaussian are sufficiently separated, we can project all points into 𝐞\mathbf{e}^{*} and set the overall mean 𝐯\mathbf{v} as the separator. This leads to spectral truth serum. Thus, in the scenarios that the ground-truth clustering is more like a well-separated mixture of two spherical Gaussian, spectral truth serum is better than surprisingly popular. Intuitively, spectral truth serum is applicable to the setting where there is only two underlying ground-truth states (e.g. fake news or not) and we elicit many signals from the crowds. The eigenvector of the covariance matrix tells us which signal is important and which signal is not. We then project people’s feedback into the eigenvector, the important direction, and compare it to the average.

Hsu and Kakade [11] show that with a third-order moment, a mixture of kk gaussians can be detected. A future direction will be using a hypothetical question (e.g. will you update your prediction for other people’s answers if you have seen some one answers A?) to elicit a third-order moment and obtain a better information aggregation results.

5 Conclusion and Discussion

We provide a new clustering perspective for information elicitation. With this perspective, we propose a new information elicitation mechanism which is not only dominantly truthful but also aggregates information robustly when there are a large number of people in the multi-task setting. We also propose a spectral method based information aggregation mechanism in the single-task setting. Of independent interest, we propose a novel affine-invariant clustering method, DMI-clustering method.

The geometric interpretation of DMI-clustering can be extended to any other kdk\leq d easily. For example, in a three-dimensional space, when we pick k=2k=2, the clustering goal is to find two clusters such that the distance between their mean, multiplying the product of cluster sizes is maximized. We know that picking other kk will lose the affine-invariant property but we can still explore the properties of other kk in the future.

Another interesting future direction is to apply other clustering techniques like non-linear clustering techniques (e.g. kernel method) to information elicitation. This work focuses on discrete setting and we can also consider continuous setting or study new information elicitation scenarios to provide new settings for clustering. Empirical validation is also an important future direction.

References

  • [1]
  • Begelfor and Werman [2006] Evgeni Begelfor and Michael Werman. 2006. Affine invariance revisited. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. IEEE, 2087–2094.
  • Blum et al. [2016] Avrim Blum, John Hopcroft, and Ravindran Kannan. 2016. Foundations of data science. Vorabversion eines Lehrbuchs 5 (2016), 5.
  • Blumer et al. [1989] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. 1989. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM) 36, 4 (1989), 929–965.
  • Brubaker and Vempala [2008] S Charles Brubaker and Santosh S Vempala. 2008. Isotropic PCA and affine-invariant clustering. In Building Bridges. Springer, 241–281.
  • Budescu and Chen [2015] David V Budescu and Eva Chen. 2015. Identifying expertise to extract the wisdom of crowds. Management Science 61, 2 (2015), 267–280.
  • Cichocki et al. [2015] Andrzej Cichocki, Sergio Cruces, and Shun-ichi Amari. 2015. Log-determinant divergences revisited: Alpha-beta and gamma log-det divergences. Entropy 17, 5 (2015), 2988–3034.
  • Dasgupta and Ghosh [2013] Anirban Dasgupta and Arpita Ghosh. 2013. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web. International World Wide Web Conferences Steering Committee, 319–330.
  • Dawid and Skene [1979] Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20–28.
  • Fitzgibbon and Zisserman [2002] Andrew Fitzgibbon and Andrew Zisserman. 2002. On affine invariant clustering and automatic cast listing in movies. In European Conference on Computer Vision. Springer, 304–320.
  • Hsu and Kakade [2013] Daniel Hsu and Sham M Kakade. 2013. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science. 11–20.
  • Huang and Yang [2020] Hsin-Hsiung Huang and Jie Yang. 2020. Affine-transformation invariant clustering models. Journal of Statistical Distributions and Applications 7, 1 (2020), 1–24.
  • Kong [2020] Yuqing Kong. 2020. Dominantly Truthful Multi-task Peer Prediction with a Constant Number of Tasks. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2398–2411.
  • Kong and Schoenebeck [2019] Yuqing Kong and Grant Schoenebeck. 2019. An Information Theoretic Framework For Designing Information Elicitation Mechanisms That Reward Truth-telling. ACM Trans. Econ. Comput. 7, 1, Article 2 (Jan. 2019), 33 pages. https://doi.org/10.1145/3296670
  • Liu et al. [2020] Y. Liu, J. Wang, and Y. Chen. 2020. Surrogate Scoring Rules. In EC ’20: The 21st ACM Conference on Economics and Computation.
  • Lloyd [1982] Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
  • Miller et al. [2005] N. Miller, P. Resnick, and R. Zeckhauser. 2005. Eliciting informative feedback: The peer-prediction method. Management Science (2005), 1359–1373.
  • Prelec [2004] D. Prelec. 2004. A Bayesian Truth Serum for subjective data. Science 306, 5695 (2004), 462–466.
  • Prelec et al. [2017] Dražen Prelec, H Sebastian Seung, and John McCoy. 2017. A solution to the single-question crowd wisdom problem. Nature 541, 7638 (2017), 532–535.
  • Shnayder et al. [2016] Victor Shnayder, Arpit Agarwal, Rafael Frongillo, and David C Parkes. 2016. Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 179–196.
  • Thurau et al. [2010] Christian Thurau, Kristian Kersting, and Christian Bauckhage. 2010. Yes we can: simplex volume maximization for descriptive web-scale matrix factorization. In Proceedings of the 19th ACM international conference on information and knowledge management. 1785–1788.
  • Weller [2007] Susan C Weller. 2007. Cultural consensus theory: Applications and frequently asked questions. Field methods 19, 4 (2007), 339–368.
  • Werman and Weinshall [1995] Michael Werman and Daphna Weinshall. 1995. Similarity and affine invariant distances between 2D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 8 (1995), 810–814.
  • Xu et al. [2019] Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. 2019. L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise. In Advances in Neural Information Processing Systems. 6222–6233.

Appendix A Synthetic Data Used for Demonstration

Affine-invariance in Figure 3

𝐀=[0.965362430.835825040.022235370.979620690.185764740.093069920.600739190.069091980.211159650.043032470.245186840.463054490.00450010.73335878]𝐓=[0.73848720.390519110.751158120.90684574]𝐛=b=[.05,.02]\mathbf{A}=\begin{bmatrix}0.96536243&0.83582504\\ 0.02223537&0.97962069\\ 0.18576474&0.09306992\\ 0.60073919&0.06909198\\ 0.21115965&0.04303247\\ 0.24518684&0.46305449\\ 0.0045001&0.73335878\\ \end{bmatrix}\quad\mathbf{T}=\begin{bmatrix}0.7384872&0.39051911\\ 0.75115812&0.90684574\\ \end{bmatrix}\quad\mathbf{b}=b=[.05,.02]

An empirical run of k-cofactors in Figure 5

𝐀=[0.327861920.751987520.247898760.401106560.648608330.058775610.136889320.898376390.519586470.695533120.571060340.395340180.256026280.52995420.665217620.253701570.496472530.517077670.006493040.781416290.936423780.005501470.080147350.948505780.549781780.662963210.572018640.79521910.114995220.591864070.116811980.907010990.016234130.593136840.318410070.439247380.603771830.964893540.487388260.170971790.494602880.817590150.278312120.768693420.229538970.438093470.036544110.62032170.906286510.459242190.925258390.692132780.177235740.124877270.212773460.349315790.847587620.054526890.652042940.82808225]\mathbf{A}=\begin{bmatrix}0.32786192&0.75198752\\ 0.24789876&0.40110656\\ 0.64860833&0.05877561\\ 0.13688932&0.89837639\\ 0.51958647&0.69553312\\ 0.57106034&0.39534018\\ 0.25602628&0.5299542\\ 0.66521762&0.25370157\\ 0.49647253&0.51707767\\ 0.00649304&0.78141629\\ 0.93642378&0.00550147\\ 0.08014735&0.94850578\\ 0.54978178&0.66296321\\ 0.57201864&0.7952191\\ 0.11499522&0.59186407\\ 0.11681198&0.90701099\\ 0.01623413&0.59313684\\ 0.31841007&0.43924738\\ 0.60377183&0.96489354\\ 0.48738826&0.17097179\\ 0.49460288&0.81759015\\ 0.27831212&0.76869342\\ 0.22953897&0.43809347\\ 0.03654411&0.6203217\\ 0.90628651&0.45924219\\ 0.92525839&0.69213278\\ 0.17723574&0.12487727\\ 0.21277346&0.34931579\\ 0.84758762&0.05452689\\ 0.65204294&0.82808225\\ \end{bmatrix}

DMI-clustering in Figure 8

𝐀=[0.207270330.562093070.23063660.552353570.263110540.184535890.068267290.515049160.416683550.408634810.453839080.137526110.304631150.198752260.496616590.403874630.12613510.469990280.300972930.326681470.37234560.05300160.418634560.528363830.536320140.085144940.378534910.341803660.574974140.083222210.206034560.673600410.120365030.353314770.284612160.362073080.238686330.384533750.376779920.0984190.673279320.228301680.332194430.01590780.651897770.403767740.399083110.197149150.089663340.408764220.501572440.512243620.016231910.471524470.049078220.300592260.650329520.504457510.209570230.28597227]\mathbf{A}=\begin{bmatrix}0.20727033&0.56209307&0.2306366\\ 0.55235357&0.26311054&0.18453589\\ 0.06826729&0.51504916&0.41668355\\ 0.40863481&0.45383908&0.13752611\\ 0.30463115&0.19875226&0.49661659\\ 0.40387463&0.1261351&0.46999028\\ 0.30097293&0.32668147&0.3723456\\ 0.0530016&0.41863456&0.52836383\\ 0.53632014&0.08514494&0.37853491\\ 0.34180366&0.57497414&0.08322221\\ 0.20603456&0.67360041&0.12036503\\ 0.35331477&0.28461216&0.36207308\\ 0.23868633&0.38453375&0.37677992\\ 0.098419&0.67327932&0.22830168\\ 0.33219443&0.0159078&0.65189777\\ 0.40376774&0.39908311&0.19714915\\ 0.08966334&0.40876422&0.50157244\\ 0.51224362&0.01623191&0.47152447\\ 0.04907822&0.30059226&0.65032952\\ 0.50445751&0.20957023&0.28597227\\ \end{bmatrix}