\stackMath

Information Elicitation Meets Clustering

Yuqing Kong
The Center on Frontiers of Computing Studies,
Peking University
[email protected]

Abstract

In the setting where we want to aggregate people’s subjective evaluations, plurality vote may be meaningless when a large amount of low-effort people always report “good” regardless of the true quality. “Surprisingly popular” method, picking the most surprising answer compared to the prior, handle this issue to some extent. However, it is still not fully robust to people’s strategies. Here in the setting where a large number of people are asked to answer a small number of multi-choice questions (multi-task, large group), we propose an information aggregation method which is robust to people’s strategies. Interestingly, this method can be seen as a rotated “surprisingly popular”. It is based on a new clustering method, Determinant MaxImization (DMI)-clustering, and a key conceptual idea that information elicitation without ground-truth can be seen as a clustering problem. Of independent interest, DMI-clustering is a general clustering method that aims to maximize the volume of the simplex consisting of each cluster’s mean multiplying the product of the cluster sizes. We show that DMI-clustering is invariant to any non-degenerate affine transformation for all data points. When the data point’s dimension is a constant, DMI-clustering can be solved in polynomial time. In general, we present a simple heuristic for DMI-clustering which is very similar to Lloyd’s algorithm for k-means. Additionally, we also apply the clustering idea in the single-task setting and use spectral method to propose a new aggregation method that utilizes the second-moment information elicited from the crowds.

1 Introduction

When we want to forecast an event, make a decision, evaluate a product, or test a policy, we need to elicit information from the crowds and then aggregate the information. In many scenarios, we may not have ground-truth answer to verify the elicited information. For example, when we poll people’s voting intentions or elicit people’s subjective evaluations, there is no ground-truth for people’s feedback. Recently, a line of work [17, 18, 20, 14, 15, 13] explore how to incentivize people to be honest without ground-truth. Some work [19, 15] also study how to aggregate the information without ground-truth.

A natural way to aggregate information without ground-truth is “plurality vote”, i.e., output the most popular option. For example, in product evaluation, when the feedback of is 56% “good”, 40% “so so”, 4% “bad”, the plurality answer is “good”. However, if there are a large amount of people who always choose “good” regardless of the quality, plurality vote will always output “good” thus becomes meaningless. A more robust method is “surprisingly popular”, proposed by Prelec et al. [19]. In the above example, if the prior is 70% “good”, 26% “so so”, 4% “bad”, “surprisingly popular” will aggregate the feedback to “so so” which is the most surprisingly popular answer ( $\frac{40}{26}>\frac{56}{70},\frac{4}{4}$ ). This will handle the above issue to a certain degree. However, “surprisingly popular” may not work in some scenarios.

Example 1.1.

When the ground truth state is “good”, the distribution over people’s feedback is (56% “good”, 40% “so so”, 4% “bad”); when the ground truth state is “so so”, the distribution is (56% “good”, 34% B, 10% “bad”); ground truth state “bad” corresponds to (46% “good”, 44% “so so”, 10% “bad”). Each state shows up with equal probability. Thus, the prior is approximately (52.7% “good”,39.3% “so so”,8% “bad”)

In the above example, for both (56% “good”, 34% “so so”, 10% “bad”) and (46% “good”, 44% “so so”, 10% “bad”), the “surprisingly popular” answer is “bad”, while they correspond to different ground truth.

Therefore, “surprisingly popular” requires an additional prior assumption about the relationship between the ground truth state and people’s answers [19]. However, even if people’s honest answers satisfy the assumption, people can perform a strategy to their answers to break the assumption. Here is an example.

Example 1.2.

When the ground truth state is “good”, the distribution over people’s feedback is (100%,0%,0%); when the ground truth state is “so so”, the distribution is (0%,100%,0%); ground truth state “bad” corresponds to (0%,0%,100%). All people perform a strategy: given my private answer is “good”, I will report “good” with probability .56, “so so” with probability .4, “bad” with probability .04, given my answer is “so so”, I will … The strategy is represented by a row-stochastic matrix $\begin{bmatrix}.56&.4&.04\\ .56&.34&.1\\ .46&.44&.1\end{bmatrix}$ where each row represents the reporting distribution for each private answer. In this case, people’s strategic behaviors change the report distribution to the previous example and "surprisingly popular" will not work here.

Thus, a fundamental problem of the “surprisingly popular” method is that though it is more robust than plurality vote method, it is not fully robust to people’s strategies.

To this end, we can naturally ask for an information aggregation method that is robust to people’s strategies, that is, regardless of people’s strategies, the aggregation result will not change.

However, it may ask too much since people can permute their answers and no method is robust to this permutation strategy. Thus we change the requirement to the aggregation result will not change up to a permutation and use side information (e.g. ground truth of a small set of questions) to determine the permutation.

Refer to caption — Figure 1: Reduction to clustering: we map each task to a data point that represents the task’s collected feedback (e.g. 30% “good”, 20% “so so”, 50% “bad”) and then clustering the data points to assign each task a final answer.

It turns out that when we have a large number of people, and a small set of a priori similar tasks (each task is a multi-choice question), the answer is yes. A key conceptual idea is to reduce information aggregation without ground truth to a clustering/unsupervised learning problem. That is, we can use a datapoint to represents a task’s feedback. Assigning answers to the tasks can be understood as clustering the data points (Figure 1). Recall the previous example, when people perform a strategy to their reports, all feedback are transformed by an affine transformation ¹¹1It holds even if people adopt different strategies as long as each task is performed by a large number random people.. Therefore, designing a robust information aggregation method is equivalent to design an affine-invariant clustering method. We then propose an affine-invariant clustering method, Determinant MaxImization (DMI) clustering, and apply it into the information elicitation without verification.

DMI-clustering is the technical highlight in this paper. Previous affine-invariant clustering work [2, 10] either has a different definition for affine-invariance²²2They require that items that are same up to affine transformations (e.g. images of the same chair observed in different angles) should be clustered into the same cluster. In our setting, we require the clustering result remains the same when all items are transformed by the same affine transformation. from this work or has an underlying probabilistic model assumption and requires sufficiently number of data points [5, 12]. In contrast, DMI-clustering is a non-parametric affine-invariant method that applies to any finite number of data points and does not require any probabilistic model assumption. DMI-clustering provides a relatively affine-invariant evaluation to each clustering, the volume of the simplex consisting of each cluster mean multiplying the product of the cluster sizes, and aims to find the clustering with the highest evaluation.

When we apply DMI-clustering into information elicitation, the dimension of the data points is the number of options, which can be seen as a constant. In this case, we show that there exists a polynomial time algorithm for DMI-clustering. In general, we present a simple local-maximal searching heuristic for DMI-clustering, k-cofactors, which is similar to Lloyd’s algorithm [16] for k-means. We illustrate the geometric interpretation of DMI-clustering and k-cofactors for 2d points in Figure 2.

To apply DMI-clustering to multi-task information elicitation, we can normalize each question’s feedback as a data point (e.g. (46% A, 44% B, 10% C)). Given multiple questions’ feedback, we can apply DMI-clustering to obtain the final answer³³3To determine the cluster’s label, we can use side information like ground-truth of a small set of questions..

We also compare DMI-clustering with the surprisingly popular method. Interestingly, we find that they are equivalent for binary-question and in general, DMI-clustering can be understood as a “rotated” surprisingly popular method.

In addition to the multi-task setting, it’s also very useful to think from a clustering view in the single-task setting where we only have a single multi-choice question to people. The “surprisingly popular” method can be seen as a clustering method which only employs the first moment. To obtain the first moment, we can ask people’s second moment information (e.g. what’s your prediction for other people’s answer) and reconstruct the first moment using the second moment [19]. From a clustering view, it’s very natural to apply spectral method, which uses the second moment information, to design a new information aggregation method in the single-task case. If we can use a hypothetical question to elicit a higher-order-moment information, we can apply higher-order-moment clustering method for single-task setting.

Our results

We connect information elicitation without ground truth with clustering, and

Multi-task & Affine-invariance: we propose an affine-invariant clustering method and apply it to propose a multi-task information elicitation mechanism that 1) aggregates information in a way that is robust to people’s strategies when the number of agents is infinite, 2) reduces the payment variance in some scenarios while still preserves previous multi-task mechanisms’ incentive properties;
Single-task & Methods of moments: we apply method of moments clustering method in the single-task setting. With only the second-moment information, we apply spectral method to propose a new aggregation mechanism which is better than the “surprisingly popular” method in some scenarios.

The main conceptual contribution of this work is the connection between clustering and information elicitation without ground truth. This connection not only brings clustering techniques to information elicitation without verification, but also brings attention to the clustering setting where the input data point, and the property of the data (e.g. moment information) can be elicited from the crowds. We hope this connection can generate more results in the future for information elicitation/aggregation without ground truth in the future.

1.1 Related Work

Information elicitation without ground truth (peer prediction) is proposed by Miller et al. [17]. The core problem is how to incentivize people to report high-quality answer even if the answer cannot be verified. The key idea is to reward each agent based on her report and her peer’s reports. A series of work [18, 8, 20, 14, 15, 13] have focused on this problem and develop reward schemes in varies of settings. However, in many scenarios, we want to not only incentivize high-quality answer but also make an aggregation of the answers. Thus, Some of information elicitation work also explore how to aggregate information without ground truth and there are two approaches here.

One approach is to use the plurality of high-payment agents’ answer. Liu et al. [15] employ this way to aggregate people’s reports based on various of payment schemes. However, incentive-compatible payment schemes only guarantees high-quality answer receives high payment in expectation. If the payment has a high variance, which happens especially when the number of tasks is small, high-payment agent may not provide high quality answers. Thus, here each agent needs to perform a sufficiently number of tasks to reduce the payment variance. A large literature of extracting wisdom from crowd also uses the idea that first estimate the crowd’s quality and then use the estimated quality to aggregate their opinions (e.g. (e.g. [6, 22])). The quality can either be estimated from previous performance the crowds or from a EM framework proposed by Dawid and Skene [9]. Though the EM framework based estimation does not requires historical data, it still requires each person to answer a sufficient amount questions to get a good estimation. Moreover, it usually requires an underlying probabilistic model: there exists a ground truth and conditioning on the ground truth, people’s reports are independent.

Another approach is to directly evaluate the answer itself. A popular method is the “surprisingly popular” method. Prelec et al. [19] uses this method in the single-task setting by eliciting people’s predictions for other people’s answer and reconstructing the prior using the predictions. In the setting we have already known the prior, this method can be directly implemented. However, like we mentioned in introduction this method requires an additional assumption for the prior.

More importantly, none of these aggregation approaches is robust to people’s strategies. Though for Determinant based Mutual Information (DMI)-Mechanism [13], the expected payment is relatively robust to other people’s strategies, we still need each agent to perform a sufficiently number of tasks to reduce the payment variance. However, the setting when we have a large number of agents and each agent performs a small number of tasks may be more practical. The multi-task setting in this work focuses on this more practical setting and proposes an aggregation method, DMI-clustering, which is robust to people’s strategies. DMI-clustering is inspired by DMI-Mechanism [13]. Unlike DMI-Mechanism based aggregation approach uses the plurality of high-payment agents’ answer, DMI-clustering learns the answers which has the highest determinant mutual information with the collection of people’s answers. When we have a large number of participants, the learning process is independent of their strategies, even if each participant only performs a small number of tasks. Xu et al. [24] also employ the property of Determinant based Mutual Information to design a robust loss function. Unlike this work, Xu et al. [24] focus on the noisy labels setting rather than the unsupervised clustering setting.

DMI-clustering is a new general clustering method. The most important property of DMI-clustering is affine-invariance. Affine-invariant clustering is well studied in image classification [23, 10, 2]. However, they have a different definition for affine-invariance from this work. In their definition, images that are about the same item observed in different angles should be classified into the same cluster.

Unlike the above work about affine-invariance, we focus on the setting where all data points are transformed by the same affine transformation and we want the clustering for these data points remains invariant to the transformation. Brubaker and Vempala [5], Huang and Yang [12] also use this definition and propose an affine-invariant clustering method. However, Brubaker and Vempala [5] focus on the application in mixture model and require the knowledge of the number of components and sufficient amount of sample data points. Huang and Yang [12] have an underlying probabilistic model and develop a maximum likelihood estimator based algorithm. In contrast, DMI-clustering is a non-parametric affine-invariant method that applies to any finite number of data points and does not require any probabilistic model assumption. Moreover, DMI-clustering provides a relatively affine-invariant evaluation to each clustering. Thus, when we have side information that gives a constraint for the set of clustering (e.g. we need to pick one from clustering I or clustering II), DMI-clustering method is still affine-invariant.

DMI-clustering is more similar to k-means clustering. Like k-means, it has a natural geometric interpretation and a simple local-maxima searching heuristic that uses each cluster’s mean. One difference from k-means is that DMI-clustering can not pick any $k$ it wants since it needs to satisfy the affine-invariance⁴⁴4If we have $d$ linearly independent $d$ -dimensional data points, k-means method can pick any $k$ . However, DMI-clustering must pick $k=d$ and make each individual data point a cluster by itself since these data points can be any other linearly independent $d$ -dimensional data points by a proper affine-transformation. .

The definition of DMI-clustering uses determinant. Log-determinant divergence [7]’s definition also uses determinant and is affine-invariant in its definition. However, still, the affine-invariance of log-determinant divergence is different from this work. It refers to the property that the divergence between two positive definite matrices is invariant to affine transformations. The geometric interpretation of DMI-clustering is maximizing the volume of simplex constituted by cluster’s means, multiplying the product of cluster sizes. Thurau et al. [21] also connect simplex volume maximization with clustering. However, their problem is very different from ours and irrelevant to affine-invariance. They aim to design an efficient algorithm for matrix factorization by greedily picking the basis vectors such that the simplex formed by these basis vectors has the maximal volume.

2 Determinant MaxImization (DMI) Clustering

In this section, we will introduce a new clustering method, DMI-clustering, that is based on determinant maximization. We will show that this method is invariant to any affine transformation which is performed on all data points due to the multiplicative property of determinant. We will also show that this clustering method will generate a clustering which has a natural geometric interpretation: there will be a representative vector for each cluster and each data point is mapped to the cluster whose representative vector has the highest inner-product with the data point.

Moreover, we will introduce a heuristic algorithm, k-cofactors, which is similar to Lloyd’s algorithm for k-means. When the data points are in 2-dimension, we visualize the algorithm and introduce an interesting geometric way to perform the algorithm. All synthetic data used for demonstration is attached in appendix. We will also show that there exists a polynomial time algorithm for DMI-clustering when $d$ is a constant. We first introduce the clustering method.

Input:

\mathbf{A}\in\mathbb{R}^{n\times d}

/* clustering the rows of

\mathbf{A}

Output: k,

\mathbf{C}^{*}\in\{0,1\}^{n\times k}

L

\mathbf{D}^{*}\in\mathbb{R}^{k\times k}

k

clusters, map row

i

to cluster

c

C^{*}_{ic}=1

Add column

(1,1,\cdots,1)^{\top}

\mathbf{A}

, i.e.,

\mathbf{A}^{+1}=[\mathbf{A}\quad\mathbf{1}]

Find the rank

k

\mathbf{A}^{+1}

Pick

k

linearly independent columns of

\mathbf{A}^{+1}

to construct matrix

\tilde{\mathbf{A}}

arbitrarily

L=

the set of picked column indices

Let

\mathcal{C}

\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}

⁵⁵5We can pick other constraint set

\mathcal{C}

here, for example, we can use

\mathcal{C}^{soft}:=\{\mathbf{C}\in[0,1]^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}

Find

\mathbf{C}^{*}\in\mathrm{argmax}_{\mathbf{C}\in\mathcal{C}}\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]

/* we call

|\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]|

the DMI-score of

\mathbf{C}

\text{DMI-S}(\mathbf{C},\mathbf{A})

\mathbf{D}^{*}=(\mathbf{C}^{*\top}\tilde{\mathbf{A}})^{-1}

Algorithm 1 DMI-clustering

\text{DMI-C}(\mathbf{A})

The core of DMI-clustering is an evaluation metric for each clustering, DMI-score. We will show that intuitively, DMI-score is the volume of the simplex formed by clusters’ means, multiplying the product of cluster sizes. DMI-clustering aims to find the clustering that has the highest DMI-score. A naive attempt to solve the optimization problem in DMI-clustering is to enumerate all possible $\mathbf{C}$ . However, it needs an exponential running time in $n$ , even when $d$ is a constant. To have a better algorithm, in addition to show that DMI-clustering is affine-invariant, we also analyze the clustering result of DMI-clustering and find that it is linearly-partitioned. This linearly-partitioned property will directly induce a polynomial time algorithm when $d$ is a constant and a simple local maximal-searching heuristics in general. We will use the following notation to describe the clustering result of DMI-clustering.

Definition 2.1 (Index of row’s max).

For matrix $\mathbf{M}\in\mathbb{R}^{n\times k}$ , we define $\mathrm{idxmax}(\mathbf{M})\in\mathbb{R}^{n\times k}$ as a matrix such that for all $i$ , $\mathrm{idxmax}(\mathbf{M})$ ’s $i^{th}$ row is a one-hot vector that indicates the index of the largest entry of $\mathbf{M}$ ’s $i^{th}$ row. When there is a tie, we define $\mathrm{idxmax}(\mathbf{M})$ as a set of all matrices whose rows indicate the index of the largest entry of $\mathbf{M}$ ’s rows. For example,

\mathrm{idxmax}\left({\begin{bmatrix}.3&.4&.1\\ 0&-1&.1\end{bmatrix}}\right)=\begin{bmatrix}0&1&0\\ 0&0&1\end{bmatrix}.

We start to introduce the theoretic properties of DMI-clustering and visualize them in Figure 3.

Theorem 2.2.

DMI-clustering is

Legal

when there exists full rank $\mathbf{C}^{*}\in\mathcal{C}$ such that $\mathbf{A}=\mathbf{C}^{*\top}\mathbf{B}$ where $\mathbf{B}$ is invertible, for both hard constraint $\mathcal{C}=\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}$ and soft constraint $\mathcal{C}^{soft}$ , $\text{DMI-C}(\mathbf{A})$ equals $\mathbf{C}^{*}$ (up to column permutation).

Affine-invariant

the DMI-score of $\mathbf{C}$ is relatively invariant to any invertible affine transformation of the data points, i.e., for all invertible $\mathbf{T}\in\mathbb{R}^{d\times d}$ and $\mathbf{b}\in\mathbb{R}^{1\times b}$ , for all $\mathbf{C}_{1},\mathbf{C}_{2}$ which have non-zero DMI-score regarding $\mathbf{A}$ ,

\frac{\text{DMI-S}(\mathbf{C}_{1},\mathbf{A})}{\text{DMI-S}(\mathbf{C}_{2},\mathbf{A})}=\frac{\text{DMI-S}(\mathbf{C}_{1},\mathbf{A}\mathbf{T}+\mathbf{b})}{\text{DMI-S}(\mathbf{C}_{2},\mathbf{A}\mathbf{T}+\mathbf{b})}

thus, for any constraint set $\mathcal{C}$ , $\text{DMI-C}(\mathbf{A}\mathbf{T}+\mathbf{b})$ is equivalent to $\text{DMI-C}(\mathbf{A})$ in the sense that any solution of $\text{DMI-C}(\mathbf{A}\mathbf{T}+\mathbf{b})$ is also a solution of $\text{DMI-C}(\mathbf{A})$ , vice versa;

Linearly-partitioned

when we maximize over $\mathcal{C}=\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}$ , $\mathbf{C}^{*}$ is a local maximal of $\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]$ if and only if $\mathbf{C}^{*}\in\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D}^{*})$ where $\mathbf{D}^{*}=(\mathbf{C}^{*\top}\tilde{\mathbf{A}})^{-1}$ .

Geometrically, the average of $\tilde{\mathbf{A}}$ ’s row vectors is the intersection of all partitions, i.e., $\bar{\tilde{\mathbf{a}}}*\mathbf{D}^{*}\propto\mathbf{1}$ and if $\mathbf{A}-\bar{\mathbf{a}}$ ’s rank is d, the optimization goal is equivalent with maximizing the volume of the simplex consist of each cluster’s mean, multiplying the product of the cluster sizes.

For simple one-dimensional case, we show that DMI-clustering will partition the numbers based on the mean (Corollary 2.3). For other dimensions, the linearly-partitioned property directly induce an algorithm (Algorithm 2) to find the local optimum. This algorithm is very similar to Lloyd’s algorithm for k-means: 1. assignment: assign the points according to the current partition; 2. update: update the partition using the current clusters. We call this algorithm k-cofactors since the update step employs the cofactors of the matrix which consists of the current cluster means. In the 2d case, we will translate the operations in k-cofactors into natural geometric operations (Algorithm 3, Figure 4). In the end of this section, we will also show that the linearly-partitioned property induces a polynomial algorithm when the dimension $d$ is a constant, based on properties of shatter functions.

Corollary 2.3 (1d points).

For n real numbers $a_{1},a_{2},\cdots,a_{n}$ where there exists $i\neq j$ such that $a_{i}\neq a_{j}$ , DMI-clustering will partition the numbers into two clusters: above the average and below the average.

Proof of Corollary 2.3.

Since DMI-clustering is affine-invariant, we subtract the average from each number and define vector $\mathbf{v}:=(a_{1}-\bar{a},a_{2}-\bar{a},\cdots,a_{n}-\bar{a})$ .

$\displaystyle\det[[\mathbf{p},\mathbf{1}-\mathbf{p}]^{\top}[\mathbf{v}\quad\mathbf{1}]]$	$\displaystyle=\det[\begin{bmatrix}\mathbf{p}^{\top}\mathbf{v}&\mathbf{p}^{\top}\mathbf{1}\\ (\mathbf{1}-\mathbf{p})^{\top}\mathbf{v}&(\mathbf{1}-\mathbf{p})^{\top}\mathbf{1}\\ \end{bmatrix}]$
	$\displaystyle\propto\det[\begin{bmatrix}\mathbf{p}^{\top}\mathbf{v}&\mathbf{p}^{\top}\mathbf{1}\\ \mathbf{1}^{\top}\mathbf{v}&\mathbf{1}^{\top}\mathbf{1}\\ \end{bmatrix}]$	(add first row the second row)
	$\displaystyle=\det[\begin{bmatrix}\mathbf{p}^{\top}\mathbf{v}&\mathbf{p}^{\top}\mathbf{1}\\ 0&\mathbf{1}^{\top}\mathbf{1}\\ \end{bmatrix}]$	(the sum of $\mathbf{v}$ is 0)
	$\displaystyle\propto\mathbf{p}^{\top}\mathbf{v}$

To maximize $\mathbf{p}^{\top}\mathbf{v}$ , the 0-1 vector $\mathbf{p}$ will pick all positive entries of $\mathbf{v}$ . Thus, DMI-clustering will partition the numbers into two clusters: above the average and below the average. ∎

Input:

\tilde{\mathbf{A}}\in\mathbb{R}^{n\times k}

Output:

\mathbf{C}\in\{0,1\}^{n\times k}

Initialize

\mathbf{C}

\mathbf{D}:=(\mathbf{C}^{\top}\tilde{\mathbf{A}})^{-1}

while $\mathbf{C}\notin\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D})$ do

\mathbf{C}\leftarrow\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D})

⁷⁷7Update

\mathbf{C}

’s row only if it does not indicate the index of the highest entry of

\tilde{\mathbf{A}}\mathbf{D}

’s row

/* assignment step */

\mathbf{D}:=(\mathbf{C}^{\top}\tilde{\mathbf{A}})^{-1}

/* update step */

end while

return

\mathbf{C}

Algorithm 2 k-cofactors: finding local maximal of

\max_{\mathbf{C}\in\mathcal{C}}\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]

Input:

(\mathbf{a}_{1},\mathbf{a}_{2},\cdots,\mathbf{a}_{n})

n

2d points */

Output:

\mathbf{C}\in\{0,1\}^{n\times 3}

3

clusters,

\mathbf{a}_{i}

belongs to cluster

c

C_{ic}=1

Initialize

\mathbf{C}

Calculate mean

\bar{\mathbf{a}}

of all points

/* The center intersection point in Figure 4 */

Calculate mean

\bar{\mathbf{a}}_{c}

of each cluster

c=1,2,3

/* The stars in Figure 4 */

Draw partition line, starting from

\bar{\mathbf{a}}

, by extending line segment

[\bar{\mathbf{a}}_{c},\bar{\mathbf{a}}]

for all

c

/* the black dashed lines in Figure 4 */

while the partition is not consistent with the current cluster do

Assign points to clusters based on the current partition

/* assignment step */

Update the partition based on the current clusters

/* update step */

end while

return

\mathbf{C}

Algorithm 3 k-cofactors for 2d points

Corollary 2.4 (2d points).

For any $\mathbf{A}\in\mathbb{R}^{n\times 2}$ , if $\mathbf{A}-\bar{\mathbf{a}}$ ’s rank is two⁸⁸8Otherwise it will reduce to the 1d case., then running $\text{DMI-C}(\mathbf{A})$ (Algorithm 1) with k-cofactors algorithm (Algorithm 2) is equivalent to Algorithm 3. Geometrically, the optimization goal is equivalent with maximizing the area of the triangles $\Delta_{\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}_{2},\bar{\mathbf{a}}_{3}}$ multiplying the $n_{1}n_{2}n_{3}$ where $n_{c}$ is the number of points in cluster $c$ .

Proof of Corollary 2.4.

If $\mathbf{A}-\bar{\mathbf{a}}$ ’s rank is two, then $[\mathbf{A}\quad\mathbf{1}]$ is rank three. Therefore, we will pick $\tilde{\mathbf{A}}=[\mathbf{A}\quad\mathbf{1}]$ .

$\mathbf{C}^{\top}\tilde{\mathbf{A}}=\mathrm{diag}(n_{1},n_{2},n_{3})\begin{bmatrix}&\bar{\mathbf{a}}_{1}&\quad 1\\ &\bar{\mathbf{a}}_{2}&\quad 1\\ &\bar{\mathbf{a}}_{3}&\quad 1\\ \end{bmatrix}$ where $n_{c}$ is the number of points in cluster $c$ . Thus maximizing $\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]$ is equivalent to maximizing the area of the triangles $\Delta_{\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}_{2},\bar{\mathbf{a}}_{3}}$ multiplying the $n_{1}n_{2}n_{3}$ .

For $\mathbf{D}=(\mathbf{C}^{\top}\tilde{\mathbf{A}})^{-1}=[\mathbf{d}_{1},\mathbf{d}_{2},\mathbf{d}_{3}]$ , we have

[\bar{\mathbf{a}}_{1}\quad 1]*\mathbf{d}_{1}=\frac{1}{n_{1}}\quad[\bar{\mathbf{a}}_{1}\quad 1]*\mathbf{d}_{2}=[\bar{\mathbf{a}}_{1}\quad 1]*\mathbf{d}_{3}=0

We also have that

[\bar{\mathbf{a}}\quad\mathbf{1}]*\mathbf{d}_{1}=(\sum_{c}\frac{n_{c}}{n}[\bar{\mathbf{a}}_{c}\quad\mathbf{1}])*\mathbf{d}_{1}=\frac{1}{n}

and by similar analysis, $[\bar{\mathbf{a}}\quad\mathbf{1}]*\mathbf{d}_{2}=[\bar{\mathbf{a}}\quad\mathbf{1}]*\mathbf{d}_{3}=\frac{1}{n}$

For point $\mathbf{a}=\lambda\bar{\mathbf{a}}+(1-\lambda)\bar{\mathbf{a}}_{1}$ ,

[\mathbf{a}\quad\mathbf{1}]*\mathbf{d}_{1}=\lambda\frac{1}{n}+(1-\lambda)\frac{1}{n_{1}}

and

[\mathbf{a}\quad\mathbf{1}]*\mathbf{d}_{2}=[\mathbf{a}\quad\mathbf{1}]*\mathbf{d}_{3}=\lambda\frac{1}{n}

when $\lambda<1$ , $\mathbf{a}$ should be mapped to cluster 1 and when $\lambda\geq 1$ , $\mathbf{a}$ can be mapped to either cluster 2 or cluster 3. Thus, we can obtain the partition line for cluster 2 and 3, call it line 2-3, by extending line segment $[\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}]$ . By analogous analysis for $\bar{\mathbf{a}}_{2},\bar{\mathbf{a}}_{3}$ , we obtain three partition lines like Figure 4.

For each point $\mathbf{x}$ between line 1-2 and 2-3, we can represent it as $\mathbf{x}=\bar{\mathbf{a}}+s(\bar{\mathbf{a}}-\bar{\mathbf{a}}_{3})+t(\bar{\mathbf{a}}-\bar{\mathbf{a}}_{1})$ where $s,t>0$ . Then $[\mathbf{x}\quad\mathbf{1}]*\mathbf{d}_{2}=\frac{1}{n}(1+s+t)$ , $[\mathbf{x}\quad\mathbf{1}]*\mathbf{d}_{1}=\frac{1}{n}(1+s+t)-t\frac{1}{n_{1}}$ and $[\mathbf{x}\quad\mathbf{1}]*\mathbf{d}_{3}=\frac{1}{n}(1+s+t)-s\frac{1}{n_{3}}$ . Thus, $\mathbf{x}$ belongs to cluster 2. By analogous analysis for other points, we finish the proof.

∎

Proof of Theorem 2.2.

We first prove the affine-invariance. Note that

[\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}]=[\mathbf{A}\quad\mathbf{1}]\begin{bmatrix}\mathbf{T}&0\\ \mathbf{b}&1\end{bmatrix}.

Thus, for invertible $\mathbf{T}$ , both $[\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}]$ and $[\mathbf{A}\quad\mathbf{1}]$ have the same rank $k$ .

Since $\tilde{\mathbf{A}}$ are $k$ linearly independent columns of $[\mathbf{A}\quad\mathbf{1}]$ , there exists $\mathbf{P}\in\mathbb{R}^{k\times(d+1)}$ such that $[\mathbf{A}\quad\mathbf{1}]=\tilde{\mathbf{A}}\mathbf{P}$ . Therefore, we have $[\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}]=\tilde{\mathbf{A}}\mathbf{P}\begin{bmatrix}\mathbf{T}&0\\ \mathbf{b}&1\end{bmatrix}$ .

Thus, for any $\tilde{\mathbf{A}}^{\prime}$ that consists of $k$ linearly independent columns in $[\mathbf{A}\mathbf{T}+\mathbf{b}\quad\mathbf{1}]$ , we pick corresponded column vectors of $\mathbf{P}\begin{bmatrix}\mathbf{T}&0\\ \mathbf{b}&1\end{bmatrix}$ to construct matrix $\mathbf{Q}\in\mathbb{R}^{k\times k}$ . We have $\tilde{\mathbf{A}}^{\prime}=\tilde{\mathbf{A}}\mathbf{Q}$ and $\mathbf{Q}$ is invertible (otherwise $\tilde{\mathbf{A}}^{\prime}$ ’s rank will be less than $k$ which is a contradiction).

Therefore,

\det[\mathbf{C}^{\top}\tilde{\mathbf{A}^{\prime}}]=\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]\det[\mathbf{Q}]\propto\pm\det[\mathbf{C}^{\top}\tilde{\mathbf{A}}]

due to the multiplicative property of determinant. The relative-invariance of DMI-score and affine-invariance of DMI-clustering directly follows from the above formula.

We then show that DMI-clustering is legal. Note that since it is affine-invariant, we only need to analyze $\text{DMI-C}(\mathbf{C}^{*})$ . Since the sum of the column vectors of $\mathbf{C}^{*}$ is all one vector $\mathbf{1}$ and the rank of $\mathbf{C}^{\top}$ is $d$ , we can pick $\mathbf{C}^{*}$ itself as the matrix that consists of $d$ linearly independent columns of $[\mathbf{C}^{*}\quad\mathbf{1}]$ .

We normalize $\mathbf{C}^{*}$ by dividing each column vector by its sum to obtain $\mathrm{ncol}(\mathbf{C}^{*})$ . Note that this normalization is equivalent to multiplying a diagonal matrix on the right. Thus, maximizing $\det[\mathbf{C}^{\top}\mathbf{C}^{*}]$ over $\mathbf{C}\in\mathcal{C}^{soft}$ is equivalent to $\max_{\mathbf{C}\in\mathcal{C}^{soft}}\det[\mathbf{C}^{\top}\mathrm{ncol}(\mathbf{C}^{*})]$ .

Moreover, $\mathbf{C}^{*\top}\mathrm{ncol}(\mathbf{C}^{*})$ is an identity matrix whose determinant is one. Since both $\mathbf{C}^{\top}$ and $\mathrm{ncol}(\mathbf{C}^{*})$ are column-stochastic matrices⁹⁹9Each entry is in [0,1] and each column sums to 1., their product is still a column-stochastic matrix. When $\mathbf{C}$ does not equal to $\mathbf{C}^{*}$ even up to column permutation, i.e., there does not exist a permutation matrix $\pi$ such that $\mathbf{C}=\pi\mathbf{C}^{*}$ , the product of $\mathbf{C}^{\top}$ and $\mathrm{ncol}(\mathbf{C}^{*})$ will be a non-permutation column-stochastic matrix whose determinant is strictly less than 1. Therefore, when there exists full rank $\mathbf{C}^{*}\in\mathcal{C}$ such that $\mathbf{A}=\mathbf{C}^{*\top}\mathbf{B}$ where $\mathbf{B}$ is invertible, $\text{DMI-C}(\mathbf{A})$ equals $\mathbf{C}^{*}$ (up to column permutation).

Finally, we prove that the clustering result is linearly-partitioned. $\tilde{\mathbf{A}}$ ’s row vectors are $\tilde{\mathbf{a}}_{1},\tilde{\mathbf{a}}_{2},\cdots,\tilde{\mathbf{a}}_{n}$ . $\mathbf{C}^{*\top}\tilde{\mathbf{A}}$ ’s row vectors are denoted by $\mathbf{v}_{1},\mathbf{v}_{2},\cdots,\mathbf{v}_{k}$ .

For all $i\in[n]$ , we denote the cluster it is mapped to by $c_{i}^{*}$ , i.e., $\mathbf{C}^{*}_{i,c_{i}^{*}}=1$ . For all $c\in[k]$ , when $\mathbf{C}^{*}$ is a local maximum, the current assignment is better than moving $i$ to cluster $c$ , that is,

\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]=\det[\begin{bmatrix}\mathbf{v}_{1}\\ \mathbf{v}_{2}\\ \vdots\\ \mathbf{v}_{k}\end{bmatrix}]\geq\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}-\tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]

Due to the multilinear property of determinant,

	$\displaystyle\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}-\tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]=$	$\displaystyle\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]-\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}+\tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]$
	$\displaystyle=$	$\displaystyle\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{}}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]+\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{}}\\ \vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]-\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]$

Thus, we have

\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]\geq\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]

for all $c$ .

Since $\mathbf{D}^{*}:=(\mathbf{C}^{*\top}\tilde{\mathbf{A}})^{-1}$ , $\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]D^{*}_{ij}$ is the cofactors of matrix $\mathbf{C}^{*\top}\tilde{\mathbf{A}}$ . We denote $\mathbf{D}^{*}$ ’s column vectors by $\mathbf{d}^{*}_{1},\mathbf{d}^{*}_{2},\cdots,\mathbf{d}^{*}_{k}$ .

Due to Laplace expansion, we have

\tilde{\mathbf{a}}_{i}*(\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]\mathbf{d}^{*}_{c_{i}^{*}})=\det[\begin{bmatrix}\vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\\ \mathbf{v}_{c}\\ \vdots\end{bmatrix}]\geq\det[\begin{bmatrix}\vdots\\ \mathbf{v}_{c_{i}^{*}}\\ \vdots\\ \tilde{\mathbf{a}}_{i}\\ \vdots\end{bmatrix}]=\tilde{\mathbf{a}}_{i}*(\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]*\mathbf{d}^{*}_{c})

Since $\det[\mathbf{C}^{*\top}\tilde{\mathbf{A}}]$ is the maximum that must be positive, we have for all $i$ , for all $c$ ,

\tilde{\mathbf{a}}_{i}*\mathbf{d}^{*}_{c_{i}^{*}}\geq\tilde{\mathbf{a}}_{i}*\mathbf{d}^{*}_{c}.

Thus, $\mathbf{C}^{*}\in\mathrm{idxmax}(\tilde{\mathbf{A}}\mathbf{D}^{*})$ . Moreover,

\displaystyle\mathbf{1}\mathbf{D}^{*-1}=\mathbf{1}\mathbf{C}^{*\top}\tilde{\mathbf{A}}=\mathbf{1}\tilde{\mathbf{A}}\propto\bar{\tilde{\mathbf{a}}}

which shows that $\bar{\tilde{\mathbf{a}}}*\mathbf{D}^{*}\propto\mathbf{1}$ , i.e., the mean of all data points is in the intersection of all partitions. We also have

\mathbf{C}^{\top}\tilde{\mathbf{A}}=\mathrm{diag}(n_{1},n_{2},\cdots,n_{k})\begin{bmatrix}&\bar{\mathbf{a}}_{1}&\quad 1\\ &\bar{\mathbf{a}}_{2}&\quad 1\\ &\vdots&\quad\vdots\\ &\bar{\mathbf{a}}_{k}&\quad 1\\ \end{bmatrix}

whose determinant is proportional to the volume of the simplex formed by $(\bar{\mathbf{a}}_{1},\bar{\mathbf{a}}_{2},\cdots,\bar{\mathbf{a}}_{k})$ , multiplying the product of cluster sizes $n_{1},n_{2},\cdots,n_{k}$ . We finish the proof.

∎

The linearly-partitioned part in the above theorem shows that instead of trying all possible $n\times k$ $\mathbf{C}$ , we can try all possible $k\times k$ $\mathbf{D}$ and check whether its corresponding $\mathbf{C}$ is optimal to obtain the result of DMI-clustering. In particular, when dimension $d$ is a constant, there exists the number of distinct linear-partitions are polynomial in the number of data points $n$ due to theory of VC dimension [4]. This implies the following corollary.

Corollary 2.5.

When dimension $d$ is a constant, when we maximize over the default constraint $\mathcal{C}=\{\mathbf{C}\in\{0,1\}^{n\times k}|\forall i,\sum_{c}C_{ic}=1\}$ , there exists a polynomial algorithm for DMI-clustering.

Proof.

When $d$ is a constant, the number of clusters $k\leq d+1$ is a constant as well. Given $n$ data points, for any linear-partition $\mathbf{D}^{*}$ , the points that belong to cluster 1 is the set of all points $\{\tilde{a}|\tilde{a}*\mathbf{d}^{*}_{1}>\tilde{a}*\mathbf{d}^{*}_{c},c\neq 1\}$ thus is intersections of $k-1$ half-spaces. Since half-space’s VC dimension is a constant $k+1$ here, there will be at most polynomial $O(n^{(k-1)(k+1)})$ distinct cluster 1. Moreover, the number of clusters $k$ is a constant here. Thus there will be most $O(n^{k(k-1)(k+1)})$ distinct linear-partitions. Thus, we can enumerate these linear-partitions in polynomial time and check which one gives the best cluster $\mathbf{C}^{*}$ that maximizes the determinant.

∎

In the next section, we will apply DMI-clustering to extract knowledge from elicited information without ground truth. We will show that this method will be robust to people’s strategies and expertise.

3 Multi-task & Invariant Clustering

This section will introduce a multi-task information elicitation mechanism which is not only dominantly-truthful but also extracts knowledge from the collected answers. The extraction is robust when each task is answered by a large number of randomly selected people.

3.1 Knowledge-Extracted DMI (K-DMI)-Mechanism

Setting

There are $n$ a priori similar tasks and $m$ agents. Each task is a multi-choice question with $C$ options. Each agent is assigned a random set of tasks (size $\geq 2C$ ) in random order independently. Agents finish tasks without communication. Each agent $i$ ’s reports are represented as a $n\times C$ 0-1 report matrix $\mathbf{R}_{i}$ where the $(t,c)$ entry of $\mathbf{R}_{i}$ is 1 if and only if agent $i$ reports option $c$ for task $t$ . The all zero rows in $\mathbf{R}_{i}$ are the tasks that agent $i$ does not perform.

Model

Tasks have states which are drawn i.i.d. from an unknown distribution. Each task $t$ ’s state is a vector $\mathbf{a}_{t}\in[0,1]^{C}$ where each agent will receive signal $c$ with probability $\mathbf{a}_{t}(c)$ . Each agent $i$ is associated with a strategy $\mathbf{S}_{i}\in[0,1]^{C\times C}$ where $\mathbf{S}_{i}(c,c^{\prime})$ is the probability that she receives $c$ and reports $c^{\prime}$ . We assume that each agent performs the same strategy for all tasks. This assumption is reasonable since the tasks are a priori similar and agents receive tasks in a random order independently. A payment scheme is dominantly truthful if for each agent, regardless of other people’s strategies, telling the truth pays her higher¹⁰¹⁰10not strictly higher than any other strategy in expectation.

Input:

\mathbf{R}_{i}\in\{0,1\}^{n\times C},i\in[m]

Output: extracted knowledge

\mathbf{C}^{*}

,payment

p_{i},i\in[m]

/* Knowledge extraction */

\mathbf{A}=\sum_{i}\mathbf{R}_{i}

\forall t,c,A_{tc}=\frac{A_{tc}}{\sum_{c}A_{tc}}

Get

\mathbf{C}^{*}

from

\text{DMI-C}(\mathbf{A})

/* Payment calculation */

for $i\in[m]$ do

\mathbf{A}=\sum_{j\neq i}\mathbf{R}_{j}

\forall t,c,A_{tc}=\frac{A_{tc}}{\sum_{c}A_{tc}}

T_{i}

is the set of tasks agent

i

performed,

T_{-i}=[n]/T_{i}

/* Use

T_{-i}

to obtain the partition to guarantee independence */

Get

\mathbf{D}^{*}_{-i}

and

L

from

\text{DMI-C}(\mathbf{A}_{T_{-i},:})

/* Use obtained partition to extract knowledge for tasks in

T_{i}

\mathbf{C}^{*}_{i}\leftarrow\mathrm{idxmax}(\mathbf{A}_{T_{i},L}\mathbf{D}^{*}_{-i})

Divide

T_{i}

into two disjoint parts

T_{1},T_{2}

such that both

|T_{1}|,|T_{2}|\geq C

Restrict

\mathbf{C}^{*}_{i}

and

\mathbf{R}_{i}

T_{\ell}

to get

\mathbf{C}^{*}_{\ell}

and

\mathbf{R}_{i}^{\ell}

respectively for

\ell=1,2

/* Reward agent

i

using the extracted knowledge */

p_{i}=\det(\mathbf{C}_{1}^{*\top}\mathbf{R}^{1}_{i})\det(\mathbf{C}_{2}^{*\top}\mathbf{R}^{2}_{i})

end for

Algorithm 4 Knowledge-Extracted DMI (K-DMI)-Mechanism

K-DMI-Mechanism is a modification of DMI-Mechanism which pays each agent the determinant mutual information between her report and her peer’s report in expectation.

Definition 3.1 (Determinant based Mutual Information (DMI) [13]).

For two random variables $X,Y$ whose joint distribution is denoted as a matrix $\mathbf{U}_{X,Y}$ ,

\textsc{DMI}(X;Y):=|\det(\mathbf{U}_{X,Y})|

Since DMI satisfies data processing inequality and its square has a polynomial format, we can use $\geq 2C$ tasks to implement it. Compared to DMI, K-DMI-Mechanism adds a clustering process, which can be interpreted as finding a clustering which has the highest determinant mutual information with the collection of all agents’ answers. For the payment calculation, K-DMI-Mechanism pays each agent the determinant mutual information between her report and the clustering result to reduce variance. The original DMI-Mechanism’s payments may not be fair to the hard-working agents when there exists a lot of low-efforts agents. The current payment calculation method may be more robust to other people’s strategies. For example, in the simple case where the answer distributions only have $C$ distinct formats, we can use other tasks to learn the correct partition and each agent is rewarded between her report and the underlying ground-truth in K-DMI.

Theorem 3.2.

K-DMI-Mechanism is dominantly-truthful. If the number of agents $m$ goes to infinite and $\lim_{m\rightarrow\infty}\frac{1}{m}\sum_{i\in[m]}\mathbf{S}_{i}$ exists and is invertible, the clustering result is invariant to people’s strategies. When $\mathbf{A}$ ’s rank is $C$ , $\text{DMI-C}(\mathbf{A})$ is equivalent to $\mathrm{argmax}_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}]$ and $v:=max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}]$ also indicates the quality of answers: $v$ decreases when agents perform strategies.

Proof of Theorem 3.2.

From agent $i$ ’s view, each task’s elicited answers are drawn i.i.d. from a distribution. Thus, the partition $\mathbf{D}_{-i}^{*}$ learned from the set of tasks $T_{-i}$ is independent of the set of tasks $T_{i}$ she performs. Therefore, after operating $\mathbf{D}_{-i}^{*}$ on $T_{i}$ ’s answers (excluding agent $i$ ’s own answer) to obtain a clustering result, the i.i.d. assumption is still satisfied for the pair of agent $i$ ’s answer and the clustering result $\mathbf{C}_{i}^{*}$ . We can think the clustering result $\mathbf{C}_{i}^{*}$ is answer from a virtual peer agent. The payment calculation reduced to DMI-Mechanism. Since DMI-Mechanism is dominantly-truthful, k-DMI-Mechanism is also dominantly truthful.

For each task $t$ , in expectation, the honest answer distributions will be $\mathbf{a}_{t}$ . Note that each agent is assigned a random set of tasks independently. We denote the probability each agent is assigned task $t$ as $p$ . In expectation, there will be $pm$ agents who assign task $t$ . Moreover, after agents perform strategies, before normalization, the expected answer feedback will be $\sum_{i\in[m]}p\mathbf{a}_{t}\mathbf{S}_{i}=\mathbf{a}_{t}\sum_{i\in[m]}p\mathbf{S}_{i}$ . Thus, when $m$ goes to infinite, the expected answer distribution will be $\mathbf{a}_{t}\lim_{m\rightarrow\infty}\frac{1}{m}\sum_{i\in[m]}\mathbf{S}_{i}$ . Since DMI-clustering is affine-invariant, then clustering result is invariant to people’s strategies.

When $\mathbf{A}$ ’s rank is $C$ , we can pick $\tilde{\mathbf{A}}=\mathbf{A}$ (all one column is the sum of $\mathbf{A}$ ’s columns thus cannot be added here). Thus, $\text{DMI-C}(\mathbf{A})$ is equivalent to $\mathrm{argmax}_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}]$ .

For $v:=max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}]$ , when agents perform strategies, it becomes

	$\displaystyle v$	$\displaystyle=max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}(\lim_{m\rightarrow\infty}\frac{1}{m}\sum_{i\in[m]}\mathbf{S}_{i})]$
		$\displaystyle\leq max_{\mathbf{C}}\det[\mathbf{C}^{\top}\mathbf{A}]$

since the absolute value of any transition matrix’s determinant is less than one.

Thus, we finish the proof.

∎

Discussion for the practical implementation of K-DMI

In DMI-clustering, we cannot distinguish two clustering if they are equivalent up to a permutation for the options. However, if we have a small set of labeled questions, given an agreement measure, we can pick an option-permutation that has the highest agreement with the labeled questions. In addition to the default constraint, we can also pick other constraint set $\mathcal{C}$ without losing the strategy-invariant property. For example, we can use other aggregation methods’ results as candidates and determine the best candidate using DMI-clustering. In the payment calculation part, if we have some assumption about each agent’s honest answer and the obtained clustering like the determinant of their multiplication is positive in expectation, we do not need to divide the tasks into two parts. We can just pay the agent the determinant of the product of her report matrix and clustering result. In running DMI-clustering, we can employ some easy heuristics to obtain an initialization. For example, we can run the surprisingly popular method or use plurality vote of high-payment agents to get an initialization.

3.2 DMI-clustering vs. Surprisingly Popular

Surprisingly popular method picks the option which is the most surprisingly popular compare to the prior. In the multi-task setting, we can use the average to estimate the prior.

Definition 3.3 (Surprisingly popular).

For task whose answer distribution is $\mathbf{a}$ , we map it to cluster $\mathrm{argmax}_{c}\frac{a_{c}}{\bar{a}_{c}}$ .

The surprisingly popular method shows a certain amount of robustness to people’s strategies compared to plurality vote. For example, if a large set of people always report one option with 100% probability, the plurality vote method will be distorted but the surprising popular method will decrease this distortion. In fact, in the binary question case, we will show that the surprisingly popular method is equivalent to DMI-clustering thus is robust to people’s strategies. However, in general, without any prior assumption, the surprisingly popular method does not satisfy the legal nor affine-invariant property, thus is not robust to people’s strategies even if the number of agents goes to infinite. We will show a visual comparison between DMI-clustering and the surprisingly popular method for three-options case in Figure 8.

Corollary 3.4 (Binary question: DMI-clustering = Surprisingly popular).

In the binary case, DMI-clustering will map $\mathbf{a}=(a_{1},a_{2})$ to cluster $i$ if $a_{i}>\bar{a}_{i}$ for $i=1,2$ .

Proof.

In this case, $\mathbf{A}$ is a $n\times 2$ row-stochastic matrix. Since DMI-clustering is affine-invariant, we subtract the row average from each row $\text{DMI-C}(\mathbf{A})=\text{DMI-C}(\mathbf{A}-\bar{\mathbf{a}})$ . Note that $[\mathbf{A}-\bar{\mathbf{a}}\quad\mathbf{1}]$ is rank 2, we pick $\tilde{\mathbf{A}}$ as the first and third columns $[\mathbf{A}_{:,1}-\bar{a}_{1}\quad\mathbf{1}]$ and use $\mathbf{v}$ as a shorthand for $\mathbf{A}_{:,1}-\bar{a}_{1}$ . Note that the sum of $\mathbf{v}$ is 0. Then by the same analysis in Corollary 2.3, the optimal 0-1 vector $\mathbf{p}$ will pick all positive entries of $\mathbf{v}$ . Thus, DMI-clustering will map $\mathbf{a}=(a_{1},a_{2})$ to cluster $i$ if $a_{i}-\bar{a}_{i}>0$ for $i=1,2$ . ∎

Both surprisingly popular and DMI-clustering method set the mean of all points as the intersection of the partitions. We normalize each column of $\mathbf{A}$ by dividing it by its sum and will see that DMI-clustering can be seen as a “rotated” version of surprisingly popular.

Corollary 3.5 (DMI-clustering = “rotated” surprisingly popular).

With column-normalized $\mathbf{A}$ , $\mathrm{ncol}(\mathbf{A})$ , the surprisingly popular method is represented as $\mathbf{C}\leftarrow\mathrm{idxmax}(\mathrm{ncol}(\mathbf{A}))$ . If $\mathrm{ncol}(\mathbf{A})$ is rank $C$ , there exists $\mathbf{D}$ such that DMI-clustering’s result is represented as $\mathbf{C}\leftarrow\mathrm{idxmax}(\mathrm{ncol}(\mathbf{A})\mathbf{D})$ .

Proof.

The first statement follows directly from the definition of surprisingly popular method. It’s left to prove the second statement. Since DMI-clustering is invariant right affine transformation to $\mathbf{A}$ , we can use column-normalized $\mathbf{A}$ as input. For $[\mathrm{ncol}(\mathbf{A})\quad\mathbf{1}]$ , the last column $\mathbf{1}$ is a linear combination of $\mathrm{ncol}(\mathbf{A})$ ’s columns since $\mathbf{A}$ ’s columns sum to $\mathbf{1}$ . If $\mathrm{ncol}(\mathbf{A})$ is rank $C$ , then we can pick $\tilde{\mathbf{A}}=\mathrm{ncol}(\mathbf{A})$ . Thus, there exists $\mathbf{D}$ such that DMI-clustering’s result is represented as $\mathbf{C}\leftarrow\mathrm{idxmax}(\mathrm{ncol}(\mathbf{A})\mathbf{D})$ .

∎

4 Single-task & Method of Moments

The biggest advantage of surprisingly popular is that it only uses the first moment information. Thus, it is originally proposed for the single-task setting. In this section, we will introduce the single-task setting and show that from a clustering view, we can utilize the second moment information to propose a spectral clustering based mechanism which can be better than surprising popular in some settings.

Setting

Agents are assigned a single task (e.g. Have you ever text while driving before?). Agents finish the task without any communication. Each agent $i$ is asked to report her signal $c_{i}$ (e.g. No) and her prediction $\mathbf{p}_{i}$ for other people’s signals (e.g. I think 10% people say yes and 90% people say no).

Input:

(c_{i},\mathbf{p}_{i})_{i}

Output: c

\mathbf{a}\in\mathbb{R}^{1\times C}

where

\forall c,a_{c}

is the ratio of people who answers

c

\forall c,c^{\prime},\Pr[c^{\prime}|c]=\frac{\sum_{i}\mathbbm{1}(c_{i}=c)\mathbf{p}_{i}(c^{\prime})}{\sum_{i}\mathbbm{1}(c_{i}=c)}

\forall c,\Pr[c]=(\sum_{c^{\prime}}\frac{\Pr[c^{\prime}|c]}{\Pr[c|c^{\prime}]})^{-1}

\forall c,c^{\prime},\Pr[c^{\prime},c]=\Pr[c^{\prime}|c]\Pr[c]

/* constructing first moment using the second moment */

\mathbf{v}\in\mathbb{R}^{1\times C}

where

\forall c,v_{c}:=\Pr[c]

/* first moment */

Return

\mathrm{argmax}_{c}\frac{a_{c}}{v_{c}}

Algorithm 5 Surprisingly popular (1st moment) [19]

Input:

(c_{i},\mathbf{p}_{i})_{i}

Output:

\plus

\minus

Get answer collections

\mathbf{a}

, prior

\Pr[c],\forall c

, joint distribution

\Pr[c,c^{\prime}],\forall c,c^{\prime}

like Algorithm 5

\mathbf{v}\in\mathbb{R}^{1\times C}

where

\forall c,v_{c}:=\Pr[c]

/* first moment */

\mathbf{M}\in\mathbb{R}^{C\times C}

where

\forall c,c^{\prime},M_{c,c^{\prime}}=\Pr[c,c^{\prime}]

/* second moment */

\mathbf{Cov}=\mathbf{M}-\mathbf{v}^{\top}\mathbf{v}

Get the first eigenvector

\mathbf{e}^{*}

\mathbf{Cov}

Return the sign of the inner product between

\mathbf{e}^{*}

and

\mathbf{a}-\mathbf{v}

Algorithm 6 Spectral truth serum (2nd moment)

Lemma 4.1.

[3] The first eigenvector $\mathbf{e}^{*}$ of the covariance matrix of mixture of two spherical gaussians passes through the means.

The above lemma shows that if the two Gaussian are sufficiently separated, we can project all points into $\mathbf{e}^{*}$ and set the overall mean $\mathbf{v}$ as the separator. This leads to spectral truth serum. Thus, in the scenarios that the ground-truth clustering is more like a well-separated mixture of two spherical Gaussian, spectral truth serum is better than surprisingly popular. Intuitively, spectral truth serum is applicable to the setting where there is only two underlying ground-truth states (e.g. fake news or not) and we elicit many signals from the crowds. The eigenvector of the covariance matrix tells us which signal is important and which signal is not. We then project people’s feedback into the eigenvector, the important direction, and compare it to the average.

Hsu and Kakade [11] show that with a third-order moment, a mixture of $k$ gaussians can be detected. A future direction will be using a hypothetical question (e.g. will you update your prediction for other people’s answers if you have seen some one answers A?) to elicit a third-order moment and obtain a better information aggregation results.

5 Conclusion and Discussion

We provide a new clustering perspective for information elicitation. With this perspective, we propose a new information elicitation mechanism which is not only dominantly truthful but also aggregates information robustly when there are a large number of people in the multi-task setting. We also propose a spectral method based information aggregation mechanism in the single-task setting. Of independent interest, we propose a novel affine-invariant clustering method, DMI-clustering method.

The geometric interpretation of DMI-clustering can be extended to any other $k\leq d$ easily. For example, in a three-dimensional space, when we pick $k=2$ , the clustering goal is to find two clusters such that the distance between their mean, multiplying the product of cluster sizes is maximized. We know that picking other $k$ will lose the affine-invariant property but we can still explore the properties of other $k$ in the future.

Another interesting future direction is to apply other clustering techniques like non-linear clustering techniques (e.g. kernel method) to information elicitation. This work focuses on discrete setting and we can also consider continuous setting or study new information elicitation scenarios to provide new settings for clustering. Empirical validation is also an important future direction.

References

[1]
Begelfor and Werman [2006] Evgeni Begelfor and Michael Werman. 2006. Affine invariance revisited. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2. IEEE, 2087–2094.
Blum et al. [2016] Avrim Blum, John Hopcroft, and Ravindran Kannan. 2016. Foundations of data science. Vorabversion eines Lehrbuchs 5 (2016), 5.
Blumer et al. [1989] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. 1989. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM) 36, 4 (1989), 929–965.
Brubaker and Vempala [2008] S Charles Brubaker and Santosh S Vempala. 2008. Isotropic PCA and affine-invariant clustering. In Building Bridges. Springer, 241–281.
Budescu and Chen [2015] David V Budescu and Eva Chen. 2015. Identifying expertise to extract the wisdom of crowds. Management Science 61, 2 (2015), 267–280.
Cichocki et al. [2015] Andrzej Cichocki, Sergio Cruces, and Shun-ichi Amari. 2015. Log-determinant divergences revisited: Alpha-beta and gamma log-det divergences. Entropy 17, 5 (2015), 2988–3034.
Dasgupta and Ghosh [2013] Anirban Dasgupta and Arpita Ghosh. 2013. Crowdsourced judgement elicitation with endogenous proficiency. In Proceedings of the 22nd international conference on World Wide Web. International World Wide Web Conferences Steering Committee, 319–330.
Dawid and Skene [1979] Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20–28.
Fitzgibbon and Zisserman [2002] Andrew Fitzgibbon and Andrew Zisserman. 2002. On affine invariant clustering and automatic cast listing in movies. In European Conference on Computer Vision. Springer, 304–320.
Hsu and Kakade [2013] Daniel Hsu and Sham M Kakade. 2013. Learning mixtures of spherical gaussians: moment methods and spectral decompositions. In Proceedings of the 4th conference on Innovations in Theoretical Computer Science. 11–20.
Huang and Yang [2020] Hsin-Hsiung Huang and Jie Yang. 2020. Affine-transformation invariant clustering models. Journal of Statistical Distributions and Applications 7, 1 (2020), 1–24.
Kong [2020] Yuqing Kong. 2020. Dominantly Truthful Multi-task Peer Prediction with a Constant Number of Tasks. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SIAM, 2398–2411.
Kong and Schoenebeck [2019] Yuqing Kong and Grant Schoenebeck. 2019. An Information Theoretic Framework For Designing Information Elicitation Mechanisms That Reward Truth-telling. ACM Trans. Econ. Comput. 7, 1, Article 2 (Jan. 2019), 33 pages. https://doi.org/10.1145/3296670
Liu et al. [2020] Y. Liu, J. Wang, and Y. Chen. 2020. Surrogate Scoring Rules. In EC ’20: The 21st ACM Conference on Economics and Computation.
Lloyd [1982] Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory 28, 2 (1982), 129–137.
Miller et al. [2005] N. Miller, P. Resnick, and R. Zeckhauser. 2005. Eliciting informative feedback: The peer-prediction method. Management Science (2005), 1359–1373.
Prelec [2004] D. Prelec. 2004. A Bayesian Truth Serum for subjective data. Science 306, 5695 (2004), 462–466.
Prelec et al. [2017] Dražen Prelec, H Sebastian Seung, and John McCoy. 2017. A solution to the single-question crowd wisdom problem. Nature 541, 7638 (2017), 532–535.
Shnayder et al. [2016] Victor Shnayder, Arpit Agarwal, Rafael Frongillo, and David C Parkes. 2016. Informed truthfulness in multi-task peer prediction. In Proceedings of the 2016 ACM Conference on Economics and Computation. ACM, 179–196.
Thurau et al. [2010] Christian Thurau, Kristian Kersting, and Christian Bauckhage. 2010. Yes we can: simplex volume maximization for descriptive web-scale matrix factorization. In Proceedings of the 19th ACM international conference on information and knowledge management. 1785–1788.
Weller [2007] Susan C Weller. 2007. Cultural consensus theory: Applications and frequently asked questions. Field methods 19, 4 (2007), 339–368.
Werman and Weinshall [1995] Michael Werman and Daphna Weinshall. 1995. Similarity and affine invariant distances between 2D point sets. IEEE Transactions on Pattern Analysis and Machine Intelligence 17, 8 (1995), 810–814.
Xu et al. [2019] Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. 2019. L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise. In Advances in Neural Information Processing Systems. 6222–6233.

Appendix A Synthetic Data Used for Demonstration

Affine-invariance in Figure 3

\mathbf{A}=\begin{bmatrix}0.96536243&0.83582504\\ 0.02223537&0.97962069\\ 0.18576474&0.09306992\\ 0.60073919&0.06909198\\ 0.21115965&0.04303247\\ 0.24518684&0.46305449\\ 0.0045001&0.73335878\\ \end{bmatrix}\quad\mathbf{T}=\begin{bmatrix}0.7384872&0.39051911\\ 0.75115812&0.90684574\\ \end{bmatrix}\quad\mathbf{b}=b=[.05,.02]

An empirical run of k-cofactors in Figure 5

\mathbf{A}=\begin{bmatrix}0.32786192&0.75198752\\ 0.24789876&0.40110656\\ 0.64860833&0.05877561\\ 0.13688932&0.89837639\\ 0.51958647&0.69553312\\ 0.57106034&0.39534018\\ 0.25602628&0.5299542\\ 0.66521762&0.25370157\\ 0.49647253&0.51707767\\ 0.00649304&0.78141629\\ 0.93642378&0.00550147\\ 0.08014735&0.94850578\\ 0.54978178&0.66296321\\ 0.57201864&0.7952191\\ 0.11499522&0.59186407\\ 0.11681198&0.90701099\\ 0.01623413&0.59313684\\ 0.31841007&0.43924738\\ 0.60377183&0.96489354\\ 0.48738826&0.17097179\\ 0.49460288&0.81759015\\ 0.27831212&0.76869342\\ 0.22953897&0.43809347\\ 0.03654411&0.6203217\\ 0.90628651&0.45924219\\ 0.92525839&0.69213278\\ 0.17723574&0.12487727\\ 0.21277346&0.34931579\\ 0.84758762&0.05452689\\ 0.65204294&0.82808225\\ \end{bmatrix}

DMI-clustering in Figure 8

\mathbf{A}=\begin{bmatrix}0.20727033&0.56209307&0.2306366\\ 0.55235357&0.26311054&0.18453589\\ 0.06826729&0.51504916&0.41668355\\ 0.40863481&0.45383908&0.13752611\\ 0.30463115&0.19875226&0.49661659\\ 0.40387463&0.1261351&0.46999028\\ 0.30097293&0.32668147&0.3723456\\ 0.0530016&0.41863456&0.52836383\\ 0.53632014&0.08514494&0.37853491\\ 0.34180366&0.57497414&0.08322221\\ 0.20603456&0.67360041&0.12036503\\ 0.35331477&0.28461216&0.36207308\\ 0.23868633&0.38453375&0.37677992\\ 0.098419&0.67327932&0.22830168\\ 0.33219443&0.0159078&0.65189777\\ 0.40376774&0.39908311&0.19714915\\ 0.08966334&0.40876422&0.50157244\\ 0.51224362&0.01623191&0.47152447\\ 0.04907822&0.30059226&0.65032952\\ 0.50445751&0.20957023&0.28597227\\ \end{bmatrix}