This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A local geometry of hyperedges in hypergraphs, and its applications to social networks

Dong Quan Ngoc Nguyen and Lin Xing Department of Applied and Computational Mathematics and Statistics
University of Notre Dame
Notre Dame, Indiana 46556, USA
[email protected] http://nd.edu/ dnguye15 [email protected]
(Date: September 29, 2020)
Abstract.

In many real world datasets arising from social networks, there are hidden higher order relations among data points which cannot be captured using graph modeling. It is natural to use a more general notion of hypergraphs to model such social networks. In this paper, we introduce a new local geometry of hyperdges in hypegraphs which allows to capture higher order relations among data points. Furthermore based on this new geometry, we also introduce new methodology–the nearest neighbors method in hypegraphs–for analyzing datasets arising from sociology.

1. Introduction

One of the challenges in the modern age is to classify data arising from many resources; for example, following the rapid developments of several areas in mathematics, a large number of publications in mathematics creates a tremendous amount of data, which signifies useful information such as relationship (or collaborations) among authors and their publications, and their influences on development of mathematics. It is often the case that analyzing such data is not straightforward, and very difficult task because of the extremely fast growth of relations among data, and of data itself.

One of the main problems in machine learning is weight (respectively, label) prediction problem in which each instance is associated with a set of weights (respectively, labels), and the aim is to propose predictive models for predicting weights (respectively, labels) for unobserved data. These learning problem have important applications in many areas such as protein function classification [4], text categorization [17], and semantic scene classification [2], to cite a few. For networks which can be modeled as graphs, many approaches to prediction problem were proposed (see, for example, [7], [5], [6], [13], [14], [15]).

In real world datasets, for example, in social networks, it often occurs that higher order relations among data points are present which can not be captured because graph modeling of such datasets only signifies binary relation. So it is natural to consider weight (respectively, label) prediction problem for a generalization of graphs–hypergraphs. Recall that a hypergraph XX is a pair (𝒱(X),(X))(\mathcal{V}(X),\mathcal{E}(X)), where 𝒱(X)\mathcal{V}(X) is the set of data points (called vertices of XX), and (X)\mathcal{E}(X) is a subset of the power set of 𝒱(X)\mathcal{V}(X) which represents higher order relations among data points. Each element in (X)\mathcal{E}(X) is called a hyperedge. Since each hyperedge can consist of a collection of data points, it captures the higher order correlation information among these data points. Recent trends in learning problem have been focused on hypergraph datasets (see, for example, [1], [16])

In this paper, we propose a geometric approach for studying weight (respectively, label) prediction problem on hypergraph data which will be used to apply for analyzing networks arising from sociology. Traditionally for datasets which can be modeled as graphs, there is a standard method called kk nearest neighbors which applies to weight (respectively, label) prediction problem. The classical kk nearest neighbors method relies on the Euclidean metrics. In contrast with many traditionally geometric approaches based on the Euclidean metrics (or distances), we propose a new methodology of nearest neighbors in hypergraphs, using metrics modulo equivalence relations which is a weaker notion than that of usual metrics, but it seems natural to use such metrics in learning data. Equivalence relation signifies the correlation among data points, and a metric modulo such an equivalence relation signifies how much different among data points are, with respect to such correlation information. In this work, we suggest that equivalence relation is a natural mathematical notion for modeling correlation information in datasets, and that metrics modulo equivalence relations are more suitable for learning correlation in datasets.

The structure of our paper is as follows. In Subsection 2.1, we introduce a notion of metric spaces with respect to equivalence relations. In Subsection 2.2, based on the notion of metric spaces with respect to equivalence relation, we propose our methodology for studying weight (respectively, label) prediction problem. In Subsection 2.3, we construct a class of metrics modulo equivalence relations on the set of hyperedges of a hypergraph which can be combined with the methodology proposed in Subsection 2.2 for studying prediction problems. In Section 3, we apply our methodology to real world datasets from sociology, and report our experimental results.

2. A class of metrics modulo certain equivalence relations for the set of hyperedges of a hypergraph, and methodologies

In weight (respectively, label) prediction problem, one is given a dataset consisting of data points x1,,xnx_{1},\ldots,x_{n} such that there is a subset, say {xi1,,xik}\{x_{i_{1}},\ldots,x_{i_{k}}\}, each of whose element is associated to a weight (respectively, label). The aim is to propose a predictive model for predicting the weight (respectively, label) of each element in the set {x1,,xn}{xi1,,xik}\{x_{1},\ldots,x_{n}\}\setminus\{x_{i_{1}},\ldots,x_{i_{k}}\}. In this paper, we propose a new approach using insights from metric geometry to weight (respectively, label) prediction problem for datasets which can be modeled as hypergraphs.

A hypergraph XX is a pair (𝒱(X),(X))(\mathcal{V}(X),\mathcal{E}(X)), where 𝒱(X)\mathcal{V}(X) is the set of points, called vertices of XX, and (X)\mathcal{E}(X) is a subset of the power set of 𝒱(X)\mathcal{V}(X), each of whose elements is called a hyperedge of XX. Each hyperedge 𝔢\mathfrak{e} is a subset of 𝒱(X)\mathcal{V}(X). If 𝔢\mathfrak{e} has exactly mm elements, 𝔢\mathfrak{e} is called an mm-hyperedge of XX. In this paper, we are interested in weight (respectively, label) prediction problem for hypergraphs, and so the following notion is natural for modeling such datasets.

Definition 2.1.

(incomplete hypergraphs)

Let X=(𝒱(X),(X))X=(\mathcal{V}(X),\mathcal{E}(X)) be a hypergraph, and let AA be either an interval [a,b][a,b] in \mathbb{R} or a set of labels {1,,q}\{1,\ldots,q\} for some positive integer qq. XX is called an incomplete hypergraph with respect to AA if there exists a map F:0(X)AF:\mathcal{E}_{0}(X)\to A for some subset 0(X)\mathcal{E}_{0}(X) of (X)\mathcal{E}(X).

Remark 2.2.

In this paper, we consider two types of incomplete hypergraphs which are specified in the following.

  • (i)

    In weight prediction problem, AA is an interval [a,b][a,b]\subset\mathbb{R}, and a subset 0(X)\mathcal{E}_{0}(X) of the set of hyperedges in XX is given such that there is a map W:0(X)[a,b]W:\mathcal{E}_{0}(X)\to[a,b]. In this case WW is the map FF in Definition 2.1. Each value W(𝔢)W(\mathfrak{e}) for 𝔢0(X)\mathfrak{e}\in\mathcal{E}_{0}(X) is called the weight of 𝔢\mathfrak{e}. Weight prediction problem asks for a predictive model of W¯\overline{W} such that W¯\overline{W} is a map from (X)\mathcal{E}(X) to [a,b][a,b] for which

    W¯(𝔢0)=W(𝔢0)\displaystyle\overline{W}(\mathfrak{e}_{0})=W(\mathfrak{e}_{0})

    for every 𝔢00(X)\mathfrak{e}_{0}\in\mathcal{E}_{0}(X).

  • (ii)

    In label prediction problem, AA is a set of labels {1,2,,q\{1,2,\ldots,q for some positive integer qq, and a subset 0(X)\mathcal{E}_{0}(X) of the set of hyperedges in XX is given such that there is a map L:0(X)[a,b]L:\mathcal{E}_{0}(X)\to[a,b]. In this case LL is the map FF in Definition 2.1. Each value l(𝔢)l(\mathfrak{e}) for 𝔢0(X)\mathfrak{e}\in\mathcal{E}_{0}(X) is called the label of 𝔢\mathfrak{e}. Label prediction problem asks for a predictive model of L¯\overline{L} such that L¯\overline{L} is a map from (X)\mathcal{E}(X) to [a,b][a,b] for which

    L¯(𝔢0)=L(𝔢0)\displaystyle\overline{L}(\mathfrak{e}_{0})=L(\mathfrak{e}_{0})

    for every 𝔢00(X)\mathfrak{e}_{0}\in\mathcal{E}_{0}(X).

In classical weight (respectively, label) prediction problem for datasets which can be modeled as graphs, one can apply the classical kk nearest neighbors method. The main idea of the classical kk nearest neighbors method is that a dataset X={x1,,xn}X=\{x_{1},\ldots,x_{n}\} is equipped with a Euclidean metric (or distance), say dd. For each xXx\in X, let xi1,,xikx_{i_{1}},\ldots,x_{i_{k}} be the kk nearest points from XX to xx with respect to the metric dd such that

d(x,xi1)d(x,xi2)d(x,xik).\displaystyle d(x,x_{i_{1}})\leq d(x,x_{i_{2}})\leq\cdots\leq d(x,x_{i_{k}}).

Then the kk nearest neighbors method expect that the weight (respectively, label) of xx should be in proximity with the ximx_{i_{m}} for 1mk1\leq m\leq k.

Our main contribution in this paper is to introduce modified kk nearest neighbors methods for learning hypergraphs, especially for weight (respectively, label) prediction problem. In order to do that, we introduce a class of metrics modulo equivalence relations on the set of hyperedges in a hypergraph. Note that in contrast with the usual metric, a metric dd modulo an equivalence relation \cong allows that d(𝔢,𝔣)=0d(\mathfrak{e},\mathfrak{f})=0 as long as 𝔢𝔣\mathfrak{e}\cong\mathfrak{f} (which means that aa is equivalent to bb with respect to certain properties, for example, similar weights or labels.) Thus such a metric is more suitable for studying weight (respectively, label) prediction problem since two distinct points in a dataset may have the same weight (respectively, label). In this case, the equivalence relation \cong signifies the relation that certain distinct points in datasets share similar properties with respect to weight (respectively, label) map, and d(𝔢,𝔣)=0d(\mathfrak{e},\mathfrak{f})=0 encodes the information that the points 𝔢,𝔣\mathfrak{e},\mathfrak{f} are expected to have the same weight (respectively, label).

We first introduce a notion of metric spaces modulo equivalence relations which is used in many places in this paper.

2.1. Metric spaces modulo equivalence relations

We present in this subsection a notion of metric spaces modulo equivalence relations. We also explain why this notion is suitable for applying to geometrical structures of hypergraphs. We first recall a notion of equivalence relations on sets.

Definition 2.3.

(Equivalence relation)

Let XX be a set. An equivalence relation, denoted by \cong, on XX is a subset of X×XX\times X such that the following are true:

  • (i)

    (Reflexivity) (a,a)(a,a)\in\;\cong for every aXa\in X.

  • (ii)

    (Symmetry) (a,b)(a,b)\in\;\cong if and only if (b,a)(b,a)\in\;\cong.

  • (iii)

    (Transitivity) if (a,b)(a,b)\in\;\cong and (b,c)(b,c)\in\;\cong then (a,c)(a,c)\in\;\cong.

When (a,b)(a,b)\in\;\cong, we say that aa is \cong–equivalent to bb. Throughout this paper, in order to signify this relation, we write aba\cong b whenever (a,b)(a,b)\in\;\cong.

An equivalence relation \cong on a set provides a way to identify similar elements in the set. Equivalently if one can find a measurement to measure how similar elements in a set are, then one can modify this measurement to introduce an equivalence relation on the set.

Reflexivity in Definition 2.3 says that an element in XX should be naturally considered to be similar to itself. If an element aa is similar to an element bb, then bb should be naturally considered to be similar to aa, which is exactly the symmetry condition in Definition 2.3. For the last condition, Transitivity, if an element aa is similar to an element bb and bb is similar an element cc, it is natural to view that aa is similar to cc.

Equivalence relation is a weaker notion than that of the identity relation which is more suitable and natural to study hypergraph data from the metric-geometry viewpoint. The main aim and intuition behind our paper is that in order to study questions in machine learning on a hypergraph XX, it is not necessary to construct a metric on the hypergraph, i.e., a mapping d:X×X0d:X\times X\to\mathbb{R}_{\geq 0} which satisfies similar conditions as the usual absolute value in \mathbb{R}, for example, d(x,y)=0d(x,y)=0 if and only if x=yx=y. Instead one only needs to construct a metric up to an equivalence relation “\cong”, i.e., a mapping d:X×X0d:X\times X\to\mathbb{R}_{\geq 0} which satisfies all the conditions of a metric except the identity relation replaced by a weaker condition that d(x,y)=0d(x,y)=0 if and only if xyx\cong y. It turns out that such a metric exists naturally in a very general class of hypergraphs, and can be exploited to introduce several modified methologies in machine learning which apply to weight prediction problem, and multi-label classification problem. Especially in this paper, we introduce a class of metrics for the set of hyperedges of a hypergraph which will be used to provide a modifed kNN for predicting labels of hyperedges. Up to the knowledge of the authors, this paper is the first one which realizes the existence of such metrics for the set of hyperedges in a hypergraph. In order to obtain such metrics, we introduce a notion of neighbors of hyperedges in a hypergraph which contains a topological feature of the hyperedges. It seems to the authors that this notion of neighbors of hyperedges has never been introduced before in literature.

We recall the notion of metrics modulo equivalence relation on a set.

Definition 2.4.

(Metric modulo an equivalence relation)

Let XX be a set, and \cong an equivalence relation on XX. A mapping d:X×Xd:X\times X\to\mathbb{R} is said to be a metric on XX modulo the equivalence relation \cong if the following condition are satisfied:

  • (i)

    d(a,b)0d(a,b)\geq 0 for all a,bXa,b\in X.

  • (ii)

    d(a,b)=0d(a,b)=0 if and only if aba\cong b.

  • (iii)

    (Symmetry) d(a,b)=d(b,a)d(a,b)=d(b,a) for all a,bXa,b\in X.

  • (iv)

    (Triangle inequality) for any a,b,cXa,b,c\in X,

    d(a,b)d(a,c)+d(c,b).d(a,b)\leq d(a,c)+d(c,b).

A metric modulo an equivalence relation \cong on a set XX acts almost like a metric. The only difference between a metric modulo an equivalence relation and a metric on a set is that condition (ii) in Definition 2.4 is replaced by a stronger condition that d(a,b)=0d(a,b)=0 if and only if a=ba=b. But in studying some properties, say (Pi)i(P_{i})_{i}, associated to a given dataset, where each PiP_{i} denotes a property of interest, it is often the case that distinct elements in the dataset share exactly the same properties (Pi)i(P_{i})_{i}. In this case, it is natural to view that the distance between these distinct elements as zero since they are considered to be equivalent with respect to the properties (Pi)i(P_{i})_{i}. Note that if one uses a usual metric on this dataset, then one cannot identify similarities among distinct elements sharing the same properties. Thus it is more natural to use a metric modulo an equivalence relation on the dataset to study the structure of the dataset related to a given set of properties. From this point of view, one can apply this idea to a variety of problems such as weight prediction, or multi-label classification as shown in this paper.

2.2. Methodology

In this subsection, we introduce our methodology for weight (respectively, label) prediction problem for hypergraph datasets. Throughout this subsection, let X=(𝒱(X),(X))X=(\mathcal{V}(X),\mathcal{E}(X)) be a hypergraph. The main contributions of our work are the constructions of several special metric structures of the set of hyperedges (X)\mathcal{E}(X). The main motivation for equipping (X)\mathcal{E}(X) with such metric structures is that we want to modify the classical kk nearest neighbors (kNN) method (which only works if the underlying space is Euclidean), and apply it to the weight (respectively, label) prediction problem for elements in (X)\mathcal{E}(X). In this work, we introduce two different paths to reach to the modification of classical kNN method, the first one called modified kNN (which directly use the metrics we construct on (X)\mathcal{E}(X)), and the second one called embedded kNN (which instead of using the metrics, we only need to make use of the local geometry around each hyperedge in (X)\mathcal{E}(X), which will be also introduced also in this work.) In the following, we describe in detail our modified and embedded kNN methods.

2.2.1. Modified kNN methods

Assume that there is a metric modulo an equivalence relation \cong, denoted by dd on (X)\mathcal{E}(X).

\star Weight Prediction Problem.

In the weight prediction problem, we assume that there is a subset 0(X)\mathcal{E}_{0}(X) of (X)\mathcal{E}(X) such that there is a map W:𝒮0[a,b]W:\mathcal{S}_{0}\to[a,b] for some interval [a,b][a,b] in \mathbb{R}. The value W(𝔢)W(\mathfrak{e}) is called the weight of 𝔢\mathfrak{e}. The aim of the weight prediction problem is to predict what possible values of W(𝔢)W(\mathfrak{e}) with 𝔢(X)0(X)\mathfrak{e}\in\mathcal{E}(X)\setminus\mathcal{E}_{0}(X) are. We propose a modified kNN method for predicting weights of elements in (X)0(X)\mathcal{E}(X)\setminus\mathcal{E}_{0}(X) as follows.

Fix a positive integer k1k\geq 1. Take an arbitrary element 𝔢(X)0(X)\mathfrak{e}\in\mathcal{E}(X)\setminus\mathcal{E}_{0}(X). Let DIS(𝔢)\mathrm{DIS}(\mathfrak{e}) denote the set of all distances from 𝔢\mathfrak{e} to elements in 0(X)\mathcal{E}_{0}(X), i.e.,

DIS(𝔢)={d(𝔢,𝔢0)|𝔢00(X)}.\displaystyle\mathrm{DIS}(\mathfrak{e})=\{d(\mathfrak{e},\mathfrak{e}_{0})\;|\mathfrak{e}_{0}\in\mathcal{E}_{0}(X)\}.

Note that the set DIS(𝔢)\mathrm{DIS}(\mathfrak{e}) only consists of finitely many distinct positive real numbers, say d1(𝔢),,dr𝔢(𝔢)d_{1}(\mathfrak{e}),\ldots,d_{r_{\mathfrak{e}}}(\mathfrak{e}) for some positive integer r𝔢1r_{\mathfrak{e}}\geq 1 such that

d1(𝔢)<d2(𝔢)<<dr𝔢(𝔢).\displaystyle d_{1}(\mathfrak{e})<d_{2}(\mathfrak{e})<\cdots<d_{r_{\mathfrak{e}}}(\mathfrak{e}).

In experimental analysis, we always choose kk such that kr𝔢k\leq r_{\mathfrak{e}} for every 𝔢(X)0(X)\mathfrak{e}\in\mathcal{E}(X)\setminus\mathcal{E}_{0}(X).

We only choose the kk smallest values in DIS(𝔢)\mathrm{DIS}(\mathfrak{e}), say d1(𝔢)<d2(𝔢)<<dk(𝔢)d_{1}(\mathfrak{e})<d_{2}(\mathfrak{e})<\cdots<d_{k}(\mathfrak{e}), and let kNN(𝔢)\text{kNN}(\mathfrak{e}) denote the set of all elements 𝔢00(X)\mathfrak{e}_{0}\in\mathcal{E}_{0}(X) such that there exists an integer 𝔢0\ell_{\mathfrak{e}_{0}} with 1𝔢0k1\leq\ell_{\mathfrak{e}_{0}}\leq k for which d𝔢0=d(𝔢,𝔢0)d_{\ell_{\mathfrak{e}_{0}}}=d(\mathfrak{e},\mathfrak{e}_{0}), i.e.,

(1) kNN(𝔢)={𝔢00(X)|d(𝔢,𝔢0)=d𝔢0 for some 1𝔢0k}\displaystyle\text{kNN}(\mathfrak{e})=\{\mathfrak{e}_{0}\in\mathcal{E}_{0}(X)\;|\text{$d(\mathfrak{e},\mathfrak{e}_{0})=d_{\ell_{\mathfrak{e}_{0}}}$ for some $1\leq\ell_{\mathfrak{e}_{0}}\leq k$}\}

We propose a predictive model for WW, denoted by W¯\overline{W}, as follows.

W¯(𝔢)=𝔢0kNN(𝔢)W(𝔢0)card(kNN(𝔢)),\displaystyle\overline{W}(\mathfrak{e})=\dfrac{\sum_{\mathfrak{e}_{0}\in\text{kNN}(\mathfrak{e})}W(\mathfrak{e}_{0})}{\mathrm{card}(\text{kNN}(\mathfrak{e}))},

where card(kNN(𝔢))\mathrm{card}(\text{kNN}(\mathfrak{e})) denotes the number of elements in kNN(𝔢)\text{kNN}(\mathfrak{e}).

In experimental analysis, we perform the modified kNN method introduced above with dd replaced by classes of metrics modulo certain equivalence relations which we construct on the set of hyperedges (X)\mathcal{E}(X) in Subsection 2.3. In Table 1, we list the class of metrics we use in the modified kNN methods.

Table 1. Metrics modulo certain equivalence relation used in modified kNN methods
The metric dd
𝒟h,,+\mathcal{D}_{h,\mathcal{B},+\infty} defined in (2.11)
𝒟h,,ϵ\mathcal{D}_{h,\mathcal{B},\epsilon} defined in (2.13)

\star Label Prediction Problem

In the label prediction problem, we assume that there is a subset, say 0(X)\mathcal{E}_{0}(X) of (X)\mathcal{E}(X) such that there is a map L:0(X){1,,q}L:\mathcal{E}_{0}(X)\to\{1,\ldots,q\} for some integer q2q\geq 2. The value L(x0)L(x_{0}) is called the label of x0x_{0}. The aim of the label prediction problem is to predict what possible values of L(x)L(x) with x(X)0(X)x\in\mathcal{E}(X)\setminus\mathcal{E}_{0}(X) are. We propose a modified kNN method for predicting labels of elements in (X)0(X)\mathcal{E}(X)\setminus\mathcal{E}_{0}(X) as follows.

For each 𝔢(X)0(X)\mathfrak{e}\in\mathcal{E}(X)\setminus\mathcal{E}_{0}(X), we define the set kNN(𝔢)\text{kNN}(\mathfrak{e}) as in (1).

We propose a predictive model for LL, denoted by L¯\overline{L}, as follows. Let (kNN)(𝔢)\mathcal{L}(\text{kNN})(\mathfrak{e}) be the multi-set of labels L(𝔢0)L(\mathfrak{e}_{0}) for 𝔢0kNN(𝔢)\mathfrak{e}_{0}\in\text{kNN}(\mathfrak{e}). If there is a mode, say L(𝔢0)L(\mathfrak{e}_{0}), in (kNN)(𝔢)\mathcal{L}(\text{kNN})(\mathfrak{e}) for some 𝔢0kNN(𝔢)\mathfrak{e}_{0}\in\text{kNN}(\mathfrak{e}), we set L¯(𝔢)=L(𝔢0)\overline{L}(\mathfrak{e})=L(\mathfrak{e}_{0}); otherwise let L¯(𝔢)\overline{L}(\mathfrak{e}) be the closest integer to the average 𝔢0kNN(𝔢)L(𝔢0)card(kNN(𝔢))\dfrac{\sum_{\mathfrak{e}_{0}\in\text{kNN}(\mathfrak{e})}L(\mathfrak{e}_{0})}{\mathrm{card}(\text{kNN}(\mathfrak{e}))}.

2.2.2. Embedded kNN methods

In this subsection, we introduce our embedded kNN methods. In the construction of each metric in this work, we need to introduce a local geometry at each element in the set (X)\mathcal{E}(X). The local geometry at each element needs to contain the local structure around each element which we want to study from the set of hyperedges (X)\mathcal{E}(X). Thus the local geometry at a hyperedge gathers the local hypergraph structure around the hyperedge with respect to certain properties of hypergraphs that we want to investigate. In this work, we introduce the viewpoint that the local geometry at a hyperedge 𝔢\mathfrak{e} is a set of certain hyperedges in (X)\mathcal{E}(X) which we consider as sharing similar features of 𝔢\mathfrak{e} with respect to certain properties. We introduce two different types of local geometry in the set of hyperedges which we call type I and type II neighborhoods of hyperedges, respectively.

Table 2 lists types of neighborhoods for hyperedges introduced in this paper.

Table 2. Types of neighborhoods used in embedded kNN methods
Types of neighborhoods
Type I neighborhoods 𝒩h,,+\mathcal{N}_{h,\mathcal{B},+\infty}, defined in (2.10)
Type II neighborhoods 𝒩h,,ϵ\mathcal{N}_{h,\mathcal{B},\epsilon}, defined in (2.12)

We first consider label prediction problem.

\star Label Prediction Problem

Let 0(X)\mathcal{E}_{0}(X) of (X)\mathcal{E}(X) for which each hyperedge in 0(X)\mathcal{E}_{0}(X) is equipped with a label in {1,,q}\{1,\ldots,q\}, i.e., there is a map L:0(X){1,,q}L:\mathcal{E}_{0}(X)\to\{1,\ldots,q\}.

In this case, let 𝒩edge\mathcal{N}_{\text{edge}} denote either the type I neighborhood 𝒩h,,+\mathcal{N}_{h,\mathcal{B},+\infty} or type II neighborhood 𝒩h,,ϵ\mathcal{N}_{h,\mathcal{B},\epsilon} in Table 2. We define the map T(X):(X)qT_{\mathcal{E}(X)}:\mathcal{E}(X)\to\mathbb{R}^{q} which represents each hyperedge as a point in q\mathbb{R}^{q} as follows. For each 𝔢(X)\mathfrak{e}\in\mathcal{E}(X),

T(X)(𝔢)=(ηi(𝒩edge(𝔢)),,ηq(𝒩edge(𝔢))),\displaystyle T_{\mathcal{E}(X)}(\mathfrak{e})=(\eta_{i}(\mathcal{N}_{\text{edge}}(\mathfrak{e})),\ldots,\eta_{q}(\mathcal{N}_{\text{edge}}(\mathfrak{e}))),

where for each 1iq1\leq i\leq q, ηi(𝒩edge(𝔢))\eta_{i}(\mathcal{N}_{\text{edge}}(\mathfrak{e})) denotes the number of hyperedges 𝔣𝒩edge(𝔢)\mathfrak{f}\in\mathcal{N}_{\text{edge}}(\mathfrak{e}) such that L(𝔣)=iL(\mathfrak{f})=i. We call T(X)T_{\mathcal{E}(X)} the feature map of (X)\mathcal{E}(X). Note that the map T(X)T_{\mathcal{E}(X)} is well-defined since by our notion of neighborhoods of hyperedges, each hyperedge in the neighborhood 𝒩edge(𝔢)\mathcal{N}_{\text{edge}}(\mathfrak{e}) belongs to the set 0(X)\mathcal{E}_{0}(X), and thus one can associate a label to 𝔣\mathfrak{f}, say L(𝔣)L(\mathfrak{f}).

Under the map T(X)T_{\mathcal{E}(X)}, the set of hyperedges (X)\mathcal{E}(X) can be represented as a subset, say T(X)((X))T_{\mathcal{E}(X)}(\mathcal{E}(X)), of q\mathbb{R}^{q}. For label prediction problem, we view the label of each element T(X)(𝔢)qT_{\mathcal{E}(X)}(\mathfrak{e})\in\mathbb{R}^{q} for a given hyperedge 𝔢\mathfrak{e} is the same as that of 𝔢(X)\mathfrak{e}\in\mathcal{E}(X). In order to predict the label of a hyperedge 𝔢\mathfrak{e}, we perform the classical kNN method for the set T(X)((X))qT_{\mathcal{E}(X)}(\mathcal{E}(X))\subseteq\mathbb{R}^{q} to predict the label of T(X)(𝔢)T_{\mathcal{E}(X)}(\mathfrak{e}) which we view as the label of the hyperedge 𝔢\mathfrak{e}.

\star Weight Prediction Problem

In this case, let 0(X)\mathcal{E}_{0}(X) of (X)\mathcal{E}(X) for which each hyperedge in 0(X)\mathcal{E}_{0}(X) is equipped with a weight in an interval [a,b][a,b] in \mathbb{R}, i.e., there is a map W:0(X)[a,b]W:\mathcal{E}_{0}(X)\to[a,b]. For an integer q2q\geq 2, we associate a map L:0(X){1,,q}L:\mathcal{E}_{0}(X)\to\{1,\ldots,q\} to LL as follows. We first divide the interval [a,b][a,b] into qq sub-intervals of equal length such that [a0,a1][a_{0},a_{1}], (a1,a2](a_{1},a_{2}], …, (aq1,aq](a_{q-1},a_{q}] where a0=aa_{0}=a, aq=ba_{q}=b, and ai=ai1+(ba)/qa_{i}=a_{i-1}+(b-a)/q for each 1iq1\leq i\leq q. For each 𝔢0(X)\mathfrak{e}\in\mathcal{E}_{0}(X), we let L(𝔢)=1L(\mathfrak{e})=1 if W(𝔢)[a0,a1]W(\mathfrak{e})\in[a_{0},a_{1}], and L(𝔢)=iL(\mathfrak{e})=i if ii is the unique integer in {2,,q}\{2,\ldots,q\} such that W(𝔢)(ai1,ai]W(\mathfrak{e})\in(a_{i-1},a_{i}]. Thus one obtains the map L:0(X){1,,q}L:\mathcal{E}_{0}(X)\to\{1,\ldots,q\}.

Repeating the same arguments as in Label Prediction Problem, one obtains the feature map defined by

T(X)(𝔢)=(ηi(𝒩edge(𝔢)),,ηq(𝒩edge(𝔢))),\displaystyle T_{\mathcal{E}(X)}(\mathfrak{e})=(\eta_{i}(\mathcal{N}_{\text{edge}}(\mathfrak{e})),\ldots,\eta_{q}(\mathcal{N}_{\text{edge}}(\mathfrak{e}))),

Under the map T(X)T_{\mathcal{E}(X)}, the set of hyperedges (X)\mathcal{E}(X) can be represented as a subset, say T(X)((X))T_{\mathcal{E}(X)}(\mathcal{E}(X)), of q\mathbb{R}^{q}. For weight prediction problem, we view the weight of each element T(X)(𝔢)qT_{\mathcal{E}(X)}(\mathfrak{e})\in\mathbb{R}^{q} for a given hyperedge 𝔢\mathfrak{e} is the same as that of 𝔢(X)\mathfrak{e}\in\mathcal{E}(X). In order to predict the weight of a hyperedge 𝔢\mathfrak{e}, we perform the classical kNN method for the set T(X)((X))qT_{\mathcal{E}(X)}(\mathcal{E}(X))\subseteq\mathbb{R}^{q} to predict the weight of T(X)(𝔢)T_{\mathcal{E}(X)}(\mathfrak{e}) which we view as the weight of the hyperedge 𝔢\mathfrak{e}.

In our experimental analysis, for simplicity, we always let q=2q=2.

2.3. A class of metrics for the set of hyperedges of a hypergraph

In this subsection, we introduce a class of metrics modulo certain equivalent relations on the set of hyperedges in a hypergraph. We consider incomplete hypergraphs introduced in Definition 2.1 which is suitable for studying the weight (respectively, label) prediction problem for the set of hyperedges.

For the rest of this section, we fix an incomplete hypergraph X=(𝒱(X),(X)X=(\mathcal{V}(X),\mathcal{E}(X) such that there is a subset 0(X)\mathcal{E}_{0}(X) of (X)\mathcal{E}(X) and a map F:0(X)AF:\mathcal{E}_{0}(X)\to A, where AA is either an interval in [a,b][a,b] in \mathbb{R} or a set of labels {1,,q}\{1,\ldots,q\} for some integer q2q\geq 2. In weight prediction problem, FF is the weight map WW, and AA is an interval [a,b][a,b] whereas in label prediction problem, FF is the label map LL, and A={1,,q}A=\{1,\ldots,q\}.

We first introduce a notion of neighborhoods of hyperedges in XX. For each 𝔢(X)\mathfrak{e}\in\mathcal{E}(X), denote by s(𝔢)s(\mathfrak{e}) the size of 𝔢\mathfrak{e}, i.e., the number of vertices in 𝔢\mathfrak{e}. Let 𝒮\mathcal{S} denote the set consisting of the sizes of all hyperedges in XX, i.e., 𝒮={s(𝔢)|𝔢(X)}\mathcal{S}=\{s(\mathfrak{e})\;|\;\mathfrak{e}\in\mathcal{E}(X)\}. Let :𝒮>0{+}\mathcal{M}:\mathcal{S}\to\mathbb{R}_{>0}\cup\{+\infty\} be a mapping which will be used to control the sizes of neighbors of hyperedges.

Definition 2.5.

((0(X),)(\mathcal{E}_{0}(X),\mathcal{M})-neighborhoods)

Let 𝔢\mathfrak{e} be a hyperedge in (X)\mathcal{E}(X). The (0(X),)(\mathcal{E}_{0}(X),\mathcal{M})-neighborhood of 𝔢\mathfrak{e} is defined by

𝒩0(X),(𝔢)={𝔣0(X)|card(𝔢𝔣)s(𝔢)/2 and s(𝔣)M(s(𝔢))},\displaystyle\mathcal{N}_{\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\{\mathfrak{f}\in\mathcal{E}_{0}(X)\;|\;\text{$\mathrm{card}(\mathfrak{e}\cap\mathfrak{f})\geq\lfloor s(\mathfrak{e})/2\rfloor$ and $s(\mathfrak{f})\leq M(s(\mathfrak{e}))$}\},

where 𝔢𝔣\mathfrak{e}\cap\mathfrak{f} is the set of vertices that 𝔢\mathfrak{e} and 𝔣\mathfrak{f} have in common, and \lfloor\cdot\rfloor denotes the floor function.

For the construction of a metric for an incomplete hypergraph, for each hyperedge 𝔢\mathfrak{e}, we only focus on (0(X),)(\mathcal{E}_{0}(X),\mathcal{M})-neighbors 𝔣\mathfrak{f} of 𝔢\mathfrak{e} such that the differences between the values σ(𝔣)\sigma(\mathfrak{f}) and the predicted value of F(𝔢)F(\mathfrak{e}) are very small, and can be controlled using a tuning parameter hh. The smaller these differences become, the more precise the predicted value of F(𝔢)F(\mathfrak{e}) is.

We now introduce a notion of neighborhoods of hyperedges, depending on a tuning parameter hh.

Definition 2.6.

Let h>0h>0 be a tuning parameter. For each vertex 𝔢(X)\mathfrak{e}\in\mathcal{E}(X), set

AVG0(X),(𝔢)=𝔣𝒩0(X),(𝔢)F(𝔣)card(𝒩0(X),(𝔢)).\mathrm{AVG}_{\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\dfrac{\sum_{\mathfrak{f}\in\mathcal{N}_{\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})}F(\mathfrak{f})}{\mathrm{card}(\mathcal{N}_{\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e}))}.

Set

𝒩h,0(X),(𝔢)={𝔣𝒩0(X),(𝔢)||F(𝔣)AVG0(X),(𝔢)|h}.\displaystyle\mathcal{N}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\{\mathfrak{f}\in\mathcal{N}_{\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})\;|\;|F(\mathfrak{f})-\mathrm{AVG}_{\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})|\leq h\}.
Refer to caption
Figure 1. Example of hypergraph with equivalent hyperedges.

For each 𝔢(X)\mathfrak{e}\in\mathcal{E}(X), set

h,0(X),(𝔢)=card(𝒩h,0(X),(𝔢)),\displaystyle\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\mathrm{card}(\mathcal{N}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})),

that is h,0(X),(𝔢)\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e}) is the number of elements in 𝒩h,0(X),(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e}).

We introduce an equivalence relation on the set of hyperedges (X)\mathcal{E}(X) which allows to identify certain hyperedges in (X)\mathcal{E}(X). Note that if two hyperedges, say 𝔢\mathfrak{e} and 𝔣\mathfrak{f} satisfy 𝒞h,0(X),(𝔢)=𝒞h,0(X),(𝔣)\mathcal{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\mathcal{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f}), then it is natural to view 𝔢\mathfrak{e} and 𝔣\mathfrak{f} as similar hyperedge in weight (respectively, label) prediction problem since their neighborhood structures around the average mean of weights (respectively, labels) are very similar.

Hence it is natural to define a binary relation on (X)\mathcal{E}(X) as follows: two hyperedges 𝔢\mathfrak{e} and 𝔣\mathfrak{f} are equivalent, denoted by 𝔢𝔣\mathfrak{e}\cong\mathfrak{f} if 𝒞h,0(X),(𝔢)=𝒞h,0(X),(𝔣)\mathcal{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\mathcal{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f}). One obtains the following.

Proposition 2.7.

The binary relation “\cong” is an equivalence relation.

Example 2.8.

Firgure 1 is an example of hypergraph that indicates which hyperedges are 3\cong_{3}-equivalent.

In this example, 0(X)={𝔞,𝔟,𝔠,𝔡,𝔤,𝔥}\mathcal{E}_{0}(X)=\{\mathfrak{a},\mathfrak{b},\mathfrak{c},\mathfrak{d},\mathfrak{g},\mathfrak{h}\}. The weights of 𝔞,𝔟,𝔠,𝔡,𝔤,𝔥\mathfrak{a},\mathfrak{b},\mathfrak{c},\mathfrak{d},\mathfrak{g},\mathfrak{h} are 0.4,0.5,0.35,0.46,0.2,0.150.4,0.5,0.35,0.46,-0.2,-0.15 respectively. Let (s(𝔢))=43s(𝔢)\mathcal{M}(s(\mathfrak{e}))=\dfrac{4}{3}s(\mathfrak{e}) for each 𝔢(X)\mathfrak{e}\in\mathcal{E}(X). Then 𝒞h,0(X),(𝔢)=𝒞h,0(X),(𝔣)=2\mathcal{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\mathcal{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f})=2, which indicates 𝔢𝔣\mathfrak{e}\cong\mathfrak{f}.

Now we define a mapping 𝒟h,0(X),:(X)×(X)0\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}:\mathcal{E}(X)\times\mathcal{E}(X)\to\mathbb{R}_{\geq 0} as follows. For hyperedges 𝔢,𝔣\mathfrak{e},\mathfrak{f}, define

(2) 𝒟h,0(X),(𝔢,𝔣)=|h,0(X),(𝔢)h,0(X),(𝔣)|.\displaystyle\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e},\mathfrak{f})=|\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})-\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f})|.
Theorem 2.9.

𝒟h,0(X),\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}} is a metric modulo the equivalence relation \cong.

Proof.

It is clear that 𝒟h,0(X),(𝔢,𝔣)0\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e},\mathfrak{f})\geq 0 which shows that (i) in Definition 2.4 holds.

Assume that 𝒟h,0(X),(𝔢,𝔣)\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e},\mathfrak{f}) for some 𝔢,𝔣(X)\mathfrak{e},\mathfrak{f}\in\mathcal{E}(X). Then it follows from (2) that fCh,0(X),(𝔢)=h,0(X),(𝔣)fC_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})=\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f}) which implies that 𝔢𝔣\mathfrak{e}\cong\mathfrak{f}. Thus (ii) in Definition 2.4 holds.

It is obvious that 𝒟h,0(X),(𝔢,𝔣)=𝒟h,0(X),(𝔣,𝔢)\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e},\mathfrak{f})=\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f},\mathfrak{e}), and thus (iii) in Definition 2.4 holds.

Let 𝔢,𝔣,𝔤\mathfrak{e},\mathfrak{f},\mathfrak{g} be hyperedges in (X)\mathcal{E}(X). We see that

𝒟h,0(X),(𝔢,𝔤)\displaystyle\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e},\mathfrak{g}) =|h,0(X),(𝔢)h,0(X),(𝔤)|\displaystyle=|\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})-\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{g})|
=|(h,0(X),(𝔢)h,0(X),(𝔣))+(h,0(X),(𝔣)h,0(X),(𝔤))|\displaystyle=|(\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})-\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f}))+(\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f})-\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{g}))|
|h,0(X),(𝔢)h,0(X),(𝔣)|+|h,0(X),(𝔣)h,0(X),(𝔤)|\displaystyle\leq|\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e})-\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f})|+|\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f})-\mathfrak{C}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{g})|
=𝒟h,0(X),(𝔢,𝔣)+𝒟h,0(X),(𝔣,𝔤).\displaystyle=\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e},\mathfrak{f})+\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{f},\mathfrak{g}).

Therefore (iv) in Definition 2.4 holds, and thus 𝒟h,0(X),\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}} is a metric modulo the equivalence relation \cong.

Depending on the range of the values of \mathcal{M}, one can obtain different types of neighborhoods of hyperedges, and different versions of the metric 𝒟h,0(X),\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}}. Two distinguished cases which we study in the paper are as follows.

Definition 2.10.

(Type I neighborhoods of hyperedges)

Let (s(𝔢))=+\mathcal{M}(s(\mathfrak{e}))=+\infty for each hyperedge 𝔢(X)\mathfrak{e}\in\mathcal{E}(X). For 𝔢(X)\mathfrak{e}\in\mathcal{E}(X), following the construction of the neighborhood 𝒩h,0(X),(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e}) with (s(𝔢))\mathcal{M}(s(\mathfrak{e})) replaced by ++\infty, one obtains a new type of neighborhood of 𝔢\mathfrak{e}, denoted by 𝒩h,0(X),+(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),+\infty}(\mathfrak{e}), which we call the Type I neighborhood of 𝔢\mathfrak{e}.

Definition 2.11.

(A first metric on (X)\mathcal{E}(X))

Let (s(𝔢))=+\mathcal{M}(s(\mathfrak{e}))=+\infty for each hyperedge 𝔢(X)\mathfrak{e}\in\mathcal{E}(X). Following the construction of the metric 𝒟h,0(X),\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}} with (s(𝔢))\mathcal{M}(s(\mathfrak{e})) replaced by ++\infty, one obtains a metric modulo the equivalence relation \cong which we denote by 𝒟h,0(X),+\mathcal{D}_{h,\mathcal{E}_{0}(X),+\infty}.

The key feature of the metric 𝒟h,0(X),+\mathcal{D}_{h,\mathcal{E}_{0}(X),+\infty} is that it does not control the sizes of neighbors of hyperedges. So it contains all weight information from the neighborhoods of hyperedges in its definition without filtering which weights are possibly important for prediction.

Definition 2.12.

(Type II neighborhoods of hyperedges)

Let (s(𝔢))=ϵs(𝔢)\mathcal{M}(s(\mathfrak{e}))=\epsilon s(\mathfrak{e})for each hyperedge 𝔢(X)\mathfrak{e}\in\mathcal{E}(X). For 𝔢(X)\mathfrak{e}\in\mathcal{E}(X), following the construction of the neighborhood 𝒩h,0(X),(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),\mathcal{M}}(\mathfrak{e}) with (s(𝔢))\mathcal{M}(s(\mathfrak{e})) replaced by ϵs(𝔢)\epsilon s(\mathfrak{e}), one obtains a new type of neighborhood of 𝔢\mathfrak{e}, denoted by 𝒩h,0(X),ϵ(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),\epsilon}(\mathfrak{e}), which we call the Type II neighborhood of 𝔢\mathfrak{e}.

Definition 2.13.

(A second metric on (X)\mathcal{E}(X))

Let (s(𝔢))=ϵs(𝔢)\mathcal{M}(s(\mathfrak{e}))=\epsilon s(\mathfrak{e}) for each hyperedge 𝔢(X)\mathfrak{e}\in\mathcal{E}(X) where ϵ\epsilon is a constant in the interval (0,2](0,2]. Following the construction of the metric 𝒟h,0(X),\mathcal{D}_{h,\mathcal{E}_{0}(X),\mathcal{M}} with (s(𝔢))\mathcal{M}(s(\mathfrak{e})) replaced by ϵs(𝔢)y\epsilon s(\mathfrak{e})y, one obtains a metric modulo the equivalence relation \cong which we denote by 𝒟h,0(X),ϵ\mathcal{D}_{h,\mathcal{E}_{0}(X),\epsilon}.

3. Experimental analysis

In this section, we apply the methodology proposed in Subsection 2.2 combined with the class of metrics in Subsection 2.3 to construct predictive models for weight (respectively, label) prediction problem for hypergraph datasets.

We are not aware of any hypergraph datasets containing weights (respectively, labels) of hyperedges. So in our experimental analysis, we use datasets of weighted bipartite networks, and transform them into hypergraphs.

Let G=(U,V,E)G=(U,V,E) be a bipartite network, where UU and VV are disjoint sets of vertices, EE is the set of edges. For each eEe\in E, ee is uniquely formed by a pair of vertices (u,v)(u,v) where uUu\in U and vVv\in V. A hypergraph X=(𝒱(X),(X))X=(\mathcal{V}(X),\mathcal{E}(X)) induced by the bipartite graph GG is defined as follows.

  • (i)

    By exchanging the notation between UU and VV, one, without loss of generality, lets 𝒱(X)=U\mathcal{V}(X)=U, that is, the set of vertices in XX is defined to be one set of vertices in GG.

  • (ii)

    For each vVv\in V, let 𝔢v\mathfrak{e}_{v} denotes the set of all vertices uu in UU such that (u,v)E(u,v)\in E. We define the set of hyperedges of XX as

    (X)={𝔢v|vV}.\displaystyle\mathcal{E}(X)=\{\mathfrak{e}_{v}\;|\;v\in V\}.

    In this case, each hyperedge 𝔢\mathfrak{e} in (X)\mathcal{E}(X) uniquely corresponds to a vertex in VV. And thus (X)\mathcal{E}(X) is in bijection with VV.

In our experimental analysis, we use two real-world datasets from social networks which are Epinions network and MovieLens network. They are both bipartite rating networks where one set of vertices are users and the other set of vertices are items. The weight of an edge presents a rating score from a user to an item. In the hypergraphs induced by these two networks, the set of vertices 𝒱(X)\mathcal{V}(X) is the set of users, then each hyperedge is formed by the set of all users that rate the same item. Therefore, each hyperedge corresponds to an item in the bipartite graph. An example of transforming a bipartite graph into a hypergraph is shown in Figure 2 and 3. Descriptions of the two datasets are as follows.

  • Epinions. This dataset was collected by Paola Massa in a 55-week crawl (Novemeber/December 2003) from the Epinions.com Website (see the dataset at http://www.trustlet.orgdownloaded_epinions.html )(see [9]). In Epinions, the vertices are users and reviews, each user rates the helpfullness of a review on a 151-5 scale, where 11 means totally not helpful and 55 means totally helpful.

  • MovieLens. This dataset contains movie ratings collected from https://movielens.org/ (see [10]). In this dataset, the vertices are users and movies, each edge connects a user with a movie and represents a rating score. The rating scores are integers from 0 to 55.

Table 3 contains descriptions of hypergraphs used in our computation. The weights of hyperedges are computed using the edge weights from bipartite network datasets and the notion of goodness introduced in [7]. In both of the two hypergraphs, each hyperedge corresponds to an item in the bipartite graphs, the weight of each hyperedge, which is computed using goodness, indicates how good an item is. In order to associate each hyperedge with a label, we divide the range of weights, which is a closed interval from minimum value of weights to maximum value of weights, into qq intervals with equal length. Then the label of a hyperedge is ii if its weight belongs to the iith interval. The set of labels are integers from 11 to qq.

According to the definitions of neighborhoods, one needs to select tuning parameters hh for 𝒩h,0(X),+(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),+\infty}(\mathfrak{e}), and 𝒩h,0(X),ϵ(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),\epsilon}(\mathfrak{e}). In our computation, we set the value of hh based on the neighborhood of each hyperedge. For each 𝔢\mathfrak{e}\in\mathcal{E} , hh equals to the standard deviation of weights of all hyperedges that belong to the neighborhood of 𝔢\mathfrak{e}. The choices of kk for kkNN are integers from 11 to 2020, and the values of ϵ\epsilon are set to be 53,43,1,23\frac{5}{3},\frac{4}{3},1,\frac{2}{3}, or 12\frac{1}{2}. The performances of weight predictions are evaluated using mean of absolute error (MAE) and root mean squared error (RMSE). The performances of label predictions are evaluated using error rate which is computed by the proportion of incorrect prediction. We select the values of kk and ϵ\epsilon that correspond to the smallest MAE or error rate.

Figures 4 and 5 present the results of predicting weights of hyperedges using Epinions dataset. In this example, the smallest MAE and RMSE values are obtained by setting ϵ\epsilon equal to 53\frac{5}{3}. In general, using the second type of neighborhood (𝒩h,0(X),ϵ(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),\epsilon}(\mathfrak{e})) leads to a better prediction results than using the first type of neighborhood (𝒩h,0(X),+(𝔢)\mathcal{N}_{h,\mathcal{E}_{0}(X),+\infty}(\mathfrak{e})).

Tables 4 and 5 present the results of predicting weights of hyperedges using modified kkNN method and embedded kkNN method. In our experimental analysis, for simplicity, we set q=2q=2 for the computation of T(X)(𝔢)=(η1(𝒩edge(𝔢)),,ηq(𝒩edge(𝔢)))T_{\mathcal{E}(X)}(\mathfrak{e})=(\eta_{1}(\mathcal{N}_{\text{edge}}(\mathfrak{e})),\ldots,\eta_{q}(\mathcal{N}_{\text{edge}}(\mathfrak{e}))) in the embedded kkNN method, where 𝒩edge(𝔢)\mathcal{N}_{\text{edge}}(\mathfrak{e}) is either type I or type II neighborhood of 𝔢\mathfrak{e}. Each cell reports a pair of numbers (mean of absolute error (MAE), root mean squared error (RMSE)).

Table 6 presents the error rates of predicting labels of hyperedges using modified kkNN methods. Table 7 presents the error rates of predicting labels of hyperedges using embedded kkNN methods.

Refer to caption

This is a subgraph of the Epinions network. UU stands for the set of users from 11 to 99, BB stands for the set of books from aa to dd.


Figure 2. Example of bipartite graph
Refer to caption
Figure 3. Hypergraph based on the bipartite graph in Figure 2
Refer to caption
Figure 4. MAE of predicting weights of hyperedges using different values of ϵ\epsilon.
Refer to caption
Figure 5. RMSE of predicting weights of hyperedges using different values of ϵ\epsilon.
Table 3. Descriptions of hypergraphs constructed using each dataset
number of hyperedges number of vertices s(𝔢)s(\mathfrak{e}) :Size of hyperedge 𝔢\mathfrak{e}
Epinions 33774 32610 2s(𝔢)9182\leq s(\mathfrak{e})\leq 918
MovieLens 7498 75106 2s(𝔢)6872\leq s(\mathfrak{e})\leq 687
Table 4. Prediction of weights of hyperedges using modified kkNN
Metric 𝒟h,0(X),+\mathcal{D}_{h,\mathcal{E}_{0}(X),+\infty} 𝒟h,0(X),ϵ\mathcal{D}_{h,\mathcal{E}_{0}(X),\epsilon}
Epinions (0.236, 0.301) (0.234, 0.299)
MovieLens (0.172, 0.217) (0.184, 0.232)
Table 5. Prediction of weights of hyperedges using embedded kkNN
Neighborhood 𝒩0(X),+(𝔢)\mathcal{N}_{\mathcal{E}_{0}(X),+\infty}(\mathfrak{e}) 𝒩0(X),ϵ(𝔢)\mathcal{N}_{\mathcal{E}_{0}(X),\epsilon}(\mathfrak{e})
Epinions (0.234, 0.298) (0.232, 0.296)
MovieLens (0.171, 0.217) (0.183, 0.231)
Table 6. Error rates of predicting labels using modified kkNN
Dataset Epinions MovieLens
Metric 𝒟h,0(X),+\mathcal{D}_{h,\mathcal{E}_{0}(X),+\infty} 𝒟h,0(X),ϵ\mathcal{D}_{h,\mathcal{E}_{0}(X),\epsilon} 𝒟h,0(X),+\mathcal{D}_{h,\mathcal{E}_{0}(X),+\infty} 𝒟h,0(X),ϵ\mathcal{D}_{h,\mathcal{E}_{0}(X),\epsilon}
2 labels 0.182 (k=1) 0.176 (k=1) 0.250 (k=1) 0.250 (k=4)
3 labels 0.316 (k=3) 0.306(=2) 0.296 (k=1) 0.296 (k=1)
4 labels 0.406 (k=4) 0.412 (k=1) 0.303 (k=6) 0.303 (k=4)
5 labels 0.507 (k=3) 0.512 (k=4) 0.535 (k=5) 0.532 (k=1)
Table 7. Error rates of predicting labels using embedded kkNN
Dataset Epinions MovieLens
Neighborhood 𝒩0(X),+(𝔢)\mathcal{N}_{\mathcal{E}_{0}(X),+\infty}(\mathfrak{e}) 𝒩0(X),ϵ(𝔢)\mathcal{N}_{\mathcal{E}_{0}(X),\epsilon}(\mathfrak{e}) 𝒩0(X),+(𝔢)\mathcal{N}_{\mathcal{E}_{0}(X),+\infty}(\mathfrak{e}) 𝒩0(X),ϵ(𝔢)\mathcal{N}_{\mathcal{E}_{0}(X),\epsilon}(\mathfrak{e})
2 labels 0.182(k=2) 0.176 (k=2) 0.252 (k=19) 0.249 (k=13)
3 labels 0.323 (k=2) 0.319 (k=5) 0.295 (k=10) 0.296 (k=3)
4 labels 0.406 (k=10) 0.412 (k=7) 0.304 (k=15) 0.303 (k=10)
5 labels 0.511 (k=5) 0.520 (k=2) 0.519 (k=15) 0.527 (k=1)

References

  • [1] S. Agarwal, K. Branson, and S. Belongie, Higher order learning with graphs. In ICML, pages 17–24, 2006.
  • [2] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, Learning multi-label scene classification. Pattern Recognition, 37(9):1757–1771, 2004.
  • [3] G. Chen, J. Zhang, F. Wang, C. Zhang, and Y. Gao, Efficient multi- label classification with hypergraph regularization, Computer Vision and Pattern Recognition, IEEE Computer Society Conference on, vol. 0, pp. 1658–1665, 2009.
  • [4] A. Elisseeff and J. Weston, A kernel method for multi-labelled classification, In Advances in Neural Information Processing Systems 14, pages 681–687, 2001.
  • [5] F. Kang, R. Jin, and R. Sukthankar, Correlated label propagation with application to multi-label learning. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1719–1726, 2006.
  • [6] H. Kazawa, T. Izumitani, H. Taira, and E. Maeda, Maximal margin labeling for multi-topic text categorization. In Advances in Neural Information Processing Systems 17, pages 649–656. 2005.
  • [7] Srijan Kumar, Francesca Spezzano, V. S Subrahmanian, and Christos Faloutsos, Edge Weight Prediction in Weighted Signed Networks. 2016 IEEE 16th International Conference on Data Mining (ICDM) (2016): 221–230.
  • [8] Dong Li, Zhiming Xu, Sheng Li, and Xin Sun, Link prediction in social networks based on hypergraph, Proceedings of the 22nd International Conference on World Wide Web, Pages 41-42
  • [9] P. Massa, and P. Avesani,Trust-aware Recommender Systems. Proceedings of the 2007 ACM Conference on Recommender Systems (2007): 17–24. Web.
  • [10] F. Maxwell Harper and Joseph A. Konstan, The MovieLens Datasets: History and Context, ACM Transactions on Interactive Intelligent Systems 5, 4: 19:1-19:19.
  • [11] J. Silva and R. Willett, Hypergraph-based anomaly detection of high-dimensional co-occurrences, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, pp. 563–569, 2009.
  • [12] Z. Tian, T. Hwang, and R. Kuang, A hypergraph- based learning algorithm for classifying gene expression and arraycgh data with prior knowledge, Bioinformatics, Volume 25, Issue 21, 1 November 2009, Pages 2831–2838.
  • [13] A. Torralba, K. P. Murphy, and W. T. Freeman, Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):854–869, 2007.
  • [14] N. Ueda and K. Saito, Parametric mixture models for multi-labeled text. In Advances in Neural Information Processing Systems 16, pages 721–728. 2003.
  • [15] M.-L. Zhang and Z.-H. Zhou. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10):1338–1351, 2006.
  • [16] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf, Learning with hypergraphs: Clustering, classification, and embedding, In NIPS, 2006.
  • [17] S. Yu, K. Yu, V. Tresp, and H.-P. Kriegel, Multi-output regularized feature projection, IEEE Transactions on Knowledge and Data Engineering, 18(12):1600–1613, 2006.