¹¹institutetext: Department of Computer Science and Information Technology, Amirkabir University of Technology, Iran; [email protected]²²institutetext: Department of Research Methodology, Measurement and Data Analysis, University of Twente, Enschede, Netherlands; [email protected]

MULTI-LABEL CLASSIFICATION USING LINK PREDICTION

Seyed Amin Fadaee 11 Maryam Amir Haeri 22

Abstract

Solving classification with graph methods has gained huge popularity in recent years. This is due to the fact that the data can be intuitively modeled with graphs to utilize high level features to aid in solving the classification problem. CULP which is short for Classification Using Link Prediction is a graph-based classifier. This classifier utilizes the graph representation of the data and transforms the problem to that of link prediction where we try to find the link between an unlabeled node and the proper class node for it. CULP proved to be highly accurate classifier and it has the power to predict the labels in near constant time. A variant of the classification problem is multi-label classification which tackles this problem for multi-label data where an instance can have multiple labels associated to it. In this work, we extend the CULP algorithm to address this problem. Our proposed extensions conveys the powers of CULP and its intuitive representation of the data in to the multi-label domain and in comparison to some of the cutting edge multi-label classifiers, yield competitive results.

1 Introduction

Classification is a supervised learning problem with the goal of learning a model using the labeled data. This model can then be used to predict the labels of new data. Binary-classification refers to a variant of this problem where a data can have one of two different values and is the most basic type of classification. In multi-class classification, a data point can take a label from multiple choices of labels (more than two) [1]. Multi-label classification is another variant of the classification problem where each data point can have multiple labels. This problem has been utilized in applications such as labeling of multimedia resources, text categorization, genetics and biology [2].

There are different approaches in solving the multi-label classification problem one of which is transforming the data. Data transformation changes the data in a way that allows a normal classifier to be used. Two main approaches which use this technique are Binary Relevance [3] which trains a separate classifier for each label and Label PowerSet [4] which treats each combination of labels as a single class upon training. The other technique is method adaptation which changes a method used for multi-class classification and makes it suitable for learning from multi-label data. Ensemble algorithms, being transformation methods by nature, are considered as another technique in tackling this problem. A notable example of this category is the RAkEL [5] algorithm.

An approach which has been researched extensively for solving multi-class classification but has not been studied for learning from multi-label data is Complex Networks. Complex networks is a domain of data mining which deals with graphs in huge dimensionality. When solving a machine learning task in this settings, a graph representation of the data is used to solve the problem. These methods usually convert the original data to one or multiple graphs (with $k$ NN or $r$ -radius method [6]) and using one or more specific graph properties, they find the labels for the unlabeled data [7, 8, 9, 10, 11, 12].

One of the algorithms in this domain is CULP (Classification Using Link Prediction), proposed in [7] as a high accuracy classifier for solving the problem of multi-class classification. This algorithm utilizes complex networks, specifically the problem of link prediction to aid in solving the classification problem. CULP models the data and labels as a graph structure called LEG (Label Embedded Graph) which uses links between a data node and a class node to define class membership; this in turn converts the problem of classification to a link prediction task where the link between an unlabeled data node and a class node must be determined.

In this work, we presented 2 extensions to the CULP algorithm, namely MiCULP (CULP with MultI-label embedding) and BiCULP (CULP with BInarily dissected labels embedding) which adapt the concept of CULP to solve the problem of multi-label classification. These algorithms fall in to the adaptive approach of solving the multi-label classification category which was mentioned earlier. As is elaborated in the experiments section, these extensions yield competitive results in comparison to the cutting edge algorithms for dealing with multi-label data.

In the next section, a more detailed review on CULP, along with the complete algorithms for MiCULP and BiCULP are presented. After that we review some of the recent works on multi-label classification and introduce some of the algorithms which will be used in experiments. Following this will be the experiments done to evaluate the performance of the extensions on some benchmark data sets and the last sections would conclude the paper along with setting some goals for following the works done here for the future.

2 Proposed Method

In our previous work [7], we devised an algorithm to solve the classification problem using the similarities between instances. CULP (Classification Using Link Prediction) utilizes the power of complex networks - link prediction in specific to determine the label of a data point. From the inception of this algorithm, the extension of it for solving the problem of multi-class classification was apparent.

CULP converts the task of classification to a link prediction problem. This is done by utilizing the LEG data-structure (short for Label Embedded Graph). LEG is a heterogeneous graph with two sets of edges and three sets of nodes which represent the data in form of a graph, suitable for classification.

Given the original labeled data, $X^{(l)}$ with labels $y$ where $y_{i}\in\{1,2,\ldots,C\}$ with $C$ being the number of classes along with the unlabeled data $X^{(u)}$ , we can create LEG nodes and edges.

A node can belong to the training nodes set ( $V_{l}$ ), testing nodes set ( $V_{u}$ ) or the class nodes set ( $V_{c}$ ). Each of the instances in $X^{(l)}$ is represented by a training node and each instance in $X^{(u)}$ is represented by a testing node. Considering there are $C$ labels for instances of $X^{(l)}$ , $C$ class nodes would be created each of which represents one of the labels.

The edges of the LEG comprise of similarity edges ( $E_{s}$ ) and membership edges ( $E_{c}$ ). Similarity edges connect node $i$ and node $j$ (where $i,j\in\{V_{l}\cup V_{u}\}$ ) and a membership edge connects a training node $i\in V_{l}$ to a class node $c\in V_{c}$ ; This means that if $X^{(l)}_{i}$ has label $y_{i}=c$ , there would be an edge from $i$ to $c$ .

Edges in $E_{s}$ incorporate the similarities between instances of our data and the edges in this set are obtained by using a graph conversion algorithm. CULP uses an undirected version of the $k$ NN graph conversion [7] but any other method is applicable as needed. The graph conversion method uses a vector similarity function namely $s$ , which can be a typical function like Euclidean, Manhattan or Cosine similarity.

An example of creating a LEG is depicted in figure 2. In this example the black and white dots represent $X^{(l)}$ and the red square point is an unlabeled data. Instances in $X^{(l)}$ has two sets of labels, white dots are labeled $1$ and the black ones have label of $2$ and as can be seen in the graph, the corresponding training nodes are connected to their label nodes $j_{1}$ and $j_{2}$ accordingly.

Figure 1: Toy example demonstrating LEG. A- The set of data belonging to 2 classes and a test point in red B- LEG representation of the data [7]

Despite the heterogeneous nature of a LEG, we can define it as a simple undirected graph $G(V,E)$ where $V=V_{l}\cup V_{u}\cup V_{c}$ and $E=E_{s}\cup E_{c}$ . The procedure for creating $G$ is summarized in Algorithm 1.

Algorithm 1 constructing LEG using

X^{(l)}

, the labels

y

and the unlabeled data

X^{(u)}

with parameter

k

and the similarity function

s

function LEG(

X^{(l)}

X^{(u)}

y

s

k

)

X=X^{(l)}\cup X^{(u)}

V_{l}\leftarrow\{1,2,...,n\}

//Nodes are represented by numbers

V_{u}\leftarrow\{n+1,n+2,...,n+m\}

V_{c}\leftarrow\{n+m+1,n+m+2,...,n+m+C\}

E_{c}\leftarrow\{\}

for

i\in\{1,2,...,n\}

E_{c}\leftarrow E_{c}\cup(i,n+m+y_{i})

end for

E_{s}\leftarrow

kNN-CONVERT

(X,s,k)

V\leftarrow V_{l}\cup V_{u}\cup V_{c}

E\leftarrow E_{s}\cup E_{c}

return

G(V,E)

end function

After creating the LEG, we can predict the labels of the unlabeled instances by finding the most probable class node to connect to each of the test nodes.

Using link prediction, we can easily solve this problem. By using a local similarity measure $\lambda$ as a link predictor (e.g. Adamic-Adar index [13]), this problem can be solved as the following:

\begin{split}\forall X^{(u)}_{i}\in X^{(u)},\;\;y_{i}=c^{*}\\ c^{*}=\underset{j\in V_{c}}{argmax}(\lambda_{i,c})\;\;\end{split}

(1)

In the formula above, the $c^{*}$ is derived by finding the class node which has the highest probability of connecting to node $i$ .

Using simple link prediction schemes such as Common Neighbors [14], Adamic Adar Index and Resource Allocation Index [15], CULP proved to be competitive against typical classification algorithms and some more powerful graph-based classifiers. The procedure of CULP can be seen in Algorithm 2.

Algorithm 2 CULP Algorithm

function CULP(

X^{(l)}

X^{(u)}

y

s

k

\lambda

)

G\leftarrow LEG(X^{(l)},X^{(u)},y,s,k)

\hat{y}\leftarrow\{\}

for

i\in V_{u}

c^{*}\leftarrow\underset{c\in V_{c}}{argmax}(\lambda_{i,c})

\hat{y}_{i}\leftarrow c^{*}-(n+m)

end for

return

\hat{y}

end function

In [7], we also extended this algorithm with CULM (CULp with Majority vote), an ensemble algorithm that uses multiple link predictors along with a typical classifier and a majority vote scheme to drive the best accuracy scores compare to other methods.

By extending the same principles used in CULP, we devised two algorithms for solving the problem of multi-label classification which will be discussed in the next section.

2.1 MiCULP Aglorithm

In the setting of multi-label classification, given the data $X^{(l)}$ , each instance in $X^{(l)}$ can have multiple labels from the label set $L=\{1,2,\ldots,C\}$ , so essentially we would have a label matrix $Y$ instead of a label vector. Given a training instance $X^{(l)}_{i}$ , $Y_{i}$ is the label vector of $X^{(l)}_{i}$ where $\forall Y_{i,j}\in Y_{i}:Y_{i,j}\in\{0,1\}$ .

As previously discussed, CULP derives a score for each pair of $(i,c)$ where $i\in V_{u}$ and $c\in V_{c}$ . For a multi-label data, we can tweak LEG and CULP algorithms to use a threshold on the similarities and therefore pick multiple labels.

The extension of LEG used for this task is called MiLEG which is short for MultI-Label Embedded Graph. The main difference between a LEG and a MiLEG is that in a MiLEG structure, a training node $i$ can be connected to multiple class nodes. The procedure for creating a MiLEG is captured in Algorithm 3. ¹¹1The complete code for both extension in Python can be found in github.com/aminfadaee/mlculp

Algorithm 3 constructing MiLEG using

X^{(l)}

, labels

Y

, the label set

L

and the unlabeled data

X^{(u)}

with parameter

k

and the similarity function

s

function MiLEG(

X^{(l)}

X^{(u)}

Y

L

s

k

)

X=X^{(l)}\cup X^{(u)}

V_{l}\leftarrow\{1,2,...,n\}

//Nodes are represented by numbers

V_{u}\leftarrow\{n+1,n+2,...,n+m\}

V_{c}\leftarrow\{n+m+1,n+m+2,...,n+m+C\}

E_{c}\leftarrow\{\}

for

i\in\{1,2,...,n\}

for

y_{i,c}\in Y_{i}

y_{i,c}=1

then

E_{c}\leftarrow E_{c}\cup(i,n+m+c)

end if

end for

E_{s}\leftarrow

kNN-CONVERT

(X,s,k)

V\leftarrow V_{l}\cup V_{u}\cup V_{c}

E\leftarrow E_{s}\cup E_{c}

return

G(V,E)

end function

As for the prediction procedure, we need to pick multiple labels that have scores beyond a defined threshold like the following:

\forall X^{(u)}_{i}\in X^{(u)};\forall y_{i,c}\in Y_{i}\begin{cases}y_{i,c}=1&\lambda_{i,c}>t\\ \\ y_{i,c}=0&\lambda_{i,c}\leq t\end{cases}

(2)

In the formula above, $t$ is used as a threshold for picking labels. $t$ ’s optimal value can be found using a validation scheme or have a pre-defined value depending on the data. Using this, we can completely define the MiCULP algorithm which is depicted in Algorithm 4.

Algorithm 4 MiCULP Algorithm

function MiCULP(

X^{(l)}

X^{(u)}

Y

L

s

k

\lambda

t

)

G\leftarrow MiLEG(X^{(l)},X^{(u)},Y,L,s,k)

\hat{Y}\leftarrow[\;]

for

i\in V_{u}

for

c\in V_{c}

\hat{Y}_{i,c-(n+m)}=1

\lambda_{i,c}>t

else

0

end for

return

\hat{y}

end function

In this algorithm, after creating the MiLEG representation, each of the unlabeled data is assigned a label vector by comparing the score of the link between that data node and all the label nodes against $t$ . For simplicity, we can normalize these scores to pick $t$ from $[0,1]$ range.

2.2 BiCULP Aglorithm

The other extension to CULP algorithm for multi-label classification is BiCULP. This extension is inspired by the binary relevance method and uses the data structure BiLEG which is short for BInarily dissected Label Embedded Graph.

The BiLEG structure creates 2 nodes for each of the labels $c\in L$ . As stated before, each label $c$ can have a value of $0$ or $1$ ; in BiLEG, each of these values are denoted by nodes $c_{0}$ and $c_{1}$ . Given this, node $i$ is connected to $c_{1}$ if $y_{i,c}=1$ or to $c_{0}$ otherwise. The complete procedure for creating this data structure is defined in Algorithm 5.

Algorithm 5 constructing BiLEG using

X^{(l)}

, the labels matrix

Y

, the label set

L

and the unlabeled data

X^{(u)}

with parameter

k

and the similarity function

s

function BiLEG(

X^{(l)}

X^{(u)}

Y

L

s

k

)

X=X^{(l)}\cup X^{(u)}

V_{l}\leftarrow\{1,2,...,n\}

//Nodes are represented by numbers

V_{u}\leftarrow\{n+1,n+2,...,n+m\}

V_{c}\leftarrow\{n+m+1,n+m+2,...,n+m+2C\}

E_{c}\leftarrow\{\}

for

i\in\{1,2,...,n\}

for

y_{i,c}\in Y_{i}

E_{c}\leftarrow E_{c}\cup(i,n+m+2c-y_{i,c})

end for

E_{s}\leftarrow

kNN-CONVERT

(X,s,k)

V\leftarrow V_{l}\cup V_{u}\cup V_{c}

E\leftarrow E_{s}\cup E_{c}

return

G(V,E)

end function

In this algorithm $2C$ class nodes are present and the nodes for label $c$ are $2c$ and $2c-1$ respectively for $0$ ( $c_{0}$ ) and $1$ ( $c_{1}$ ) values of $c$ .

Having the BiLEG, the BiCULP algorithm easily determines the labels of the node $i$ in the following way:

\forall X^{(u)}_{i}\in X^{(u)};\forall y_{i,c}\in Y_{i}\begin{cases}y_{i,c}=1&\lambda_{i,c_{1}}>\lambda_{i,c_{0}}\\ \\ y_{i,c}=0&otherwise\end{cases}

(3)

An apparent gain of using this approach compared to that of MiCULP is that finding a suitable threshold is no longer needed. Furthermore, this approach utilizes the lack of labels in prediction as well which can be beneficial in presence of imbalanced classes. The complete algorithm for BiCULP is presented in Algorithm 6.

Algorithm 6 BiCULP Algorithm

function BiCULP(

X^{(l)}

X^{(u)}

Y

L

s

k

\lambda

)

G\leftarrow BiLEG(X^{(l)},X^{(u)},Y,L,s,k)

\hat{Y}\leftarrow[\;]

for

i\in V_{u}

for

c-1,c\in V_{c}

\hat{Y}_{i,\frac{c}{2}-(n+m)}=1

\lambda_{i,c-1}>\lambda_{i,c}

else

0

end for

return

\hat{y}

end function

2.3 Toy Example

In this section we introduce a toy example to elaborate the concepts discussed in MiCULP and BiCULP extensions. As can be seen in table 1, there are 6 points having at least one of three labels of $a$ , $b$ and $c$ and a seventh point (denoted by $i$ from here on), which the labels for is unknown.

Table 1: A simple set of multi-label data.

$x_{1}$	1	0	1	1	0	0	0
$x_{2}$	0	1	2	-2	3	-3	-1
$a$	1	1	1	0	1	0	??
$b$	1	1	1	1	0	0	??
$c$	1	0	0	1	0	1	??

The first step in prediction of $i$ ’s labels with the proposed algorithms is the creation of MiLEG and BiLEG structures which are shown in figure 2. The left graph which depicts MiCULP has 3 class nodes and that of the BiLEG in the right has 3 pairs of class nodes. In both graphs, each training point is connected to its corresponding label nodes and non-class nodes are connected to each other via the similarity links.

Given a common neighbors link prediction strategy, we can find the scores of $(i,a)$ , $(i,b)$ and $(i,c)$ in MiLEG as $2$ , $3$ and $3$ respectively. With a $t$ with value of $3$ , we can find the labels of $i$ as ${b,c}$ which matches our expectations of this node.

Using the same strategy in BiLEG results in $\lambda_{i,a_{0}}=2$ , $\lambda_{i,a_{1}}=2$ , $\lambda_{i,b_{0}}=1$ , $\lambda_{i,b_{1}}=3$ , $\lambda_{i,c_{0}}=1$ , $\lambda_{i,c_{1}}=3$ . Comparing each pair of values will yield the same results of ${b,c}$ as the labels for $i$ .

Figure 2: Toy examples of LEG extensions. A- The MiLEG representation B- The BiLEG representation

3 Related Work

MiCULP and BiCULP algorithms are extended versions of CULP algorithm [7] for multi-label classification. CULP proved to be a highly accurate classifier, able to utilize the high-level features in data and fast in prediction of unlabeled data. The inner logic of CULP makes it intuitively extendable for multi-label classification which leads to capturing its powers in solving this problem. In the rest of this section, some of the state of the art algorithms in multi-label classification are reviewed to provide some alternatives to compare against our work.

ACkEL [16] is an improvement on the RAkEL algorithm which trains different classifiers on subsets of the data using the label powerset method. The ACkEL algorithm uses kernel support vector machines as the classifiers and uses the Fisher’s linear discriminant ratio to determine the separability of the classes and joint entropy to describe the imbalance level in the data. These two parameters are linearly combined to evaluate the quality of a subset of the data. Furthermore, this algorithm uses active learning scheme to select samples based on the discriminant ratio and the joint entropy. In the experiments, we used the version of ACkEL which uses overlapping subsets of data, namely ACkELo because it outperforms the other variant - ACkELd.

In [17], a framework for optimizing different multi-label learning loss functions by finding rule sets to optimize it with gradient boosting, along with BOOMER, the instantiation of this framework is proposed. This algorithm is able to minimize both decomposable and non-decomposable loss functions and is another algorithm which we used in our experiments.

MLWSE [18] is a stacked ensemble algorithm which uses pairwise label correlations and it uses label-specific meta features to facilitate the classification process. In this algorithm, two classifiers with highly correlated labels share high similar weights as opposed to two classifiers with weakly correlated labels. In our work we chose the MLWSE-L2 variant of this algorithm to compare against which uses $L2$ sparsity regularization. This regularization is used to prevent the stacked ensemble from combining all the base classifiers.

In [19], an ordering scheme for classifier chains is proposed. This approach finds a single order for the chain based on conditional entropy of the class labels and does not train any additional classifier to aid in this process. The authors showed that their method can outperform many ordering schemes in the literature.

4 Experimental Results

In order to evaluate the MiCULP and BiCULP algorithms on a practical level, in this section we apply them on some benchmark data sets. The data sets used in our experiments are birds [20], emotions [21], enron [22], flags [23] and scene [24]. These data sets along with some of their properties are listed in Figure 2.

Table 2: Dataset used for experiments.

n

is the number of samples,

f

is the number of features,

C

is the number of labels and

|L|

is the total number of combinations of labels.

Dataset	n	f	C	—L—
Birds	645	260	19	133
Emotions	593	72	6	27
Enron	1702	1001	53	15806
Flags	194	19	7	54
Scene	2407	294	6	15

As mentioned previously, the algorithms used for comparing our results with are ACkELo [16], MLWSE-L2 [18] and BOOMER [17]. As for the evaluation criteria, we used hamming loss, example based F1 score (F1 for short), micro-averaging F1 score (micro-F1) and macro-averaging F1 (macro-F1). The results of each of these metrics are presented in Tables 3, 4, 5 and 6.

Table 3: Hamming Results

Dataset	MiCULP	BiCULP	ACkELo	MLWSE-L2	BOOMER
Birds	$0.050\pm 0.007_{5}$	$0.047\pm 0.007_{4}$	$0.040\pm 0.001_{1}$	$0.045\pm 0.001_{3}$	$0.041\pm 0.007_{2}$
Emotions	$0.188\pm 0.029_{2}$	$0.175\pm 0.019_{1}$	$0.199\pm 0.008_{5}$	$0.193\pm 0.007_{4}$	$0.189\pm 0.020_{3}$
Enron	$0.051\pm 0.002_{4}$	$0.048\pm 0.003_{3}$	$0.057\pm 0.000_{5}$	$0.046\pm 0.000_{1}$	$0.047\pm 0.002_{2}$
Flags	$0.288\pm 0.026_{5}$	$0.264\pm 0.028_{3}$	$0.268\pm 0.012_{4}$	$0.257\pm 0.014_{2}$	$0.240\pm 0.040_{1}$
Scene	$0.070\pm 0.006_{1}$	$0.073\pm 0.008_{2}$	$0.079\pm 0.001_{3}$	$0.085\pm 0.003_{5}$	$0.081\pm 0.006_{4}$
Rank	$3.4$	$2.6$	$3.6$	$3.0$	$2.4$

Table 4: F1 Results

Dataset	MiCULP	BiCULP	ACkELo	MLWSE-L2	BOOMER
Birds	$0.547\pm 0.045_{3}$	$0.608\pm 0.040_{2}$	$0.149\pm 0.006_{4}$	$0.140\pm 0.009_{5}$	$0.619\pm 0.058_{1}$
Emotions	$0.683\pm 0.033_{2}$	$0.654\pm 0.040_{3}$	$0.718\pm 0.008_{1}$	$0.614\pm 0.014_{4}$	$0.607\pm 0.044_{5}$
Enron	$0.579\pm 0.022_{1}$	$0.503\pm 0.031_{3}$	$0.437\pm 0.014_{5}$	$0.576\pm 0.006_{2}$	$0.493\pm 0.024_{4}$
Flags	$0.728\pm 0.030_{2}$	$0.712\pm 0.036_{5}$	$0.736\pm 0.013_{1}$	$0.721\pm 0.025_{3}$	$0.720\pm 0.051_{4}$
Scene	$0.805\pm 0.014_{1}$	$0.738\pm 0.020_{3}$	$0.788\pm 0.003_{2}$	$0.672\pm 0.010_{4}$	$0.652\pm 0.027_{5}$
Rank	$1.8$	$3.2$	$2.6$	$3.6$	$3.8$

Table 5: Micro F1 Results

Dataset	MiCULP	BiCULP	ACkELo	MLWSE-L2	BOOMER
Birds	$0.418\pm 0.070_{3}$	$0.437\pm 0.076_{1}$	$0.408\pm 0.011_{4}$	$0.359\pm 0.027_{5}$	$0.421\pm 0.069_{2}$
Emotions	$0.704\pm 0.025_{2}$	$0.698\pm 0.037_{3}$	$0.715\pm 0.005_{1}$	$0.658\pm 0.013_{5}$	$0.667\pm 0.040_{4}$
Enron	$0.576\pm 0.020_{1}$	$0.521\pm 0.029_{4}$	$0.443\pm 0.017_{5}$	$0.566\pm 0.004_{2}$	$0.537\pm 0.020_{3}$
Flags	$0.740\pm 0.031_{2}$	$0.731\pm 0.038_{5}$	$0.733\pm 0.012_{4}$	$0.737\pm 0.017_{3}$	$0.751\pm 0.046_{1}$
Scene	$0.794\pm 0.017_{1}$	$0.774\pm 0.027_{3}$	$0.777\pm 0.002_{2}$	$0.733\pm 0.009_{5}$	$0.739\pm 0.023_{4}$
Rank	$1.8$	$3.2$	$3.2$	$4.0$	$2.8$

Table 6: Macro F1 Results

Dataset	MiCULP	BiCULP	ACkELo	MLWSE-L2	BOOMER
Birds	$0.351\pm 0.073_{2}$	$0.352\pm 0.075_{1}$	$0.283\pm 0.013_{4}$	$0.133\pm 0.010_{5}$	$0.331\pm 0.084_{3}$
Emotions	$0.688\pm 0.027_{2}$	$0.667\pm 0.041_{3}$	$0.701\pm 0.007_{1}$	$0.584\pm 0.013_{5}$	$0.641\pm 0.042_{4}$
Enron	$0.298\pm 0.032_{3}$	$0.301\pm 0.023_{2}$	$0.115\pm 0.004_{5}$	$0.547\pm 0.005_{1}$	$0.272\pm 0.042_{4}$
Flags	$0.630\pm 0.066_{4}$	$0.605\pm 0.080_{5}$	$0.640\pm 0.021_{3}$	$0.711\pm 0.025_{1}$	$0.648\pm 0.067_{2}$
Scene	$0.800\pm 0.018_{1}$	$0.782\pm 0.023_{3}$	$0.782\pm 0.002_{2}$	$0.665\pm 0.010_{5}$	$0.742\pm 0.022_{4}$
Rank	$2.6$	$2.8$	$3.0$	$3.4$	$3.4$

The parameters used in MiCULP and BiCULP algorithms are $k$ ( $1\leq k\leq 45$ ), $\lambda$ which is one of Common Neighbors, Resource Allocation, Adamic Adar or Compatibility Score [7], the vector similarity function which can be Cosine, Manhattan or Euclidean. For each data set, the parameters are tuned via a 5-Fold Cross Validation. After finding the parameters, 30 runs of 10-Fold cross validation which amounts to 300 total runs are done.

For the competitions, if the metric for a data set is presented in their work, that score is presented as it is. For other cases the default parameters of each algorithm is used.

As can be seen from the rankings, in hamming loss, MiCULP has the fourth best loss. In F1, micro F1 and macro F1 metrics, MiCULP out performs all other algorithms. Also for the scene data set, MiCULP is out performing other methods. Another thing that can be noticed from the results is that although BiCULP could not outperform MiCULP, it holds the second best score for hamming loss and macro F1.

5 Conclusion and Future Works

In this work we extended the CULP classifier to solve the problem of multi-label classification. Our method uses graph representation of the data and similarities between instances as high-level features to classify a multi-label instance effectively. As complex data needs complex features, the graph representation can capture these features intuitively and use it as an advantage to aid in the classification. We proposed MiCULP and BiCULP algorithms which differ in the modeling of their graph representation and our experiments proved the effectiveness of these algorithms, specially MiCULP in dealing with multi-label data.

We are planning to extend the works done here in multiple ways, firstly we can use multiple link predictor to classify the test nodes (similar to CULM, another extension of CULP [7]).

Another idea is to extend the BiCULP algorithms to solve the Multi-Dimension Classification problem, a variant of multi-label classification where a label has more than 2 values. As BiCULP creates a node for each value, creating nodes for more than zero and one values can be easily achieved.

Finally by defining a scheme to connect the class nodes together, we can incorporate the label correlation of the data in the MiLEG or BiLEG data structures. This can lead to an even higher accuracy and a model which defines the inner relations of the labels.

References

[1] K. P. Murphy, Machine learning: a probabilistic perspective. MIT press, 2012.
[2] F. Herrera, F. Charte, A. J. Rivera, and M. J. del Jesus, Multilabel Classification, pp. 17–31. Cham: Springer International Publishing, 2016.
[3] T. Gon ç alves and P. Quaresma, “A preliminary approach to the multilabel classification problem of portuguese juridical documents,” in Portuguese Conference on Artificial Intelligence, pp. 435–444, Springer, 2003.
[4] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern recognition, vol. 37, no. 9, pp. 1757–1771, 2004.
[5] G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble method for multilabel classification,” in European conference on machine learning, pp. 406–417, Springer, 2007.
[6] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
[7] S. A. Fadaee and M. A. Haeri, “Classification using link prediction,” Neurocomputing, vol. 359, pp. 395–407, 2019.
[8] T. H. Cupertino, M. G. Carneiro, Q. Zheng, J. Zhang, and L. Zhao, “A scheme for high level data classification using random walk and network measures,” Expert Systems with Applications, vol. 92, pp. 289–303, 2018.
[9] M. G. Carneiro and L. Zhao, “Organizational data classification based on the importance concept of complex networks,” IEEE transactions on neural networks and learning systems, 2017.
[10] T. C. Silva and L. Zhao, “Network-based high level data classification,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 6, pp. 954–970, 2012.
[11] T. F. Covões and Z. Liang, “Low and high level classification using stacking,” in Neural Networks (IJCNN), 2017 International Joint Conference on, pp. 2525–2532, IEEE, 2017.
[12] M. G. Carneiro and L. Zhao, “High level classification totally based on complex networks,” in Computational Intelligence and 11th Brazilian Congress on Computational Intelligence (BRICS-CCI & CBIC), 2013 BRICS Congress on, pp. 507–514, IEEE, 2013.
[13] L. A. Adamic and E. Adar, “Friends and neighbors on the web,” Social networks, vol. 25, no. 3, pp. 211–230, 2003.
[14] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for social networks,” journal of the Association for Information Science and Technology, vol. 58, no. 7, pp. 1019–1031, 2007.
[15] T. Zhou, L. L ü, and Y.-C. Zhang, “Predicting missing links via local information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–630, 2009.
[16] R. Wang, S. Kwong, X. Wang, and Y. Jia, “Active k-labelsets ensemble for multi-label classification,” Pattern Recognition, vol. 109, p. 107583.
[17] M. Rapp, E. L. Menc í a, J. F ü rnkranz, V.-L. Nguyen, and E. H ü llermeier, “Learning gradient boosted multi-label classification rules,” arXiv preprint arXiv:2006.13346, 2020.
[18] Y. Xia, K. Chen, and Y. Yang, “Multi-label classification with weighted classifier selection and stacked ensemble,” Information Sciences, 2020.
[19] X. Jun, Y. Lu, Z. Lei, and D. Guolun, “Conditional entropy based classifier chains for multi-label classification,” Neurocomputing, vol. 335, pp. 185–194, 2019.
[20] F. Briggs, Y. Huang, R. Raich, K. Eftaxias, Z. Lei, W. Cukierski, S. F. Hadley, A. Hadley, M. Betts, X. Z. Fern, J. Irvine, L. Neal, A. Thomas, G. Fodor, G. Tsoumakas, H. W. Ng, T. N. T. Nguyen, H. Huttunen, P. Ruusuvuori, T. Manninen, A. Diment, T. Virtanen, J. Marzat, J. Defretin, D. Callender, C. Hurlburt, K. Larrey, and M. Milakov, “The 9th annual mlsp competition: New methods for acoustic classification of multiple simultaneous bird species in a noisy environment,” in 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–8, 2013.
[21] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P. Vlahavas, “Multi-label classification of music into emotions.,” in ISMIR, vol. 8, pp. 325–330, 2008.
[22] B. Klimt and Y. Yang, “The enron corpus: A new dataset for email classification research,” in Machine Learning: ECML 2004 (J.-F. Boulicaut, F. Esposito, F. Giannotti, and D. Pedreschi, eds.), (Berlin, Heidelberg), pp. 217–226, Springer Berlin Heidelberg, 2004.
[23] E. C. Goncalves, A. Plastino, and A. A. Freitas, “A genetic algorithm for optimizing the label ordering in multi-label classifier chains,” in 2013 IEEE 25th International Conference on Tools with Artificial Intelligence, pp. 469–476, 2013.
[24] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown, “Learning multi-label scene classification,” Pattern recognition, vol. 37, no. 9, pp. 1757–1771, 2004.