Efficient decision tree training with new data structure for secure multi-party computation
December 17, 2021)
Abstract
We propose a secure multi-party computation (MPC) protocol that constructs a secret-shared decision tree for a given secret-shared dataset. The previous MPC-based decision tree training protocol (Abspoel et al. 2021) requires comparisons, being exponential in the tree height and with and being the number of rows and that of attributes in the dataset, respectively. The cause of the exponential number of comparisons in is that the decision tree training algorithm is based on the divide-and-conquer paradigm, where dummy rows are added after each split in order to hide the number of rows in the dataset. We resolve this issue via secure data structure that enables us to compute an aggregate value for every group while hiding the grouping information. By using this data structure, we can train a decision tree without adding dummy rows while hiding the size of the intermediate data. We specifically describes a decision tree training protocol that requires only comparisons when the input attributes are continuous and the output attribute is binary. Note that the order is now linear in the tree height . To demonstrate the practicality of our protocol, we implement it in an MPC framework based on a three-party secret sharing scheme. Our implementation results show that our protocol trains a decision tree with a height of 5 in 33 seconds for a dataset of 100,000 rows and 10 attributes.
1 Introduction
Secure multi-party computation (MPC) [Yao86] allows parties to jointly compute any function while keeping inputs private. Its large computational overhead has long been a barrier to practical use. In recent years, even efficient MPC protocols for machine learning methods such as neural network training [WGC19, RWT+18, MR18] have been proposed.
Decision tree is one of the classical machine learning methods. It is still widely used due to its computational simplicity and ease of interpretation. It is also an important component of other machine learning methods, such as gradient boosting decision tree [Fri01] and random forest [Bre01], which have been successful in recent years.
Since the work of Lindell and Pinkas [LP00] in the early days of privacy-preserving data mining, there has been a lot of research on MPC protocols for decision tree training. In order to be used as a component of MPC protocols for other machine learning methods, it is desirable to keep all the information, from the input to the trained decision tree, private. However, only a few protocols [dHSCodA14, AEV21, ACC+21] with such a property have been proposed. This is mainly due to two kinds of computational difficulties in MPC.
The first difficulty is the computation on real numbers. Decision tree training requires computation of evaluation functions. Although there are many types of evaluation functions, all commonly used ones involve division or logarithm. Therefore, naive MPC protocols for decision tree training involve computation on real numbers, which increases the computational cost. On the contrary, de Hoogh et al. [dHSCodA14] cleverly avoided computation on real numbers by replacing fractional number comparisons with integer comparisons, and proposed an efficient protocol for the case where inputs are categorical values. Abspoel et al. [AEV21] presented an efficient protocol that can be applied to the case where the input contains numerical values. The number of candidates to which the evaluation functions are applied is when the input consists only of categorical values, whereas it increases to when the input contains numerical values, where is the number of possible values of the categorical value and is the number of samples in the input. They used a sorting protocol to reduce the number of candidates to , and also extended the technique of de Hoogh et al. to the numerical case to avoid computation on real numbers. Adams et al. [ACC+21] dealt with the case where the input contains numeric values by a different approach: discretizing the numeric attributes of the input. Although the trained tree is slightly different from the one without discretization, this approach avoids the use of sorting, which is relatively computationally expensive, and allows us to use the efficient protocol of de Hoogh et al. [dHSCodA14].
The second difficulty is the protection of the intermediate data size. In decision tree training, the data is split recursively from the root node to the leaf nodes in a top-down fashion. As the tree height increases, the number of nodes increases exponentially. On the other hand, the size of the intermediate data processed by each node also decreases exponentially on average, hence the overall computational cost is linear in the tree height. When this is implemented in MPC, the intermediate data size after splitting has to be hidden, so the existing protocols [dHSCodA14, AEV21, ACC+21] used a data, which contains some dummy entries, of the same size as the original one at each node. Therefore, we could not benefit from the size reduction by data splitting, and as a result, the overall computational cost was exponential in the tree height.
1.1 Our contribution
Method | Number of operations |
---|---|
Trivial [AEV21] | |
Abspoel et al. [AEV21] | |
Abspoel et al. [AEV21] | |
with efficient sort [HKI+12] | |
Ours |
We propose an MPC protocol for decision tree training with linear computational cost on tree height, which is the first protocol to solve the second problem above. It trains a binary decision tree under the assumption that all input attributes are numeric and the output attribute is binary. As in the protocol by Abspoel et al. [AEV21], it does not reveal any information other than the size of the input and the upper bound on the tree height.
The computational cost of our protocol is , assuming that the comparison and multiplication protocols are unit operations, where is the number of input attributes in the dataset, and is the number of samples in the dataset. This is an exponential improvement with respect to over the computational cost of the protocol by Abspoel et al. (Actually, Abspoel et al. [AEV21] claimed only a computational cost , but their protocol can easily be implemented to run in by replacing the sorting protocol to efficient one such as [HKI+12].) A comparison of the computational costs is shown in Table 1.
Our approach of exponential improvement in computational cost with respect to the tree height is general. For completeness, our protocol is instantiated with all input attributes being numeric, the output attribute being binary, and the evaluation function being the Gini index; however, it is easy to extend. In fact, the main protocol (Algorithm 5), which plays a central role in the exponential improvement of the computational cost, describes a process common to the major decision tree training algorithms CART [BFOS84], ID3 [Qui86], and C4.5 [Qui14].
Our protocol is built on top of a set of basic protocols, such as multiplication and comparison, provided by many recent MPC frameworks, so it can be used on top of various implementations. More specifically, we build our protocol on top of an MPC model called arithmetic black box (ABB), which consists of a set of basic operations described in Section 2.2.1.
As a byproduct of our decision tree training protocol, we also propose a secure data structure that can compute aggregate values for each group while keeping the grouping information private. This data structure can be used to compute aggregate values such as sums and maximums within each group while keeping the grouping information private, even in cases other than decision tree training.
To see the practicality of our decision tree training protocol, we implemented it on an MPC framework based on a 2-out-of-3 secret sharing scheme. Our protocol trained a decision tree with a height of 5 for 100,000 inputs of 10 input variables in 33 seconds.
1.2 Overview of our techniques
In general, MPC protocols are incompatible with divide-and-conquer algorithms. In divide-and-conquer algorithms, the problem is divided into smaller subproblems and solved recursively, but MPC protocols need to hide the size of the divided problem as well. A common way to hide the size of the problem is to add a dummy. We hide the actual size of the data by adding dummies (or leaving samples that should be removed) to the split data to make it appear to be the same size as the original. A disadvantage of this method is that it is computationally expensive; since it loses the property that the data size becomes smaller after splitting. For this reason, the previous study [AEV21] required an exponential cost for the height of the tree.
We use the property that the total number of samples is invariant at each height in training decision trees. We keep the data of nodes of the same height together, and train them all at once without adding any dummies. This allows our protocol to process only samples in total, while the previous study [AEV21] processes samples including dummies.
To implement this idea, we first define a data structure that looks like a private vector of length , but is internally grouped. Specifically, we place the grouped elements on a private vector of length so that elements of the same group appear next to each other, and then create a private vector of length n with a flag corresponding to the first element of each group. This allows us to detect the boundaries of groups by referring to the flags internally, although we cannot distinguish the groupings outwardly.
In decision tree training, each group needs to be split when moving to the next height. We accomplish this within our data structure by stably sorting the elements using the binary branching result, which is computed for each element, as a key. Stability of the sort ensures that elements that are in the same group and have the same branching result will be placed sequentially after the sort. Since this split requires only one-bit-key sorting, it can be very efficient depending on the underlying MPC implementation.
We build the group-wise sum, maximum, and prefix sum computations on our data structure. We then use them to build a decision tree training algorithm similar to [AEV21] on our data structure.
2 Preliminaries
In this section, we introduce a typical decision tree training algorithm in the clear and secure mulit-party computation.
Before that, we introduce some notation. Throughout this paper, the index of a vector starts at . We refer to the -th element of a vector by . That is, if is a vector of length , then . In logical operations, represents false and represents true.
2.1 Decision tree training
Decision tree training is a method in machine learning. The goal is to obtain a model called a decision tree that predicts a value of an output attribute, given values of input attributes. There are several famous algorithms for decision tree training, such as CART [BFOS84], ID3 [Qui86], and C4.5 [Qui14]. The general framework of these algorithms is the same, and in fact they are all greedy algorithms based on the divide-and-conquer paradigm. In this section, we present a typical algorithm, for which we plan to construct a secure version, for training a two-class classification binary tree, where all input attributes are numerical.
2.1.1 Typical decision tree training algorithm
Let us start with defining notation. Consider a dataset with input attributes and an output attribute . Suppose there are samples, each sample being a pair of an input tuple and a class label . Here, is an -tuple, and is a value of the output attribute . The -th element of represents a value of the input attribute . A decision tree consists of a binary tree and some additional information. Each internal node (non-leaf node) has a condition called a test of the form . It asks if the -th element in a given input tuple is less than a threshold or not. Each edge is assigned a possible outcome of its source node’s test, that is, true or false. An edge whose assigned outcome is true (false) is called a true edge (false edge, respectively). A child node whose incoming edge is a true edge (false edge) is called a true-child node (false-child node, respectively). Each leaf node is assigned a class label called leaf label. This information is used to predict a class label for a given input tuple as follows. Starting from the root node, we repeat evaluating the test of the internal node we reach and tracing an outgoing edge that is assigned the same value as the test outcome. When we reach a leaf node, we output its leaf label as the predicted class label.
A typical decision tree training algorithm is shown in Algorithm 1. It trains a tree recursively from the root node to a leaf node in a top-down fashion. At each node, it checks if the stopping criterion is satisfied using the given training dataset to determine the node type. If the stopping criterion is satisfied, the current node is set to be a leaf node. Then, it sets the most frequent class label in the dataset to the leaf label of the current node, and outputs a tree whose root is this node. If the stopping criterion is not satisfied, the current node is set as an internal node. In this case, we select a test of the form that gives the best data splitting with respect to a predetermined criterion, and split the training dataset into and according to this test, where and . It then recursively trains decision trees and with and as the training data, respectively, and sets the roots of these trees as the child nodes of the current node, and outputs a tree whose root is the current node.
We use the commonly used stopping criterion: (1) the height of the node is , or (2) the dataset cannot be split further (i.e., (i) all class labels are the same, or (ii) all input tuples are the same), where is an upper bound of the tree height, which is typically given as a hyperparameter.
2.1.2 Test selection measure
The size and shape of the decision tree depends on which tests are selected at the internal nodes. In general, it is desirable to make the tree as small as possible, but the problem of constructing a decision tree that minimizes the sum of the lengths of the paths from the root to each leaf is known to be NP-hard [HR76]. Therefore, we usually define a measure for goodness of local splitting and select a test that maximizes this measure.
Commonly used measures for goodness of split include the information gain used in ID3 [Qui86] and the Gini index used in CART [BFOS84]. We use the Gini index, which is also used in previous studies such as [dHSCodA14, AEV21] due to its ease of computation in MPC.
Two types of Gini indices are defined: one for a dataset and one for a dataset and a test. The Gini index for a dataset , which we denote by , is defined as follows:
where is a subset of whose class label is . Intuitively, the smaller is, the purer becomes in terms of class labels.
The Gini index for a dataset and a test , which we denote by , is defined using as follows:
Intuitively, the smaller is, the purer each split dataset becomes (and hence the better the test is). Therefore, to find the best test for splitting a dataset , we compute a test that minimizes [HKP11].
Abspoel et al. [AEV21] showed that minimization of is equivalent to maximization of defined as
(1) |
where and . We refer to it as modified Gini index and use it as a measure in our protocol.
2.2 Secure multi-party computation
We model secure multi-party computation (MPC) with an ideal functionality called arithmetic black box (ABB). This ideal functionality allows a set of parties to store values, operate on the stored values, and retrieve the stored values. We build our protocol on top of an ABB. This allows our protocol to run on any MPC implementation that realizes ABB, since concrete ABB implementation is separated from their construction.
2.2.1 Arithmetic black box
• A command : Receive from a party and store it as . • A command : Send to every party, who store it in the local variable . • A command : Compute and store it as . • A command : Compute and store it as . • A command : If then set , otherwise set . Store it as . • A command : If then set , otherwise set . Store it as . We assume that the commands , , , and are also defined in the same way when one of the inputs is a public value.
We assume a simple ABB named over a ring for some integer as shown in Fig. 1. We denote a value referred to by a name stored in as . In a typical case, where is realized by a secret sharing based MPC, means that is secret shared. We say a value is private if it is stored in .
We identify residue classes in with their representatives in . We assume is sufficiently large such that vector indices can be stored in . We also assume that the number of parties is constant.
For notational simplicity, , , , and are also written as , , , and , respectively. Furthermore, we denote by .
2.2.2 Cost of MPC protocols
We define the cost of an MPC protocol as the number of invocations of ABB operations other than linear combination of private values. That is, we assume that the parties can compute , , , , and for free, where and are private values and is a public value. This cost models the communication complexity on a typical MPC based on a linear secret sharing scheme, in which the parties can locally compute linear combination of secret shared values. We refer to ABB operations, except for linear combinations of private values, as non-free operations.
2.2.3 Known protocols
We show known protocols that we will use as building blocks for our protocols. The protocols shown here are limited to those that can be built on for completeness. Some MPC implementations may provide the same functionality more efficiently, in which case we can use them instead of the protocols listed here to run our protocol more efficiently.
We start by defining some simple protocols.
-
•
computes logical disjunction of bits and as , using non-free operations in rounds.
-
•
computes negation of a bit as , using no non-free operations.
-
•
receives a bit and two values and , and computes if , otherwise, as , using non-free operations in rounds.
-
•
computes the maximum value of and as , using non-free operations in rounds.
We require an extended protocol, which we call . We let denote the operation that computes a private value such that and are private vectors of the same length , , and . We use the construction by Abspoel et al. [AEV21], which uses non-free operations in rounds.
We require three permutation-related protocols , , and . Let be a symmetric group on . That is, is the set of all bijective functions from to . A permutation is an element of . Applying a permutation to a vector of length is the operation of rearranging into a vector satisfying for . We denote this operation as . Now we define the three protocols. Let , where is an integer, denote the operation that computes , such that is a uniformly randomly chosen element of . Let , where is a permutation and is a vector of length , denote the operation that computes , such that . Let , where is a permutation and is a vector of length , denote the operation that computes , such that . We use the construction by Falk and Ostrovsky [FO21]. In their construction, a private permutation is represented as a set of private control bits for Waksman permutation network [Wak68]. All these protocols use non-free operations in rounds. Note that we do not use the protocol directly in our protocols, but it is required as a component of the construction of protocol shown below.
We also require a protocol to compute a permutation that stably sorts given keys. We let
denote the operation that computes such that are vectors of length and applying to lexicographically and stably sorts them. We use the construction by Laud and Willemson [LW14]. The protocol use non-free operations in rounds. Note that we can construct the composition of private and public permutations that is needed for the construction, since our private permutation is a set of control bits for Waksman permutation network.
To simplify the description, we introduce a small subprotocol for sorting private vectors. We let
denote the following procedure:
-
1.
;
-
2.
for .
We sometimes use similar notation when the same operation is applied to multiple inputs. For example,
means parallel execution of for and
means parallel execution of for .
If vectors are given for a protocol defined for scalar values, it means that the protocol is applied on an element-by-element basis. That is, means parallel execution of for , and means parallel execution of for .
If some of the inputs are scalar, the same scalar values are used for all executions. For example, means parallel execution of for , and means parallel execution of for .
3 Our secure group-wise aggregation protocols
In this section, we propose group-wise aggregation protocols that compute aggregate values (sum, prefix sum, and maximum) for each group without revealing the grouping information of the input grouped values. These are executed on grouped values stored in our data structure. These protocols and the data structure play a central role in the construction of our decision tree training protocol proposed in Section 4.
3.1 Our data structure for privately grouped values
We propose a data structure that stores grouped values without revealing any information about the grouping. We store values, divided into several groups, in a private vector of length , called the internally grouped vector. Here, elements in the same group are stored as consecutive elements in the vector. That is, for any , , and such that , if and are in the same group, then and are also in the same group. Along with such a vector, we maintain a private bit vector of length , called the group flag vector, which indicates the boundaries between groups. Namely, we set if the -th element in is the first element in a group, otherwise . By definition, is always true.
We show an example. Suppose that six values are stored in an internally grouped vector as and the corresponding group flag vector is . Then, this means that the six values are divided into three groups, , , and .
For the sake of simplicity, we introduce some notations. Let () be the index of the first (last, respectively) element of the group that contains the -th element within the grouping represented by a group flag vector . Formally, they are defined as and , respectively, where is the length of . For example, if a group flag vector is defined as , then , , , and .
3.2 Our protocol for group-wise sum
Input | Output | |||
---|---|---|---|---|
Sum | Prefix sum | Max | ||
1 | 3 | 4 | 3 | 3 |
0 | 1 | 4 | 4 | 3 |
1 | 2 | 2 | 2 | 2 |
1 | 2 | 7 | 2 | 3 |
0 | 3 | 7 | 5 | 3 |
0 | 2 | 7 | 7 | 3 |
The group-wise sum protocol privately computes sums of each group in our data structure. It receives a private group flag vector of length and a private internally grouped vector of length , and outputs a private vector of length , where for . Note that the same value is computed for elements in the same group. An example is shown in Table 2. Columns 1 and 2 are the inputs, and column 3 is the output.
Before presenting our protocol, let us define some operations related to the computation of prefix sum. Given a vector of length , computes a vector of length such that for . We also define an inverse operation . Let denote an operation that computes such that . This can be easily computed as and for all . We further define reverse-ordered versions of these operations. Given a vector of length , computes a vector of length such that for . Given a vector of length , computes a vector of length such that . This is computed as and for all .
The protocol for group-wise sum is shown in Algorithm 2. Let be the number of groups in the input. In Algorithms 2 to 2, we compute so that for each , is the sum in the -th group in . This follows from the fact that the collection of the first elements of each group in is equal to the reverse-ordered prefix sum of the sums in each group in . Next, in Algorithms 2 to 2, we copy each , which is the sum in the -th group of , to each element of the -th group. To do this, we apply the technique used by Laud in his parallel reading protocol [Lau15] as follows. We use the fact that when the prefix sum is computed, an element with a value of zero will be copied with the result of the preceding element. Specifically, in Algorithm 2, the values are restored to their original order so that the first element of each group becomes the sum of each group and the other elements become zero, and in Algorithm 2, the prefix sum of the entire vector is computed. However, this will copy the prefix sum of the sums instead of the sums. Therefore, the inverse operation for the prefix sum is preliminarily performed in Algorithm 2.
Note that this protocol is also useful for copying a particular element in a group to all elements in the group, i.e., we clear all but the source elements with zeros and then apply this protocol. This technique will be used in the following two protocols Algorithms 3 and 4.
The protocol uses non-free operations in rounds.
3.3 Our protocol for group-wise prefix sum
The group-wise prefix sum protocol privately computes prefix sums of each group in our data structure. It receives a private group flag vector of length and a private internally grouped vector of values to be summed, and outputs a private vector of prefix sums for each group such that . An example of input/output is shown in Table 2. Columns 1 and 2 are the inputs, and column 4 is the output.
The protocol is shown in Algorithm 3. We first compute , which is the prefix sum of (Algorithm 3). This looks almost done, but each value exceeds the desired value by a partial sum from the first element of to the last element of the preceding group in . Therefore, we try to subtract these partial sums from and obtain the desired output. The predecessor of the first element in the -th group in is equal to the partial sum from the first element in to the last element in the -th group of . Using this property, we construct a vector that contains such values as the first values of the groups (Algorithm 3). We then copy the first elements of each group in to other elements by applying protocol to . Finally, we subtract this from to obtain the prefix sum for each group (Algorithm 3).
The protocol uses non-free operations in rounds.
3.4 Our protocol for group-wise max
The group-wise max protocol privately computes maximum values of each group in our data structure. It receives a private group flag vector of length and a private internally grouped vector of length , and then outputs a private vector of the maximum values for each group such that . An example is shown in Table 2. Columns 1 and 2 are the inputs, and column 5 is the output.
The protocol is shown in Algorithm 4. First, in Algorithms 4 to 4, we compute , where -th element in represents the maximum value up to in the group in . That is, . The underlying idea is as follows. Suppose we have a vector that satisfies for each . Then, we can obtain by computing for each . Since , we can compute for each by repeating this for rounds. In the group-by prefix sum protocol, we want to compute instead of . Therefore, in addition to the maximum value in the range , we keep a flag indicating whether the first element of the group is in this range or not. Then, we can compute the desired value by not updating the maximum value if the corresponding flag is . Then, in Algorithms 4 and 4, we copy the last elements of each group in to other elements as we have done in Algorithm 3. Here, if and only if the -th element is the last element in a group.
The protocol uses non-free operations in rounds, since each iteration in Algorithm 4 requires non-free operations in rounds.
4 Our efficient decision tree training protocol
In this section, we present our decision tree training protocol. Given a private training data set, it outputs the trained decision tree in a private form. Since the output decision tree is normalized for efficiency, we first describe it in Section 4.1. Then, in Section 4.2, we explain how our protocol trains the tree in a layer-by-layer manner. This part contains the main idea to reduce the factor in the computational cost to . The details of the batch test selection are deferred to Section 4.3. Note that our group-wise operations described in Section 3 are used throughout our protocols presented in this section.
4.1 Decision tree normalization for efficiency
Our training protocol outputs an equivalent normalized decision tree instead of the one that should have been obtained when training in the clear. Equivalent in this case means that the output for any given input is the same. Roughly speaking, our normalization aligns the heights of all leaf nodes to the upper bound on the tree height by inserting internal nodes that forward any sample to the false child node. Although this increases the number of nodes in the tree, in MPC, it reduces the data size and computational cost (though by just a constant factor), and simplifies the protocol.
Before describing the details of our normalization, let us recall the decision tree we originally wanted to compute, which is the output of Algorithm 1. It is a binary tree with height less than or equal to , and all tests for its internal nodes are of the form , where is an attribute number and is a threshold.
Our normalization consists of two modifications. The first modification is to change each test to an equivalent test such that . In the original algorithm, we compute when computing a threshold, but this involves division, which is a costly operation in MPC. We avoid division by computing instead of .
Our second modification is to align the height of all leaf nodes without changing the tree’s output. The modification is simple. We insert an internal node , which does not actually split, into the position of a leaf node with height less than . The test for is , which always returns false, and has a false branch to , where is a sufficiently small public value. Any input tuple that reaches passes through the false branch to reach , so the predicted label does not change. In the normalized tree, all nodes with height less than are internal nodes, and all nodes with height are leaf nodes. This makes our protocol simple and efficient.
4.2 Our layer-by-layer training protocol
This section describes the main part of our decision tree training protocol. We construct a decision tree by training nodes of the same height in a batch, layer by layer, while keeping the input and output secret. Training samples assigned to different nodes in the same layer are processed as internally separate groups using the protocols proposed in Section 3. This improves the factor of the communication complexity in [AEV21] to .
4.2.1 Encoding of inputs and outputs
Our decision tree training protocol receives a private training dataset and a public upper bound on the height of the tree, and outputs a private decision tree of height . The training dataset consists of samples, each of which consists of input tuple and a binary value called a class label. Each input tuple consists of numerical input attribute values. Our protocol receives it as private vectors () of length and a private vector of length . That is, the -th input tuple of the training dataset and its associated class label correspond to and , respectively.
The output tree is a normalized binary tree as described in Section 4.1. It is stored in private vectors. Since all nodes of height are internal nodes, the node number, attribute number, and threshold of each node are stored in three vectors , , and , respectively. Since all nodes of height are leaf nodes, the node number and leaf label of each node are stored in two vectors and , respectively. The length of each vector of height is . If the actual number of nodes is smaller than the length of the vector, it is filled with a dummy value . The vectors of each layer are collectively denoted as () and , which we call layer information.
In order to represent the edges between nodes, each node is assigned an integer node number. The only node with height is the root, and its node number is . For each child node (of height ) of a node of height with node number , assign node number to the false child (if any) and node number to the true child (if any). With this numbering scheme, in the -th layer, all node numbers are assigned different values from .
4.2.2 The main protocol of our decision tree training
The main protocol of our decision tree training is shown in Algorithm 5. It trains the decision tree layer by layer in order from the -th layer to the -th layer. Samples and associated values are stored in our private grouping data structure as internally grouped vectors. Throughout the training, a group flag vector represents the grouping to each node of the -th layer. ,, and are internally grouped vectors grouped by that store the node numbers, input attribute values, and output attribute values, respectively. In the -th layer, all samples are initialized to be assigned to the root node whose node number is (Algorithms 5 and 5). Then, each layer is trained iteratively (Algorithm 5).
At each iteration, we first trains the nodes at the -th layer and computes test result for each sample (Algorithm 5). This is executed in the protocol, which we will describe in Section 4.2.3. Each represents the test result of the -th sample. 0 and 1 denote false and true, respectively.
The node numbers and group flags at the next layer are computed in Algorithms 5 and 5, respectively. Then, , , , and of the -th layer are computed by stably sorting , , , and by (Algorithm 5). Thanks to the stability of sorting, both the correspondence between the values of each sample and the contiguity of elements in the same group are maintained.
Let us verify the correctness of node numbers and group flags for the next layer. Let be a node at the -th layer with node number . The node number of a child node of is for a false child and for a true child. Thus, , which is computed in Algorithm 5, is the node number in the next layer of each sample. As for the group flags, since the splitting of the groups is stable, the first 0 and 1 in each group are the first elements of the group after the split. Since and indicate the positions of ’s and ’s, respectively, the first elements of the new groups can be detected using , which detects first ’s for all groups. We can construct by detecting elements whose value is and whose prefix sum in the group is also , as shown in Algorithm 6. The protocol uses non-free operations in rounds.
In Algorithm 5, we train the leaf nodes in a batch by invoking the protocol, which we will describe in Section 4.2.4, and obtain the output vectors for height .
The decision tree training protocol uses non-free operations in rounds, since the protocol uses non-free operations in , and the protocol uses non-free operations in rounds, as we will show in the following sections.
4.2.3 Batch training for internal nodes
This section describes the protocol for training internal nodes in a batch, which we have been putting off. It receives the privately grouped dataset of the -th layer (), computes the best test for each node, and outputs the test results and the layer information .
In Algorithm 7, we compute the best test for each group using the protocol which will be shown in Section 4.3.1. In Algorithm 7, we determine if is all the same in each group, which is part of the stopping criterion described in Section 2.1.1. It is computed by the protocol shown in Algorithm 8. In the protocol, , , and represent the number of elements, the number of , and the number of in the group, respectively. Thus, computes the desired value. It uses non-free operations in rounds. In Algorithm 7, we replace test with for each element whose result in Algorithm 7 is true. Specifically, for each such that , and are overwritten with and , respectively. In Algorithm 7, the protocol is used to compute the test results from the best tests computed in the Algorithm 7. In Algorithm 7, the protocol is used to format , , and .
The protocol is shown in Algorithm 9. It computes the private results of applying the tests denoted by and to the input tuples . For each , it computes the flag indicating whether by an equality test (Algorithm 9), and uses it to compute the -th attribute value (Algorithm 9). Then, it computes the test result of each element in Algorithm 9. It uses non-free operations in rounds.
The protocol is shown in Algorithm 10. It removes redundant values from given vectors. Since the node numbers, attribute numbers, thresholds, and leaf labels are all the same in the same group, it is sufficient to leave only one element for each group. Therefore, we clear all but the first elements in each group with and move the first elements to the front of the vector. Since the number of nodes in the -th layer is at most , we delete the trailing elements so that each vector has this length. It uses non-free operations in rounds.
The protocol uses non-free operations in rounds, since the protocol uses non-free operations in rounds, as we will show in Section 4.3.1.
4.2.4 Batch training for leaf nodes
This section describes the protocol for training leaf nodes in a batch, which we have putting off. It receives the privately grouped dataset of the -th layer, computes the leaf label for each node, and outputs the layer information . The protocol is shown in Algorithm 11. In Algorithm 11, we compute the most frequent value of output attribute as a leaf label. This is a typical method to define leaf label. In Algorithm 11, the protocol is used to format and .
The protocol uses non-free operations in rounds.
4.3 Our batch test selection protocol
In this section, we explain how to perform a batch node-wise test selection on a dataset that is grouped by nodes and stored in our private grouping data structure. Like the standard test selection method in the clear, our batch test selection consists of three levels of components: one to select the best test across all attributes called the global test selection protocol, one to select the best test for splitting by a specific attribute called the attribute-wise test selection protocol, and one to compute measures called the modified Gini index protocol. We will introduce them in order. Thanks to our group-wise operations, all of them are almost straightforward to construct.
4.3.1 Global test selection
The global test selection protocol computes the best test through all attributes for each node in a batch. The algorithm is straightforward: it calls the attribute-wise test selection protocol to compute the best test for each attribute and then selects the best test among them. Since the attribute-wise test selection protocol assumes that the data is already sorted within a group for a given attribute, this protocol is responsible for sorting within the group before calling attribute-wise test selection.
The protocol is shown in Algorithm 12. It receives the training data (input tuples and class labels) privately grouped by nodes, and outputs the information (attribute number and threshold) of the best test for each group. For each input attribute, it sorts the input attribute values and class labels within the group and selects the best test for each group when splitting on that attribute (Algorithms 12 to 12). Then select the best test among all attributes in Algorithm 12. Since the output of the attribute-wise test selection protocol is identical in each group, it is sufficient to do this for each element independently.
The protocol is almost identical to the algorithm in the clear and the protocol in [AEV21]. The difference is that we need to sort within each group in Algorithm 12. Recalling that is a bit vector where only the first element of each group is , we can see that computes different and ascending values for each group. Thus, we can sort within each group by using and as keys in lexicographic order as in Algorithm 12.
The protocol uses non-free operations in rounds, since protocol uses non-free operations in rounds, as we will show in the next section.
4.3.2 Attribute-wise test selection
The attribute-wise test selection protocol computes the best tests in each group for a given numerical input attribute. It assumes that the input attribute values and class labels are sorted with respect to the input attribute values within each group. It implements the technique by Abspoel et al. [AEV21] that reduces the number of candidate thresholds from to , on our data structure using the group-wise operations proposed in Section 3. We use the protocol, which will be described in Section 4.3.3, to compute the modified Gini index.
The protocol is shown in Algorithm 13. It receives a private group flag vector , a private numeric input attribute vector , and a private class label vector . Vectors and are sorted with respect to in each group. Outputs are thresholds and scores of the best tests in each group.
We show that the protocol computes the best tests in each group for a given numerical input attribute. Since and are sorted within each group with respect to , it is sufficient to consider only the split between two adjacent elements in each group [AEV21]. In Algorithms 13 and 13, the threshold and the score for split between the -th and -th elements are computed. If the -th element is the last element in a group (i.e., or ) or if it has the same attribute value as the next element (i.e., ), we cannot split between the -th and -th elements. In this case, we set and (Algorithms 13 and 13). In Algorithm 13, the score and threshold of a element whose score is the maximum in a group are copied to other elements in the group.
The protocol uses non-free operations in rounds, since protocol uses non-free operations in rounds, as we will show in the next section.
4.3.3 Modified Gini index
We present a protocol to compute the modified Gini index for privately grouped dataset. Thanks to the group-wise operations proposed in Section 3, the formula by Abspoel et al. [AEV21] in Equation 1 can be used almost directly.
The protocol is shown in Algorithm 14. The input is a private group flag vector and a private class label vector which is sorted by an input attribute in each group. The output is a private vector , where each represents the modified Gini index for a split between the -th and -th elements.
Since each is a bit vector with iff , let the -th element be , then represents the number of up to in the group, and represents the number of after in the group. That is, , , , , , and are the number of ’s up to , the number of ’s up to , the number of elements up to , the number of ’s after , the number of ’s after , and the number of elements after , respectively, in the group. From Equation 1, is the modified Gini index for splitting between the -th and -th elements.
The protocol uses non-free operations in rounds.
5 Demonstration of our protocol’s practicality
In order to show the practicality of our decision tree training protocol, we implemented it and measured the running time.
5.1 Implementation methods
OS | CentOS Linux release 7.3.1611 |
---|---|
CPU | Intel Xeon Gold 6144k (3.50GHz 8 core/16 thread)2 |
Memory | 768 GB |
We implemented our protocols on a Shamir’s secret-sharing based three-party computation over a field , where . This 3PC scheme is secure against a single static corruption. For the ABB implementation, we used the comparison protocols by Kikuchi et al. [KIM+18] and the multiplication protocol by Chida et al. [CHI+19]. We also replaced some of the protocols built on ABB with more efficient protocols. The inner product, apply, unapply, and sortperm protocols are based on the ones by Chida et al. [CHI+19]. Our implementation includes several optimizations.
The protocols were implemented with the C++ language. We measured on three servers with the same configuration connected by a ring configuration of Intel X710/X557-AT 10G. The configuration of the servers is shown in Table 3.
5.2 Benchmarking results
Time [s] | |
---|---|
1 | 1.342 |
2 | 2.203 |
5 | 4.432 |
10 | 8.810 |
20 | 16.633 |
50 | 40.891 |
Time [s] | |
---|---|
1.811 | |
2.092 | |
2.569 | |
4.432 | |
32.035 |
Time [s] | |
---|---|
1 | 2.469 |
2 | 2.641 |
5 | 3.069 |
10 | 4.432 |
20 | 7.790 |
50 | 19.111 |
100 | 39.401 |
To show the scalability of our protocol, we measured the running time for different parameters of , , and . Based on the case where , , and , we measured the execution time of training by varying , , and . The results are shown in Tables 5, 6 and 4. Each runtime is the average value of three measurements. The results show that the running time is approximately linear with respect to , , and , respectively.
References
- [ACC+21] Samuel Adams, Chaitali Choudhary, Martine De Cock, Rafael Dowsley, David Melanson, Anderson C. A. Nascimento, Davis Railsback, and Jianwei Shen. Privacy-preserving training of tree ensembles over continuous data. CoRR, Vol. abs/2106.02769, , 2021.
- [AEV21] Mark Abspoel, Daniel Escudero, and Nikolaj Volgushev. Secure training of decision trees with continuous attributes. Proc. Priv. Enhancing Technol., Vol. 2021, No. 1, pp. 167–187, 2021.
- [BFOS84] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984.
- [Bre01] Leo Breiman. Random forests. Mach. Learn., Vol. 45, No. 1, pp. 5–32, 2001.
- [CHI+19] Koji Chida, Koki Hamada, Dai Ikarashi, Ryo Kikuchi, Naoto Kiribuchi, and Benny Pinkas. An efficient secure three-party sorting protocol with an honest majority. IACR Cryptol. ePrint Arch., p. 695, 2019.
- [dHSCodA14] Sebastiaan de Hoogh, Berry Schoenmakers, Ping Chen, and Harm op den Akker. Practical secure decision tree learning in a teletreatment application. In Nicolas Christin and Reihaneh Safavi-Naini, editors, FC 2014, March 3-7, 2014, Christ Church, Barbados, Vol. 8437 of Lecture Notes in Computer Science, pp. 179–194. Springer, 2014.
- [FO21] Brett Hemenway Falk and Rafail Ostrovsky. Secure merge with O(n log log n) secure operations. In Stefano Tessaro, editor, ITC 2021, July 23-26, 2021, Virtual Conference, Vol. 199 of LIPIcs, pp. 7:1–7:29. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
- [Fri01] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232, 2001.
- [HKI+12] Koki Hamada, Ryo Kikuchi, Dai Ikarashi, Koji Chida, and Katsumi Takahashi. Practically efficient multi-party sorting protocols from comparison sort algorithms. In Taekyoung Kwon, Mun-Kyu Lee, and Daesung Kwon, editors, ICISC, Vol. 7839 of LNCS, pp. 202–216. Springer, 2012.
- [HKP11] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann, 2011.
- [HR76] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees is NP-complete. Inf. Process. Lett., Vol. 5, No. 1, pp. 15–17, 1976.
- [KIM+18] Ryo Kikuchi, Dai Ikarashi, Takahiro Matsuda, Koki Hamada, and Koji Chida. Efficient bit-decomposition and modulus-conversion protocols with an honest majority. In Willy Susilo and Guomin Yang, editors, ACISP 2018, July 11-13, 2018, Wollongong, NSW, Australia, Vol. 10946 of Lecture Notes in Computer Science, pp. 64–82. Springer, 2018.
- [Lau15] Peeter Laud. Parallel oblivious array access for secure multiparty computation and privacy-preserving minimum spanning trees. Proc. Priv. Enhancing Technol., Vol. 2015, No. 2, pp. 188–205, 2015.
- [LP00] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In Mihir Bellare, editor, CRYPTO, Vol. 1880 of LNCS, pp. 36–54. Springer, 2000.
- [LW14] Peeter Laud and Jan Willemson. Composable oblivious extended permutations. In Frédéric Cuppens, Joaquín García-Alfaro, A. Nur Zincir-Heywood, and Philip W. L. Fong, editors, FPS 2014, November 3-5, 2014, Montreal, QC, Canada, Vol. 8930 of Lecture Notes in Computer Science, pp. 294–310. Springer, 2014.
- [MR18] Payman Mohassel and Peter Rindal. ABY: A mixed protocol framework for machine learning. In David Lie, Mohammad Mannan, Michael Backes, and XiaoFeng Wang, editors, CCS 2018, October 15-19, 2018, Toronto, ON, Canada, pp. 35–52. ACM, 2018.
- [Qui86] J. Ross Quinlan. Induction of decision trees. Mach. Learn., Vol. 1, No. 1, pp. 81–106, 1986.
- [Qui14] J Ross Quinlan. C4.5: programs for machine learning. Elsevier, 2014.
- [RWT+18] M. Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M. Songhori, Thomas Schneider, and Farinaz Koushanfar. Chameleon: A hybrid secure computation framework for machine learning applications. In Jong Kim, Gail-Joon Ahn, Seungjoo Kim, Yongdae Kim, Javier López, and Taesoo Kim, editors, AsiaCCS 2018, June 04-08, 2018, Incheon, Republic of Korea, pp. 707–721. ACM, 2018.
- [Wak68] Abraham Waksman. A permutation network. J. ACM, Vol. 15, No. 1, pp. 159–163, 1968.
- [WGC19] Sameer Wagh, Divya Gupta, and Nishanth Chandran. Securenn: 3-party secure computation for neural network training. Proc. Priv. Enhancing Technol., Vol. 2019, No. 3, pp. 26–49, 2019.
- [Yao86] Andrew Chi-Chih Yao. How to generate and exchange secrets (extended abstract). In 27th Annual Symposium on Foundations of Computer Science, 27–29 October 1986, Toronto, Canada, pp. 162–167. IEEE Computer Society, 1986.