Efficient decision tree training with new data structure for secure multi-party computation

Koki Hamada Dai Ikarashi Ryo Kikuchi Koji Chida

( NTT Social Informatics Laboratories
December 17, 2021)

Abstract

We propose a secure multi-party computation (MPC) protocol that constructs a secret-shared decision tree for a given secret-shared dataset. The previous MPC-based decision tree training protocol (Abspoel et al. 2021) requires $O(2^{h}mn\log n)$ comparisons, being exponential in the tree height $h$ and with $n$ and $m$ being the number of rows and that of attributes in the dataset, respectively. The cause of the exponential number of comparisons in $h$ is that the decision tree training algorithm is based on the divide-and-conquer paradigm, where dummy rows are added after each split in order to hide the number of rows in the dataset. We resolve this issue via secure data structure that enables us to compute an aggregate value for every group while hiding the grouping information. By using this data structure, we can train a decision tree without adding dummy rows while hiding the size of the intermediate data. We specifically describes a decision tree training protocol that requires only $O(hmn\log n)$ comparisons when the input attributes are continuous and the output attribute is binary. Note that the order is now linear in the tree height $h$ . To demonstrate the practicality of our protocol, we implement it in an MPC framework based on a three-party secret sharing scheme. Our implementation results show that our protocol trains a decision tree with a height of 5 in 33 seconds for a dataset of 100,000 rows and 10 attributes.

1 Introduction

Secure multi-party computation (MPC) [Yao86] allows parties to jointly compute any function while keeping inputs private. Its large computational overhead has long been a barrier to practical use. In recent years, even efficient MPC protocols for machine learning methods such as neural network training [WGC19, RWT⁺18, MR18] have been proposed.

Decision tree is one of the classical machine learning methods. It is still widely used due to its computational simplicity and ease of interpretation. It is also an important component of other machine learning methods, such as gradient boosting decision tree [Fri01] and random forest [Bre01], which have been successful in recent years.

Since the work of Lindell and Pinkas [LP00] in the early days of privacy-preserving data mining, there has been a lot of research on MPC protocols for decision tree training. In order to be used as a component of MPC protocols for other machine learning methods, it is desirable to keep all the information, from the input to the trained decision tree, private. However, only a few protocols [dHSCodA14, AEV21, ACC⁺21] with such a property have been proposed. This is mainly due to two kinds of computational difficulties in MPC.

The first difficulty is the computation on real numbers. Decision tree training requires computation of evaluation functions. Although there are many types of evaluation functions, all commonly used ones involve division or logarithm. Therefore, naive MPC protocols for decision tree training involve computation on real numbers, which increases the computational cost. On the contrary, de Hoogh et al. [dHSCodA14] cleverly avoided computation on real numbers by replacing fractional number comparisons with integer comparisons, and proposed an efficient protocol for the case where inputs are categorical values. Abspoel et al. [AEV21] presented an efficient protocol that can be applied to the case where the input contains numerical values. The number of candidates to which the evaluation functions are applied is $\Theta(c)$ when the input consists only of categorical values, whereas it increases to $\Theta(n^{2})$ when the input contains numerical values, where $c$ is the number of possible values of the categorical value and $n$ is the number of samples in the input. They used a sorting protocol to reduce the number of candidates to $O(n)$ , and also extended the technique of de Hoogh et al. to the numerical case to avoid computation on real numbers. Adams et al. [ACC⁺21] dealt with the case where the input contains numeric values by a different approach: discretizing the numeric attributes of the input. Although the trained tree is slightly different from the one without discretization, this approach avoids the use of sorting, which is relatively computationally expensive, and allows us to use the efficient protocol of de Hoogh et al. [dHSCodA14].

The second difficulty is the protection of the intermediate data size. In decision tree training, the data is split recursively from the root node to the leaf nodes in a top-down fashion. As the tree height increases, the number of nodes increases exponentially. On the other hand, the size of the intermediate data processed by each node also decreases exponentially on average, hence the overall computational cost is linear in the tree height. When this is implemented in MPC, the intermediate data size after splitting has to be hidden, so the existing protocols [dHSCodA14, AEV21, ACC⁺21] used a data, which contains some dummy entries, of the same size as the original one at each node. Therefore, we could not benefit from the size reduction by data splitting, and as a result, the overall computational cost was exponential in the tree height.

1.1 Our contribution

Table 1: Comparison of computational cost of MPC protocols for decision tree training with numerical input attributes.

Method	Number of operations
Trivial [AEV21]	$O(2^{h}mn^{2})$
Abspoel et al. [AEV21]	$O((2^{h}+\log n)mn\log n)$
Abspoel et al. [AEV21]	$O(2^{h}mn\log n)$
with efficient sort [HKI⁺12]
Ours	$O(hmn\log n)$

We propose an MPC protocol for decision tree training with linear computational cost on tree height, which is the first protocol to solve the second problem above. It trains a binary decision tree under the assumption that all input attributes are numeric and the output attribute is binary. As in the protocol by Abspoel et al. [AEV21], it does not reveal any information other than the size of the input and the upper bound $h$ on the tree height.

The computational cost of our protocol is $O(hmn\log n)$ , assuming that the comparison and multiplication protocols are unit operations, where $m$ is the number of input attributes in the dataset, and $n$ is the number of samples in the dataset. This is an exponential improvement with respect to $h$ over the computational cost $O(2^{h}mn\log n)$ of the protocol by Abspoel et al. (Actually, Abspoel et al. [AEV21] claimed only a computational cost $O((2^{h}+\log n)mn\log n)$ , but their protocol can easily be implemented to run in $O(2^{h}mn\log n)$ by replacing the sorting protocol to efficient one such as [HKI⁺12].) A comparison of the computational costs is shown in Table 1.

Our approach of exponential improvement in computational cost with respect to the tree height is general. For completeness, our protocol is instantiated with all input attributes being numeric, the output attribute being binary, and the evaluation function being the Gini index; however, it is easy to extend. In fact, the main protocol (Algorithm 5), which plays a central role in the exponential improvement of the computational cost, describes a process common to the major decision tree training algorithms CART [BFOS84], ID3 [Qui86], and C4.5 [Qui14].

Our protocol is built on top of a set of basic protocols, such as multiplication and comparison, provided by many recent MPC frameworks, so it can be used on top of various implementations. More specifically, we build our protocol on top of an MPC model called arithmetic black box (ABB), which consists of a set of basic operations described in Section 2.2.1.

As a byproduct of our decision tree training protocol, we also propose a secure data structure that can compute aggregate values for each group while keeping the grouping information private. This data structure can be used to compute aggregate values such as sums and maximums within each group while keeping the grouping information private, even in cases other than decision tree training.

To see the practicality of our decision tree training protocol, we implemented it on an MPC framework based on a 2-out-of-3 secret sharing scheme. Our protocol trained a decision tree with a height of 5 for 100,000 inputs of 10 input variables in 33 seconds.

1.2 Overview of our techniques

In general, MPC protocols are incompatible with divide-and-conquer algorithms. In divide-and-conquer algorithms, the problem is divided into smaller subproblems and solved recursively, but MPC protocols need to hide the size of the divided problem as well. A common way to hide the size of the problem is to add a dummy. We hide the actual size of the data by adding dummies (or leaving samples that should be removed) to the split data to make it appear to be the same size as the original. A disadvantage of this method is that it is computationally expensive; since it loses the property that the data size becomes smaller after splitting. For this reason, the previous study [AEV21] required an exponential cost for the height of the tree.

We use the property that the total number of samples is invariant at each height in training decision trees. We keep the data of nodes of the same height together, and train them all at once without adding any dummies. This allows our protocol to process only $\Theta(hmn)$ samples in total, while the previous study [AEV21] processes $\Theta(2^{h}mn)$ samples including dummies.

To implement this idea, we first define a data structure that looks like a private vector of length $n$ , but is internally grouped. Specifically, we place the $n$ grouped elements on a private vector of length $n$ so that elements of the same group appear next to each other, and then create a private vector of length n with a flag corresponding to the first element of each group. This allows us to detect the boundaries of groups by referring to the flags internally, although we cannot distinguish the groupings outwardly.

In decision tree training, each group needs to be split when moving to the next height. We accomplish this within our data structure by stably sorting the elements using the binary branching result, which is computed for each element, as a key. Stability of the sort ensures that elements that are in the same group and have the same branching result will be placed sequentially after the sort. Since this split requires only one-bit-key sorting, it can be very efficient depending on the underlying MPC implementation.

We build the group-wise sum, maximum, and prefix sum computations on our data structure. We then use them to build a decision tree training algorithm similar to [AEV21] on our data structure.

2 Preliminaries

In this section, we introduce a typical decision tree training algorithm in the clear and secure mulit-party computation.

Before that, we introduce some notation. Throughout this paper, the index of a vector starts at $1$ . We refer to the $i$ -th element of a vector $\vec{v}$ by $\vec{v}[i]$ . That is, if $\vec{v}$ is a vector of length $n$ , then $\vec{v}=(\vec{v}[1],\vec{v}[2],\dots,\vec{v}[n])$ . In logical operations, $0$ represents false and $1$ represents true.

2.1 Decision tree training

Decision tree training is a method in machine learning. The goal is to obtain a model called a decision tree that predicts a value of an output attribute, given values of input attributes. There are several famous algorithms for decision tree training, such as CART [BFOS84], ID3 [Qui86], and C4.5 [Qui14]. The general framework of these algorithms is the same, and in fact they are all greedy algorithms based on the divide-and-conquer paradigm. In this section, we present a typical algorithm, for which we plan to construct a secure version, for training a two-class classification binary tree, where all input attributes are numerical.

2.1.1 Typical decision tree training algorithm

Let us start with defining notation. Consider a dataset $D$ with $m$ input attributes $X_{1},\dots,X_{m}$ and an output attribute $Y$ . Suppose there are $n$ samples, each sample being a pair of an input tuple $x$ and a class label $y$ . Here, $x$ is an $m$ -tuple, and $y$ is a value of the output attribute $Y$ . The $j$ -th element of $x$ represents a value of the input attribute $X_{j}$ . A decision tree consists of a binary tree and some additional information. Each internal node (non-leaf node) has a condition called a test of the form $X_{j}<t$ . It asks if the $j$ -th element in a given input tuple is less than a threshold $t$ or not. Each edge is assigned a possible outcome of its source node’s test, that is, true or false. An edge whose assigned outcome is true (false) is called a true edge (false edge, respectively). A child node whose incoming edge is a true edge (false edge) is called a true-child node (false-child node, respectively). Each leaf node is assigned a class label called leaf label. This information is used to predict a class label for a given input tuple as follows. Starting from the root node, we repeat evaluating the test of the internal node we reach and tracing an outgoing edge that is assigned the same value as the test outcome. When we reach a leaf node, we output its leaf label as the predicted class label.

Notation:

\mathcal{T}:=\operatorname{\sf Train}(\mathcal{D})

Input: A training dataset

\mathcal{D}

Output: A decision tree

\mathcal{T}

1 if the stopping criterion is met then

2 Let

r

be a leaf node whose leaf label is the most common class label in

\mathcal{D}

. Outputs a tree whose root node is

r

3else

4 Find the best test

X_{j}<t

according to the variable selection measure.

5 Recursively computes two subtrees

\mathcal{T}_{X_{j}<t}:=\operatorname{\sf Train}(\mathcal{D}_{X_{j}<t})

and

\mathcal{T}_{X_{j}\geq t}:=\operatorname{\sf Train}(\mathcal{D}_{X_{j}\geq t})

6 Let

v

be an internal node

v

whose test is

X_{j}<t

. Output a tree such that its root node is

v

v

’s true-child node is

\mathcal{T}_{X_{j}<t}

’s root node, and

v

’s false-child node is

\mathcal{T}_{X_{j}\geq t}

’s root node.

Algorithm 1 A typical decision tree training algorithm in the clear.

A typical decision tree training algorithm is shown in Algorithm 1. It trains a tree recursively from the root node to a leaf node in a top-down fashion. At each node, it checks if the stopping criterion is satisfied using the given training dataset $\mathcal{D}$ to determine the node type. If the stopping criterion is satisfied, the current node is set to be a leaf node. Then, it sets the most frequent class label in the dataset to the leaf label of the current node, and outputs a tree whose root is this node. If the stopping criterion is not satisfied, the current node is set as an internal node. In this case, we select a test of the form $X_{j}<t$ that gives the best data splitting with respect to a predetermined criterion, and split the training dataset $\mathcal{D}$ into $\mathcal{D}_{X_{j}<t}$ and $\mathcal{D}_{X_{j}\geq t}$ according to this test, where $\mathcal{D}_{X_{j}<t}:=\{(x,y)\in\mathcal{D}\mid x(X_{j})<t\}$ and $\mathcal{D}_{X_{j}\geq t}:=\{(x,y)\in\mathcal{D}\mid x(X_{j})\geq t\}$ . It then recursively trains decision trees $\mathcal{T}_{X_{j}<t}$ and $\mathcal{T}_{X_{j}\geq t}$ with $\mathcal{D}_{X_{j}<t}$ and $\mathcal{D}_{X_{j}\geq t}$ as the training data, respectively, and sets the roots of these trees as the child nodes of the current node, and outputs a tree whose root is the current node.

We use the commonly used stopping criterion: (1) the height of the node is $h$ , or (2) the dataset cannot be split further (i.e., (i) all class labels are the same, or (ii) all input tuples are the same), where $h$ is an upper bound of the tree height, which is typically given as a hyperparameter.

2.1.2 Test selection measure

The size and shape of the decision tree depends on which tests are selected at the internal nodes. In general, it is desirable to make the tree as small as possible, but the problem of constructing a decision tree that minimizes the sum of the lengths of the paths from the root to each leaf is known to be NP-hard [HR76]. Therefore, we usually define a measure for goodness of local splitting and select a test that maximizes this measure.

Commonly used measures for goodness of split include the information gain used in ID3 [Qui86] and the Gini index used in CART [BFOS84]. We use the Gini index, which is also used in previous studies such as [dHSCodA14, AEV21] due to its ease of computation in MPC.

Two types of Gini indices are defined: one for a dataset and one for a dataset and a test. The Gini index for a dataset $\mathcal{D}$ , which we denote by $\operatorname{\sf Gini_{\mathit{}}}(\mathcal{D})$ , is defined as follows:

\operatorname{\sf Gini_{\mathit{}}}(\mathcal{D}):=1-\sum_{c\in\{0,1\}}\frac{|\mathcal{D}_{Y=c}|^{2}}{|\mathcal{D}|^{2}},

where $\mathcal{D}_{Y=c}:=\{(x,y)\in\mathcal{D}\mid y=c\}$ is a subset of $\mathcal{D}$ whose class label is $c$ . Intuitively, the smaller $\operatorname{\sf Gini_{\mathit{}}}(\mathcal{D})$ is, the purer $\mathcal{D}$ becomes in terms of class labels.

The Gini index for a dataset $\mathcal{D}$ and a test $X_{j}<t$ , which we denote by $\operatorname{\sf G_{\mathit{X_{j}<t}}}(\mathcal{D})$ , is defined using $\operatorname{\sf Gini_{\mathit{}}}$ as follows:

\operatorname{\sf G_{\mathit{X_{j}<t}}}(\mathcal{D}):=\frac{|\mathcal{D}_{X_{j}<t}|}{|D|}\operatorname{\sf Gini_{\mathit{}}}(\mathcal{D}_{X_{j}<t})+\frac{|\mathcal{D}_{X_{j}\geq t}|}{|\mathcal{D}|}\operatorname{\sf Gini_{\mathit{}}}(\mathcal{D}_{X_{j}\geq t}).

Intuitively, the smaller $\operatorname{\sf G_{\mathit{X_{j}<t}}}(\mathcal{D})$ is, the purer each split dataset becomes (and hence the better the test is). Therefore, to find the best test for splitting a dataset $\mathcal{D}$ , we compute a test $T$ that minimizes $\operatorname{\sf G_{\mathit{T}}}(\mathcal{D})$ [HKP11].

Abspoel et al. [AEV21] showed that minimization of $\operatorname{\sf G_{\mathit{X_{j}<t}}}(\mathcal{D})$ is equivalent to maximization of $\operatorname{\sf G^{\prime}_{\mathit{X_{j}<t}}}(\mathcal{D})$ defined as

\operatorname{\sf G^{\prime}_{\mathit{X_{j}<t}}}(\mathcal{D}):=(|\mathcal{D}_{X_{j}\geq t}|(|\mathcal{D}_{X_{j}<t\wedge Y=0}|^{2}+|\mathcal{D}_{X_{j}<t\wedge Y=1}|^{2})\\ +|\mathcal{D}_{X_{j}<t}|(|\mathcal{D}_{X_{j}\geq t\wedge Y=0}|^{2}+|\mathcal{D}_{X_{j}\geq t\wedge Y=1}|^{2}))\\ /(|\mathcal{D}_{X_{j}<t}||\mathcal{D}_{X_{j}\geq t}|),

(1)

where $\mathcal{D}_{X_{j}<t\wedge Y=c}:=\{(x,y)\in\mathcal{D}\mid x(X_{j})<t\wedge y=c\}$ and $\mathcal{D}_{X_{j}\geq t\wedge Y=c}:=\{(x,y)\in\mathcal{D}\mid x(X_{j})\geq t\wedge y=c\}$ . We refer to it as modified Gini index and use it as a measure in our protocol.

2.2 Secure multi-party computation

We model secure multi-party computation (MPC) with an ideal functionality called arithmetic black box (ABB). This ideal functionality allows a set of parties $P_{1},\dots,P_{C}$ to store values, operate on the stored values, and retrieve the stored values. We build our protocol on top of an ABB. This allows our protocol to run on any MPC implementation that realizes ABB, since concrete ABB implementation is separated from their construction.

2.2.1 Arithmetic black box

• A command $[\![z]\!]\leftarrow\operatorname{\sf Enc}(x,P_{i})$ : Receive $x$ from a party $P_{i}$ and store it as $[\![x]\!]$ . • A command $z\leftarrow\operatorname{\sf Dec}([\![x]\!])$ : Send $x$ to every party, who store it in the local variable $z$ . • A command $[\![z]\!]\leftarrow\operatorname{\sf Add}([\![x]\!],[\![y]\!])$ : Compute $z:=x+y$ and store it as $[\![z]\!]$ . • A command $[\![z]\!]\leftarrow\operatorname{\sf Mul}([\![x]\!],[\![y]\!])$ : Compute $z:=xy$ and store it as $[\![z]\!]$ . • A command $[\![z]\!]\leftarrow\operatorname{\sf EQ}([\![x]\!],[\![y]\!])$ : If $x=y$ then set $z:=1$ , otherwise set $z:=0$ . Store it as $[\![z]\!]$ . • A command $[\![z]\!]\leftarrow\operatorname{\sf LT}([\![x]\!],[\![y]\!])$ : If $x<y$ then set $z:=1$ , otherwise set $z:=0$ . Store it as $[\![z]\!]$ . We assume that the commands $\operatorname{\sf Add}$ , $\operatorname{\sf Mul}$ , $\operatorname{\sf EQ}$ , and $\operatorname{\sf LT}$ are also defined in the same way when one of the inputs is a public value.

Figure 1: The arithmetic black box functionality

\mathcal{F}_{\rm ABB}

We assume a simple ABB named $\mathcal{F}_{\rm ABB}$ over a ring $\mathbb{Z}_{M}$ for some integer $M$ as shown in Fig. 1. We denote a value referred to by a name $x$ stored in $\mathcal{F}_{\rm ABB}$ as $[\![x]\!]$ . In a typical case, where $\mathcal{F}_{\rm ABB}$ is realized by a secret sharing based MPC, $[\![x]\!]$ means that $x$ is secret shared. We say a value is private if it is stored in $\mathcal{F}_{\rm ABB}$ .

We identify residue classes in $\mathbb{Z}_{M}$ with their representatives in $[0,M)$ . We assume $M$ is sufficiently large such that vector indices can be stored in $\mathcal{F}_{\rm ABB}$ . We also assume that the number of parties $C$ is constant.

For notational simplicity, $[\![z]\!]\leftarrow\operatorname{\sf Add}([\![x]\!],[\![y]\!])$ , $[\![z]\!]\leftarrow\operatorname{\sf Mul}([\![x]\!],[\![y]\!])$ , $[\![z]\!]\leftarrow\operatorname{\sf EQ}([\![x]\!],[\![y]\!])$ , and $[\![z]\!]\leftarrow\operatorname{\sf LT}([\![x]\!],[\![y]\!])$ are also written as $[\![z]\!]\leftarrow[\![x]\!]+[\![y]\!]$ , $[\![z]\!]\leftarrow[\![x]\!]\times[\![y]\!]$ , $[\![z]\!]\leftarrow([\![x]\!]\stackrel{{\scriptstyle?}}{{=}}[\![y]\!])$ , and $[\![z]\!]\leftarrow([\![x]\!]\stackrel{{\scriptstyle?}}{{<}}[\![y]\!])$ , respectively. Furthermore, we denote $[\![x_{1}]\!]+[\![x_{2}]\!]+\dots+[\![x_{n}]\!]$ by $\sum_{i=1}^{n}[\![x_{i}]\!]$ .

2.2.2 Cost of MPC protocols

We define the cost of an MPC protocol as the number of invocations of ABB operations other than linear combination of private values. That is, we assume that the parties can compute $\operatorname{\sf Add}([\![x]\!],[\![y]\!])$ , $\operatorname{\sf Add}(c,[\![y]\!])$ , $\operatorname{\sf Add}([\![x]\!],c)$ , $\operatorname{\sf Mul}(c,[\![y]\!])$ , and $\operatorname{\sf Mul}([\![x]\!],c)$ for free, where $[\![x]\!]$ and $[\![y]\!]$ are private values and $c$ is a public value. This cost models the communication complexity on a typical MPC based on a linear secret sharing scheme, in which the parties can locally compute linear combination of secret shared values. We refer to ABB operations, except for linear combinations of private values, as non-free operations.

2.2.3 Known protocols

We show known protocols that we will use as building blocks for our protocols. The protocols shown here are limited to those that can be built on $\mathcal{F}_{\rm ABB}$ for completeness. Some MPC implementations may provide the same functionality more efficiently, in which case we can use them instead of the protocols listed here to run our protocol more efficiently.

We start by defining some simple protocols.

•

$[\![z]\!]\leftarrow[\![x]\!]\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu\mkern 5.0mu\mathbin{\operator@font{\textsf{OR}}}\penalty 900\mkern 5.0mu\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu[\![y]\!]$ computes logical disjunction of bits $x$ and $y$ as $[\![x]\!]+[\![y]\!]-[\![x]\!]\times[\![y]\!]$ , using $O(1)$ non-free operations in $O(1)$ rounds.
•

$[\![z]\!]\leftarrow\lnot[\![x]\!]$ computes negation of a bit $x$ as $1-[\![x]\!]$ , using no non-free operations.
•

$[\![z]\!]\leftarrow\operatorname{\sf IfElse}([\![c]\!],[\![t]\!],[\![f]\!])$ receives a bit $c$ and two values $t$ and $f$ , and computes $t$ if $c=1$ , $f$ otherwise, as $[\![f]\!]+[\![c]\!]\times([\![t]\!]-[\![f]\!])$ , using $O(1)$ non-free operations in $O(1)$ rounds.
•

$[\![z]\!]\leftarrow\operatorname{\sf Max}([\![x]\!],[\![y]\!])$ computes the maximum value of $x$ and $y$ as $\operatorname{\sf IfElse}(([\![x]\!]\stackrel{{\scriptstyle?}}{{<}}[\![y]\!]),[\![y]\!],[\![x]\!])$ , using $O(1)$ non-free operations in $O(1)$ rounds.

We require an extended $\operatorname{\sf Max}$ protocol, which we call $\operatorname{\sf VectMax}$ . We let $[\![z]\!]\leftarrow\operatorname{\sf VectMax}([\![\vec{x}]\!],[\![\vec{y}]\!])$ denote the operation that computes a private value $[\![z]\!]$ such that $[\![\vec{x}]\!]$ and $[\![\vec{y}]\!]$ are private vectors of the same length $n$ , $i=\mathop{\rm arg~{}max}\limits_{j\in[1,n]}\vec{x}[j]$ , and $z=\vec{y}[j]$ . We use the construction by Abspoel et al. [AEV21], which uses $O(n)$ non-free operations in $O(\log n)$ rounds.

We require three permutation-related protocols $\operatorname{\sf Shuffle}$ , $\operatorname{\sf Apply}$ , and $\operatorname{\sf Unapply}$ . Let $S_{n}$ be a symmetric group on $[1,n]$ . That is, $S_{n}$ is the set of all bijective functions from $[1,n]$ to $[1,n]$ . A permutation is an element of $S_{n}$ . Applying a permutation $\pi\in S_{n}$ to a vector $\vec{x}$ of length $n$ is the operation of rearranging $\vec{x}$ into a vector $\vec{z}$ satisfying $\vec{z}[\pi(i)]=\vec{x}[i]$ for $i\in[1,n]$ . We denote this operation as $\pi(\vec{x})$ . Now we define the three protocols. Let $[\![\pi]\!]\leftarrow\operatorname{\sf Shuffle}(n)$ , where $n$ is an integer, denote the operation that computes $[\![\pi]\!]$ , such that $\pi$ is a uniformly randomly chosen element of $S_{n}$ . Let $[\![\vec{z}]\!]\leftarrow\operatorname{\sf Apply}([\![\pi]\!],[\![\vec{x}]\!])$ , where $\pi\in S_{n}$ is a permutation and $\vec{x}$ is a vector of length $n$ , denote the operation that computes $[\![\vec{z}]\!]$ , such that $[\![\vec{z}]\!]=\pi(\vec{x})$ . Let $[\![\vec{z}]\!]\leftarrow\operatorname{\sf Unapply}([\![\pi]\!],[\![\vec{x}]\!])$ , where $\pi\in S_{n}$ is a permutation and $\vec{x}$ is a vector of length $n$ , denote the operation that computes $[\![\vec{z}]\!]$ , such that $[\![\vec{z}]\!]=\pi^{-1}(\vec{x})$ . We use the construction by Falk and Ostrovsky [FO21]. In their construction, a private permutation is represented as a set of private control bits for Waksman permutation network [Wak68]. All these protocols use $O(n\log n)$ non-free operations in $O(\log n)$ rounds. Note that we do not use the $\operatorname{\sf Shuffle}$ protocol directly in our protocols, but it is required as a component of the construction of $\operatorname{\sf SortPerm}$ protocol shown below.

We also require a protocol $\operatorname{\sf SortPerm}$ to compute a permutation that stably sorts given keys. We let

[\![\pi]\!]\leftarrow\operatorname{\sf SortPerm}([\![\vec{x}_{1}]\!],[\![\vec{x}_{2}]\!],\dots,[\![\vec{x}_{k}]\!])

denote the operation that computes $[\![\pi]\!]$ such that $\vec{x}_{1},\vec{x}_{2},\dots,\vec{x}_{k}$ are vectors of length $n$ and applying $\pi\in S_{n}$ to $((\vec{x}_{1}[1],\dots,\vec{x}_{k}[1]),\dots,(\vec{x}_{1}[n],\dots,\vec{x}_{k}[n]))$ lexicographically and stably sorts them. We use the construction by Laud and Willemson [LW14]. The protocol use $O(n\log n)$ non-free operations in $O(\log n)$ rounds. Note that we can construct the composition of private and public permutations that is needed for the $\operatorname{\sf SortPerm}$ construction, since our private permutation is a set of control bits for Waksman permutation network.

To simplify the description, we introduce a small subprotocol $\operatorname{\sf Sort}$ for sorting private vectors. We let

[\![\vec{z}_{1}]\!],\dots,[\![\vec{z}_{m}]\!]\leftarrow\operatorname{\sf Sort}([\![\vec{x}_{1}]\!],\dots,[\![\vec{x}_{k}]\!];[\![\vec{y}_{1}]\!],\dots,[\![\vec{y}_{m}]\!])

denote the following procedure:

1.

$[\![\pi]\!]\leftarrow\operatorname{\sf SortPerm}([\![\vec{x}_{1}]\!],\allowbreak\dots,\allowbreak[\![\vec{x}_{k}]\!])$ ;
2.

$[\![\vec{z}_{j}]\!]\leftarrow\operatorname{\sf Apply}(\allowbreak[\![\pi]\!],\allowbreak[\![\vec{y}_{j}]\!])$ for $j\in[1,m]$ .

We sometimes use similar notation when the same operation is applied to multiple inputs. For example,

[\![z_{1}]\!],\dots,[\![z_{m}]\!]\leftarrow\operatorname{\sf IfElse}([\![c]\!];[\![t_{1}]\!],\dots,[\![t_{m}]\!];[\![f_{1}]\!],\dots,[\![f_{m}]\!])

means parallel execution of $[\![z_{j}]\!]\leftarrow\operatorname{\sf IfElse}([\![c]\!],[\![t_{j}]\!],[\![f_{j}]\!])$ for $j\in[1,m]$ and

[\![z_{1}]\!],\dots,[\![z_{m}]\!]\leftarrow\operatorname{\sf VectMax}([\![\vec{x}]\!];[\![\vec{y}_{1}]\!],\dots,[\![\vec{y}_{m}]\!])

means parallel execution of $[\![z_{j}]\!]\leftarrow\operatorname{\sf VectMax}([\![\vec{x}]\!],[\![\vec{y}_{j}]\!])$ for $j\in[1,m]$ .

If vectors are given for a protocol defined for scalar values, it means that the protocol is applied on an element-by-element basis. That is, $[\![\vec{z}]\!]\leftarrow[\![\vec{x}]\!]\times[\![\vec{y}]\!]$ means parallel execution of $[\![\vec{z}[i]]\!]\leftarrow[\![\vec{x}[i]]\!]\times[\![\vec{y}[i]]\!]$ for $i\in[1,n]$ , and $[\![\vec{z}]\!]\leftarrow\operatorname{\sf IfElse}([\![\vec{c}]\!],[\![\vec{t}]\!],[\![\vec{f}]\!])$ means parallel execution of $[\![\vec{z}[i]]\!]\leftarrow\operatorname{\sf IfElse}([\![\vec{c}[i]]\!],[\![\vec{t}[i]]\!],[\![\vec{f}[i]]\!])$ for $i\in[1,n]$ .

If some of the inputs are scalar, the same scalar values are used for all executions. For example, $[\![\vec{z}]\!]\leftarrow 2\times[\![\vec{y}]\!]$ means parallel execution of $[\![\vec{z}[i]]\!]\leftarrow 2\times[\![\vec{y}[i]]\!]$ for $i\in[1,n]$ , and $[\![\vec{z}]\!]\leftarrow\operatorname{\sf IfElse}([\![c]\!],[\![\vec{t}]\!],1)$ means parallel execution of $[\![\vec{z}[i]]\!]\leftarrow\operatorname{\sf IfElse}([\![c]\!],[\![\vec{t}[i]]\!],1)$ for $i\in[1,n]$ .

3 Our secure group-wise aggregation protocols

In this section, we propose group-wise aggregation protocols that compute aggregate values (sum, prefix sum, and maximum) for each group without revealing the grouping information of the input grouped values. These are executed on grouped values stored in our data structure. These protocols and the data structure play a central role in the construction of our decision tree training protocol proposed in Section 4.

3.1 Our data structure for privately grouped values

We propose a data structure that stores grouped values without revealing any information about the grouping. We store $n$ values, divided into several groups, in a private vector $[\![\vec{x}]\!]$ of length $n$ , called the internally grouped vector. Here, elements in the same group are stored as consecutive elements in the vector. That is, for any $i$ , $j$ , and $k$ such that $1\leq i<j<k\leq n$ , if $\vec{x}[i]$ and $\vec{x}[k]$ are in the same group, then $\vec{x}[i]$ and $\vec{x}[k]$ are also in the same group. Along with such a vector, we maintain a private bit vector $[\![\vec{g}]\!]$ of length $n$ , called the group flag vector, which indicates the boundaries between groups. Namely, we set $\vec{g}[i]=1$ if the $i$ -th element in $\vec{x}$ is the first element in a group, otherwise $\vec{g}[i]=0$ . By definition, $\vec{g}[1]=1$ is always true.

We show an example. Suppose that six values are stored in an internally grouped vector $\vec{x}$ as $\vec{x}=(3,1,2,2,3,2)$ and the corresponding group flag vector is $\vec{g}=(1,0,1,1,0,0)$ . Then, this means that the six values are divided into three groups, $(3,1)$ , $(2)$ , and $(2,3,2)$ .

For the sake of simplicity, we introduce some notations. Let $\operatorname{\sf Head}(\vec{g},i)$ ( $\operatorname{\sf Tail}(\vec{g},i)$ ) be the index of the first (last, respectively) element of the group that contains the $i$ -th element within the grouping represented by a group flag vector $\vec{g}$ . Formally, they are defined as $\operatorname{\sf Head}(\vec{g},i):=\max\{j\in[1,i]\mid\vec{g}[j]=1\}$ and $\operatorname{\sf Tail}(\vec{g},i):=\min\{j\in(i,n]\mid\vec{g}[j]=1\}\cup\{n+1\}-1$ , respectively, where $n$ is the length of $\vec{g}$ . For example, if a group flag vector is defined as $\vec{g}=(1,0,1,1,0,0)$ , then $\operatorname{\sf Head}(\vec{g},2)=1$ , $\operatorname{\sf Head}(\vec{g},4)=4$ , $\operatorname{\sf Tail}(\vec{g},4)=6$ , and $\operatorname{\sf Tail}(\vec{g},3)=3$ .

3.2 Our protocol for group-wise sum

Table 2: Example of input/output for our group-wise aggregation protocols, where

\vec{g}

is a group flag vector and

\vec{x}

is an internally grouped vector.

Input		Output
$\vec{g}$	$\vec{x}$	Sum	Prefix sum	Max
1	3	4	3	3
0	1	4	4	3
1	2	2	2	2
1	2	7	2	3
0	3	7	5	3
0	2	7	7	3

Notation:

[\![\vec{y}]\!]\leftarrow\operatorname{\sf GroupSum}([\![\vec{g}]\!],[\![\vec{x}]\!])

Input: A private group flag vector

[\![\vec{g}]\!]

of length

n

and a private internally grouped vector

[\![\vec{x}]\!]

of length

n

Output: A private vector

[\![\vec{y}]\!]

of length

n

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

[\![\vec{p}]\!]\leftarrow\operatorname{\sf PrefixSumR}([\![\vec{x}]\!])\times[\![\vec{g}]\!]

[\![\pi]\!]\leftarrow\operatorname{\sf SortPerm}(\lnot[\![\vec{g}]\!])

[\![\vec{p}_{1}]\!]\leftarrow\operatorname{\sf Apply}([\![\pi]\!],[\![\vec{p}]\!])

[\![\vec{s}_{1}]\!]\leftarrow\operatorname{\sf PrefixSumR}^{-1}([\![\vec{p}_{1}]\!])

[\![\vec{d}_{1}]\!]\leftarrow\operatorname{\sf PrefixSum}^{-1}([\![\vec{s}_{1}]\!])

[\![\vec{d}]\!]\leftarrow\operatorname{\sf Unapply}([\![\pi]\!],[\![\vec{d}_{1}]\!])\times[\![\vec{g}]\!]

[\![\vec{y}]\!]\leftarrow\operatorname{\sf PrefixSum}([\![\vec{d}]\!])

Algorithm 2 Group-wise sum.

The group-wise sum protocol privately computes sums of each group in our data structure. It receives a private group flag vector $[\![\vec{g}]\!]$ of length $n$ and a private internally grouped vector $[\![\vec{x}]\!]$ of length $n$ , and outputs a private vector $[\![\vec{y}]\!]$ of length $n$ , where $\vec{y}[i]=\sum_{j=\operatorname{\sf Head}(\vec{g},i)}^{\operatorname{\sf Tail}(\vec{g},i)}\vec{x}[j]$ for $i\in[1,n]$ . Note that the same value is computed for elements in the same group. An example is shown in Table 2. Columns 1 and 2 are the inputs, and column 3 is the output.

Before presenting our protocol, let us define some operations related to the computation of prefix sum. Given a vector $\vec{x}$ of length $n$ , $\vec{z}\leftarrow\operatorname{\sf PrefixSum}(\vec{x})$ computes a vector $\vec{z}$ of length $n$ such that $\vec{z}[i]=\sum_{j=1}^{i}\vec{x}[j]$ for $i\in[1,n]$ . We also define an inverse operation $\operatorname{\sf PrefixSum}^{-1}$ . Let $\vec{z}\leftarrow\operatorname{\sf PrefixSum}^{-1}(\vec{x})$ denote an operation that computes $\vec{z}$ such that $\vec{x}=\operatorname{\sf PrefixSum}(\vec{z})$ . This can be easily computed as $\vec{z}[1]:=\vec{x}[1]$ and $\vec{z}[i]:=\vec{x}[i]-\vec{x}[i-1]$ for all $i\in[2,n]$ . We further define reverse-ordered versions of these operations. Given a vector $\vec{x}$ of length $n$ , $\vec{z}\leftarrow\operatorname{\sf PrefixSumR}(\vec{x})$ computes a vector $\vec{z}$ of length $n$ such that $\vec{z}[i]=\sum_{j=i}^{n}\vec{x}[j]$ for $i\in[1,n]$ . Given a vector $\vec{x}$ of length $n$ , $\vec{z}\leftarrow\operatorname{\sf PrefixSumR}^{-1}(\vec{x})$ computes a vector $\vec{z}$ of length $n$ such that $\vec{x}=\operatorname{\sf PrefixSumR}(\vec{z})$ . This is computed as $\vec{z}[n]:=\vec{x}[n]$ and $\vec{z}[i]:=\vec{x}[i]-\vec{x}[i+1]$ for all $i\in[1,n)$ .

The protocol for group-wise sum is shown in Algorithm 2. Let $r$ be the number of groups in the input. In Algorithms 2 to 2, we compute $[\![\vec{s}_{1}]\!]$ so that for each $j\in[1,r]$ , $\vec{s}_{1}[j]$ is the sum in the $j$ -th group in $\vec{x}$ . This follows from the fact that the collection of the first elements of each group in $\vec{p}$ is equal to the reverse-ordered prefix sum of the sums in each group in $\vec{x}$ . Next, in Algorithms 2 to 2, we copy each $\vec{s}_{1}[j]$ , which is the sum in the $j$ -th group of $\vec{x}$ , to each element of the $j$ -th group. To do this, we apply the technique used by Laud in his parallel reading protocol [Lau15] as follows. We use the fact that when the prefix sum is computed, an element with a value of zero will be copied with the result of the preceding element. Specifically, in Algorithm 2, the values are restored to their original order so that the first element of each group becomes the sum of each group and the other elements become zero, and in Algorithm 2, the prefix sum of the entire vector is computed. However, this will copy the prefix sum of the sums instead of the sums. Therefore, the inverse operation for the prefix sum is preliminarily performed in Algorithm 2.

Note that this protocol is also useful for copying a particular element in a group to all elements in the group, i.e., we clear all but the source elements with zeros and then apply this protocol. This technique will be used in the following two protocols Algorithms 3 and 4.

The protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds.

3.3 Our protocol for group-wise prefix sum

Notation:

[\![\vec{y}]\!]\leftarrow\operatorname{\sf GroupPrefixSum}([\![\vec{g}]\!],[\![\vec{x}]\!])

Input: A private group flag vector

[\![\vec{g}]\!]

of length

n

and a private internally grouped vector

[\![\vec{x}]\!]

of length

n

Output: A private vector

[\![\vec{y}]\!]

of length

n

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

[\![\vec{s}]\!]\leftarrow\operatorname{\sf PrefixSum}([\![\vec{x}]\!])

[\![\vec{q}[1]]\!]\leftarrow 0

and

[\![\vec{q}[i]]\!]\leftarrow[\![\vec{s}[i-1]]\!]\times[\![\vec{g}[i]]\!]

for

i\in[2,n]

[\![\vec{y}]\!]\leftarrow[\![\vec{s}]\!]-\operatorname{\sf GroupSum}([\![\vec{g}]\!],[\![\vec{q}]\!])

Algorithm 3 Group-wise prefix sum.

The group-wise prefix sum protocol privately computes prefix sums of each group in our data structure. It receives a private group flag vector $[\![\vec{g}]\!]$ of length $n$ and a private internally grouped vector $[\![\vec{x}]\!]$ of values to be summed, and outputs a private vector $[\![\vec{y}]\!]$ of prefix sums for each group such that $\vec{y}[i]=\sum_{j=\operatorname{\sf Head}(\vec{g},i)}^{i}\vec{x}[j]$ . An example of input/output is shown in Table 2. Columns 1 and 2 are the inputs, and column 4 is the output.

The protocol is shown in Algorithm 3. We first compute $[\![\vec{s}]\!]$ , which is the prefix sum of $[\![\vec{x}]\!]$ (Algorithm 3). This looks almost done, but each value $\vec{s}[i]$ exceeds the desired value by a partial sum from the first element of $\vec{x}$ to the last element of the preceding group in $\vec{x}$ . Therefore, we try to subtract these partial sums from $[\![\vec{s}]\!]$ and obtain the desired output. The predecessor of the first element in the $j$ -th group in $\vec{s}$ is equal to the partial sum from the first element in $\vec{x}$ to the last element in the $(j-1)$ -th group of $\vec{x}$ . Using this property, we construct a vector $[\![\vec{q}]\!]$ that contains such values as the first values of the groups (Algorithm 3). We then copy the first elements of each group in $[\![\vec{q}]\!]$ to other elements by applying $\operatorname{\sf GroupSum}$ protocol to $[\![\vec{q}]\!]$ . Finally, we subtract this from $[\![\vec{s}]\!]$ to obtain the prefix sum for each group (Algorithm 3).

The protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds.

3.4 Our protocol for group-wise max

Notation:

[\![\vec{y}]\!]\leftarrow\operatorname{\sf GroupMax}([\![\vec{g}]\!],[\![\vec{x}]\!])

Input: A private group flag vector

[\![\vec{g}]\!]

of length

n

and a private internally grouped vector

[\![\vec{x}]\!]

of length

n

Output: A private vector

[\![\vec{y}]\!]

of length

n

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

m:=\left\lceil\log n\right\rceil

[\![\vec{g}^{(0)}]\!]\leftarrow[\![\vec{g}]\!]

and

[\![\vec{x}^{(0)}]\!]\leftarrow[\![\vec{x}]\!]

4 for $d:=0$ to $m-1$ do

w:=2^{d}

[\![\vec{g}^{(d+1)}]\!]\leftarrow[\![\vec{g}^{(d)}]\!]

[\![\vec{x}^{(d+1)}]\!]\leftarrow[\![\vec{x}^{(d)}]\!]

7 for each $j\in[w+1,n]$ do in parallel

[\![\vec{g}^{(d+1)}[j]]\!]\leftarrow[\![\vec{g}^{(d)}[j-w]]\!]\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu\mkern 5.0mu\mathbin{\operator@font{\textsf{OR}}}\penalty 900\mkern 5.0mu\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu[\![\vec{g}^{(d)}[j]]\!]

[\![\vec{a}]\!]\leftarrow\operatorname{\sf Max}([\![\vec{x}^{(d)}[j-w]]\!],[\![\vec{x}^{(d)}[j]]\!])

[\![\vec{x}^{(d+1)}[j]]\!]\leftarrow\operatorname{\sf IfElse}([\![\vec{g}^{(d)}[j]]\!],[\![\vec{x}^{(d)}[j]]\!],[\![\vec{a}]\!])

[\![\vec{t}[i]]\!]\leftarrow[\![\vec{g}[i+1]]\!]

for

i\in[1,n)

and

[\![\vec{t}[n]]\!]\leftarrow 1

[\![\vec{y}]\!]\leftarrow\operatorname{\sf GroupSum}([\![\vec{g}]\!],[\![\vec{t}]\!]\times[\![\vec{x}^{(m)}]\!])

Algorithm 4 Group-wise max.

The group-wise max protocol privately computes maximum values of each group in our data structure. It receives a private group flag vector $[\![\vec{g}]\!]$ of length $n$ and a private internally grouped vector $[\![\vec{x}]\!]$ of length $n$ , and then outputs a private vector $[\![\vec{y}]\!]$ of the maximum values for each group such that $y[i]=\max_{j\in[\operatorname{\sf Head}(\vec{g},i),\operatorname{\sf Tail}(\vec{g},i)]}x[j]$ . An example is shown in Table 2. Columns 1 and 2 are the inputs, and column 5 is the output.

The protocol is shown in Algorithm 4. First, in Algorithms 4 to 4, we compute $[\![\vec{x}^{(m)}]\!]$ , where $j$ -th element in $\vec{x}^{(m)}$ represents the maximum value up to $\vec{x}[j]$ in the group in $\vec{x}$ . That is, $\vec{x}^{(m)}[j]=\max_{i\in[\operatorname{\sf Head}(\vec{g},j),j]}\vec{x}[i]$ . The underlying idea is as follows. Suppose we have a vector $\vec{x}_{w}$ that satisfies $\vec{x}_{w}[i]=\max_{j\in(i-w,i]}\vec{x}[j]$ for each $i$ . Then, we can obtain $\vec{x}_{2w}$ by computing $\vec{x}_{2w}[i]:=\max(\vec{x}_{w}[i],\vec{x}_{w}[i-w])$ for each $i$ . Since $\vec{x}_{1}=\vec{x}$ , we can compute $\max_{j\in[1,i]}\vec{x}[j]$ for each $i$ by repeating this for $\Theta(\log n)$ rounds. In the group-by prefix sum protocol, we want to compute $\max_{j\in[\operatorname{\sf Head}(\vec{g},i),i]}\vec{x}[j]$ instead of $\max_{j\in[1,i]}\vec{x}[j]$ . Therefore, in addition to the maximum value in the range $(i-w,i]$ , we keep a flag indicating whether the first element of the group is in this range or not. Then, we can compute the desired value by not updating the maximum value if the corresponding flag is $1$ . Then, in Algorithms 4 and 4, we copy the last elements of each group in $[\![\vec{x}^{(m)}]\!]$ to other elements as we have done in Algorithm 3. Here, $\vec{t}[i]=1$ if and only if the $i$ -th element is the last element in a group.

The protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds, since each iteration in Algorithm 4 requires $O(n)$ non-free operations in $O(1)$ rounds.

4 Our efficient decision tree training protocol

In this section, we present our decision tree training protocol. Given a private training data set, it outputs the trained decision tree in a private form. Since the output decision tree is normalized for efficiency, we first describe it in Section 4.1. Then, in Section 4.2, we explain how our protocol trains the tree in a layer-by-layer manner. This part contains the main idea to reduce the $2^{h}$ factor in the computational cost to $h$ . The details of the batch test selection are deferred to Section 4.3. Note that our group-wise operations described in Section 3 are used throughout our protocols presented in this section.

4.1 Decision tree normalization for efficiency

Our training protocol outputs an equivalent normalized decision tree instead of the one that should have been obtained when training in the clear. Equivalent in this case means that the output for any given input is the same. Roughly speaking, our normalization aligns the heights of all leaf nodes to the upper bound on the tree height by inserting internal nodes that forward any sample to the false child node. Although this increases the number of nodes in the tree, in MPC, it reduces the data size and computational cost (though by just a constant factor), and simplifies the protocol.

Before describing the details of our normalization, let us recall the decision tree we originally wanted to compute, which is the output of Algorithm 1. It is a binary tree with height less than or equal to $h$ , and all tests for its internal nodes are of the form $X_{j}<t$ , where $j$ is an attribute number and $t$ is a threshold.

Our normalization consists of two modifications. The first modification is to change each test $X_{j}<t$ to an equivalent test $2X_{j}<t^{\prime}$ such that $t^{\prime}=2t$ . In the original algorithm, we compute $t=(\vec{x}_{j}[i]+\vec{x}_{j}[i+1])/2$ when computing a threshold, but this involves division, which is a costly operation in MPC. We avoid division by computing $t^{\prime}=2t=\vec{x}_{j}[i]+\vec{x}_{j}[i+1]$ instead of $t$ .

Our second modification is to align the height of all leaf nodes without changing the tree’s output. The modification is simple. We insert an internal node $u$ , which does not actually split, into the position of a leaf node $v$ with height less than $h$ . The test for $u$ is $2X_{1}<{\sf MIN\_VALUE}$ , which always returns false, and $u$ has a false branch to $v$ , where ${\sf MIN\_VALUE}$ is a sufficiently small public value. Any input tuple that reaches $u$ passes through the false branch to reach $v$ , so the predicted label does not change. In the normalized tree, all nodes with height less than $h$ are internal nodes, and all nodes with height $h$ are leaf nodes. This makes our protocol simple and efficient.

4.2 Our layer-by-layer training protocol

Notation:

([\![{\sf Layer}^{(k)}]\!])_{k\in[0,h]}\leftarrow\operatorname{\sf DecisionTreeTraining}(([\![\vec{x}_{j}^{(0)}]\!])_{j\in[1,m]},[\![\vec{y}^{(0)}]\!]{},h)

Input:

m

private vectors

([\![\vec{x}_{j}^{(0)}]\!])_{j\in[1,m]}

of length

n

, a private vector

[\![\vec{y}^{(0)}]\!]{}

of length

n

, and an integer

h

Output: A private decision tree

([\![{\sf Layer}^{(k)}]\!])_{k\in[0,h]}

of height

h

Cost:

O(hmn\log n)

non-free operations in

O(h(\log n+\log m))

rounds.

[\![\vec{g}^{(0)}[1]]\!]\leftarrow 1

and

[\![\vec{g}^{(0)}[i]]\!]\leftarrow 0

for

i\in[2,n]

[\![{\sf NID}^{(0)}[i]]\!]\leftarrow 1

for

i\in[1,n]

4 for $k:=0$ to $h-1$ do

[\![{\sf Layer}^{(k)}]\!],[\![\vec{b}^{(k)}]\!]\leftarrow\operatorname{\sf TrainInternalNodes}(k,([\![\vec{x}_{j}^{(k)}]\!])_{j},[\![\vec{y}^{(k)}]\!],[\![\vec{g}^{(k)}]\!],[\![{\sf NID}^{(k)}]\!])

[\![{\sf NID}]\!]\leftarrow 2^{k}\times[\![\vec{b}^{(k)}]\!]+[\![{\sf NID}^{(k)}]\!]

[\![\vec{g}]\!]\leftarrow\operatorname{\sf GroupFirstOne}([\![\vec{g}^{(k)}]\!],\lnot[\![\vec{b}^{(k)}]\!])+\operatorname{\sf GroupFirstOne}([\![\vec{g}^{(k)}]\!],[\![\vec{b}^{(k)}]\!])

[\![\vec{x}_{1}^{(k+1)}]\!],\dots,[\![\vec{x}_{m}^{(k+1)}]\!],[\![\vec{y}^{(k+1)}]\!],[\![\vec{g}^{(k+1)}]\!],[\![{\sf NID}^{(k+1)}]\!]\leftarrow\operatorname{\sf Sort}([\![\vec{b}^{(k)}]\!];[\![\vec{x}_{1}^{(k)}]\!],\dots,[\![\vec{x}_{m}^{(k)}]\!],[\![\vec{y}^{(k)}]\!],[\![\vec{g}]\!],[\![{\sf NID}]\!])

[\![{\sf Layer}^{(h)}]\!]\leftarrow\operatorname{\sf TrainLeafNodes}(h,[\![g^{(h)}]\!],[\![\vec{y}^{(h)}]\!],[\![{\sf NID}^{(h)}]\!])

Algorithm 5 Decision tree training.

Notation:

[\![\vec{f}]\!]\leftarrow\operatorname{\sf GroupFirstOne}([\![\vec{g}]\!],[\![\vec{b}]\!])

Input: A private group flag vector

[\![\vec{g}]\!]

of length

n

and a private bit vector

[\![\vec{b}]\!]

of length

n

Output: A private bit vector

[\![\vec{f}]\!]

of length

n

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

[\![\vec{s}]\!]\leftarrow\operatorname{\sf GroupPrefixSum}([\![\vec{g}]\!],[\![\vec{b}]\!])

[\![\vec{f}]\!]\leftarrow([\![\vec{s}]\!]\times[\![\vec{b}]\!]\stackrel{{\scriptstyle?}}{{=}}1)

Algorithm 6 Detecting first ones in a private internally grouped vector.

This section describes the main part of our decision tree training protocol. We construct a decision tree by training nodes of the same height in a batch, layer by layer, while keeping the input and output secret. Training samples assigned to different nodes in the same layer are processed as internally separate groups using the protocols proposed in Section 3. This improves the $2^{h}$ factor of the communication complexity in [AEV21] to $h$ .

4.2.1 Encoding of inputs and outputs

Our decision tree training protocol receives a private training dataset and a public upper bound $h$ on the height of the tree, and outputs a private decision tree of height $h$ . The training dataset consists of $n$ samples, each of which consists of input tuple and a binary value called a class label. Each input tuple consists of $m$ numerical input attribute values. Our protocol receives it as $m$ private vectors $[\![\vec{x}_{j}]\!]$ ( $j\in[1,m]$ ) of length $n$ and a private vector $[\![\vec{y}]\!]$ of length $n$ . That is, the $i$ -th input tuple of the training dataset and its associated class label correspond to $(\vec{x}_{1}[i],\vec{x}_{2}[i],\dots,\vec{x}_{m}[i])$ and $\vec{y}[i]$ , respectively.

The output tree is a normalized binary tree as described in Section 4.1. It is stored in $3h+2$ private vectors. Since all nodes of height $k\in[0,h)$ are internal nodes, the node number, attribute number, and threshold of each node are stored in three vectors ${\sf NID}^{(k)}$ , ${\sf AID}^{(k)}$ , and ${\sf Threshold}^{(k)}$ , respectively. Since all nodes of height $h$ are leaf nodes, the node number and leaf label of each node are stored in two vectors ${\sf NID}^{(k)}$ and ${\sf Label}^{(k)}$ , respectively. The length of each vector of height $k\in[0,h]$ is $\min\{n,2^{k}\}$ . If the actual number of nodes is smaller than the length of the vector, it is filled with a dummy value ${\sf NULL}$ . The vectors of each layer are collectively denoted as $[\![{\sf Layer}^{(k)}]\!]:=([\![{\sf NID}^{(k)}]\!],[\![{\sf AID}^{(k)}]\!],[\![{\sf Threshold}^{(k)}]\!])$ ( $k\in[0,h)$ ) and $[\![{\sf Layer}^{(h)}]\!]:=([\![{\sf NID}^{(h)}]\!],[\![{\sf Label}^{(h)}]\!])$ , which we call layer information.

In order to represent the edges between nodes, each node is assigned an integer node number. The only node with height $0$ is the root, and its node number is $1$ . For each child node (of height $k+1$ ) of a node of height $k$ with node number $d$ , assign node number $d$ to the false child (if any) and node number $d+2^{k}$ to the true child (if any). With this numbering scheme, in the $k$ -th layer, all node numbers are assigned different values from $[1,2^{k}]$ .

4.2.2 The main protocol of our decision tree training

The main protocol of our decision tree training is shown in Algorithm 5. It trains the decision tree layer by layer in order from the $0$ -th layer to the $h$ -th layer. Samples and associated values are stored in our private grouping data structure as internally grouped vectors. Throughout the training, a group flag vector $\vec{g}^{(k)}$ represents the grouping to each node of the $k$ -th layer. ${\sf NID}^{(k)}$ , $\vec{x}_{j}^{(k)}$ , and $\vec{y}^{(k)}$ are internally grouped vectors grouped by $\vec{g}^{(k)}$ that store the node numbers, input attribute values, and output attribute values, respectively. In the $0$ -th layer, all samples are initialized to be assigned to the root node whose node number is $1$ (Algorithms 5 and 5). Then, each layer is trained iteratively (Algorithm 5).

At each iteration, we first trains the nodes at the $k$ -th layer and computes test result $\vec{b}^{(k)}$ for each sample (Algorithm 5). This is executed in the $\operatorname{\sf TrainInternalNodes}$ protocol, which we will describe in Section 4.2.3. Each $\vec{b}^{(k)}[i]$ represents the test result of the $i$ -th sample. 0 and 1 denote false and true, respectively.

The node numbers ${\sf NID}$ and group flags $\vec{g}$ at the next layer are computed in Algorithms 5 and 5, respectively. Then, $\vec{g}^{(k+1)}$ , ${\sf NID}^{(k+1)}$ , $\vec{x}_{j}^{(k+1)}$ , and $\vec{y}^{(k+1)}$ of the $(k+1)$ -th layer are computed by stably sorting $\vec{g}$ , ${\sf NID}$ , $\vec{x}_{j}^{(k)}$ , and $\vec{y}^{(k)}$ by $\vec{b}^{(k)}$ (Algorithm 5). Thanks to the stability of sorting, both the correspondence between the values of each sample and the contiguity of elements in the same group are maintained.

Let us verify the correctness of node numbers ${\sf NID}$ and group flags $\vec{g}$ for the next layer. Let $v$ be a node at the $k$ -th layer with node number $d$ . The node number of a child node of $v$ is $d$ for a false child and $d+2^{k}$ for a true child. Thus, $[\![{\sf NID}]\!]$ , which is computed in Algorithm 5, is the node number in the next layer of each sample. As for the group flags, since the splitting of the groups is stable, the first 0 and 1 in each group are the first elements of the group after the split. Since $\lnot[\![\vec{b}^{(k-1)}]\!]$ and $[\![\vec{b}^{(k-1)}]\!]$ indicate the positions of $0$ ’s and $1$ ’s, respectively, the first elements of the new groups can be detected using $\operatorname{\sf GroupFirstOne}$ , which detects first $1$ ’s for all groups. We can construct $\operatorname{\sf GroupFirstOne}$ by detecting elements whose value is $1$ and whose prefix sum in the group is also $1$ , as shown in Algorithm 6. The $\operatorname{\sf GroupFirstOne}$ protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds.

In Algorithm 5, we train the leaf nodes in a batch by invoking the $\operatorname{\sf TrainLeafNodes}$ protocol, which we will describe in Section 4.2.4, and obtain the output vectors ${\sf Layer}^{(h)}$ for height $h$ .

The decision tree training protocol uses $O(hmn\log n)$ non-free operations in $O(h(\log m+\log n))$ rounds, since the $\operatorname{\sf TrainInternalNodes}$ protocol uses $O(mn\log n)$ non-free operations in $O(\log n+\log m)$ , and the $\operatorname{\sf TrainLeafNodes}$ protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds, as we will show in the following sections.

4.2.3 Batch training for internal nodes

Notation:

[\![{\sf Layer}^{(k)}]\!],[\![\vec{b}]\!]\leftarrow\operatorname{\sf TrainInternalNodes}(k,([\![\vec{x}_{j}]\!])_{j\in[1,m]},[\![\vec{y}]\!],[\![\vec{g}]\!],[\![{\sf NID}]\!])

Input: An integer

k

m

private vectors

([\![\vec{x}_{j}]\!])_{j\in[1,m]}

of length

n

, a private vector

[\![\vec{y}]\!]

of length

n

, a private group flag vector

[\![\vec{g}]\!]

of length

n

, and a private vector

[\![{\sf NID}]\!]

of length

n

Output:

[\![{\sf Layer}^{(k)}]\!]=([\![{\sf NID}^{(k)}]\!],[\![{\sf AID}^{(k)}]\!],[\![{\sf Threshold}^{(k)}]\!])

and a private bit vector

[\![\vec{b}]\!]

of length

n

, where

[\![{\sf NID}^{(k)}]\!]

[\![{\sf AID}^{(k)}]\!]

, and

[\![{\sf Threshold}^{(k)}]\!]

are private vectors of length

\min\{n,2^{k}\}

Cost:

O(mn\log n)

non-free operations in

O(\log n+\log m)

rounds.

[\![{\sf AID}]\!],[\![{\sf Threshold}]\!]\leftarrow\operatorname{\sf GlobalTestSelection}(([\![\vec{x}_{j}]\!])_{j},[\![\vec{y}]\!],[\![\vec{g}]\!])

[\![\vec{s}]\!]\leftarrow\operatorname{\sf GroupSame}([\![\vec{g}]\!],[\![\vec{y}]\!])

[\![{\sf AID}]\!],[\![{\sf Threshold}]\!]\leftarrow\operatorname{\sf IfElse}([\![\vec{s}]\!];1,{\sf MIN\_VALUE};[\![{\sf AID}]\!],[\![{\sf Threshold}]\!])

[\![\vec{b}]\!]\leftarrow\operatorname{\sf ApplyTests}(([\![\vec{x}_{j}]\!])_{j},[\![{\sf AID}]\!],[\![{\sf Threshold}]\!])

[\![{\sf NID}^{(k)}]\!],[\![{\sf AID}^{(k)}]\!],[\![{\sf Threshold}^{(k)}]\!]\leftarrow\operatorname{\sf FormatLayer}(k,[\![\vec{g}]\!],[\![{\sf NID}]\!],[\![{\sf AID}]\!],[\![{\sf Threshold}]\!])

Algorithm 7 Training internal nodes.

Notation:

[\![\vec{f}]\!]\leftarrow\operatorname{\sf GroupSame}([\![\vec{g}]\!],[\![\vec{y}]\!])

Input: A private group flag vector

[\![\vec{g}]\!]

of length

n

and A private bit vector

[\![\vec{y}]\!]

of length

n

Output: A private bit vector

[\![\vec{f}]\!]

of length

n

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

[\![\vec{s}]\!]\leftarrow\operatorname{\sf GroupSum}([\![\vec{g}]\!],\vec{1})

, where

\vec{1}

is a vector

(1,\dots,1)

of length

n

[\![\vec{s}_{0}]\!]\leftarrow\operatorname{\sf GroupSum}([\![\vec{g}]\!],\lnot[\![\vec{y}]\!])

[\![\vec{s}_{1}]\!]\leftarrow\operatorname{\sf GroupSum}([\![\vec{g}]\!],[\![\vec{y}]\!])

[\![\vec{f}]\!]\leftarrow([\![\vec{s}]\!]\stackrel{{\scriptstyle?}}{{=}}[\![\vec{s}_{0}]\!])\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu\mkern 5.0mu\mathbin{\operator@font{\textsf{OR}}}\penalty 900\mkern 5.0mu\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu([\![\vec{s}]\!]\stackrel{{\scriptstyle?}}{{=}}[\![\vec{s}_{1}]\!])

Algorithm 8 Checking if all elements in each group are the same.

Notation:

[\![\vec{b}]\!]\leftarrow\operatorname{\sf ApplyTests}(([\![\vec{x}_{j}]\!])_{j\in[1,m]},[\![{\sf AID}]\!],[\![{\sf Threshold}]\!])

Input:

m

private vectors

([\![\vec{x}_{j}]\!])_{j\in[1,m]}

of length

n

, a private vector

[\![{\sf AID}]\!]

of length

n

, and a private vector

[\![{\sf Threshold}]\!]

of length

n

Output: A private bit vector

[\![\vec{b}]\!]

of length

n

Cost:

O(mn)

non-free operations in

O(1)

rounds.

[\![\vec{e}_{j}]\!]\leftarrow([\![{\sf AID}]\!]\stackrel{{\scriptstyle?}}{{=}}j)

for

j\in[1,m]

[\![\vec{x}[i]]\!]\leftarrow\sum_{j\in[1,m]}[\![\vec{x}_{j}[i]]\!]\times[\![\vec{e}_{j}[i]]\!]

for

i\in[1,n]

[\![\vec{b}]\!]\leftarrow(2\times[\![\vec{x}]\!]\stackrel{{\scriptstyle?}}{{<}}[\![{\sf Threshold}]\!])

Algorithm 9 Applying tests to samples.

Notation:

([\![\vec{d}_{1}]\!],\dots,[\![\vec{d}_{c}]\!])\leftarrow\operatorname{\sf FormatLayer}(k,[\![\vec{g}]\!],[\![\vec{a}_{1}]\!],\dots,[\![\vec{a}_{c}]\!])

Input: An integer

k

, a private group flag vector

[\![\vec{g}]\!]

of length

n

, and

c

private vectors

[\![\vec{a}_{1}]\!],\dots,[\![\vec{a}_{c}]\!]

of length

n

Output: A sequence of

c

private vectors

([\![\vec{d}_{1}]\!],\dots,[\![\vec{d}_{c}]\!])

, where length of

[\![\vec{d}_{j}]\!]

\min\{2^{k},n\}

for

j\in[1,c]

Cost:

O(cn\log n)

non-free operations in

O(\log n)

rounds.

[\![\vec{v}_{j}]\!]\leftarrow\operatorname{\sf IfElse}([\![\vec{g}]\!],[\![\vec{a}_{j}]\!],{\sf NULL})

for

j\in[1,c]

[\![\vec{v}_{1}]\!],\dots,[\![\vec{v}_{c}]\!]\leftarrow\operatorname{\sf Sort}(\lnot[\![\vec{g}]\!];[\![\vec{v}_{1}]\!],\dots,[\![\vec{v}_{c}]\!])

4 Let

[\![\vec{d}_{j}]\!]

be the first

\min\{2^{k},n\}

elements of

[\![\vec{v}_{j}]\!]

for

j\in[1,c]

Algorithm 10 Formatting vectors for

k

-th layer.

This section describes the protocol for training internal nodes in a batch, which we have been putting off. It receives the privately grouped dataset of the $k$ -th layer ( $k\in[0,h)$ ), computes the best test for each node, and outputs the test results $[\![\vec{b}]\!]$ and the layer information $[\![{\sf Layer}^{(k)}]\!]=([\![{\sf NID}^{(k)}]\!],[\![{\sf AID}^{(k)}]\!],[\![{\sf Threshold}^{(k)}]\!])$ .

In Algorithm 7, we compute the best test for each group using the $\operatorname{\sf GlobalTestSelection}$ protocol which will be shown in Section 4.3.1. In Algorithm 7, we determine if $\vec{y}[i]$ is all the same in each group, which is part of the stopping criterion described in Section 2.1.1. It is computed by the $\operatorname{\sf GroupSame}$ protocol shown in Algorithm 8. In the $\operatorname{\sf GroupSame}$ protocol, $\vec{s}$ , $\vec{s}_{0}$ , and $\vec{s}_{1}$ represent the number of elements, the number of $0$ , and the number of $1$ in the group, respectively. Thus, $(\vec{s}\stackrel{{\scriptstyle?}}{{=}}\vec{s}_{0})\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu\mkern 5.0mu\mathbin{\operator@font{\textsf{OR}}}\penalty 900\mkern 5.0mu\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu(\vec{s}\stackrel{{\scriptstyle?}}{{=}}\vec{s}_{1})$ computes the desired value. It uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds. In Algorithm 7, we replace test with $2X_{1}<{\sf MIN\_VALUE}$ for each element whose result in Algorithm 7 is true. Specifically, for each $i$ such that $s[i]=1$ , $[\![{\sf AID}[i]]\!]$ and $[\![{\sf Threshold}[i]]\!]$ are overwritten with $1$ and ${\sf MIN\_VALUE}$ , respectively. In Algorithm 7, the $\operatorname{\sf ApplyTests}$ protocol is used to compute the test results from the best tests computed in the Algorithm 7. In Algorithm 7, the $\operatorname{\sf FormatLayer}$ protocol is used to format $[\![{\sf NID}]\!]$ , $[\![{\sf AID}]\!]$ , and $[\![{\sf Threshold}]\!]$ .

The $\operatorname{\sf ApplyTests}$ protocol is shown in Algorithm 9. It computes the private results $[\![\vec{b}]\!]$ of applying the tests denoted by ${\sf AID}$ and ${\sf Threshold}$ to the input tuples $(\vec{x}_{j})_{j\in[1,m]}$ . For each $i\in[1,n]$ , it computes the flag $[\![e_{j}[i]]\!]$ indicating whether ${\sf AID}[i]=j$ by an equality test (Algorithm 9), and uses it to compute the ${\sf AID}[i]$ -th attribute value $\vec{x}_{{\sf AID}[i]}[i]$ (Algorithm 9). Then, it computes the test result of each element in Algorithm 9. It uses $O(mn)$ non-free operations in $O(1)$ rounds.

The $\operatorname{\sf FormatLayer}$ protocol is shown in Algorithm 10. It removes redundant values from given vectors. Since the node numbers, attribute numbers, thresholds, and leaf labels are all the same in the same group, it is sufficient to leave only one element for each group. Therefore, we clear all but the first elements in each group with ${\sf NULL}$ and move the first elements to the front of the vector. Since the number of nodes in the $k$ -th layer is at most $\min\{n,2^{k}\}$ , we delete the trailing elements so that each vector has this length. It uses $O(cn\log n)$ non-free operations in $O(\log n)$ rounds.

The $\operatorname{\sf TrainInternalNodes}$ protocol uses $O(mn\log n)$ non-free operations in $O(\log n+\log m)$ rounds, since the $\operatorname{\sf GlobalTestSelection}$ protocol uses $O(mn\log n)$ non-free operations in $O(\log n+\log m)$ rounds, as we will show in Section 4.3.1.

4.2.4 Batch training for leaf nodes

Notation:

[\![{\sf Layer}^{(h)}]\!]\leftarrow\operatorname{\sf TrainLeafNodes}(h,[\![\vec{g}]\!],[\![\vec{y}]\!],[\![{\sf NID}]\!])

Input: An integer

h

, a private group flag vector

[\![\vec{g}]\!]

of length

n

, a private vector

[\![\vec{y}]\!]

of length

n

, and a private vector

[\![{\sf NID}]\!]

of length

n

Output:

[\![{\sf Layer}^{(h)}]\!]=([\![{\sf NID}^{(h)}]\!],[\![{\sf Label}^{(h)}]\!])

, where

[\![{\sf NID}^{(h)}]\!]

and

[\![{\sf Label}^{(h)}]\!]

are private vectors of length

\min\{n,2^{h}\}

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

2Computes the leaf labels by

[\![{\sf Label}]\!]\leftarrow(\operatorname{\sf GroupSum}([\![\vec{g}]\!],\lnot[\![\vec{y}]\!])\stackrel{{\scriptstyle?}}{{<}}\operatorname{\sf GroupSum}([\![\vec{g}]\!],[\![\vec{y}]\!]))

, which are the most frequent values of

\vec{y}

in each group.

[\![{\sf NID}^{(h)}]\!],[\![{\sf Label}^{(h)}]\!]\leftarrow\operatorname{\sf FormatLayer}(h,[\![\vec{g}]\!],[\![{\sf NID}]\!],[\![{\sf Label}]\!])

Algorithm 11 Training leaf nodes.

This section describes the protocol for training leaf nodes in a batch, which we have putting off. It receives the privately grouped dataset of the $h$ -th layer, computes the leaf label for each node, and outputs the layer information $[\![{\sf Layer}^{(h)}]\!]=([\![{\sf NID}^{(h)}]\!],[\![{\sf Label}^{(h)}]\!])$ . The protocol is shown in Algorithm 11. In Algorithm 11, we compute the most frequent value of output attribute as a leaf label. This is a typical method to define leaf label. In Algorithm 11, the $\operatorname{\sf FormatLayer}$ protocol is used to format $[\![{\sf NID}]\!]$ and $[\![{\sf Label}]\!]$ .

The protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds.

4.3 Our batch test selection protocol

In this section, we explain how to perform a batch node-wise test selection on a dataset that is grouped by nodes and stored in our private grouping data structure. Like the standard test selection method in the clear, our batch test selection consists of three levels of components: one to select the best test across all attributes called the global test selection protocol, one to select the best test for splitting by a specific attribute called the attribute-wise test selection protocol, and one to compute measures called the modified Gini index protocol. We will introduce them in order. Thanks to our group-wise operations, all of them are almost straightforward to construct.

4.3.1 Global test selection

Notation:

[\![\vec{a}]\!],[\![\vec{t}]\!]\leftarrow\operatorname{\sf GlobalTestSelection}(([\![\vec{x}_{j}]\!])_{j\in[1,m]},[\![\vec{y}]\!],[\![\vec{g}]\!])

Input:

m

private vectors

([\![\vec{x}_{j}]\!])_{j\in[1,m]}

of length

n

, a private vector

[\![\vec{y}]\!]

of length

n

, a private group flag vector

[\![\vec{g}]\!]

of length

n

Output: A private vector

[\![\vec{a}]\!]

of length

n

and a private vector

[\![\vec{t}]\!]

of length

n

Cost:

O(mn\log n)

non-free operations in

O(\log n+\log m)

rounds.

2for each $j\in[1,m]$ do in parallel

[\![\vec{u}_{j}]\!],[\![\vec{v}_{j}]\!]\leftarrow\operatorname{\sf Sort}(\operatorname{\sf PrefixSum}([\![\vec{g}]\!]),[\![\vec{x}_{j}]\!];[\![\vec{x}_{j}]\!],[\![\vec{y}]\!])

[\![\vec{t}_{j}]\!],[\![\vec{s}_{j}]\!]\leftarrow\operatorname{\sf AttributewiseTestSelection}([\![\vec{g}]\!],[\![\vec{u}_{j}]\!],[\![\vec{v}_{j}]\!])

6for each $i\in[1,n]$ do in parallel

[\![\vec{a}[i]]\!],[\![\vec{t}[i]]\!]\leftarrow\operatorname{\sf VectMax}(([\![\vec{s}_{1}[i]]\!],\dots,[\![\vec{s}_{m}[i]]\!]);(1,\dots,m),([\![\vec{t}_{1}[i]]\!],\dots,[\![\vec{t}_{m}[i]]\!]))

Algorithm 12 Batch global test selection.

The global test selection protocol computes the best test through all attributes for each node in a batch. The algorithm is straightforward: it calls the attribute-wise test selection protocol to compute the best test for each attribute and then selects the best test among them. Since the attribute-wise test selection protocol assumes that the data is already sorted within a group for a given attribute, this protocol is responsible for sorting within the group before calling attribute-wise test selection.

The protocol is shown in Algorithm 12. It receives the training data (input tuples and class labels) privately grouped by nodes, and outputs the information (attribute number and threshold) of the best test for each group. For each input attribute, it sorts the input attribute values and class labels within the group and selects the best test for each group when splitting on that attribute (Algorithms 12 to 12). Then select the best test among all attributes in Algorithm 12. Since the output of the attribute-wise test selection protocol is identical in each group, it is sufficient to do this for each element independently.

The protocol is almost identical to the algorithm in the clear and the protocol in [AEV21]. The difference is that we need to sort within each group in Algorithm 12. Recalling that $\vec{g}$ is a bit vector where only the first element of each group is $1$ , we can see that $\operatorname{\sf PrefixSum}(\vec{g})$ computes different and ascending values for each group. Thus, we can sort within each group by using $\operatorname{\sf PrefixSum}(\vec{g})$ and $\vec{x}_{j}$ as keys in lexicographic order as in Algorithm 12.

The protocol uses $O(mn\log n)$ non-free operations in $O(\log m+\log n)$ rounds, since $\operatorname{\sf AttributewiseTestSelection}$ protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds, as we will show in the next section.

4.3.2 Attribute-wise test selection

Notation:

[\![\vec{t}]\!],[\![\vec{s}]\!]\leftarrow\operatorname{\sf AttributewiseTestSelection}([\![\vec{g}]\!],[\![\vec{x}]\!],[\![\vec{y}]\!])

Input: A private group flag vector

[\![\vec{g}]\!]

of length

n

, a private vector

[\![\vec{x}]\!]

of length

n

, and a private vector

[\![\vec{y}]\!]

of length

n

Output: Private vectors

[\![\vec{t}]\!]

and

[\![\vec{s}]\!]

of length

n

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

[\![\vec{s}]\!]\leftarrow\operatorname{\sf ModifiedGini}(g,y)

[\![\vec{t}[i]]\!]\leftarrow[\![\vec{x}[i]]\!]+[\![\vec{x}[i+1]]\!]

for all

i\in[1,n)

and

[\![\vec{t}[n]]\!]\leftarrow{\sf MIN\_VALUE}

[\![\vec{p}[i]]\!]\leftarrow[\![\vec{g}[i+1]]\!]\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu\mkern 5.0mu\mathbin{\operator@font{\textsf{OR}}}\penalty 900\mkern 5.0mu\nonscript\mskip-4.0mu plus -2.0mu minus -4.0mu([\![\vec{x}[i]]\!]\stackrel{{\scriptstyle?}}{{=}}[\![\vec{x}[i+1]]\!])

for

i\in[1,n)

and

[\![\vec{p}[n]]\!]\leftarrow 1

[\![\vec{s}]\!],[\![\vec{t}]\!]\leftarrow\operatorname{\sf IfElse}([\![\vec{p}]\!];{\sf MIN\_VALUE},{\sf MIN\_VALUE};[\![\vec{s}]\!],[\![\vec{t}]\!])

[\![\vec{s}]\!],[\![\vec{t}]\!]\leftarrow\operatorname{\sf GroupMax}([\![\vec{g}]\!],[\![\vec{s}]\!];[\![\vec{s}]\!],[\![\vec{t}]\!])

Algorithm 13 Batch attribute-wise test selection.

The attribute-wise test selection protocol computes the best tests in each group for a given numerical input attribute. It assumes that the input attribute values and class labels are sorted with respect to the input attribute values within each group. It implements the technique by Abspoel et al. [AEV21] that reduces the number of candidate thresholds from $\Theta(n^{2})$ to $\Theta(n)$ , on our data structure using the group-wise operations proposed in Section 3. We use the $\operatorname{\sf ModifiedGini}$ protocol, which will be described in Section 4.3.3, to compute the modified Gini index.

The protocol is shown in Algorithm 13. It receives a private group flag vector $[\![\vec{g}]\!]$ , a private numeric input attribute vector $[\![\vec{x}]\!]$ , and a private class label vector $[\![\vec{y}]\!]$ . Vectors $\vec{x}$ and $\vec{y}$ are sorted with respect to $\vec{x}$ in each group. Outputs are thresholds $[\![\vec{t}]\!]$ and scores $[\![\vec{s}]\!]$ of the best tests in each group.

We show that the protocol computes the best tests in each group for a given numerical input attribute. Since $\vec{x}$ and $\vec{y}$ are sorted within each group with respect to $\vec{x}$ , it is sufficient to consider only the split between two adjacent elements in each group [AEV21]. In Algorithms 13 and 13, the threshold $\vec{t}[i]$ and the score $\vec{s}[i]$ for split between the $i$ -th and $(i+1)$ -th elements are computed. If the $i$ -th element is the last element in a group (i.e., $i=n$ or $\vec{g}[i+1]=1$ ) or if it has the same attribute value as the next element (i.e., $\vec{x}[i]=\vec{x}[i+1]$ ), we cannot split between the $i$ -th and $(i+1)$ -th elements. In this case, we set $\vec{t}[i]:={\sf MIN\_VALUE}$ and $\vec{s}[i]:={\sf MIN\_VALUE}$ (Algorithms 13 and 13). In Algorithm 13, the score and threshold of a element whose score is the maximum in a group are copied to other elements in the group.

The protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds, since $\operatorname{\sf ModifiedGini}$ protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds, as we will show in the next section.

4.3.3 Modified Gini index

Notation:

[\![\vec{s}]\!]\leftarrow\operatorname{\sf ModifiedGini}([\![\vec{g}]\!],[\![\vec{y}]\!])

Input: A private group flag vector

[\![\vec{g}]\!]

of length

n

and a private vector

[\![\vec{y}]\!]

of length

n

Output: A private vector

[\![\vec{s}]\!]

of length

n

, where

\vec{s}[i]

is the modified Gini index when the dataset is split between the

i

-th and

(i+1)

-th elements.

Cost:

O(n\log n)

non-free operations in

O(\log n)

rounds.

[\![\vec{y}_{0}]\!]\leftarrow\lnot[\![\vec{y}]\!]

[\![\vec{y}_{1}]\!]\leftarrow[\![\vec{y}]\!]

[\![\vec{u}_{b}]\!]\leftarrow\operatorname{\sf GroupPrefixSum}([\![\vec{g}]\!],[\![\vec{y}_{b}]\!])

for

b\in\{0,1\}

[\![\vec{s}_{b}]\!]\leftarrow\operatorname{\sf GroupSum}([\![\vec{g}]\!],[\![\vec{y}_{b}]\!])

for

b\in\{0,1\}

[\![\vec{w}_{b}]\!]\leftarrow[\![\vec{s}_{b}]\!]-[\![\vec{u}_{b}]\!]

for

b\in\{0,1\}

[\![\vec{u}]\!]\leftarrow[\![\vec{u}_{0}]\!]+[\![\vec{u}_{1}]\!]

and

[\![\vec{w}]\!]\leftarrow[\![\vec{w}_{0}]\!]+[\![\vec{w}_{1}]\!]

[\![\vec{p}]\!]\leftarrow[\![\vec{w}]\!]\times([\![\vec{u}_{0}]\!]^{2}+[\![\vec{u}_{1}]\!]^{2})+[\![\vec{u}]\!]\times([\![\vec{w}_{0}]\!]^{2}+[\![\vec{w}_{1}]\!]^{2})

[\![\vec{q}]\!]\leftarrow[\![\vec{u}]\!]\times[\![\vec{w}]\!]

[\![\vec{s}]\!]\leftarrow[\![\vec{p}]\!]/[\![\vec{q}]\!]

. Here, for simplicity, we set

\vec{s}

to be the result of element-wise division of

\vec{p}

and

\vec{q}

; however, in practice, we set

\vec{s}

to be the pair of

\vec{p}

and

\vec{q}

, and a comparison of the elements in

\vec{s}

is replaced by the division-free comparison as [AEV21].

Algorithm 14 Modified Gini index.

We present a protocol to compute the modified Gini index for privately grouped dataset. Thanks to the group-wise operations proposed in Section 3, the formula by Abspoel et al. [AEV21] in Equation 1 can be used almost directly.

The protocol is shown in Algorithm 14. The input is a private group flag vector $[\![\vec{g}]\!]$ and a private class label vector $[\![\vec{y}]\!]$ which is sorted by an input attribute in each group. The output is a private vector $[\![\vec{s}]\!]$ , where each $\vec{s}[i]$ represents the modified Gini index for a split between the $i$ -th and $(i+1)$ -th elements.

Since each $\vec{y}_{b}$ is a bit vector with $\vec{y}_{b}[i]=1$ iff $\vec{y}[i]=b$ , let the $i$ -th element be $v$ , then $\vec{u}_{b}[i]$ represents the number of $b$ up to $v$ in the group, and $\vec{w}_{b}[i]$ represents the number of $b$ after $v$ in the group. That is, $\vec{u}_{0}[i]$ , $\vec{u}_{1}[i]$ , $\vec{u}[i]$ , $\vec{w}_{0}[i]$ , $\vec{w}_{1}[i]$ , and $\vec{w}[i]$ are the number of $0$ ’s up to $v$ , the number of $1$ ’s up to $v$ , the number of elements up to $v$ , the number of $0$ ’s after $v$ , the number of $1$ ’s after $v$ , and the number of elements after $v$ , respectively, in the group. From Equation 1, $p[i]/q[i]$ is the modified Gini index for splitting between the $i$ -th and $(i+1)$ -th elements.

The protocol uses $O(n\log n)$ non-free operations in $O(\log n)$ rounds.

5 Demonstration of our protocol’s practicality

In order to show the practicality of our decision tree training protocol, we implemented it and measured the running time.

5.1 Implementation methods

Table 3: Specifications of the machine used for our benchmark.

OS	CentOS Linux release 7.3.1611
CPU	Intel Xeon Gold 6144k (3.50GHz 8 core/16 thread) $\times$ 2
Memory	768 GB

We implemented our protocols on a Shamir’s secret-sharing based three-party computation over a field $\mathbb{Z}_{p}$ , where $p=2^{61}-1$ . This 3PC scheme is secure against a single static corruption. For the ABB implementation, we used the comparison protocols by Kikuchi et al. [KIM⁺18] and the multiplication protocol by Chida et al. [CHI⁺19]. We also replaced some of the protocols built on ABB with more efficient protocols. The inner product, apply, unapply, and sortperm protocols are based on the ones by Chida et al. [CHI⁺19]. Our implementation includes several optimizations.

The protocols were implemented with the C++ language. We measured on three servers with the same configuration connected by a ring configuration of Intel X710/X557-AT 10G. The configuration of the servers is shown in Table 3.

5.2 Benchmarking results

Table 4: Running time for training a decision tree of different heights

h

n=10^{4}

samples with

m=10

input variables.

$h$	Time [s]
1	1.342
2	2.203
5	4.432
10	8.810
20	16.633
50	40.891

Table 5: Running time for training a decision tree of height

h=5

on different numbers of samples

n

with

m=10

input variables.

$n$	Time [s]
$10$	1.811
$10^{2}$	2.092
$10^{3}$	2.569
$10^{4}$	4.432
$10^{5}$	32.035

Table 6: Running time for training a decision tree of height

h=5

n=10^{4}

samples with different numbers of input variables

m

$m$	Time [s]
1	2.469
2	2.641
5	3.069
10	4.432
20	7.790
50	19.111
100	39.401

To show the scalability of our protocol, we measured the running time for different parameters of $n$ , $m$ , and $h$ . Based on the case where $n=10^{4}$ , $m=10$ , and $h=5$ , we measured the execution time of training by varying $n=10,10^{2},10^{3},10^{4},10^{5}$ , $m=1,2,5,10,20,50,100$ , and $h=1,2,5,10,20,50$ . The results are shown in Tables 5, 6 and 4. Each runtime is the average value of three measurements. The results show that the running time is approximately linear with respect to $n$ , $m$ , and $h$ , respectively.

References

[ACC⁺21] Samuel Adams, Chaitali Choudhary, Martine De Cock, Rafael Dowsley, David Melanson, Anderson C. A. Nascimento, Davis Railsback, and Jianwei Shen. Privacy-preserving training of tree ensembles over continuous data. CoRR, Vol. abs/2106.02769, , 2021.
[AEV21] Mark Abspoel, Daniel Escudero, and Nikolaj Volgushev. Secure training of decision trees with continuous attributes. Proc. Priv. Enhancing Technol., Vol. 2021, No. 1, pp. 167–187, 2021.
[BFOS84] Leo Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth, 1984.
[Bre01] Leo Breiman. Random forests. Mach. Learn., Vol. 45, No. 1, pp. 5–32, 2001.
[CHI⁺19] Koji Chida, Koki Hamada, Dai Ikarashi, Ryo Kikuchi, Naoto Kiribuchi, and Benny Pinkas. An efficient secure three-party sorting protocol with an honest majority. IACR Cryptol. ePrint Arch., p. 695, 2019.
[dHSCodA14] Sebastiaan de Hoogh, Berry Schoenmakers, Ping Chen, and Harm op den Akker. Practical secure decision tree learning in a teletreatment application. In Nicolas Christin and Reihaneh Safavi-Naini, editors, FC 2014, March 3-7, 2014, Christ Church, Barbados, Vol. 8437 of Lecture Notes in Computer Science, pp. 179–194. Springer, 2014.
[FO21] Brett Hemenway Falk and Rafail Ostrovsky. Secure merge with O(n log log n) secure operations. In Stefano Tessaro, editor, ITC 2021, July 23-26, 2021, Virtual Conference, Vol. 199 of LIPIcs, pp. 7:1–7:29. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2021.
[Fri01] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232, 2001.
[HKI⁺12] Koki Hamada, Ryo Kikuchi, Dai Ikarashi, Koji Chida, and Katsumi Takahashi. Practically efficient multi-party sorting protocols from comparison sort algorithms. In Taekyoung Kwon, Mun-Kyu Lee, and Daesung Kwon, editors, ICISC, Vol. 7839 of LNCS, pp. 202–216. Springer, 2012.
[HKP11] Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and Techniques, 3rd edition. Morgan Kaufmann, 2011.
[HR76] Laurent Hyafil and Ronald L. Rivest. Constructing optimal binary decision trees is NP-complete. Inf. Process. Lett., Vol. 5, No. 1, pp. 15–17, 1976.
[KIM⁺18] Ryo Kikuchi, Dai Ikarashi, Takahiro Matsuda, Koki Hamada, and Koji Chida. Efficient bit-decomposition and modulus-conversion protocols with an honest majority. In Willy Susilo and Guomin Yang, editors, ACISP 2018, July 11-13, 2018, Wollongong, NSW, Australia, Vol. 10946 of Lecture Notes in Computer Science, pp. 64–82. Springer, 2018.
[Lau15] Peeter Laud. Parallel oblivious array access for secure multiparty computation and privacy-preserving minimum spanning trees. Proc. Priv. Enhancing Technol., Vol. 2015, No. 2, pp. 188–205, 2015.
[LP00] Yehuda Lindell and Benny Pinkas. Privacy preserving data mining. In Mihir Bellare, editor, CRYPTO, Vol. 1880 of LNCS, pp. 36–54. Springer, 2000.
[LW14] Peeter Laud and Jan Willemson. Composable oblivious extended permutations. In Frédéric Cuppens, Joaquín García-Alfaro, A. Nur Zincir-Heywood, and Philip W. L. Fong, editors, FPS 2014, November 3-5, 2014, Montreal, QC, Canada, Vol. 8930 of Lecture Notes in Computer Science, pp. 294–310. Springer, 2014.
[MR18] Payman Mohassel and Peter Rindal. ABY ${}^{\mbox{3}}$ : A mixed protocol framework for machine learning. In David Lie, Mohammad Mannan, Michael Backes, and XiaoFeng Wang, editors, CCS 2018, October 15-19, 2018, Toronto, ON, Canada, pp. 35–52. ACM, 2018.
[Qui86] J. Ross Quinlan. Induction of decision trees. Mach. Learn., Vol. 1, No. 1, pp. 81–106, 1986.
[Qui14] J Ross Quinlan. C4.5: programs for machine learning. Elsevier, 2014.
[RWT⁺18] M. Sadegh Riazi, Christian Weinert, Oleksandr Tkachenko, Ebrahim M. Songhori, Thomas Schneider, and Farinaz Koushanfar. Chameleon: A hybrid secure computation framework for machine learning applications. In Jong Kim, Gail-Joon Ahn, Seungjoo Kim, Yongdae Kim, Javier López, and Taesoo Kim, editors, AsiaCCS 2018, June 04-08, 2018, Incheon, Republic of Korea, pp. 707–721. ACM, 2018.
[Wak68] Abraham Waksman. A permutation network. J. ACM, Vol. 15, No. 1, pp. 159–163, 1968.
[WGC19] Sameer Wagh, Divya Gupta, and Nishanth Chandran. Securenn: 3-party secure computation for neural network training. Proc. Priv. Enhancing Technol., Vol. 2019, No. 3, pp. 26–49, 2019.
[Yao86] Andrew Chi-Chih Yao. How to generate and exchange secrets (extended abstract). In 27th Annual Symposium on Foundations of Computer Science, 27–29 October 1986, Toronto, Canada, pp. 162–167. IEEE Computer Society, 1986.