Chromatic Learning for Sparse Datasets

Vladimir Feinberg
Sisu Data
[email protected]
&Peter Bailis
Sisu Data
[email protected]

Abstract

Learning over sparse, high-dimensional data frequently necessitates the use of specialized methods such as the hashing trick. In this work, we design a highly scalable alternative approach that leverages the low degree of feature co-occurrences present in many practical settings. This approach, which we call Chromatic Learning (CL), obtains a low-dimensional dense feature representation by performing graph coloring over the co-occurrence graph of features—an approach previously used as a runtime performance optimization for GBDT training [1]. This color-based dense representation can be combined with additional dense categorical encoding approaches, e.g., submodular feature compression, to further reduce dimensionality [2]. CL exhibits linear parallelizability and consumes memory linear in the size of the co-occurrence graph. By leveraging the structural properties of the co-occurrence graph, CL can compress sparse datasets, such as KDD Cup 2012, that contain over 50M features down to 1024, using an order of magnitude fewer features than frequency-based truncation and the hashing trick while maintaining the same test error for linear models. This compression further enables the use of deep networks in this wide, sparse setting, where CL similarly has favorable performance compared to existing baselines for budgeted input dimension.

1 Introduction

Extremely sparse, high-dimensional datasets pose significant challenges to resource-efficient learning. In practice, these sparse datasets often arise by combining several disparate data sources, resulting in set-valued features [3, 4, 5]. Representing small subsets from a large set with a binary characteristic vector results in many zero-valued entries. For example, kdd12, a popular user behavior prediction dataset, describes topics a user has liked and other users they follow. The dataset has over 50M features in total but each row has at most 10 non-zero features. Many methods for learning over dense inputs, such as neural networks, remain largely intractable in this non-sequential sparse setting: even when restricting to the most frequent 1M features of kdd12, the Wide and Deep architecture [6] requires over 16GB of GPU memory for a 256-element minibatch, meaning these architectures computationally do not scale to the highly sparse regime. Recent work has developed techniques to learn directly from sparse datasets by encouraging model sparsity, such as Neural Factorization Machines [7], but still requires inputs of fewer than 100K features due to memory constraints on input representation given modern hardware.

In this work, we leverage the structure of many real-world sparse datasets to demonstrate a novel dimensionality reduction technique. We observe that pairs of features rarely co-occur; in kdd12 over 99.9997% of all possible feature pairs never appear in the any example simultaneously. LightGBM [1] uses this observation as a means of improving the runtime performance of training gradient-boosted decision trees (GBDTs). We generalize this approach to obtain accuracy improvements in the low-memory regime for a range of models beyond GBDTs, including linear models and deep networks, which depend on dense inputs. In contrast with the popular hashing trick [8, 9] that uses random hashing to reduce dimensions, this data-dependent approach exploits dataset structure to unlock substantial improvements in accuracy on real-world sparse datasets.

Our method, Chromatic Learning (CL), performs graph coloring to obtain a low-dimensional, dense, categorical feature representation, then applies additional dense reduction methods. First, CL creates a feature co-occurrence graph, where any features that co-occur have an edge connecting them. Since co-occurrence is rare, the resulting graph is sparse and has a low chromatic number. CL assigns each input feature a categorical variable based on its color in the graph coloring, thus representing sparse inputs using fewer categorical dimensions. This reduces memory usage because identically colored categories share the same embedding dimensions. Subsequently, CL applies one of several techniques for compressing dense categorical features, such as frequency-based truncation and submodular feature compression [2]. By enabling the application of these categorical feature compression techniques in sparse settings, CL exhibits substantial reductions in column count with equivalent accuracy compared to baselines with larger column budgets. Furthermore, representations from CL generalize because unseen examples exhibit the same sparse structure and rarely contain features identified with the same color.

We demonstrate the efficacy of CL by learning linear models directly on the compressed space of colors and achieve the same level of test error using $10\times$ fewer features than frequency-based truncation and hashing trick approaches across four benchmark datasets. Additionally, we show that in low-dimensional settings CL improves classification performance for a variety of neural network architectures including Factorization Machines [10], Wide and Deep learning [6], Neural Factorization Machines [7], and DeepFM [11] compared to baseline dimensionality reduction methods including the hashing trick.

2 Related Work

Our contributions relate to several lines of research literature: hashing-based kernels, gradient-boosted trees, and submodular optimization. In this section, we review each.

2.1 Hashing Trick

The hashing trick (HT) initially appeared in [8] as a method for dimensionality reduction. HT is a linear transformation $\phi:\mathbb{R}^{D}\rightarrow\mathbb{R}^{d}$ which reduces sparse vectors in a high $D$ -dimensional space to a small $d$ -dimensional one using two hashes $\eta:[D]\rightarrow[d],\xi:[D]\rightarrow\{\pm 1\}$ [12], $\phi_{i}(\mathbf{x})=\sum_{j:\eta(j)=i}\xi(j)x_{j}$ , which approximately preserves linear inner products (i.e., $\operatorname{\mathbb{E}}\left[{\left\langle\phi(\mathbf{x}),\phi(\mathbf{y})\right\rangle}\right]=\left\langle\mathbf{x},\mathbf{y}\right\rangle$ with low variance) and thus reconstructs a linear kernel on the original space $\mathbb{R}^{D}$ .

Recently, HT structural requirements were characterized by the upper bound on the ratio $\nicefrac{{\left\lVert\mathbf{x}\right\rVert_{\infty}}}{{\left\lVert\mathbf{x}\right\rVert_{2}}}\leq\nu$ for all inputs $\mathbf{x}$ [13]: with probability $1-\delta$ , if $d=\Omega\left({\epsilon^{-2}\log\frac{1}{\delta}}\right)$ and $\nu=\tilde{O}\left({\sqrt{\epsilon}}\right)$ , then the relative error between $\left\lVert\phi(\mathbf{x})\right\rVert_{2}$ and $\left\lVert\mathbf{x}\right\rVert_{2}$ is at most $\epsilon$ (a condition related to preserving inner products through the parallelogram law). A $\nu=\tilde{O}\left({\sqrt{\epsilon}}\right)$ condition points to the generality of HT. For $k$ -sparse binary vectors in $\mathbb{R}^{n}$ , $\nu\leq k^{-1/2}$ ; however, this relies on at least $k$ non-zero values for every sparse input. For a bag of words representation, a $\nicefrac{{1}}{{k}}$ relative norm error would require sentences of at least $k$ words. An intuitive question that we seek to answer in this work is whether there are dimensionality reduction mechanisms that can take advantage of stronger structure than $\nu$ , and even benefit from having few non-zero entries.

2.2 Exclusive Feature Bundling

Gradient-boosted decision trees (GBDTs) are a supervised learning algorithm for learning over a base decision tree (DT) [14]. For a weak hypothesis class of DTs $\mathcal{F}$ , GBDTs provide a mechanism to learn in the larger class of the additive closure $\mathcal{G}$ of $\mathcal{F}$ , where, given a running weighted sum $F_{t}=\sum_{i=1}^{t}w_{i}f_{i}$ for $f_{i}\in\mathcal{F}$ , a new $f_{t+1}$ is fit to the gradient of the loss at at each data point $\partial_{F_{t}}(\mathbf{x})\ell(F_{t}(\mathbf{x}),y)$ and then added $F_{t+1}=F_{t}+w_{t+1}f_{t+1}$ with an appropriately-chosen weight.

While this approach allows search over the larger class $\mathcal{G}$ , DT training at each iteration is expensive, unless alleviated through specialized techniques like exclusive feature bundling (EFB) in [1]. In that work, the authors notice that DT training can be accelerated by reducing dataset width. EFB builds a co-occurrence graph on the set of categorical features $V$ with an edge between two vertices if they ever co-occur (i.e., attain nonzero values in at least one training example). Coloring this graph $G=(V,E)$ provides a map from vertices to colors, where vertices sharing a color must be mutually exclusive among all observed training examples. [1] uses an incremental but serial binary adjacency matrix construction for an $O(V^{2})$ greedy coloring run time.

Unifying each of these color sets into a single categorical variable reduces DT training cost, improving GBDT training speed overall. Similar dimensionality reduction techniques have been observed in other domains, such as register allocation [15]. However, many estimators require numerical inputs, such as linear models, Gaussian Process classifiers, Factorization Machines, and neural networks. Broader application of EFB is thus limited because one-hot encoding, typically used for pre-processing categorical features, inverts bundling: the one-hot encoding of a bundled feature is equivalent to the concatenation of one-hot encodings for its constituents. In this work, we present a method that adapts the general idea of compression via coloring co-occurrence but is suitable for use in models that require numerical inputs.

2.3 Submodular Optimization for Categorical Feature Compression

Refer to caption — Figure 1: Training and test curves with and without sample splitting using categorical feature compression. The Criteo dataset was preprocessed by taking logarithms of its count features, which is standard for this dataset.

Recent work [2] shows that the problem of quantizing a fine categorical variable with $D$ values to one with $d$ values while preserving as much mutual information with the binary label as possible is monotone submodular.

A set-valued function $f:2^{V}\rightarrow\mathbb{R}$ is monotone if $f(A)\leq f(B)$ for $A\subset B\subset V$ and submodular if the gain $\Delta(x|Y)=f\left({\left\{{x}\right\}\cup Y}\right)-f(Y)$ satisfies $\Delta(x|A)\geq\Delta(x|B)$ for $A\subset B\subset V$ and $x\in V\setminus B$ . Such functions admit deterministic $\left({1-e^{-1}}\right)$ -approximate maximization procedures in psuedo-polynomial time [16] and polynomial randomized algorithms [17] for finding $\operatorname*{argmax}_{T}f(T)$ .

In a classification setting with a single $[D]$ -valued categorical feature $X$ and a binary label $Y$ , [18] show that the mapping $Z:[D]\rightarrow[d]$ which maximizes $I(Z(X);Y)$ is defined by $d+1$ splitters $S=\left\{{s_{0}\cdots s_{d}}\right\}$ with $Z_{S}(x)=i$ when $s_{i-1}<x\leq s_{i}$ . [2] prove that selecting the set $S\subset[D]$ is monotone submodular maximization problem $I(Z_{S}(X);Y)$ .

We found direct application of the method unsuccessful. Using label information to featurize results in target leakage [19]. [2] groups features together with similar conditional positive label probability. This artificially reduces label variance within each quanitzed feature cluster and results in overfitting. We verify this on the original dataset used in [2], the Criteo Ad-Click Prediction dataset. To reduce variance from the fitting procedure for evaluation, we train a logistic regression, but note that for such a cheap learner, dimensionality reduction is not necessary for the end classification goal. Figure. 1 shows the training and testing log loss of categorical feature compression applied to estimates of conditional probability made from the training set itself compared to a simple fix, data splitting. With an even split, we estimate conditional probabilities with half of the data, and train on the other half.¹¹1On a practical note, this can be done deterministically without requiring another copy of the data by hashing each example and using the first bit of the resulting hash to split training data. We find that splitting is crucial for low test loss and, as expected, training loss is conspicuously low when double-dipping.

A few challenges with the application of [2] remain:

1.

Many datasets (Sec. 4) have millions of binary features, rather than few categorical ones with many values.
2.

[2] extend their method to multiple categorical variables with a heuristic, but it does not perform well when significantly reducing dimension (Sec. 3.3).

3 Chromatic Learning

We present CL for reducing the dimensionality of sparse datasets for use in models that expect dense, numerical inputs. For simplicity of presentation, we assume dense, continuous features are set aside and focus on reducing the dimensionality of large, sparse vectors with binary features.

CL requires performing a parallelizable reduction over the data to collect the set of features $V$ and feature co-occurrences $E$ that comprise the graph $G$ from Sec. 2.2. After this collection, the graph $G$ is colored. On $P$ processors, this requires $O(V+E+P^{2})$ memory and $O\left({\nicefrac{{\Delta V+E+n}}{{P}}}\right)$ serial run time for $n$ examples, where $\Delta$ is the maximum degree in $G$ .

At this point, each feature in $V$ is identified with a color, through a mapping $c$ , representing a dense categorical dataset over $\left\lvert{cV}\right\rvert$ variables and $V$ distinct categorical values across all variables. This enables the application of several categorical encodings. We describe an extension to submodular feature compression and refer to [20] for a description of target encoding.

3.1 Chromatic Representation

Suppose our training set consists of iid examples, $\mathbf{x}_{i}\in 2^{D}$ . Associate with each $\mathbf{x}_{i}$ its set of active indices $T_{i}\subset[D]$ , and consider the co-occurrence graph $G$ defined by vertices $V=\bigcup_{i}T_{i}$ and edges $E=\bigcup_{i}K(T_{i})$ , with $K(\cdot)$ generating the edges from the complete graph on its argument, a set of vertices. Given a proper coloring of $G$ , $c:V\rightarrow\mathbb{N}$ , we show that the representation of the data defined by categorical vectors $\mathbf{v}$ with $\left\lvert{cV}\right\rvert$ categorical variables (where $v_{i}$ has cardinality $\left\lvert{c^{-1}\{i\}}\right\rvert$ ) permits learning with low generalization error.

In particular, any Lipschitz-smooth decision function on the original space $2^{D}$ can be approximated by one that operates on the chromatic representation $\mathbf{v}$ . By construction this holds for all training examples, but for a test example $T\subset[D]$ , two features may have identical colors. In this case, the example must be approximated by an input with one of the colliding features missing.²²2Given two features mapping to the same color outside the training set, the more popular feature is dropped for our evaluation.

For a given $T\subset[D]$ , let $\mathrm{CC}(G,T)$ be the count of color collisions, i.e., $\left\lvert{T}\right\rvert-\left\lvert{cT}\right\rvert$ . By smoothness, bounding $\operatorname{\mathbb{E}}\mathrm{CC}(G,T)$ implies low average discrepancy between the decision function on the two representations (Appendix A). We find that a greedy coloring has an average color collision count less than one on all our sparse benchmark datasets (Tab. 1).

Table 1: Graphical properties derived from the training split of benchmark datasets (Table. 2).

V

is the set of vertices for the graph generated from these sets containing the unique sparse binary features found in the data. CC refers to the average collision count per example in the test set, using a greedy coloring of the graph generated by the training set.

dataset	avg edges per ex.	$\left\lvert{V}\right\rvert$	avg. degree	colors	CC	avg. nnz
url	481	2.71M	74	395	0.40	29
kdda	682	19.31M	129	103	0.30	36
kddb	439	28.88M	130	79	0.21	29
kdd12	27	50.33M	32	22	0.09	7

3.2 Scalable Coloring

The previous section motivates constructing the graph $G$ explicitly to obtain a proper coloring. However, construction of $G$ can potentially be expensive. We describe how to efficiently construct $G$ .

First, we require the union $E$ of the edge sets $K(T_{i})$ . We process the dataset in $P$ parallel chunks on mapper threads. We maintain $W=\alpha P$ global sets of edges, where $\alpha$ is the average ratio of time it takes to generate edges from each point to the time it takes to add such edges to a hashset. We split the hash space of edges into $W$ parts. Each mapper allocates $W$ local buffers and, upon processing $T_{i}$ adds edges from $K(T_{i})$ to the local buffer corresponding to each edge’s shard. Once a buffer is filled, the mapper locks the corresponding global set of edges and adds the global set with its local edges.

The expected contention time on each global set’s mutex is constant by the choice of $\alpha$ .³³3In practice, we can choose $\alpha=1$ as the hashing the edge, which is done locally on each mapper, is the most expensive operation. As a result, every edge only requires a constant amount of time to process. Each mapper only requires $O(P)$ local working memory, independent of the data, in contrast to tree-merge paradigms such as map-reduce. This hash-join approach can be applied to vertices as well and can be extended by storing aggregate statistics for each vertex, as is necessary for Sec. 3.3.

Once $E$ is collected and converted in parallel to an adjacency list representation, $G$ is ready for approximate graph coloring, which requires $O\left({\nicefrac{{V\Delta}}{{P}}}\right))$ time, where $\Delta$ is the maximum degree [21]. Converting to an adjacency list in parallel can be done by mapping over the edge set: first computing the degree of each vertex across threads with atomic increments, and next using a cumulative sum across the degree array to define atomic integral offsets in an adjacency array, which can then be filled in another parallel sweep across $E$ .

3.3 Categorical Encodings

With each feature assigned a color, we view the dataset as a categorical input, with colors as categorical variables and the original input features as categorical values. This permits the use of several categorical encodings.

With target encoding (TE), the average label value replaces categories, yielding just $m$ numeric features, one for each color. With CL combined with frequency truncation (CL+FT) and a budget $b$ , we embed the $b$ most frequent categories in $d$ dimensions, yielding an embedding layer that creates $(dm)$ -sized embeddings with $bd$ parameters. For neural networks with a first hidden layer of size $h$ , using frequency truncation (FT) alone requires the first layer to use $hbd$ parameters, which is prohibitive for all but small $b$ .

In addition, we present an extension to submodular-based feature compression (SM). We depart from [2], which recommends compressing to a fixed budget of features $b$ by running categorical feature compression on each categorical variable $X_{i}$ to maximize $I(Z_{i}(X_{i});Y)$ , sorting results by marginal gain across all features, and using the top $b$ selections for the final encoding. This results in suboptimal compression relative to our alternative, as the final sorting stage is based on marginal gains made with respect to each individual solution, so the top- $b$ features do not necessarily maintain any optimality properties.

Instead, we maximize the global submodular problem $\sum_{i}I(Z_{i}(X_{i});Y)$ across all colors $i$ simultaneously—adding submodular functions over disjoint inputs retains submodularity. This outperforms the sorting heuristic on the original Criteo task proposed by [2] (Fig. 2). Excerpting logarithmic factors, solving one submodular problem of size $V$ is faster than multiple of mixed sizes summing to $V$ , requiring $O(\log^{2}V+V/P)$ time [22].

With this final encoding, an example $\mathbf{x}\in\mathbb{R}^{D}$ is transformed into one in $\mathbb{R}^{b}$ by taking every nonzero feature $j$ in $\mathbf{x}$ and setting the corresponding feature $Z_{c(j)}(j)$ , which is then one-hot encoded.

4 Evaluation

In this section, we evaluate CL relative to alternative approaches to learning over sparse data by assessing the accuracy recovered by the same learning procedure applied across several reduction procedures at different budgets for the output dimension.

While our approach allows end-to-end parallelism, we found that on real datasets coloring, initialization, and vertex (feature) processing were not the bottlenecks and did not require parallelism. We considered a feature dense if it appeared in over $10\%$ of training rows. Sparse features with multiple associated numerical values were treated identically.⁴⁴4Initial results showed this did not affect performance, so we elided these values. An alternative would track distinct feature values with a separate dictionary, binning for sparse and continuous data.

We evaluate standard sparse benchmark datasets available in Table 2. All four evaluation sets contain millions of sparse binary features, with relatively few active per example. Only the url dataset [23] had numeric features. No additional preprocessing was performed on the retrieved data.⁵⁵5Datasets may be retrieved from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

Table 2: Dataset dimensions. All datasets were generated from sequential observations, so they were split chronologically. Average nnz denotes the average number of nonzero entries per row.

dataset	train ex.	test ex.	dense feat.	sparse feat.	avg. nnz	max nnz
url	1.68M	0.72M	134	2.71M	29	327
kdda	8.41M	0.51M	0	19.31M	36	85
kddb	19.26M	0.75M	2	28.88M	29	75
kdd12	119.71M	29.93M	7	50.33M	7	10

4.1 Dimensionality Reduction

A comparison of chromatic learning with submodular feature compression (CL+SM) to frequency-based truncation (FT) and hashing trick (HT) shows favorable performance in for linear estimators across a variety of budgets. We performed a single pass over the training data via Vowpal Wabbit with default parameters using, which performs online adaptive gradient descent [9]. Our results illustrate that, to achieve equivalent log loss on the test set, FT requires a magnitude more features and HT requires two orders of magnitude (Fig. 3).

For very large budgets that approach the original dimensionality on the two smaller datasets, FT outperforms CL+SM. In this regime, where nearly the full input dimensionality is preserved, we suspect that the quantization may interfere with model confidence.

4.2 Non-Linear Sparse Learning

With a small budget, CL enables learning more sophisticated classifiers that are otherwise not trainable in limited-memory settings. We assess the performance of several deep learning models given different dimensionality reduction approaches in the highly compressed setting of 1024 features.

We used default parameters specified by the deep learning library [24], but reduced batch size to 256 (from 2048) because of an out-of-memory error on the Nvidia V100 GPUs we were using. The library was originally configured for dense datasets, such as Criteo. We reduced the epoch count to 5 (from 15) to account for the increased number of gradient steps and did not change any other settings.

We find that CL compares favorably in terms of test accuracy with several categorical encoding strategies compared to the FT baseline (Tab. 3). Across all 5 evaluated architectures and all 4 datasets, CL+SM outperforms both HT and FT on every dataset, except for DeepFM with url, where it achieves a test log loss of 0.058, compared to FT’s 0.053. Furthermore, CL+FT presents a simpler alternative to CL+SM which enables using the full dataset and many more features directly, on 2 of the 4 datasets, this improves upon the CL+SM representation with 1024 one-hot features.

Table 3: Final classifier test log loss across different models. Each model compares frequency-based truncation (FT) and hashing trick (HT) baselines against chromatic learning (CL) approaches of submodular feature compression (SM) and target encoding (TE), with input dimension budget set to 1024. We notice that a CL with frequency-based truncation with large budgets

2^{18},2^{20}

(CL+FT@18, CL+FT@20) performs better than the more sophisticated CL+SM approach on certain datasets. For such large budgets, the Wide and Deep model with just FT runs out of memory or does not finish training within a day on a V100 Nvidia Tesla GPU.

learner	encoder	url	kdda	kddb	kdd12
Wide and Deep [6]	FT	0.046	0.326	0.314	0.172
	HT	0.058	0.337	0.328	0.173
	CL+SM	0.037	0.296	0.276	0.159
	CL+TE	0.325	0.588	0.517	0.280
	CL+FT@18	0.032	0.292	0.296	0.169
	CL+FT@20	0.030	0.292	0.291	0.169
Logistic Regression	FT	0.077	0.336	0.315	0.172
	HT	0.087	0.359	0.330	0.175
	CL+SM	0.048	0.308	0.283	0.159
	CL+TE	0.198	0.571	0.423	0.204
Factorization Machines [10]	FT	0.076	0.321	0.311	0.172
	HT	0.086	0.339	0.323	0.170
	CL+SM	0.063	0.303	0.282	0.159
	CL+TE	0.253	0.691	0.665	0.206
Neural Factorization Machines [7]	FT	0.045	0.320	0.309	0.172
	HT	0.056	0.317	0.310	0.170
	CL+SM	0.040	0.291	0.271	0.158
	CL+TE	0.285	0.804	0.839	0.294
DeepFM [11]	FT	0.053	0.321	0.313	0.172
	HT	0.075	0.333	0.327	0.170
	CL+SM	0.058	0.302	0.287	0.159
	CL+TE	0.363	0.792	0.681	0.282

5 Discussion

In this work, we explore using graph coloring to generate virtual categorical variables that create a dense view of a sparse dataset. The strong empirical performance (Tab. 3) requires explanation, since the variables were created artificially.

The fact that the colors generated by a greedy coloring are effective at representing the dataset on unseen examples is surprising, because new co-occurrences (edges) appear in the test set frequently (Tab. 1). When new edges appear between features of different colors, the chromatic representation is lossless. However, in an adversarial setting, this data-dependent property may not hold. An alternative approach (which our early experiments deemed unnecessary) would choose a random, uniform coloring over $G$ instead. If $G$ has a low maximum degree $\Delta$ , then such a coloring can be sampled by simulating Glauber dynamics with $m>2\Delta$ colors [25]. Furthermore, such sampling is internally parallelizable because each update is local to a vertex’s neighborhood. Given any new co-occurrence in the test set between two features $x,y$ , a uniform coloring must have assigned $x,y$ one of $m-\deg_{G}x-\deg_{G}y\geq m-2\Delta$ colors, limiting a color collision’s probability by choice of $m$ regardless of the properties of the data’s distribution. However, greedy coloring required fewer colors than a random one and did not encounter many color collisions in practice.

Second, how can we reconcile the relationship between the discrete graphical structure $G$ and a variable training set size $n$ , which affects $G$ ? Given infinte datasets, would $G$ be complete? $G$ is a random graph obtained from sampling cliques from a true latent graph $L$ of co-occurrences. Since $G\subset L$ , it suffices for the latent graph $L$ to be sparse. However, even if $L$ is complete, edges only contribute to collisions to the extent that they appear in test examples. Thus, for the co-occurence graph $G=(V,E)$ , the expected count of unseen edges, $\operatorname{\mathbb{E}}\left\lvert{K(T)\setminus E}\right\rvert$ , provides a better proxy for the propriety of the chromatic representation than $L\setminus G$ . Given a set of $n$ training examples, $\operatorname{\mathbb{E}}\left\lvert{E\setminus K(T)}\right\rvert$ admits a Good-Turing-type [26] upper bound, $\frac{1}{n}\sum_{i}\left\lvert{K(T_{i})\setminus E_{i}}\right\rvert$ where $E_{i}=\bigcup_{i^{\prime}\neq i}K(T_{i})$ . This approach lets us analyze even sparser co-occurrence graphs, such as $G^{(2)}$ , the co-occurrence graph with edges that appear at least twice in the training set. $G^{(2)}$ and its natural successors $G^{(3)},G^{(4)},\cdots$ may be efficiently constructed by using a rolling set of bloom filters to prune out edges that are only generated by a few training examples [27]. Appendix A elaborates on the trade-off between chromatic representation fidelity and colors used by $G^{(1)}=G,G^{(k)}$ , and how Good-Turing-type estimators can be used to forecast new edge incidence counts between the two co-occurrence construction approaches.

Besides making learning tractable for estimators that require dense inputs, one interesting implication of CL+SM is that it yields $n\times d$ design matrices where, typically, $d\ll n$ . For small $d$ , this opens up sketching approaches to learning whose superlinearity in input dimension otherwise makes them inaccessible to wide, sparse datasets, such as coreset construction for nearest-neighbor queries [28]. Further, in the case of linear models, an $n\times d$ design can be represented faithfully as a $(d+1)\times d$ one by Carathéodory’s Theorem [29], which for small $d$ can greatly simplify linear learning (e.g., tuning regularization parameters no longer requires multiple passes over the data).

Beyond the supervised setting, co-occurrence graphs may be appealing from an unsupervised learning perspective: a weighted co-occurrence graph may be used to accelerate graphical model structure learning [30] by pruning the search space of log-linear models during forward selection.

6 Conclusion

We have introduced Chromatic Learning, a method that provides a viable representation of sparse data that enables otherwise-inaccessible learning methods to be applied—such as neural networks—in memory-constrained settings, as shown in Sec. 4.2. This presents several avenues for future work. Optimizing the tradeoff in the data split for submodular feature compression between estimating conditional probabilities and training may result in lower test error. In addition, a balanced coloring scheme may further reduce color collisions, improving accuracy. Finally, the approaches presented in this work illustrate the co-occurrence graph is recoverable in explicit form for many high-dimensional, sparse datasets. This phenomenon may merit its own study.

Broader Impact

This work provides learning methods that scale effectively across many processors while limiting memory. Such methods encourage more organizations to adopt machine learning techniques because of the relative cost of horizontal versus vertical scaling and the overall cost of memory.

Open access to distributed and large-scale methods is important for leveling the playing field between organizations in general that wish to apply learning techniques on such large, realistic sparse data.

This may be positive or negative, depending one’s alignment with the values and goals of each individual organization that may apply chromatic sparse learning. Behind the veil, we believe increased accessibility of large-scale learning through cheaper processing of the same data is net positive.

Acknowledgments and Disclosure of Funding

The authors would like to thank Jon Gjengset, Greg Valiant, Barak Oshri, Arnaud Autef, and Kai Sheng Tai for helpful comments.

References

Ke et al. [2017] Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pages 3146–3154, 2017.
Bateni et al. [2019] Mohammad Hossein Bateni, Lin Chen, Hossein Esfandiari, Thomas Fu, Vahab S Mirrokni, and Afshin Rostamizadeh. Categorical feature compression via submodular optimization. arXiv preprint arXiv:1904.13389, 2019.
Haldar et al. [2019] Malay Haldar, Mustafa Abdool, Prashant Ramanathan, Tao Xu, Shulin Yang, Huizhong Duan, Qing Zhang, Nick Barrow-Williams, Bradley C Turnbull, Brendan M Collins, et al. Applying deep learning to airbnb search. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1927–1935, 2019.
Guo and Berkhahn [2016] Cheng Guo and Felix Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.
Covington et al. [2016] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems, pages 191–198, 2016.
Cheng et al. [2016] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems, pages 7–10, 2016.
He and Chua [2017] Xiangnan He and Tat-Seng Chua. Neural factorization machines for sparse predictive analytics. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 355–364, 2017.
Moody [1989] John Moody. Fast learning in multi-resolution hierarchies. pages 29–39, 1989.
Agarwal et al. [2014] Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111–1133, 2014.
Rendle [2010] Steffen Rendle. Factorization machines. In 2010 IEEE International Conference on Data Mining, pages 995–1000. IEEE, 2010.
Guo et al. [2017] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247, 2017.
Weinberger et al. [2009] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. pages 1113–1120, 2009.
Freksen et al. [2018] Casper B Freksen, Lior Kamma, and Kasper Green Larsen. Fully understanding the hashing trick. pages 5389–5399, 2018.
Friedman [2001] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
Chaitin [1982] Gregory J Chaitin. Register allocation & spilling via graph coloring. ACM Sigplan Notices, 17(6):98–101, 1982.
Minoux [1978] Michel Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization techniques, pages 234–243. Springer, 1978.
Mirzasoleiman et al. [2015] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause. Lazier than lazy greedy. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
Kurkoski and Yagi [2014] Brian M Kurkoski and Hideki Yagi. Quantization of binary-input discrete memoryless channels. IEEE Transactions on Information Theory, 60(8):4544–4552, 2014.
Prokhorenkova et al. [2018] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. In Advances in neural information processing systems, pages 6638–6648, 2018.
Micci-Barreca [2001] Daniele Micci-Barreca. A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems. ACM SIGKDD Explorations Newsletter, 3(1):27–32, 2001.
Gebremedhin and Manne [2000] Assefaw Hadish Gebremedhin and Fredrik Manne. Scalable parallel graph coloring algorithms. Concurrency: Practice and Experience, 12(12):1131–1146, 2000.
Balkanski et al. [2019] Eric Balkanski, Aviad Rubinstein, and Yaron Singer. An exponential speedup in parallel running time for submodular maximization without loss in approximation. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 283–302. SIAM, 2019.
Ma et al. [2009] Justin Ma, Lawrence K Saul, Stefan Savage, and Geoffrey M Voelker. Identifying suspicious urls: an application of large-scale online learning. In Proceedings of the 26th annual international conference on machine learning, pages 681–688, 2009.
rixwew [2020] rixwew. Factorization machine models in pytorch. https://github.com/rixwew/pytorch-fm, 2020.
Jerrum [1995] Mark Jerrum. A very simple algorithm for estimating the number of k-colorings of a low-degree graph. Random Structures & Algorithms, 7(2):157–165, 1995.
Good [1953] Irving J Good. The population frequencies of species and the estimation of population parameters. Biometrika, 40(3-4):237–264, 1953.
McMahan et al. [2013] H Brendan McMahan, Gary Holt, David Sculley, Michael Young, Dietmar Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, Daniel Golovin, et al. Ad click prediction: a view from the trenches. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1222–1230, 2013.
Inaba et al. [1994] Mary Inaba, Naoki Katoh, and Hiroshi Imai. Applications of weighted voronoi diagrams and randomization to variance-based k-clustering. In Proceedings of the tenth annual symposium on Computational geometry, pages 332–339, 1994.
Maalouf et al. [2019] Alaa Maalouf, Ibrahim Jubran, and Dan Feldman. Fast and accurate least-mean-squares solvers. In Advances in Neural Information Processing Systems, pages 8305–8316, 2019.
Petitjean et al. [2013] Francois Petitjean, Geoffrey I. Webb, and Ann E. Nicholson. Scaling log-linear analysis to high-dimensional data. In IEEE International Conference on Data Mining, pages 597–606, 2013.
[31] David A McAllester and Robert E Schapire. On the convergence rate of good-turing estimators.
Matula and Beck [1983] David W Matula and Leland L Beck. Smallest-last ordering and clustering and graph coloring algorithms. Journal of the ACM (JACM), 30(3):417–427, 1983.
Dyer et al. [2006] Martin Dyer, Abraham D Flaxman, Alan M Frieze, and Eric Vigoda. Randomly coloring sparse random graphs with fewer colors than the maximum degree. Random Structures & Algorithms, 29(4):450–465, 2006.

Appendix A Reduced Co-occurrence Graphs

In this section, we elaborate on guarantees for using reduced co-occurrence graphs. Recall that we observe $n$ iid samples of vectors $\mathbf{x}_{i}\in 2^{D}$ , equivalently represented by sets $T_{i}\subset[D]$ of active indices, with $K(T_{i})$ , the edge set of the complete graph on vertices $T_{i}$ . We assume a sparse setting, i.e., $\left\lvert{T_{i}}\right\rvert\leq\eta$ .

Consider the set of vertices $V=[D]$ and the different co-occurrence graphs constructed over $V$ by using edges that appear at least $k$ times, $G^{(k)}$ . These graphs will satisfy the nesting property $G^{(k+1)}\subset G^{(k)}$ . Denote the edge count $\#(e)=\left\lvert{\left\{i\in[n]\,\middle|\,e\in K(T_{i})\right\}}\right\rvert$ as well as the leave-one-out edge count $\#_{i}(e)=\left\lvert{\left\{i^{\prime}\in[n]\setminus\{i\}\,\middle|\,e\in K(T_{i^{\prime}})\right\}}\right\rvert$ . This lets us build reduced co-occurrence graphs $G^{(k)}=(V,E^{(k)})$ containing edges $E^{(k)}=\left\{e\in\bigcup_{i}K(T_{i})\,\middle|\,\#(e)\geq k\right\}$ and their leave-one-out analogues $G^{(k)}_{i}=(V,E^{(k)}_{i})$ and $E^{(k)}_{i}=\left\{e\in\bigcup_{i^{\prime}\neq i}K(T_{i^{\prime}})\,\middle|\,\#_{i}(e)\geq k\right\}$ .

Then we may construct Good-Turing-type estimators [31] for the expected count of new edges. With $T$ an independent sample from the same distribution as each $T_{i}$ , our estimator $N^{(k)}=\sum_{i}\left\lvert{K(T_{i})\setminus E_{i}^{(k)}}\right\rvert$ upper bounds the average new edge count in expectation $\operatorname{\mathbb{E}}N^{(k)}\geq n\operatorname{\mathbb{E}}\left\lvert{K(T)\setminus E^{(k)}}\right\rvert$ because $E_{i}^{(k)}\subset E^{(k)}$ . When we color a graph $G^{(k)}$ with $m$ colors by a function $c:V\rightarrow[m]$ , we induce a chromatic representation in the space $C=c^{-1}\{1\}\times c^{-1}\{2\}\times\cdots\times c^{-1}\{m\}$ (to represent absences, suppose $\bot\in V$ without loss of generality, where $c(\bot)=0$ ). For simplicity, we consider a fixed lossy transformation between $2^{D}$ and $C$ , given by $\mathbf{x}\mapsto\mathbf{y}$ , where $y_{j}=0\lor\inf\left\{i\,\middle|\,x_{i}=1,c(i)=j\right\}$ . For certain machine learning algorithms, such as neural networks, learning on $C$ is much cheaper than on $2^{D}$ directly. Furthermore, categorical dimensionality reduction techniques may be applied to $C$ , whereas they would be unavailable in $2^{D}$ .

To analyze the fidelity of the space $C$ in representing vectors from the sampling distribution, we’ll consider how lossy their one-hot encoding is in the original $2^{D}$ space. This transformation yields a color collision resolution function $\chi:2^{D}\rightarrow 2^{D}$ that describes achievable values by the chromatic representation. If $\mathbf{x}$ exhibits no color collisions under $c$ , then the collision resolution function $\chi(\mathbf{x})=\mathbf{x}$ . If $T$ is the set of active indices in $\mathbf{x}$ then $\chi$ relates to the color collision count through $\left\lVert\chi(\mathbf{x})\right\rVert_{1}=\left\lvert{cT}\right\rvert$ . Given this setup, we can construct a range of chromatic representations, parameterized by the threshold $k$ , with the following representation fidelity property for all $k$ .

Theorem 1.

Consider a measure $\mu$ over $2^{D}$ , a sample of $n$ independent observations $\mathbf{x}_{i}\sim\mu$ , whose corresponding sets of active indices $T_{i}\subset[D]$ satisfy $\left\lvert{T_{i}}\right\rvert\leq\eta$ , and fix any $k,m\in\mathbb{N}$ . As above, construct the random graph $G^{(k)}$ over vertices $V=[D]$ with the union of edges from complete graphs $K(T_{i})$ which appear at least $k$ times among all $i\in[n]$ . Independently sample a uniform proper $m$ -coloring $c$ over $G^{(k)}$ . This random coloring induces the collision resolution function $\chi:2^{D}\rightarrow 2^{D}$ that, for any input $\mathbf{x}$ , returns the same vector but with all higher indices $x_{i}$ set to zero if $c(i)=c(j)$ for some smaller index $j<i$ with $x_{j}=1$ . Define $\Delta^{(k)}$ to be the maximum degree of $G^{(k)}$ . Then, if $m>2\Delta^{(k)}$ , with probability $1-\delta$ over the choice of the sample, for any $L$ -Hamming-Lipschitz $f:2^{D}\rightarrow\mathbb{R}$ ,

\operatorname{\mathbb{E}}\left\lVert f-f\circ\chi\right\rVert_{L^{1}(\mu)}\leq\frac{L}{m-2\Delta^{(k)}}\left({\frac{N^{(k)}}{n}+\frac{k\eta^{2}\log\delta^{-1}}{\sqrt{n}}}\right)\,\,,

where $N^{(k)}$ is defined recursively by $N^{(k)}=kf(k)+N^{(k-1)}$ with $N^{(0)}=0$ and $f(k)$ is the count of edges which appear $k$ times in the multiset $\left\{{K(T_{i})}\right\}_{i}$ .

Proof.

We first reduce our error to the collision count. Consider a new, independent sample $\mathbf{x}\sim\mu$ and its associated set $T\subset[D]$ of active indices, yielding

$\displaystyle\operatorname{\mathbb{E}}\left\lVert f-f\circ\chi\right\rVert_{L^{2}(\mu)}$	$\displaystyle=\operatorname{\mathbb{E}}\left\lvert{f(\mathbf{x})-f\circ\chi(\mathbf{x})}\right\rvert$	$\displaystyle\mathbf{x}\sim\mu$
	$\displaystyle\leq L\operatorname{\mathbb{E}}\left\lVert\mathbf{x}-\chi(\mathbf{x})\right\rVert_{1}$	$f$ Hamming-Lipschitz
	$\displaystyle\leq L\operatorname{\mathbb{E}}\mathrm{CC}\left({G^{(k)},T}\right)$	$\displaystyle\text{$\chi$ definition}\,\,,$

where $\mathrm{CC}(G^{(k)},T)$ is the color collision count with respect to the coloring $c$ of $G^{(k)}$ , equal to $\left\lvert{T}\right\rvert-\left\lvert{cT}\right\rvert$ . Next, every edge in $K(T)\setminus E^{(k)}$ contributes to at most 1 color collision, so

\operatorname{\mathbb{E}}{\mathrm{CC}\left({G^{(k)},T}\right)}\leq\operatorname{\mathbb{E}}\left[{\sum_{xy\in K(T)\setminus E^{(k)}}\operatorname{\mathbb{E}}\left[1\left\{{c(x)=c(y)}\right\}\,\middle|\,T_{1:n},T\right]}\right]\,\,,

where we also apply the tower property. Edges in $K(T)\cap E^{(k)}$ cannot result in a color collision because $c$ is proper. Further, since $c$ sampled independently of $T_{1:n},T$ among all uniform $m$ -colorings, for any two vertices $x,y$ , which each have at most $\Delta^{(k)}$ neighbors in $G^{(k)}$ , there are at least $m-2\Delta^{(k)}$ colors which $x,y$ may be assigned, so $\operatorname{\mathbb{E}}\left[1\left\{{c(x)=c(y)}\right\}\,\middle|\,T_{1:n},T\right]\leq\left({m-2\Delta^{(k)}}\right)^{-1}$ . Applying this bound above yields

\operatorname{\mathbb{E}}{\mathrm{CC}\left({G^{(k)},T}\right)}\leq\left({m-2\Delta^{(k)}}\right)^{-1}\operatorname{\mathbb{E}}{\left\lvert{K(T)\setminus E^{(k)}}\right\rvert}\,\,.

We turn our attention to the last term $\operatorname{\mathbb{E}}{\left\lvert{K(T)\setminus E^{(k)}}\right\rvert}$ , which is the expected count of new edges. $\left\lvert{K(T)\setminus E^{(k)}}\right\rvert\leq\left\lvert{K(T)\setminus E^{(k)}_{i}}\right\rvert$ per $E^{(k)}_{i}\subset E^{(k)}$ and $\left\lvert{K(T)\setminus E^{(k)}_{i}}\right\rvert\overset{d}{=}\left\lvert{K(T_{i})\setminus E^{(k)}_{i}}\right\rvert$ since $T\overset{d}{=}T_{i}$ but both are independent of $E^{(k)}_{i}$ . Summing $n$ such terms over $i\in[n]$ , we have

\operatorname{\mathbb{E}}{\left\lvert{K(T)\setminus E^{(k)}}\right\rvert}\leq\frac{1}{n}\operatorname{\mathbb{E}}N^{(k)}=\frac{1}{n}\operatorname{\mathbb{E}}\sum_{i}\left\lvert{K(T_{i})\setminus E_{i}^{(k)}}\right\rvert\,\,,

At this point, combining all of our inequalities, we have shown

\operatorname{\mathbb{E}}\left\lVert f-f\circ\chi\right\rVert_{L^{1}(\mu)}\leq\frac{L}{m-2\Delta^{(k)}}\operatorname{\mathbb{E}}\frac{N^{(k)}}{n}\,\,.

To finish, we show that $\frac{1}{n}N^{(k)}$ concentrates about its mean by applying McDiarmid’s inequality, recognizing $N^{(k)}$ is a function of $n$ arguments $T_{i}\in 2^{D}$ satisfying $\left\lvert{T_{i}}\right\rvert\leq\eta$ . We focus on showing bounded differences by replacing $T_{i}$ with some $T_{i}^{\prime}\subset[D]$ , also with $\left\lvert{T_{i}^{\prime}}\right\rvert\leq\eta$ , yielding new $\left({E^{(k)}_{j}}\right)^{\prime}=\left({E^{(k)}_{j}\setminus K(T_{i})}\right)\cup K(T_{i}^{\prime})$ for $j\neq i$ . The $i$ -th term of the sum $N^{(k)}$ changes by $\left\lvert{K(T_{i}^{\prime})\setminus\left({E^{(k)}_{i}}\right)^{\prime}}\right\rvert-\left\lvert{K(T_{i})\setminus E^{(k)}_{i}}\right\rvert$ , which is bounded by $\eta^{2}$ , as $K(T_{i}^{\prime}),K(T_{i})$ are bounded by that size. Next, the remaining $j$ -th terms of $N^{(k)}$ for $j\neq i$ , $\left\lvert{K(T_{j})\setminus\left({E^{(k)}_{j}}\right)^{\prime}}\right\rvert$ , can only increase by at most one relative to $\left\lvert{K(T_{j})\setminus E^{(k)}_{j}}\right\rvert$ for every edge $e\in K(T_{i})$ . If $\#(e)>k$ , then $e\in\left({E^{(k)}_{j}}\right)^{\prime}$ regardless. Otherwise, an edge $e$ in $K(T_{i})$ is only absent from $E^{(k)}_{j}$ if it appears in at most $k-1$ other sets $K(T_{j})$ . Thus, replacing $T_{i}$ with $T_{i}^{\prime}$ increases $N^{(k)}$ by at most $k\eta^{2}$ . The alternative formulation for $N^{(k)}$ follows by recognizing that it can be computed explicitly as the sum of the count of edges which appear exactly $j\in[k]$ times.∎

Given Thm. 1, a smooth kernel machine $k(\cdot,\alpha)$ over $2^{D}$ can be approximated by one over the chromatic representation $C$ . The flexibility afforded by dealing with sparser graphs $G^{(k)}$ allows one to progressively increase $k$ as $n$ increases: the Good-Turing term $N^{(k)}$ provides a data-dependent estimate of unseen edge mass per new example (Fig. 4).

Based on Thm. 1, the number of requisite colors is at least $2\Delta^{(k)}$ . Unfortunately, in practice $\Delta^{(k)}$ tends to be large due to a few vertices with high degree (Fig. 5).

By filtering out a small set of vertices $W\subset V$ and applying CL to the induced graph $G_{f}$ , which is $G$ with vertices $W$ removed, we can reduce the maximum degree of $G_{f}$ significantly, in turn dropping the color lower bound. Vertices in the hold-out set may be colored with their own unique colors. As a result, the lower bound on the number of colors necessary for fidelity $L$ with probability at least 99% based on Thm. 1 is $m_{f}=\left\lvert{W}\right\rvert+2\Delta_{f}^{(k)}+\frac{N_{f}^{(k)}}{n}+\frac{k\eta^{2}\log 100}{\sqrt{n}}$ , where $G_{f}^{(k)}$ is the induced graph resulting from removing $W$ from $G^{(k)}$ , $\Delta_{f}^{(k)}$ is its maximum degree, and $\frac{1}{n}N_{f}^{(k)}$ is the Good-Turing estimator for the average new edge count for unseen test examples on the filtered graph $G_{f}^{(k)}$ . The Good-Turing estimates remain valid, even with filtering (Fig. 6).

To choose the set $W$ which maximally reduces the degree, we require a largest-first order $\sigma$ of the vertices, where $\sigma(i)$ is a vertex of maximum degree in the induced graph on vertices $\sigma(i)\cdots\sigma\left({\left\lvert{V}\right\rvert}\right)$ . Such an ordering may be computed in time linear in the graph size [32].

We remove vertices in largest-first order until the number of removed vertices $\left\lvert{W}\right\rvert$ is at least twice the maximum degree of the remaining graph $2\Delta_{f}^{(k)}$ . This results in a significantly smaller number of requisite colors $m_{f}$ across all thresholds $k$ compared to $m$ , even when including the $\left\lvert{W}\right\rvert$ distinct colors used for representing high-degree vertices (Fig. 7).

End-to-end, we can achieve low color collisions, and thus a high-fidelity representation of the original dataset, independent of the distribution of the data, through the chromatic representation by:

1.

Choosing an appropriate filtering threshold $k$ , given the training set size $n$ .
2.

Collecting the co-occurrence graph $G$ (Sec. 3.2). Using a rolling set of bloom filters allows one to collect the smaller $G^{(k)}$ directly [27], by doing a preliminary pass with $k$ bloom filters tracking whether each edge appeared at least $k$ times, rolling an edge to the next filter if it’s present in the prior one, and then doing another pass over the data, filtering by the last bloom filter.
3.

Computing the largest-first ordering of $G^{(k)}$ [32].
4.

Identifying the smallest prefix $W$ of the largest-first ordering which has size at least twice the maximum degree $\Delta_{f}^{(k)}$ of the subgraph $G^{(k)}_{f}$ induced by the rest of the ordering.
5.

Uniformly sampling a coloring of $G^{(k)}_{f}$ from at least $m\geq 2\Delta_{f}+\frac{N_{f}^{(k)}}{n}$ colors by simulating Glauber dynamics for $\tilde{O}\left({m\left({\left\lvert{V}\right\rvert-\left\lvert{W}\right\rvert}\right)}\right)$ steps [25] and coloring each vertex in $W$ with new colors.
6.

Using the color map to perform categorical encoding, through embeddings or submodular feature compression.

The transition dynamics for the Glauber Markov chain are determined by each vertex’s neighborhood. The Markov blanket for each vertex is exactly its neighbors, so MCMC simulation can sample multiple vertices at once. The filtered graph has $v=\left\lvert{V}\right\rvert-\left\lvert{W}\right\rvert$ vertices with average degree $d=2\left\lvert{E_{f}^{(k)}}\right\rvert/v$ . By a birthday paradox calculation, with $P$ such simultaneous MCMC updates, we can expect contention on less than a single vertex at any given time, so long as $P=O\left({\nicefrac{{\sqrt{v}}}{{d}}}\right)$ . Although the end-to-end uniform coloring process above uses more colors and computation than the greedy approach in the main text, it has far fewer color collisions (Fig. 8).

More efficient color sampling on random graphs is an active area of research, suggesting that slightly modified Glauber dynamics can mix quickly with fewer colors, even in the chance presence of high-degree vertices [33]. Such work leads to opportunities that obviate the need for the discrete representation of a prefix of largest-first vertices used to lower the maximum degree on the filtered graph $G^{(k)}_{f}$ .

Appendix B Additional Practical Notes

Note that care must be taken to not resize under a lock to maintain the expected contention time guarantee, which can be achieved by using incremental resizing, i.e., when a hash table is full, create a new one of double the size, and keep the old, moving an edge over from old to new on every insert rather than all at once during the resize. However, we found that this optimization was not critical for efficient performance in practice.

We also note that, while HT did not perform well relative to the heuristic FT in Sec. 4.1, it nonetheless provides a valuable technique for reducing memory usage in settings where a single index over features is unavailable. Frequently, sparse data is encoded in string/value pairs, without a global index. HT points us to a low-memory way of computing such an index. We construct a mapping using an ordinary hash table from 64-bit hashes of feature strings as keys themselves to 32-bit index. This can be done in a preliminary pass, in a parallel fashion, similar to Sec. 3.2, and yields a compact mapping. With such a large hash, collisions are exceedingly rare, with the probability of a collision among a billion features being less than 3%, by a birthday paradox calculation.

Experiment code, configuration, and scripts to download the datasets are available in the GitHub repository sisudata/chromatic-learning.