Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning
Abstract
Learning effective representations from raw data is crucial for the success of deep learning methods. However, in the tabular domain, practitioners often prefer augmenting raw column features over using learned representations, as conventional tree-based algorithms frequently outperform competing approaches. As a result, feature engineering methods that automatically generate candidate features have been widely used. While these approaches are often effective, there remains ambiguity in defining the space over which to search for candidate features. Moreover, they often rely solely on validation scores to select good features, neglecting valuable feedback from past experiments that could inform the planning of future experiments. To address the shortcomings, we propose a new tabular learning framework based on large language models (LLMs), coined Optimizing Column feature generator with decision Tree reasoning (OCTree). Our key idea is to leverage LLMs’ reasoning capabilities to find good feature generation rules without manually specifying the search space and provide language-based reasoning information highlighting past experiments as feedback for iterative rule improvements. Here, we choose a decision tree as reasoning as it can be interpreted in natural language, effectively conveying knowledge of past experiments (i.e., the prediction models trained with the generated features) to the LLM. Our empirical results demonstrate that this simple framework consistently enhances the performance of various prediction models across diverse tabular benchmarks, outperforming competing automatic feature engineering methods.
1 Introduction
Learning useful representations from raw data is key to the success of deep learning algorithms, and their effectiveness has been demonstrated across multiple domains, e.g., vision [1, 2, 3, 4] and language [5, 6]. However, in the tabular domain, deep learning approaches are often perceived as less effective [7, 8, 9, 10]. For instance, tree-based approaches utilizing raw column features of tabular data [11, 12] often outperform deep learning models in tabular prediction tasks such as classification and regression [13, 14, 15]. As a result, practitioners commonly resort to using tree-based methods coupled with manual feature engineering, such as computing the product of two column features [16].
The generation of column features, even with domain knowledge, can be challenging and costly. For instance, manual validation to identify useful features is infeasible due to the exponentially many possible combinations to explore [17]. To address this issue, existing feature engineering methods [18, 17, 19] use additional filtering schemes [20, 21, 22] to evaluate and select the useful features automatically. While these approaches reduce manual effort and improve feature quality, they still present several challenges. First, practitioners often resort to using a manually defined search space over which to generate candidate features due to the inherent ambiguity in what constitutes informative features [17, 23]. However, this still requires substantial computation for validating candidate features, particularly as the number of features and the complexity of the search space grow. Furthermore, they overlook more effective experimental designs, solely relying on validation scores to select good features, even though information from past experiments is useful for better selection.
Motivated by this, we propose to approach this problem from a novel perspective: optimization to discover effective generation rules, leveraging the language understanding and reasoning capabilities of large language models (LLMs). Recent research has demonstrated that LLMs can optimize various non-differentiable problems using prompts that describe the optimization task in natural language [24, 25, 26]. This suggests the potential for LLMs to automatically generate and iteratively refine feature generators without the need for manually specifying the rule space. For example, the reasoning capabilities of LLMs allow incorporating feedback on their previous outputs into the process for iterative refinement. Moreover, linguistic contexts, such as column names (e.g., ‘’ and ‘’) and categorical values (e.g., ‘’ and ‘’), could be naturally integrated into the optimization [27, 28, 29, 30], which is difficult, if not impossible, with conventional methods.
Contributions. In this work, we leverage LLMs to generate novel column features for tabular prediction tasks, proposing Optimizing Column feature generator with decision Tree reasoning (OCTree), a generic framework for automatic feature generation using LLMs. Figure 1 illustrates an overview of our framework. Our approach begins by prompting an LLM to propose a name for a novel column feature based on the task description, such as ‘ ’ for stock price prediction. This initial suggestion guides the LLM in exploring and refining the corresponding feature values. Then, we further leverage the reasoning capability of LLMs to produce a good rule that generates values for the newly introduced column feature based on the existing ones. Specifically, starting from an initial rule , we let the LLM iteratively improve the current rule by using extracted reasonings and validation scores attained by the prediction model as feedback. Here, we let be the language description of a decision tree under the entire dataset and the new feature generated by . Namely, we propose using such decision tree reasoning to provide the LLM with effective knowledge of the past experiments, i.e., the prediction model trained with the generated features, providing learned knowledge of the entire dataset. This procedure is iterated for a fixed number of times, after which we select the rule with the highest validation score.
We assess the effectiveness of OCTree through extensive evaluations on real-world datasets (e.g., stock price and patient mortality prediction) from various sources, including recent Kaggle competitions. Our experimental results demonstrate that OCTree consistently enhances the performance of various prediction models, including gradient-boosted decision trees [11] and deep neural networks [31, 32], for both classification and regression tasks. We also assess OCTree on datasets where language descriptions are unavailable, i.e., all feature values and column names are anonymized during preprocessing. Even on these datasets, OCTree reduces relative prediction errors by an average of 5.0% compared to the best baseline model, i.e., XGBoost on the 19 classification tasks benchmarked by Grinsztajn et al. [13]. Here, we use the Llama 2 7B model fine-tuned on high-quality dialogue data, for making the model understand and generate contextually relevant and coherent rules. We also show that OCTree outperforms recent automatic feature engineering methods [18, 17]. Lastly, we verify that the features generated for one type of model (e.g., simpler) can often be used to enhance the performance of other types of models (e.g., more complex).
2 Related work
Tabular learning with LLMs. Recent developments in LLMs have encouraged investigation into their potential applications to tabular prediction tasks. Dinh et al. [27] and Hegselmann et al. [28] fine-tuned the GPT-3 [33] and T0 [34], respectively, by serializing tabular data into natural language. Nam et al. [30] utilizes unlabeled data expressed in natural language with LLMs via prompting for few-shot semi-supervised tabular classification tasks. More recently, Yan et al. [35] introduced tabular-specific tokenization to pre-train a single language model with multiple tabular datasets. Instead of using LLMs as prediction models, we explore whether they can effectively generate informative column features useful for tabular prediction tasks. Specifically, we propose to improve various prediction models by generating novel column features using LLM as an optimizer [25].
LLMs as optimizers. The use of prompting techniques has enabled LLM to demonstrate its potential for optimization. This is achieved by describing the optimization problem in natural language and instructing LLM to generate new solutions iteratively based on the previously found solutions and their evaluated scores. In particular, Yang et al. [25] uses LLM to optimize linear regression, the traveling salesman problem, and prompt optimization (i.e., finding instructions to guide LLM to generate the best results). Motivated by these observations, we leverage an LLM to optimize an effective column feature generator. However, unlike previous work, we additionally use decision tree reasonings as feedback, which provides learned knowledge of the dataset in a natural language.
Automatic feature engineering. Automatic feature engineering refers to generating features from raw tabular data without human labor, which can improve the performance of prediction tasks [19]. A variety of automatic feature engineering methods have been proposed [36, 37, 38], and recently, Horn et al. [23] have used iterative feature subsampling using beam search to select informative features. Zhang et al. [17] have proposed feature boosting and pruning algorithms to efficiently and accurately filter the generated features. After the emergence of LLM, Hollmann et al. [19] proposed a context-aware automatic feature engineering method that leverages LLM to generate semantically meaningful features based on the description of a given task. Unlike previous works, we use LLM’s optimization and reasoning capabilities to find good feature generation rules without having to manually define the search space. Furthermore, it is worth noting that our method works in both context-aware and context-agnostic ways (Hollmann et al. [19] requires language-based context information), and can also be used to suggest meaningful features to human annotators.
3 Optimizing column feature generator with decision tree reasoning
In this section, we introduce a framework for automatically generating column features by leveraging the language understanding and reasoning capabilities of LLMs. In a nutshell, our framework utilizes LLMs as optimizers to propose and refine the rules for generating column features. Specifically, we iteratively refine the rules through LLMs to achieve better validation scores, using (1) the validation scores of previously proposed rules and (2) decision tree-based reasoning (extracted from the training data) as input of LLMs to facilitate the optimization task. We first frame the tabular feature generation problem as an optimization of the rule generator (Section 3.1), and then present the core component, coined Optimizing Column feature generator with decision Tree reasoning (OCTree) (Section 3.2).
Problem setup. Formally, the goal of tabular prediction tasks is to train a prediction model , where is the input space and is an -dimensional column feature with corresponding column names . For example, ‘: ’ and ‘: ’ are features for the columns ‘: ’ and ‘: ’, respectively. In classification tasks, represents the label space with classes, while in regression tasks, . We use to denote the name of the corresponding column, containing the values of the label .
3.1 Tabular feature generation as rule generator optimization
We start by framing column feature generation as an optimization problem for the rule generator, where the rule defines a mapping from the original set of features to a new feature. In this regard, our objective is to generate an informative one-dimensional column feature, , by optimizing the rule , i.e., a novel column feature that enhances the performance of the prediction model when trained with the new feature, . Formally, based on the generated rule w.r.t to the training dataset , our optimization problem is as follows:
(1) |
where , denotes concatenation of the column features, is the number of samples in the dataset , and is the rule generator (i.e., LLM in our case) applied to the training dataset . is an objective function evaluated using the prediction model , such as the mean absolute error measured with a decision tree regressor for a regression task. In short, we optimize a rule to achieve the best validation score measured with on .
However, such a bi-level optimization is often non-differentiable (or extremely compute-heavy) as it requires computing gradients through the optimization process of , making it challenging to learn the rule. Moreover, the generated rule may also require non-differentiable operations such as logical conjunction between categorical features (e.g., ‘ = ’ in Figure 1). One way to tackle this issue is to use black-box optimizations, such as evolutionary strategies [39] or reinforcement learning [40]. However, these approaches still face several limitations. For instance, they require a manually pre-defined search space for the operations to be used (and it becomes computationally heavy as the search space grows) and may overlook the effective experimental design of feedback for improvement.
To tackle this issue, we propose using LLMs as optimizers to iteratively improve the rule by prompting them with a trajectory of feedback, i.e., the history of previously generated rules along with their corresponding validation scores and reasoning information. This optimization via prompting method [25] has emerged as an efficient and effective tool for black-box optimization. Furthermore, LLMs offer several advantages, e.g., leveraging the semantics of column and feature values for better optimization and the flexibility to operate without a restricted pre-defined search space for rules.
3.2 Generating column features with OCTree
We now describe the core algorithm. First, we prompt the LLM to propose the name of a new column to generate based on the task description, e.g., ‘ ’ in Figure 1. We then compute the validation score of the prediction model and extract reasoning information from the decision tree fitted with the training dataset to initialize the optimization trajectory, which is used as feedback for the LLM. Here, this novel decision tree reasoning effectively conveys knowledge of past experiments, which further facilitates the optimization process. Finally, we continue updating the trajectory as the optimization progresses, allowing the LLM to iteratively improve the rule.
Column name generation. OCTree first generates the column name of a novel feature. OCTree achieves this by prompting the LLM with a prompt to recommend a new column name that would be useful for predicting the target (see Appendix A.1): . With their language understanding capabilities, LLMs generate reasonable column names. For example, the LLM might propose using trading volume as a new feature for predicting stock prices.
Initialize optimization and extract reasoning. OCTree then generates the initial rule for deriving a novel column feature from the original column names . OCTree achieves this by prompting the LLM with to propose a rule to predict with (see Appendix A.2): . Then, we obtain the initial score evaluated with : . Also, we extract the decision tree reasoning with CART [41] fitted on training dataset:
CART is a binary tree that recursively splits data based on criteria, such as Gini impurity, to predict outcomes. We use CART because (i) tree-based algorithms, which typically use ensembles of simple decision trees like CART, still outperform deep learning on tabular prediction tasks, and (ii) it is more interpretable than tree ensembles and can be easily expressed in natural language. For example, as illustrated in Figure 1, CART can expressed in natural language using the if-else syntax. Intuitively, the decision tree reasoning extracted by CART provides valuable insights learned from the entire training dataset. The reasoning explicitly shows the columns that are considered more significant (as nodes in the tree) and the corresponding values (as thresholds of the nodes) used for prediction.
Optimization with decision tree reasoning. To optimize the rule, we describe the task in natural language and provide the trajectory , which includes a history of previously proposed rules with the corresponding scores and reasonings. Specifically, we generate a new rule with the LLM using the prompt (see Appendix A.3). The prompt asks to generate a new rule that is not in that would give a better score than the scores in : . Optimization trajectories are presented in ascending order of score since LLM is more likely to generate entries similar to those that appear later in the list [42, 25]. Finally, we append the score , decision tree reasoning , and rule to the optimization trajectory . We then optimize rule with a fixed number of iterations and select the rule with the highest validation score.
Generating multiple features. The optimization steps can be repeated in order to generate multiple useful features. For example, after generating additional column ‘Smoking Status,’ ‘Physical Activity Level’ can be generated based on the original features and ‘Smoking Status.’ Formally, we first generate a new column , where is the optimized rule for generating a column feature . Then, one can generate a new input space . Using , OCTree iteratively performs the optimization process to generate new column features again. We sequentially introduce multiple new column features until the validation score stops improving.
4 Experiments
In this section, we evaluate the effectiveness of OCTree across a range of tabular classification and regression tasks using diverse datasets. Our findings demonstrate that OCTree consistently improves the performance of various prediction models (Section 4.1). Furthermore, ablation studies validate the effectiveness of the proposed decision tree reasoning and demonstrate the utility of the generated features across different types of prediction models (Section 4.2).
Method | LLM | Tesla† | Enefit† | Disease∗ | Clinical∗ | Academic∗ |
XGBoost [11] | ||||||
Baseline | - | 6.61 | 8.00 | 28.097.9 | 46.275.0 | 14.150.6 |
OCTree | Llama 2 | 5.56 (15.9%) | 8.00 (0.0%) | 26.197.2 (6.8%) | 45.074.1 (2.6%) | 14.110.5 (0.3%) |
OCTree | GPT-4o | 5.48 (17.1%) | 7.82 (2.3%) | 25.726.6 (8.4%) | 43.754.4 (5.4%) | 13.740.1 (2.9%) |
MLP [31] | ||||||
Baseline | - | 7.41 | 33.53 | 38.103.6 | 41.771.7 | 14.410.8 |
OCTree | Llama 2 | 5.23 (29.4%) | 29.99 (10.6%) | 32.865.7 (13.7%) | 39.802.3 (4.7%) | 14.260.7 (1.0%) |
OCTree | GPT-4o | 5.01 (32.4%) | 21.68 (35.3%) | 30.955.8 (18.8%) | 39.250.5 (6.0%) | 14.220.5 (1.3%) |
HyperFast [32] | ||||||
Baseline | - | N/A | N/A | 28.5710.0 | 43.641.1 | 14.670.7 |
OCTree | Llama 2 | N/A | N/A | 28.109.2 (1.6%) | 41.451.7 (5.0%) | 14.490.5 (1.2%) |
OCTree | GPT-4o | N/A | N/A | 27.143.8 (5.0%) | 42.001.5 (3.8%) | 14.490.5 (1.2%) |
Datasets. First, we select real-world datasets with language descriptions from diverse sources: the Disease, Academic, Enefit, and Tesla Stock datasets were recently released on Kaggle, and the Clinical Trial dataset is from the US National Library of Medicine. These prediction tasks are very practical and compelling in domains such as healthcare (e.g., diagnostics), academia (e.g., student dropout), and finance (e.g., stock price prediction). In addition, to verify that OCTree is also applicable without language descriptions, we select 19 classification datasets benchmarked by Grinsztajn et al. [13]. Overall, when selecting a dataset, it is important to reflect the heterogeneous nature of tabular data [43]: whether it contains both categorical and numerical features, or is limited to only one type. Additionally, the task type, i.e., classification or regression, is also considered. We conduct experiments on datasets that exhibit both characteristics to ensure OCTree’s general applicability to various types of tabular data. Details are provided in Appendix B.
Baselines. To validate our method, we use three baselines. We first consider XGBoost [11], one of the most competitive tree-based methods, as they are still known to be effective in tabular domains. Secondly, we enhance multilayer perceptron (MLP) [31], the base architecture of deep learning models. Finally, we show that OCTree improves on a recently proposed tabular model, HyperFast [32], which is practical for rapid model deployment because classification can be performed instantly. Implementation details, including hyperparameter search space, are provided in Appendix C.
Common setup. For all datasets, 60% of the data is used for training, 20% for validation, and 20% for testing. Following Gorishniy et al. [31], we use learned embeddings for categorical features when training MLPs (note that HyperFast handles categorical features automatically). For all experiments, we use CART [41] with a maximum depth of 4 to extract decision tree reasoning, provided to the rule generating LLM in the prompt. Unless noted otherwise, we use the Llama 2 model [44] at the 7B scale, fine-tuned on dialogue data to enhance its chat capability. Specifically, we fine-tune the pre-trained model on UltraChat [45], a dataset of dialogues that has been used to produce strong chat models such as UltraLM [45]. Our empirical findings suggest that open models even at moderate scales can be effective, particularly when equipped with strong chat capabilities. Further analysis on performance comparison depending on the types of LLMs is presented in Section 4.1.
4.1 Main results
Results on datasets with language descriptions. We first describe the details of the experiments when clear language descriptions are provided (e.g., column names and categorical features). In these cases, the LLM generates a logical rule in natural language. Since the logical rule is easily converted to Python code, we prompt the LLM to convert it. In this case, we use gpt-4o with a temperature of 0.0 (since the Llama 2 model often fails to do this; see Appendix A.4 for used prompt).
As shown in Table 1, our framework consistently improves various types of baseline models. For instance, when we generate column ‘ ’ for the Tesla Stock dataset using Llama 2 during training of the XGBoost, the relative error decreases by 15.9%. Additionally, our framework is compatible with any kind of LLM. In particular, while Llama 2, the open-source model, improves the baseline models with a high margin, one can even improve with the advanced model. For example, using GPT-4o, the most recent LLM by OpenAI, our method reduces the relative error by 17.1% on the Tesla Stock dataset for the XGBoost. Remark that we select recently released datasets (e.g., Enefit is a recent Kaggle competition dataset) to curate datasets that LLMs would not have seen.
Results on datasets without language descriptions. In practice, language descriptions that clearly describe the given prediction task are not always available for all datasets. For example, feature names and values are often changed to arbitrary meaningless symbols in many financial and medical datasets to protect confidentiality [46]. Even though the language description does not exist, our framework can be easily extended to this kind of case by using arithmetic rules as feature generators. This is because the superiority of our framework comes from the optimization capability of LLMs [24, 25], using decision tree reasoning as explicit feedback, rather than utilizing language description.
Specifically, to ensure that the datasets do not contain language descriptions, we use ordinal encoding for categorical features and then normalized all features with a min-max scaler to convert the original numeric values to different values. In addition, we use generic indicators for the column names. For example, if the number of columns is , we use for the column names. For the initial rule, we simply use the product of the two most important features measured by XGBoost, e.g., . The generated rule is, therefore, in the form of an arithmetic formula.
XGBoost [11] | MLP [31] | HyperFast [32] | ||||
Dataset | Baseline | OCTree (Ours) | Baseline | OCTree (Ours) | Baseline | OCTree (Ours) |
electricity | 8.320.0 | 6.650.1 (20.1%) | 15.640.3 | 14.820.4 (5.2%) | 15.250.5 | 14.700.5 (3.6%) |
rl | 23.610.8 | 19.320.4 (18.2%) | 32.034.2 | 28.301.7 (11.6%) | 33.771.3 | 33.501.2 (0.8%) |
compass | 22.910.5 | 18.890.4 (17.6%) | 27.411.0 | 26.780.1 (2.3%) | 25.740.6 | 24.911.1 (3.2%) |
covertype | 9.100.2 | 7.960.0 (12.5%) | 8.730.4 | 8.250.3 (5.5%) | 9.861.6 | 9.211.3 (6.6%) |
phoneme | 10.890.5 | 10.150.7 (6.8%) | 12.060.8 | 10.980.6 (9.8%) | 10.550.7 | 10.570.9 (N/I) |
kddCup09 | 19.861.1 | 19.071.4 (4.0%) | 24.300.3 | 24.301.6 (0.0%) | 25.750.7 | 24.461.1 (5.0%) |
pol | 1.690.2 | 1.620.2 (4.0%) | 1.370.3 | 1.270.3 (7.3%) | 1.700.4 | 1.550.2 (8.8%) |
Magic | 14.250.3 | 13.750.4 (3.5%) | 14.600.2 | 14.500.0 (0.7%) | 14.950.2 | 14.340.5 (4.1%) |
california | 9.450.6 | 9.131.0 (3.4%) | 11.910.3 | 11.370.1 (4.5%) | 11.750.7 | 11.020.6 (6.2%) |
house_16H | 11.660.5 | 11.320.2 (3.0%) | 13.070.2 | 12.540.6 (4.1%) | 12.770.3 | 12.290.4 (3.8%) |
eye_movements | 35.060.7 | 34.172.0 (2.6%) | 40.031.2 | 39.861.9 (0.4%) | 41.331.5 | 40.291.7 (2.5%) |
road-safety | 21.140.0 | 20.650.1 (2.3%) | 22.170.4 | 21.870.1 (1.4%) | 24.540.3 | 24.070.4 (1.9%) |
kdd_ipums_la | 10.891.0 | 10.691.0 (1.8%) | 13.131.3 | 11.721.5 (10.7%) | 16.150.3 | 13.551.4 (16.1%) |
MiniBooNE | 5.480.2 | 5.420.1 (1.2%) | 9.690.3 | 7.350.2 (24.1%) | 6.610.4 | 6.540.2 (1.1%) |
credit | 22.020.3 | 21.780.3 (1.1%) | 24.430.6 | 23.230.7 (4.9%) | 25.061.1 | 24.301.8 (3.0%) |
Higgs | 27.950.7 | 27.910.2 (0.1%) | 29.430.4 | 28.800.2 (2.1%) | 30.040.2 | 29.730.5 (1.0%) |
jannis | 20.610.1 | 20.640.1 (N/I) | 22.280.1 | 22.510.1 (N/I) | 24.290.4 | 23.650.3 (2.6%) |
wine | 19.113.3 | 19.183.9 (N/I) | 21.533.1 | 21.591.4 (N/I) | 19.182.7 | 19.312.2 (N/I) |
bank-marketing | 20.090.3 | 20.310.6 (N/I) | 21.110.4 | 21.090.4 (0.1%) | 21.251.0 | 21.660.8 (N/I) |
As shown in Table 2, our framework improves the baseline models even without language descriptions. Specifically, our framework reduces errors by an average of 5.0% over the XGBoost classifier. We hypothesize that our method successfully finds good feature generation rules since LLM automatically generates informative new column features in a black-box manner from the large search space of arithmetic rules since our method does not need to manually specify the search space of the rules.
Method | LLM | Avg. Err. |
XGBoost | - | 16.530.1 |
OCTree | Llama 2 Chat 7B | 16.320.1 |
OCTree | Code Llama 7B | 15.830.2 |
OCTree | Ours 7B | 15.710.4 |
We experiment with several open LLMs to evaluate their performance when employed as the rule generator in our framework. As shown in Table 3, while all of these models result in improvements over the baseline on the datasets without language descriptions, we find our own model (i.e., Llama 2 7B fine-tuned on UltraChat) to be particularly effective. This demonstrates that our framework can be implemented effectively even with open models at a moderate scale, provided that they have strong chat capabilities. We suspect that this is due to the enhanced ability of such models to understand and generate contextually relevant and coherent rules, which leads to improved optimization outcomes. Furthermore, Code Llama also demonstrates competitive performance; it is trained on code datasets, which include a variety of arithmetic rules, therefore beneficial for suggesting rules for datasets lacking language descriptions. This suggests that incorporating code datasets alongside dialogue datasets may further enhance the performance of the rule generator LLM.
Incorporating previous automatic feature engineering methods. As OCTree is orthogonal to existing automatic feature engineering methods, our method naturally integrates with them. The simplest approach is to first generate features with OCTree and subsequently employ these existing methods to further enhance the feature set. In Table 4, we first compare the performance of using OCTree alone. Here, we compare using XGBoost and MLP to show that our framework outperforms the others for both tree-based and deep learning models. As can be seen from the table, our method outperforms state-of-the-art automatic feature engineering methods (i.e., AutoFeat [23] and OpenFE [17]). We exclude CAAFE [19] from our comparison since this method relies on a language-based context of the dataset, making it inapplicable to the datasets in Table 2. In contrast, other methods, including ours, are context-agnostic and can also be applied without any description (but our method benefits from clear language descriptions as shown in Table 1 if it is available). Furthermore, we highlight that using our method in combination with OpenFE (i.e., OCTree† in Table 4) further improves the performance, resulting in a 7.9% reduction in relative error for XGBoost.
Gen. Feat. | DT Reasoning | Disease∗ | Clinical∗ | electricity† | kddCup09† |
- | - | 28.097.9 | 46.275.0 | 8.320.0 | 19.861.1 |
✓ | ✗ | 27.628.4 (1.7%) | 45.614.1 (1.4%) | 6.890.6 (17.2%) | 19.471.6 (2.0%) |
✓ | ✓ | 26.197.2 (6.8%) | 45.074.1 (2.6%) | 6.650.1 (20.1%) | 19.071.4 (4.0%) |
4.2 Ablations and analysis
Ablation study of the proposed components. Our framework consists of two main components: (i) generating new additional column features (Gen. Feat.), and (ii) providing explicit decision tree reasonings as feedback (DT reasoning) during the optimization process. As shown in Table 5, both components are crucial to our framework. First of all, the rules for introducing new column features are successfully optimized even without using explicit decision trees for feedback (i.e., just providing a score as feedback to the LLM), therefore resulting in performance improvements. Secondly, one can get even better performance by providing the decision tree as feedback to the LLM. We hypothesize that providing a decision tree is actually the same as providing learned information from the entire training samples. Also, decision trees include information about important columns and their threshold values used for predictions in their nodes. In addition, deep learning models often lack interpretability (i.e., non-trivial to represent them in prompts), whereas decision trees are used in prompts for LLMs because they are easily converted to natural language using if-else syntax.
MLP | HyperFast | |||
Dataset | Baseline | OCTree (Ours) | Baseline | OCTree (Ours) |
Disease∗ | 38.103.6 | 35.244.4 (7.5%) | 28.5710.0 | 27.625.8 (5.8%) |
Clinical∗ | 41.771.2 | 42.322.3 (N/I) | 43.641.1 | 42.761.8 (2.0%) |
electricity† | 15.640.3 | 15.030.3 (3.9%) | 15.370.4 | 14.880.2 (3.2%) |
kddCup09† | 24.300.3 | 23.470.5 (3.4%) | 25.620.7 | 25.220.9 (1.6%) |
Transferring generated rules to other prediction models. While we optimize feature generation rules to improve the performance of a certain single model, it is able to use these generated features in other models to improve their performance. For example, it is highly practical to generate features by evaluating with XGBoost, which has a relatively short evaluation time compared to large deep neural networks, and then use the generated features to improve large models. To verify such transferability of our framework, we first optimize the column generation rules by evaluating with XGBoost, and then train MLP and HyperFast along with the generated features. As shown in Table 6, the generated features are useful for improving the performance of MLP and HyperFast, and show high practicality when features need to be generated in a given amount of time.
Examination on LLM for column generation. During the initialization, LLM recommends a new column feature that is not considered in the original dataset. Here, we conduct an experiment to verify whether LLM actually introduces the most beneficial feature that can improve prediction performance. We perform an analysis from two perspectives: (i) whether LLM can distinguish column features that are more relevant to the target task when given candidates for new column names, and (ii) whether it is actually beneficial to use the real values of the column features if they are obtainable.
Column feature | Model | |
Cough | Cholesterol | XGBoost [11] |
✗ | ✗ | 34.760.8 |
✗ | ✓ | 33.340.8 |
✓ | ✗ | 30.004.3 |
✓ | ✓ | 28.097.9 |
Our motivation is that LLM can understand the relationship between the target task and the column feature so that LLM can introduce the new columns needed for the task. To verify this, we first verify that LLM can distinguish column features that are more important for the task. For the experiment, we first corrupt the dataset by removing two existing features. We then prompt LLM to rank these two features (by providing as candidates) in order of most informative to the target task. Finally, we compare the performance of adding each feature to the corrupted dataset. As shown in Table 7, LLM indeed distinguishes more informative column features. Here, in the Disease dataset, we remove ‘ ’ and ‘’ from the original dataset. We then ask LLM what attribute is more important in predicting whether a patient has a disease. Of the two, LLM answers ‘,’ which shows more performance improvement than using ‘ ’ as an additional feature. Therefore, generated column features from LLM are beneficial for the prediction.
Using this ability of LLM, we propose to generate a new column through LLM, which practitioners actually need for their target task, even without candidates. For example, on the Clinical Trial dataset, we verify that LLM indeed generates useful column features, which are beneficial for the target task (i.e., predicting the patient’s mortality). During the column name generation step, LLM introduces ‘’ as an additional column name. First, we get the actual ages of patients for US National Library of Medicine. We then measure the performance gain by using the actual values obtained from the real world. As shown in Figure 3, imputing the generated column with real values shows an improvement, validating that OCTree actually recommends columns that are useful for the target task. Therefore, we suggest practitioners utilize OCTree (i) to identify additional column features that should be collected or, if this is not possible, (ii) optimize a column feature generation rule.
5 Conclusion
In this paper, we propose OCTree, a generic framework that leverages the power of LLMs (e.g., reasoning capability) for automatically generating column features for tabular prediction tasks. We evaluate the effectiveness of OCTree across various prediction tasks and find that our method consistently improves the performance of diverse prediction models. As a future work, applying feedback-based alignment methods, such as reinforcement learning from human feedback, to further enhance LLMs as rule generators would be an exciting direction to explore.
Limitation. One potential limitation of our work is that evaluating the generated features involves computing the validation scores of the prediction model, which may be time-consuming, if the model requires extensive training. However, as demonstrated by the results in Table 6, this issue can be mitigated by initially generating features for a simpler model and then transferring them to the target model, thereby reducing runtime.
References
- Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
- Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Tian et al. [2020] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
- Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Logeswaran and Lee [2018] Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893, 2018.
- Yoon et al. [2020] Jinsung Yoon, Yao Zhang, James Jordon, and Mihaela van der Schaar. Vime: Extending the success of self-and semi-supervised learning to tabular domain. Advances in Neural Information Processing Systems, 33:11033–11043, 2020.
- Huang et al. [2020] Xin Huang, Ashish Khetan, Milan Cvitkovic, and Zohar Karnin. Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678, 2020.
- Ucar et al. [2021] Talip Ucar, Ehsan Hajiramezanali, and Lindsay Edwards. Subtab: Subsetting features of tabular data for self-supervised representation learning. Advances in Neural Information Processing Systems, 34:18853–18865, 2021.
- Zhu et al. [2023] Bingzhao Zhu, Xingjian Shi, Nick Erickson, Mu Li, George Karypis, and Mahsa Shoaran. Xtab: Cross-table pretraining for tabular transformers. arXiv preprint arXiv:2305.06090, 2023.
- Chen and Guestrin [2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.
- Prokhorenkova et al. [2018] Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31, 2018.
- Grinsztajn et al. [2022] Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35:507–520, 2022.
- Chen et al. [2023] Kuan-Yu Chen, Ping-Han Chiang, Hsin-Rung Chou, Ting-Wei Chen, and Tien-Hao Chang. Trompt: Towards a better deep neural network for tabular data. arXiv preprint arXiv:2305.18446, 2023.
- McElfresh et al. [2023] Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Ganesh Ramakrishnan, Micah Goldblum, Colin White, et al. When do neural nets outperform boosted trees on tabular data? Advances in Neural Information Processing Systems, 2023.
- Cherepanova et al. [2023] Valeriia Cherepanova, Roman Levin, Gowthami Somepalli, Jonas Geiping, C Bayan Bruss, Andrew Gordon Wilson, Tom Goldstein, and Micah Goldblum. A performance-driven benchmark for feature selection in tabular deep learning. Advances in Neural Information Processing Systems, 2023.
- Zhang et al. [2023a] Tianping Zhang, Zheyu Aqa Zhang, Zhiyuan Fan, Haoyan Luo, Fengyuan Liu, Qian Liu, Wei Cao, and Li Jian. Openfe: Automated feature generation with expert-level performance. In International Conference on Machine Learning, pages 41880–41901. PMLR, 2023a.
- Amballa et al. [2024] Avinash Amballa, Anmol Mekala, Gayathri Akkinapalli, Manas Madine, Naga Pavana Priya Yarrabolu, and Przemyslaw A Grabowicz. Automated model selection for tabular data. arXiv preprint arXiv:2401.00961, 2024.
- Hollmann et al. [2024] Noah Hollmann, Samuel Müller, and Frank Hutter. Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems, 36, 2024.
- Breiman [2001] Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
- Friedman [2001] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
- Liu and Yu [2005] Huan Liu and Lei Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on knowledge and data engineering, 17(4):491–502, 2005.
- Horn et al. [2020] Franziska Horn, Robert Pack, and Michael Rieger. The autofeat python library for automated feature engineering and selection. In Machine Learning and Knowledge Discovery in Databases: International Workshops of ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part I, pages 111–120. Springer, 2020.
- Fernando et al. [2023] Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktäschel. Promptbreeder: Self-referential self-improvement via prompt evolution. arXiv preprint arXiv:2309.16797, 2023.
- Yang et al. [2023] Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Zhang et al. [2023b] Michael R Zhang, Nishkrit Desai, Juhan Bae, Jonathan Lorraine, and Jimmy Ba. Using large language models for hyperparameter optimization. arXiv e-prints, pages arXiv–2312, 2023b.
- Dinh et al. [2022] Tuan Dinh, Yuchen Zeng, Ruisu Zhang, Ziqian Lin, Michael Gira, Shashank Rajput, Jy-yong Sohn, Dimitris Papailiopoulos, and Kangwook Lee. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
- Hegselmann et al. [2023] Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, 2023.
- Manikandan et al. [2023] Hariharan Manikandan, Yiding Jiang, and J Zico Kolter. Language models are weak learners. Advances in Neural Information Processing Systems, 2023.
- Nam et al. [2023a] Jaehyun Nam, Woomin Song, Seong Hyeon Park, Jihoon Tack, Sukmin Yun, Jaehyung Kim, and Jinwoo Shin. Semi-supervised tabular classification via in-context learning of large language models. In Workshop on Efficient Systems for Foundation Models@ ICML2023, 2023a.
- Gorishniy et al. [2021] Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. Revisiting deep learning models for tabular data. Advances in Neural Information Processing Systems, 34:18932–18943, 2021.
- Bonet et al. [2024] David Bonet, Daniel Mas Montserrat, Xavier Giró-i Nieto, and Alexander G Ioannidis. Hyperfast: Instant classification for tabular data. AAAI Conference on Artificial Intelligence, 2024.
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Victor et al. [2022] Sanh Victor, Webson Albert, Raffel Colin, Bach Stephen, Sutawika Lintang, Alyafeai Zaid, Chaffin Antoine, Stiegler Arnaud, Raja Arun, Dey Manan, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
- Yan et al. [2024] Jiahuan Yan, Bo Zheng, Hongxia Xu, Yiheng Zhu, Danny Chen, Jimeng Sun, Jian Wu, and Jintai Chen. Making pre-trained language models great on tabular prediction. In International Conference on Learning Representations, 2024.
- Kanter and Veeramachaneni [2015] James Max Kanter and Kalyan Veeramachaneni. Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE international conference on data science and advanced analytics (DSAA), pages 1–10. IEEE, 2015.
- Fan et al. [2010] Wei Fan, Erheng Zhong, Jing Peng, Olivier Verscheure, Kun Zhang, Jiangtao Ren, Rong Yan, and Qiang Yang. Generalized and heuristic-free feature construction for improved accuracy. In Proceedings of the 2010 SIAM International Conference on Data Mining, pages 629–640. SIAM, 2010.
- Li et al. [2022] Liyao Li, Haobo Wang, Liangyu Zha, Qingyi Huang, Sai Wu, Gang Chen, and Junbo Zhao. Learning a data-driven policy network for pre-training automated feature engineering. In The Eleventh International Conference on Learning Representations, 2022.
- Doerr and Neumann [2021] Benjamin Doerr and Frank Neumann. A survey on recent progress in the theory of evolutionary algorithms for discrete optimization. ACM Transactions on Evolutionary Learning and Optimization, 1(4):1–43, 2021.
- Mazyavkina et al. [2021] Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. Reinforcement learning for combinatorial optimization: A survey. Computers & Operations Research, 134:105400, 2021.
- Loh [2011] Wei-Yin Loh. Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery, 1(1):14–23, 2011.
- Liu et al. [2023] Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023.
- Nam et al. [2023b] Jaehyun Nam, Jihoon Tack, Kyungmin Lee, Hankook Lee, and Jinwoo Shin. Stunt: Few-shot tabular learning with self-generated tasks from unlabeled tables. In International Conference on Learning Representations, 2023b.
- Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Ding et al. [2023] Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- Asuncion and Newman [2007] Arthur Asuncion and David Newman. Uci machine learning repository, 2007.
- Akiba et al. [2019] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623–2631, 2019.
Appendix: Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning
Appendix A Prompt examples
A.1 Generate a new column
In Listing 4, we present the prompt which instructs an LLM to generate a new column name . The prompt includes a detailed explanation of each column’s feature, specifying its types and values. We restricted to be either binary or categorical for convenience.
A.2 Initialize a rule
In Listing 5, we present the prompt which instructs an LLM to create an initial rule for generating the new column . An LLM generates the new column name considering the column features in the dataset and the meaning of .
A.3 Generate a rule
In Listing 6, we present the prompt which instructs an LLM to generate a better rule than rules generated by itself. The list of (rule, tree-based reasoning, score) tuples is provided in the prompt. Since the score in the following prompt is the accuracy of the XGBoost classifier, the trajectory is sorted in ascending score order. In the case of a regression task that uses mean absolute error, the trajectory is sorted in descending score order.
A.4 Translate a rule into Python code
In Listing 7, we present the prompt which instructs an LLM to translate the rule into Python code. To eliminate the potential for errors, we restrict it in multiple ways. First, we specify the types of variables as far as we can. Next, we instruct LLM to consider the types of feature values before calculating them (e.g. avoid adding categorical value and numerical value).
Appendix B Dataset details
In this section, we provide further details on the datasets.
B.1 Using language descriptions.
1. Disease111https://www.kaggle.com/datasets/uom190346a/disease-symptoms-and-patient-profile-dataset
A classification task predicting the patient’s diagnosis of the disease given the following attributes:
-
•
Binary (Yes or No): Fever, Fatigue
-
•
Numerical (Range): Age (25 90)
-
•
Categorical: Gender (Female or Male), Blood Pressure (High, Normal, or Low), Cholesterol Level (High, Normal, or Low)
2. Clinical Trial222https://data.projectdatasphere.org/projectdatasphere/html/content/119
A classification task predicting patient mortality in clinical trials given the following attributes:
-
•
Binary (Yes or No) - Historical Disease: Deep Vein Thrombosis, Pulmonary Embolism, Antiandrogen Therapy, Cardiac Failure, Respiratory Failure, Venous Insufficiency, Coronary Artery Disease, Myocardial Infarction, Hypertension, Peripheral Arterial Occlusive Disease
-
•
Binary (Yes or No) - Medication: Dexamethasone, Ondansetron, Heparin, Fluorouracil, Ranitidine, Cisplatin, Metoclopramide, Carboplatin, Furosemide
3. Academic333https://www.kaggle.com/datasets/missionjee/students-dropout-and-academic-success-dataset
A classification task predicting whether the student dropout or not given the following attributes:
-
•
Numerical: Marital Status, Daytime/Evening Attendance, Previous Qualification, Nationality, Father’s Qualification, Father’s Occupation, Displaced, Debtor, Tuition Fees up to Date, Gender, Scholarship Holder, Age at Enrollment, International, Curricular Units 1st Sem (Approved), Curricular Units 1st Sem (Grade), Curricular Units 2nd Sem (Approved), Curricular Units 2nd Sem (Grade).
A regression task predicting the energy consumption of the day given the following attributes:
-
•
Numerical: Prediction Unit Id, Day, Hour, Lowest Price Per MWh, Highest Price Per MWh, Installed Capacity, Euros Per MWh, Local Forecast Temperature, Local Forecast Dewpoint, Local Forecast Cloudcover Total, Local Forecast 10 Metre U Wind Component, Local Forecast 10 Metre V Wind Component, Local Forecast Direct Solar Radiation, Local Forecast Surface Solar Radiation Downwards, Local Forecast Total Precipitation.
5. Tesla Stock555https://www.kaggle.com/datasets/guillemservera/tsla-stock-data
A regression task predicting the highest stock price of the target day given the following attributes:
-
•
Numerical: Open Price of 2 Days Before, Highest Price of 2 Days Before, Lowest Price of 2 Days Before, Close Price of 2 Days Before, Open Price of 1 Day Before, Highest Price of 1 Day Before, Lowest Price of 1 Day Before, Close Price of 1 Day Before, Open Price of the Target Day, Time Index.
B.2 Without using language descriptions.
For our main result without language descriptions (see Table 2), we use the 19 classification datasets from the tabular benchmark proposed by Grinsztajn et al. [13]. We provide brief statistics of the dataset in this section. Due to limited compute resources, we uniformly subsample 50,000 samples when the original number of samples is larger than 50,000, following the dataset curation scheme from Grinsztajn et al. [13]).
Dataset | OpenML ID | # Samples | # Features |
rl | 44160 | 4970 | 12 |
electricity | 44156 | 38474 | 8 |
compass | 44162 | 16644 | 17 |
wine | 44091 | 2554 | 11 |
house_16H | 44123 | 13488 | 16 |
MagicTelescope (Magic) | 44125 | 13376 | 10 |
Higgs | 44129 | 940160 | 24 |
jannis | 44131 | 57580 | 54 |
credit | 44089 | 16714 | 10 |
eye_movements | 44157 | 7608 | 23 |
kddCup09_upselling (kddCup09) | 44158 | 5032 | 45 |
road-safety | 44161 | 111762 | 32 |
bank-marketing | 44126 | 10578 | 7 |
phoneme | 44127 | 3172 | 5 |
covertype | 44159 | 423680 | 54 |
california | 44090 | 20634 | 8 |
kdd_ipums_la_97-small (kdd_ipums_la) | 44124 | 5188 | 20 |
MiniBooNE | 44128 | 72998 | 50 |
pol | 44122 | 10082 | 26 |
Appendix C Baseline details
In this section, we outline the hyperparameter search space for the baseline models. For each random split of every dataset, we find the optimal set of hyperparameters using a random sampler run for 400 trials. We utilize the Optuna library [47] for the hyperparamter tuning.
C.1 XGBoost
Parameter | Distribution |
Max depth | UniformInt [1, 11] |
Num estimators | UniformInt [100, 6100, 200] |
Min child weight | LogUniformInt [1, 1e2] |
Subsample | Uniform [0.5, 1] |
Learning rate | LogUniform [1e-5, 0.7] |
Col sample by level | Uniform [0.5, 1] |
Col sample by tree | Uniform [0.5, 1] |
Gamma | LogUniform [1e-8, 7] |
Lambda | LogUniform [1, 4] |
Alpha | LogUniform [1e-8, 1e2] |
C.2 Multilayer perception (MLP)
For MLP, we follow the hyperparameter search space used in Grinsztajn et al. [13] and the architecture used in Gorishniy et al. [31]. Specifically, the MLP architecture for tabular data additionally learns the feature embeddings of categorical features. MLPs are run up to 300 epochs, with early stopping, and the model that performs the best on the validation set is selected. If validation scores do not improve for 40 epochs, we early-stopped. For learning rate schedulers, we use the ReduceOnPlateau implementation in PyTorch.
Parameter | Distribution |
Num layers | UniformInt [1, 8] |
Layer size | UniformInt [16, 1024] |
Dropout | Uniform [0, 0.5] |
Learning rate | LogUniform [1e-5, 1e-2] |
Category embedding size | UniformInt [64, 512] |
Learning rate scheduler | [True, False] |
Batch size | [256, 512, 1024] |
C.3 HyperFast
For HyperFast [32], we consider the hyperparameter space used in the original paper.
Parameter | Distribution |
N ensemble | [1, 4, 8, 16, 32] |
Batch size | [1024, 2048] |
NN bias | [True, False] |
Stratify sampling | [True, False] |
Optimization | [None, ‘optimize’, ‘ensemble_optimize’] |
Optimize steps | [1, 4, 8, 16, 32, 64, 128] |
Seed | UniformInt [0, 9] |
Appendix D Compute resources
We ran our experiments on a variety of machines. We used:
-
•
CPU: Intel(R) Xeon(R) Gold 6226R, GPU: RTX 3090
-
•
CPU: Intel(R) Xeon(R) Gold 6426Y, GPU: RTX A4090
-
•
CPU: Intel(R) Xeon(R) Gold 6426Y, GPU: RTX A6000
Appendix E Broader impacts
Since our method is particularly effective when collecting real data is expensive or restricted, practitioners in need might give big attention (e.g., in the finance or medical domain, where real data is hard to obtain due to issues like privacy). However, as features introduced by OCTree are artificially generated, careful inspection of the generated features is needed.