This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Portland State University
11email: {xinhaoz, jinghanz, kunpeng}@pdx.edu
22institutetext: University of Montreal
22email: [email protected]
33institutetext: Visa Research
33email: [email protected]

Retrieval-Augmented Feature Generation for Domain-Specific Classification

Xinhao Zhang 11    Jinghan Zhang 11    Fengran Mo 22    Yuzhong Chen 33    Kunpeng Liu Corresponding Author.11
Abstract

Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is by expanding the current feature space using existing features and enriching the informational content. However, generating new, interpretable features in application fields often requires domain-specific knowledge about the existing features. This paper introduces a new method RAFG for generating reasonable and explainable features specific to domain classification tasks. To generate new features with interpretability in domain knowledge, we perform information retrieval on existing features to identify potential feature associations, and utilize these associations to generate meaningful features. Furthermore, we develop a Large Language Model (LLM)-based framework for feature generation with reasoning to verify and filter features during the generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method produces high-quality, meaningful features and significantly improves classification performance compared with baseline methods.

Keywords:
Feature generation Information retrieval Large language models.

1 Introduction

In domain-specific applications, e.g., disease classification and insurance claim prediction, the data are usually scarce with limited features, which cannot achieve satisfactory performance by training a machine learning model. A common practice is to generate more new features to support the model for decision-making. However, the feature generation relies on the domain knowledge based on manual generation or lack of interpretability with automatic generation. Besides, the existing studies usually focus only on the data structure [5, 20, 19]. Although efficient, these approaches lack transparency and are hard to explain. In addition, the resources of domain-specific knowledge are also expensive to obtain.

Refer to caption
Figure 1: Framework of RAFG. We adopt an LLM to generate new features according to the retrieved textual information containing expertise knowledge (e.g., the BMI as the shown case).

An intuitive solution is to utilize the information and associations of existing features, which requires extracting and understanding the relationship between relevant features. This is non-trivial since it requires domain knowledge as references and generates new features on top of them, which is challenging for a model without specific fine-tuning. Although fine-tuning a language model with domain-specific data can inject domain knowledge and enhance performance, the paradigm of fine-tuning is high-cost and limits the models’ generalizability across different domains. Thus, an alternative is to retrieve useful information from an external knowledge base as support reference, and then leverage a generator model with a reasoning mechanism, e.g., an LLM, to integrate the original feature with the corresponding support knowledge to generate the new feature. This is similar to the principle of retrieval-augmented generation (RAG) [7] technique, where we apply it for domain-specific feature generation. On one hand, the retrieved knowledge can serve as explicit references for interoperability. On the other hand, the new feature is generated based on only the existing feature, which can reduce the noise injection risk and efficiently leverage the limited available information.

As shown in Figure 1, the goal is to detect heart disease based on the general features. A more concrete indicator is the BMI feature, which can be generated by the original features Height and Weight. The new feature generation could be achieved with the support of external knowledge of BMI and this is our motivation for implementing retrieval-augmented feature generation. Specifically, we first deploy an LLM to capture the relevance between each pair of features based on the description of the domain datasets. Then, the LLM forms a query to retrieve relevant knowledge based on the selected feature and further generate the new feature together with the retrieved results. We expect the LLM can extract the vital information to generate the new feature of each pair of original features to enrich the data attribution, which can enhance the performance for a domain-specific downstream task.

Our Targets. Comprehensively, we aim to address three main challenges in generating features for domain-specific scenarios: 1) how to extract and utilize the useful information by understanding the relations among the features, 2) how to utilize retrieval-augmented generation based on LLMs to generate useful and explainable features with reasoning procedure, and 3) how to form and decide an optimal feature space based on the original and generated feature.

Our Approach. To address these challenges, we propose a novel LLM-based Retrieval-Augmented Feature Generation (RAFG), which generates features with retrieved knowledge of existing features and the reasoning ability of LLMs. Specifically, 1) we instruct an LLM to analyze and capture the implicit pattern within the structured data and the textual information, e.g., dataset descriptions and feature labels. With advanced reasoning capabilities, the LLM can provide a logical and cognitive process to derive the relation among the features and generate a query for later domain-specific retrieval; 2) we deploy the RAG technology based on the previously generated query to identify the relevant reference from a reliable knowledge base. The search results are integrated with the original selected feature to feed to the LLM. Eventually, a new feature is generated by the reasoning and generation capacity of LLM under the training-free scope; 3) we design an automated feature generation mechanism that continually refines the system’s feature generation capabilities. The new feature generated by the LLM can be automatically integrated into the data attribution. We then adopt the validated features to update the feature space until reaching the pre-defined iterations or achieve a determined optimal feature space. The iteratively generated features are expected to significantly enhance the domain-specific interpretability by aligning with the needs of the associated fields.

In summary, our contribution includes:

  1. 1.

    We introduce a novel LLM-based retrieval-augmented feature generation method RAFG for domain-specific scenarios, which mines and utilizes the potential relationships of existing features to generate new features.

  2. 2.

    The RAFG framework is training-free and can be adapted for any specific domain. The generated feature is explainable according to the retrieved reference and validated for enriching the feature space by the automatic evaluation based on LLMs.

  3. 3.

    Experimental results on different domain datasets demonstrate the effectiveness and robustness of our RAFG across various downstream tasks. Our analysis confirms the effects of retrieval-augmented generated features.

2 Related Work

2.1 Automated Feature Generation

Feature Engineering. Feature engineering is an essential part of machine learning, including selecting, modifying or creating new features from raw data to improve the task performance [18]. The target of this process is to optimize the feature representation space. In this process, we tailor the best and most interpretable feature set for specific machine learning tasks for better model accuracy and generalization. Among feature engineering, feature generation is the approach that generates new features from existing ones in a dataset. This is usually through various mathematical or logical transformation operations [5, 20]. The goal of feature generation is to create complex latent feature spaces [28, 17]. However, existing feature generation methods often lack transparency and require a large amount of manual operations.

Automated Feature Generation. With the advancement of deep learning technologies and LLMs, automated feature generation has significant development and application [14, 23]. In this field, the researchers have developed various automated feature engineering methods, such as ExploreKit [10] and Cognito [12]. These methods increase efficiency in processing large datasets and reduce the need for manual intervention. However, these methods often focus on the structural and numerical information of the data table and overlook the semantic and textual information. Also, the “black box” operations of feature generation with deep learning methods make the generated features challenging to explain and validate.

2.2 Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) [13] is a technique that integrates the information retrieval capabilities with the generative language models to enhance the performance of the final output. RAG effectively assists LLMs with tasks requiring extensive and specific domain knowledge [26, 9]. Given a domain-specific task, the LLMs can access a large external library, which might be a set of documents or knowledge related to the task [7]. Then, the LLMs generate a query according to the task information and use it for searching based on the similarity between the query and the candidate documents. This approach helps the LLMs to produce responses that are more accurate, domain-relevant, and reduce hallucinations [25]. However, existing RAG for feature generation lacks a reasoning process [22, 26, 13]. To generate interpretable features, the LLM needs to analyze the relevance and rationality of the retrieved information and the generated features. The reasoning process ensures the effectiveness of the new features by validating their effectiveness and explainability in the domain knowledge.

3 Problem Statement

We formulate the task as searching for the potential features to enrich and reconstruct an optimal and explainable feature representation space to advance certain downstream tasks, such as classification, regression, etc. Concretely, we denote the original tabular dataset as 𝒟0={0;y}\mathcal{D}_{0}=\{\mathcal{F}_{0};\mathit{y}\} that includes an original feature set ={fi}i=1I0\mathcal{F}=\{f_{i}\}_{i=1}^{I_{0}}, features {fi}\{f_{i}\} and its target label y\mathit{y}, with textual information 𝒞0\mathcal{C}_{0} including the labels and the data description. Our optimization objective is to automatically generate new features {gt},t=1,2,\{g_{t}\},t=1,2,\ldots based on the retrieved knowledge with reasoning procedure that can reconstruct an optimal feature set \mathcal{F}^{*}:

=argmax^𝒫𝒜(^,y),\mathcal{F}^{*}=\underset{\hat{\mathcal{F}}}{\text{argmax}}\;\mathbf{\mathcal{P}}_{\mathcal{A}}(\hat{\mathcal{F}},\mathit{y}), (1)

where 𝒜\mathcal{A} is a domain-specific downstream task (e.g., predicting life expectancy), 𝒫\mathcal{P} is the performance indicator of 𝒜\mathcal{A} and ^={0,{gt}}\hat{\mathcal{F}}=\{\mathcal{F}_{0},\{g_{t}\}\} is an optimized and reconstructed feature set reconstructed on 0\mathcal{F}_{0}.

4 Methodology

Refer to caption
Figure 2: Framework of RAFG. Given an input data table including a description, feature vectors, and a target label vector, the LLM first integrate the text information of description, label information, and data types to embed and form a query. Then, with this query we adopt RAG technology to search through an external library for one of several relevant documents which can guide the LLM in creating a new feature with most potential. After that, we test the template data table with the new feature for metrics improvement, and the LLM decides whether to reserve this new feature. This searching and generation process iterates until reaching the maximum rounds of iteration, or the best feature space is found.

In this section, we introduce our novel feature generation method named Retrieval Augmented Feature Generation (RAFG), which generates features dynamically and automatically by mining the text information including feature labels and data descriptions of a dataset with a large language model (LLM). Within this framework, we utilize the LLM as a text-miner and a feature generator, which embeds and conducts retrieval based on the text information of the dataset and existing features. Then, the LLM generates new features according to the retrieved external knowledge and the output of RAG. The model evaluates the qualities and reliability of newly generated features to optimize and update the feature set. The overview of the RAFG framework is shown in Figure 2, which includes three stages: (1) Query Generation and Domain Knowledge Retrieval; (2) Retrieval-Augmented Feature Generation with Reasoning; and (3) Feature Update Determination by LLM.

4.1 Query Generation and Domain Knowledge Retrieval

In this stage, our objective is to identify and extract potential new features and their generation methods that could enhance the performance of downstream tasks. First, we collect the textual information of the current feature set =t1={fi}i=1I\mathcal{F}=\mathcal{F}_{t-1}=\{f_{i}\}_{i=1}^{I}, including the goals of the downstream tasks 𝒜\mathcal{A^{{}^{\prime}}}, and the textual information 𝒞=𝒞t1\mathcal{C}=\mathcal{C}_{t-1} to construct a query 𝒬t\mathcal{Q}_{t} through a pre-trained LLM pϕp_{\phi}:

𝒬t=pϕ(𝒜,𝒞t1).\mathcal{Q}_{t}=p_{\phi}(\mathcal{A}^{{}^{\prime}},\mathcal{C}_{t-1}). (2)

where pϕp_{\phi} denotes the LLM with parameters ϕ\phi, and t=1,2,t=1,2,\ldots is the number of iterations. After integration, the LLM embeds the query:

qt=pϕ(embed(𝒬t)),q_{t}=p_{\phi}(\text{embed}(\mathcal{Q}_{t})), (3)

where embed()\text{embed}(\cdot) is the embedding process that transforms the input into a vector representation. We then retrieve through the whole library for the top-kk most relevant documents to query 𝒬t\mathcal{Q}_{t}. We evaluate the relevance of each document jk,j=1,2,,k\mathcal{R}_{j}\in\mathcal{R}_{k},j=1,2,\ldots,k to the query QtQ_{t} by the cosine similarity of the embeddings and select the top-kk documents with highest similarity:

sim(qt,rj)=qtrjqtrj,\text{sim}(q_{t},r_{j})=\frac{q_{t}\cdot r_{j}}{\|q_{t}\|\|r_{j}\|}, (4)

where rj=pϕ(embed(j))r_{j}=p_{\phi}(\text{embed}(\mathcal{R}_{j})) and \|\cdot\|\in\mathbb{R} is the norm function.

The set of documents ={k},k=1,2,\mathcal{R}=\{\mathcal{R}_{k}\},k=1,2,\ldots consists of documents k\mathcal{R}_{k}, which contains the information of potential features {gk},k=1,2,\{g_{k}\},k=1,2,\ldots that can be accessed through specific calculations, combinations or judgments with existing features in t1\mathcal{F}_{t-1}:

gk={pϕ(o(t1)k)},g_{k}=\{p_{\phi}(\mathit{o}(\mathcal{F}_{t-1})\mid\mathcal{R}_{k})\}, (5)

where o\mathit{o} denotes the calculation, combination, or judgment actions on features in t1\mathcal{F}_{t-1}. Specially, we denotes the operations as follows:

  • Calculation involves arithmetic transformations, statistical measures, or algorithmic functions applied to derive gtg_{t}. In our case, we define a calculator for features as a function oc:n\mathit{o}_{c}:\mathbb{R}^{n}\rightarrow\mathbb{R} applied to a vector of existing features, resulting in a scalar or vector output as the new feature.

  • Combination involves integrating two or more features into a single new feature through methods such as concatenation, averaging, or more complex fusion techniques. In our case, we define a calculator for combination as a function of:n×mp\mathit{o}_{f}:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}^{p}, where nn, mm, and pp are dimensions of the input features and the resulting feature, respectively.

  • Judgment involves logical or rule-based decision-making processes that integrate domain knowledge or empirical rules to form a new feature. In our case, we define a calculator for judgment as a decision function od:𝒳{0,1}\mathit{o}_{d}:\mathcal{X}\rightarrow\{0,1\} where 𝒳\mathcal{X} is the input space composed of one or more feature vectors, and the output is a new categorical feature based on predefined criteria.

4.2 Retrieval-Augmented Feature Generation with Reasoning

In this stage, our objective is to generate a new feature and align it with the existing features in the data table. Now we have introduced external knowledge about features through RAG. This knowledge includes the deeper structural details and relationships between features, which are not directly reflected in the original dataset. Here we adopt the LLM to analyze and extract feature structures and relationships, and further generate new features to expand the feature space.

From the kk documents with potential features {gk}\{g_{k}\}, the LLM first assesses these potential features and then ranks them based on their potential to enhance the performance of downstream tasks according to LLM’s evaluation. Here, to help the model’s evaluation, we employ a “Chain of Thought” [24] strategy to encourage the LLM to think and formulate logical hypotheses and reasoning for the possible outcomes of integrating each potential feature into the data table step-by-step.

The LLM then selects the most promising document t{k}\mathcal{R}_{t}\in\{\mathcal{R}_{k}\} from these candidates, and generate a new feature gtg_{t} according to t\mathcal{R}_{t}. We integrate gtg_{t} into the data table:

gt={pϕ(o(t1)t)},t={t1,gt},g_{t}=\{p_{\phi}(\mathit{o}(\mathcal{F}_{t-1})\mid\mathcal{R}_{t})\},\mathcal{F}_{t}=\{\mathcal{F}_{t-1},g_{t}\}, (6)
𝒟t={t;y}={{t1,gt};y}.\mathcal{D}_{t}=\{\mathcal{F}_{t};\mathit{y}\}=\{\{\mathcal{F}_{t-1},g_{t}\};\mathit{y}\}. (7)

Here, the gtg_{t} is generated by the corresponding calculation, combination, or judgment method oto_{t} mentioned in document t\mathcal{R}_{t} with existing features and is of the same length as fif_{i}.

Finally, we feed 𝒟t\mathcal{D}_{t} into the downstream task and obtain the performance metric 𝒫t\mathcal{P}_{t}, which is then fed back to the LLM for performance evaluation.

4.3 Feature Update Determination by LLM

In this stage, our objective is to test the effectiveness of the newly generated feature and decide whether to keep it. We decide whether to keep the feature be the improvement of 𝒫t𝒫t1\mathcal{P}_{t}-\mathcal{P}_{t-1}, where a higher value of 𝒫\mathcal{P} indicates better performance:

t={tif 𝒫t>𝒫t1,t1otherwise.\mathcal{F}_{t}=\begin{cases}\mathcal{F}_{t}&\text{if }\mathcal{P}_{t}>\mathcal{P}_{t-1},\\ \mathcal{F}_{t-1}&\text{otherwise}.\end{cases} (8)

When the adding of new feature gtg_{t} brings an improvement in the downstream task performance of the data table, we formally adopt gtg_{t} as a new feature within the table. The LLM updates the dataset text information 𝒞=𝒞t=pϕ(𝒞t1|t)\mathcal{C}=\mathcal{C}_{t}=p_{\phi}(\mathcal{C}_{t-1}|\mathcal{R}_{t}) to incorporate the information related to gtg_{t}. The feature generation process expands the dimensionality of the feature space. As the feature space expands, the model can understand and interpret the data more comprehensively, further improving the performance of downstream tasks and enhancing generalization capabilities.

After that, we move forward to generate new queries to search for more potential features with the updated data table 𝒟t\mathcal{D}_{t} and the data description 𝒞t\mathcal{C}_{t}. This iteration process continues until a predefined maximum number of iteration TT is reached, or the optimal feature set is found when achieving peak performance in the downstream task.

Algorithm 1 Retrieval-Augmented Feature Generation
1:Input: Tabular dataset 𝒟0={0;y}\mathcal{D}_{0}=\{\mathcal{F}_{0};\mathit{y}\}, textual information 𝒞0\mathcal{C}_{0}, goals of downstream tasks 𝒜\mathcal{A^{{}^{\prime}}},iteration time TT, LLM pϕp_{\phi}
2:0\mathcal{F}\leftarrow\mathcal{F}_{0}, 𝒞𝒞0\mathcal{C}\leftarrow\mathcal{C}_{0}, 𝒟𝒟0\mathcal{D}\leftarrow\mathcal{D}_{0}
3:𝒫DownStreamTask(𝒟)\mathcal{P}\leftarrow\text{DownStreamTask}(\mathcal{D})
4:for t=1t=1 to TT do
5:     𝒬tpϕ(𝒜,𝒞)\mathcal{Q}_{t}\leftarrow p_{\phi}(\mathcal{A}^{{}^{\prime}},\mathcal{C})
6:     qtpϕ(embed(𝒬t))q_{t}\leftarrow p_{\phi}(\text{embed}(\mathcal{Q}_{t}))
7:     {k}RetrieveFromLibrary(qt)\{\mathcal{R}_{k}\}\leftarrow\text{RetrieveFromLibrary}(q_{t})\triangleright
8:Retrieve {k}\{\mathcal{R}_{k}\} with top-k cosine similarity.
9:     tpϕ({k})\mathcal{R}_{t}\leftarrow p_{\phi}(\{\mathcal{R}_{k}\})\triangleright
10:The LLM selects the most promising document from these candidates.
11:     gt{pϕ(o()t)}g_{t}\leftarrow\{p_{\phi}(\mathit{o}(\mathcal{F})\mid\mathcal{R}_{t})\}
12:     𝒞tpϕ(𝒞t1|t)\mathcal{C}_{t}\leftarrow p_{\phi}(\mathcal{C}_{t-1}|\mathcal{R}_{t})
13:     𝒟t{{,gt};y}\mathcal{D}_{t}\leftarrow\{\{\mathcal{F},g_{t}\};\mathit{y}\}
14:     𝒫tDownStreamTask(𝒟t)\mathcal{P}_{t}\leftarrow\text{DownStreamTask}(\mathcal{D}_{t})
15:     if 𝒫t\mathcal{P}_{t} >𝒫\mathcal{P} then
16:         𝒫𝒫t\mathcal{P}\leftarrow\mathcal{P}_{t}, {,gt}\mathcal{F}\leftarrow\{\mathcal{F},g_{t}\}, 𝒞𝒞t\mathcal{C}\leftarrow\mathcal{C}_{t}
17:     end if
18:end for
19:return \mathcal{F}

5 Experiments

In this section, we present four experiments to evaluate the effectiveness and impacts of the RAFG. First, we compare the performance of the RAFG against several baseline methods on four downstream tasks. Second, we present the information gain during feature generation. Then, we showcase the relationship between new features and existing features. Finally, we discuss the reason for the improvement of performance.

5.1 Experiments Settings

Datasets. We evaluate the RAFG method on four real-world datasets, including Parkinson’s Disease Classification (PDC) [16], Animal Information Dataset (AID) [3], Global Country Information Dataset 2023 (GCI) [1] and Diabetes Health Indicators Dataset (DIA) [21]. The detailed information is shown in Table 1.

Table 1: Datasets description. Here we contain datasets from four specific different domains.
Datasets Samples Features Class Target
PDC 756 754 2 Parkinson Binary
AID 205 15 10 Social Structure
GCI 195 34 4 Life Expectancy
DIA 253680 21 2 Diabetes Binary

Metrics. We evaluate the model performance by the following metrics: Overall Accuracy (Acc) measures the proportion of true results in the total dataset. Precision (Prec) reflects the ratio of true positive predictions to all positive predictions for each class. Recall (Rec), also known as sensitivity, reflects the ratio of true positive predictions to all actual positives for each class. F-Measure (F1) is the harmonic mean of precision and recall, providing a single score that balances both metrics.

Baseline Models and Methods. We adapt the RAFG across a range of classification models, including Random Forests (RF) [15], Decision Tree (DT) [4], TabTransformer (TT) [8] and TabNet (TN) [2]. We compare the performance of these models between whether adapted with RAFG on top of them. We compare the RAFG with several baselines, including (1) raw data without feature generation (Raw), (2) the Least Absolute Shrinkage and Selection Operator (Lasso) [27], (3) the Feature Engineering for Predictive Modeling using Reinforcement Learning [11] (RL), and (4) the Context-Aware Automated Feature Engineering (CAAFE) [6].

Implementation Details. In our implementation for feature generation, we use GPT-4o111platform.openai.com as the query generator paired with Google’s Custom Search JSON API222developers.google.com/custom-search/ to perform the retrieval tasks. We employ the Wikipedia333en.wikipedia.org/ as the external knowledge library for feature generation. This setup involves conducting multiple searches within a 55-second window for each query and narrowing down the search results to the top three most relevant documents. GPT-4o then selects the most suitable document for feature generation from these. For our baseline models, Lasso regression is configured with a maximum of 10001000 iterations, a tolerance of 1×1041\times 10^{-4}, 100100 alpha values linearly spaced for regularization strength optimization, an epsilon of 1×1031\times 10^{-3} to fine-tune the range, and 55-fold cross-validation for robustness. The RL setup includes a batch size of 3232, a learning rate of 0.010.01, an epsilon of 0.90.9 for exploration, a gamma of 0.90.9 indicating the discount factor, a target network update every 100100 iterations, a memory capacity of 2020 for the replay buffer, and 3030 exploration steps to balance exploration and exploitation effectively. CAAFE is configured with default parameters, using GPT-4o as the LLM, and the number of iterations is set to 33.

5.2 Experimental Results

Overall Performance. In Table 3 we show the overall results of RAFG across four domains-specific datasets.

(1) Compared with baseline methods, RAFG consistently achieves superior performance. For instance, in Random Forest, RAFG shows a significant improvement in accuracy across all datasets and has a 19.2%19.2\% increase on the AID dataset than raw data, and a 3.8%3.8\% increase than CAAFE, which is the second-highest accuracy baseline. The RAFG method shows particular strength in boosting model accuracy. The performance demonstrates RAFG’s enriching of feature space with domain-specific knowledge enhances the model’s understanding and boosts overall performance.

(2) Across different models and metrics, RAFG consistently outperforms other methods in all metrics. Notably, in GCI dataset under DT, RAFG achieves a precision of 66.9%66.9\%, surpassing the highest precision of 60.7%60.7\%. Moreover, for F1 scores, RAFG raises the score in the GCI under DT from 55.6%55.6\% to 68.1%68.1\%. These results demonstrate the RAFG’s capability of reducing misclassification effectively.

(3) The generation of new features through RAFG significantly enhances model performance. As in the results, the addition of features, e.g., from 756 to 770 in the PDC dataset, leads to substantial increases in performance metrics, including accuracy and F1. This result suggests that the new features not only numerically expand the feature set, but also enhance the information richness of the dataset. In this way the models can capture more complex patterns and relationships within the data.

Table 2: Variation in the number of features.
Dataset Original RAFG Dataset Original RAFG
PDC 756 770 (+14) GCI 34 39 (+5)
AID 15 16 (+1) DIA 21 23 (+2)

Searching relevant documents from external knowledge library effectively assists RAFG in acquiring useful information for feature extraction. In Figure 3, we compare the performance of RAFG with and without the use of an external database ("LLM Only" in the figure) across different datasets under the DT. The results demonstrate that incorporating external databases consistently enhances the accuracy of RAFG across all datasets.

Refer to caption
Figure 3: The impact of external knowledge library on RAFG accuracy under DT.
Table 3: Overall performance on downstream tasks. The best results are highlighted in bold, and the runner-up results are highlighted in underline. (Higher values indicate better performance.)
Metrics Model RF DT TT TN
Acc Dataset PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA
Raw 0.825 0.635 0.674 0.859 0.773 0.423 0.571 0.794 0.841 0.577 0.306 0.861 0.677 0.365 0.510 0.859
Lasso 0.847 0.654 0.735 0.860 0.788 0.442 0.592 0.796 0.857 0.615 0.327 0.862 0.709 0.423 0.531 0.860
RL 0.857 0.687 0.755 0.859 0.799 0.519 0.592 0.796 0.878 0.596 0.347 0.862 0.714 0.558 0.551 0.863
CAAFE 0.862 0.673 0.776 0.863 0.820 0.577 0.612 0.797 0.884 0.654 0.408 0.863 0.741 0.577 0.592 0.865
RAFG 0.884 0.750 0.816 0.867 0.841 0.615 0.674 0.815 0.900 0.673 0.429 0.865 0.773 0.596 0.714 0.867
Prec Dataset PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA
Raw 0.721 0.175 0.675 0.681 0.688 0.118 0.551 0.588 0.797 0.218 0.327 0.575 0.490 0.106 0.474 0.429
Lasso 0.753 0.188 0.690 0.684 0.755 0.150 0.607 0.590 0.797 0.237 0.307 0.599 0.519 0.129 0.525 0.684
RL 0.760 0.245 0.763 0.683 0.722 0.319 0.554 0.590 0.832 0.252 0.360 0.585 0.489 0.151 0.636 0.806
CAAFE 0.753 0.237 0.768 0.724 0.783 0.173 0.607 0.592 0.832 0.321 0.409 0.572 0.536 0.196 0.601 0.715
RAFG 0.793 0.343 0.791 0.735 0.777 0.261 0.669 0.605 0.857 0.349 0.450 0.570 0.559 0.242 0.683 0.723
Rec Dataset PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA
Raw 0.855 0.156 0.640 0.570 0.737 0.126 0.564 0.597 0.790 0.201 0.311 0.702 0.430 0.113 0.474 0.500
Lasso 0.818 0.194 0.720 0.575 0.727 0.141 0.594 0.599 0.820 0.224 0.333 0.696 0.583 0.101 0.533 0.538
RL 0.842 0.206 0.742 0.568 0.742 0.244 0.549 0.598 0.846 0.236 0.373 0.691 0.440 0.101 0.533 0.501
CAAFE 0.890 0.337 0.766 0.559 0.763 0.262 0.603 0.603 0.832 0.433 0.310 0.705 0.653 0.116 0.592 0.575
RAFG 0.905 0.357 0.791 0.563 0.786 0.320 0.703 0.596 0.882 0.353 0.404 0.710 0.664 0.213 0.669 0.579
F1 Dataset PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA PDC AID GCI DIA
Raw 0.750 0.146 0.636 0.587 0.702 0.122 0.556 0.592 0.793 0.209 0.289 0.593 0.419 0.110 0.470 0.462
Lasso 0.776 0.180 0.679 0.594 0.738 0.137 0.599 0.594 0.808 0.228 0.303 0.622 0.476 0.105 0.526 0.539
RL 0.788 0.209 0.741 0.584 0.731 0.268 0.549 0.594 0.838 0.225 0.363 0.606 0.434 0.121 0.551 0.466
CAAFE 0.790 0.243 0.758 0.572 0.772 0.188 0.601 0.597 0.832 0.344 0.350 0.589 0.508 0.137 0.579 0.595
RAFG 0.829 0.324 0.788 0.578 0.781 0.271 0.681 0.600 0.868 0.341 0.399 0.587 0.556 0.224 0.664 0.600

Information Gain. We then present the information gain during the feature generation process. Here we adopt the information entropy to quantify the information in the dataset. Figure 4 shows the information entropy gain for datasets before and after RAFG. We can see that the new features contribute differently to the information entropy. The fundamental reason that RAFG enhances the task performance is by introducing new information to enrich the dataset’s overall information context.

Refer to caption
Figure 4: The information gain in different datasets with RAFG. The new features contribute differently to the information of their datasets. The information gain proves that the new features enrich the data.

Case Study. In this case study, we present how our approach add new features to the Global Country Information Dataset. Here, RAFG generates and adds five new features to the dataset with feature labels and data descriptions. The RAFG also generates the corresponding calculation methods and explanations to the new features. We demonstrate the changes in model performance and information gain resulting from these additions below.

In Table 4, we demonstrate the feature information and data description before and after RAFG, and in Table 5 we showcase the detailed information of newly generated features in the order of generation along with the functions and meanings. We first compare the changes in accuracy for downstream tasks and information gain when generating new features. Figure 5 shows that each new feature improves the metrics for downstream tasks and adds extra information to the dataset. The RAFG increases the dimensionality of the dataset’s feature space by incorporating external information. Thus, the model in a downstream task can better capture the complex relationships between features to improve task performance.

Table 4: Comparison of the Global Country Information Dataset [1] before and after applying RAFG. The blue highlighted text is the new descriptive information generated by the LLM.
Dataset Number of Features Data Description
Original 35 This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons.
New 40 (+5) This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons. Newly added to this dataset are five key variables designed to deepen insights into economic pressures, population density, resource utilization, educational investments, and environmental by stress.
Table 5: Features generated by RAFG for the Global Country Information dataset.
Label Population Load Ratio
Calculation PopulationLand Area (Km2)
Reasoning Represents population density per square kilometer, indicating the population load of a country.
Label Resource Utilization Rate
Calculation Agricultural Land (%) + Forested Area (%)100
Reasoning Measures the proportion of land used for agriculture and forestry; higher values indicate greater resource use.
Label Education Investment Effectiveness
Calculation Gross Primary Enrollment (%) + Gross Tertiary Enrollment (%)2
Reasoning Reflects investment in primary and higher education, typically linked to better economic development.
Label Environmental Stress Index
Calculation CO2Emissions(Forested Area (%)100×Land Area (Km2) )
Reasoning Indicates environmental stress by measuring CO2 emissions per unit of forest area. Higher values show greater pressure.
Label GDP per Capita
Calculation GDPPopulation
Reasoning Represents a country’s economic level and standard of living.
Refer to caption
Figure 5: Accuracy variation in RAFG feature generation process and the information gain from new features for Global Country Information Dataset. The downstream task is Decision Tree classification.
Refer to caption
Figure 6: Heatmap of the correlations between the newly generated features (right) and the original features (left).

Correlation. We further explore the correlation between the new features generated by RAFG and the existing features in the Global Country Information Dataset. Figure 6 shows a detailed correlation analysis between these newly generated features and several key features in the dataset. The correlation of features validates the effectiveness of our RAFG method. We demonstrate the deeper connections between these new features and critical social-economic indicators. This can enhance the model’s ability to understand and capture the deeper structure of the data. For example, the high correlation between population load ratio and land area and urbanization rate highlights issues of population density distribution. The connection between resource utilization rate and environmental stress index reflects the close relationship between resource management and environmental protection. These correlation results validate the soundness and correctness of new features.

6 Conclusion

In this paper, we introduce a novel method, RAFG, that utilizes an LLM for text-informed feature generation. Our target is to enrich data by effectively utilizing the textual information to generate domain-specific features. RAFG leverages LLM to extract and integrate the textual information and retrieve relevant and reliable information to generate new features. To ensure the correctness of generation, we adopt RAG technology with external knowledge to consistently produce reliable and precise features. These features significantly enhance machine learning models without the need for high resource costs or domain-specific fine-tuning. The experimental results demonstrate that RAFG outperforms existing methods in generating meaningful features and enhancing performance across various domains. Also, RAFG’s automated and adaptable framework continually evolves with new data and shows great potential for broader adaptability in various fields. In future work, we plan to refine and expand the RAFG paradigm to more feature engineering methods.

References

  • [1] Countries of the world 2023. Kaggle dataset (2023)
  • [2] Arik, S.Ö., Pfister, T.: Tabnet: Attentive interpretable tabular learning. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 6679–6687 (2021)
  • [3] Banerjee, S.: Animal information dataset. Kaggle dataset (2023)
  • [4] De Ville, B.: Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics 5(6), 448–455 (2013)
  • [5] Guo, H., Tang, R., Ye, Y., Li, Z., He, X.: Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247 (2017)
  • [6] Hollmann, N., Müller, S., Hutter, F.: Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems 36 (2024)
  • [7] Hu, Y., Lu, Y.: Rag and rau: A survey on retrieval-augmented language model in natural language processing. arXiv preprint arXiv:2404.19543 (2024)
  • [8] Huang, X., Khetan, A., Cvitkovic, M., Karnin, Z.: Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678 (2020)
  • [9] Huang, Y., Huang, J.: A survey on retrieval-augmented text generation for large language models. arXiv preprint arXiv:2404.10981 (2024)
  • [10] Katz, G., Shin, E.C.R., Song, D.: Explorekit: Automatic feature generation and selection. In: 2016 IEEE 16th International Conference on Data Mining. pp. 979–984 (2016). https://doi.org/10.1109/ICDM.2016.0123
  • [11] Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive modeling using reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
  • [12] Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: Automated feature engineering for supervised learning. In: 2016 IEEE 16th international conference on data mining workshops. pp. 1304–1307. IEEE (2016)
  • [13] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33, 9459–9474 (2020)
  • [14] Pan, T., Chen, J., Xie, J., Zhou, Z., He, S.: Deep feature generating network: A new method for intelligent fault detection of mechanical systems under class imbalance. IEEE Transactions on Industrial Informatics 17(9), 6282–6293 (2020)
  • [15] Rigatti, S.J.: Random forest. Journal of Insurance Medicine 47(1), 31–39 (2017)
  • [16] Sakar, C., Serbes, G., Gunduz, A., Nizam, H., Sakar, B.: Parkinson’s disease classification. UCI Machine Learning Repository (2018)
  • [17] Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y.: Toward causal representation learning. Proceedings of the IEEE 109(5), 612–634 (2021)
  • [18] Severyn, A., Moschitti, A.: Automatic feature engineering for answer selection and extraction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. pp. 458–467 (2013)
  • [19] Shi, H., Li, H., Zhang, D., Cheng, C., Cao, X.: An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification. Computer Networks 132, 81–98 (2018)
  • [20] Song, W., Shi, C., Xiao, Z., Duan, Z., Xu, Y., Zhang, M., Tang, J.: Autoint: Automatic feature interaction learning via self-attentive neural networks. In: Proceedings of the 28th ACM international conference on information and knowledge management. pp. 1161–1170 (2019)
  • [21] Teboul, A.: Diabetes health indicators dataset. Kaggle dataset (2023)
  • [22] Tekkesinoglu, S., Kunze, L.: From feature importance to natural language explanations using llms with rag. arXiv preprint arXiv:2407.20990 (2024)
  • [23] Wang, D., Fu, Y., Liu, K., Li, X., Solihin, Y.: Group-wise reinforcement feature generation for optimal and explainable representation space reconstruction. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 1826–1834 (2022)
  • [24] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837 (2022)
  • [25] Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R.J., Anandkumar, A.: Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems 36 (2024)
  • [26] Zhang, T., Patil, S.G., Jain, N., Shen, S., Zaharia, M., Stoica, I., Gonzalez, J.E.: Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131 (2024)
  • [27] Zhang, X., Wang, Z., Jiang, L., Gao, W., Wang, P., Liu, K.: Tfwt: Tabular feature weighting with transformer (2024)
  • [28] Zhong, G., Wang, L.N., Ling, X., Dong, J.: An overview on data representation learning: From traditional feature learning to recent deep learning. The Journal of Finance and Data Science 2(4), 265–278 (2016)