11email: {xinhaoz, jinghanz, kunpeng}@pdx.edu 22institutetext: University of Montreal
22email: [email protected] 33institutetext: Visa Research
33email: [email protected]
Retrieval-Augmented Feature Generation for Domain-Specific Classification
Abstract
Feature generation can significantly enhance learning outcomes, particularly for tasks with limited data. An effective way to improve feature generation is by expanding the current feature space using existing features and enriching the informational content. However, generating new, interpretable features in application fields often requires domain-specific knowledge about the existing features. This paper introduces a new method RAFG for generating reasonable and explainable features specific to domain classification tasks. To generate new features with interpretability in domain knowledge, we perform information retrieval on existing features to identify potential feature associations, and utilize these associations to generate meaningful features. Furthermore, we develop a Large Language Model (LLM)-based framework for feature generation with reasoning to verify and filter features during the generation process. Experiments across several datasets in medical, economic, and geographic domains show that our RAFG method produces high-quality, meaningful features and significantly improves classification performance compared with baseline methods.
Keywords:
Feature generation Information retrieval Large language models.1 Introduction
In domain-specific applications, e.g., disease classification and insurance claim prediction, the data are usually scarce with limited features, which cannot achieve satisfactory performance by training a machine learning model. A common practice is to generate more new features to support the model for decision-making. However, the feature generation relies on the domain knowledge based on manual generation or lack of interpretability with automatic generation. Besides, the existing studies usually focus only on the data structure [5, 20, 19]. Although efficient, these approaches lack transparency and are hard to explain. In addition, the resources of domain-specific knowledge are also expensive to obtain.

An intuitive solution is to utilize the information and associations of existing features, which requires extracting and understanding the relationship between relevant features. This is non-trivial since it requires domain knowledge as references and generates new features on top of them, which is challenging for a model without specific fine-tuning. Although fine-tuning a language model with domain-specific data can inject domain knowledge and enhance performance, the paradigm of fine-tuning is high-cost and limits the models’ generalizability across different domains. Thus, an alternative is to retrieve useful information from an external knowledge base as support reference, and then leverage a generator model with a reasoning mechanism, e.g., an LLM, to integrate the original feature with the corresponding support knowledge to generate the new feature. This is similar to the principle of retrieval-augmented generation (RAG) [7] technique, where we apply it for domain-specific feature generation. On one hand, the retrieved knowledge can serve as explicit references for interoperability. On the other hand, the new feature is generated based on only the existing feature, which can reduce the noise injection risk and efficiently leverage the limited available information.
As shown in Figure 1, the goal is to detect heart disease based on the general features. A more concrete indicator is the BMI feature, which can be generated by the original features Height and Weight. The new feature generation could be achieved with the support of external knowledge of BMI and this is our motivation for implementing retrieval-augmented feature generation. Specifically, we first deploy an LLM to capture the relevance between each pair of features based on the description of the domain datasets. Then, the LLM forms a query to retrieve relevant knowledge based on the selected feature and further generate the new feature together with the retrieved results. We expect the LLM can extract the vital information to generate the new feature of each pair of original features to enrich the data attribution, which can enhance the performance for a domain-specific downstream task.
Our Targets. Comprehensively, we aim to address three main challenges in generating features for domain-specific scenarios: 1) how to extract and utilize the useful information by understanding the relations among the features, 2) how to utilize retrieval-augmented generation based on LLMs to generate useful and explainable features with reasoning procedure, and 3) how to form and decide an optimal feature space based on the original and generated feature.
Our Approach. To address these challenges, we propose a novel LLM-based Retrieval-Augmented Feature Generation (RAFG), which generates features with retrieved knowledge of existing features and the reasoning ability of LLMs. Specifically, 1) we instruct an LLM to analyze and capture the implicit pattern within the structured data and the textual information, e.g., dataset descriptions and feature labels. With advanced reasoning capabilities, the LLM can provide a logical and cognitive process to derive the relation among the features and generate a query for later domain-specific retrieval; 2) we deploy the RAG technology based on the previously generated query to identify the relevant reference from a reliable knowledge base. The search results are integrated with the original selected feature to feed to the LLM. Eventually, a new feature is generated by the reasoning and generation capacity of LLM under the training-free scope; 3) we design an automated feature generation mechanism that continually refines the system’s feature generation capabilities. The new feature generated by the LLM can be automatically integrated into the data attribution. We then adopt the validated features to update the feature space until reaching the pre-defined iterations or achieve a determined optimal feature space. The iteratively generated features are expected to significantly enhance the domain-specific interpretability by aligning with the needs of the associated fields.
In summary, our contribution includes:
-
1.
We introduce a novel LLM-based retrieval-augmented feature generation method RAFG for domain-specific scenarios, which mines and utilizes the potential relationships of existing features to generate new features.
-
2.
The RAFG framework is training-free and can be adapted for any specific domain. The generated feature is explainable according to the retrieved reference and validated for enriching the feature space by the automatic evaluation based on LLMs.
-
3.
Experimental results on different domain datasets demonstrate the effectiveness and robustness of our RAFG across various downstream tasks. Our analysis confirms the effects of retrieval-augmented generated features.
2 Related Work
2.1 Automated Feature Generation
Feature Engineering. Feature engineering is an essential part of machine learning, including selecting, modifying or creating new features from raw data to improve the task performance [18]. The target of this process is to optimize the feature representation space. In this process, we tailor the best and most interpretable feature set for specific machine learning tasks for better model accuracy and generalization. Among feature engineering, feature generation is the approach that generates new features from existing ones in a dataset. This is usually through various mathematical or logical transformation operations [5, 20]. The goal of feature generation is to create complex latent feature spaces [28, 17]. However, existing feature generation methods often lack transparency and require a large amount of manual operations.
Automated Feature Generation. With the advancement of deep learning technologies and LLMs, automated feature generation has significant development and application [14, 23]. In this field, the researchers have developed various automated feature engineering methods, such as ExploreKit [10] and Cognito [12]. These methods increase efficiency in processing large datasets and reduce the need for manual intervention. However, these methods often focus on the structural and numerical information of the data table and overlook the semantic and textual information. Also, the “black box” operations of feature generation with deep learning methods make the generated features challenging to explain and validate.
2.2 Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) [13] is a technique that integrates the information retrieval capabilities with the generative language models to enhance the performance of the final output. RAG effectively assists LLMs with tasks requiring extensive and specific domain knowledge [26, 9]. Given a domain-specific task, the LLMs can access a large external library, which might be a set of documents or knowledge related to the task [7]. Then, the LLMs generate a query according to the task information and use it for searching based on the similarity between the query and the candidate documents. This approach helps the LLMs to produce responses that are more accurate, domain-relevant, and reduce hallucinations [25]. However, existing RAG for feature generation lacks a reasoning process [22, 26, 13]. To generate interpretable features, the LLM needs to analyze the relevance and rationality of the retrieved information and the generated features. The reasoning process ensures the effectiveness of the new features by validating their effectiveness and explainability in the domain knowledge.
3 Problem Statement
We formulate the task as searching for the potential features to enrich and reconstruct an optimal and explainable feature representation space to advance certain downstream tasks, such as classification, regression, etc. Concretely, we denote the original tabular dataset as that includes an original feature set , features and its target label , with textual information including the labels and the data description. Our optimization objective is to automatically generate new features based on the retrieved knowledge with reasoning procedure that can reconstruct an optimal feature set :
(1) |
where is a domain-specific downstream task (e.g., predicting life expectancy), is the performance indicator of and is an optimized and reconstructed feature set reconstructed on .
4 Methodology

In this section, we introduce our novel feature generation method named Retrieval Augmented Feature Generation (RAFG), which generates features dynamically and automatically by mining the text information including feature labels and data descriptions of a dataset with a large language model (LLM). Within this framework, we utilize the LLM as a text-miner and a feature generator, which embeds and conducts retrieval based on the text information of the dataset and existing features. Then, the LLM generates new features according to the retrieved external knowledge and the output of RAG. The model evaluates the qualities and reliability of newly generated features to optimize and update the feature set. The overview of the RAFG framework is shown in Figure 2, which includes three stages: (1) Query Generation and Domain Knowledge Retrieval; (2) Retrieval-Augmented Feature Generation with Reasoning; and (3) Feature Update Determination by LLM.
4.1 Query Generation and Domain Knowledge Retrieval
In this stage, our objective is to identify and extract potential new features and their generation methods that could enhance the performance of downstream tasks. First, we collect the textual information of the current feature set , including the goals of the downstream tasks , and the textual information to construct a query through a pre-trained LLM :
(2) |
where denotes the LLM with parameters , and is the number of iterations. After integration, the LLM embeds the query:
(3) |
where is the embedding process that transforms the input into a vector representation. We then retrieve through the whole library for the top- most relevant documents to query . We evaluate the relevance of each document to the query by the cosine similarity of the embeddings and select the top- documents with highest similarity:
(4) |
where and is the norm function.
The set of documents consists of documents , which contains the information of potential features that can be accessed through specific calculations, combinations or judgments with existing features in :
(5) |
where denotes the calculation, combination, or judgment actions on features in . Specially, we denotes the operations as follows:
-
•
Calculation involves arithmetic transformations, statistical measures, or algorithmic functions applied to derive . In our case, we define a calculator for features as a function applied to a vector of existing features, resulting in a scalar or vector output as the new feature.
-
•
Combination involves integrating two or more features into a single new feature through methods such as concatenation, averaging, or more complex fusion techniques. In our case, we define a calculator for combination as a function , where , , and are dimensions of the input features and the resulting feature, respectively.
-
•
Judgment involves logical or rule-based decision-making processes that integrate domain knowledge or empirical rules to form a new feature. In our case, we define a calculator for judgment as a decision function where is the input space composed of one or more feature vectors, and the output is a new categorical feature based on predefined criteria.
4.2 Retrieval-Augmented Feature Generation with Reasoning
In this stage, our objective is to generate a new feature and align it with the existing features in the data table. Now we have introduced external knowledge about features through RAG. This knowledge includes the deeper structural details and relationships between features, which are not directly reflected in the original dataset. Here we adopt the LLM to analyze and extract feature structures and relationships, and further generate new features to expand the feature space.
From the documents with potential features , the LLM first assesses these potential features and then ranks them based on their potential to enhance the performance of downstream tasks according to LLM’s evaluation. Here, to help the model’s evaluation, we employ a “Chain of Thought” [24] strategy to encourage the LLM to think and formulate logical hypotheses and reasoning for the possible outcomes of integrating each potential feature into the data table step-by-step.
The LLM then selects the most promising document from these candidates, and generate a new feature according to . We integrate into the data table:
(6) |
(7) |
Here, the is generated by the corresponding calculation, combination, or judgment method mentioned in document with existing features and is of the same length as .
Finally, we feed into the downstream task and obtain the performance metric , which is then fed back to the LLM for performance evaluation.
4.3 Feature Update Determination by LLM
In this stage, our objective is to test the effectiveness of the newly generated feature and decide whether to keep it. We decide whether to keep the feature be the improvement of , where a higher value of indicates better performance:
(8) |
When the adding of new feature brings an improvement in the downstream task performance of the data table, we formally adopt as a new feature within the table. The LLM updates the dataset text information to incorporate the information related to . The feature generation process expands the dimensionality of the feature space. As the feature space expands, the model can understand and interpret the data more comprehensively, further improving the performance of downstream tasks and enhancing generalization capabilities.
After that, we move forward to generate new queries to search for more potential features with the updated data table and the data description . This iteration process continues until a predefined maximum number of iteration is reached, or the optimal feature set is found when achieving peak performance in the downstream task.
5 Experiments
In this section, we present four experiments to evaluate the effectiveness and impacts of the RAFG. First, we compare the performance of the RAFG against several baseline methods on four downstream tasks. Second, we present the information gain during feature generation. Then, we showcase the relationship between new features and existing features. Finally, we discuss the reason for the improvement of performance.
5.1 Experiments Settings
Datasets. We evaluate the RAFG method on four real-world datasets, including Parkinson’s Disease Classification (PDC) [16], Animal Information Dataset (AID) [3], Global Country Information Dataset 2023 (GCI) [1] and Diabetes Health Indicators Dataset (DIA) [21]. The detailed information is shown in Table 1.
Datasets | Samples | Features | Class | Target |
---|---|---|---|---|
PDC | 756 | 754 | 2 | Parkinson Binary |
AID | 205 | 15 | 10 | Social Structure |
GCI | 195 | 34 | 4 | Life Expectancy |
DIA | 253680 | 21 | 2 | Diabetes Binary |
Metrics. We evaluate the model performance by the following metrics: Overall Accuracy (Acc) measures the proportion of true results in the total dataset. Precision (Prec) reflects the ratio of true positive predictions to all positive predictions for each class. Recall (Rec), also known as sensitivity, reflects the ratio of true positive predictions to all actual positives for each class. F-Measure (F1) is the harmonic mean of precision and recall, providing a single score that balances both metrics.
Baseline Models and Methods. We adapt the RAFG across a range of classification models, including Random Forests (RF) [15], Decision Tree (DT) [4], TabTransformer (TT) [8] and TabNet (TN) [2]. We compare the performance of these models between whether adapted with RAFG on top of them. We compare the RAFG with several baselines, including (1) raw data without feature generation (Raw), (2) the Least Absolute Shrinkage and Selection Operator (Lasso) [27], (3) the Feature Engineering for Predictive Modeling using Reinforcement Learning [11] (RL), and (4) the Context-Aware Automated Feature Engineering (CAAFE) [6].
Implementation Details. In our implementation for feature generation, we use GPT-4o111platform.openai.com as the query generator paired with Google’s Custom Search JSON API222developers.google.com/custom-search/ to perform the retrieval tasks. We employ the Wikipedia333en.wikipedia.org/ as the external knowledge library for feature generation. This setup involves conducting multiple searches within a -second window for each query and narrowing down the search results to the top three most relevant documents. GPT-4o then selects the most suitable document for feature generation from these. For our baseline models, Lasso regression is configured with a maximum of iterations, a tolerance of , alpha values linearly spaced for regularization strength optimization, an epsilon of to fine-tune the range, and -fold cross-validation for robustness. The RL setup includes a batch size of , a learning rate of , an epsilon of for exploration, a gamma of indicating the discount factor, a target network update every iterations, a memory capacity of for the replay buffer, and exploration steps to balance exploration and exploitation effectively. CAAFE is configured with default parameters, using GPT-4o as the LLM, and the number of iterations is set to .
5.2 Experimental Results
Overall Performance. In Table 3 we show the overall results of RAFG across four domains-specific datasets.
(1) Compared with baseline methods, RAFG consistently achieves superior performance. For instance, in Random Forest, RAFG shows a significant improvement in accuracy across all datasets and has a increase on the AID dataset than raw data, and a increase than CAAFE, which is the second-highest accuracy baseline. The RAFG method shows particular strength in boosting model accuracy. The performance demonstrates RAFG’s enriching of feature space with domain-specific knowledge enhances the model’s understanding and boosts overall performance.
(2) Across different models and metrics, RAFG consistently outperforms other methods in all metrics. Notably, in GCI dataset under DT, RAFG achieves a precision of , surpassing the highest precision of . Moreover, for F1 scores, RAFG raises the score in the GCI under DT from to . These results demonstrate the RAFG’s capability of reducing misclassification effectively.
(3) The generation of new features through RAFG significantly enhances model performance. As in the results, the addition of features, e.g., from 756 to 770 in the PDC dataset, leads to substantial increases in performance metrics, including accuracy and F1. This result suggests that the new features not only numerically expand the feature set, but also enhance the information richness of the dataset. In this way the models can capture more complex patterns and relationships within the data.
Dataset | Original | RAFG | Dataset | Original | RAFG |
---|---|---|---|---|---|
PDC | 756 | 770 (+14) | GCI | 34 | 39 (+5) |
AID | 15 | 16 (+1) | DIA | 21 | 23 (+2) |
Searching relevant documents from external knowledge library effectively assists RAFG in acquiring useful information for feature extraction. In Figure 3, we compare the performance of RAFG with and without the use of an external database ("LLM Only" in the figure) across different datasets under the DT. The results demonstrate that incorporating external databases consistently enhances the accuracy of RAFG across all datasets.

Metrics | Model | RF | DT | TT | TN | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Acc | Dataset | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA |
Raw | 0.825 | 0.635 | 0.674 | 0.859 | 0.773 | 0.423 | 0.571 | 0.794 | 0.841 | 0.577 | 0.306 | 0.861 | 0.677 | 0.365 | 0.510 | 0.859 | |
Lasso | 0.847 | 0.654 | 0.735 | 0.860 | 0.788 | 0.442 | 0.592 | 0.796 | 0.857 | 0.615 | 0.327 | 0.862 | 0.709 | 0.423 | 0.531 | 0.860 | |
RL | 0.857 | 0.687 | 0.755 | 0.859 | 0.799 | 0.519 | 0.592 | 0.796 | 0.878 | 0.596 | 0.347 | 0.862 | 0.714 | 0.558 | 0.551 | 0.863 | |
CAAFE | 0.862 | 0.673 | 0.776 | 0.863 | 0.820 | 0.577 | 0.612 | 0.797 | 0.884 | 0.654 | 0.408 | 0.863 | 0.741 | 0.577 | 0.592 | 0.865 | |
RAFG | 0.884 | 0.750 | 0.816 | 0.867 | 0.841 | 0.615 | 0.674 | 0.815 | 0.900 | 0.673 | 0.429 | 0.865 | 0.773 | 0.596 | 0.714 | 0.867 | |
Prec | Dataset | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA |
Raw | 0.721 | 0.175 | 0.675 | 0.681 | 0.688 | 0.118 | 0.551 | 0.588 | 0.797 | 0.218 | 0.327 | 0.575 | 0.490 | 0.106 | 0.474 | 0.429 | |
Lasso | 0.753 | 0.188 | 0.690 | 0.684 | 0.755 | 0.150 | 0.607 | 0.590 | 0.797 | 0.237 | 0.307 | 0.599 | 0.519 | 0.129 | 0.525 | 0.684 | |
RL | 0.760 | 0.245 | 0.763 | 0.683 | 0.722 | 0.319 | 0.554 | 0.590 | 0.832 | 0.252 | 0.360 | 0.585 | 0.489 | 0.151 | 0.636 | 0.806 | |
CAAFE | 0.753 | 0.237 | 0.768 | 0.724 | 0.783 | 0.173 | 0.607 | 0.592 | 0.832 | 0.321 | 0.409 | 0.572 | 0.536 | 0.196 | 0.601 | 0.715 | |
RAFG | 0.793 | 0.343 | 0.791 | 0.735 | 0.777 | 0.261 | 0.669 | 0.605 | 0.857 | 0.349 | 0.450 | 0.570 | 0.559 | 0.242 | 0.683 | 0.723 | |
Rec | Dataset | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA |
Raw | 0.855 | 0.156 | 0.640 | 0.570 | 0.737 | 0.126 | 0.564 | 0.597 | 0.790 | 0.201 | 0.311 | 0.702 | 0.430 | 0.113 | 0.474 | 0.500 | |
Lasso | 0.818 | 0.194 | 0.720 | 0.575 | 0.727 | 0.141 | 0.594 | 0.599 | 0.820 | 0.224 | 0.333 | 0.696 | 0.583 | 0.101 | 0.533 | 0.538 | |
RL | 0.842 | 0.206 | 0.742 | 0.568 | 0.742 | 0.244 | 0.549 | 0.598 | 0.846 | 0.236 | 0.373 | 0.691 | 0.440 | 0.101 | 0.533 | 0.501 | |
CAAFE | 0.890 | 0.337 | 0.766 | 0.559 | 0.763 | 0.262 | 0.603 | 0.603 | 0.832 | 0.433 | 0.310 | 0.705 | 0.653 | 0.116 | 0.592 | 0.575 | |
RAFG | 0.905 | 0.357 | 0.791 | 0.563 | 0.786 | 0.320 | 0.703 | 0.596 | 0.882 | 0.353 | 0.404 | 0.710 | 0.664 | 0.213 | 0.669 | 0.579 | |
F1 | Dataset | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA | PDC | AID | GCI | DIA |
Raw | 0.750 | 0.146 | 0.636 | 0.587 | 0.702 | 0.122 | 0.556 | 0.592 | 0.793 | 0.209 | 0.289 | 0.593 | 0.419 | 0.110 | 0.470 | 0.462 | |
Lasso | 0.776 | 0.180 | 0.679 | 0.594 | 0.738 | 0.137 | 0.599 | 0.594 | 0.808 | 0.228 | 0.303 | 0.622 | 0.476 | 0.105 | 0.526 | 0.539 | |
RL | 0.788 | 0.209 | 0.741 | 0.584 | 0.731 | 0.268 | 0.549 | 0.594 | 0.838 | 0.225 | 0.363 | 0.606 | 0.434 | 0.121 | 0.551 | 0.466 | |
CAAFE | 0.790 | 0.243 | 0.758 | 0.572 | 0.772 | 0.188 | 0.601 | 0.597 | 0.832 | 0.344 | 0.350 | 0.589 | 0.508 | 0.137 | 0.579 | 0.595 | |
RAFG | 0.829 | 0.324 | 0.788 | 0.578 | 0.781 | 0.271 | 0.681 | 0.600 | 0.868 | 0.341 | 0.399 | 0.587 | 0.556 | 0.224 | 0.664 | 0.600 |
Information Gain. We then present the information gain during the feature generation process. Here we adopt the information entropy to quantify the information in the dataset. Figure 4 shows the information entropy gain for datasets before and after RAFG. We can see that the new features contribute differently to the information entropy. The fundamental reason that RAFG enhances the task performance is by introducing new information to enrich the dataset’s overall information context.

Case Study. In this case study, we present how our approach add new features to the Global Country Information Dataset. Here, RAFG generates and adds five new features to the dataset with feature labels and data descriptions. The RAFG also generates the corresponding calculation methods and explanations to the new features. We demonstrate the changes in model performance and information gain resulting from these additions below.
In Table 4, we demonstrate the feature information and data description before and after RAFG, and in Table 5 we showcase the detailed information of newly generated features in the order of generation along with the functions and meanings. We first compare the changes in accuracy for downstream tasks and information gain when generating new features. Figure 5 shows that each new feature improves the metrics for downstream tasks and adds extra information to the dataset. The RAFG increases the dimensionality of the dataset’s feature space by incorporating external information. Thus, the model in a downstream task can better capture the complex relationships between features to improve task performance.
Dataset | Number of Features | Data Description |
---|---|---|
Original | 35 | This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons. |
New | 40 (+5) | This comprehensive dataset provides a wealth of information about all countries worldwide, …, enabling in-depth analyses and cross-country comparisons. Newly added to this dataset are five key variables designed to deepen insights into economic pressures, population density, resource utilization, educational investments, and environmental by stress. |
Label | Population Load Ratio |
---|---|
Calculation | PopulationLand Area (Km2) |
Reasoning | Represents population density per square kilometer, indicating the population load of a country. |
Label | Resource Utilization Rate |
Calculation | Agricultural Land (%) + Forested Area (%)100 |
Reasoning | Measures the proportion of land used for agriculture and forestry; higher values indicate greater resource use. |
Label | Education Investment Effectiveness |
Calculation | Gross Primary Enrollment (%) + Gross Tertiary Enrollment (%)2 |
Reasoning | Reflects investment in primary and higher education, typically linked to better economic development. |
Label | Environmental Stress Index |
Calculation | CO2Emissions(Forested Area (%)100×Land Area (Km2) ) |
Reasoning | Indicates environmental stress by measuring CO2 emissions per unit of forest area. Higher values show greater pressure. |
Label | GDP per Capita |
Calculation | GDPPopulation |
Reasoning | Represents a country’s economic level and standard of living. |


Correlation. We further explore the correlation between the new features generated by RAFG and the existing features in the Global Country Information Dataset. Figure 6 shows a detailed correlation analysis between these newly generated features and several key features in the dataset. The correlation of features validates the effectiveness of our RAFG method. We demonstrate the deeper connections between these new features and critical social-economic indicators. This can enhance the model’s ability to understand and capture the deeper structure of the data. For example, the high correlation between population load ratio and land area and urbanization rate highlights issues of population density distribution. The connection between resource utilization rate and environmental stress index reflects the close relationship between resource management and environmental protection. These correlation results validate the soundness and correctness of new features.
6 Conclusion
In this paper, we introduce a novel method, RAFG, that utilizes an LLM for text-informed feature generation. Our target is to enrich data by effectively utilizing the textual information to generate domain-specific features. RAFG leverages LLM to extract and integrate the textual information and retrieve relevant and reliable information to generate new features. To ensure the correctness of generation, we adopt RAG technology with external knowledge to consistently produce reliable and precise features. These features significantly enhance machine learning models without the need for high resource costs or domain-specific fine-tuning. The experimental results demonstrate that RAFG outperforms existing methods in generating meaningful features and enhancing performance across various domains. Also, RAFG’s automated and adaptable framework continually evolves with new data and shows great potential for broader adaptability in various fields. In future work, we plan to refine and expand the RAFG paradigm to more feature engineering methods.
References
- [1] Countries of the world 2023. Kaggle dataset (2023)
- [2] Arik, S.Ö., Pfister, T.: Tabnet: Attentive interpretable tabular learning. In: Proceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 6679–6687 (2021)
- [3] Banerjee, S.: Animal information dataset. Kaggle dataset (2023)
- [4] De Ville, B.: Decision trees. Wiley Interdisciplinary Reviews: Computational Statistics 5(6), 448–455 (2013)
- [5] Guo, H., Tang, R., Ye, Y., Li, Z., He, X.: Deepfm: a factorization-machine based neural network for ctr prediction. arXiv preprint arXiv:1703.04247 (2017)
- [6] Hollmann, N., Müller, S., Hutter, F.: Large language models for automated data science: Introducing caafe for context-aware automated feature engineering. Advances in Neural Information Processing Systems 36 (2024)
- [7] Hu, Y., Lu, Y.: Rag and rau: A survey on retrieval-augmented language model in natural language processing. arXiv preprint arXiv:2404.19543 (2024)
- [8] Huang, X., Khetan, A., Cvitkovic, M., Karnin, Z.: Tabtransformer: Tabular data modeling using contextual embeddings. arXiv preprint arXiv:2012.06678 (2020)
- [9] Huang, Y., Huang, J.: A survey on retrieval-augmented text generation for large language models. arXiv preprint arXiv:2404.10981 (2024)
- [10] Katz, G., Shin, E.C.R., Song, D.: Explorekit: Automatic feature generation and selection. In: 2016 IEEE 16th International Conference on Data Mining. pp. 979–984 (2016). https://doi.org/10.1109/ICDM.2016.0123
- [11] Khurana, U., Samulowitz, H., Turaga, D.: Feature engineering for predictive modeling using reinforcement learning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
- [12] Khurana, U., Turaga, D., Samulowitz, H., Parthasrathy, S.: Cognito: Automated feature engineering for supervised learning. In: 2016 IEEE 16th international conference on data mining workshops. pp. 1304–1307. IEEE (2016)
- [13] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33, 9459–9474 (2020)
- [14] Pan, T., Chen, J., Xie, J., Zhou, Z., He, S.: Deep feature generating network: A new method for intelligent fault detection of mechanical systems under class imbalance. IEEE Transactions on Industrial Informatics 17(9), 6282–6293 (2020)
- [15] Rigatti, S.J.: Random forest. Journal of Insurance Medicine 47(1), 31–39 (2017)
- [16] Sakar, C., Serbes, G., Gunduz, A., Nizam, H., Sakar, B.: Parkinson’s disease classification. UCI Machine Learning Repository (2018)
- [17] Schölkopf, B., Locatello, F., Bauer, S., Ke, N.R., Kalchbrenner, N., Goyal, A., Bengio, Y.: Toward causal representation learning. Proceedings of the IEEE 109(5), 612–634 (2021)
- [18] Severyn, A., Moschitti, A.: Automatic feature engineering for answer selection and extraction. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. pp. 458–467 (2013)
- [19] Shi, H., Li, H., Zhang, D., Cheng, C., Cao, X.: An efficient feature generation approach based on deep learning and feature selection techniques for traffic classification. Computer Networks 132, 81–98 (2018)
- [20] Song, W., Shi, C., Xiao, Z., Duan, Z., Xu, Y., Zhang, M., Tang, J.: Autoint: Automatic feature interaction learning via self-attentive neural networks. In: Proceedings of the 28th ACM international conference on information and knowledge management. pp. 1161–1170 (2019)
- [21] Teboul, A.: Diabetes health indicators dataset. Kaggle dataset (2023)
- [22] Tekkesinoglu, S., Kunze, L.: From feature importance to natural language explanations using llms with rag. arXiv preprint arXiv:2407.20990 (2024)
- [23] Wang, D., Fu, Y., Liu, K., Li, X., Solihin, Y.: Group-wise reinforcement feature generation for optimal and explainable representation space reconstruction. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. pp. 1826–1834 (2022)
- [24] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35, 24824–24837 (2022)
- [25] Yang, K., Swope, A., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R.J., Anandkumar, A.: Leandojo: Theorem proving with retrieval-augmented language models. Advances in Neural Information Processing Systems 36 (2024)
- [26] Zhang, T., Patil, S.G., Jain, N., Shen, S., Zaharia, M., Stoica, I., Gonzalez, J.E.: Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131 (2024)
- [27] Zhang, X., Wang, Z., Jiang, L., Gao, W., Wang, P., Liu, K.: Tfwt: Tabular feature weighting with transformer (2024)
- [28] Zhong, G., Wang, L.N., Ling, X., Dong, J.: An overview on data representation learning: From traditional feature learning to recent deep learning. The Journal of Finance and Data Science 2(4), 265–278 (2016)