GPT in Data Science:
A Practical Exploration of Model Selection
Abstract
There is an increasing interest in leveraging Large Language Models (LLMs) for managing structured data and enhancing data science processes. Despite the potential benefits, this integration poses significant questions regarding their reliability and decision-making methodologies. It highlights the importance of various factors in the model selection process, including the nature of the data, problem type, performance metrics, computational resources, interpretability vs accuracy, assumptions about data, and ethical considerations. Our objective is to elucidate and express the factors and assumptions guiding GPT-4’s model selection recommendations. We employ a variability model to depict these factors and use toy datasets to evaluate both the model and the implementation of the identified heuristics. By contrasting these outcomes with heuristics from other platforms, our aim is to determine the effectiveness and distinctiveness of GPT-4’s methodology. This research is committed to advancing our comprehension of AI decision-making processes, especially in the realm of model selection within data science. Our efforts are directed towards creating AI systems that are more transparent and comprehensible, contributing to a more responsible and efficient practice in data science.
Index Terms:
Generative Pre-trained Transformer (GPT), Machine Learning Model Selection, Heuristic Analysis, Data Science, Variability ModelI Introduction
At the 49th International Conference on Very Large Data Bases (VLDB), a panel led by Halevy et al. [1] posed an important question about the future of Data Science in the context of Large Language Models (LLMs). The growing role of LLMs in tasks such as database querying, query generation, and making inferences, as evidenced in recent studies [2, 3, 4, 5], highlights this evolving landscape. The LangChain library [6] exemplifies this integration, enabling LLMs to work with various computational resources including data connectors, and has attracted significant attention, being utilized by over 30,000 developers in creating context-aware and reasoning applications.
Despite the increasing use of GPT models in various tasks [7], concerns about their reliability and decision-making processes remain, as illustrated in Figure 1. This figure demonstrates an inconsistency in GPT-4’s query results when integrated with the LangChain framework, raising questions about the sources of such discrepancies (e.g. issues with LangChain, GPT-4, or the prompt structure).

Further, examining GPT-4’s analytical capabilities, as shown in Figure 2, reveals challenges in understanding the underlying heuristics of its decisions. For instance, when querying about employees most likely to leave a company, GPT-4 focused on employment duration. This raises questions about the factors considered in its analysis and the potential need for more comprehensive data for improved analysis.

For GPT to be effectively utilized in decision-making within data science, it is essential to assess its explanatory capabilities. This paper concentrates on a critical aspect of data science decision-making: model selection. As indicated by Tavares et al. [8], the selection of machine learning models is influenced by various factors, including data attributes, prediction algorithm types, and requirements such as performance and bias. However, these influencing factors are often not explicitly represented, especially concerning bias detection. This paper delves into the nuances of model selection in Data Science, specifically focusing on the heuristics used by GPT-4. We aim to decipher these heuristics using the framework by Tavares et al. [8], which emphasizes a clear understanding and representation of the factors affecting model selection. Our approach seeks to enhance the transparency and efficiency of decision-making, contributing to a more explainable and effective data science practice.
We compare the heuristics used by GPT-4 with those proposed by Scikit-Learn engineers, as outlined by Tavares et al. [8]. Using toy datasets, we examine GPT-4’s decision-making process in model selection and its alignment with established heuristics. The paper addresses two key research questions: (RQ1) How can we capture and represent GPT’s algorithm selection factors? (RQ2) How do GPT’s heuristic outcomes compare to those of Scikit-Learn engineers?
The paper is organized to offer a thorough understanding of GPT’s role in Data Science. Section 2 provides a background on LLMs and their applications. Section 3 details our methodology for capturing and comparing heuristics. Section 4 presents our experimental findings, and Section 5 concludes the paper with a summary and future research directions.
II Background
II-A LLM and GPT
Large Language Models (LLMs) and Generative Pre-trained Transformers (GPT) are integral parts of AI’s Natural Language Processing (NLP) realm. While LLM is a broad category encompassing models that predict word sequences and can be used for various tasks such as text generation and translation, GPT, developed by OpenAI [9], is a specific LLM type. GPT, renowned for generating text akin to human writing, undergoes extensive pre-training before fine-tuning for specialized tasks. In essence, GPT is a subclass of LLMs, but not all LLMs are GPT models. Other prominent LLM examples include BERT, RoBERTa, and XLNet.
II-B Variability-aware model
Variability modeling, widely explored for developing automated systems, refers to the process of identifying and representing the variabilities—or the ability of a product, item, or feature to change, evolve, or be customized—in order to harness automation opportunities. This concept is crucial in fields like software engineering, databases, and data warehouses, where understanding the common and unique aspects of a system is essential. Feature modeling, a key aspect of variability modeling, involves capturing the common and variable attributes of a system using abstract entities known as features.
In this context, feature models stand out as intuitive and effective tools for representing the features of a variant-rich software system. They not only facilitate an overall understanding of the system but also support various stages of development, such as scoping, planning, development, variant derivation, configuration, and maintenance, thus contributing to the system’s long-term success[10]. Feature models formally represent features, their relationships, and constraints. This approach enhances understandability, traceability, explainability, and maintenance of the system.
In our proposed reverse engineering approach, we use an automated method based on massive neural networks for decision-making. By extracting variability models from these automated systems, we aim to leverage the benefits of variability models—such as traceability, understandability, explainability, and maintenance—in a novel context.
Furthermore, feature diagrams provide a structured, visual method for addressing the complexities in machine learning applications. These diagrams not only highlight the intricate relationship between modeling assumptions and algorithm choices but also underscore their impact on performance and other critical evaluation criteria, including fairness.
II-C LLM applied to Data Science
Large Language Models (LLMs) are increasingly integral to data science, showcasing their potential in diverse areas such as data preprocessing, analytics, and even drug discovery [3]. The study by Chopra et al. [2] highlights the use of LLMs in data science, particularly for tasks like data preprocessing and analytics. However, challenges arise in the interaction between data scientists and LLM-powered chatbots, especially in areas like contextual data retrieval, prompt formulation for complex tasks, and adapting generated code. These insights lead to the proposal of design improvements, including data brushing and inquisitive feedback loops, to enhance AI-assisted data science tools.
Tools like DataChat AI [4], Anaconda Assistant, and Databricks Assistant, LLMs enable data scientists to engage in conversational interactions about their data. This interaction includes asking follow-up questions and receiving context-specific responses, greatly enhancing the user experience and efficiency in data management and analysis. Finally, Troy et al. [5] focus on enabling generative AI to produce SQL statements. They propose a tool that generates syntactically valid language sentences and integrate AI algorithms for semantic guidance, demonstrating the capability of LLMs in generating structured queries for specific purposes, like detecting cyber-attacks.
These studies collectively underscore the expanding role of LLMs in data science, from enhancing analytics and data management to contributing significantly to fields like drug discovery. However, the challenges, including effective communication with AI assistants and adapting AI-generated solutions to specific contexts, remain crucial areas for further development.
III Approach: Capturing and Representing Heuristics Behind GPT’s Decision-Making Process
In this section, we explore the process of reverse engineering the decision-making heuristics of GPT (Generative Pre-trained Transformer) using both bottom-up and top-down approaches. Our goal is to deduce the underlying principles and factors that GPT considers when making decisions about data science model selection, given a specific dataset. This exploration is critical for understanding the capabilities and limitations of GPT in the context of data science and machine learning. The following subsections detail the methodologies and findings of our investigation.
III-A Top-down Approach
In the top-down approach, we directly posed a question to GPT to elicit its criteria for selecting a data science model based on given dataset characteristics. The question was formulated as: “Given a dataset, which modeling assumptions/factors do you consider to selecting a data science model?” This query aimed to uncover the explicit decision-making criteria used by GPT. However, it’s important to note that the response presented below may not necessarily reflect the actual decision-making process used by GPT internally; it could represent an idealized or generalized answer suitable for the given question.
III-B Bottom-up Approach
To complement the top-down analysis, we employed a bottom-up approach, which involved practical applications of GPT’s decision-making process. We selected three different toy datasets, each with a clearly defined target variable. For each dataset, we asked GPT to recommend the most suitable machine learning model(s) to address the specific problem presented by the dataset. The prompt was structured as follows: “Given the dataset {x}, with the target variable in column {column y}, and with the objective of {problem description z}, identify the most suitable machine learning model(s) to solve this issue. Explain your choice(s) and the underlying modeling assumptions and factors that guided your decision. Outline your decision-making process in detail. If multiple models are viable, rank them in order of preference, and describe the criteria for transitioning from one model to another in the evaluation process.”
III-B1 Dataset: Heart Failure Prediction
This application aims to predict patient survival in heart failure scenarios using clinical data [11]. The dataset comprises 299 patient records, each featuring 12 clinical attributes, such as age, sex, diabetes presence, and serum creatinine levels. Additionally, it includes a binary target variable, ‘death_event’, indicating patient mortality within an average follow-up period of 130 days [11].
Prompt: “Given the attached dataset {heart_failure_clinical_records_dataset.csv}, with the target variable in column {‘death_event’}, and the objective of predicting survival of patients with heart failure based on clinical information, identify the most suitable machine learning model(s) to solve this issue. Explain your choice(s) and the underlying modeling assumptions and factors that guided your decision. Outline your decision-making process in detail. If multiple models are viable, rank them in order of preference, and describe the criteria for transitioning from one model to another in the evaluation process.”
As a result, GPT provided the following result:
Based on this response, we could extract a list of factors that were considered for proposing a rank of the most-fit models. Following the GPT guidance, the feature model shown in Figure 3 was developed.

Based on the GPT output and on the feature model we designed for this application, the key constraints influencing feature modeling include:
-
1.
Dataset Constraints:
-
•
DatasetSize = 299 NumberOfColumns = 13
-
•
Target Variable Type = Binary (DEATH_EVENT)
-
•
FeatureTypes = (Numerical BinaryCategorical)
-
•
-
2.
Algorithm Type Constraints:
-
•
Problem Type = Classification
-
•
Classification Type = Binary
-
•
Algorithm Options = {Logistic Regression, Random Forest, Gradient Boosting Machines, Support Vector Machine}
-
•
-
3.
Evaluation Criteria Constraints:
-
•
EvaluationCriteria = (Interpretability PerformanceMetrics ComputationalEfficiency RobustnessToOverfitting)
-
•
PerformanceEvaluation (Accuracy Precision Recall AUC-ROC)
-
•
-
4.
Model-Specific Constraints:
-
•
Logistic Regression (BinaryClassification Interpretability Simplicity)
-
•
Random Forest (Robustness to Overfitting Handling Mixed Data Types FeatureImportanceScores Interpretability)
-
•
Gradient Boosting Machines (High Predictive Accuracy Handling Mixed Data Types Prone to Overfitting)
-
•
Support Vector Machine (Kernel Functions High Dimensional Handling Non-Linear Relationship Modeling Interpretability HyperparameterTuning)
-
•
-
5.
Computational Efficiency Consideration:
-
•
Limited Computational Resources Prefer {Logistic Regression, Other Simplified Models}
-
•
-
6.
Ranking and Transition Process Constraints:
-
•
InitialModel LogisticRegression
-
•
If Logistic Regression Underperforms Consider {Random Forest, Gradient Boosting Machines}
-
•
If Robustness to Overfitting Not Sufficient and Computational Resources are Unlimited Consider Gradient Boosting Machines
-
•
If Non-Linear Relationships Suspected and Computational Resources are Unlimited Consider Support Vector Machine
-
•
III-B2 Dataset: Diabetes Prediction
The Diabetes Prediction dataset [12], utilized for modeling the likelihood of diabetes in patients, integrates medical and demographic data alongside diabetes status. It encompasses variables such as age, gender, BMI, hypertension, heart disease, smoking history, HbA1c level, and blood glucose level. The dataset’s primary aim is to facilitate the development of machine learning models for diabetes prediction, aiding healthcare professionals in identifying at-risk individuals and aiding research on diabetes-related factors.
Unique to this dataset are considerations for the complexity of relationships and the inclusion of Decision Trees and Neural Networks in the model selection process, factors not emphasized in the Heart Failure Prediction dataset. The dataset’s categorical and numerical nature necessitates models adept at handling both data types, with the binary ‘diabetes’ target variable framing it as a classification problem.
Key considerations include:
1. Data Complexity: Acknowledging potential non-linear relationships in the data, models capable of addressing complex data interactions are prioritized.
2. Model Selection: Decision Trees and Neural Networks are considered, in addition to Logistic Regression, Random Forests, and GBM. Decision Trees offer interpretability and manage non-linear data well, while Neural Networks excel in modeling intricate relationships, especially in larger datasets.
3. Performance and Transition Criteria: Model selection is guided by performance metrics (accuracy, AUC-ROC), with a focus on balancing interpretability and complexity. The transition between models is dictated by these metrics and signs of overfitting, ensuring model robustness and efficiency.
The feature diagram (Figure 4) represents these heuristics, underscoring the unique aspects of the Diabetes Prediction dataset in model selection, unlike the Heart Failure dataset.

Ranking and Transition Criteria:
III-B3 Dataset: Car’s Price Prediction
Based on the car price prediction dataset [13] and the nature of the regression problem, the selection of suitable machine learning models incorporates additional heuristics and factors, as depicted in Figure 5:
-
1.
Dataset Complexity (Simple vs. Complex Regressor): The complexity of relationships in the data is considered. Simple regressors are suitable for less complex relationships, while complex models are favored for more intricate data patterns.
-
2.
Quality and Size of Data (Generalizability and Interpretability): The dataset’s size and quality, including missing data and outliers, are crucial in model selection. Larger datasets support more complex models, while smaller or poorer quality datasets might benefit from simpler models.
-
3.
Type of Features (Numerical and Categorical): The quantity and type of features (numerical and categorical) influence the model choice. For example, more categorical features might necessitate models adept at handling feature conversion.
-
4.
Nature of the Target Variable (Continuous): The target variable ‘selling_price’ is continuous, directing the choice towards regression models.
-
5.
Model Type (Linear vs. Non-linear): The decision between linear and non-linear models depends on the complexity of relationships within the data.
-
6.
Evaluation Criteria (Performance Metrics): Models are assessed based on RMSE, MAE, and R² score, with cross-validation for generalizability.

Given these considerations, the preferred models are:
III-C Overall Representation: GPT Modeling Techniques
Figure 6 presents the feature diagram showcasing the factors and assumptions used in GPT model selection, integrating insights from both top-down and bottom-up approaches. It categorizes these into three main groups under Modeling Assumptions: dataset assumptions, functional requirements, and non-functional requirements, following the framework proposed by Tavares et al. [8]. At the root of this diagram is the Modeling Technique Selection feature, which encompasses a wide range of machine learning techniques. Previous figures depicted only the ML techniques recommended for specific applications, whereas this comprehensive diagram includes all models for reference, as detailed in Tavares et al. [8].

Although ethical concerns were raised in the top-down query and we represent it in our Modeling Assumptions feature diagram, GPT did not consider these when proposing model selections for any of the three datasets.
Below is a concise set of constraints derived from the figure and GPT outputs, encapsulating dataset attributes, problem types, evaluation metrics, and computational capacity. The guidelines also consider the iterative process of model improvement, advocating for an adaptive approach—progressing from basic to advanced models in response to performance results and overfitting indicators. The following exemplifies the structuring of these constraints:
-
1.
Dataset Constraints:
-
•
Size MinimumSizeRequirement Features MaximumFeaturesAllowed
-
•
DatasetType = {Numerical, Categorical, Binary, Text, Image, Time-series}
-
•
DatasetQualityConstraints = (MissingData Outliers Noise Unbalanced)
-
•
-
2.
Algorithm Type Constraints:
-
•
ProblemType = Classification Regression Clustering Dimensionality Reduction
-
•
If Classification AlgorithmOptions = DecisionTree, NaiveBayes, NeuralNetwork, SVM, K-Nearest Neighbors, Logistic Regression, EnsembleMethods, DeepLearningModels
-
•
If Regression AlgorithmOptions = LinearRegression, PolynomialRegression, RidgeRegression, LassoRegression, ElasticNet, EnsembleMethods, DeepLearningModels
-
•
If Clustering AlgorithmOptions = K-Means, Hierarchical Clustering, DBSCAN, Gaussian Mixture Models, Mean Shift
-
•
If Dimensionality Reduction AlgorithmOptions = PCA, t-SNE, LDA, QDA, Autoencoders
-
•
-
3.
Evaluation Criteria Constraints:
-
•
If ProblemType = Classification PerformanceMetrics = {Accuracy, Precision, Recall, F1Score, AUC-ROC}
-
•
If ProblemType = Regression PerformanceMetrics = {RMSE, MAE, R2}
-
•
If ProblemType = Clustering PerformanceMetrics = {SilhouetteScore, DaviesBouldinIndex, CalinskiHarabaszIndex}
-
•
If ProblemType = DimensionalityReduction PerformanceMetrics = {ReconstructionError, ExplainedVarianceRatio}
-
•
-
4.
Non-Functional Requirements Constraints:
-
•
EvaluationCriteria (Interpretability Robustness Fairness Privacy Transparency Generalizability EthicalConsiderations)
-
•
ModelComplexity (Simple Complex)
-
•
-
5.
Computational Resource Constraints:
-
•
If LimitedResources Prefer {ModelsWithLessComplexity ReducedFeatureSets}
-
•
If SufficientResources Capability to Implement {EnsembleMethods DeepLearningModels}
-
•
-
6.
Transition and Iterative Improvement Constraints:
-
•
InitialModelSelection SimplerModels
-
•
PoorPerformance TransitionToMoreComplexModels
-
•
If OverfittingDetected ApplyRegularization ParameterTuning ModelSelectionReevaluation
-
•
In summary, this section presents the factors and variabilities that GPT takes into account when selecting a model for a dataset. This diagram and constraints can be further adapted to examine model selection for additional datasets, employing GPT’s heuristics to interpret decisions made by an automated GPT-based system, or to conduct tests and verification processes investigating the relationships between these factors.
IV Comparative Results
Tavares et al. [8] present a system ranking models for datasets based on Scikit Heuristics. We tested this system on these three datasets to evaluate if the heuristic consistently selects similar models. When discrepancies occurred, we compared the top model from each heuristic.
We applied the main models suggested by the GPT and Scikit Learn diagrams using the Scikit Learn library. The GridSearchCV from scikit-learn was employed to automate the evaluation of all model parameter combinations via cross-validation.
IV-A Heart Failure Prediction
Following [14], we divided our dataset into an 80% training set (239 patients) and a 20% test set (60 patients), using a stratified split to accommodate the dataset’s imbalance.
The Sklearn heuristic recommended LinearSVC, KNeighborsClassifier, SVC, and EnsembleClassifiers. In contrast, the GPT heuristic suggested Logistic Regression, Random Forest, Gradient Boosting Machines, and Support Vector Machine. Model selection was based on performance metrics like accuracy, precision, recall, and AUC-ROC, as suggested by GPT.
We executed the following models for the dataset:
Results including the top two models from GPT (Logistic Regression and Random Forest), the top two from Scikit-learn (LinearSVC and KNeighborsClassifier), and the baseline model (Random Forest) as used in [14] are:
Logistic Regression, LinearSVC, and RandomForestClassifier showed identical accuracy and precision. However, Logistic Regression (suggested by GPT) demonstrated superior ROC-AUC performance. ROC-AUC is a critical metric for binary classifiers, especially in imbalanced datasets, as it quantifies the model’s ability to distinguish between classes across all thresholds.
IV-B Diabetes Prediction
In the Diabetes Prediction subsection, we compared the effectiveness of models suggested by both Scikit-learn heuristics and GPT outputs. The Scikit-based heuristic recommended the SGDClassifier, and kernelApproximation, with alternatives like Linear SVC, Kneighbors Classifier, SVC, and Ensemble Classifiers for datasets under 100K samples. The GPT heuristic, on the other hand, suggested Logistic Regression, Decision Trees, Random Forests, GBM, and Neural Networks, emphasizing the transition between models based on performance metrics such as accuracy and AUC-ROC.
Our execution and evaluation of these models on the dataset revealed notable results. The Logistic Regression model, recommended by GPT, achieved an accuracy of 0.9604, precision of 0.8592, recall of 0.6388, and a roc_auc of 0.9625. In contrast, the SGDClassifier, as suggested by Scikit, showed similar accuracy (0.9599) and precision (0.8598), but slightly lower recall (0.6312) and identical roc_auc (0.9625).
Notably, the RandomForestClassifier and GradientBoostingClassifier, also recommended by GPT, outperformed both in terms of accuracy, with scores of 0.9721 and 0.9723 respectively, and showed superior recall and precision. This comparison indicates that while the Scikit-based heuristic offers competitive models, the GPT-suggested models, particularly RandomForestClassifier and GradientBoostingClassifier, display a slight edge in overall performance for this specific dataset.
IV-C Car’s Price Prediction
In the Car’s Price Prediction subsection, we evaluated models suggested by both the Sklearn-based heuristic and GPT outputs. The Sklearn heuristic recommended RidgeRegression, SVR(kernel=linear), SVR(kernel=rbf), and EnsembleRegressors, whereas GPT suggested Random Forest Regressor, Gradient Boosting Regressors (like XGBoost, LightGBM), Linear Regression, Ridge/Lasso Regression, and Support Vector Regressor (SVR). Model performance was assessed based on RMSE, MAE, and R² score.
Upon execution, the Ridge Regression model, recommended by Sklearn, achieved an RMSE of 339373.6684, MAE of 122614.1828, and an R2_score of 0.6226. The SVR with RBF kernel, another Sklearn suggestion, however, had a significantly higher RMSE (568437.5465) and lower R2_score (-0.0588).
Comparatively, the Random Forest Regressor, a GPT suggestion, showed an RMSE of 360838.8792 and an R2_score of 0.5733, indicating slightly lower performance than Ridge Regression but notably better than SVR with RBF kernel. The GradientBoostingRegressor, also recommended by GPT, had an RMSE of 354280.1772 and an R2_score of 0.5887, performing comparably to the Random Forest Regressor.
Linear Regression and Lasso, both GPT suggestions, showed competitive performance, with RMSEs of 340865.1113 and 358057.2296, respectively, and similar R2_scores.
This comparison highlights that while Sklearn-based heuristics offer robust models like Ridge Regression, GPT’s recommendations, especially GradientBoostingRegressor and Random Forest Regressor, are equally competitive in terms of RMSE and R2_score. The overall performance of models suggested by GPT demonstrates their efficacy in handling this specific dataset for car price prediction.
IV-D Discussion
This section presents an overview of the results from evaluating the efficacy of model selection heuristics provided by GPT and Scikit-learn engineers across different datasets. Our findings reveal interesting implications for the application of Large Language Models (LLMs) like GPT-4 in data science. In the classification datasets, the models suggested by GPT outperformed those recommended by Scikit-learn heuristics. Conversely, in the regression dataset, the model suggested by Scikit-learn engineers showed superior performance compared to GPT’s suggestions.
However, it’s important to note that several factors can influence these outcomes. The way datasets are split, parameter selection for algorithms, and the specific characteristics of each dataset play significant roles in the performance of the suggested models. These factors need to be carefully considered when interpreting the results and their applicability to real-world scenarios.
Overall, the results are positive and encouraging. We were successful in mapping the decision-making process of GPT for the selection of each algorithm. The suggestions provided by GPT achieved satisfactory results in all datasets, demonstrating its potential as a valuable tool in model selection.
V Conclusion and Future Work
This paper addressed the need for understanding the heuristics behind decision-making processes in AI, particularly in the context of model selection in data science. The increasing reliance on Large Language Models like GPT-4 in various data science applications underscores the importance of this endeavor. Our approach involved capturing and articulating the heuristics underlying GPT-4’s model selection recommendations, a task crucial for enhancing transparency and trust in AI systems. It highlights the importance of various factors in the model selection process, including the nature of the data, problem type, performance metrics, computational resources, interpretability vs accuracy, assumptions about data, and ethical considerations.
To achieve this, we employed feature models, which proved instrumental in representing the complex, multi-faceted nature of these heuristics. By utilizing toy datasets, we were able to effectively test and evaluate both the model and the application of the captured heuristics. This methodology enabled us to compare the results against heuristics proposed by other platforms, providing a comprehensive assessment of the efficacy and uniqueness of GPT-4’s approach.
VI Future Work
Our research opens several avenues for future exploration. One key area involves further refining the feature models to capture even more nuanced aspects of AI heuristics. Additionally, expanding the scope of datasets and scenarios will provide a more robust evaluation of AI decision-making across different contexts. There’s also a compelling opportunity to explore the integration of these heuristics into real-world applications, assessing how they can enhance the performance and reliability of AI systems in practical settings.
Furthermore, the implications of this study for AI ethics and governance are vast and warrant deeper investigation. Understanding AI decision-making is a step towards more ethical and responsible AI, and future research should delve into how these insights can be translated into policy and practice. By continuing this line of inquiry, we aim to contribute to the development of AI systems that are not only powerful and efficient but also transparent, understandable, and aligned with human values and societal needs.
VI-A Exploration of Other Large Language Models (LLMs)
A comprehensive investigation of various LLMs will provide insights into their unique decision-making processes. Establishing clear evaluation standards will enable us to compare and contrast these models. The integration of different generative models with data connectors, as exemplified in LangChain, offers a fertile ground for exploring how different LLMs interact and perform in a variety of contexts.
VI-B Prompt Engineering and Its Impact
Prompt engineering and its influence on model heuristics is a significant aspect to explore. By adopting the approach proposed by [15], where LLMs are used to simulate multiple humans and replicate human subject studies, we can investigate how simulating data scientists with varied professional experiences impact GPT’s model selection. This approach aligns with the concept of Turing Experiments for language models, which aim to evaluate the extent to which language models can simulate different aspects of human behavior.
VI-C Ethical Considerations
Our investigation revealed that while GPT’s strategic framework (top-down approach) acknowledges ethical considerations, these elements are not as emphasized in the practical application of the models. This gap suggests an area for future work, where we can explore potential biases and ethical implications of AI-assisted model selection, and how prompt engineering could be leveraged to explicitly include these ethical considerations. By incorporating these factors, we can ensure that the AI’s decision-making process not only focuses on performance metrics but also aligns with ethical guidelines and best practices.
VI-D Addressing Hallucination in Responses
One of the critical challenges with LLMs is addressing ‘hallucinations’ [16] – instances where the models generate contextually irrelevant or factually incorrect responses. A deeper analysis of the internal decision-making processes of these models is necessary to understand and mitigate these risks, which are crucial for the trust and safety in AI applications.
VI-E Application to More Complex Scenarios
Applying our findings to larger datasets and more complex applications will significantly enhance our understanding of LLM decision-making. This will involve not only scaling up the size of the datasets but also tackling more challenging scenarios that reflect real-world complexity. Such applications will provide valuable insights into the robustness and adaptability of LLMs in diverse and demanding environments.
Acknowledgment
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC), and the Centre for Community Mapping (COMAP).
References
- [1] A. Halevy, Y. Choi, A. Floratou, M. J. Franklin, N. Noy, and H. Wang, “Will llms reshape, supercharge, or kill data science?(vldb 2023 panel),” Proceedings of the VLDB Endowment, vol. 16, no. 12, pp. 4114–4115, 2023.
- [2] B. Chopra, A. Singha, A. Fariha, S. Gulwani, C. Parnin, A. Tiwari, and A. Z. Henley, “Conversational challenges in ai-powered data science: Obstacles, needs, and design opportunities,” arXiv preprint arXiv:2310.16164, 2023.
- [3] J.-P. Vert, “How will generative ai disrupt data science in drug discovery?” Nature Biotechnology, pp. 1–2, 2023.
- [4] R. J. L. John, D. Bacon, J. Chen, U. Ramesh, J. Li, D. Das, R. Claus, A. Kendall, and J. M. Patel, “Datachat: An intuitive and collaborative data analytics platform,” in Companion of the 2023 International Conference on Management of Data, 2023, pp. 203–215.
- [5] C. Troy, S. Sturley, J. M. Alcaraz-Calero, and Q. Wang, “Enabling generative ai to produce sql statements: A framework for the auto-generation of knowledge based on ebnf context-free grammars,” IEEE Access, vol. 11, pp. 123 543–123 564, 2023.
- [6] LangChain, “Langchain framework,” 2023, accessed: 2023-11-15. [Online]. Available: https://python.langchain.com/docs/use_cases/qa_structured/sql
- [7] N. Nathalia, A. Paulo, and C. Donald, “Artificial intelligence vs. software engineers: An empirical study on performance and efficiency using chatgpt,” in Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering, 2023, pp. 24–33.
- [8] C. Tavares, N. Nascimento, P. Alencar, and D. Cowan, “Adaptive method for machine learning model selection in data science projects,” in 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022, pp. 2682–2688.
- [9] OpenAI, “Gpt models api,” 2023, accessed: 2023-08-07. [Online]. Available: https://platform.openai.com/docs/guides/gpt
- [10] D. Nešić, J. Krüger, c. Stănciulescu, and T. Berger, “Principles of feature modeling,” in Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 62–73.
- [11] D. Chicco and G. Jurman, “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone,” BMC medical informatics and decision making, vol. 20, no. 1, pp. 1–16, 2020.
- [12] N. Birla, “Kaggle: Diabetes prediction dataset,” 2023, accessed: 2023-11-15. [Online]. Available: https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
- [13] Kaggle, “Kaggle: Vehicle dataset,” 2023, accessed: 2023-11-15. [Online]. Available: https://www.kaggle.com/datasets/nehalbirla/vehicle-dataset-from-cardekho
- [14] R. Leenings, N. R. Winter, L. Plagwitz, V. Holstein, J. Ernsting, K. Sarink, L. Fisch, J. Steenweg, L. Kleine-Vennekate, J. Gebker et al., “Photonai—a python api for rapid machine learning model development,” Plos one, vol. 16, no. 7, p. e0254062, 2021.
- [15] G. V. Aher, R. I. Arriaga, and A. T. Kalai, “Using large language models to simulate multiple humans and replicate human subject studies,” in International Conference on Machine Learning. PMLR, 2023, pp. 337–371.
- [16] Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. J. Bang, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, no. 12, pp. 1–38, 2023.