QCFE: An efficient Feature engineering for query cost estimation.
Abstract
Query cost estimation is a classical task for database management. Recently, researchers apply the AI-driven model to implement query cost estimation for achieving high accuracy. However, two defects of feature design lead to poor cost estimation accuracy-time efficiency. On the one hand, existing works only encode the query plan and data statistics while ignoring some other important variables, like storage structure, hardware, database knobs, etc. These variables also have significant impact on the query cost. On the other hand, due to the straightforward encoding design, existing works suffer heavy representation learning burden on ineffective dimensions of input. To meet the above two problems, we first propose an efficient feature engineering for query cost estimation, called QCFE. Specifically, we design a novel feature called feature snapshot to efficiently integrate the influences of the ignored variables. Further, we propose a difference-propagation feature reduction method for query cost estimation to filter the useless features. The experimental results demonstrate our QCFE could largely improve the time-accuracy efficiency on extensive benchmarks.
I Introduction
Cost estimation plays a pivotal role in database management, forming the bedrock for database optimization strategies encompassing query optimization [1], index optimization [2], and storage efficiency, among other aspects. The precision of cost estimation methodologies stands as a linchpin for achieving optimal performance in database operations. Regrettably, conventional techniques reliant on cost equations may produce huge estimation errors under complex workloads due to their simplistic frameworks and underlying assumptions [3]. This shortcoming can engender sub-optimal optimization outcomes, consequently undermining database performance.
Hence, the database community has delved into the application of neural network models to capture the intricate correlation between queries and their associated costs, harnessing the formidable learning prowess inherent in these deep networks. Extensive experimental results [4, 5, 6, 3] have demonstrated that the learning approaches achieve high accuracy across various complex benchmarks.
However, the utilization of neural networks for databases poses an efficiency-accuracy dilemma. On one hand, the query cost is related to multiple features, such as relation table, query, etc.), compared to other fields, such as natural language processing [7] and image recognition [8, 9], as they contain different structure. To accurately represent and fit the query plan-performance relationship, it is necessary to design complex network models. For instance, the transformer query cost model [10] has shown superior performance compared to simpler models, primarily due to its deeper network layers and attention mechanism that assigns weights to features. Notably, such network structures require more training and inference time than ordinary deep neural network (DNN) models.
On the other hand, database cost estimation is a frequently invoked component, which may be invoked multiple times within a single query optimization [11]. For example, even PostgreSQL utilizes genetic algorithms to reduce the number of estimation requests, the cost estimation component is still called more than times, where is the number of nodes in the query plan. Additionally, large database management systems receive thousands of query requests every minute. Consequently, it is not feasible to allocate excessive time to the fundamental cost estimation component.

Given the limitations of designing complex models, for solving the efficiency-accuracy dilemma, a natural approach should optimize the features to reduce the burden of representation learning and fitting learning in the model. Existing AI-driven cost estimation methods utilize relatively straightforward approaches to processing input query features. Typically, the one-hot encoding for tables [12], the one-hot encoding for indexes [13], and the vector for numerical values are directly fed into the evaluation model in a bottom-up path of the query. Totally, we have identified two shortcomings regarding feature design in existing AI-driven query cost estimators:
(1) Missing Important Features: Current methods [3] primarily focus on encoding the query plan and the table statistics, often overlooking the impact of other database variables on query cost. However, variables such as the storage format of data (e.g., B+ tree or LSM tree) and the hardware of the database also play a significant role in determining the query cost. Our investigation, as depicted in Figure 1, demonstrates substantial differences (2 times in TPCH and 3 times in Sysbench) in the average execution time of the same queries under different database environments (five database knob configurations). Therefore, neglecting the database environment can result in significant losses when predicting query cost.
(2) Heavy Representation Learning: Existing methods directly utilize the table feature, index feature, operator feature, etc. as the input of the AI cost model. This brings a large burden for representation learning [14], which is used to learn the effective representation of input features. Specifically, with the goal of simplifying learned model (accelerating inference time), capturing the relationships between the large amount of features and the query cost can be a difficult task. This intricate logic relationship between multiple features necessitates multiple nonlinear transformations to effectively capture.
These two questions appear to be a contradiction. The absence of crucial features primarily results from an incomplete modeling of the query cost estimation problem, necessitating the incorporation of additional features. The heavy representation learning stems from the ineffective elements of the encoding, necessitating the removal of some features. Nevertheless, when viewed collectively, these issues can be categorized as feature engineering challenges, implying that the task of query cost estimation’s feature engineering has not been processed optimally.
To solve the above problems, we design an effective feature engineering for query cost estimation, called QCFE. The core sights are as follows: (1) To avoid missing important variables, we define a novel concept, called feature snapshot () to integrate the characteristics of ignored variables (defined as the variable set of database knobs, storage structure, hardware and operating system). To the best of our knowledge, no one has attempted to encode the ignored variables for the query cost model. One possible reason may be that the resource required to build an exact feature representation is tantamount to build the database environment. Hence, we propose an estimated method to obtain the snapshot feature, ensuring high efficiency.
(2) For the heavy representation learning, we design a difference-propagation feature reduction (FR) method, to relieve the learning burden by pruning the ineffective features. Specifically, depending on the relational table and load type, certain features may not be effective. For instance, the plan method employs columns with the attribute’s length to encode the index. However, in pure write scenarios, the database management system may not create an index, resulting in an ineffective feature with the length of the number of columns in the query feature. These ineffective features not only increase the training and inference cost of the AI evaluation model but also reduce its accuracy [15].
Totally, the specific contributions are as follows:
-
•
In order to improve the time-accuracy efficiency, we propose a feature engineering for query cost estimation, called QCFE.
-
•
We first propose the feature snapshot (in Section III) concept for query cost estimation, integrating the influential the ignored variables variables. Our core goal is to make some reasonable assumptions to calculate the feature snapshot with high time efficiency.
-
•
We design the difference-propagation feature reduction method (in Section IV) to efficiently reduce the useless feature, further improving model training and inference efficiency.
-
•
To clarify the effectiveness of our QCFE, we demonstrate various comparisons (Section V) under extensive popular benchmarks (TPC-H, job-light, and Sysbench), including the evaluation of time-accuracy efficiency, the ablation of QCFE, the robustness of QCFE, etc.
II overview
In this section, we overview the architecture and workflow of our QCFE.


Firstly, we show the general feature engineering which is widely used in existing works [16, 12, 17], to clarify the effectiveness of our QCFE. As shown in Figure 2, the general FE directly encodes the query plan and table statistics as the training set for AI-driven models, which ignores some other Influential factors (like hardware) and may have meaningless computing resource overhead of ineffective input codes.
In contrast, our QCFE considers all the factors of the database environment containing the table statistics, queries, hardware, database knobs, etc. Especially, we design an efficient feature snapshot to capture the ignored variables influence in Section III. Then, we propose the difference propagation feature reduction algorithm (a variant of the back-propagation algorithm [18]) to filter the useless dimensions to further improve time-accuracy efficiency in Section IV.
III The snapshot feature
Although the ignored variables also have significant impacts on the query cost, it involves resource overhead to make exact representations for these variables as a feature snapshot. In this section, we introduce the sophisticated feature snapshot design in Section III-A. Further, to improve the time efficiency of calculating the feature snapshot, we design a standard simplified SQL template to replace the original templates.
III-A Estimated Feature Snapshot
Firstly, we clarify why we design an estimated FS to represent the ignored variables instead of an exact feature from the following two reasons.
(1) Partial influence: Essentially, these ignored variables affect the query cost by affecting the I/O cost and CPU cost of the operators. Only partial components of these variables have impact on the query cost. For example, The audio peripherals in the hardware will not affect the query execution efficiency. Hence, we only need to consider the I/O and CPU-related partial components.
(2) High resource overhead: The ignored variables have complex structures, which are costly to be directly encoded as an exact feature. For example, the hardware consists of multiple components, such memory, disk, CPU, etc. The exact representation of the various components of the ignored variable will bring enormous feature snapshots, leading to large space cost.
Due to the partial influence and high resource overhead, we consider to construct an estimated feature snapshot representation for the ignored variables. The specific considerations are as follows:
(1) We firstly identify how the ignored variables affect the query cost. Specifically, we analyze a basic physical cost formula of PostgreSQL, (the I/O to sequentially access a page) (the I/O cost to randomly access a page) (the CPU cost to process a tuple) (the CPU cost to process a tuple via indexes) (the CPU cost to process an operator, like sort). We can conclude two metrics from the above formula, the cost coefficient () and the cost number(). All the ignored variables influence the query cost by influencing the and . In general, the query plan and data statistics have the main impact on while the ignored variables mainly influence the cost of once I/O and CPU of request . For example, the disk type only influences the I/O speed for a given relation and query. And the enable index scan knob mainly influences the in once request. Our first estimation is that the ignored variables only influence the .
Cost Formula | Operators |
Seq Scan, Materialize, Aggregation | |
Index Scan, Merge Join, Hash Join | |
Sort | |
Nested Loop |
(2) We present how to estimate the . For simplicity and generality, we leverage the notion of logical cost functions [19] to estimate the instead of the cost formula of certain DBMS. Table I shows the specific logical assumptions of the cost formula for different operators. Here, these formulas is determined by the logical execution of operators. For example, could be the logical formula for seq scan [20], where is the cardinality of operator. Based on these assumed logical formula, we could estimate the for each operator instead of an incredible exact representation, representing the influences of the ignored variables. For example, could be the feature snapshot of the seq scan operator. Also, these assumed logical formulas could be optimized for specific DBMS (such as the revised cost formula for Postgres [21]), improving FS estimation accuracy.
Specifically, according to the logic formulas, we utilize the regression model, the least square method [22] to calculate the FS for each operator. The regression model is trained from the labeled operator set, which is collected by multiple query executions. When the query structure is complex, collecting labeled sets may be costly in time.
III-B Simplified Template
We observe that calculating the feature snapshot needs to execute multiple queries, which is costly for complex queries such as the OLAP in TPCH. In order to improve time efficiency, we design simplified templates, which could not only capture the characteristic but also have efficient execution time.
As shown in Algorithm 1, the input consists of the data abstract , the original query templates , the scale of transferred template , and the output is a query set as an effective substitute for calculating FS. Our algorithm consists of three important phases, (i) Parse original Query Templates. (ii) Generate simplified templates. (iii) Fill in simplified templates. Next, we introduce these phases in detail.

In the first phase (Lines 2-5), we parse the original query templates and obtain the operator-table-column set, defined as , where is an operator, is one of the related table-column tuples of . For each query template , we gather its operator set by matching the keywords-operator relationships in Table III-B. Specifically, the keyword ”” is corresponding to the seq scan operator and the index scan operator. We observe one keyword may related to two physical operators due to the uncertain query optimization. After parsing the original query templates, we obtain the operator-table-column set, which is the basis to generate the simplified templates. For example, Figure 3 shows an example from an original query template to the simplified queries. The original query template consists of some keywords, like , order by, group by, etc. Based on these keywords, we obtain the corresponding operator-table-column set represented by some arrows, like partsupp-p_partkey hash/index scan.
Keyword | Operator | Parent Template |
Seq/Index Scan | SELECT * FROM [table] WHERE [condition] | |
Order By | Sort | SELECT * FROM [table] WHERE [condition] ORDER BY [table.attr] |
Group By | Aggregate | SELECT COUNT(*) FROM [table] WHERE [condition] GROUP BY [attribute] |
table1.attr1 = table2.attr2 | Merge/Hash Join, Nested loop | SELECT * FROM [table1] JOIN [table2] ON [table1.attr = table2.attr] WHERE [condition] |
SELECT * FROM [table1] JOIN [table2] ON [table1.attr = table2.attr] WHERE [condition] ORDER BY [table1.attr] | ||
… | … | … |
In the second phase (Lines 6-9), we generate the simplified templates according to the parent template correlation in Table III-B and the operator-table-column set . For each kind of operator, we design some parent templates to reproduce operators. For the seq/index scan, we design one parent template, “SELECT * FROM [Table] WHERE [CONDITION]”. We do not enforce the operator type in parent templates by database commands, like enable index scan. The reason is that the enforcement may miss the characteristics of the original queries. For example, one column with the corresponding index has a large amount of index scan instead of seq scan. Hence, we generate simplified query templates without some enforced commands. After identifying the parent templates, we generate specific templates by filling in the table-column information. For example in Figure 3, for the seq scan operator, we generate the template by filling the table-column information into the template ”SELECT * FROM partsupp WHERE [ps_partkey]”.
In the third phase (Lines 11-15), we fill the simplified templates with the data abstract and random keyword from to iteratively generate the simplified queries. We set up the scale parameter to control the number of template queries. Finally, we obtain the simplified queries, which could be used to calculate the feature snapshot.
In general, the time complexity of this algorithm is related to the length of the , the generated operator length, the length of the table-column pair of operators, and the scale , which are constants. Hence, our simplified query generation algorithm is in constant time complexity. Next, we analyze some limitations of our simplified query templates.
Discussions: In this paper, our feature snapshot is established for the operator level. This is an estimated representation of the ignored variables. Naturally, it could be extended to more fine-grained levels such as the operator-table level and even the operator-table-column level, i.e. we could establish the own feature snapshot for the scan operator of certain tables and certain columns. Fine-grained feature snapshots will bring higher efficiency, and also increase the collection cost.
IV Feature Reduction
Existing AI-driven works [6, 13] of query cost estimation employ a direct feature encoding for queries and data tables. The large dimension of features brings overhead computational. In this section, we introduce our feature reduction algorithm, a novel method to filter the useless features for AI-driven query cost estimators, to further improve the time-accuracy efficiency. We first analyze the encoding characteristics of existing popular AI-driven database techniques in Section IV-A. Then, we introduce our difference-propagation feature reduction method in Section IV-B, to filter the useless features.
IV-A The analysis of existing encoding methods
As the foundation of the design feature reduction algorithm, we analyze the encoding characteristics of existing popular AI-driven database techniques. As shown in Table III, we conclude the coding objects, the coding methods, and the model type for seven popular works varying multiple database tasks. The specific analysis is as follows.
Method | plan or operator | encoding methods | model | task |
QPPnet [13] | physical operator | one-hot, numerical value | DNN | cost estimation |
MSCN [17] | query plan | one-hot, numerical value | DNN | cardinality estimaiton |
AVGDL [16] | physical operator | one-hot, numerical value | RNN | view selection |
end2end [12] | physical operator | one-hot, numerical value | RNN | cost estimaiton & cardinality estimaiton |
zero-shot [3] | physical operator | numerical value | MLP | cost estimaiton |
AIMeets [23] | physical operator | numerical value, one-hot | DNN | Index selection |
Bao [24] | physical operator | one-hot, numerical value | tree CNN | query optimization |
For encoding objects, we find most techniques encoding the query from the aspects of fine-grained operator level, like AIMeets [23], AVGDL [16], etc. Only the MSCN [17] directly encodes the queries from the three aspects, the join, predicates, and data samples. Thus, our feature reduction algorithm should be designed for the fine-grained operator level to adapt to most models. At the same time, the fine-grained design could also be compatible with high-level encoding objects.
For the encoding methods, we could find that all the techniques utilize the same encoding method, the one-hot encoding, and numerical values. For example, the QPPNet [13] utilizes the plan-structural deep neural network to learn the relationships between the query physical plan and its time cost. For every node of the plan, QPPNet employs the one-hot encoding for node type (like scan, sort, etc.), index, table, and the numerical value for columns, node width, card, and cost. Consequently, our feature reduction algorithm should be compatible with the one-hot encoding and numerical values.
For the model type, the learned models of existing methods consist of three typical networks, deep neural network (DNN), recurrent neural network (RNN), and convolutional neural network (CNN). Then, our feature reduction algorithm should be compatible with different types of neural networks.
Totally, feature reduction needs to consider three core sights, including the fine-grained operator level, the one-hot encoding and numerical values, and the different typical neural networks.
IV-B Feature Reduction
In this section, we introduce our fine-grained feature reduction algorithm designed for query operators. The feature reduction involves two key factors: the input labeled data and the learned model. The input-labeled data contains the operator-level feature snapshot, the query feature, and the statistics feature. The learned model refers to the AI-driven cost model for query estimation. Due to the different learning abilities, the different models may have different effective features on the same labeled data. The feature reduction problem is an complex problem () due to exponential feature combinations. The specific problem is defined as follows:
Given the cost estimation model and labeled operator set , where the is the encoding of certain query operator (like scan, sort, etc), is the corresponding cost and is the corresponding learned model (). The goal of feature reduction is to filter the useful feature input consisting of the one-hot and numerical values. Figure 4 shows a simple example for operator encoding and the learned cost model (the DNN and RELU are also used in existing query cost estimators [13, 12]). The is a labeled data of scan operator. The feature reduction task aims to obtain the effective subset feature of the four dimension input under the three-layer neural network model.


Feature reduction is a classical task of machine learning [15], which is significant in improving the time-accuracy efficiency of ML models. A typical idea is to greedily select a subset of features from the original feature set to form a new feature set so that the new feature set can be used to train the learned model with better performance. Considering that the combinatorial explosion feature subsets will bring large time complexity, . In order to solve this problem within polynomial time, we ignore the co-relationships between features and consider the approximate q-error-based greedy algorithm to find an effective feature subset as shown in Algorithm 2. In Lines 2-10, we reduce the useless features by iteratively dropping the feature with the minimum q-error. The q-error-based approximate greedy algorithm could reduce the useless features with the polynomial time complexity .
Even though the above approximate greedy algorithm could complete feature reduction within polynomial time, ignoring the co-relationships between features may retain some potentially useless features. That is, due to the mutual influence between the two features, removing one feature alone may increase the error, but removing two at the same time will reduce the error.
To ensure the algorithm time complexity and the co-relationship between features, we consider a normal solution to filter the useless features, the gradient. The partial derivative of each dimension could not only reflect the influence of certain dimensions of the input but also combine the co-relationships of features [25]. Specifically, for all types of AI-driven models (like deep network, convolutional network, etc.), we could utilize the background propagation algorithm to quickly gather the expectation changes (gradient changes) as the influence scores for each dimension. Taking the learned model of Figure 4(b) as an example, we could calculate the partial gradient expectation of the on dimension () by the equation:
Although the gradient reduction method could represent the effectiveness of certain dimensions of input, two characteristics of existing AI-driven database task models [16, 24, 17] lead to the gradient reduction ineffective. On the one hand, there exists a large amount of one-hot codes which is a discrete variable. The gradient of discrete variables is ineffective. On the other hand, the activation function suffers gradient vanishing problems such as the RELU in QPPNet [13]. For example in Figure 4, , the gradient of and are zero which is ineffective for feature importance.
(1) | ||||
To process the discrete variables and gradient vanishing, we consider the difference-propagation method to replace the gradient propagation, which is also used in other field [26]. Specifically, we consider sampling some points as the reference basis (defined as ) and utilize the expectation differences between and as the importance score. The specific importance score for single layer is defined as . For the multiple layer model as shown in Figure 4, we have the difference-propagation equation as shown in Equation 1.
Also, we observe that the difference-propagation method does not suffer from the discrete variable and the gradient vanishing problem. Taking the Figure 4 as an example, we define the reference set the difference-importance of dimension of is , which is effective for feature reduction.
Hence, we utilize the difference-propagation as the core of our feature reduction algorithm as shown in Algorithm 3. Specifically, Line 1 inits the reference set by sampling points from . For each dimension, Line 3 calculates the expectation differences of each dimension of by the Equation 1. If the expectation difference is larger than zero, Line 4 adds this feature to the filtered feature set. Line 7 returns all the filtered useful features.
Time Complexity: The difference-propagation algorithm could reduce the useless features within the polynomial time complexity . Among them, refers to the number of the features. And the is the time to calculate the expectation differences, which could also be accelerated by matrix multiplication [27].
Discussions: Our work could flexibly extend to dynamic workloads by designing a recall algorithm according to the inherent value of input features. In fact, besides time-accuracy importance, the input features of the query cost model have some inherent value. That is, although some features are worthless in the current query evaluation task, it may have the potential ability to affect the query cost itself. For example, to improve the time-accuracy efficiency of the query cost model under the write-only workload, our algorithm reduces the useless index feature, which is represented as a one-hot vector. However, with the workload changes (50% read, 50% write), the partial index features are effective for estimating the cost of read queries. Then, the partial index features have high potential value with dynamic query cost estimation. Therefore, in dynamic scenarios, researchers and users need not only to consider the time-accuracy efficiency of the cost model but also to consider the inherent value of features to prevent missing important features.
V The Evaluation Of QCFE
In this section, we conduct extensive experiments for our QCFE. Firstly, in Section V-A, we introduce our experimental set-up in detail. Then, we present the comparisons with popular existing works in time-accuracy efficiency with extensive benchmarks in Section V-B. We introduce the ablation study to demonstrate the effects of our QCFE with different design choices in Section V-C. We demonstrate the QCFE robustness of parameters in Section V-D. Finally, we evaluate the transferability of our feature snapshot in Section V-E.
V-A Experimental setup
In this section, we introduce the foundation of our experiments, including the hardware settings, dataset settings, workload configurations, and evaluation metrics.
Hardware settings To avoid the competition for resources, we utilize different servers to collect labeled sets of queries and implement training for learned cost models. The specific settings for all the data collection are PostgreSQL 14.4, Intel Core R7 7735HS with 16GB memory, and 512GB hard disk. The model training is executed on Intel Core i7-12700H with 42GB memory, 2.5TB hard disk, and NVIDIA GeForce RTX 3060 Laptop GPU.
Experimental Datasets We utilize three open source benchmarks to evaluate our methods, containing TPC-H 111https://www.tpc.org/tpch/, Sysbench 222https://github.com/akopytov/Sysbench/archive/1.0.zip and IMDB 333http://homepages.cwi.nl/ boncz/job/imdb.tgz, which are widely used in existing works [12, 17]. TPC-H is a practical commercial dataset consisting of eight tables and 22 query templates. In this paper, we employ the scale factor of 1 to implement our comparisons. IMDB is a complex & large movie review dataset consisting of 21 tables with size = 7GB. Sysbench is a synthetic dataset with a simple structure and we set up the table size = 5000000.
Workload Configurations To verify the efficiency of our feature snapshot, we randomly generate 20 database configurations of Postgres 14.4. Then, for each benchmark, we construct the labeled queries (scale = 10,000) under all configurations. We set up different workload settings to unify the amount of labeled queries as follows. For TPCH, we randomly generate queries in each configuration, so there are 17,600 queries in total. For Sysbench, we run the oltp_read_only.lua with default time setting = 10s to generate label queries. Then, we obtain (knob configurations) queries within 200s. For IMDB, we run the queries of job-light 444https://github.com/andreaskipf/learnedcardinalities in each configuration and then collect 14,000 labeled queries. Finally, we randomly split the labeled data of every benchmark into 2000, 4000, 6000, 8000, 10000 five data sets to conduct comparative experiments. In all scales, we divide the labeled set into 80% training set and 20% test set.
Metric: In this paper, we utilize extensive metrics to evaluate our methods, including Q-error shown in Equation 2, the pearson coefficient shown in Equation 3, the training time, and the inference time. Specifically, the Q-error could clarify the accuracy of query cost estimation. The pearson coefficient could identify the coefficient relationship between the true cost labels and the predicted labels. The training time and inference time could reflect the time efficiency of the learn estimators.
(2) |
(3) |
Implementation: We evaluate our feature engineering method on two popular AI-driven estimators, QPPNet [13] based on the open source implementation [28], and MSCN [17], which have different kinds of model designs. The QPPNet utilizes a plan-structural DNN for query cost estimation while MSCN designs a single flattened DNN for query cardinality estimation. To extend MSCN to estimate the cost (like End-to-End [12, 29]), we change the output of MSCN from cardinality to query cost and add the fine-grained features (containing the cardinality) same with QPPNet to MSCN. Further, we utilize the open-source libraries [30] (could be used to calculate the difference-propagation) to implement feature reduction code. The source codes of our experiments are available in github 555https://github.com/AvatarTwi/query_cost_feature_engineering.
V-B The Effectiveness of our feature engineering
In this section, we evaluate the time-accuracy efficiency of our feature engineering under five methods, containing PostgreSQL, QPPNet, MSCN, QCFE(qpp), QCFE(mscn). We integrate our QCFE into the operators of QPPNet (called QCFE(qpp)) and MSCN (called QCFE(mscn)). Due to the different complexity of benchmarks, we set training iteration = 800 for job-light, 400 for TPCH, and 100 for Sysbench. Specifically, Table IV shows the time-accuracy efficiency on extensive benchmarks with scale = 2000, 4000, 6000, 8000, 10000, including the pearson coefficient, the mean q-error, and the training time of all baselines.
Dataset | Model | 2000 | 4000 | 6000 | 8000 | 10000 | ||||||||||||||||
pearson | mean | time | pearson | mean | time | pearson | mean | time | pearson | mean | time | pearson | mean | time | ||||||||
TPCH | PGSQL | 0.704 | 819.241 | 1.194 | 0.741 | 882.319 | 2.394 | 0.746 | 790.83 | 3.506 | 0.663 | 1393.472 | 4.734 | 0.632 | 1179.219 | 5.896 | ||||||
QCFE(mscn) | 0.986 | 1.094 | 8.547 | 0.986 | 1.109 | 25.635 | 0.998 | 1.106 | 27.984 | 0.997 | 1.086 | 31.368 | 0.997 | 1.11 | 41.399 | |||||||
QCFE(qpp) | 0.985 | 1.072 | 424.322 | 0.985 | 1.089 | 399.598 | 0.985 | 1.101 | 391.901 | 0.979 | 1.108 | 433.513 | 0.969 | 1.096 | 417.469 | |||||||
MSCN | 0.983 | 1.105 | 9.608 | 0.979 | 1.13 | 27.995 | 0.988 | 1.126 | 40.615 | 0.988 | 1.105 | 33.314 | 0.987 | 1.134 | 43.437 | |||||||
QPPNet | 0.985 | 1.107 | 369.467 | 0.986 | 1.136 | 345.541 | 0.984 | 1.111 | 361.47 | 0.963 | 1.129 | 354.904 | 0.966 | 1.128 | 365.44 | |||||||
Sysbench | PGSQL | 0.224 | 169509.592 | 0.006 | 0.246 | 175054.294 | 0.011 | 0.287 | 185715.657 | 0.016 | 0.269 | 978066.774 | 0.022 | 0.283 | 938706.491 | 0.027 | ||||||
QCFE(mscn) | 0.792 | 1.528 | 8.116 | 0.748 | 1.521 | 15.531 | 0.818 | 1.484 | 32.226 | 0.776 | 1.542 | 29.579 | 0.721 | 1.57 | 41.045 | |||||||
QCFE(qpp) | 0.824 | 1.72 | 5.52 | 0.682 | 1.68 | 4.965 | 0.715 | 2.464 | 5.329 | 0.857 | 1.868 | 4.409 | 0.787 | 2.01 | 4.808 | |||||||
MSCN | 0.698 | 1.734 | 7.183 | 0.709 | 1.738 | 14.713 | 0.796 | 1.659 | 21.366 | 0.606 | 1.804 | 28.616 | 0.648 | 1.785 | 35.351 | |||||||
QPPNet | 0.524 | 10.402 | 9.321 | 0.465 | 8.492 | 9.997 | 0.488 | 8.891 | 9.463 | 0.616 | 33.596 | 9.459 | 0.633 | 32.644 | 10.282 | |||||||
job-light | PGSQL | 0.447 | 150.103 | 0.009 | 0.394 | 171.211 | 0.017 | 0.396 | 137.072 | 0.033 | 0.367 | 153.256 | 0.042 | 0.376 | 148.1 | 0.048 | ||||||
QCFE(mscn) | 0.985 | 1.083 | 7.029 | 0.996 | 1.066 | 18.709 | 0.996 | 1.056 | 25.3 | 0.998 | 1.046 | 31.915 | 0.998 | 1.046 | 45.077 | |||||||
QCFE(qpp) | 0.994 | 1.18 | 229.067 | 0.996 | 1.127 | 243.729 | 0.997 | 1.162 | 230.444 | 0.995 | 1.148 | 241.921 | 0.996 | 1.243 | 255.174 | |||||||
MSCN | 0.99 | 1.086 | 7.229 | 0.994 | 1.074 | 33.012 | 0.993 | 1.074 | 33.012 | 0.995 | 1.069 | 37.713 | 0.994 | 1.07 | 68.241 | |||||||
QPPNet | 0.993 | 1.2013 | 353.156 | 0.989 | 1.423 | 361.81 | 0.993 | 1.248 | 333.015 | 0.985 | 1.445 | 337.923 | 0.992 | 1.261 | 527.086 |
The time-accuracy efficiency of TPCH: In Table IV, compared to MSCN, we observe that our QCFE(mscn) improves the mean q-error from 1.094 to 1.086, the pearson coefficient from 0.985 to 0.993 on average. Compared to PostgreSQL, QCFE(mscn) improves the mean q-error from 1,012.6 to 1.086, the pearson coefficient from 0.697 to 0.993 on average. That clarify that our QCFE could effectively construct useful features for AI-driven query cost model. At the same time, our QCFE(mscn) saves average 12.4% time efficiency for learned model training. That demonstrates that our QCFE could effectively reduce the useless features and improve model training speed. Especially, on the scale = 10000 for QCFE(mscn), our method improves the q-error from 1.134 to 1.11 and the pearson coefficient from 0.987 to 0.997. Similarly, the QCFE(qpp) also reaches higher time-accuracy efficiency than QPPNet on all scales.
The time-accuracy of job-light: Compared to TPCH, the job-light has more complex table relationships, and we observe that the average mean q-error is larger than TPCH. Specifically, compared to QPPNet, our QCFE(qpp) improves the mean q-error from 1.316 to 1.172, and the pearson coefficient from 0.990 to 0.996 on average. Compared to PostgreSQL, QCFE(qpp) improves the mean q-error from 151.8 to 1.172, the pearson coefficient from 0.396 to 0.996 on average. That clarifies our QCFE could effectively construct useful features for AI-driven query cost model. At the same time, our QCFE(qpp) saves average 37.3% time efficiency for learned model training. That demonstrates that our QCFE could effectively reduce the useless features and improve model training speed. Especially, on the scale = 8000 for QCFE(qpp), our method improves the q-error from 1.445 to 1.148 and the pearson coefficient from 0.985 to 0.995. Similarly, the QCFE(mscn) also reaches higher time-accuracy efficiency than MSCN on all scales.
The time-accuracy of Sysbench: Compared to other benchmarks, the Sysbench only has a single table and its query templates have simple structures. Thus, it is more difficult to learn the relationships between query features and query cost. For example, there exists q-error = 10 with the real query time cost = 0.0001 and predict cost = 0.001. We observe that the PostgreSQL, MSCN, and QPPNet have average mean q-error and average pearson. Although Sysbench is difficult to predict, our QCFE(mscn) and QCFE(qpp) still reach pearson coefficient and q-error on average. Also compared to QPPNet, our QCFE(qpp) saves average 16.7% time efficiency for learned model training. Especially, on the scale = 8000 for QCFE(qpp), our method improves the q-error from 1.804 to 1.542 and the pearson coefficient from 0.616 to 0.857.
The Variance of Q-error: Moreover, we present the q-error variance of different benchmarks on scale = 2000, 4000, 6000, 8000, 10000 in Figure 5. Totally, Our method can effectively reduce the variance of the q-error in extensive benchmarks, making the prediction effect more stable and accurate. Compared to QPPNet, we observe that our QCFE(qpp) reduces the q-error to 57.143% (from 1.084 to 1.048), 0.032% (from 9.16 to 1.308), 50.299% (from 1.167 to 1.084) at the 50% quantiles under TPCH, Sysbench, job-light. For example, in the TPCH with scale = 2000, our QCFE(qpp) can reduce the 90% percentiles q-error from 1.3 to 1.2 compared to QPPNet. Also, other benchmarks have similar comparison results. Furthermore, we have observed that as the amount of data increases, the QPPNet and MSCN both show deterioration in prediction effects and increase in error variance, while our method can ensure higher accuracy and better variance under various data amounts. Hence, our method could effectively construct useful features for learned estimators and reduce the variance of q-error.

V-C Ablation Study
In this experiment, we present the results of our ablation study to show the effects of our QCFE under the different design choices, including FSO (the feature snapshot calculated by original queries), FST (the feature snapshot from template queries), FSO + FR, FSO + GD(gradient) and FSO + Greedy. We focus on the effectiveness of the different feature snapshots and various feature reduction methods.

Figure 6 shows the results of our ablation study of QPPNet model on extensive benchmarks with scale = 4000. We observe that the FST could reach the similar [1.109,1.781,1.222] mean q-error with FSO [1.098,1.715,1.180]. This demonstrates that our standard templates could reflect the characteristics of original queries. Moreover, we observe that our difference-importance outperforms Greedy and GD in both TPCH and job-light. This is because FR could effectively reduce the useless features and improve the accuracy of learned estimators. While GD may suffer the discrete one-hot codes and the gradient vanishing. The greedy method Greedy only reduces one dimension in one iteration. Thus, Greedy may miss the co-relationships between features.

Moreover, we present the feature reduction results for TPCH in Figure 7. We observe that Greedy reduces above 1.20% features, GD above 41.22%, and FR above 41.22% on average. Among them, the Greedy retains the majority of features because it cannot capture the co-relationships between features, like the pair of useless features. Specifically, our FR reduces up to 57 features in index scan operator while Greedy only reduces 2 features. This demonstrates the feature of index scan exists multiple. Moreover, we find that GD also could reduce a large number of features. In particular, GD could reduce the 101 features of the sort operator. This is because the GD could capture the co-relationships between features. However, due to the one-hot codes and gradient vanishing, its reduction may produce wrong importance scores. The 50th q-error of GD in Figure 6 only reaches 1.44 while FR could reach 1.24.
V-D The robustness of parameters
In this section, we demonstrate the robustness of two important parameters of QCFE, the scale of templates and the number of reference.
TYPE | FSO | 1(2) | 2(4) | 3(6) | 4(8) | ||||||||||
mean | time | mean | time | mean | time | mean | time | mean | time | ||||||
TPCH | 1.098 | 7.7h | 1.106 | 0.8h | 1.110 | 1.9h | 1.108 | 2.3h | 1.096 | 3.8h | |||||
job-light | 1.18 | 31.8h | 1.262 | 0.9h | 1.222 | 1.7h | 1.210 | 2.6h | 1.187 | 3.5h |
As shown in Table V, we observe that the effects of different scales of templates. Totally, we conclude that our simplified SQL templates could reflect the characteristics of original queries. And the q-error is relatively robust with the changes of scales. For TPCH, our FS generates 123 SQL templates, and we set scale = 1,2,3,4. We observe that calculating the feature snapshot costs 7.7h for labeling queries with original templates of TPCH. While our simplified templates only cost 3.8h and reach a competitive q-error (1.096) with the FSO. For job-light, FS generates 19 SQL templates due to its light join structure. Hence, we set up the scale = 2,4,6,8 to generate more queries. Because the multiple join in job-light has large time consumption, our simplified templates will reduce the queries collect time to 3.5h/31.8h 11% and also reach a competitive q-error.
number | mean | q-error95 | q-error90 | runtime/s | reduction ratio |
200 | 1.107 | 1.39 | 1.224 | 267.788 | 40.036% |
250 | 1.09 | 1.305 | 1.212 | 267.788 | 39.857% |
300 | 1.095 | 1.262 | 1.212 | 349.73 | 40.214% |
400 | 1.09 | 1.27 | 1.196 | 591.671 | 40.125% |
500 | 1.076 | 1.215 | 1.156 | 911.671 | 39.767% |
Table VI shows the robustness of the number of references in TPCH and QCFE(qpp), including the mean q-error, the 95th q-error, the 90th q-error, the runtime of FR, and the reduction ratio. We observe that the q-error is slightly improved with the reference number increasing. This clarifies that the limited reference could reflect the importance of input features. And the runtime increases linearly with parameter changes, from 267s to 911s. This demonstrates that our QCFE has a linear increase with the changes of reference. Further, the reduction ratio (about 40%) is also robust with the change of reference number.
V-E The transferability of feature snapshot
In this section, we introduce the transferability of our feature snapshot. Specifically, we utilize the trained cost model (scale = 10000 in TPCH and job-light) of the above hardware settings as the transferable basis. Then, in the same TPCH and job-light settings, we collect 2000 labeled queries as the training set and 500 labeled queries as the test set in a new hardware setting (Intel Core i7-12700H with 42GB memory, 2.5TB hard disk), called h2. To clarify the transferability of our feature snapshot, we calculate the feature snapshot with the 2000 labeled original queries (FSO) and the template queries (FST scale = 4 in TPCH and 4 in job-light).
Table VII shows pearson coefficient, the mean q-error, and the training time of the cost model directly trained with 2000 labeled data, the basis model + FSO with 200 retraining, and the basis model + FST with 200 retraining. We find that only by replacing the feature snapshot and a little retraining our transferable model could reach a similar accuracy to the model trained with 2000 labeled data. Especially, we observe that the FST outperforms the FSO in mean q-error under TPCH. Because TPCH has more various operators (123 templates). It is difficult for 2000 labeled queries to calculate a representative feature snapshot. And our FST could cover more operator characteristics than the FSO with limited labeled set.
Model | Metric | TPCH | job-light | |||
basis | pearson | 0.983 | 0.995 | |||
mean | 1.088 | 1.195 | ||||
time | 381.157 | 232.519 | ||||
trans-FSO | pearson | 0.981 | 0.997 | |||
mean | 1.112 | 1.246 | ||||
time | 114.455 | 65.539 | ||||
trans-FST | pearson | 0.982 | 0.99 | |||
mean | 1.083 | 1.278 | ||||
time | 121.093 | 73.246 |
Moreover, we demonstrate the convergence speed of the direct training model and the transferable model in Figure 8. We observe that the transferable model could reach a similar accuracy with the direct training model with 25% training time. This demonstrates that our feature snapshot could achieve hardware transferability and effectively reduce the training time with a new database environment.

VI Related Works
In this section, we introduce the related works from two aspects: the cost estimation methods and the feature processing methods.
Cost Estimation: In the early days, in order to improve the efficiency of query optimization, researchers proposed some statistical methods [31, 20, 32] to hypothetically estimate the query performance. For example, Li et al. [31] proposed operator-based statistical techniques to estimate query execution time. This model simulates operator costs by modeling resource usage as operator-level behavior and designing a statistical learning model. Modeling load performance as a random variable, Wu et al. [20] proposed a model that provides uncertain information for query execution time prediction. The model expresses the cost of a query as a function of the selectivity of operators in the query plan and factors that describe system CPU and I/O operation costs. Recently, the database community has attempted to utilize deep learning model [33, 34] to improve the accuracy of simulating the query performance. For example, Sun et al. [6] proposed an end-to-end framework for learning cost estimation, TPool, which uses a tree-structured model to approximate query costs. This research proposes effective feature extraction and encoding techniques to improve the learning ability of the model. Marcus et al. [13] proposed a deep neural network QPPNet based on the query plan tree structure to estimate the query cost. The model can automatically discover the characteristics of operators and query plans, avoiding manual feature engineering. Akdere et al [35] proposed predictive modeling techniques to learn query execution behavior at different granularities, from coarse-grained plan-level models to fine-grained operator-level models. This method can well balance the two indicators of accuracy and transferability. Moreover, the zeroshot [3] focused on the query encoding issues, and proposed a transferable feature across databases as the basis of zero-shot estimation.
Feature Engineering of cost estimation: Feature engineering is a typical task for improving the efficiency of deep learning. There exist many feature engineering methods [36, 37] in other fields, like the work2vec [38] in natural language processing. As for query cost estimation, there only exist two relative works, the transferable query encoding [3] and query representation learning [10]. (1) The zeroshot [3] focused on the poor migration of query encoding and converted the original one-hot encoding of attributes into the encoding of attribute types. This encoding method supports a transferable feature across databases as the basis of zero-shot estimation. (2) The QueryFormer [10] proposed a tree-transformer model to learn the tree structure of queries. The goal of QueryFormer is to translate query execution plans into vectorized representations. Totally different from the above two works, our QCFE aims to find the effective features for query cost estimation, which can be used to improve the time-accuracy efficiency of query cost estimation. And our method can be easily combined with the above two methods and enhance their efficiency.
VII Conclusion
In this paper, we propose QCFE to obtain effective features for query cost estimation. On the one hand, we design an estimated feature snapshot to integrate some other ignored variables, containing database knobs, hardware, etc. The experimental comparisons clarify that our approach could efficiently capture the influence of ignored variables. On the other hand, we design a difference-propagation feature reduction method to filter the useless features, to improve the time-accuracy efficiency of query cost estimation. In the future, we consider optimizing QCFE from the following two aspects. On the one hand, we consider adapting our feature reduction algorithm to the dynamic workload situations by designing a recall mechanism. On the other hand, we aim to enhance the transfer learning of query cost estimation by identifying which feature to transfer.
References
- [1] Q. Zhou, J. Arulraj, S. Navathe, W. Harris, and J. Wu, “Sia: Optimizing queries using learned predicates,” in Proceedings of the 2021 International Conference on Management of Data, pp. 2169–2181, 2021.
- [2] H. Lan, Z. Bao, and Y. Peng, “An index advisor using deep reinforcement learning,” (New York, NY, USA), Association for Computing Machinery, 2020.
- [3] B. Hilprecht and C. Binnig, “Zero-shot cost models for out-of-the-box learned cost prediction,” arXiv preprint arXiv:2201.00561, 2022.
- [4] Z. Yang, E. Liang, A. Kamsetty, C. Wu, Y. Duan, X. Chen, P. Abbeel, J. M. Hellerstein, S. Krishnan, and I. Stoica, “Deep unsupervised cardinality estimation,” vol. 13, no. 3, 2019.
- [5] Z. Yang, A. Kamsetty, S. Luan, E. Liang, Y. Duan, X. Chen, and I. Stoica, “Neurocard: One cardinality estimator for all tables,” vol. 14, no. 1, 2020.
- [6] J. Sun and G. Li, “An end-to-end learning-based cost estimator,” 2021.
- [7] K. Chowdhary and K. Chowdhary, “Natural language processing,” Fundamentals of artificial intelligence, pp. 603–649, 2020.
- [8] S. Wu, M. Zhang, G. Chen, and K. Chen, “A new approach to compute cnns for extremely large images,” in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 39–48, 2017.
- [9] T. Vu, C. Van Nguyen, T. X. Pham, T. M. Luu, and C. D. Yoo, “Fast and efficient image quality enhancement via desubpixel convolutional neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, September 2018.
- [10] Y. Zhao, G. Cong, J. Shi, and C. Miao, “Queryformer: a tree transformer model for query plan representation,” Proceedings of the VLDB Endowment, vol. 15, no. 8, pp. 1658–1670, 2022.
- [11] Y. E. Ioannidis, “Query optimization,” ACM Computing Surveys (CSUR), vol. 28, no. 1, pp. 121–123, 1996.
- [12] J. Sun and G. Li, “An end-to-end learning-based cost estimator,” arXiv preprint arXiv:1906.02560, 2019.
- [13] R. Marcus and O. Papaemmanouil, “Plan-structured deep neural network models for query performance prediction,” arXiv preprint arXiv:1902.00132, 2019.
- [14] D. Zhang, J. Yin, X. Zhu, and C. Zhang, “Network representation learning: A survey,” IEEE transactions on Big Data, vol. 6, no. 1, pp. 3–28, 2018.
- [15] G. Dong and H. Liu, Feature engineering for machine learning and data analytics. CRC press, 2018.
- [16] H. Yuan, G. Li, L. Feng, J. Sun, and Y. Han, “Automatic view generation with deep learning and reinforcement learning,” in 2020 IEEE 36th International Conference on Data Engineering (ICDE), pp. 1501–1512, 2020.
- [17] A. Kipf, T. Kipf, B. Radke, V. Leis, P. Boncz, and A. Kemper, “Learned cardinalities: Estimating correlated joins with deep learning. https: //github.com/andreaskipf/learnedcardinalities,” in CIDR, 2019.
- [18] R. Rojas and R. Rojas, “The backpropagation algorithm,” Neural networks: a systematic introduction, pp. 149–182, 1996.
- [19] W. Du, R. Krishnamurthy, and M.-C. Shan, “Query optimization in a heterogeneous dbms,” in VLDB, vol. 92, pp. 277–291, 1992.
- [20] W. Wu, W. Xi, H. Hacgümü, and J. F. Naughton, “Uncertainty aware query execution time prediction,” Proceedings of the VLDB Endowment, vol. 7, no. 14, pp. 1857–1868, 2014.
- [21] Improving PostgreSQL Cost Model:https://dsl.cds.iisc.ac.in/publications/thesis/pankhuri.pdf.
- [22] Å. Björck, “Least squares methods,” Handbook of numerical analysis, vol. 1, pp. 465–652, 1990.
- [23] B. Ding, S. Das, R. Marcus, W. Wu, S. Chaudhuri, and V. R. Narasayya, “Ai meets ai: Leveraging query executions to improve index recommendations,” SIGMOD ’19, Association for Computing Machinery, 2019.
- [24] R. Marcus, P. Negi, H. Mao, N. Tatbul, M. Alizadeh, and T. Kraska, “Bao: Learning to steer query optimizers,” arXiv preprint arXiv:2004.03814, 2020.
- [25] C. J. Ter Braak and I. C. Prentice, “A theory of gradient analysis,” Advances in ecological research, vol. 34, pp. 235–282, 2004.
- [26] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” in International conference on machine learning, pp. 3145–3153, PMLR, 2017.
- [27] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,” in Proceedings of the nineteenth annual ACM symposium on Theory of computing, pp. 1–6, 1987.
- [28] Github repository: QPPNet in PyTorch. https://github.com/rabbit721/QPPNet.
- [29] S. Liu, X. Chen, Y. Zhao, J. Chen, R. Zhou, and K. Zheng, “Efficient learning with pseudo labels for query cost estimation,” CIKM ’22, (New York, NY, USA), p. 1309–1318, Association for Computing Machinery, 2022.
- [30] Github repository: SHAP. https://github.com/shap/shap.
- [31] J. Li, A. König, V. Narasayya, and S. Chaudhuri, “Robust estimation of resource consumption for sql queries using statistical techniques,” Proceedings of the VLDB Endowment, vol. 5, no. 11, pp. 1555–1566, 2012.
- [32] W. Wu, Y. Chi, S. Zhu, J. Tatemura, H. Hacigümüs, and J. F. Naughton, “Predicting query execution time: Are optimizer cost models really unusable?,” in 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1081–1092, IEEE, 2013.
- [33] N. Sheoran, S. Mitra, V. Porwal, S. Ghetia, J. Varshney, T. Mai, A. Rao, and V. Maddukuri, “Conditional generative model based predicate-aware query approximation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 8259–8266, 2022.
- [34] X. Yu, C. Chai, G. Li, and J. Liu, “Cost-based or learning-based? a hybrid query optimizer for query plan selection,” Proceedings of the VLDB Endowment, vol. 15, no. 13, pp. 3924–3936, 2022.
- [35] M. Akdere, U. Çetintemel, M. Riondato, E. Upfal, and S. B. Zdonik, “Learning-based query performance modeling and prediction,” in 2012 IEEE 28th International Conference on Data Engineering, pp. 390–401, IEEE, 2012.
- [36] F. Nargesian, H. Samulowitz, U. Khurana, E. B. Khalil, and D. S. Turaga, “Learning feature engineering for classification.,” in Ijcai, vol. 17, pp. 2529–2535, 2017.
- [37] S. Scott and S. Matwin, “Feature engineering for text classification,” in ICML, vol. 99, pp. 379–388, 1999.
- [38] K. W. Church, “Word2vec,” Natural Language Engineering, vol. 23, no. 1, pp. 155–162, 2017.