QCFE: An efficient Feature engineering for query cost estimation.

Yu Yan,Hongzhi Wang, Junfang Huang, Dake Zhong, Man Yang, Kaixin Zhang, Tao Yu, Tianqing Wang Harbin Institute of Technology Harbin China, HUAWEI China
[email protected],[email protected]

Abstract

Query cost estimation is a classical task for database management. Recently, researchers apply the AI-driven model to implement query cost estimation for achieving high accuracy. However, two defects of feature design lead to poor cost estimation accuracy-time efficiency. On the one hand, existing works only encode the query plan and data statistics while ignoring some other important variables, like storage structure, hardware, database knobs, etc. These variables also have significant impact on the query cost. On the other hand, due to the straightforward encoding design, existing works suffer heavy representation learning burden on ineffective dimensions of input. To meet the above two problems, we first propose an efficient feature engineering for query cost estimation, called QCFE. Specifically, we design a novel feature called feature snapshot to efficiently integrate the influences of the ignored variables. Further, we propose a difference-propagation feature reduction method for query cost estimation to filter the useless features. The experimental results demonstrate our QCFE could largely improve the time-accuracy efficiency on extensive benchmarks.

I Introduction

Cost estimation plays a pivotal role in database management, forming the bedrock for database optimization strategies encompassing query optimization [1], index optimization [2], and storage efficiency, among other aspects. The precision of cost estimation methodologies stands as a linchpin for achieving optimal performance in database operations. Regrettably, conventional techniques reliant on cost equations may produce huge estimation errors under complex workloads due to their simplistic frameworks and underlying assumptions [3]. This shortcoming can engender sub-optimal optimization outcomes, consequently undermining database performance.

Hence, the database community has delved into the application of neural network models to capture the intricate correlation between queries and their associated costs, harnessing the formidable learning prowess inherent in these deep networks. Extensive experimental results [4, 5, 6, 3] have demonstrated that the learning approaches achieve high accuracy across various complex benchmarks.

However, the utilization of neural networks for databases poses an efficiency-accuracy dilemma. On one hand, the query cost is related to multiple features, such as relation table, query, etc.), compared to other fields, such as natural language processing [7] and image recognition [8, 9], as they contain different structure. To accurately represent and fit the query plan-performance relationship, it is necessary to design complex network models. For instance, the transformer query cost model [10] has shown superior performance compared to simpler models, primarily due to its deeper network layers and attention mechanism that assigns weights to features. Notably, such network structures require more training and inference time than ordinary deep neural network (DNN) models.

On the other hand, database cost estimation is a frequently invoked component, which may be invoked multiple times within a single query optimization [11]. For example, even PostgreSQL utilizes genetic algorithms to reduce the number of estimation requests, the cost estimation component is still called more than $n$ times, where $n$ is the number of nodes in the query plan. Additionally, large database management systems receive thousands of query requests every minute. Consequently, it is not feasible to allocate excessive time to the fundamental cost estimation component.

Refer to caption — Figure 1: The average query cost (ms) of 1000 queries in TPCH and Sysbench under different database environments.

Given the limitations of designing complex models, for solving the efficiency-accuracy dilemma, a natural approach should optimize the features to reduce the burden of representation learning and fitting learning in the model. Existing AI-driven cost estimation methods utilize relatively straightforward approaches to processing input query features. Typically, the one-hot encoding for tables [12], the one-hot encoding for indexes [13], and the vector for numerical values are directly fed into the evaluation model in a bottom-up path of the query. Totally, we have identified two shortcomings regarding feature design in existing AI-driven query cost estimators:

(1) Missing Important Features: Current methods [3] primarily focus on encoding the query plan and the table statistics, often overlooking the impact of other database variables on query cost. However, variables such as the storage format of data (e.g., B+ tree or LSM tree) and the hardware of the database also play a significant role in determining the query cost. Our investigation, as depicted in Figure 1, demonstrates substantial differences (2 times in TPCH and 3 times in Sysbench) in the average execution time of the same queries under different database environments (five database knob configurations). Therefore, neglecting the database environment can result in significant losses when predicting query cost.

(2) Heavy Representation Learning: Existing methods directly utilize the table feature, index feature, operator feature, etc. as the input of the AI cost model. This brings a large burden for representation learning [14], which is used to learn the effective representation of input features. Specifically, with the goal of simplifying learned model (accelerating inference time), capturing the relationships between the large amount of features and the query cost can be a difficult task. This intricate logic relationship between multiple features necessitates multiple nonlinear transformations to effectively capture.

These two questions appear to be a contradiction. The absence of crucial features primarily results from an incomplete modeling of the query cost estimation problem, necessitating the incorporation of additional features. The heavy representation learning stems from the ineffective elements of the encoding, necessitating the removal of some features. Nevertheless, when viewed collectively, these issues can be categorized as feature engineering challenges, implying that the task of query cost estimation’s feature engineering has not been processed optimally.

To solve the above problems, we design an effective feature engineering for query cost estimation, called QCFE. The core sights are as follows: (1) To avoid missing important variables, we define a novel concept, called feature snapshot ( $SF$ ) to integrate the characteristics of ignored variables (defined as the variable set of database knobs, storage structure, hardware and operating system). To the best of our knowledge, no one has attempted to encode the ignored variables for the query cost model. One possible reason may be that the resource required to build an exact feature representation is tantamount to build the database environment. Hence, we propose an estimated method to obtain the snapshot feature, ensuring high efficiency.

(2) For the heavy representation learning, we design a difference-propagation feature reduction (FR) method, to relieve the learning burden by pruning the ineffective features. Specifically, depending on the relational table and load type, certain features may not be effective. For instance, the plan method employs columns with the attribute’s length to encode the index. However, in pure write scenarios, the database management system may not create an index, resulting in an ineffective feature with the length of the number of columns in the query feature. These ineffective features not only increase the training and inference cost of the AI evaluation model but also reduce its accuracy [15].

Totally, the specific contributions are as follows:

•

In order to improve the time-accuracy efficiency, we propose a feature engineering for query cost estimation, called QCFE.
•

We first propose the feature snapshot (in Section III) concept for query cost estimation, integrating the influential the ignored variables variables. Our core goal is to make some reasonable assumptions to calculate the feature snapshot with high time efficiency.
•

We design the difference-propagation feature reduction method (in Section IV) to efficiently reduce the useless feature, further improving model training and inference efficiency.
•

To clarify the effectiveness of our QCFE, we demonstrate various comparisons (Section V) under extensive popular benchmarks (TPC-H, job-light, and Sysbench), including the evaluation of time-accuracy efficiency, the ablation of QCFE, the robustness of QCFE, etc.

II overview

In this section, we overview the architecture and workflow of our QCFE.

Firstly, we show the general feature engineering which is widely used in existing works [16, 12, 17], to clarify the effectiveness of our QCFE. As shown in Figure 2, the general FE directly encodes the query plan and table statistics as the training set for AI-driven models, which ignores some other Influential factors (like hardware) and may have meaningless computing resource overhead of ineffective input codes.

In contrast, our QCFE considers all the factors of the database environment containing the table statistics, queries, hardware, database knobs, etc. Especially, we design an efficient feature snapshot to capture the ignored variables influence in Section III. Then, we propose the difference propagation feature reduction algorithm (a variant of the back-propagation algorithm [18]) to filter the useless dimensions to further improve time-accuracy efficiency in Section IV.

III The snapshot feature

Although the ignored variables also have significant impacts on the query cost, it involves resource overhead to make exact representations for these variables as a feature snapshot. In this section, we introduce the sophisticated feature snapshot design in Section III-A. Further, to improve the time efficiency of calculating the feature snapshot, we design a standard simplified SQL template to replace the original templates.

III-A Estimated Feature Snapshot

Firstly, we clarify why we design an estimated FS to represent the ignored variables instead of an exact feature from the following two reasons.

(1) Partial influence: Essentially, these ignored variables affect the query cost by affecting the I/O cost and CPU cost of the operators. Only partial components of these variables have impact on the query cost. For example, The audio peripherals in the hardware will not affect the query execution efficiency. Hence, we only need to consider the I/O and CPU-related partial components.

(2) High resource overhead: The ignored variables have complex structures, which are costly to be directly encoded as an exact feature. For example, the hardware consists of multiple components, such memory, disk, CPU, etc. The exact representation of the various components of the ignored variable will bring enormous feature snapshots, leading to large space cost.

Due to the partial influence and high resource overhead, we consider to construct an estimated feature snapshot representation for the ignored variables. The specific considerations are as follows:

(1) We firstly identify how the ignored variables affect the query cost. Specifically, we analyze a basic physical cost formula of PostgreSQL, $Cost_{total}=c_{s}$ (the I/O to sequentially access a page) $\times n_{s}+c_{r}$ (the I/O cost to randomly access a page) $\times n_{r}+c_{t}$ (the CPU cost to process a tuple) $\times n_{t}+c_{i}$ (the CPU cost to process a tuple via indexes) $\times n_{i}+c_{o}$ (the CPU cost to process an operator, like sort) $\times n_{o}$ . We can conclude two metrics from the above formula, the cost coefficient ( $C=\{c_{s},c_{r},c_{t},c_{i},c_{o}\}$ ) and the cost number( $N=\{n_{s},n_{r},n_{t},n_{i},n_{o}\}$ ). All the ignored variables influence the query cost by influencing the $C$ and $N$ . In general, the query plan and data statistics have the main impact on $N$ while the ignored variables mainly influence the cost of once I/O and CPU of request $C$ . For example, the disk type only influences the I/O speed for a given relation and query. And the enable index scan knob mainly influences the $C$ in once request. Our first estimation is that the ignored variables only influence the $C$ .

TABLE I: The logical formula knowledge.

Cost Formula	Operators
$F=c0\times n+c1$	Seq Scan, Materialize, Aggregation
$F=c0\times n+c1$	Index Scan, Merge Join, Hash Join
$F=c0\times nlogn+c2$	Sort
$F=c0\times n1\times n2+c1\times n1+c\times 2\times n2+c3$	Nested Loop

(2) We present how to estimate the $C$ . For simplicity and generality, we leverage the notion of logical cost functions [19] to estimate the $C$ instead of the cost formula of certain DBMS. Table I shows the specific logical assumptions of the cost formula for different operators. Here, these formulas is determined by the logical execution of operators. For example, $F=c0\times n+c1$ could be the logical formula for seq scan [20], where $n$ is the cardinality of operator. Based on these assumed logical formula, we could estimate the $C$ for each operator instead of an incredible exact representation, representing the influences of the ignored variables. For example, $[c0,c1]$ could be the feature snapshot of the seq scan operator. Also, these assumed logical formulas could be optimized for specific DBMS (such as the revised cost formula for Postgres [21]), improving FS estimation accuracy.

Specifically, according to the logic formulas, we utilize the regression model, the least square method [22] to calculate the FS for each operator. The regression model is trained from the labeled operator set, which is collected by multiple query executions. When the query structure is complex, collecting labeled sets may be costly in time.

III-B Simplified Template

We observe that calculating the feature snapshot needs to execute multiple queries, which is costly for complex queries such as the OLAP in TPCH. In order to improve time efficiency, we design simplified templates, which could not only capture the characteristic but also have efficient execution time.

input : the data abstract (

R

), the original query templates (

P

) and the scale of transferred templates (

N

)

output :

Q

is an effective substitute of the original queries.

info,T,Q\leftarrow\varnothing,\varnothing,\varnothing

4for $p\in P$ do

5 for $s\in p.split()$ do

6 if $s.match()$ then

info[s.op].add([s.tab,s.col])

// update the operator-table-column information

11for $op\in info$ do

13 for $(t,c)\in info[op]$ do

T_{op}\leftarrow genTemplate(op,t,c)

// generate template according to op

T\leftarrow T\cup T_{op}

18while $N>0$ do

19 for $t\in T$ do

q\leftarrow setRandomValue(R,t,random(<,>,=,...))

// fill template with R and a random operator keyword

Q\leftarrow Q\cup q

N\leftarrow N-1

26return

Q

Algorithm 1 Generate the simplified query templates.

As shown in Algorithm 1, the input consists of the data abstract $R$ , the original query templates $P$ , the scale of transferred template $N$ , and the output is a query set as an effective substitute for calculating FS. Our algorithm consists of three important phases, (i) Parse original Query Templates. (ii) Generate simplified templates. (iii) Fill in simplified templates. Next, we introduce these phases in detail.

In the first phase (Lines 2-5), we parse the original query templates and obtain the operator-table-column set, defined as $\{o1:[(t1,c1),(t2,c2),...],...\}$ , where $o1$ is an operator, $t1,c1$ is one of the related table-column tuples of $o1$ . For each query template $p$ , we gather its operator set by matching the keywords-operator relationships in Table III-B. Specifically, the keyword ” $>$ ” is corresponding to the seq scan operator and the index scan operator. We observe one keyword may related to two physical operators due to the uncertain query optimization. After parsing the original query templates, we obtain the operator-table-column set, which is the basis to generate the simplified templates. For example, Figure 3 shows an example from an original query template to the simplified queries. The original query template consists of some keywords, like $=,>$ , order by, group by, etc. Based on these keywords, we obtain the corresponding operator-table-column set represented by some arrows, like partsupp-p_partkey $\rightarrow$ hash/index scan.

Keyword	Operator	Parent Template
$>,like,=,<,in,etc.$	Seq/Index Scan	SELECT * FROM [table] WHERE [condition]
Order By	Sort	SELECT * FROM [table] WHERE [condition] ORDER BY [table.attr]
Group By	Aggregate	SELECT COUNT(*) FROM [table] WHERE [condition] GROUP BY [attribute]
table1.attr1 = table2.attr2	Merge/Hash Join, Nested loop	SELECT * FROM [table1] JOIN [table2] ON [table1.attr = table2.attr] WHERE [condition]
table1.attr1 = table2.attr2	Merge/Hash Join, Nested loop	SELECT * FROM [table1] JOIN [table2] ON [table1.attr = table2.attr] WHERE [condition] ORDER BY [table1.attr]
…	…	…

Method	plan or operator	encoding methods	model	task
QPPnet [13]	physical operator	one-hot, numerical value	DNN	cost estimation
MSCN [17]	query plan	one-hot, numerical value	DNN	cardinality estimaiton
AVGDL [16]	physical operator	one-hot, numerical value	RNN	view selection
end2end [12]	physical operator	one-hot, numerical value	RNN	cost estimaiton & cardinality estimaiton
zero-shot [3]	physical operator	numerical value	MLP	cost estimaiton
AIMeets [23]	physical operator	numerical value, one-hot	DNN	Index selection
Bao [24]	physical operator	one-hot, numerical value	tree CNN	query optimization

$\displaystyle I_{diff}(k)$	$\displaystyle=\frac{1}{len(R)len(D)}\sum_{x_{i}\in D,x_{j}\in R}$	(1)
	$\displaystyle[\|\frac{M(x_{i})-M(x_{j})}{h_{1}(x_{i})-h_{1}(x_{j})}*\frac{h_{1}(x_{i})-h_{1}(x_{j})}{x_{i}^{k}-x_{j}^{k}}$
	$\displaystyle+\frac{M(x_{i})-M(x_{j})}{h_{2}(x_{i})-h_{2}(x_{j})}*\frac{h_{2}(x_{i})-h_{2}(x_{j})}{x_{i}^{k}-x_{j}^{k}}\|]$

Dataset	Model	2000			4000			6000			8000			10000
		pearson	mean	time	pearson	mean	time	pearson	mean	time	pearson	mean	time	pearson	mean	time
TPCH	PGSQL	0.704	819.241	1.194	0.741	882.319	2.394	0.746	790.83	3.506	0.663	1393.472	4.734	0.632	1179.219	5.896
	QCFE(mscn)	0.986	1.094	8.547	0.986	1.109	25.635	0.998	1.106	27.984	0.997	1.086	31.368	0.997	1.11	41.399
	QCFE(qpp)	0.985	1.072	424.322	0.985	1.089	399.598	0.985	1.101	391.901	0.979	1.108	433.513	0.969	1.096	417.469
	MSCN	0.983	1.105	9.608	0.979	1.13	27.995	0.988	1.126	40.615	0.988	1.105	33.314	0.987	1.134	43.437
	QPPNet	0.985	1.107	369.467	0.986	1.136	345.541	0.984	1.111	361.47	0.963	1.129	354.904	0.966	1.128	365.44
Sysbench	PGSQL	0.224	169509.592	0.006	0.246	175054.294	0.011	0.287	185715.657	0.016	0.269	978066.774	0.022	0.283	938706.491	0.027
	QCFE(mscn)	0.792	1.528	8.116	0.748	1.521	15.531	0.818	1.484	32.226	0.776	1.542	29.579	0.721	1.57	41.045
	QCFE(qpp)	0.824	1.72	5.52	0.682	1.68	4.965	0.715	2.464	5.329	0.857	1.868	4.409	0.787	2.01	4.808
	MSCN	0.698	1.734	7.183	0.709	1.738	14.713	0.796	1.659	21.366	0.606	1.804	28.616	0.648	1.785	35.351
	QPPNet	0.524	10.402	9.321	0.465	8.492	9.997	0.488	8.891	9.463	0.616	33.596	9.459	0.633	32.644	10.282
job-light	PGSQL	0.447	150.103	0.009	0.394	171.211	0.017	0.396	137.072	0.033	0.367	153.256	0.042	0.376	148.1	0.048
	QCFE(mscn)	0.985	1.083	7.029	0.996	1.066	18.709	0.996	1.056	25.3	0.998	1.046	31.915	0.998	1.046	45.077
	QCFE(qpp)	0.994	1.18	229.067	0.996	1.127	243.729	0.997	1.162	230.444	0.995	1.148	241.921	0.996	1.243	255.174
	MSCN	0.99	1.086	7.229	0.994	1.074	33.012	0.993	1.074	33.012	0.995	1.069	37.713	0.994	1.07	68.241
	QPPNet	0.993	1.2013	353.156	0.989	1.423	361.81	0.993	1.248	333.015	0.985	1.445	337.923	0.992	1.261	527.086

number	mean	q-error95	q-error90	runtime/s	reduction ratio
200	1.107	1.39	1.224	267.788	40.036%
250	1.09	1.305	1.212	267.788	39.857%
300	1.095	1.262	1.212	349.73	40.214%
400	1.09	1.27	1.196	591.671	40.125%
500	1.076	1.215	1.156	911.671	39.767%

QCFE: An efficient Feature engineering for query cost estimation.

Abstract

I Introduction

II overview

III The snapshot feature

III-A Estimated Feature Snapshot

III-B Simplified Template

IV Feature Reduction

IV-A The analysis of existing encoding methods

IV-B Feature Reduction

V The Evaluation Of QCFE

V-A Experimental setup

V-B The Effectiveness of our feature engineering

V-C Ablation Study

V-D The robustness of parameters

V-E The transferability of feature snapshot

VI Related Works

VII Conclusion

References

Model	Metric	TPCH	job-light
basis	pearson	0.983	0.995
	mean	1.088	1.195
	time	381.157	232.519
trans-FSO	pearson	0.981	0.997
	mean	1.112	1.246
	time	114.455	65.539
trans-FST	pearson	0.982	0.99
	mean	1.083	1.278
	time	121.093	73.246