GP: Context-free Grammar Pre-training for Text-to-SQL Parsers
Abstract
Text-to-SQL technology has significant applications in realizing database query through natural language, with no requirement for learning SQL grammar. Nevertheless, the challenge is that modeling alignment between database information and consideration in a certain query is not obvious. Text-to-SQL parsing is proposed as novel Grammar Pre-training (GP) to decode deep relations between database and question. To adequately learn the internal relationship of SQL grammar, the decoder is pre-trained independently of the encoder. Subsequently, the robustness of the model is improved and convergence is accelerated. Flooding level is adopted to obtain the non-zero training loss and avoid local extrema problems. Ultimately, we achieved better performance on Spider, a cross-DB Text-to-SQL dataset (72.8% dev, 69.8% test)by encoding the sentence with and RAT-SQL model. By reducing the average loss by 78.9%, the variance is only 0.8% of the previous model while training. Moreover, experiments proved that this technique converges much faster and has excellent robustness.
1 Introduction
Recently, with the development of artificial intelligence technology, to directly generate SQL statements has attracted a huge deal of research interest. These statements interact with database systems through the analysis of natural language. A Natural Language Interface to Database (NLIDB)is adopted by current research work to realize the interaction between user’s questions and the database system to obtain and analyze data(?).
The core problem of NLIDB is to convert the input text information into SQL statements (Text-to-SQL). For solving this problem, two main approaches exist at present. First, the method is based on a rule template, indicating that the natural language is classified based on the common SQL grammar. Therefore, the corresponding SQL templates is related to various categories(?, ?). Such a method requires manual summarization of experience and a huge deal of time(?). Moreover, by changing the application scenario, the existing templates are often difficult to satisfy the requirements. Hence, the migration is poor. Second, based on the deep learning method, the neural network is utilized for end-to-end implementation(?, ?, ?, ?, ?). This approach can be self-optimized by continuously adding the sample information. It includes the advantages of both higher accuracy and strong stability and receives further attention from the academic community. By incorporating it with the BERT encoder, the WikiSQL dataset accuracy of above 90% can be obtained.

However, satisfactory performance is not achieved by these deep-learning methods on a cross-domain Text-to-SQL scenario such as Spider(?). According to Fig. 1, this SQL query includes nested clauses such as GROUP BY and connections in multiple tables. Such grammar details are concerned rarely by users, hence, they are hardly mentioned in questions. Bailin Wang et al. proposed a relation-aware framework called RAT-SQL and achieved state-of-art accuracy on the Spider dataset. Moreover, pre-training language models are developed based on structured table data and the natural language of users. At early stages, BERT(?) and RoBERTa(?) for contextual sentences are used in cross-domain Text-to-SQL scenario, however, the relation between the tables and fields of the database is not considered. A grammar-augmented pre-training model () is presented describing the joint representations of textual and tabular data (?). By integrating the pre-training model with other downstream methods such as RAT-SQL, the accuracy of cross-domain tasks can be improved greatly.
In the present work, a context-free grammar pre-training (GP) method is proposed creatively for Text-to-SQL. Since SQL grammar framework is irrelevant to the specific natural language, we first pre-trained the decoder without encoder information. Within the training step, GP effectively improves the training efficiency of the model and has good advantages in robustness and convergence. Within the preprocessing module, we used string matching to discover the value appearing in the question, and add it behind the equivalent column on the original input sequence. To design the loss function, we adopted flooding level as a new method to avoid local minimum values. Based on /RAT-SQL framework, experiments indicated that a much higher accuracy on Spider dataset and better robustness is obtained by our approach. It also presents potential applications for other context-free grammar representation tasks.
2 Related Works
Pre-training models for NLP parsing Text-to-SQL task comprise both structured schema information and unstructured user question. Early research used general pre-training models such as Elmo(?), RoBERTa(?), and BERT(?) to represent textual information for unstructured language questions. There has been a great enhancement in the joint textual-tabular field like question answering(?) and table semantic parsing(?) by learning better representations from the input text and table information. However, they mostly consider single tables. In recent pre-training work, it is focused on achieving high-quality cross-modal representation. TaBERT (?) is pre-trained through millions of web tables. It can denote complete structure for various tables and make some matrix computations in table semantic parsing. Nevertheless, its performance is weakened by the noisy context information on the Text-to-SQL task. In this work, we adopt , the grammar-augmented pre-training technique utilizing a novel text-schema link objective and masked language modeling (MLM). Integrating as feature representation layers with other downstream models, great accuracy is obtained on the Spider dataset.
Neural networks for Text-to-SQL Previous networks are intended to solve problems in single table dataset such as WikiSQL. The Seq2SQL model based on the strategy mode(?) is used in Text-to-SQL tasks and SQL execution accuracy of 59.45% is achieved on the WikiSQL dataset. Then, TypeSQL(?) is presented to further extract the keywords in the question sentence by integrating external knowledge and database field enumeration values. The obvious results were obtained by the above method in a single table query, however, it is not enough for solving the complex mode of the multi-table query. EditSQL(?) utilizes an editing mechanism to introduce historical information for user queries, moreover, its matching accuracy on Spider dataset reaches up to 32.9%. an intermediate representation called SemQL is used in IRNet(?) to translate complex SQL queries into a syntax tree. Using pointer network(?) for downstream tasks, an accuracy of 54.7 is obtained on the Spider test set. Moreover, graph neural networks are concerned to present the relations for schema information. Global gated graph neural network(?) is designed to train the database patterns’ structure and apply it in the encoding and decoding stages. Recently, RAT-SQL (?) used a relation-aware self-attention mechanism for schema encoding, schema linking, and feature representation. It obtains the state-of-art accuracy of 65.6% on the Spider test set.
Training loss optimization is a common problem in training procedure (?). Comparing with former methods such as dropout (?), label smoothing(?) batch normalization (?), and mixup(?), to avoid the training loss from decreasing to zero, flooding level(?) makes the training loss float around a small constant value. On the other hand, the loss fixed around a certain level can be determined based on the model itself. Thus, flooding skips some local extreme points to find the optimal parameters from a global perspective.
3 Methodology
3.1 Context-free Grammar Pre-training
RAT-SQL uses the Syntactic Neural Model (SNM) presented by (?) to create the SQL grammar. Yin et al. believed that the present methods treat code generation as a task of sequence generation not considering the grammar of the target programming language. Programming languages, especially SQL, have strict grammar rules, unlike natural languages. Based on these rules, SNM is essentially a method to improve the accuracy of the model by limiting the search space of the decoder.
Moreover, the basic framework of SQL grammar is context-free with the specific natural language description. For instance, regardless of the natural language description, the first clause of SQL is always , and the next clause is always . Based on the experiments, the loss value in the initial training stage of RAT-SQL is extremely large mainly coming from SQL grammar errors created by the decoder.
Regarding the above situation, we proposed a context-free Grammar Pre-training (GP) technique to pre-train the parameters on the decoder side. The encoder’s semantic information is replaced by zero vectors. The probability equation of RAT-SQL utilizing LSTM to output a sequence of actions is:
(1) |
where is always [] in the stage of GP and are all previous actions. Correspondingly, the LSTM’s state updating strategy will be modified as:
(2) |
where and are the LSTM cell state and output in step , represents the embedding of the previous action, denotes the step equivalent to expanding the parent AST node of the current node, and is the current node type embedding. We used to replace the former obtained through multi-head attention on over .
Since GP no longer depends on semantic information, it cannot predict column’s or table’s names. It is assumed that each sample has only one column and one table in order to not change the framework of RAT-SQL, , thus
(3) |
(4) |
To prevent overfitting, the number of decoder Grammar Pre-training steps was limited to 300.
3.2 Question-Schema Serialization and Encoding
Generally, the serialization technique of RAT-SQL is adopted. Since the utilized pre-trained semantic model is , the question tokens are preceded by and ended up with . Then, tables and columns are spliced in sequence based on the order of the schema presented by the Spider dataset. Moreover, we used as the separator.
As stated in (?), modeling with only table/field names and their relations is not always adequate for capturing the semantics of the schema and its dependencies with the question. Remarkably, we append values to mention columns only when they match the question exactly. For example, in Figure 2, the keyword in the question appears in both column and column , respectively. Thus, there is a relationship between the token and a Column-Part-Match(CPM) with column as well as a Column-Exact-Match(CEM) relationship with column . Intuitively, the exact match possesses a greater probability as the correct column. To strengthen this relationship, we put after the column during serializing while column not. The sequence can be converted as
(5) |

In RAT-SQL, the vector representation of a table or a column is the average of the last and first token. Research indicates that this encoding method may lose important information(?), hence, another technique is utilized by calculating the average of all tokens’ vector of the column or table. When a column is followed by a value, the column’s representation is determined by all column tokens and value tokens (Fig.3).

For deep learning, training loss keeps often decreasing while the validation loss suddenly starts to rise(?). (?) proposed a tricky and simple loss function to decrease validation loss continuously:
(6) |
where represents the user-specified flooding level, and is the model parameter. It is assumed that the existence of parameter can prevent the model from falling into the local optimum to a certain extent, during the optimization process. Since spider dataset has various types of SQL grammar and databases sizes are usually inconsistent, it usually leads to overfitting and converges near a local extreme while training, here, this method was adopted to make final results well. Nevertheless, unsuitable usually result in gradient explosion.
4 Experimental Results
4.1 Experimental Setup
The Adam optimizer(?) with default hyperparameters is adopted. In the stage of GP, the learning rate is set as . Owing to GPU memory limitation, we set and , in which, and are the gradient accumulation parameters of RAT-SQL equivalent to batch size of 12. Considering GP and a smaller batch size, compared to RAT-SQL, we set the initial learning rate of from the original to , and the initial learning rate of other model parameters from to . The rest of the setups are the same with RAT-SQL.
4.2 Dataset and Metrics
dataset | samples | databases |
train set | 8659 | 146 |
dev set | 1034 | 20 |
test set | 2147 | 40 |
(?) is a cross-domain Text-to-SQL dataset and large-scale complex. It includes both schema information and a corresponding SQL statement for each natural language problem. According to Table 1, it includes 10,181 questions and 5,693 unique complex SQL queries on 206 databases with multiple tables covering 138 different domains. Based on its hardness level, spider splits into 4 types of data sets as Easy, Medium, Hard, and Extra Hard. It is the only data set in the public data set of Text-to-SQL tasks containing both complex SQL statements and multi-table query. Here, the complex SQL denotes the nested query situation of , , and clauses in the statement.
The metric adopted to assess model performance is suggested by (?). This metric refers to utilize standardized definitions to process the prediction SQL and the true statement, and calculate matching degree between them, without considering the column names order.
4.3 Results
While RAT-SQL and are open-sourced, the offline result is worse compared to announced on the leaderboard in our experiments (Table 2). The reason can be explained by random seed or device differences. In this section, we mainly compared model performance based on offline results.
model | leaderboard | offline |
RAT-SQL+Bert | 69.7 | 66.7 |
RAT-SQL+ | 73.4 | 71.7 |
GP Figure 5 indicates that in the first 50 steps of GP, the training loss significantly drops, then, it remains at about 53. To prevent overfitting, the number of Grammar Pre-training steps is limited, even if the loss is still dropping at a tiny speed. Then, we used the pre-trained decoder to train our model, and the training loss is maintained at a lower level compared to the previous method without GP (Fig. 5). We computed the average and variance of loss before and after 1500 steps as stable values. From Table 3, the average loss with GP is 15.02, which is reduced by 78.9% compared to the former one. Furthermore, its variation rate is only 0.8% of the model without GP indicating that there is a smooth optimization during training. The final loss of less than 1.37 also proves that GP helps to find auxiliary information between SQL grammar and question words.


Model | ||||
Avg | Var | Avg | Var | |
Without GP | ||||
With GP |
Flooding Equation 6 indicates that there is an extra parameter in loss function, and the model performance is extremely sensitive to and learning rate . Moreover, a slightly larger may lead to the model to gradient explosion during training. Table 4 indicates several examples about different parameter combination. denotes that the parameter combination will result in gradient explosion. It is worth mentioning that although can enhance model performance, the results are not stable, in which, the best result may be as high as 72.1%, and the lowest result may be only 70.7% even if we used the same parameters.
Dev. | |||
Serialization with value By adding the equivalent value after the column, the recognition between columns is enhanced. It is indicated that a slight reduction exists in column selection errors. Table 5 represents the enhancements of Flooding(Fld.), Serialization with value(val.) and GP, respectively. The best result is 73.1% on Dev. offline.
model | Dev. |
RAT-SQL+ | |
RAT-SQL+ with Fld. | |
RAT-SQL+ with Fld. val. | |
RAT-SQL+ with Fld. val. GP |
model | Dev. | Test |
RAT-SQL+ (?) | 73.4 | 69.6 |
RAT-SQL++GP (Ours) | 72.8 | 69.8 |
The ultimate result on Spider is 72.8% on Dev. and 69.8% on Test. Compared to the result of RAT-SQL+, the Dev. and Test. The results of RAT-SQL++GP are much closer indicating that our model is more robust, as shown in Table 6.
5 Conclusion
Since most researches concentrate on natural language generation in Text-to-SQL tasks, SNM was utilized here to analyze the target programming language’s syntax. To reduce SQL grammar errors in the decoder process, we proposed a new framework called GP, for pre-training parameters on the decoder side. Questions are appended by values when they match the word exactly. Schema information is enriched as the input of encoding By averaging the embeddings of all tokens’ vector from the column or table instead of the first and last token. Ultimately, we adopted flooding level to avoid local minimum loss in the training procedure. The results proved that this method possesses a greater performance on the Spider dataset. It is also beneficial for other context-free grammar representation tasks. Furthermore, since parameter tuning is a complex task, a tiny difference of parameters, especially learning rate, can result in completely different results. This model still has a high probability for further improvement, thus, some tuning methods will be assessed in the future.
Acknowledgments
The authors thank Manfang Wu, Jian Cai for their assistance and advice. We also acknowledge Xuefeng Li and our anonymous reviewers for their comments. The Big Data Lab is operated by OneConnect Financial Technology.
References
- Baik, Jagadish, & Li Baik, C., Jagadish, H. V., & Li, Y. (2019). Bridging the semantic gap with sql query logs in natural language interfaces to databases. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 374–385. IEEE.
- Bogin, Gardner, & Berant Bogin, B., Gardner, M., & Berant, J. (2019). Global reasoning over database structures for text-to-sql parsing. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3650–3655.
- Chen, Zha, Chen, Xiong, Wang, & Wang Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., & Wang, W. Y. (2020). Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 1026–1036.
- Devlin, Chang, Lee, & Toutanova Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186.
- Goodfellow, Bengio, & Courville Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
- Guo, Zhan, Gao, Xiao, Lou, Liu, & Zhang Guo, J., Zhan, Z., Gao, Y., Xiao, Y., Lou, J.-G., Liu, T., & Zhang, D. (2019). Towards complex text-to-sql in cross-domain database with intermediate representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4524–4535.
- Ioffe & Szegedy Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. PMLR.
- Ishida, Yamane, Sakai, Niu, & Sugiyama Ishida, T., Yamane, I., Sakai, T., Niu, G., & Sugiyama, M. (2020). Do we need zero training loss after achieving zero training error?. In International Conference on Machine Learning, pp. 4604–4614. PMLR.
- Kingma & Ba Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR).
- Li & Jagadish Li, F., & Jagadish, H. (2014). Constructing an interactive natural language interface for relational databases. Proceedings of the VLDB Endowment, 8(1), 73–84.
- Lin, Socher, & Xiong Lin, X. V., Socher, R., & Xiong, C. (2020). Bridging textual and tabular data for cross-domain text-to-sql semantic parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 4870–4888.
- Liu, Ott, Goyal, Du, Joshi, Chen, Levy, Lewis, Zettlemoyer, & Stoyanov Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv, 1907.11692.
- Peters, Neumann, Iyyer, Gardner, Clark, Lee, & Zettlemoyer Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237.
- Popescu, Armanasu, Etzioni, Ko, & Yates Popescu, A.-M., Armanasu, A., Etzioni, O., Ko, D., & Yates, A. (2004). Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pp. 141–147, Geneva, Switzerland. COLING.
- Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929–1958.
- Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826.
- Unger, Bühmann, Lehmann, Ngonga Ngomo, Gerber, & Cimiano Unger, C., Bühmann, L., Lehmann, J., Ngonga Ngomo, A.-C., Gerber, D., & Cimiano, P. (2012). Template-based question answering over rdf data. In Proceedings of the 21st international conference on World Wide Web, pp. 639–648.
- Vinyals, Fortunato, & Jaitly Vinyals, O., Fortunato, M., & Jaitly, N. (2015). Pointer networks. Advances in neural information processing systems, 28, 2692–2700.
- Wang, Shin, Liu, Polozov, & Richardson Wang, B., Shin, R., Liu, X., Polozov, O., & Richardson, M. (2020). Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7567–7578.
- Yin & Neubig Yin, P., & Neubig, G. (2017). A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 440–450.
- Yin, Neubig, Yih, & Riedel Yin, P., Neubig, G., Yih, W.-t., & Riedel, S. (2020). Tabert: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8413–8426.
- Yu, Li, Zhang, Zhang, & Radev Yu, T., Li, Z., Zhang, Z., Zhang, R., & Radev, D. (2018). Typesql: Knowledge-based type-aware neural text-to-sql generation. In 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2018, pp. 588–594. Association for Computational Linguistics (ACL).
- Yu, Wu, Lin, Wang, Chern Tan, Yang, Radev, Socher, & Xiong Yu, T., Wu, C.-S., Lin, X. V., Wang, B., Chern Tan, Y., Yang, X., Radev, D., Socher, R., & Xiong, C. (2020). Grappa: Grammar-augmented pre-training for table semantic parsing. arXiv e-prints, 2009.
- Yu, Yasunaga, Yang, Zhang, Wang, Li, & Radev Yu, T., Yasunaga, M., Yang, K., Zhang, R., Wang, D., Li, Z., & Radev, D. (2018a). Syntaxsqlnet: Syntax tree networks for complex and cross-domain text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1653–1663.
- Yu, Zhang, Yang, Yasunaga, Wang, Li, Ma, Li, Yao, Roman, et al. Yu, T., Zhang, R., Yang, K., Yasunaga, M., Wang, D., Li, Z., Ma, J., Li, I., Yao, Q., Roman, S., et al. (2018b). Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In EMNLP.
- Zhang, Cisse, Dauphin, & Lopez-Paz Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical risk minimization. arXiv preprint arXiv, 1710.09412.
- Zhang, Yu, Er, Shim, Xue, Lin, Shi, Xiong, Socher, & Radev Zhang, R., Yu, T., Er, H. Y., Shim, S., Xue, E., Lin, X. V., Shi, T., Xiong, C., Socher, R., & Radev, D. (2020). Editing-based sql query generation for cross-domain context-dependent questions. In 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, pp. 5338–5349. Association for Computational Linguistics.
- Zhong, Xiong, & Socher Zhong, V., Xiong, C., & Socher, R. (2017). Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv, 1709.00103.