Appendix
A Further Discussion on Set Operators
Based on the rules discussed on Section 3.4.1, NatSQL can simplify the SQL with INTERSECT (example is shown in Figure 1) and EXCEPT. As to the case that the set operator itself represents part of a condition, NatSQL allows them to follow the WHERE keyword. As illustrated in Table 1, this type of SQL is mainly related to the EXCEPT operator.
The prediction NatSQL prediction work in Table 1 is easier than others. NatSQL here only needs to predict the ‘cartoon’ table, instead of predicting the ‘cartoon.channel’ column. Predicting a table is easier than predicting a column because the premise of finding the correct column is to find the correct table. Besides, many models output ‘cartoon.id’ instead of ‘cartoon.channel’ because the annotation of ‘cartoon.id’ is the same as‘tv_channel.id’ column.
Ques | Find the id of tv channels that do not play any cartoon |
---|---|
SQL | SELECT id FROM tv_channel EXCEPT |
SELECT channel FROM cartoon | |
SemQL | SELECT tv_channel.id EXCEPT |
SELECT cartoon.channel | |
NatSQL | SELECT tv_channel.id WHERE except cartoon.* |
In addition to the conditions mentioned in Section 3.4.1 that cannot be concatenated, Table 2 present one more example. They can not concatenate because one WHERE condition can not concatenate a HAVING condition by a OR operator.
Ques | Which film is rented at a fee of 0.99 or has less |
---|---|
than 3 in the inventory? | |
SemQL | SELECT film.title WHERE film.rental_rate = 0.99 |
UNION | |
SELECT film.title WHERE count(inventory.*) 3 | |
NatSQL | SELECT film.title WHERE film.rental_rate = 0.99 |
OR count(inventory.*) 3 |
B Further Discussion on Executable SQL Generation
In Section 3.6, we discuss the different questions in Figure 5 generate different NatSQL, where training data is the key. Firstly, in the dataset, for SQL with multiple WHERE conditions, the order of the conditions is mostly consistent with the question. Secondly, the NatSQL further expands this type of training data. For example, the NatSQL queries in Figure 1,3,4 contain multiple WHERE conditions, which appear in the order they are mentioned. These training data make it possible for models to generate different NatSQL according to the different questions in Figure 5.
C Gold NatSQL Error Analysis
Table 3 presents the F1 score of NatSQL for different SQL components. We observe that the main errors come from GROUP BY and IUE matching. Although NatSQL cannot be converted to all gold GROUP BY clauses, most of these errors don’t affect the execution results. The IUE errors occur because NatSQL only supports one IUE operator per query.
Component | F1 | Component | F1 |
---|---|---|---|
select | 0.997 | where | 0.969 |
group | 0.879 | order | 0.996 |
and/or | 0.998 | IUE | 0.900 |
keywords | 0.989 |
Some other errors are due to the limitation of the exact match evaluation method when evaluating the JOIN ON clause of subqueries and sub-subqueries. Specifically, when the FROM and JOIN in a generated subquery is not identical to the gold SQL, the Spider evaluation scheme considers it to be wrong. For example, the following two SQL statements have the same semantic meaning, but they are recognized as different by the Spider exact match evaluation method, thus results in an exact match error.
… col in ( SELECT col FROM T1 JOIN T2 … )
… col in ( SELECT col FROM T2 JOIN T1 … )
D SQL, SemQL and NatSQL Examples
We present more examples in Table 4.
Ques: | What are the name of the countries where there is not a single car maker? |
---|---|
SQL: | SELECT CountryName FROM countries EXCEPT SELECT T1.CountryName FROM countries AS T1 JOIN car_makers AS T2 ON T1.countryId = T2.Country; |
SemQL: | Not Support |
NatSQL: | SELECT countries.countryname WHERE except @ is car_makers.* |
Ques: | Find the last name of the students who currently live in the state of North Carolina but have not registered in any degree program. |
SQL: | SELECT T1.staff_name FROM staff AS T1 JOIN Staff_DA AS T2 ON T1.staff_id = T2.staff_id WHERE T2.job_title_code = ”Sales Person” EXCEPT |
SELECT T1.staff_name FROM staff AS T1 JOIN Staff_DA AS T2 ON T1.staff_id = T2.staff_id WHERE T2.job_title_code = ”Clerical Staff” | |
SemQL: | SELECT staff.staff_name WHERE Staff_DA.job_title_code = ”Sales Person” EXCEPT |
SELECT staff.staff_name WHERE Staff_DA.job_title_code = ”Clerical Staff” | |
NatSQL: | SELECT staff.staff_name WHERE Staff_DA.job_title_code = ”Sales Person” |
AND Staff_DA.job_title_code != ”Clerical Staff” | |
Ques: | Find id of the tv channels that from the countries where have more than two tv channels. |
SQL: | SELECT id FROM tv_channel GROUP BY country HAVING count(*) 2 |
SemQL: | SELECT tv_channel.id WHERE count ( tv_channel.* ) 2 |
NatSQL: | SELECT tv_channel.id WHERE count ( tv_channel.* ) 2 |
Ques: | List all song names by singers above the average age. |
SQL: | SELECT song_name FROM singer WHERE age ( SELECT avg(age) FROM singer ) |
SemQL: | SELECT singer.song_name WHERE singer.age ( SELECT avg(singer.age) ) |
NatSQL: | SELECT singer.song_name singer WHERE @ avg ( age ) |
Ques: | Which district has both stores with less than 3000 products and stores with more than 10000 products? |
SQL: | SELECT district FROM shop WHERE Number_products 3000 INTERSECT SELECT district FROM shop WHERE Number_products 10000 |
SemQL: | SELECT shop.district WHERE shop.Number_products 3000 INTERSECT SELECT shop.district WHERE shop.Number_products 10000 |
NatSQL: | SELECT shop.district WHERE shop.number_products 3000 and shop.number_products 10000 |