Mustard: Mastering Uniform Synthesis of Theorem and Proof Data

Abstract

Recent large language models (LLMs) have witnessed significant advancement in various tasks, including mathematical reasoning and theorem proving. As these two tasks require strict and formal multi-step inference, they are appealing domains for exploring the reasoning ability of LLMs but still face important challenges. Previous studies such as Chain-of-Thought (CoT) have revealed the effectiveness of intermediate steps guidance. However, such step-wise annotation requires heavy labor, leading to insufficient training steps for current benchmarks. To fill this gap, this work introduces Mustard, a data generation framework that masters uniform synthesis of theorem and proof data of high quality and diversity. Mustard synthesizes data in three stages: (1) It samples a few mathematical concept seeds as the problem category. (2) Then it prompts a generative language model with the sampled concepts to obtain both the problems and their step-wise formal solutions. (3) Lastly, the framework utilizes a proof assistant (e.g., Lean Prover) to filter the valid proofs. With the proposed Mustard, we present a theorem-and-proof benchmark MustardSauce with 7,335 valid data points. Each data point contains an informal statement, an informal proof, and a translated formal proof that passes the prover validation. We perform extensive analysis and demonstrate that Mustard generates validated high-quality step-by-step data. We further apply the MustardSauce for fine-tuning smaller language models. The fine-tuned Llama 2-7B achieves on average an 18.15% relative performance gain in automated theorem proving, and 6.78% in math word problems.

1 Introduction

Large language models (LLMs) \citepDBLP:journals/corr/abs-2303-08774,gpt35 have shown promising reasoning capabilities in various domains, including math word problem and theorem proving \citepDBLP:journals/corr/abs-2110-14168,DBLP:conf/nips/HendrycksBKABTS21,DBLP:conf/iclr/ZhengHP22,DBLP:conf/iclr/WuJBG21. These two tasks, which require strictly and successively multi-step inference, have become appeal domains to evaluate and develop LLMs’ ability in complex reasoning. Recent works progress LLMs in solving math problems mainly through two techniques. The first is the chain-of-thoughts (CoT) prompting \citepDBLP:conf/nips/Wei0SBIXCLZ22,DBLP:conf/nips/KojimaGRMI22,DBLP:conf/iclr/0002WSLCNCZ23, which provides step-by-step solutions to the LLMs. The second is to leverage the LLMs’ ability in code generation to generate formalized languages and utilize external solvers to obtain strict inference results \citepDBLP:conf/nips/WuJLRSJS22,DBLP:conf/iclr/JiangWZL0LJLW23,DBLP:journals/corr/abs-2009-03393,DBLP:conf/iclr/HanRWAP22,DBLP:conf/iclr/PoluHZBBS23. Both techniques rely on step-wise annotation to improve LLMs’ performance and interpretability on the math problem.

Correct intermediate steps are crucial for LLMs to perform complex reasoning. However, high-quality step-wise annotations are hard to obtain, and Figure 1 demonstrates a few representative works. Previous works such as miniF2F \citepDBLP:conf/iclr/ZhengHP22 resorts to manual annotation and validation to obtain high-quality step-wise labels. However, manual annotation requires heavy labor of knowledgeable experts, resulting in an extremely small-scale dataset. Manual checking also does not guarantee the correctness of data as the labelers would make mistakes in labeling. On the other hand, generating data with rule-based checking such as ROSCOE \citepDBLP:conf/iclr/GolovnevaCPCZFC23 can produce large-scale reasoning data. Given that the generated data are more friendly and readable for human beings, the correctness of the reasoning is not guaranteed by those rules. Moreover, another line of work such as INT \citepDBLP:conf/iclr/WuJBG21 performs rule-based synthesis to generate validated proofs, which are both correct and large-scale. However, the data are brutally synthesized so that many generated proofs lack actual meaning. Therefore, we need a more efficient way to generate mathematical data that are large-scale, with accurate intermediate steps, and also meaningful mathematical knowledge to human beings.

To fill this gap, we propose Mustard, a data generation framework that uniformly synthesizes large-scale and high-quality mathematical data by combining the advantages of LLMs in verbalization and formal theorem provers in rigorous data validation. Specifically, Mustard first samples a few mathematical concepts from a predefined list and prompts an LLM to generate a related question described in natural language. Then, it applies the LLM to generate the corresponding solution in both natural and formal language. Given the generated solution, Mustard further validates them using a theorem prover. The passed one is considered to be correct and is a high-quality data point. The invalid one on the other hand is considered to be a challenging sample, which will be further combined with the error messages to prompt the LLM for a solution revision, and added as a challenging data point.

By applying the proposed Mustard one can obtain large amounts of problems and theorems with desired mathematical concepts and domains. Eventually, we build a mathematical dataset with validated informal and formal solutions, named MustardSauce (Mustard resource).

We conduct extensive data analysis and experiments on the generated MustardSauce. Through deep inspection of the data, we find that Mustard generates interesting and reasonable math problems by creatively combining two mathematical concepts, and MustardSauce is diverse and has a high proportion of difficult data. We also observe that the prover is consistent with human evaluation, where humans usually consider a validated solution to have a higher quality than those without a formal validation process. Lastly, we fine-tune smaller-scale language models on MustardSauce. The fine-tuned Llama 2-7B achieves improvements by 20.9% on zero-shot inference on GSM8K and achieves 8.7 of pass@1 on mathlib. These results demonstrate the effectiveness of MustardSauce in improving the mathematical reasoning capabilities of language models.

Refer to caption — Figure 1: A comparison of methods of synthesizing and validating intermediate reasoning steps.

The contributions of this paper are summarized as follows:

1.

We propose a novel framework Mustard that can generate high-quality mathematical data (both informal and formal) with an interplay between generative language model and theorem prover assistants.
2.

We release the MustardSauce, which contains both math word problems and theorem-proving problems spanning over four educational levels. Each sample has corresponding informal and formal solutions.
3.

We conduct extensive analysis and experiments on the generated data, demonstrating their quality, diversity, and effectiveness in improving language models’ mathematical reasoning performance.

2 Related Works

Large Language Models for Mathematical Reasoning

The growing generative language models \citepDBLP:conf/nips/BrownMRSKDNSSAA20,gpt35,DBLP:journals/corr/abs-2303-08774 show compelling potential for solving mathematical problems both in natural language proofs \citepDBLP:journals/corr/abs-2303-08774,gpt35 and in formal languages with theorem provers \citepDBLP:journals/corr/abs-2009-03393,DBLP:conf/iclr/HanRWAP22,DBLP:conf/iclr/PoluHZBBS23. On the other hand, some works explore using language models to automatically translate natural language proofs into formal ones given few-shot demonstrations \citepDBLP:conf/nips/WuJLRSJS22,DBLP:conf/iclr/JiangWZL0LJLW23,DBLP:journals/corr/abs-2309-04295. Chain-of-though reasoning \citepDBLP:conf/nips/Wei0SBIXCLZ22,DBLP:conf/nips/KojimaGRMI22,DBLP:conf/iclr/0002WSLCNCZ23 is demonstrated beneficial for the LLMs to derive correct answers. However, some recent works \citepDBLP:conf/iclr/Saparov023,DBLP:conf/iclr/GolovnevaCPCZFC23 observe that the intermediate reasoning steps can be inconsistent. This paper proposes a data generation framework that taps the comprehensive mathematical reasoning capabilities of large language models. It generates mathematical reasoning problems with informal and formal solutions that are step-wise validated by a formal theorem prover. With the framework, we obtain high-quality mathematical data.

Synthesizing Mathematical Data

Obtaining large-scale high-quality mathematical data is a long-standing challenge. Previous data relies on well-trained annotators to hand-craft and review the formal proofs \citepDBLP:conf/iclr/ZhengHP22, which is time and labour-consuming and results in a small data scale. [DBLP:conf/nips/WangD20] constructs a neural generator for data synthesis, but it still requires the intervention of human-written data. Besides, [DBLP:conf/iclr/WuJBG21] explore using a theorem generator to automatically generate formal proofs with rules. However, the rule-based generation depends on given axioms in specified orders. As a result, the generated data is restricted to a few domains. On the other hand, recent works demonstrate the effectiveness of distilling knowledge from large language models \citepDBLP:conf/naacl/WestBHHJBLWC22,DBLP:conf/acl/YuanCFGSJXY23,DBLP:journals/corr/abs-2308-06259, and some of them \citepDBLP:conf/acl/WangKMLSKH23,DBLP:journals/corr/abs-2304-12244 explore data evolution by properly prompting the language models. The proposed framework explores eliciting mathematical knowledge from large language models to achieve diverse and large-scale mathematical data. In this framework, an interplay between the language model and a formal proof assistant controls the quality and difficulties of data. Using the proposed framework, we collect a large-scale mathematical dataset that contains diverse and multiple-difficulty math questions with high-quality solutions.

3 Mustard

In this work, we aim to obtain large-scale mathematical data with multi-step annotations and propose Mustard to generate diverse and high-quality math and theorem-proving problems with multi-step informal and formal solutions. As shown in Figure 2, Mustard consists of three stages. In the first concept seeding stage, Mustard samples a set of math concepts as the problem domain. Then in the second solution generation stage, it generates the concept-related problem and solution by prompting an LLM. In the third stage, a theorem prover is used to validate the generated solution. If the solution can not pass the prover, the error message is returned to the second stage for another turn of solution generation. Through interaction between the LLM and a formal proof assistant, Mustard can generate diverse and high-quality data that contains both informal and formal solutions. We describe the details of each stage in this section.

3.1 Concept Seeding

We first define and build a mathematical concept pool that covers as complete sub-subjects in mathematics and educational levels as possible. Specifically, we collect all math courses on the Khan Academy website¹¹1https://www.khanacademy.org/math, the large-scale online educational platform. The resulting pool includes concepts in four educational levels: elementary school, middle school, high school, and higher education. Each educational level has 5 to 9 math domains, covering different types of math problems such as algebra and geometry. Each domain contains subdivided mathematical concepts to inspect different mathematical abilities like polynomial arithmetic or factorization. Concept statistics and detailed concepts in each domain are demonstrated in Appendix B.

Given the concept pool, for each educational level, Mustard uniformly samples 1 or 2 concepts from all domains as seeds, and then generates mathematical problems that cover one or multiple concepts. In particular, given an educational level, taking 2 concepts from different subjects challenges the model to generate problems that join diverse domains while keeping the problems reasonable.

3.2 Proof Generation

Given the sampled mathematical concepts, Mustard generates math problems and their corresponding solutions. Specifically, Mustard leverages the capability of LLMs in generating natural language or code and prompts an LLM to generate the problem statement, its natural language solution, and a formal solution written in Lean. As a result, the LLM needs to complete the following three tasks: (T1) Generating a math problem that relates to the given concepts; (T2) Solving the math problem with a natural language proof; (T3) Performing auto-formalization to translate the written natural language proof into a formalized proof. In this work, we use GPT-4 [DBLP:journals/corr/abs-2303-08774] as the LLM for proof generation.

We intend to generate a problem based on educational level, math domains, and concepts. Considering that mathematical problems include proof and calculation, we also introduce the question types into the prompt for generating theorem proving and word problems respectively. Moreover, we do not include any exemplars or other manual interventions except for the sampled concepts. We intend to avoid potential biases brought by the concepts inside the exemplars and achieve a more diverse generation. The prompt template is shown as the following:

You are a math expert. Now please come up with a math problem according to the following requirements. The math problem should contain a question part (indicated by ‘‘Problem: ’’), a corresponding solution in natural language (indicated by ‘‘Informal proof:’’), and a translated formal solution in Lean 3 (indicated by ‘‘Formal proof in Lean 3:’’). Please note that the informal proof and the formal proof need to be identical.

Please create a [QUESTION TYPE] in the level of [EDUCATIONAL LEVEL] based on the following knowledge point(s): ⁢[CONCEPT]⁢ in [DOMAIN]; ⁢[CONCEPT]⁢ in [DOMAIN].

You must respond in the following format:

# Problem: ...

# Informal proof: ...

# Formal proof in Lean 3: ...

The “[]” indicates the placeholders for the corresponding question type, educational level, concepts, and domains. Multiple concepts are separated by “;”. We retrieved the text after the “Problem:”, “Informal proof:” and “Formal proof in Lean 3:” as the generated sample.

3.3 Proof Filtering

In the proof-filtering stage, Mustard interacts with the Lean Prover \citepDBLP:conf/cade/MouraKADR15 to obtain validation messages of the proof steps, which guides data revision and filtering. Specifically, after a formal solution is passed to the Lean Prover and if the prover returns no error message, the corresponding data point is collected into the valid dataset. Otherwise, Mustard collects the error messages from the prover and prompts the language model to revise the invalid solution. To help the language model locate the incorrect lines described in the error messages, we also add a line number at the beginning of each line in the formal solutions. The verification and self-refinement are performed in multiple rounds until LLM generates a valid solution. We use the number of rounds to measure the difficulty of the generated sample, assuming a difficult problem is hard to solve by an LLM and requires more rounds of correction. The prompt template of a single round of correction is demonstrated as follows, and the complete prompt template is shown in Table 13 in Appendix C.1:

# Formal proof (c) in Lean 3:

‘‘‘lean

line 1 <code>

line 2 <code>

line 3 <code>

...

‘‘‘

# Error messages for Formal proof (c) from Lean Prover:

4 Experiments

4.1 Case Study

We first inspect the data points generated by Mustard. Table 1 shows a generated math problem in which Mustard creatively combines two mathematical concepts and constructs a reasonable question. The generated question includes knowledge from both concepts. It is suggested that Mustard can join the concepts and construct a reasonable question. Furthermore, Table 4.1 demonstrates a case that Mustard provides solid and comprehensive solutions in both natural language and Lean. Although the formal proof is long, it is consistent and passes the prover’s validation. It is demonstrated that Mustard can generate long valid solutions.

Table 1: An informal statement generated by Mustard. The complete data point is demonstrated in Table 18 in Appendix E.

Question Type: Theorem Proving, Educational Level: Middle School, k=2.

Concept(s): Geometry in 8th grade; Algebraic expressions in Algebra basics.

Informal Statement

Given a rectangle ABCD where AB is x + 5 and AD is 2x - 3. Prove that the area of the rectangle ABCD is

(2x^{2}+7x-15)

square units.

[Uncaptioned image] — Table 2: A data point generated by Mustard.

Inspection Dimension	Requirement	Valid	Invalid	$p$ -value
(D1) IS Correctness	Whether the informal statement is factually correct.	93.50	83.50	0.00167
(D2) IS Relevance	Whether the informal statement is relevant to each seed concept.	87.50	92.50	0.09604
(D3) RT Classification	Whether the informal statement is of the required question type.	67.00	68.50	0.74903
(D4) IP Correctness	Whether the informal proof correctly solves the informal statement.	88.50	73.50	0.00012
(D5) IS-FS Alignment	Whether the informal statement and the formal statement describe the same problem and are aligned with each other.	74.00	66.50	0.10138
(D6) IP-FP Alignment	Whether the informal proof and the formal proof describe the same solution and have aligned proof steps.	72.00	54.00	0.00018

MODEL	Zero (G)	Few (G)	Zero (M)	Few (M)	MODEL	Zero (G)	Few (G)	Zero (M)	Few (M)
Baselines
GPT2-large	3.4	5.1	0.6	1.0	GPT2-large > gt	14.6	17.4	4.6	6.8
Llama 2-7B	7.2	12.8	2.0	2.6	Llama 2-7B > gt	24.5	28.2	10.4	12.6
Fine-tuning
GPT2-large > tt	3.9	6.3	1.2	2.0	GPT2-large > tt > gt	15.9	18.5	5.0	7.8
GPT2-large > in	3.6	5.8	1.0	1.8	GPT2-large > in > gt	14.6	17.9	4.8	7.4
GPT2-large > ra	3.8	6.3	1.0	2.0	GPT2-large > ra > gt	15.8	18.4	4.8	7.6
GPT2-large > va	4.1 (+7.89%)	6.1 (-3.17%)	1.4 (+40.00%)	2.2 (+10.00%)	GPT2-large > va > gt	16.0 (+1.27%)	18.7 (+1.63%)	5.2 (+8.33%)	7.8 (+2.63%)
Llama 2-7B > tt	9.0	15.5	3.0	3.4	Llama 2-7B > tt > gt	26.1	30.2	11.8	13.8
Llama 2-7B > in	8.3	14.4	2.4	3.2	Llama 2-7B > in > gt	25.4	28.2	10.8	12.8
Llama 2-7B > ra	8.9	14.9	2.8	3.4	Llama 2-7B > ra > gt	26.1	29.9	11.6	13.6
Llama 2-7B > va	9.1 (+2.25%)	15.7 (+5.37%)	3.0 (+7.14%)	3.8 (+11.76%)	Llama 2-7B > va > gt	26.3 (+0.77%)	30.8 (+3.01%)	12.2 (+5.17%)	14.2 (+4.41%)

MODEL	mathlib	miniF2F	test	MODEL	mathlib	miniF2F	test
Baselines
GPT2-large	0.0	0.0	0.0	GPT2-large > mt	5.6	2.9	8.6
Llama 2-7B	0.0	0.0	0.0	Llama 2-7B > mt	14.3	7.0	10.8
Fine-tuning
GPT2-large > in	2.0	0.0	6.0	GPT2-large > in> mt	5.9	2.0	8.2
GPT2-large > ra	3.0	1.2	7.0	GPT2-large > ra> mt	6.6	2.9	9.6
GPT2-large > va	3.7 (+23.33%)	1.6 (+33.33%)	8.3 (+18.57%)	GPT2-large > va> mt	7.4 (+12.12%)	3.7 (+27.59%)	10.6 (+10.42%)
Llama 2-7B > tt	8.3	2.6	11.7	Llama 2-7B > tt> mt	15.1	7.0	13.6
Llama 2-7B > in	5.8	1.2	8.6	Llama 2-7B > in> mt	11.6	5.7	12.6
Llama 2-7B > ra	7.5	2.5	10.4	Llama 2-7B > ra> mt	14.7	6.6	13.2
Llama 2-7B > va	8.7 (+16.00%)	2.9 (+16.00%)	12.2 (+17.31%)	Llama 2-7B > va> mt	15.7 (+6.80%)	7.8 (+18.18%)	14.4 (+18.18%)

MODEL			test
GPT2-large > va			8.3
GPT2-large > va > mt			10.6
GPT2-large > mt > va			9.8
Llama 2-7B > va			12.2
Llama 2-7B > va > mt			14.4
Llama 2-7B > mt > va			13.8

		Thoerem Proving					Word Problem
		All (GPT-4)			Step (GPT-4)	All (GPT-3.5)	All (GPT-4)			Step (GPT-4)	All (GPT-3.5)
		#correct=0	1 ( $\Delta$ )	2 ( $\Delta$ )	0	0	0	1 ( $\Delta$ )	2 ( $\Delta$ )	0	0
k=1	elem	26.0	48.0 (+22.0)	55.9 (+33.9)	25.5	15.1	38.0	59.7 (+21.7)	67.0 (+45.3)	40.1	22.2
	midd	16.4	31.8 (+15.4)	39.6 (+24.2)	17.0	4.3	22.9	39.7 (+16.8)	47.4 (+30.6)	28.4	7.0
	high	6.8	14.4 (+7.6)	17.2 (+9.6)	6.6	1.9	6.7	16.8 (+10.1)	21.9 (+11.8)	8.1	3.4
	higher	2.1	5.6 (+3.5)	9.8 (+6.3)	3.0	0.7	3.6	10.9 (+7.3)	16.0 (+8.7)	4.3	2.7
k=2	elem	24.1	42.3 (+18.2)	52.3 (+34.1)	23.2	12.8	32.2	49.5 (+17.3)	58.2 (+40.9)	31.9	22.1
	midd	14.0	25.2 (+11.2)	34.1 (+22.9)	15.0	5.3	16.9	27.3 (+10.4)	34.5 (+24.1)	17.3	5.7
	high	3.8	8.3 (+4.5)	12.1 (+7.6)	5.4	2.2	5.7	11.0 (+5.3)	16.2 (+10.9)	4.5	2.5
	higher	1.1	3.3 (+2.2)	5.0 (+2.8)	2.6	1.3	2.6	7.0 (+4.4)	10.5 (+6.1)	2.9	2.1

Mustard: Mastering Uniform Synthesis of Theorem and Proof Data

Abstract

1 Introduction

2 Related Works

Large Language Models for Mathematical Reasoning

Synthesizing Mathematical Data

3 Mustard

3.1 Concept Seeding

3.2 Proof Generation

3.3 Proof Filtering

4 Experiments

4.1 Case Study

4.2 Human Evaluation

4.3 Data Quality by Downstream Application

4.4 Impact of Data Scalability

4.5 Pass Rate

4.6 Diversity and Difficulty

5 Conclusion

Appendix A Future Works

Appendix B Mathematical Concepts

Appendix C Prompt Templates in Mustard

C.1 Prompt Template for Proof Filtering

C.2 Prompt Templates for Step-by-Step Generation

Appendix D More Statistic Results of MustardSauce

D.1 Difficulty of MustardSauce by Number of Correction

D.2 Data Diversity

Appendix E Case Study

Appendix F Implementation Details of Downstream Task

F.1 Datasets

GSM8K \citepDBLP:journals/corr/abs-2110-14168

Mathlib

miniF2F \citepDBLP:conf/iclr/ZhengHP22

F.2 Models

GPT2-large \citepradford2019language

Llama 2-7B \citepDBLP:journals/corr/abs-2307-09288

F.3 Implementation Details

Appendix G More Experimental Results

Appendix H Data Contamination Check

Elementary School		Middle School		High School		Higher Education
Domains	#	Domains	#	Domains	#	Domains	#
1st grade 2nd grade 3rd grade 4th grade 5th grade 6th grade	3 8 14 14 16 8	7th grade 8th grade Algebra basics Pre-algebra Basic geometry and measurement	7 7 8 15 14	Algebra 1 Algebra 2 High school geometry Trigonometry Statistics and probability High school statistics Precalculus Calculus 1 Calculus 2	16 12 9 4 16 7 10 8 6	AP College Statistics College Algebra Differential Calculus Integral Calculus AP College Calculus AB AP College Calculus BC Multivariable calculus Differential equations Linear algebra	14 14 6 5 10 12 5 3 3
# Domain	6		5		9		9
# Concept	63		51		88		72

Domain	Concept
1st grade	“Place value”, “Addition and subtraction”, “Measurement, data, and geometry”,
2nd grade	“Add and subtract within 20”, “Place value”, “Add and subtract within 100”, “Add and subtract within 1,000”, “Money and time”, “Measurement”, “Data”, “Geometry”,
3rd grade	“Intro to multiplication”, “1-digit multiplication”, “Addition, subtraction, and estimation”, “Intro to division”, “Understand fractions”, “Equivalent fractions and comparing fractions”, “More with multiplication and division”, “Arithmetic patterns and problem solving”, “Quadrilaterals”, “Area”, “Perimeter”, “Time”, “Measurement”, “Represent and interpret data”,
4th grade	“Place value”, “Addition, subtraction, and estimation”, “Multiply by 1-digit numbers”, “Multiply by 2-digit numbers”, “Division”, “Factors, multiples and patterns”, “Equivalent fractions and comparing fractions”, “Add and subtract fractions”, “Multiply fractions”, “Understand decimals”, “Plane figures”, “Measuring angles”, “Area and perimeter”, “Units of measurement”,
5th grade	“Decimal place value”, “Add decimals”, “Subtract decimals”, “Add and subtract fractions”, “Multi-digit multiplication and division”, “Multiply fractions”, “Divide fractions”, “Multiply decimals”, “Divide decimals”, “Powers of ten”, “Volume”, “Coordinate plane”, “Algebraic thinking”, “Converting units of measure”, “Line plots”, “Properties of shapes”,
6th grade	“Ratios”, “Arithmetic with rational numbers”, “Rates and percentages”, “Exponents and order of operations”, “Negative numbers”, “Variables & expressions”, “Equations & inequalities”, “Plane figures”,

Domain	Concept
7th grade	“Negative numbers: addition and subtraction”, “Negative numbers: multiplication and division”, “Fractions, decimals, & percentages”, “Rates & proportional relationships”, “Expressions, equations, & inequalities”, “Geometry”, “Statistics and probability”,
8th grade	“Numbers and operations”, “Solving equations with one unknown”, “Linear equations and functions”, “Systems of equations”, “Geometry”, “Geometric transformations”, “Data and modeling”,
Algebra basics	“Foundations”, “Algebraic expressions”, “Linear equations and inequalities”, “Graphing lines and slope”, “Systems of equations”, “Expressions with exponents”, “Quadratics and polynomials”, “Equations and geometry”,
Pre-algebra	“Factors and multiples”, “Patterns”, “Ratios and rates”, “Percentages”, “Exponents intro and order of operations”, “Variables & expressions”, “Equations & inequalities introduction”, “Percent & rational number word problems”, “Proportional relationships”, “One-step and two-step equations & inequalities”, “Roots, exponents, & scientific notation”, “Multi-step equations”, “Two-variable equations”, “Functions and linear models”, “Systems of equations”,
Basic geometry and measurement	“Intro to area and perimeter”, “Intro to mass and volume”, “Measuring angles”, “Plane figures”, “Units of measurement”, “Volume”, “Coordinate plane”, “Decomposing to find area”, “3D figures”, “Circles, cylinders, cones, and spheres”, “Angle relationships”, “Scale”, “Triangle side lengths”, “Geometric transformations”,

Domain	Concept
Algebra 1	“Algebra foundations”, “Solving equations & inequalities”, “Working with units”, “Linear equations & graphs”, “Forms of linear equations”, “Systems of equations”, “Inequalities (systems & graphs)”, “Functions”, “Sequences”, “Absolute value & piecewise functions”, “Exponents & radicals”, “Exponential growth & decay”, “Quadratics: Multiplying & factoring”, “Quadratic functions & equations”, “Irrational numbers”, “Creativity in algebra”,
Algebra 2	“Polynomial arithmetic”, “Complex numbers”, “Polynomial factorization”, “Polynomial division”, “Polynomial graphs”, “Rational exponents and radicals”, “Exponential models”, “Logarithms”, “Transformations of functions”, “Equations”, “Trigonometry”, “Modeling”,
High school geometry	“Performing transformations”, “Transformation properties and proofs”, “Congruence”, “Similarity”, “Right triangles & trigonometry”, “Analytic geometry”, “Conic sections”, “Circles”, “Solid geometry”,
Trigonometry	“Right triangles & trigonometry”, “Trigonometric functions”, “Non-right triangles & trigonometry”, “Trigonometric equations and identities”,
Statistics and probability	“Analyzing categorical data”, “Displaying and comparing quantitative data”, “Summarizing quantitative data”, “Modeling data distributions”, “Exploring bivariate numerical data”, “Study design”, “Probability”, “Counting, permutations, and combinations”, “Random variables”, “Sampling distributions”, “Confidence intervals”, “Significance tests (hypothesis testing)”, “Two-sample inference for the difference between groups”, “Inference for categorical data (chi-square tests)”, “Advanced regression (inference and transforming)”, “Analysis of variance (ANOVA)”,
High school statistics	“Displaying a single quantitative variable”, “Analyzing a single quantitative variable”, “Two-way tables”, “Scatterplots”, “Study design”, “Probability”, “Probability distributions & expected value”,
Precalculus	“Composite and inverse functions”, “Trigonometry”, “Complex numbers”, “Rational functions”, “Conic sections”, “Vectors”, “Matrices”, “Probability and combinatorics”, “Series”, “Limits and continuity”,
Calculus 1	“Limits and continuity”, “Derivatives: definition and basic rules”, “Derivatives: chain rule and other advanced topics”, “Applications of derivatives”, “Analyzing functions”, “Integrals”, “Differential equations”, “Applications of integrals”,
Calculus 2	“Integrals review”, “Integration techniques”, “Differential equations”, “Applications of integrals”, “Parametric equations, polar coordinates, and vector-valued functions”, “Series”,

Domain	Concept
AP College Statistics	“Exploring categorical data”, “Exploring one-variable quantitative data: Displaying and describing”, “Exploring one-variable quantitative data: Summary statistics”, “Exploring one-variable quantitative data: Percentiles, z-scores, and the normal distribution”, “Exploring two-variable quantitative data”, “Collecting data”, “Probability”, “Random variables and probability distributions”, “Sampling distributions”, “Inference for categorical data: Proportions”, “Inference for quantitative data: Means”, “Inference for categorical data: Chi-square”, “Inference for quantitative data: slopes”, “Prepare for the 2022 AP Statistics Exam”,
College Algebra	“Linear equations and inequalities”, “Graphs and forms of linear equations”, “Functions”, “Quadratics: Multiplying and factoring”, “Quadratic functions and equations”, “Complex numbers”, “Exponents and radicals”, “Rational expressions and equations”, “Relating algebra and geometry”, “Polynomial arithmetic”, “Advanced function types”, “Transformations of functions”, “Rational exponents and radicals”, “Logarithms”,
Differential Calculus	“Limits and continuity”, “Derivatives: definition and basic rules”, “Derivatives: chain rule and other advanced topics”, “Applications of derivatives”, “Analyzing functions”, “Parametric equations, polar coordinates, and vector-va”,
Integral Calculus	“Integrals”, “Differential equations”, “Applications of integrals”, “Parametric equations, polar coordinates, and vector-valued functions”, “Series”,
AP College Calculus AB	“Limits and continuity”, “Differentiation: definition and basic derivative rules”, “Differentiation: composite, implicit, and inverse functions”, “Contextual applications of differentiation”, “Applying derivatives to analyze functions”, “Integration and accumulation of change”, “Differential equations”, “Applications of integration”, “AP Calculus AB solved free response questions from past exams”, “AP Calculus AB Standards mappings”,
AP College Calculus BC	“Limits and continuity”, “Differentiation: definition and basic derivative rules”, “Differentiation: composite, implicit, and inverse functions”, “Contextual applications of differentiation”, “Applying derivatives to analyze functions”, “Integration and accumulation of change”, “Differential equations”, “Applications of integration”, “Parametric equations, polar coordinates, and vector-valued functions”, “Infinite sequences and series”, “AP Calculus BC solved exams”, “AP Calculus BC Standards mappings”,
Multivariable calculus	“Thinking about multivariable functions”, “Derivatives of multivariable functions”, “Applications of multivariable derivatives”, “Integrating multivariable functions”, “Green’s, Stokes’, and the divergence theorems”,
Differential equations	“First order differential equations”, “Second order linear equations”, “Laplace transform”,
Linear algebra	“Vectors and spaces”, “Matrix transformations”, “Alternate coordinate systems (bases)”,

Prompt Templates for Step-by-Step Generation
(T1)	You are a math expert. Now please come up with a math problem according to the following requirements. The math problem should contain a question part (indicated by ‘‘Problem: ’’), a corresponding solution in natural language (indicated by ‘‘Informal proof:’’), and a translated formal solution in Lean 3 (indicated by ‘‘Formal proof in Lean 3:’’). Please note that the informal proof and the formal proof need to be identical. Please create a [QUESTION TYPE] in the level of [EDUCATIONAL LEVEL] based on the following knowledge point(s): ⁢[CONCEPT]⁢ in [DOMAIN]; ⁢[CONCEPT]⁢ in [DOMAIN]. Please first write the question part regardless of the other parts. You must write the following format, filling in the ‘‘# Problem: ’’ section, and leaving the other two sections empty. # Problem: ... # Informal proof: ... # Formal proof in Lean 3: ...
(T2)	You are a math expert. Now please come up with a math problem according to the following requirements. The math problem should contain a question part (indicated by ‘‘Problem: ’’), a corresponding solution in natural language (indicated by ‘‘Informal proof:’’), and a translated formal solution in Lean 3 (indicated by ‘‘Formal proof in Lean 3:’’). Please note that the informal proof and the formal proof need to be identical. Please create a [QUESTION TYPE] in the level of [EDUCATIONAL LEVEL] based on the following knowledge point(s): ⁢[CONCEPT]⁢ in [DOMAIN]; ⁢[CONCEPT]⁢ in [DOMAIN]. Please then write the corresponding solution in natural language (indicated by ‘‘Informal proof:’’) given the ‘‘# Problem: ’’, filling in the ‘‘# Informal proof: ’’ section, and leaving the other section empty. # Problem: <generated problem> # Informal proof: ... # Formal proof in Lean 3: ...
(T3)	You are a master in Lean. Now please come up with a math problem according to the following requirements. The math problem should contain a question part (indicated by ‘‘Problem: ’’), a corresponding solution in natural language (indicated by ‘‘Informal proof:’’), and a translated formal solution in Lean 3 (indicated by ‘‘Formal proof in Lean 3:’’). Please note that the informal proof and the formal proof need to be identical. Please create a [QUESTION TYPE] in the level of [EDUCATIONAL LEVEL] based on the following knowledge point(s): ⁢[CONCEPT]⁢ in [DOMAIN]; ⁢[CONCEPT]⁢ in [DOMAIN]. Please translate the ‘‘# Informal proof:’’ section into Lean 3 and fill in the ‘‘# Formal proof in Lean 3: ’’ section. # Problem: <generated problem> # Informal proof: <generated informal proof> # Formal proof in Lean 3: ...