Datasets for Studying Generalization from Easy to Hard Examples
1 Introduction
In domains like computer vision, single and multi-agent games, and mathematical reasoning, classically trained models perform well on inputs from the same distribution used for training, but often fail to extrapolate their knowledge to more difficult tasks sampled from a different (but related) distribution. The goal of approaches like deep thinking and algorithm learning is to construct systems that achieve this extrapolation. With this in mind, we detail several datasets intended to motivate and facilitate novel research into systems that generalize from easy training data to harder test examples.
We present three datasets: Prefix Sums, Mazes, and Chess Puzzles. We also provide an easy to install Python package and accompanying documentation to make training and testing on this data accessible.222The source code is available at http://github.com/aks2203/easy-to-hard-data.
2 Prefix Sums
The Prefix Sums dataset is meant to provide a simple toy baseline for testing new approaches, as models can be trained and tested rapidly on this dataset. Inputs are lists of binary () numbers. Each sample comes with a label/target, which is a list of the same length containing the cumulative sums modulo two for the input strings. For 52 different input lengths, we provide sets of 10,000 examples each. The shortest strings available are 16 bits, and we have every length through 64 bits as well as 72, 128, 256, and 512.
We refer to longer input strings as “harder” examples, which follows from the classical algorithms understanding of work required to compute the prefix sums. By offering so many different lengths, we provide many levels of difficulty.
2.1 Data Generation
For each choice of the sample length we produce 10,000 unique random strings, each containing binary numbers that represent a fair coin flip. We then compute their cumulative sums modulo two and save the input-output pairs.
2.2 Examples
Examples from the sets with 16 bit and 28 bit inputs are shown below.
Input: | |
---|---|
Target: | |
Input: | |
Target: |
3 Mazes
The Mazes dataset contains images of mazes and their solutions. The solutions are represented as segmentation maps of the input pixels, with one class for pixels that are on the optimal (i.e., shortest) path from start to finish, and another class for those that are not on the optimal path. The inputs are three-channel (RGB) images. The “start” of the maze is marked in red, the “end” of the maze is marked by a green square, and the walls of the maze are black. Every square where the player is allowed to move is white. Note that the “start” and “end” positions of an example can be swapped without changing the label; we do not care about the direction in which the path is walked.
3.1 Data Generation
We generate the mazes as abstract graphs, and then render them as images. We initialize a square grid graph, then use depth first search to find a subset of the edges that form a spanning tree. We randomly assign two nodes in the graph to be the end points of the maze. To render an image from this representation, we represent each node and each edge in the spanning tree with a white cell and each edge from the grid graph that is not in the spanning tree as a black cell. See Figure 2 for a depiction of each stage in the maze generation process. The code used was adapted from another maze generation repository [1]. Finally, the solutions are generated with breadth-first search, using the image as input. By construction, the generated mazes admit a unique solution, in other words, there is exactly one path connecting the “start” and “end” positions.

A note on the size of mazes: After creating an maze, we represent the graph using pixels per cell, and add a 3 pixel wide black border on each side. For this reason, a dataset of mazes is represented using images of dimension . For example, the Mazes dataset contains images of size pixels.
3.2 Examples
Figure 3 shows examples of mazes and their solutions from datasets of three different difficulties.






4 Chess Puzzles
Training a full-scale chess game system is a complex task that requires large code bases and lots of compute resources. The chess puzzles compiled and provided by Lichess [2] make for a much more accessible task. In these puzzles, the inputs are mid-game boards and the outputs are the unique best next move as determined by a state of the art chess engine. Here, we provide a Chess Puzzles dataset that encodes the complex decision problems required to play chess, but is formatted as a simple pixel-wise classification problem that is easily used for training and testing. Our dataset consists of images representing game states (i.e., board configurations) and labels for each game state, which are 2D maps in which the source and destination positions for the optimal move are marked with a 1, and all other positions are marked with a 0.
4.1 Data Generation
By combing through 200,000,000 user games using the Stockfish 12/13 chess engines, Lichess creates puzzles consisting of sequences of unique best next moves. Once available for play on Lichess’s website, a puzzle’s initial Elo rating is then refined by having users with known ratings attempt to defeat the puzzle.333Elo is a system for rating players; one increases their Elo by besting other players with high Elo, and decreases their Elo by failing against lower rated players. Chess puzzles are rated similarly by presenting them to players with known skill rating. Interactions with hundreds/thousands of users eventually leave the puzzle with an equilibrated final rating, which quantifies difficulty. Each puzzle is tagged with the Forsyth-Edwards Notaion (FEN) representation of the board and an automatically generated short descriptor using code made available on the Lichess website. In other words, we can download a string that describes the current board, the best next move, and the difficulty rating for each of these puzzles.
We use FEN information to create a Pytorch tensor that encodes the board state as a multi-channel “image.” This representation consists of twelve channels, one for each class of white piece and one for each class of black piece. In particular, each channel has ones at the locations of the relevant pieces (e.g., pawns for one channel, rooks for another, etc..) and zeros elsewhere. The first six channels correspond to the player who is to move next. For this reason, a machine learning system that accepts this representation should not need to be explicitly told which player acts next, however a boolean flag for this purpose can be requested by the user when the dataset is constructed. Note, that our formulation of this dataset leaves out information about castling rights and En passant – a simplification of the game we make for convenience.
Each board state is labeled with a single-channel tensor representing the solution to the problem. In these tensors there are zeros everywhere except on the origin and destination squares for the piece that should be moved. Again, note that the target itself does not possess the explicit information about what type of piece or which color is to move, but that information can be determined with the input and the identification of which color’s turn it is.
4.2 Examples






5 Python Package
The Python package for these datasets is available through the Package Installer for Python (pip). We provide Python classes for each problem type, and the class constructors accept an argument specifying the desired difficulty for the dataset. The documentation and code is available here.444https://pypi.org/project/easy-to-hard-data/
When a dataset object is created, the user can point the constructor to the location of the downloaded raw data, or else a flag can be set to perform this download automatically.
In addition to initializing dataset objects, the repository also houses code for generating the data and code to make visualizations for mazes and chess. The dataset objects rely on standard Pytorch features, and should be easy to modify to create variations on the standard benchmark problems.
References
- [1] C. Hill. Making a maze, Apr 2017.
- [2] Lichess. Lichess open puzzles database. https://database.lichess.org/#puzzles, 2021. Accessed: 2021-04-01.