Datasets for Studying Generalization from Easy to Hard Examples

Avi Schwarzschild¹¹1Correspondence to [email protected]. Department of Mathematics, University of Maryland Eitan Borgnia Department of Computer Science, University of Maryland Arjun Gupta Department of Robotics, University of Maryland Arpit Bansal Department of Electrical and Computer Engineering, University of Maryland Zeyad Emam Department of Mathematics, University of Maryland Furong Huang Department of Computer Science, University of Maryland Micah Goldblum Department of Computer Science, University of Maryland Tom Goldstein Department of Computer Science, University of Maryland

1 Introduction

In domains like computer vision, single and multi-agent games, and mathematical reasoning, classically trained models perform well on inputs from the same distribution used for training, but often fail to extrapolate their knowledge to more difficult tasks sampled from a different (but related) distribution. The goal of approaches like deep thinking and algorithm learning is to construct systems that achieve this extrapolation. With this in mind, we detail several datasets intended to motivate and facilitate novel research into systems that generalize from easy training data to harder test examples.

We present three datasets: Prefix Sums, Mazes, and Chess Puzzles. We also provide an easy to install Python package and accompanying documentation to make training and testing on this data accessible.²²2The source code is available at http://github.com/aks2203/easy-to-hard-data.

2 Prefix Sums

The Prefix Sums dataset is meant to provide a simple toy baseline for testing new approaches, as models can be trained and tested rapidly on this dataset. Inputs are lists of binary ( $0/1$ ) numbers. Each sample comes with a label/target, which is a list of the same length containing the cumulative sums modulo two for the input strings. For 52 different input lengths, we provide sets of 10,000 examples each. The shortest strings available are 16 bits, and we have every length through 64 bits as well as 72, 128, 256, and 512.

We refer to longer input strings as “harder” examples, which follows from the classical algorithms understanding of work required to compute the prefix sums. By offering so many different lengths, we provide many levels of difficulty.

2.1 Data Generation

For each choice of the sample length $n,$ we produce 10,000 unique random strings, each containing $n$ binary numbers that represent a fair coin flip. We then compute their cumulative sums modulo two and save the input-output pairs.

2.2 Examples

Examples from the sets with 16 bit and 28 bit inputs are shown below.

Input:	$[1,0,1,0,1,0,1,1,0,0,1,1,1,0,1,1]$
Target:	$[1,1,0,0,1,1,0,1,1,1,0,1,0,0,1,0]$
Input:	$[1,0,0,1,1,0,1,1,0,1,1,0,1,0,1,1,0,0,0,1,1,0,1,0,1,1,0,0]$
Target:	$[1,1,1,0,1,1,0,1,1,0,1,1,0,0,1,0,0,0,0,1,0,0,1,1,0,1,1,1]$

Figure 1: Prefix Sums input samples and their corresponding targets/labels. We provide multiple sets, each containing problems of a different length, and intend for users to train on shorter strings and test on longer ones. Examples from the sets of length 16 and 28 are shown above.

3 Mazes

The Mazes dataset contains images of mazes and their solutions. The solutions are represented as segmentation maps of the input pixels, with one class for pixels that are on the optimal (i.e., shortest) path from start to finish, and another class for those that are not on the optimal path. The inputs are three-channel (RGB) images. The “start” of the maze is marked in red, the “end” of the maze is marked by a green square, and the walls of the maze are black. Every square where the player is allowed to move is white. Note that the “start” and “end” positions of an example can be swapped without changing the label; we do not care about the direction in which the path is walked.

3.1 Data Generation

We generate the mazes as abstract graphs, and then render them as images. We initialize a square grid graph, then use depth first search to find a subset of the edges that form a spanning tree. We randomly assign two nodes in the graph to be the end points of the maze. To render an image from this representation, we represent each node and each edge in the spanning tree with a white cell and each edge from the grid graph that is not in the spanning tree as a black cell. See Figure 2 for a depiction of each stage in the maze generation process. The code used was adapted from another maze generation repository [1]. Finally, the solutions are generated with breadth-first search, using the image as input. By construction, the generated mazes admit a unique solution, in other words, there is exactly one path connecting the “start” and “end” positions.

Refer to caption — Figure 2: The *Mazes* generation process for making a $5\times 5$ maze. We start with a $3\times 3$ grid graph. Each side of the grid graph contains 3 nodes and 2 edges ( $3+2=5$ total elements). We then produce a spanning tree for the graph using a randomized algorithm. The tree encodes the allowed and forbidden paths. This tree is then represented as a $5\times 5$ array of maze cells. Finally, we convert this to an image by representing each cell as a $2\times 2$ array of pixels, and adding a 3-pixel border on each side. This creates an image representation that has $5\times 2+3+3=16$ pixels on each side. The green and red start and end cells are chosen at random.

A note on the size of mazes: After creating an $n\times n$ maze, we represent the graph using $2\times 2=4$ pixels per cell, and add a 3 pixel wide black border on each side. For this reason, a dataset of $n\times n$ mazes is represented using images of dimension $(2n+6)\times(2n+6)$ . For example, the $9\times 9$ Mazes dataset contains images of size $24\times 24$ pixels.

3.2 Examples

Figure 3 shows examples of mazes and their solutions from datasets of three different difficulties.

Figure 3: Samples from Mazes. Mazes (top) and their labels/solutions (bottom) of size

9\times 9

13\times 13

, and

21\times 21

4 Chess Puzzles

Training a full-scale chess game system is a complex task that requires large code bases and lots of compute resources. The chess puzzles compiled and provided by Lichess [2] make for a much more accessible task. In these puzzles, the inputs are mid-game boards and the outputs are the unique best next move as determined by a state of the art chess engine. Here, we provide a Chess Puzzles dataset that encodes the complex decision problems required to play chess, but is formatted as a simple pixel-wise classification problem that is easily used for training and testing. Our dataset consists of images representing game states (i.e., board configurations) and labels for each game state, which are 2D maps in which the source and destination positions for the optimal move are marked with a 1, and all other positions are marked with a 0.

4.1 Data Generation

By combing through 200,000,000 user games using the Stockfish 12/13 chess engines, Lichess creates puzzles consisting of sequences of unique best next moves. Once available for play on Lichess’s website, a puzzle’s initial Elo rating is then refined by having users with known ratings attempt to defeat the puzzle.³³3Elo is a system for rating players; one increases their Elo by besting other players with high Elo, and decreases their Elo by failing against lower rated players. Chess puzzles are rated similarly by presenting them to players with known skill rating. Interactions with hundreds/thousands of users eventually leave the puzzle with an equilibrated final rating, which quantifies difficulty. Each puzzle is tagged with the Forsyth-Edwards Notaion (FEN) representation of the board and an automatically generated short descriptor using code made available on the Lichess website. In other words, we can download a string that describes the current board, the best next move, and the difficulty rating for each of these puzzles.

We use FEN information to create a Pytorch tensor that encodes the board state as a multi-channel “image.” This representation consists of twelve $8\times 8$ channels, one for each class of white piece and one for each class of black piece. In particular, each channel has ones at the locations of the relevant pieces (e.g., pawns for one channel, rooks for another, etc..) and zeros elsewhere. The first six channels correspond to the player who is to move next. For this reason, a machine learning system that accepts this representation should not need to be explicitly told which player acts next, however a boolean flag for this purpose can be requested by the user when the dataset is constructed. Note, that our formulation of this dataset leaves out information about castling rights and En passant – a simplification of the game we make for convenience.

Each board state is labeled with a single-channel $8\times 8$ tensor representing the solution to the problem. In these tensors there are zeros everywhere except on the origin and destination squares for the piece that should be moved. Again, note that the target itself does not possess the explicit information about what type of piece or which color is to move, but that information can be determined with the input and the identification of which color’s turn it is.

4.2 Examples

5 Python Package

The Python package for these datasets is available through the Package Installer for Python (pip). We provide Python classes for each problem type, and the class constructors accept an argument specifying the desired difficulty for the dataset. The documentation and code is available here.⁴⁴4https://pypi.org/project/easy-to-hard-data/

When a dataset object is created, the user can point the constructor to the location of the downloaded raw data, or else a flag can be set to perform this download automatically.

In addition to initializing dataset objects, the repository also houses code for generating the data and code to make visualizations for mazes and chess. The dataset objects rely on standard Pytorch features, and should be easy to modify to create variations on the standard benchmark problems.

References

[1] C. Hill. Making a maze, Apr 2017.
[2] Lichess. Lichess open puzzles database. https://database.lichess.org/#puzzles, 2021. Accessed: 2021-04-01.