The ProfessionAl Go annotation datasEt (PAGE)

Yifan Gao, Danni Zhang, Haoyue Li Yifan Gao is the corresponding author (e-mail: [email protected]).Yifan Gao is with School of Biomedical Engineering, University of Science and Technology of China, Hefei, China. Danni Zhang is with School of Economics, Anhui University, Hefei, China. Haoyue Li is with College of Medicine and Biological Information Engineering, Northeastern University, Shenyang, China.

Abstract

The game of Go has been highly under-researched due to the lack of game records and analysis tools. In recent years, the increasing number of professional competitions and the advent of AlphaZero-based algorithms provide an excellent opportunity for analyzing human Go games on a large scale. In this paper, we present the ProfessionAl Go annotation datasEt (PAGE), containing 98,525 games played by 2,007 professional players and spans over 70 years. The dataset includes rich AI analysis results for each move. Moreover, PAGE provides detailed metadata for every player and game after manual cleaning and labeling. Beyond the preliminary analysis of the dataset, we provide sample tasks that benefit from our dataset to demonstrate the potential application of PAGE in multiple research directions. To the best of our knowledge, PAGE is the first dataset with extensive annotation in the game of Go. This work is an extended version of [1] where we perform a more detailed description, analysis, and application.

Index Terms:

Go, game analytics, data mining, psychology, board game

I Introduction

Go (weiqi, baduk) is one of Asia’s most popular board games, especially in China, Japan, and Korea [2]. In the last two decades, professional Go tournaments have grown dramatically, with millions of audiences watching games from TV streams or online servers.

Data-driven analytics in traditional sports, e-sports, and board games has been a popular research area [3, 4, 5, 6, 7]. These evolving analytics techniques have significantly contributed to the sport and game community. In recent years, with the development of game record databases, large-scale analysis of Go has become a reality. In particular, the advent of AlphaZero [8] and KataGo [9] has made it possible to understand human decision-making in games at a deep level.

In such background, the analysis of Go is potentially very promising. On the one hand, as a popular competitive game, developing data-driven analytic technologies for Go can enrich the fan experience, help players improve their abilities, and promote other favorable aspects. On the other hand, as a game with simple rules and complex content, it is a valuable vehicle for psychological research [10, 11]. However, conducting such research on Go remains nontrivial and challenging, and there is an extreme lack of relevant work. Firstly, there are no structured professional datasets. It is difficult for researchers to extract metadata and other game information. Secondly, no statistics on Go matches can directly demonstrate the state of both sides (e.g., shots in soccer). Finally, few people know about Go compared to popular sports and games, so it is not easy to organize data and construct practical features well.

To overcome these problems, we present the ProfessionAl Go annotation datasEt (PAGE), containing 98,525 games played by 2,007 professional players from 1950 to 2021. The raw records and metadata of the games are derived from the publicly available Go database. We annotate the metadata related to players and tournaments by combining multiple trustworthy sources. A comprehensive in-game statistics feature is calculated by KataGo, which analyzes all games. As a result, the dataset is of good quality and has broad coverage.

In this paper, we provide three challenging downstream tasks to demonstrate the application of PAGE in several research directions. First, we used PAGE to evaluate the relationship between gender differences in ratings and participation rates of professional Go players. Second, several Convolutional Neural Networks (CNNs) and Transformer architectures were applied to predict blunders of professional players. Finally, we used several popular machine learning methods to predict game outcomes from historical data. The experimental results show that PAGE can potentially be applied in different studies.

In summary, the main contributions of our work are:

•

We present the first professional Go dataset with extensive annotation. The dataset contains a large amount of metadata and in-game statistics, facilitating massive research. The dataset will be made publicly available¹¹1https://github.com/YifanGao00/The-Professional-Go-Dataset.
•

We describe several applications that benefit from our dataset and suggest possible research directions.

This work is an extended version of [1], where additional explanations, comparisons, and applications are provided. We add extra in-game statistics features, including uncertainty and ownership, compared to the previous work. In addition, this work explains in more detail the process of building PAGE, as well as the introduction of the dataset structure. In particular, we also add two applications, containing the analysis of participation rates in gender differences and the blunder prediction of professional players.

The rest of the paper is organized as follows: Section II presents the background related to the proposed dataset. In section III, we describe the annotation procedure of PAGE in detail and provide the preliminary analysis of the dataset. Section IV presents three applications of the analysis performed on our dataset: participation analysis, blunder prediction, and game outcome prediction. Section V discusses potential research directions and limitations of PAGE, and the paper concludes in Section VI.

Refer to caption — Figure 1: In-game statistics and metadata in the game between Ke Jie and Lee Sedol in the final of the 2nd Mlily Cup. It is the core property of the PAGE.

II Background

II-A Professional Go

Even though Go originated in China, the professional Go system made its appearance in Japan for the first time. Go became a popular game in Japan during the Edo period since the Shogun funded its development. In the early 20th century, Japan became the first country to establish a modern professional system, maintaining its absolute leadership until the 1980s. Nie Weiping won the Japan-China Super Go Tournament in 1985 after an incredible streak of victories against top Japanese players. Four years later, a highly anticipated Ing Cup final saw Cho Hunhyun defeat Nie Weiping 3:2 and win the 400,000 dollar prize. As a result of these two landmark events, China, Japan, and Korea formed a triple balance of power. In 1996, Korean Go entered its golden age, which lasted for ten years. Korean players, represented by Lee Changho, won most of the tournament titles during this decade. Chinese players progressed rapidly afterward and became fierce competitors with Korean players.

A couple of the earliest Go world tournaments were the Ing Cup and the Fujitsu Cup, which were founded in 1988. The Go world champions are the highest honor for professional players, similar to the Grand Slam in tennis. In addition to this, there are a large number of regional-level competitions, mainly consisting of single-elimination tournaments and leagues.

II-B Computer Analysis of Human Games

With AlphaZero outperforming the best professional players by a large margin, researchers are focusing on more challenging tasks. However, AlphaZero and its open-source implementations, like ELF OpenGo [12] and Leela Zero [13], cannot be used to analyze human games precisely. According to their evaluation, they only take into account the win rate. Their win rate fluctuates dramatically when there is a small score gap, so the robustness is not good. Furthermore, these AIs cannot adjust for different rules and komis (points added to compensate for the black player’s first move advantage), resulting in incorrect game evaluations.

KataGo is the state-of-the-art AlphaZero-based framework. In addition to providing rich in-game statistics, it uses improved technologies for more precise analysis. It supports various rules and komis. Consequently, KataGo is appropriate for analyzing human games and obtaining in-game statistics. KataGo supports many different statistics, but PAGE focuses primarily on five: win rate, score difference, uncertainty, ownership, and recommended moves.

Fig. 1a to Fig. 1e shows the schematic diagram of the in-game statistics. In this case, uncertainty and ownership are fine-grained features unique to KataGo. In particular, the uncertainty is the variance of the score in the current board state. Due to the characteristics of Monte Carlo Tree Search (MCTS), this value is significantly large but still serves as a good indicator of uncertainty. Ownership is the expected ownership of each position on the board, where 1 denotes the current player’s ownership and -1 denotes the opponent’s ownership. As can be seen in Fig. 1d, the closer to black, such as multiplication symbol in the figure, is to the area occupied by black. Conversely, the closer to white, such as the area of the circle, is more likely to be white’s territory. The triangular areas in the diagram represent the areas that both players need to fight over.

II-C Player Performance Analysis

It has been a long-standing practice for professional Go players to improve themselves and prepare for tournaments using intuition-driven performance analysis methods. Players will review their opponent’s matches, recognize their weaknesses, and expect to gain an advantage in upcoming games by replaying and reviewing their games. Recently, AlphaZero and its modified algorithms have become far superior to humans in Go. Its incredible insight into the game and accurate evaluation capabilities are helping many professional Go players improve their performance. Despite this, players rarely research further based on purposefully collected data, so these analysis methods are still intuition-driven.

There are well-established and promising data-driven analytics methods for many sports and games. Merhej [3] uses deep learning to assess the value of defensive behavior in football. Baboota [4] predicts Premier League outcomes through machine learning and gets good results. Beal [5] presents a novel natural language processing technology, combining the statistics of sports reporters with background articles to improve predictive performance at football games. In Castellar’s study [6], reaction time and exercise time are examined as factors affecting the performance of table tennis players. Yang [7] proposes an explainable two-stage network for predicting real-time win rates in multiplayer online battle arena games. These analysis techniques are vital to improving the fan experience and assessing players’ performance.

The AlphaZero algorithm is used in many board games, including Gomoku [14], Othello [15], and NoGo [16]. However, these games do not have many accessible game records, making it challenging to develop analysis techniques for them. In contrast, the detailed chess datasets, many participants, and advanced chess computer engines significantly impacted chess analytics. Skill evaluations [17, 18], ratings [19, 20, 21], and style modeling [22, 23, 24, 25] are the most common applications. These technologies demonstrate the great potential of player performance analysis in the board game. There is little previous research on the game of Go. Over the last two decades, however, a growing number of tournaments, players, and recorded games has allowed the development of a dataset and benchmarks for Go. As a result, we propose PAGE, a large-scale professional Go annotation dataset. Using this dataset, we hope to bridge the gap between this unique game and the game community.

II-D Board Game in Psychological Research

Board games are an essential research part of psychological research, especially cognitive science. In particular, chess is a valuable tool for psychological investigations [26]. Unlike other board games, chess has a long history of professionalization and a large number of well-organized and structured game records spanning decades. Researchers interested in various psychological topics can use these databases to conduct their research. In the following content, we present the application of the chess database in two topics, including the role played by gender and age in cognitive performance.

Howard [27] highlighted the methodological value of chess data for the first time and presented a database with millions of records providing performance data on over 60,000 players since 1970. This article also looked at other intellectual games, such as bridge, Go, and backgammon. However, data for these games tend to be more limited than for chess, making it difficult to use as a tool for psychological research. By analyzing the chess database, Howard [28] proposed that gender differences in elite mind sports arise from differences in innate ability. Bilalic [29] countered this idea and argued that different participation rates or differences in the amount of practice, motivation, and interest of male and female players might better explain the gender differences in chess performance. Following this, more work focused on the relationship between participation rates and gender differences [30, 31, 32]. Blanch [26] used the chess database to comprehensively analyze gender differences in chess from multiple perspectives. The findings demonstrate that biosocial factors, such as age and practice, rather than differences in participation rates, influence gender differences in Elo ratings. In addition to intelligence and participation rates, another reasonable explanation for the gender difference is the gender stereotype threat. Stafford [33] analyzed a database of over 5.5 million chess matches and found no evidence of the stereotype threat effect. On the other hand, Smerdon [34] demonstrated that a typical gender stereotype threat exists in chess by introducing the variable of the opponent’s age. Furthermore, some studies have used large-scale chess databases to verify psychological effects, such as the gender equality paradox [35].

Chess is also often used as a tool to study the effects of age on cognitive performance. Fair [36] estimated decline rates for chess and various sports event and confirmed that the decline was much smaller for chess than for sports. Roring [37] used a multilevel modeling approach to a large database of chess players to study longitudinal changes in chess skills. Vaci [38] used linear mixed effects models to explore the hypothesis that ”age is more friendly to the more able.” In the latest study, Strittmatter [39] presented evidence for a life-cycle model of cognitive performance based on a comprehensive analysis of professional chess tournament data.

Large-scale chess databases have played a crucial role in these studies. However, due to the limitations of the dataset, most of the information used in these studies contained only ELO ratings [40] and game results. With the ongoing development of AlphaZero-based algorithms, AI has been able to provide more insight into human decision-making within the game. Specifically, comprehensive and detailed fine-grained statistics, such as win rates, score differences, uncertainty, and ownership contained in PAGE, have the potential to provide more convincing evidence for psychological studies.

III The Professional Go Annotation Dataset

III-A Data Acquisition

III-A1 Metadata

We begin with the raw game data by accessing Go4Go²²2https://www.go4go.net, the largest Go game database. Game records in the database can be used for academic research with the owner’s permission. Some games were excluded from the raw records, including AI competitions, amateur games, handicap games, and games that ended abnormally. In the end, we obtained 98,525 games played by 2,007 professional players. We extracted metadata about each player’s birth date, gender, and affiliated association from three reliable online references: Go Ratings³³3https://www.goratings.org/en/, Sensei’s Library⁴⁴4https://senseis.xmp.net, and List of Go Players⁵⁵5https://db.u-go.net

Metadata related to matches and rounds were extracted from SGF files. We finally got 503 tournament categories (e.g., Chunlan Cup, LG Cup), 3131 tournaments (e.g., 1st Chunlan Cup, 15th LG Cup), and 342 rounds (e.g., round 2, final). We extracted data regarding matches and rounds from SGF files. Ultimately, we found various tournament categories (e.g., Chunlan Cup, LG Cup), tournaments, and rounds (e.g., round 2, final). Every tournament was manually labeled by a knowledgeable amateur player, including the category, level, and region. The tournament categories include championship, league, team, and friendly. The level of tournaments includes top-tier and regular. The region includes international and regular.

We used a Python script⁶⁶6https://github.com/pfmonville/whole_history_rating to calculate the player’s WHR ratings [41]. To obtain a fine-grained rating score, we took the following calculation approach. First, the initial number of iterations was set to 50, Second, by adding games day by day and performing one iteration, the rating was calculated for all time points. Notably, following the Go Ratings, the initial score was set to 3000. Finally, we obtained the WHR rating and the uncertainty. The uncertainty is higher if the player has just entered the leaderboard or has not played for too long.

III-A2 In-game Statistics

The game records were analyzed using KataGo v1.9.1 with the TensorRT backend. In order to achieve a balance between accuracy and speed, the following strategy was applied: We performed 100 simulations for each move to obtain initial in-game statistics. After a move with a high fluctuation (greater than 10% win rate or 5 points), KataGo would reevaluate it with a simulation count of 1000. Finally, we obtained in-game statistics for all games, including win rate, score difference, uncertainty, ownership distribution, and recommended moves. We conducted our analysis on an NVIDIA RTX 2080Ti graphic card, which took about 40 days.

TABLE I: Most frequent players.

Players	Games	Players	Games
Cho Chikun	2079	O Rissei	1217
Lee Changho	1962	Yamashita Keigo	1217
Kobayashi Koichi	1605	Gu Li	1215
Rin Kaiho	1536	Otake Hideo	1214
Cho Hunhyun	1533	Cho U	1145
Lee Sedol	1375	Choi Cheolhan	1133
Yoda Norimoto	1316	Chang Hao	1100
Kato Masao	1249	Park Junghwan	1058
Takemiya Masaki	1248	Kobayashi Satoru	1044

TABLE II: Most frequent matchups.

Matchups	Games	Win	Loss
Cho Hunhyun vs. Lee Changho	287	110	177
Cho Hunhyun vs. Seo Bongsoo	207	139	68
Kobayashi Koichi vs. Cho Chikun	123	60	63
Yoo Changhyuk vs. Lee Changho	122	39	83
Cho Chikun vs. Kato Masao	107	67	40

TABLE III: Most frequent tournaments.

Tournaments	Games	Tournament	Games
Chinese League A	11092	Japanese Oza	1996
Korean League A	4773	Japanese NHK Cup	1939
Japanese Honinbo	3655	Samsung Cup	1806
Japanese Ryusei	2932	Chinese Mingren	1588
Japanese Meijin	2869	Chinese Women’s League	1569
Japanese Judan	2766	LG Cup	1514
Japanese Kisei	2723	Korean League B	1411
Japanese Tengen	2383	Chinese Tianyuan	1282
Japanese Gosei	2230	Korean Women’s League	1279

III-B Dataset Analysis

In this section, we introduce the preliminary analysis of PAGE. There are two aspects to the statistics: players and games.

III-B1 Players

Fig. 2a presents the age distribution of the players in the different generations of the games. The average age of players before 1990 was between 30 and 50. As the decade progressed, the average age of players dropped rapidly to less than 30. The majority of competitions were completed by players in their 20s in the 21st century. It is evident from this trend that professional Go is becoming more competitive. Table I and Table II show the most frequent players and matchups. Most of the games were completed by these prominent legends, illustrating a long-tailed distribution of games played by different players.

III-B2 Games

Fig. 2b shows the number of games per year in the dataset. As we can see from the graph, it is growing exponentially over time. On the one hand, it indicates that the number of professional matches has increased in the last couple of years. On the other hand, PAGE does not appear to have recorded many games during the early years. Table V shows the tournaments with the greatest number of games played, mostly Japanese tournaments, except for the Chinese League A and the Korean League A. The main reason is that these tournaments have been held for a long time. As shown in Fig. 2c, resigned games have shorter lengths than non-resigned games. Fig. 2d to 2f show some trends of in-game statistics over time. We observe some interesting results from these schematics. With AlphaGo’s advent, professional players began to imitate the AI’s preferred moves, which is particularly evident in the opening phase (first 50 moves). One can observe the coincidence rate in Fig. 2d, which grew significantly after 2016. In addition, the non-opening phase is difficult to imitate, so the coincidence rate grows much slower than in the opening stage. According to Fig. 2e and Fig. 2f, the average loss of win rate and the average loss of score also decreased after AlphaGo emerged. Furthermore, players with higher WHR ratings have lower statistics than those with lower scores. These characteristics offer the potential to develop player performance analysis techniques.

IV Downstream Tasks

IV-A Participation Analysis

In this section, we use PAGE to evaluate the relationship between gender differences and participation rates in professional Go. In chess, recent psychological research has proven that gender differences are not related to differences in participation rates. First, we describe the methodology used and the details of the data. Second, we report the experimental results.

IV-A1 Materials and Methods

Inspired by the work of Knapp [32] and Blanch [26], we conducted three experiments to explore the relationship between differences in participation rates and gender differences. First, we tested WHR scores using the negative hypergeometric distribution. Specifically, among $N$ female and $M$ male players, $R_{k}$ was the best female player ranked $k$ -th among all combinations of participants. We calculated the 0.05% and 99.5% quartiles of the expectation distribution $r_{low}$ and $r_{high}$ . If there was no gender effect in Go performance, the ranking $R_{k}$ of the $k$ -th ranked female player should be between $r_{low}$ and $r_{high}$ . Second, we made a binary observation for each sampling unit to determine the weight of the difference in rating scores that can be attributed to the difference in participation rates between male and female players, i.e., the explanation rate. Finally, we performed month-by-month calculations of the explanation rate to investigate changes in the gender difference over time.

In the first and second experiments, we selected the WHR ratings for December 2021 and excluded players with less than ten games. Finally, 1366 players were included in the calculation, with 1116 male and 250 female players. The experiments examined the ranking of the top 100 best female players, i.e., $k$ was set to 100. In the third experiment, we calculated the explanation rate month-by-month, where at least 50 female players entered the rating that month. To reduce the effect of the rating distribution on the results, we selected only the top 20% of female players each month and computed the average. For example, $k=10$ in April 1991 and $k=50$ in December 2021.

IV-A2 Results

Fig. 3 shows the observed and expected rankings of the WHR scores for the top 100 female players. The observed rank scores (red line) lie outside the 99.5% quantile of the expected ranking (blue line). The actual score differences (red line) and the score differences attributed to different participation rates (blue line) are shown in Fig. 4. The unexplained differences between the actual and expected differences range from 221 to 346 points, with a mean of 272. In other words, only 23.2% to 56.1% (mean 44%) of the actual scoring difference is explained by the different participation rates of male and female players. Fig. 5 shows the month-by-month trend of the explanation rate of the actual rating differences. As can be seen, there is an overall decreasing trend in the explanation rate. It is worth noting that the maximum value of the explanation rate has been very high for several months. This is because Rui Naiwei reached second place in the world during this period, which is also the highest ranking ever for a female player.

From the three experiments mentioned above, we can find that gender differences do exist in professional Go tournaments, and participation rates barely explain the differences in WHR scores. Compared to chess, Go tournaments are a unique tool for studying East Asian cultures. However, there is little psychological research on the game of Go. We present the first analysis of the relationship between participation rates and gender differences through PAGE and expect to motivate more psychological work in the future.

TABLE IV: Experiment results in blunder prediction.

Model	Accuracy (%)	Precision 1# (%)	Precision 2# (%)
ConvNet	69.64	73.67	65.58
ZeroNet	71.22	79.34	65.06
KataNet	71.08	76.18	66.36
Perceiver	71.10	74.66	67.38

TABLE V: Features description in game outcome prediction.

Features	Description
Metadata Features
Basic Information	Include the games time, age, gender, association.
Ranks	Ranking of Go players.
WHR Score	WHR rating, which measures the level of the player.
WHR Uncertainty	A measure of uncertainty in the WHR rating system.
Tournament Feature	Features of the tournament.
Contextual Features
Match Results	The player’s recent performance in various competitions.
Match Results by Region	Performance against the opponent’s region.
Matchup Results	Past competitions against the opponent.
Tournament Results	Past performances at this tournament.
Opponents Ranks	Rank of opponents.
Opponents Ages	Age of opponents.
Cross-region Counts	The number of crossregional competitions.
In-game Features
Mean Win Rate	Average win rate in recent games.
Mean Score	Average score difference.
Mean Loss Win Rate	Average win rate lost.
Mean Loss Score	Average score lost.
Advantage Rounds	Number of rounds with a 5% win rate or a 3 point advantage.
Strong Advantage Rounds	Number of rounds with a 10% win rate or a 5 point advantage.
Coincidence Rate	The ratio of moves that match KataGo’s recommendation to total moves.

IV-B Blunder Prediction

Predicting the upcoming mistakes of human players from the board state is a promising task. In professional Go tournaments, top players often steer their opponents to positions that are more likely to make mistakes and gain an advantage through their opponent’s blunders. However, due to Go’s abstract and complex characteristics, directly analyzing human decision-making is highly challenging. In recent years, the development of deep learning has made it possible to model the fine-grained behaviors and decisions of human players. In this section, we leverage various deep learning methods, including CNNs [42] and Transformer architectures [43, 44], to predict blunders in professional Go tournaments.

IV-B1 Materials and Methods

This study explored the performance with several state-of-the-art deep learning models, including ConvNet, ZeroNet, KataNet, and Perceiver Transformer. Shao [45] designed CNNs to predict the moves of the board game, and it has achieved satisfactory performance in the RenjuNet database. This work uses regular CNNs, so we name it ConvNet. We call the network used in AlphaGo Zero as ZeroNet. Furthermore, the architecture used in KataGo is KataNet. There are no domain-specific improvements, such as ownership subhead or scoring distribution, included in our implementation of KataNet. Last but not least, the Perceiver [46] is a model built on the Transformer architecture. The model utilizes an asymmetric attention mechanism that extracts the input iteratively into a latent bottleneck, allowing it to scale to handle multimodal inputs. To ensure fairness, we set the CNNs to roughly the same depth and width, namely 6 successive blocks and 64 channels. In the Perceiver, we adjust the hyperparameters to maintain the same order of magnitude of parameter size as the CNNs.

To construct the blunder detection as a classification problem, we define a blunder move as an action with a 10% drop in win rate or a score loss of 5 points. Meanwhile, normal moves were consistent with the AI’s recommended moves. We extracted a sample of 982,323 blunder moves from PAGE. On the other hand, we randomly selected 1,138,135 samples from the normal moves to form the blunder prediction dataset. We divided the training, validation, and test sets according to the ratio of 7:1:2. All deep learning models were trained for ten epochs. The batch size was set to 128, the optimizer was Stochastic Gradient Descent (SGD), and the learning rate was 0.01. The experiments were conducted on an NVIDIA RTX 3060 GPU with 12G video memory. For each method, we report the accuracy and precision of each class in the test set.

IV-B2 Results

Table IV reports the performance of the four methods in predicting blunders. Among them, three metrics indicate the overall accuracy, the precision of normal move classification, and the precision of blunder move classification. The performance of ConvNet is lower than the other three methods, probably because the shallower network architecture does not capture the information and features of the board state well. ZeroNet performs the best in terms of accuracy, while Perceiver Transformer has the best precision in blunder move classification. In our experiment, we only used the base features, i.e., the current state of the board. While multimodal analysis combining more in-game statistics has the potential to improve the performance of blunder prediction significantly.

IV-C Game Outcome Prediction

This section takes a closer look at how we can use historical games to predict the outcome of future games based on historical data. Following feature extraction and pre-processing, we apply popular machine learning methods, and finally, we report the prediction system’s performance.

IV-C1 Materials and Methods

First, we performed feature extraction and categorize all features into three groups: metadata features, contextual features, and in-game features. The detailed meanings of these features are illustrated in Table V.

XGBoost [47], Random Forest (RF) [48], LightGBM [49], and CatBoost [50] were selected as the training methods for the outcome prediction models. Three rating-based models were used for comparison: ELO, WHR, and TrueSkill [51]. In particular, we used the corresponding hyperparameters in the Python package by default and did not tune them. Although tuning these hyperparameters could improve the performance, our proposed approach has produced promising performance compared to other rating-based models.

TABLE VI: Experimental results in game outcome prediction. The best performance is indicated by bolded fonts. As can be seen in the red font, our proposed approach is significantly better than the best rating-based approach.

	Mean		CR		CHN		KOR		JPN		Others
	ACC↑	MSE↓	ACC↑	MSE↓	ACC↑	MSE↓	ACC↑	MSE↓	ACC↑	MSE↓	ACC↑	MSE↓
ELO	0.6515	0.2144	0.6466	0.2174	0.6176	0.2273	0.6339	0.2191	0.6637	0.2107	0.7156	0.1907
WHR	0.6567	0.2125	0.6684	0.2090	0.6212	0.2295	0.6397	0.2193	0.6577	0.2108	0.7254	0.1813
TrueSkill	0.6439	0.2605	0.6380	0.2781	0.6095	0.1875	0.6265	0.2654	0.6552	0.2546	0.7106	0.2062
XGBoost	0.7351	0.1700	0.7457	0.1663	0.7033	0.1844	0.7197	0.1779	0.7563	0.1607	0.7692	0.1520
RF	0.6932	0.2022	0.6954	0.1998	0.6623	0.2146	0.6800	0.2086	0.7031	0.1956	0.7446	0.1847
LightGBM	0.7509	0.1637	0.7611	0.1574	0.7241	0.1765	0.7374	0.1715	0.7599	0.1600	0.7912	0.1432
CatBoost	0.7530	0.1623	0.7632	0.1572	0.7258	0.1752	0.7379	0.1699	0.7633	0.1577	0.7946	0.1411
	9.6%	-0.050	9.5%	-0.052	10.5%	-0.052	9.8%	-0.049	10.0%	-0.053	6.9%	-0.040

TABLE VII: Ablation results in game outcome prediction.

Metadata	Contextual	In-game	ACC↑	MSE↓
✓			0.6719	0.1975
	✓		0.7099	0.1827
		✓	0.6883	0.1891
✓	✓		0.7342	0.1706
✓	✓	✓	0.7530	0.1623

IV-C2 Results

Table VI shows the experimental results. We can see that WHR achieves an accuracy of 65.67% and an MSE of 0.2125 among all rating system-based outcome prediction models, which is the best performer. Compared to ELO, WHR performs slightly better because it takes advantage of the long-term dependence on game results. There is a lower performance in TrueSkill because it is more suited to calculating ratings for multiplayer sports.

The best performance of the machine learning methods using various features is CatBoost, reaching an accuracy of 75.30% and an MSE of 0.1623, which is much higher than WHR. The CatBoost model improves accuracy by 10% and MSE by 0.05 in almost every category. The improvement of the Others category is lower because it is relatively easier to predict, while the WHR method has an accuracy of over 72%.

We also conducted ablation experiments to check the validity of each modular feature using the CatBoost method. The results are shown in Table VII. The model using only meta-information features is 67.19%, which is slightly higher than the WHR method, suggesting that other attributes in meta-information, in addition to WHR ratings, play an important role in the prediction.

The accuracy was 70.99% when only contextual features were used. In comparison, the accuracy of 68.83% was achieved by using only in-game features, both higher than the state-of-the-art rating system-based prediction approach. By combining these features, the predictive ability of the model has improved. As demonstrated by the ablation experiments, the characteristics of each of our modules enhance the performance of the prediction system.

V Discussion

Modeling, analyzing, and understanding human behavior and decision-making is nontrivial and challenging. In this paper, we present PAGE, containing fine-grained annotations of elite Go players for over seventy years. In the previous section, we applied the dataset to three downstream tasks and achieved satisfactory results. This section discusses more possible research directions and limitations of our proposed dataset.

V-A Possible research directions

V-A1 Advanced in-game statistics

In traditional sports, there are many advanced in-game statistics, such as expected goals (xG) in soccer and total points added (TPA) in basketball. These advanced statistics enhance the fan experience and allow for better analysis of player performance. Even though we can better analyze player performance using our extracted in-game features, we can still do better. The prediction ability could be enhanced by adding more advanced in-game features. For example, time series analysis of win rate, score, and uncertainty, can be improved with well-developed techniques.

V-A2 Behavior and style modeling

In Go matches, fans pay attention most to the styles and behaviors of different players. Taking Lee Changho as an example, he was able to defeat his opponent at the last minute through extreme steadiness and control. Alternatively, Gu Li tries to fight his opponents every chance he gets.

Modeling player styles and understanding human decision-making are challenging but fascinating tasks. Several potential applications can be found through its use, including targeted training, preparation, and AI-assisted cheating detection [52, 53]. This problem has only been discussed in a relatively small amount of board game literature. In this article, Omori [54] proposes that shogi moves could be classified based on game style and that AI can be trained to match certain game styles. McIlroy-Young [24] presents a personalized model for predicting an individual’s move and demonstrating how it captures human behavior at the individual level. The lack of proper analysis techniques and large datasets has contributed to the lack of noticeable success in style modeling and recognition. The advent of PAGE has helped to make progress on such issues.

V-A3 Rating system

In a result prediction model, a player’s winning percentage can only be predicted between two players. As opposed to this, a rating system can evaluate several players based on their relative strengths and is, therefore, more helpful in assessing a player’s strengths. There are currently only two types of rating systems used for board games: win-loss and time. It is possible for board games with more features to evaluate a player’s strength more effectively. Therefore, combining machine learning approaches with traditional rating systems has great potential.

V-A4 Live commentary enhancement

Go matches are watched live on television or online platforms every year by millions of fans. A classic world Go tournament, on the other hand, has the tendency to last for over five hours in a room with just the host analyzing the game. This results in viewers getting bored with the show and stopping to watch. As AlphaZero has evolved, live Go broadcasts usually include the AI’s estimation of the position, which improves the viewer experience. While showing the win rate is essential to draw viewers’ attention, showing only the percentage of wins often leads to attacks on professional players because even top professionals’ games are rated very negatively by AI. Our proposed PAGE can change this situation. In live streaming Go matches, PAGE has the potential to develop technologies like automatic commentary text generation, real-time statistics, and result prediction that will enable a more detailed and nuanced analysis, significantly improving the viewer experience in live streaming.

V-B Limitations

Although our PAGE contains a wealth of statistics and metadata, there are still some weaknesses. First, due to computational resource limitations, we only provide in-game statistics for 100 simulations, which are unreliable in very few cases, especially in complex situations. Second, analysis using only KataGo may not always be reliable. In future work, we will look at longer computations and mutual validation of different AIs to improve the robustness of in-game statistics.

VI Conclusion

We present PAGE, the first professional Go dataset with extensive annotation. Our dataset provides a large number of valuable statistics and metadata that can be useful tools for various research topics, especially in game analytics and psychology investigations. The motivation for building this dataset is to facilitate and further promote research on a broad range of human-centered computing through this large, diverse, and accessible dataset. This work is an extended version of [1] in which we performed a more detailed dataset description and added two applications that benefit from PAGE.

References

[1] Y. Gao, “Pgd: A large-scale professional go dataset for data-driven analytics,” in 2022 IEEE Conference on Games (CoG), 2022, pp. 284–291.
[2] J. Fairbairn, Invitation to go. Courier Corporation, 2010.
[3] C. Merhej, R. Beal, S. Ramchurn, and T. Matthews, “What happened next? using deep learning to value defensive actions in football event-data,” arXiv preprint arXiv:2106.01786, 2021.
[4] R. Baboota and H. Kaur, “Predictive analysis and modelling football results using machine learning approach for english premier league,” International Journal of Forecasting, vol. 35, no. 2, pp. 741–755, 2019.
[5] R. Beal, S. E. Middleton, T. J. Norman, and S. D. Ramchurn, “Combining machine learning and human experts to predict match outcomes in football: A baseline model,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 17, 2021, pp. 15 447–15 451.
[6] C. Castellar, F. Pradas, L. Carrasco, A. D. La Torre, and J. A. González-Jurado, “Analysis of reaction time and lateral displacements in national level table tennis players: are they predictive of sport performance?” International Journal of Performance Analysis in Sport, vol. 19, no. 4, pp. 467–477, 2019.
[7] Z. Yang, Z. Pan, Y. Wang, D. Cai, S. Shi, S.-L. Huang, W. Bi, and X. Liu, “Interpretable real-time win prediction for honor of kings–a popular mobile moba esport,” IEEE Transactions on Games, 2022.
[8] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” nature, vol. 550, no. 7676, pp. 354–359, 2017.
[9] D. J. Wu, “Accelerating self-play learning in go,” arXiv preprint arXiv:1902.10565, 2019.
[10] Y.-H. Lin, “Life cycle patterns of professional performance from go world champions,” Psychiatry and Clinical Neurosciences, vol. 76, no. 4, pp. 128–130, 2022.
[11] M. Shin, J. Kim, and M. Kim, “Human learning from artificial intelligence: evidence from human go players’ decisions after alphago,” in Proceedings of the Annual Meeting of the Cognitive Science Society, vol. 43, no. 43, 2021.
[12] Y. Tian, J. Ma, Q. Gong, S. Sengupta, Z. Chen, J. Pinkerton, and L. Zitnick, “Elf opengo: An analysis and open reimplementation of alphazero,” in International Conference on Machine Learning. PMLR, 2019, pp. 6244–6253.
[13] G.-C. Pascutto, “Leela zero,” https://github.com/leela-zero/leela-zero, 2017.
[14] Y. Gao, L. Wu, and H. Li, “Gomokunet: A novel unet-style network for gomoku zero learning via exploiting positional information and multiscale features,” in 2021 IEEE Conference on Games (CoG). IEEE, 2021, pp. 1–4.
[15] A. Norelli and A. Panconesi, “Olivaw: Mastering othello without human knowledge, nor a penny,” IEEE Transactions on Games, 2022.
[16] Y. Gao and L. Wu, “Efficiently mastering the game of nogo with deep reinforcement learning supported by domain knowledge,” Electronics, vol. 10, no. 13, p. 1533, 2021.
[17] G. Haworth, K. Regan, and G. D. Fatta, “Performance and prediction: Bayesian modelling of fallible choice in chess,” in Advances in computer games. Springer, 2009, pp. 99–110.
[18] M. Guid and I. Bratko, “Computer analysis of world chess champions,” ICGA journal, vol. 29, no. 2, pp. 65–73, 2006.
[19] K. W. Regan and G. M. Haworth, “Intrinsic chess ratings,” in Twenty-fifth aaai conference on artificial intelligence, 2011.
[20] D. R. Ferreira, “Determining the strength of chess players based on actual play,” ICGA journal, vol. 35, no. 1, pp. 3–19, 2012.
[21] N. Veček, M. Mernik, and M. Črepinšek, “A chess rating system for evolutionary algorithms: A new method for the comparison and ranking of evolutionary algorithms,” Information Sciences, vol. 277, pp. 656–679, 2014.
[22] K. W. Regan, T. Biswas, and J. Zhou, “Human and computer preferences at chess,” in Workshops at the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014.
[23] R. McIlroy-Young, S. Sen, J. Kleinberg, and A. Anderson, “Aligning superhuman ai with human behavior: Chess as a model system,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1677–1687.
[24] R. McIlroy-Young, R. Wang, S. Sen, J. Kleinberg, and A. Anderson, “Learning personalized models of human behavior in chess,” arXiv preprint arXiv:2008.10086, 2020.
[25] R. McIlroy-Young, Y. Wang, S. Sen, J. Kleinberg, and A. Anderson, “Detecting individual decision-making style: Exploring behavioral stylometry in chess,” Advances in Neural Information Processing Systems, vol. 34, pp. 24 482–24 497, 2021.
[26] A. Blanch, A. Aluja, and M.-P. Cornadó, “Sex differences in chess performance: Analyzing participation rates, age, and practice in chess tournaments,” Personality and Individual Differences, vol. 86, pp. 117–121, 2015.
[27] R. W. Howard, “A complete database of international chess players and chess performance ratings for varied longitudinal studies,” Behavior research methods, vol. 38, no. 4, pp. 698–703, 2006.
[28] ——, “Are gender differences in high achievement disappearing? a test in one intellectual domain,” Journal of Biosocial Science, vol. 37, no. 3, pp. 371–380, 2005.
[29] M. Bilalić and P. McLeod, “How intellectual is chess?–a reply to howard,” Journal of Biosocial Science, vol. 38, no. 3, pp. 419–421, 2006.
[30] ——, “Participation rates and the difference in performance of women and men in chess,” Journal of biosocial science, vol. 39, no. 5, pp. 789–793, 2007.
[31] M. Bilalić, K. Smallbone, P. McLeod, and F. Gobet, “Why are (the best) women so good at chess? participation rates and gender differences in intellectual domains,” Proceedings of the Royal Society B: Biological Sciences, vol. 276, no. 1659, pp. 1161–1165, 2009.
[32] M. Knapp, “Are participation rates sufficient to explain gender differences in chess performance?” Proceedings of the Royal Society B: Biological Sciences, vol. 277, no. 1692, p. 2269, 2010.
[33] T. Stafford, “Female chess players outperform expectations when playing men,” Psychological science, vol. 29, no. 3, pp. 429–436, 2018.
[34] D. Smerdon, H. Hu, A. McLennan, W. von Hippel, and S. Albrecht, “Female chess players show typical stereotype-threat effects: Commentary on stafford (2018),” Psychological Science, vol. 31, no. 6, pp. 756–759, 2020.
[35] A. Vishkin, “Queen’s gambit declined: The gender-equality paradox in chess participation across 160 countries,” Psychological Science, vol. 33, no. 2, pp. 276–284, 2022.
[36] R. C. Fair, “Estimated age effects in athletic events and chess,” Experimental aging research, vol. 33, no. 1, pp. 37–57, 2007.
[37] R. W. Roring and N. Charness, “A multilevel model analysis of expertise in chess across the life span.” Psychology and aging, vol. 22, no. 2, p. 291, 2007.
[38] N. Vaci, B. Gula, and M. Bilalić, “Is age really cruel to experts? compensatory effects of activity.” Psychology and Aging, vol. 30, no. 4, p. 740, 2015.
[39] A. Strittmatter, U. Sunde, and D. Zegners, “Life cycle patterns of cognitive performance over the long run,” Proceedings of the National Academy of Sciences, vol. 117, no. 44, pp. 27 255–27 261, 2020.
[40] A. E. Elo, The rating of chessplayers, past and present. Arco Pub., 1978.
[41] R. Coulom, “Whole-history rating: A bayesian rating system for players of time-varying strength,” in International Conference on Computers and Games. Springer, 2008, pp. 113–124.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[43] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=YicbFdNTTy
[44] Y. Dai, Y. Gao, and F. Liu, “Transmed: Transformers advance multi-modal medical image classification,” Diagnostics, vol. 11, no. 8, p. 1384, 2021.
[45] K. Shao, D. Zhao, Z. Tang, and Y. Zhu, “Move prediction in gomoku using deep learning,” in 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC). IEEE, 2016, pp. 292–297.
[46] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” in International conference on machine learning. PMLR, 2021, pp. 4651–4664.
[47] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 785–794.
[48] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
[49] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y. Liu, “Lightgbm: A highly efficient gradient boosting decision tree,” Advances in neural information processing systems, vol. 30, pp. 3146–3154, 2017.
[50] L. O. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “Catboost: unbiased boosting with categorical features,” in Advances in neural information processing systems, 2018.
[51] R. Herbrich, T. Minka, and T. Graepel, “Trueskill™: A bayesian skill rating system,” in Proceedings of the 19th international conference on neural information processing systems, 2006, pp. 569–576.
[52] D. M. D. Iliescu, “the impact of artificial intelligence on the chess world,” JMIR serious games, vol. 8, no. 4, p. e24049, 2020.
[53] D. J. Barnes and J. Hernandez-Castro, “On the limits of engine analysis for cheating detection in chess,” Computers & Security, vol. 48, pp. 58–73, 2015.
[54] S. Omori and T. Kaneko, “Learning of evaluation functions to realize playing styles in shogi,” in Pacific Rim International Conference on Artificial Intelligence. Springer, 2016, pp. 367–379.