Analyze Professional Games

Analyze Professional Games

Introduction

There are at least tens of thousands of professional games in digital format, either free or available for minimal cost. In addition, it may be possible to obtain larger numbers of games between strong amateurs. There is a great deal of information contained in these game records, and it may be possible to analyze these game records to produce data which some of the Moyoman modules can use in their play.

Tasks

The following tasks are logical ones for the use of professional game data:

Fuseki

If the first 30 to 40 moves of the professional games were used to compare with the actual opening being played by Moyoman, if the exact game position had occurred before, then all of the follow up moves which had been played could easily be identified, and each of those moves recommended.

Joseki

Joseki sequences can be identified by looking at the play in each corner, assuming that the first move in a 13x13 corner area occurs no further out than the 5-4 point. A joseki dictionary could be built from this, and used to play in situations where the opponent has played one of the josekis.

Shape

The professional games can be analyzed to try to determine good shape by the moves that are made in a local area. This would be the most challenging problem in terms of determining which situations are candidates for divining shape, and which are not.

Expert

One might try to use a neural network or genetic algorithms to generate moves. Training data would be needed for this for entire games. The professional games could be used for that purpose.

Design

A major problem is how to efficiently store and index the information from these games. The SGF (Smart Game Format) is a concise and understandable file format, and approximately 700,000 games could be stored in 1 gigabyte without compression, so that would be an obvious choice. The problem is how to locate a given board position from a large number of files.

Zobrist hashing is used throughout the Moyoman program. A Zobrist hash is a 64 bit number which represents a given board position. The Zobrist hash could be used as an index. This scheme can most easily be explained using an example. Assume that 20,000 professional games are used, with an average of 250 moves per game. That results in a maximum of 5 million board positions. Since the games will be unique after the first 30 or 40 moves, the actual number of board positions will be somewhere between 4 and 5 million. Each of these board positions will have a 64 bit number associated with it. For a given board position during a game by the Moyoman program, 8 different Zobrist hashes can be computed for the board position, to take into account rotation and reflection. Only one of the 8 hashes would be stored, for example, the largest one. If the hash matches any of the 5 million stored Zobrist hashes, then the game so far has been played at least once in one of the professional games. For each of the stored Zobrist hashes, a list of followup moves and Zobrist hashes are stored as well. Thus, the program can quickly determine for any board position if it occurred before in the 20,000 professional games, and what the next move or moves were.

How much space would be required to store this information? Assuming that we want to use an ASCII file to represent this data, a 64 bit number can be represented in 12 bytes using BASE64 encoding. On average, there might be 2 follow up moves, each of which would take up 15 bytes for the hash, a separator character, and the follow up move position. Assume a blank line to separate board position entries. Assuming each part of this entry is on its own line, and there is an average of 45 bytes per entry. Multiply this by 5 million board positions, and we now have a 200Mb file. Clearly, the program would not cache this information, but would do its work off of disk. Repeatedly reading a 200Mb file would be cumbersome, so the files could be stored as 4096 files of about 50kb each. This would be natural because the file names could just be based on the first two characters of the BASE64 encoding. The two characters could not be used directly, since some operating systems do not have case sensitive file names and the slash character might be problematic. The file name could be a number between 0 and 4095. The files would all be sorted to allow for binary search to be used to efficiently find entries in these files.

It would not be unreasonable to use 200Mb of storage for this game information. If the game records were stored as SGF files, then 20,000 games would take up about 30Mb without compression. Downloading this would be the problematic part, especially for people without a high speed Internet connection. The Moyoman program would provide a conversion program to convert SGF files into the data format of 4096 files discussed above. Perhaps the default distribution of Moyoman would come with one hundred professional games, and the user could then get additional games and run the conversion program to produce the 4096 files at their convenience.

The end user would not need to use these professional games, only the module developer who is using information from professional games as part of the move generation strategy for the module. The idea is that the module developer obtains the sgf games on his own, then runs the conversion program which is part of the Moyoman software. Tools for lookup based on Zobrist hash would also be available. The module developer then processes this data, and produces a much smaller amount of data which his module would use and would be part of the standard Moyoman distribution under the data directory.