AlphaGo Zero

Something amazing has happened. A couple of years ago, we closely followed the progress of AlphaGo, the distributed DeepMind algorithm which defeated Lee Sedol in four out of five games of the ancient board game Go. This has since been defeated by a variant, AlphaGo Zero, which is considerably stronger — after 20 days of training, it was able to win 100 out of 100 games against the version of AlphaGo which played against Sedol. After a further 20 days of training, it won 89 out of 100 games against a stronger instantiation of AlphaGo, namely the one which defeated the world champion Ke.

However, its superiority over the previous algorithm isn’t the most interesting aspect. What makes it interesting is that, unlike AlphaGo which both trained on human games and made use of hardcoded features (such as ‘liberties’), AlphaGo Zero is remarkably simple:

  • The algorithm has no external inputs, learning only from games against itself;
  • The input to the neural network is just 17 inputs, namely the parity of the turn, the indicator functions of white stones for the last 8 positions, and the indicator functions of black stones for the last 8 positions. (Storing the immediate history is a necessity due to ko rules.)
  • Instead of separate policy and value networks, the algorithm uses only one neural network;
  • Monte Carlo rollouts are ditched in favour of a feedback loop where the tree search evolves together with the network.

Read the Nature paper for more details. AlphaGo Zero was trained on just four tensor processing units (TPUs), which are fast hardware implementations of fixed-point limited-precision linear algebra. This is much more efficient (but less numerically precise) than a GPU, which is in turn much more efficient (but less flexible) than a CPU.

This entry was posted in Uncategorized. Bookmark the permalink.

2 Responses to AlphaGo Zero

  1. googleusercontent says:

    AlphaGo Zero did not defeat 100-0 the “current AlphaGo” after 40 days of training. After 3 days of training, the 20-block version of Zero defeated 100-0 the version which played with Lee. After 40 days of training, the 40-block version of Zero defeated 89-11 the version which played with Ke.

    Below I quote a passage from https://deepmind.com/blog/alphago-zero-learning-scratch/ ; details can be found in the Nature paper.
    “After just three days of self-play training, AlphaGo Zero emphatically defeated the previously published version of AlphaGo – which had itself defeated 18-time world champion Lee Sedol – by 100 games to 0. After 40 days of self training, AlphaGo Zero became even stronger, outperforming the version of AlphaGo known as “Master”, which has defeated the world’s best players and world number one Ke Jie.”

Leave a Reply