Learning-Deep-Learning

AlphaGo: Mastering the game of Go with deep neural networks and tree search

June 2024

tl;dr: MCTS trees with neural nets bets human in the game of Go.

Overall impression

Value iteration and policy iterations are systematic, iterative method that solves MDP problems. Yet even with the improved policy iteration, it still have to perform time-consuming operation to update the value of EVERY state. A standard 19x19 Go board has roughly 2e170 possible states. This vast amount of states will be intractable to solve with a vanilla value iteration or policy iteration technique.

AlphaGo and its successors use a Monte Carlo tree search algorithm to find its moves guided by a value network and a policy network, trained on from human and computer play. Both the value network and policy network takes in the current state of the board and produces a singular state value function V(s) of the board position and the state-action value function Q(s, a) of all possible moves given the current board position. Neural networks are used to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.

Every leaf node (an unexplored board position) in the MCTS is evaluated in two very different ways: by the value network; and second, by the outcome of a random rollout played out using the fast rollout policy. Note that a single evaluation of the value network also approached the accuracy of Monte Carlo rollouts using the RL policy network, but using 15,000 times less computation. This is very similar to a fast-slow system design, intuition vs reasoning, system 1 vs system 2 by Nobel laureate Daniel Kahneman (We can see similar design in more recent work such as DriveVLM).

Key ideas

Technical details

Notes