Bill Spight wrote:
I may be wrong, but my impression is that neural networks generalize from what they are trained on, and so they can produce some new things from time to time.
Here's my understanding of how AlphaGo works, described in layman's terms - at least the Fan Hui version, of which the Nature paper was based:
Step 1.) Train policy network to construct a function that is able to predict moves that a strong player makes. Initially, this is done with supervised learning - give it a bunch of high dan player games, train neural network to maximize weights so that you have a (non-linear) function that predicts the next move given a new high dan player game.
Step 2.) Improve policy network through reinforcement learning. To do this, have the latest version of the policy network play A against an older version of the policy network B. See who wins. Update the weights for the function trained by the policy network, giving positive value for a win, negative value for a loss.
Step 3.) Train a value network, not with sample data like in Step 1, but directly by playing games like in Step 2: At a random board position, predict who will win the game. Use the Policy network from Step 2 to play out the rest of the game, and see who won. Just like before, if the prediction was correct, add positive value to the weights; if prediction was wrong, add negative value.
Step 4.) Combine policy network, value network, and Monte Carlo Tree Search: Tree is constructed, starting at root board state. At each node in the tree, the policy network gives a prior probability that any of the given moves will be good (e.g. 62% chance I should play move X). Then you can traverse the tree to search for best outcome. Outcome is defined by a linear combination of the value as defined by the value network PLUS the outcome that would occur by doing monte carlo simulation from that point in the tree. How much weight to give to MTCS vs. the value network is not clear to me.
Step 5.) Profit (beat Lee Sedol, Ke Jie, earn millions, and start the robot revolution).
So anyway, this allows for generalization to occur, as Bill suggests. Fundamentally, the program still does a search. But the moves the breadth of actions from a given state that are likely to lead to a good result is really reduced due to the policy network (which has been trained first on training data, and then refined by playing against itself). And leaf evaluation combines the trained value network and monte carlo search from that position. The neural networks themselves basically produce non-linear functions with weights that have been adjusted through training. Given a totally new situation and board position, it can be fed to that function to produce a result.
This is basically my understanding of how things work. Please feel free to correct any misunderstandings that I have, because I'm interested in learning more about it, too.