possible to improve AlphaGo in endgame

Bill Spight · Post by **Bill Spight** » Thu Mar 17, 2016 4:36 pm

One other thing. Since my preferred approach is to minimize the maximum error of my evaluation, that seems easier to me if the evaluation is in terms of points, not probability of winning. It is not so obvious to me how to set bounds on the probability, aside from 0 and 1.

Mike Novack · Post by **Mike Novack** » Thu Mar 17, 2016 6:47 pm

That's the point. It doesn't matter if the correct way to play is "always make the move that maximizes the probability of winning the game" because we humans have no way to calculate that.

Look, MOST of the time, the move with the highest probability of winning will turn out to be the same as the move which maintains (or increases) the point advantage. THAT we humans can calculate, so it's a good approximation for us, the best we can do.

Bill Spight · Post by **Bill Spight** » Thu Mar 17, 2016 7:31 pm

Mike Novack wrote:That's the point. It doesn't matter if the correct way to play is "always make the move that maximizes the probability of winning the game" because we humans have no way to calculate that.

Look, MOST of the time, the move with the highest probability of winning will turn out to be the same as the move which maintains (or increases) the point advantage. THAT we humans can calculate, so it's a good approximation for us, the best we can do.

Well, we can do a bit better, and pros sometimes do. Amateurs sometimes do, as well.

Bill Spight · Post by **Bill Spight** » Thu Mar 17, 2016 7:38 pm

Over 20 years ago, Professor Berlekamp came up with a strategy called sentestrat. Basically what it says is that if the opponent makes a play that raises the global temperature, answer it.

On its face, sentestrat goes against my inclinations as a go player. Why, not always allowing the opponent to dictate my moves is part of my fighting spirit. I even took to calling the strategy gotestrat.

However, it has the advantage, appealing to Berlekamp as a mathematician, that it limits the opponent's possible gain to the current global temperature, unless there is a ko or mistake. It is a risk averse strategy, worth considering if you are far enough ahead.

mitsun · Post by **mitsun** » Fri Mar 18, 2016 10:52 am

Bill Spight wrote: Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet. Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.

By "general endgame play" you mean finding local moves which make sense to a human (large points), rather than moves which are optimal in the global sense (winning the game)?

It might be a difficult to train AlphaGo to do that, since the goal is no longer clearcut. When evaluating two sequences, how do you give greater weight to the sequence which makes sense to a human, over the sequence which actually wins the game?

I suppose a step in this direction would be to weight the result by the score, so that a win by 5 points gets greater weight than a win by 1 point. Training that way might skew play in the direction you want. As long as all positive scores get much greater weight than all negative scores, performance (win rate) should not suffer too much. (I think some people have been arguing that the win rate might actually improve, but I remain skeptical.)

uPWarrior · Post by **uPWarrior** » Fri Mar 18, 2016 11:26 am

mitsun wrote:
Bill Spight wrote: Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet. Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.
By "general endgame play" you mean finding local moves which make sense to a human (large points), rather than moves which are optimal in the global sense (winning the game)?

It might be a difficult to train AlphaGo to do that, since the goal is no longer clearcut. When evaluating two sequences, how do you give greater weight to the sequence which makes sense to a human, over the sequence which actually wins the game?

I suppose a step in this direction would be to weight the result by the score, so that a win by 5 points gets greater weight than a win by 1 point. Training that way might skew play in the direction you want. As long as all positive scores get much greater weight than all negative scores, performance (win rate) should not suffer too much. (I think some people have been arguing that the win rate might actually improve, but I remain skeptical.)

I think it might be possible to do it if you somehow merge both neural networks (% of winning and points difference).

It is known that the first produces stronger AIs, as it is maximizing the correct objective after all. However, it should be possible to create a more human-like endgame if, whenever the two evaluators differ on the best move, you maximize the winning margin if the difference in winning probability is smaller than a given epsilon (e.g., if the points-maximizing move is only 0.01% more likely to lose according to the original network). You might even be able to get the theoretically correct epsilon using the uncertainty of the process, and you might argue that the extra points margin is a buffer for possible blind spots, and it might be worth more than the winning-probability you lost.

Bill Spight · Post by **Bill Spight** » Fri Mar 18, 2016 11:52 am

mitsun wrote:
Bill Spight wrote: Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet. Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.
By "general endgame play" you mean finding local moves which make sense to a human (large points), rather than moves which are optimal in the global sense (winning the game)?

No. I mean finding heuristics that apply to endgames in general.

People seem to be assuming that the apparently silly or senseless endgame plays of MCTS programs and AlphaGo are actually optimal. If they maximize the proportion of games won in random playouts, that means that they may well not be optimal in fact. If they maximize some probability measure that no one knows the meaning of, because it depends upon the neural network, they may still not be optimal.

Consider W264 by AlphaGo in game 5 vs. Lee Sedol. I called it a misstep instead of a mistake, because AlphaGo played a 1 pt. reverse sente instead of a 1.25 pt. reverse sente, but it didn't lose the game, and may not even have lost a point. (Not that W264 was one of those "silly" plays, just a misstep.) But that was that game. In another game where the rest of the board was different, it could have been the game losing move. It may well be that the 1.25 pt. reverse sente dominates AlphaGo's move, never being worse, regardless of the rest of the board.

Since move 264 is so late in the game, I expect that any strong program would choose the larger play if it made a difference between winning and losing, not from any probability heuristic, but from reading the game out. But suppose that we did not rely upon reading the game out, but compared an evaluation program based upon probability vs. one based upon territory/area. Who knows what the result would be?

djhbrown · Post by **djhbrown** » Fri Mar 18, 2016 2:30 pm

the closer the game gets to the end, the stronger her moves become! she judges win probabilities by reading, not instead of reading.

RobertJasiek · Post by **RobertJasiek** » Fri Mar 18, 2016 11:03 pm

AlphaGo does not read but simulates reading. Since its simulation has holes, it is better to play perfect endgame whenever possible to have a better position when (infrequently) making a mistake due to hitting a hole, such as move 79 in game 4. If AlphaGo shall be a model for very good play to us, correct endgame is even more important.

Bill Spight · Post by **Bill Spight** » Mon Apr 24, 2017 1:38 pm

Mike Novack wrote:That's the point. It doesn't matter if the correct way to play is "always make the move that maximizes the probability of winning the game" because we humans have no way to calculate that.

Sorry if I am beating a dead horse, but computer programs cannot do that either. They only pretend to do so.

By which I mean, they calculate a frequency of winning, using quasi-random methods. That produces a probability, but one that is not objective. It is contingent upon a number of assumptions, which are not revealed.

lightvector · Post by **lightvector** » Tue Apr 25, 2017 12:01 am

I'm confused - is everyone just talking past one another? I think most of what people have said on both sides is right, if not literally then at least when interpreted charitably or in good faith.

* Bill is obviously right that strong Go programs are not computing or maximizing the "probability of winning the game". For MCTS playouts, the closest human thing to compare might be something like "probability of winning if both players play weird drunk mid-kyu-level blitz", but better at good shape and worse at tactics. For the value net, it might be "probability of winning if both players play weird drunk mid-dan-level blitz" but better at both good shape and counting and worse at tactics. Neither comparison is perfect, but is close enough to be useful. We can also neglect sampling error - with millions of playouts "drunk blitzness" bias by far dominates any sampling error.

* Bill is again right that computer programs are almost certainly making mistakes even from a practical-probability-of-winning perspective, because they are at least some of the time giving up points for no gain where they have not read out the rest of the game to prove they won't need those points. That's how Zen and other bots lost some of their games in the past when they were around 5d to 7d - a blind spot causing overoptimism about some part of the board, which wouldn't matter since the bot was winning anyways. But then they would needlessly give up enough points on the rest of the board until the blind spot actually swung the game against them.

And from the other side...

* Multiple people are obviously right that playing the expected-point-difference-maximizing move is not the thing that maximizes real chances of winning the game. And experience on the part of computer Go programmers has taught that given the choice between maximizing playout win/loss probabilities vs pretty much any so-far-conceived notion of "expected point difference" it's better by far to maximize the the win/loss probabilities.

* Moreover, in practice the giving-up of points doesn't lose the game on its own, because it's always coupled with a misevaluation by the bot that it doesn't need those points, else it would have tried to keep them. In each case fixing the misevaluation will prevent those lost games just as surely, and in practice doing so is far easier. As long that remains true, programmers trying to improve their bot's strength in theory need *never* work on the silly endgame moves, since pushing the misevaluations closer to zero in frequency and severity and improving the bot's overall judgment about whether it might need the points will also suffice to push game-losses-due-to-silly-endgame arbitrarily close to zero. Since right now fixing silly endgame is very much not the best way to improve the bot's strength per unit of effort spent, programmers have quite sensibly not much spent much effort on it.

Uberdude · Post by **Uberdude** » Tue Apr 25, 2017 5:07 am

I mostly agree with lightvector, and it's nice to see someone trying to find common ground instead of saying the same things over and over, but a few quibbles:

1st point I think you mixed up the policy and value network: policy is the one which chooses a move given a board position (similar to a human's shape intuition and pattern recognition), value is the one which says who is winning a board position (similar to human whole board positional judgement/counting). The value network is AlphaGo's innovation (since reproduced, but not as well by DeepZen, FineArt and others) others made policy networks before around mid dan amateur level (and DeepMind published a paper about theirs too before AlphaGo, plus hired some of the authors of previous ones). In the v13 AlphaGo paper there were charts showing how strong AlphaGo was with the various combinations of the 3 modules: Policy network, value network, and MCTS, and they will be quite a lot stronger since then. So MCTS plus policy network playouts could well be quite a lot better than mid-dan blitz now. Also in response to Robert's point about AlphaGo not reading (which we've had before), whilst I agree MCTS is not much like human reading, the requirement for reading to be perfect to count as reading is a strange use of the word: most words do not have an implied "perfect" adjective in front of them (or else I don't read). But with some tree search and a value network and no monte carlo rollouts, you could actually have a program that reads quite like a human: exploring a tree and judging who is winning those positions, without doing loads of semi-random playouts to the end of the game (which could use a policy network or not).

lightvector wrote:That's how Zen and other bots lost some of their games in the past when they were around 5d to 7d - a blind spot causing overoptimism about some part of the board, which wouldn't matter since the bot was winning anyways. But then they would needlessly give up enough points on the rest of the board until the blind spot actually swung the game against them.

Not just when Zen was 5-7d, but now when it is top pro level. It was winning versus Park Junghwan in the World Go Championship (by about komi according to Kim Jiseok) but lost some lead (probably the monte carlo problem of losing points if you still win) and then near the end even more (a combination of misevaluating some dead stones as in seki at the top (as in lightvector's last point), and problems with the komi and Chinese vs Japanese rules).

Bill Spight · Post by **Bill Spight** » Tue Apr 25, 2017 8:45 am

Thanks, guys. Good points.

I talked about beating a dead horse because I think people now pretty much agree that top computer programs can make mistakes by believing that one play is superior to another, assessing the difference in terms of the probability of winning the game, and that humans can catch some of those mistakes by assessing the difference in terms of points.

But there still are those who claim that humans just don't understand how the top programs think, and while humans may think that they recognize some computer mistakes, the programs are better than they are, so the just don't really know.

I'd like to make two points. First, people do think in terms of the probability of winning. They just don't do it very well. Second, I'd like to make the case that assessing the chances of winning by point evaluation works better and better as the end of the game nears (at least at the strong amateur level and above), and that it is likely, at the moment, that human evaluation is better than computer evaluation at some point during the endgame.

uPWarrior wrote:It is known that {% of winning} produces stronger AIs, as it is maximizing the correct objective after all.

When uPWarrior says that the correct objective is the percentage of winning he is thinking probabilistically. We humans do that all the time. We consider an even game between two players of the same level to be a 50-50 proposition, while an even game between, say, a 1 kyu and a shodan means that the shodan will win about 2/3 of the time. If a move is a small gote, but there are much larger moves on the board and we cannot read the game out, we consider that Black will play the gote half the time and White will play it half the time. Or if there is a small sente for Black, and we cannot read the game out, we consider that Black will get to play it almost 100% of the time. But 30 moves into an even game, if you ask us what the probability is that Black will win, we are hard pressed to make an estimate.

I suspect that strong human players can be trained to make good probability estimates of winning the game. The reason is that gamblers made fair bets even before the invention of probability theory. The training would consist of having players make modest bets on the outcomes of top level games while the games were in progress. Over time, I expect that the players would learn to make fair bets.

BTW, it would be an interesting research project to see how well top programs assess the probability of winning the game, using pro game records. Have the programs assess the position after move 100, for instance, and compare the percentage of wins vs. the assessed probabilities. My impression is that the programs are more accurate after 100 moves than after 200 moves. That is, at the endgame stage they underestimate the chances of the winners. I think that by betting on the projected winner at the odds assessed by the program, you would clean up.

More later. Gotta run.

Life In 19x19

possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame

Re: possible to improve AlphaGo in endgame