possible to improve AlphaGo in endgame

Bill Spight · **#21**

zorq wrote:

Bill Spight wrote:

In general one maximizes the probability of winning by maximizing the territory difference.

This is clearly false. If one is greedy, one may be punished.

If one is greedy and is punished, one has not maximized the territory difference.

Bill Spight · **#22**

Kirby wrote:

Bill Spight wrote:

In general one maximizes the probability of winning by maximizing the territory difference.

In short, the computer miscalculated the situation due to the complexity that was added by a very unusual move.

If I were the computer, and I wanted to increase my chances of winning, I would want to avoid this type of complexity that would result in my misreading of the situation.

Is that what it means in the program to maximize the probability of winning?

Quote:

So maximizing my chances of winning isn't necessarily about always maximizing the difference in score.

(Emphasis mine.)

That's not what I said, is it?

mitsun · **#23**

Bill Spight wrote:

zorq wrote:

Bill Spight wrote:

In general one maximizes the probability of winning by maximizing the territory difference.

This is clearly false. If one is greedy, one may be punished.

If one is greedy and is punished, one has not maximized the territory difference.

I presume AlphaGo can calculate, for an endgame position, for every possible move, two quantities: probability that this move will lead to a win, expected margin of win for this move. Are you really stating that the move which maximizes the second of these quantities will necessarily maximize the first? That seems clearly false to me. If no single move maximizes both of these quantities, which move do you think the computer should play?

Kirby · **#24**

Bill Spight wrote:

Kirby wrote:

Bill Spight wrote:

In general one maximizes the probability of winning by maximizing the territory difference.

In short, the computer miscalculated the situation due to the complexity that was added by a very unusual move.

If I were the computer, and I wanted to increase my chances of winning, I would want to avoid this type of complexity that would result in my misreading of the situation.

Is that what it means in the program to maximize the probability of winning?

Quote:

So maximizing my chances of winning isn't necessarily about always maximizing the difference in score.

(Emphasis mine.)

That's not what I said, is it?

You did not specifically say that maximizing the territorial difference always is the best way to increase the probability of winning, but you suggested that it generally was. There may be cases where maximizing the territorial difference leads to a greater chance of winning. Like you said, it allows leeway for making mistakes.

My argument is, rather, that a computer need not base its strategy in such a way to account for its mistakes. Rather, the computer can adopt a strategy to reduce its own uncertainty in how the game will progress. The more certain the computer is of how the rest of the game will proceed, the more easily it can make decisions on what to do later in the game.

This is my view of what the computer is doing. It makes plays that appear to be point-losing at times, but result in less uncertainty as to how the rest of the game will play out. The better certainty the computer has in the rest of the game, the better position it can be to make decisions that will likely to lead to a win.

Admittedly, this uncertainty-reducing strategy comes at a cost: if the result of making the game less uncertain leads to a losing board position, the computer has failed.

Nonetheless, it appears that AlphaGo prefers this type of strategy. It prefers a state of greater certainty of winning the game, even if it means making point-losing plays.

gowan · **#25**

Does AlphaGo have a strategy when it plays other than a greedy algorithm of choosing the best move each time? For example can it decide to play a moyo game before the game starts?

Charles Matthews · **#26**

Kirby wrote:

Bill Spight wrote:

In general one maximizes the probability of winning by maximizing the territory difference.

<snip>

Nonetheless, it appears that AlphaGo prefers this type of strategy. It prefers a state of greater certainty of winning the game, even if it means making point-losing plays.

I think seeing the wood for the trees might be a help in this thread.

We know that go in general cannot be solved by "brute force". On the other hand for certain endgame positions it can be, by filtering out candidate plays first, and then looking at all possible orders of play (to first approximation). The trouble being that looking at all orders of play hits a fast-growing function, the factorial. Anyone with a feel for these things knows that 20! is much more serious than 10!, for example.

So, AlphaGo in general seems to have succeeded in dominating the brute force requirement, well enough, by some very sharp filtering and sampling of orders of play. The program can cope, in a classy fashion, with different kinds of middlegame challenges, which is the primary determinant of strength (not being the butter to your opponent's hot knife in fighting).

Come the endgame, as far as we know, it does not change regime. Indeed it would be dangerous to assume that life-and-death issues or ko are off the menu just because plays are supposedly smallish and generally local. Human players who switch off the shields at this point will lose some games memorably.

When it sees the shore, the program is going to swim to it as directly as it can. We could say this is "instinctive", because effectively its brain has been hardwired to do that.

Near the end of the game its sampling of lines will start getting somewhat closer to a complete view of ways to play. It seems quite possible that a constructed position could defeat that sampling: something a chess-player might call "problem-like", with a rather different resonance. In CGT jargon, "hidden secrets" are probably implicit throughout the game. The concept can be illustrated effectively in endgame positions; it doesn't mean that is their natural habitat.

I don't think we know yet whether further training of the type already done will have much impact on the finer endgame points. It may not be so easy to "improve AlphaGo in the endgame" within the DeepMind paradigm.

Bill Spight · **#27**

Bill Spight wrote:

zorq wrote:

Bill Spight wrote:

In general one maximizes the probability of winning by maximizing the territory difference.

This is clearly false. If one is greedy, one may be punished.

If one is greedy and is punished, one has not maximized the territory difference.

mitsun wrote:

I presume AlphaGo can calculate, for an endgame position, for every possible move, two quantities: probability that this move will lead to a win, expected margin of win for this move.

Do you mean point margin? I think not.

Quote:

Are you really stating that the move which maximizes the second of these quantities will necessarily maximize the first? That seems clearly false to me.

Me, too.

Quote:

If no single move maximizes both of these quantities, which move do you think the computer should play?

For random rollouts, I know what they mean by probability of winning. For the evaluation network, I do not. My approach, when playing safe, would be to minimize the maximum error of my estimation of my chances of winning. I purposely avoid using the term, probability, because it is not a probability estimate, as commonly understood, for example, in a Bayesian or frequentist approach.

Generally speaking, which is what I am doing, increasing the territory difference provides a safety buffer against both misreading, now or later, and misestimating the probability of winning.

Bill Spight · **#28**

Charles Matthews wrote:

I don't think we know yet whether further training of the type already done will have much impact on the finer endgame points. It may not be so easy to "improve AlphaGo in the endgame" within the DeepMind paradigm.

As someone has already pointed out, AlphaGo focuses on winning the game at hand. To do so it has its own heuristics. These are obviously different from the human developed heuristics of the endgame, such as evaluating the size of plays. We do know that in the vast majority of cases, playing the largest play is best, and we also know how to recognize some situations when that is not the case. One advantage of the human heuristics is that they apply in general, not just to the game at hand. So it is still worthwhile for humans to study them.

Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet. Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.

Charles Matthews · **#29**

Bill Spight wrote:

Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet.

The standard Demis Hassabis lecture/stump speech is that all things become possible, as engineering matters, once "general artificial intelligence" comes onstream. Some crumbs would fall from the corporate table: this application would require a large body of training sequences.

I would actually make a program of this type to play "Archipelago". I'm not sure I have mentioned this go variant, ever.

As a training game for go players, it is conceived of as a multi-board version of go (disjunctive games) where you deal a dozen small graphs off the top of a pack. Then you just play with Tromp-Taylor style rules, with something done about komi per board.

I think the advantage over 19x19 monoboard go is that there would probably be more chance of bootstrapping the training up from simple examples. Clearly CGT principles are there to be learned, via disjunctive games with finite graphs.

Assuming that all makes sense (abandoning the homogeneity of the big board, and its fighting complexity, introducing disjunction consciously, allowing superko to rule some kinds of small-board incident) I think breeding up superhuman understanding of endgame theory becomes a feasible project.

Bill Spight wrote:

Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.

Yes, the whole deal with a genuine strong AI go player is that illustrative material can become a commodity.

Bill Spight · **#30**

Let me try to give an example of a possible approach. Not that I know this is a good heuristic without testing it, OC.

Suppose that AlphaGo thinks that it is ahead and wishes to play safe. Then, instead of looking for a play that maximizes its estimated probability of winning, it vicariously switches sides and looks for a play by the opponent that maximizes its opponent's estimated probability of winning. Then it makes that play itself in the search tree and estimates its probability of winning in the resultant position. If it estimates that it is still ahead, that play becomes a good candidate move. This is a heuristic for playing prophylactically, to minimize the chances of the opponent making trouble, not for maximizing its own estimated probability of winning.

Bill Spight · **#31**

One other thing. Since my preferred approach is to minimize the maximum error of my evaluation, that seems easier to me if the evaluation is in terms of points, not probability of winning. It is not so obvious to me how to set bounds on the probability, aside from 0 and 1.

Mike Novack · **#32**

That's the point. It doesn't matter if the correct way to play is "always make the move that maximizes the probability of winning the game" because we humans have no way to calculate that.

Look, MOST of the time, the move with the highest probability of winning will turn out to be the same as the move which maintains (or increases) the point advantage. THAT we humans can calculate, so it's a good approximation for us, the best we can do.

Bill Spight · **#33**

Mike Novack wrote:

That's the point. It doesn't matter if the correct way to play is "always make the move that maximizes the probability of winning the game" because we humans have no way to calculate that.

Look, MOST of the time, the move with the highest probability of winning will turn out to be the same as the move which maintains (or increases) the point advantage. THAT we humans can calculate, so it's a good approximation for us, the best we can do.

Well, we can do a bit better, and pros sometimes do. Amateurs sometimes do, as well.

Bill Spight · **#34**

Over 20 years ago, Professor Berlekamp came up with a strategy called sentestrat. Basically what it says is that if the opponent makes a play that raises the global temperature, answer it.

On its face, sentestrat goes against my inclinations as a go player. Why, not always allowing the opponent to dictate my moves is part of my fighting spirit. I even took to calling the strategy gotestrat.

However, it has the advantage, appealing to Berlekamp as a mathematician, that it limits the opponent's possible gain to the current global temperature, unless there is a ko or mistake. It is a risk averse strategy, worth considering if you are far enough ahead.

mitsun · **#35**

Bill Spight wrote:

Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet. Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.

By "general endgame play" you mean finding local moves which make sense to a human (large points), rather than moves which are optimal in the global sense (winning the game)?

It might be a difficult to train AlphaGo to do that, since the goal is no longer clearcut. When evaluating two sequences, how do you give greater weight to the sequence which makes sense to a human, over the sequence which actually wins the game?

I suppose a step in this direction would be to weight the result by the score, so that a win by 5 points gets greater weight than a win by 1 point. Training that way might skew play in the direction you want. As long as all positive scores get much greater weight than all negative scores, performance (win rate) should not suffer too much. (I think some people have been arguing that the win rate might actually improve, but I remain skeptical.)

uPWarrior · **#36**

mitsun wrote:

Bill Spight wrote:

Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet. Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.

By "general endgame play" you mean finding local moves which make sense to a human (large points), rather than moves which are optimal in the global sense (winning the game)?

It might be a difficult to train AlphaGo to do that, since the goal is no longer clearcut. When evaluating two sequences, how do you give greater weight to the sequence which makes sense to a human, over the sequence which actually wins the game?

I suppose a step in this direction would be to weight the result by the score, so that a win by 5 points gets greater weight than a win by 1 point. Training that way might skew play in the direction you want. As long as all positive scores get much greater weight than all negative scores, performance (win rate) should not suffer too much. (I think some people have been arguing that the win rate might actually improve, but I remain skeptical.)

I think it might be possible to do it if you somehow merge both neural networks (% of winning and points difference).

It is known that the first produces stronger AIs, as it is maximizing the correct objective after all. However, it should be possible to create a more human-like endgame if, whenever the two evaluators differ on the best move, you maximize the winning margin if the difference in winning probability is smaller than a given epsilon (e.g., if the points-maximizing move is only 0.01% more likely to lose according to the original network). You might even be able to get the theoretically correct epsilon using the uncertainty of the process, and you might argue that the extra points margin is a buffer for possible blind spots, and it might be worth more than the winning-probability you lost.

Bill Spight · **#37**

mitsun wrote:

Bill Spight wrote:

Could an AlphaGo type program be trained for general endgame play? I think so, but nobody has done so yet. Such a program could be a good endgame tutor for humans. It might even be a good add-on to AlphaGo for the endgame stage.

By "general endgame play" you mean finding local moves which make sense to a human (large points), rather than moves which are optimal in the global sense (winning the game)?

No. I mean finding heuristics that apply to endgames in general.

People seem to be assuming that the apparently silly or senseless endgame plays of MCTS programs and AlphaGo are actually optimal. If they maximize the proportion of games won in random playouts, that means that they may well not be optimal in fact. If they maximize some probability measure that no one knows the meaning of, because it depends upon the neural network, they may still not be optimal.

Consider W264 by AlphaGo in game 5 vs. Lee Sedol. I called it a misstep instead of a mistake, because AlphaGo played a 1 pt. reverse sente instead of a 1.25 pt. reverse sente, but it didn't lose the game, and may not even have lost a point. (Not that W264 was one of those "silly" plays, just a misstep.) But that was that game. In another game where the rest of the board was different, it could have been the game losing move. It may well be that the 1.25 pt. reverse sente dominates AlphaGo's move, never being worse, regardless of the rest of the board.

Since move 264 is so late in the game, I expect that any strong program would choose the larger play if it made a difference between winning and losing, not from any probability heuristic, but from reading the game out. But suppose that we did not rely upon reading the game out, but compared an evaluation program based upon probability vs. one based upon territory/area. Who knows what the result would be?

djhbrown · **#38**

the closer the game gets to the end, the stronger her moves become! she judges win probabilities by reading, not instead of reading.

RobertJasiek · **#39**

AlphaGo does not read but simulates reading. Since its simulation has holes, it is better to play perfect endgame whenever possible to have a better position when (infrequently) making a mistake due to hitting a hole, such as move 79 in game 4. If AlphaGo shall be a model for very good play to us, correct endgame is even more important.

Bill Spight · **#40**

Mike Novack wrote:

That's the point. It doesn't matter if the correct way to play is "always make the move that maximizes the probability of winning the game" because we humans have no way to calculate that.

Sorry if I am beating a dead horse, but computer programs cannot do that either. They only pretend to do so.

By which I mean, they calculate a frequency of winning, using quasi-random methods. That produces a probability, but one that is not objective. It is contingent upon a number of assumptions, which are not revealed.

possible to improve AlphaGo in endgame

Who is online