2019 China Securities Cup World AI Open

Uberdude · #1

This week was this AI competition. The final is today and tomorrow, FineArt has a 2-0 lead vs Golaxy. I think LeelaZero lost in the semi-final to Golaxy. There is commentary on the AGA twitch hosted by xhu, at the moment with Ohashi Hirofumi 6p commentating. Schedule at https://www.reddit.com/r/baduk/comments ... 019_china/. Here's the 2nd game with LZ following along, FineArt is black, won by resign.

Bill Spight · #2

LZ reckons that White's (Golaxy's) counterhane loses 4% (712 playouts) versus LZ's winrate evaluation at :b71:

, while the hanging connection in the center gains 1% (982 playouts) versus the same estimate. Does this 5% difference reflect Golaxy's error as a player, or LZ's error as an analyst at fewer than 1k playouts? (Or both? The two are not mutually exclusive, OC.

)

Bill Spight · #3

Maybe the fault, dear Brutus, lies in our bots, not in ourselves. Or rather, in our use of our bots for analysis. See below.

White 308: LZ estimates White winning chances at 75% (405 playouts).

Black 309: makes the obvious (to humans) response. LZ now estimates White's winning chances to be 10½% worse, only 64½%.

Did LZ not see that reply? Surely it did, but misevaluated its significance. Or is misevaluating the current position. Or both.

White 310: Golaxy plays a throw-in atari, which caters to a mistake by Black. According to LZ it loses 21½% (64 playouts), reducing White's chances of winning to 43%. :shock:

Really? (No, not really, with only 64 playouts.

)

Black 311: Fine Art makes the obvious capture of the throw-in stone, thereby losing 16% (313 playouts), to give it only a 40% chance of winning the game. According to LZ.

White 312: This is the theoretically largest play, gaining (on average) 1¾ pt. by area scoring. The alternative is to fill the ko in the bottom right, gaining on average 1⅔ pt.

Black 313: Nutso, by human standards. There is nothing to lose, and everything to gain, by taking the ko with sente instead of forcing White to fill the ko. Still, only an inaccuracy. According to LZ it gains 8½% (330 playouts) to make Black the slight favorite.

Edit: Perhaps I should not say nutso. It is true that Black loses nothing, either in theory or practice, by taking the ko. However, if Black takes the ko and White connects the dame, White is komaster of the remaining ko, and Black should fill it instead of playing the gote on the left side. Whether Black should answer White's ko threat at D-08 is not exactly obvious. If Black does answer and runs out of ko threats, then Black should play as FineArt did and play the "nutso" move before taking the gote on the left side to prevent White from getting the last dame and winning by ½ pt.

White 314: The obviously (to humans) correct reply, losing 7%, (647 playouts), according to LZ.

Black 315: The last play before the dame stage, gaining 1½ pts. of area. According to LZ it gains 8% (3.6k playouts), giving Black a 67% chance of winning. Really? With only dame left (and, as it turns out, 12 moves from the end), Black has only 2:1 odds of winning? This is not an error of fewer than 1k playouts, it is an error with almost 4k playouts. In its favor, FineArt was confident enough of a win to give up the advantage of taking the sente ko two moves before. OC, we do not know its evaluation. Edit: Actually, Black 313 is correct when White is komaster of the ko in the bottom right, to prevent White from getting the last dame at area scoring.

Elsewhere I have pointed out the lack of guidance to humans in using as analysts, bots trained as players. I rest my case.

Uberdude · #4

Bill, I wouldn't put much weight on the LZ percentages at such low playouts: I just have it pondering as I input the game. I only glanced at your thread about Winrate errors, but as lightvector said vs what? vs perfect play is just itself or 100% - self. vs the bot's best understanding of the position where we say that is what it says at near infinite playouts is more tractable comparison and can be studied. So we can ask what is the error in the Winrate at 100 or 1000 thousand playouts when we use that to estimate what that bot would think at a billion playouts. maybe a billion takes too long so we could take 10 million as "a lot" as that's what deepmind used for the AG teaching tool.

ps 72 involves a ladder so needs extra playouts.

Bill Spight · #5

Uberdude wrote:

Bill, I wouldn't put much weight on the LZ percentages at such low playouts: I just have it pondering as I input the game.

I don't put much, if any, weight on them, which was really the point of my first post.

Bill Spight wrote:

Does this 5% difference reflect Golaxy's error as a player, or LZ's error as an analyst at fewer than 1k playouts? (Or both? The two are not mutually exclusive, OC. )

Quote:

I only glanced at your thread about Winrate errors, but as lightvector said vs what? vs perfect play is just itself or 100% - self. vs the bot's best understanding of the position where we say that is what it says at near infinite playouts is more tractable comparison and can be studied. So we can ask what is the error in the Winrate at 100 or 1000 thousand playouts when we use that to estimate what that bot would think at a billion playouts. maybe a billion takes too long so we could take 10 million as "a lot" as that's what deepmind used for the AG teaching tool.

Unless we are talking about a limited region of play, or about the late endgame, we don't know what perfect play is, and neither do the bots. That's part of what makes go interesting.

But players don't need accurate evaluations (winrate estimates) to play well, they only need good enough evaluations. Reviewers and analysts, however, need to consider the roads not taken. Those moves need good evaluations. And, as humans, we need to understand the evaluations that we rely upon. I submit that nobody understands them now.

I was not intending to post the second note, but with a close game, a ko with a potential komonster at area scoring, and low playouts, LZ was challenged its endgame evaluations. Even so, the very strange swings in winrate estimates less than 20 moves from the end of the game underscore my doubts about how well the bots play the endgame. Maybe at that point the good enough evaluations don't have to be very good, I dunno.

One more point. When the players whose game you are reviewing come up with plays that gain more than 1% according to the bot you are using for review, you need more playouts.

Uberdude · #6

FineArt beat Golaxy 4-1 in the final.

Uberdude · #7

RE 72, LZ's view doesn't change much with about 100k playouts. Hanging connection gives white 56.4% (104k), hane lets black get 49.1% (91k) with the cut ie white 50.9 so 5% difference. Golaxy is generally stronger than LZ (though I don't know how many playouts each got in this competition, the time limits were 60 min + 10x40s byo), though not so much that in some situations LZ could be better than Golaxy, no idea if this is one of them. Would be interesting to know what FineArt thought.

Attachment:

golaxy fineart lz connect.PNG [ 846.74 KiB | Viewed 9630 times ]

Attachment:

golaxy fineart lz hane.PNG [ 900.9 KiB | Viewed 9630 times ]

Bill Spight · #8

Uberdude wrote:

RE 72, LZ's view doesn't change much with about 100k playouts. Hanging connection gives white 56.4% (104k), hane lets black get 49.1% (91k) with the cut ie white 50.9 so 5% difference.

Con rispetto, signore, the difference I am interested in is the one between the hanging connection with at least 100k playouts and the actual :w72:

with at least 100k playouts. In the diagram White has a winrate estimate of 56½% with 104k playouts for the hanging connection, while for :w72:

the winrate estimate is 49½% with only 580 playouts. That is not enough playouts for a fair comparison.

Uberdude · #9

The way to see what LZ thinks of the actual 72 hane with 100k playouts is to play it and wait for 100k playouts to happen, as shown in the 2nd picture. To wait for that 580 to turn into 100k on the first position would likely mean the 1st choice move has also gone up by a factor of 200, which is 20 million which would take ages.

Bill Spight · **#10**

Uberdude wrote:

The way to see what LZ thinks of the actual 72 hane with 100k playouts is to play it and wait for 100k playouts to happen, as shown in the 2nd picture. To wait for that 580 to turn into 100k on the first position would likely mean the 1st choice move has also gone up by a factor of 200, which is 20 million which would take ages.

Well, you know LZ, but that has not been my experience playing around with Deep Leela. Plays do not retain their relative playout ratios when you alter the game tree. IIUC, the main differences lie in the networks, not the search strategies.

Edit: In fact, if the ratios remained the same, you would still have a potentially unfair comparison. But if you had, say, 100k playouts for :w72:

and 300k playouts for the hanging connection, that's not such an imbalance.

Bill Spight · **#11**

Uberdude wrote:

The way to see what LZ thinks of the actual 72 hane with 100k playouts is to play it and wait for 100k playouts to happen, as shown in the 2nd picture. To wait for that 580 to turn into 100k on the first position would likely mean the 1st choice move has also gone up by a factor of 200, which is 20 million which would take ages.

Oh, I haven't been taking the second picture and comparing it with the first. What I have been doing with Deep Leela to get a direct comparison is this. After playing :w72:

as in the game and generating the second picture, then back up and play :b71:

again. That way DL compares the options for :w72:

directly, utilizing the altered search tree which focuses more on the actual move in the game than the original tree. Generating the second picture alters the winrate estimates and number of playouts for the first picture. At least, that happens with DL. My guess is that it works that way with LZ as well.

yoyoma · **#12**

Bill Spight wrote:

Uberdude wrote:

The way to see what LZ thinks of the actual 72 hane with 100k playouts is to play it and wait for 100k playouts to happen, as shown in the 2nd picture. To wait for that 580 to turn into 100k on the first position would likely mean the 1st choice move has also gone up by a factor of 200, which is 20 million which would take ages.

Oh, I haven't been taking the second picture and making a direct comparison. What I have been doing with Deep Leela is this. After playing :w72:

as in the game and generating the second picture, then backing up and playing :b71:

again. That way DL compares the options for :w72:

directly, utilizing the altered search tree which focuses more on the actual move in the game than the original tree. Generating the second picture alters the winrate estimate and number of playouts for the first picture. At least, that happens with DL. My guess is that it works that way with LZ as well.

This will sometimes work, but in a chaotic way. The internals go like this: After going to :w72:

it will build a tree and analyze positions, and as a side effect, cache those positions. Then when you go back to :b71:

, it will reset the tree search part, and start searching again. The search part will go normally, it does not know the result of the deep :w72:

search. But all searches starting with :w72:

will be in the cache. These will take a shortcut, bypassing the GPU. So even if :w72:

normally starts off bad and only becomes good later, this shortcut might allow it to get enough visits to it to see that it's actually a good move.

So you can't really rely on this method, it's more reliable to compare moves by clicking into each one and noting the winrates.

Bill Spight · **#13**

yoyoma wrote:

Bill Spight wrote:

Uberdude wrote:

The way to see what LZ thinks of the actual 72 hane with 100k playouts is to play it and wait for 100k playouts to happen, as shown in the 2nd picture. To wait for that 580 to turn into 100k on the first position would likely mean the 1st choice move has also gone up by a factor of 200, which is 20 million which would take ages.

Oh, I haven't been taking the second picture and making a direct comparison. What I have been doing with Deep Leela is this. After playing :w72:

as in the game and generating the second picture, then backing up and playing :b71:

again. That way DL compares the options for :w72:

directly, utilizing the altered search tree which focuses more on the actual move in the game than the original tree. Generating the second picture alters the winrate estimate and number of playouts for the first picture. At least, that happens with DL. My guess is that it works that way with LZ as well.

This will sometimes work, but in a chaotic way. The internals go like this: After going to :w72:

it will build a tree and analyze positions, and as a side effect, cache those positions. Then when you go back to :b71:

, it will reset the tree search part, and start searching again. The search part will go normally, it does not know the result of the deep :w72:

search. But all searches starting with :w72:

will be in the cache. These will take a shortcut, bypassing the GPU. So even if :w72:

normally starts off bad and only becomes good later, this shortcut might allow it to get enough visits to it to see that it's actually a good move.

So you can't really rely on this method, it's more reliable to compare moves by clicking into each one and noting the winrates.

Let me see if I understand you. The best way to compare play A and play B in terms of winrates is to make each play and observe the winrate estimate of the bot's choice for the opponent's reply to each.

I tried that approach with Deep Leela (faute de mieux, au moment) and iterated attempts until the winrate estimates converged to the same 0.1%. That yielded these pictures:

Black's reply to the hanging connection has a winrate estimate of 47.9%.

Black's reply to FineArt's :w72:

has a winrate estimate of 43.7%.

So, FWIW, DL prefers FineArt's actual play over the hanging connection by more than 4%, even though the actual play was not on its radar after :b71:

, and still is not. Right?

yoyoma · **#14**

Bill Spight wrote:

Black's reply to the hanging connection has a winrate estimate of 47.9%.

Black's reply to FineArt's :w72:

has a winrate estimate of 43.7%.

So, FWIW, DL prefers FineArt's actual play over the hanging connection by more than 4%, even though the actual play was not on its radar after :b71:

, and still is not. Right?

Yes I think we're on the same page now. Before actually playing move 72, LZ doesn't spend enough time considering FineArt's play to realize it's good. Also LZ code isn't smart enough to reuse results from a forced deeper search of 72 in the way you tried -- going forward and then back.

I just realized you're talking about the original Leela with deep learning, not Leela Zero? Are those screenshots from original Leela's built in GUI? I was talking about Leela Zero, but I think the same logic applies to both.

Bill Spight · **#15**

yoyoma wrote:

Bill Spight wrote:

Black's reply to the hanging connection has a winrate estimate of 47.9%.

Black's reply to FineArt's :w72:

has a winrate estimate of 43.7%.

So, FWIW, DL prefers FineArt's actual play over the hanging connection by more than 4%, even though the actual play was not on its radar after :b71:

, and still is not. Right?

Yes I think we're on the same page now. Before actually playing move 72, LZ doesn't spend enough time considering FineArt's play to realize it's good. Also LZ code isn't smart enough to reuse results from a forced deeper search of 72 in the way you tried -- going forward and then back.

I just realized you're talking about the original Leela with deep learning, not Leela Zero? Are those screenshots from original Leela's built in GUI? I was talking about Leela Zero, but I think the same logic applies to both.

Yeah, I am planning to buy a new desktop later this year. Meanwhile I am making occasional use of Deep Leela ( https://www.deepleela.com ). The screenshots are from there. The web site claims to use Leela Zero, but I am pretty sure they still are using Leela 11 right now. It still likes the slide to the 4-2, underneath the 4-4 stone, instead of the jump attachment on the 4-3, for instance.

Thanks for the suggestion. Relying upon the estimated winrates of the opponent's replies works much quicker than backing up.

Given propensity of LZ to make more visits to plays it thinks are better — probably unavoidable with any kind of best first search — backing up is probably inferior for direct comparisons, anyway.

Uberdude · **#16**

Results of the tournament including links to games are here: https://www.reddit.com/r/baduk/comments ... 9_summary/

2019 China Securities Cup World AI Open

Who is online