It is currently Sat Sep 21, 2019 6:30 pm

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 660 posts ]  Go to page Previous  1 ... 29, 30, 31, 32, 33
Author Message
Offline
 Post subject: Re: This 'n' that
Post #641 Posted: Mon Aug 19, 2019 8:42 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
On winrate estimates, territory estimates, margins of error, and the last play

OC, as humans we are used to territory estimates, but we are past the hype about how top bots think differently, and better than humans, in some mysterious way about the probability of winning the game. Unless we are talking about certain situations such as the 5x5 board, where we know that the probability is 100% that Black wins with perfect play, and even reasonably good play, or the late endgame where we can figure out perfect play, or a pro vs pro game where one player leads by, say, 50 pts. and the largest play gains 10 pts., there is no a priori knowable probability of winning the game. A posteriori, we could have a position played to the end many times by certain players, or by players of comparable levels, and get winrate estimates thaty way, but we do not know how well those winrates would generalize, and to whom. In general, as the skill of the players decreases towards random play, the winrates get closer to 50%. And the bots do not estimate winrates in that manner, anyway. The mystery of winrates is baked into the cake. We really do not know enough about the factors involved. Perhaps there will be a Ph.D. dissertation about winrates in the near future. :) (BTW, I have found another example where Elf is way wrong about the value of a play by a top player — Dosaku in this case. More later. :))

My purpose here is not to casts doubt on winrate estimates. They are useful. It was the hype that got me started, but that has pretty well blown over. One problem that still remains is that of their margins of error. If a top bot estimates, given sufficient playouts — and we don't know how many that is, either —, that one play has a winrate 10% worse than that of the bot's top choice, we can be pretty sure that it is a mistake. OTOH, if the winrate estimate is only 2% worse, we have little assurance that it is an error. I have recently downsized my margin of error for Elf to 4%, but that is still an educated guess. Nobody has worked out the margins of error for winrate estimates, and I doubt if anybody is going to do so anytime soon. The margin of error may be important for a human attempting to interpret winrate estimates, but any bot that picks a play with a smaller winrate estimate, given sufficient playouts, is likely to play worse. And today's bots are written to win games, not analyze positions.

Now, when we can actually work out territory estimates, we can determine the margins of error. For example, if a gote gains 5 pts., its margin of error is 5 pts., as well, since we do not know who will make the play. Assuming correct play, that is. If the players make mistakes, the margin of error could be greater. But the gain is not a territory estimate, it is something that we find out when we make the estimate. Now, some bots make territory estimates as well as winrate estimates. This is good, but, AFAIK, they do not yet estimate the margin of error of the territory estimates. In terms of the whole board the gain from making the largest gote or reverse sente is the temperature. If we are going to use territory estimates, we need temperature estimates, as well.

That brings me to the topic of the last play. If I am 1 pt. behind and make a play that gains 3 pts., then I am 2 pts. ahead. The opponent might still win. But if my play was the last play of the game, then I win. Such a situation would be unusual, because the temperature would drop from 3 to 0, and such a large temperature drop is unusual in go. The average drop in temperature between moves is less than 1 pt. It is probably less than 0.1 pt. But larger temperature drops do occur. For instance, suppose that after my play the temperature dropped by 2 pts., i.e., to 1 pt. Then I would still (very likely) win, since I would be 2 pts. ahead and the best my opponent could do would be to gain 1 pt., not enough to catch up. (A very unusual ko situation could still give her the win, since the margin of error for ko positions is greater than their temperature.) The play just before a significant temperature drop is also called a last play.

In fact, one of the traditional dogmas of go is that of getting the last big play of the opening. Now, what that play is is not well defined, but good players can usually sense it, and sense the related temperature drop, as well. Unless the bots prove that it is hokum, which I don't think they will. ;) In fact, I have found an example where I think the bots back up the idea of a significant temperature drop in the opening. :D It has to do with the 5-3 approach to the 3-4 point.

Now, humans have known, or at least strongly suspected, that the 5-3 approach to the 3-4 point is not as big, as a rule, as the original 3-4 play itself. Certainly by the 19th century the idea was that, usually at move 4, White should play the 5-3 approach to a 3-4 stone before occupying an empty corner, even though occupying the empty corner was better objectively, because White needed to complicate the game to overcome Black's advantage. Today, with komi, the empty corner beckons, although approaching a 3-4 stone, even at move 2, is not unknown. Writing in the mid-20th century, even Takagawa could not unequivocally say that the approach at move 2 was a mistake. Obviously, the 3-4 makes more territory, on average, but the 5-3 has more influence towards the center and the side. Which is better? Probably the 3-4, but quien sabe?

In the 17th century the 5-3 approach to the 3-4 stone was common at move 2. Did the players think that the 5-3 stone was objectively not quite as good? Maybe so, but Dosaku played a number of games as White where he played the 5-3 in all four corners, playing it as the first play in empty corners. Did he think that the 5-3 was objectively as good as, or better than, the 3-4? Obviously, he was extremely skilled at utilizing the influence of the 5-3, but would he have played it to occupy an empty corner if he were playing against himself?

Well, Elf has an opinion, expressed in terms of winrates. What does Elf say?

In a game against Yasui Chitetsu (GoGoD 1671-08-25a) Dosaku played the 5-3 approach as :w2: against Chitetsu's 3-4 :b1:, a very common opening at the time. Elf estimates that the approach loses 5½% versus a 4-4 play in an empty corner. (I don't regard winrate estimates as precise enough to warrant reporting tenths of a point difference near 50%. Half point precision is good enough, IMHO. :)) Next, Chitetsu played :b3: as a two space pincer against :w2:, which was also common back then. Elf regards :b3: as a 4% winrate error. (Within decades human players had dropped the :b3: pincer, which indicates that they also had come to regard it as an error. When both bots and humans think a play is a mistake, it probably is. ;)) Dosaku played :w4: on the 5-3 in the adjacent corner closest to :b3:. Elf considers it a 7% error. OK, Elf considers the 5-3 to be a mistake, whether as an approach to the 3-4 or as the first play in an empty corner. What does this have to do with the last play, if anything?

OK. Today's bots consider the corners to be worth more, by comparison with the sides, than humans. In the late 20th century we were starting to see humans devalue the sides by a little bit. For instance, the sanrensei was devalued, but the nirensei was still considered good. Even today, the bots like the nirensei. ;) But we see plays on the side that top humans played without a second thought regarded as losing 10% by today's top bots. Shoulder hits, side attachments, or other plays against enclosures are usually considered to be bigger than extensions on the side. This represents a big difference in opening theory. IOW, the temperature of the corners remains hotter than the temperature of the sides for longer than we humans have thought. A temperature drop is coming up. ;)

GoGoD 1665-00-00a, Aoki Guseki (W) vs. Dosaku. :w4: plays the 5-3 approach instead of occupying the last empty corner. Elf estimates a winrate loss of 6½%.

GoGoD 1667-12-05b, Castle Game, Honinbo Doetsu (W) vs. Yasui Chitetsu. :w2: is a 5-3 approach, estimated loss of 5½%, :b3: plays on the 3-4 in an open corner. :w4: approaches on the 5-3. Estimated loss: only 2%. :o (But there are two empty corners.)

GoGoD 1669-07-16, Dosaku (W) vs. Doetsu. :w2: is a 5-3 approach. Estimated winrate loss: 6½%. :b5: is a 5-3 approach. Estimated winrate loss: 2%. (Two empty corners.) :w8: is a 5-3 approach instead of occupying the last empty corner. Estimated winrate loss: 7½%.

If I were writing an article or thesis, I would OC, examine many instances, either of actual games, or of computer generated positions. And I have looked at more games than I report here. The number of empty corners seems to matter to the winrate loss estimate of the 5-3 approach. Here is my hypothesis as to why.

Winrate loss estimates depend, not only upon the play made, but upon the alternative, presumably best, play. The value of the 5-3 approach in each corner is approximately the same in each case, I assume. Then the difference in winrates reflects the difference in the value of occupying an empty corner, assuming that that is the best play. When there is only one empty corner, that difference is around 6½% in terms of winrates. But when there are two empty corners, they are miai, if not exactly so. And then the difference is pretty much the loss in the corner of the 5-3 approach versus the play after the two corners are occupied, which comes to around 2%. The difference of around 4½% reflects a temperature drop after the last empty corner is occupied. Occupying the last empty corner is significant.

When there are three empty corners, there is some uncertainty about who will get to occupy the last empty corner, at least as bots calculate winrates. That uncertainty reduces the winrate estimate of the loss of the 5-3 approach by around 1½%.

OC, if I did the research I could get better estimates, and there may be other factors to consider. :) But I think these results are suggestive. There does seem to be a last play effect in the opening, namely occupying the last empty corner. It comes earlier than humans have thought, but there may be another significant temperature drop a bit later on at the threshold of the middle game, and yet another at the cusp of the endgame. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #642 Posted: Mon Aug 19, 2019 9:31 pm 
Lives with ko

Posts: 289
Liked others: 56
Was liked: 244
Rank: maybe 2d
Bill, regarding winrates specifically, when you say you want a margin of error, presumably you are talking about the error in the bot's estimate relative to something. What precisely is that something?

  • Obviously it's not "theoretical perfect play", because under perfect play the position must either be entirely won or entirely lost, so the true winrate will either be 100% or it will be 0%. In that case, the error in a winrate estimate like 60% would of course be precisely either 60% or 40%, and it would be generally be impossible to determine which.
  • Is is "the probability that the bot would win from here against itself, using the actual self-play settings and parameters used in training?". Well that could be either 100% or 0% too! Because it is not atypical for bots to only randomize self-play early in the game, for the rest of the game they might actually just deterministicly always choose the move that got the best search results. Or maybe they may randomize just a little too, in which case it might not be exactly 100% or 0%, but could still vary wildly depending sensitively on the details. And these details don't actually matter much! The neural net during training sees pretty much the same thing either way: it sees a game with mostly good moves ending in a win or loss. You're not going to go back and replay exactly that same game again, so it doesn't matter if the later moves were deterministic or not. And it would be weird if what we wanted was an error estimate relative to something that might vary so sensitively with respect to details of training that actually don't matter much.
  • Is it "the average probability that randomly chosen professional human players would win from here against other randomly chosen pro opponents"? Well in that case the error is going to be often vastly greater than small numbers like 4%, as human pro players routinely lose highly-winning games or win highly-losing games or make other huge swings from strong bots' perspectives. And of course you need to consider possible issues like move A is definitely better than B for bots and the bots are "right" to evaluate it so, maybe it's even better in some "objective" sense, but move B actually leads to better practical chances for a human because relative to human strengths/weaknesses, move A makes it both harder for you and easier for your opponent to handle the resulting fight.
  • Is it "the winrate that the bot itself will report in the future after more moves are played", with the hopes that with more moves the bot can better judge whether it was 'right' or 'wrong'?". In that case, you need to specify some sort of time horizon. With a way-too-long horizon, of course we're back to 100% or 0%, because that's what it will be at the end of the game. With a very short horizon though, you're measuring short-term fluctuation noise. So you want some intermediate horizon, but what horizon is tricky, as it may take highly variable numbers of moves for the bot to realize, depending on the potential judgment/misjudgment involves a short-term fight or a long-term shape that will only come into play much later in the game. Either way, you still actually need to say what the time horizon you care about is (possibly different for different situations?). And of course, it's not guaranteed that the numbers you get will apply to humans, who have different strengths and biases.
  • Or maybe you actually do mean the move-to-move fluctuation noise, i.e. you want something like the error with respect to "the winrate that the bot itself will report on the very next move"? That's pretty easy to quantify, but that it doesn't seem like an ideal metric. If the bot rates move A 5% higher than move B, and you play both A and B on the board, the winrate will then fluctuate a bit for each, but the magnitude of that fluctuation isn't necessarily tied to whether A is really a "better move" than B. Similar to earlier, it depends on things like whether it's a short term tactic the bot can realize imminently, or it's a longer-term judgment difference that won't get resolved soon. And of course, here too it's not guaranteed that the numbers you get will apply to humans.

Or you do mean something else entirely? Apologies if you've explained this already somewhere and I missed it. :)

Basically, it's kind of hard to think about how one would add an error estimate (or how one would research how to add it) when not sure what precisely that error is supposed to measure in the first place.


This post by lightvector was liked by: dfan
Top
 Profile  
 
Offline
 Post subject:
Post #643 Posted: Mon Aug 19, 2019 10:32 pm 
Honinbo
User avatar

Posts: 8667
Location: Santa Barbara, CA
Liked others: 323
Was liked: 2007
GD Posts: 312
Quote:
under perfect play the position must either be entirely won or entirely lost, so the true winrate will either be 100% or it will be 0%.
Probably matters little, if at all, to this point: but how do we know perfect play doesn't always lead to no-result (e.g. triple ko, etc.) ?

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #644 Posted: Mon Aug 19, 2019 11:54 pm 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
lightvector wrote:
Bill, regarding winrates specifically, when you say you want a margin of error, presumably you are talking about the error in the bot's estimate relative to something. What precisely is that something?


"That's not my department, says Wernher Von Braun." — Tom Lehrer

;)

Color me oldfashioned, but when I come up with an approximate measure, I am interested in its error function. Now, the inventors of the winrate estimate have good reasons for not providing an error function. For one reason, it's not the only thing they use to choose plays. For another, the number of playouts or visits indicates the degree of confidence in the winrate estimate. For another, for choosing the best play, the order is more important than the absolute value.

{This paragraph may be skipped.} My first foray into game related evaluation was coming up with a point count for Quick Tricks in contract bridge. I took a Chebyshev approach and minimized the maximum error, given knowledge of the total of the point counts of the two partner's hands and certain assumptions about the play. Starting with the errors is what enabled me to come up with the evaluation function. :) (I had not assumed a point count, it just worked out that way. ;))

However, human reviewers are obviously interested in winrate estimates and, from my point of view, are hampered by the lack of error estimates. If LZ, Elf, or KataGo says that my play has a winrate estimate 2% lower than its first choice, does that mean that my play was a mistake? It is apparent from reading reviews that some people even think that a difference of ½% is significant (in the playing sense, not the statistical sense), something that strikes me as absurd. There are other questions that I have, as an analyst, but this is a basic question that human reviewers have, but they have no guidance in the matter.

Now, it would be possible to use the data from bots to come up with margins of error, however defined, but 1) it would take a good bit of time and effort, 2) you would have to make assumptions that people could challenge, 3) the landscape keeps changing as bots improve and new methods may be devised. Look at the exciting progress of chess engines, several years after they got better than humans. Those who devise bots have not provided margins of error, and I doubt that they will, any time soon. Perhaps some academic will do the research.

Quote:
Is is "the probability that the bot would win from here against itself, using the actual self-play settings and parameters used in training?".


If I understand dfan correctly, that's pretty much the idea, But that's not how the estimates are derived. ;)

Quote:
Or maybe you actually do mean the move-to-move fluctuation noise, i.e. you want something like the error with respect to "the winrate that the bot itself will report on the very next move"?


That's pretty much the reinforcement learning approach, isn't it? A winrate estimate estimates the winrate estimate after the next move is played. But, IIUC, that is not tested directly, either. Rather the test is how well the bot plays the whole game, not how well it evaluates each position or play. It is a player, not an analyst.

Quote:
That's pretty easy to quantify, but that it doesn't seem like an ideal metric.


True enough. But when you see a winrate estimate with 700 playouts and after the next play, which is the bot's first choice, the new winrate estimate differs by 2% with 12,000 playouts, you have to suspect that the margin of error with 700 playouts is at least 2%. ;)

A few years ago, with a little cleverness I compared winrate estimates for Leela 11 with 100k playouts per position (not each option) versus 200k playouts where I could argue that the difference between the two was not random, but the result of evaluation errors with 100k playouts, and came up with a minimum margin of error of around 3%. Nowadays, OC, who cares about Leela 11's margin of error?

Quote:
And of course, here too it's not guaranteed that the numbers you get will apply to humans.


True enough. But if a bot's winrate margin of error is 3% with superhuman play, surely it is larger when applied to human play.

Quote:
Basically, it's kind of hard to think about how one would add an error estimate (or how one would research how to add it) when not sure what precisely that error is supposed to measure in the first place.


Sure. The researcher has to specify what he means. Hard to do when the developers talk as though there were such a thing as the probability of winning the game. You have to make assumptions, and the developers may not even know what the assumptions are. Or maybe they don't want to say. ;)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #645 Posted: Tue Aug 20, 2019 2:44 am 
Oza
User avatar

Posts: 2152
Location: Tokyo, Japan
Liked others: 1996
Was liked: 1216
Rank: Jp 6 dan
KGS: ez4u
Could someone explain the relationship between the lower confidence bounds (LCB) and upper confidence bounds (UCB) and the winrate? I have naively thought that the change in the use of LCB in LZ 0.17 was in a sense a conservative adjustment for the degree of uncertainty in the winrate. Is this completely off base?

_________________
Dave Sigaty
"Short-lived are both the praiser and the praised, and rememberer and the remembered..."
- Marcus Aurelius; Meditations, VIII 21

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #646 Posted: Tue Aug 20, 2019 4:29 am 
Lives in gote

Posts: 412
Liked others: 1
Was liked: 114
Rank: KGS 2k
GD Posts: 100
KGS: Tryss
Bill Spight wrote:
Sure. The researcher has to specify what he means. Hard to do when the developers talk as though there were such a thing as the probability of winning the game. You have to make assumptions, and the developers may not even know what the assumptions are. Or maybe they don't want to say. ;)


For bots like LZ, the winrate given by the network is an interpolation (for this position) based on the results of positions encountered in self play by previous networks.

Basically, you feed the algorithm positions and results, and it fit a function (the network) to these datapoints. Then you apply this function to all the positions you encounter

Playouts just apply this function to positions further in the tree, and the "final winrate" is the winrate of the last position in the "best line" (if I'm not mistaken)


This post by Tryss was liked by: Bill Spight
Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #647 Posted: Tue Aug 20, 2019 7:46 am 
Lives with ko

Posts: 289
Liked others: 56
Was liked: 244
Rank: maybe 2d
EdLee wrote:
Quote:
under perfect play the position must either be entirely won or entirely lost, so the true winrate will either be 100% or it will be 0%.
Probably matters little, if at all, to this point: but how do we know perfect play doesn't always lead to no-result (e.g. triple ko, etc.) ?

With area rules with superko and half-integer komi (which is the only kind of rules that most current bots use) the game must always terminate with a win or loss. And yes, this is a bit of a distraction from the actual issue.

Bill Spight wrote:
Quote:
That's pretty easy to quantify, but that it doesn't seem like an ideal metric.


True enough. But when you see a winrate estimate with 700 playouts and after the next play, which is the bot's first choice, the new winrate estimate differs by 2% with 12,000 playouts, you have to suspect that the margin of error with 700 playouts is at least 2%. ;)


Note that this is still tricky. Consider the case where two moves differ by less than 2%, and therefore you don't trust that difference, but actually the estimates of the two are highly correlated, due to using leading to almost the same variations, differing only in one forcing move that changes the territory slightly but doesn't tactically matter? In that case, while the "error" (whatever that means) in each of the two moves is at least 2%, the "error" (whatever that means) in their difference could be far less than 2%, since whatever part of it is correlated will cancel out in the difference.

Bill Spight wrote:
lightvector wrote:
Bill, regarding winrates specifically, when you say you want a margin of error, presumably you are talking about the error in the bot's estimate relative to something. What precisely is that something?


"That's not my department, says Wernher Von Braun." — Tom Lehrer

;)

Color me oldfashioned, but when I come up with an approximate measure, I am interested in its error function.


The winrate is a prediction of the binary outcome of win/loss as seen statistically in the self-play game data. The problem is that to talk about the error for a binary outcome prediction (win/loss) as a separate and independent quantity from the prediction itself and as something intrinsic to the prediction itself is dangerously close to being mathematically incoherent. So you have to tread carefully, because unlike some other areas, where human intuition usually points at something genuinely meaningful even if it may be fantastically hard to make precise and quantify, sometimes in this specific area it may be human intuition that is the problem.

The straightforward and perhaps-unhelpful answer to your question is that so long as the probability prediction is well-calibrated* with respect to a player population, then whenever a bot predicts 80%, the "error function" is that 80% of the time it will be predicting too low by 20%, because the game actually was won, and 20% of the time it will be predicting too high by 80%, because the game actually was lost. And then the straightforward answer would say that's it, that's all there is to know regarding the error of that prediction. The percentage itself IS the expression of uncertainty about the game outcome!

(* "well-calibrated" means that among all times of the time the bot says, e.g. 80% in positions randomly drawn from games by those players, indeed about 80% of time the game is then won and 20% of the time the game is then lost. Bot winrates are obviously not well-calibrated with respect to human player populations, but if you have enough games from the desired player population, it is very possible to make it well-calibrated. You just plot the bot winrates among all the positions within those games against the empirical outcome of the games, fit a curve, and then have the bot report what the curve says instead of what it would have said originally).

--------------------------------

To give another analogy - imagine a well-calibrated weather station predicts in city A a chance of rain today of 70%, and suppose the city is too small or the potential rainclouds too big for there to be any appreciable chance of rain only hitting part of the city without hitting the whole. What is the "error function" on this prediction?

Now, in reality, it actually either does rain in A or not. So the 70% isn't a fact about the world, it's a fact about the weather station's own uncertainty about the world. The weather station is not making a prediction of a platonic probability "70%" out there in the world where that prediction itself has some additional uncertainty, rather it's making prediction of rain or no rain and "70%" is the expression of uncertainty about "rain or no rain". So error in the prediction will be 70% 30% of the time, and 30% 70% of the time.

In what cases would 80% have been a "better" prediction? If it did in fact rain in A, then it would be better. If it didn't actually rain in A, then it would be worse. A better model might indeed instead make a prediction of 80% because it recognizes features that more strongly suggest rain that original model didn't see. Or it might make a prediction of 10% or 0% because it sees features that make rain extremely unlikely that original model didn't see, this being likely among the 30% of times that the original model would be wrong in cases like that. In either case, again the percentage itself already is the expression of uncertainty and error of that particular model (so long as the model is well-calibrated).

-----------------------------------------

So, when a bot rates a move at 60%, (note: major caveats regarding differences between match play and self play, let's suppose the bot has been well-calibrated to the population of its own *match* play games rather than self-play games) the bot is saying "I'm 60% certain that I will win if I play here, but 40% uncertain about winning if I do so". There's no further "error" to talk about regarding that 60% number. The 60% itself is ALREADY saying that the bot expects to be "wrong" (i.e. in error) about winning 40% of the time.

So that's sort of what I'm getting at. When you are trying to predict a binary, inherent to the prediction itself there is no further error to speak of since the percentage itself is already the expression of what the probable error is. This is not always intuitive for humans, and is easy to get tripped up by even for experienced statisticians.

Now, while there is no further inherent notion of error, you DO get other notions of "error" when you start talking about comparing the prediction to OTHER statistical averages (like proportions of games won/lost by humans, etc), or about the way that successive predictions may change over time. Then there IS plenty more to speak of. And of course, much of the above does not apply to score, which is not binary. But for winrate what notion of error you get is *entirely* a function of what other thing you choose to try to compare against. And what other thing to try to compare against to give what humans they want for review is not completely a scientific question, but in part also a psychology question, and a user-education question, and a question of "what do you actually want, it's your free choice what statistics you would personally find most useful", which is why it's difficult to approach.

I hope this helps clear up some of the mathematical trickyness regarding what winrates "mean". Or maybe it makes people more confused. :)


This post by lightvector was liked by 2 people: Bill Spight, dfan
Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #648 Posted: Tue Aug 20, 2019 9:14 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
Just a brief note for now. Busy, busy today.

As I said, my purpose is not to question winrate estimates. They are not hyped anymore. :) But I believe that there is a problems with how human players use them. Humans use them to evaluate plays and positions, without any clear understanding of what they mean. IIUC, bots use them along with other factors such as playouts to choose plays. Winrate estimates only have to be good enough as evaluations to be useful for that purpose. They are tested by how well bots play games, not by how well they evaluate specific positions and plays. Bots typically also supply humans with winrate estimates and playouts for plays not made. But they do so without also providing humans with guidance about how to interpret and utilize that information. IMHO, that's a problem.

Not that bot developers have to solve it, but they could address it. I have addressed that problem from time to time. For instance, I now propose a margin of error for Elf 1 of 4% for estimates in the vicinty of 50%, with at least 4k playouts. That's a guess, but it's an educated guess. I have neither the time, nor the energy, nor the inclination to do a lot of research on that question right now. Who does? (Besides, it's a thankless task. ;))

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #649 Posted: Wed Aug 21, 2019 12:43 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
Continuing in the same vein, here are opening moves in a game between Dosaku (W) and Doetsu 350 years ago (GoGoD 1669-07-16), with comments by Elf and my reflections :)

Click Here To Show Diagram Code
[go]$$Bc Dosaku (W) vs. Doetsu
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . 6 . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . 5 . . |
$$ | . . 1 , . . . . . , . . . . . , . . . |
$$ | . . . . 2 . . 3 . . . . . . . 4 . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]

:w2: loses 6½% to par, according to Elf, with 26k playouts. (Note: :w2: was hardly on Elf's radar, getting only 1 playout originally. That is not enough to establish any winrate estimate. Elf established it with its preferred reply, which had 26k playouts. I have followed this procedure for the number of playouts for many of the moves below.)
:b3: loses 5%, with 4k playouts
:b5: loses only 2% to par (16k playouts). I have already hypothesized why.
:w6: loses 7.5% (16k playouts). I don't know why this loses more than :b3:. The two corners are mirrored, with a 90°rotation. With a 180° rotation I would expect White to have gained a slight advantage because of a small temperature drop, but Elf thinks that Black has gained around 7%. :o

Click Here To Show Diagram Code
[go]$$Bcm7 Variation 1
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . O . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . X , . . . . . , . . 5 3 1 , . . . |
$$ | . . . . O . . B . . . 6 . 4 2 O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]

The answer surely lies in the assessment of this position, which Elf gives as the main variation starting with :b7:. We know that bots like :b7:, but Black would have a corresponding play in the top right corner if play had continued there instead of in the bottom right. Elf has a small preference for :w10: over the one space jump. Perhaps the point is that the :bc: stone hinders White on the bottom side, I don't know. :scratch:

Click Here To Show Diagram Code
[go]$$Bcm7 Game record 2
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . 4 . . 5 . . 3 . . . 2 . . . . |
$$ | . . . , . . . . . , . . . . . , 1 . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . 6 . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . O . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . X , . . . . . , . . . . a , . . . |
$$ | . . . . O . . X . . 7 . . . . O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]

:b7: loses 2½% by comparison with a, with 14k playouts
:w8: loses 7½% by not occupying the last empty corner (29k playouts)
:b9: loses 12% to par, not just because it fails to occupy the last empty corner, but also because it is a slack 3 space pincer (21k playouts)
:w10: occupies the last empty corner, but still loses 7% to par because it is on E-17 (17k playouts) (Elf does not show the other 3-5 point, so I don't know how it would evaluate it.)
:b11: is curious to our eyes. It is plainly too slow. But apparently it was in vogue at the time. Players plainly valued making a base on the side. Bots like it less than we do. According to Elf it loses 12% to par (30k playouts). Elf prefers to invade the top left corner on the 3-3.
:w12: returns the favor, making a base on the right side instead of enclosing the top left corner. It loses 15% to par. Perhaps the extra 3% can be explained by the fact that :b11: makes the corner invasion a pincer, as well.
:b13: loses 17% to par (30k playouts). What explains the extra 2% loss? I think it's a temperature drop. (More below.)

Click Here To Show Diagram Code
[go]$$Wcm14 Game record 3
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . O . . X . . X . . . O . . . . |
$$ | . . 4 , . . . . . , . . . . . , X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . 3 . . . . . . . . . . . . . O . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . 1 . . . . . . . . . . . . . O . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . 2 . . . . . . . . . . . . X . . |
$$ | . . X , . . . . . , . . . . . , . . . |
$$ | . . . . W . . X . . X . . . . O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]

:w14: plays a counter pincer, avoiding the top left corner once more. It loses 20½% to par (29k playouts), a serious loss. (If Ohashi had chosen this or a similar game to highlight Dosaku's play, Dosaku would not have looked so good, eh? { https://lifein19x19.com/viewtopic.php?f=13&t=16844 } But some selection bias is expected and excusable, don't you think? ;))
:b15: breaks a sector line and threatens the :wc: stone. It loses only 15% to par (30k playouts).
:w16: makes a base on the left side, no surprise by now. ;) It loses 20½% to par (29k playouts). The extra percentage lost to par by comparison with :b13: is also, I think, because of a temperature drop.
:b17: is Elf's first choice. :D It does not show any variation, but I think it picks the 3-4 instead of the 3-3 now because if it played the 3-3, the outside attachment on this 3-4 would work well with the White base on the left side.

The alternative to each of these base making plays is to invade or enclose the top left corner, clearly par play, according to Elf. And there is not a lot of difference locally, so why the additional losses to par? As I said, I think it has to do with temperature drops. Look at the whole board after :b17:. There is a base on every side. True, each corner has a weak stone in it, but single weak stones can be handled. ( :wc: is the weakest White stone, and Dosaku bolstered it on the next play.) One thing the bases on the side do is to reduce the temperature around them. You can't pincer them, and they restrict the opponent's development. Each time a base is made on the side the temperature drops there. :w16: produces a significant temperature drop after :b17:. :b17: is the last play of the opening in spades. An unusual situation. ;)

Click Here To Show Diagram Code
[go]$$Wcm16 Variation 2
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . O . . X . . X . . . O . . . . |
$$ | . . 1 , . . . . . , . . . . . , X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . 3 . . . . . . . . . . . . O . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . O . . . . . . . . . . . . . O . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X . . . . . . . . . . . . X . . |
$$ | . . X , . 2 . . . , . . . . . , . . . |
$$ | . . . . W . . X . . X . . . . O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]

Elf recommends enclosing the corner, OC. Then :b17: covers the :wc: stone and :w18: gets the last big play of the opening. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #650 Posted: Sat Aug 24, 2019 10:27 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
More in the same vein :)

Click Here To Show Diagram Code
[go]$$Bc Honinbo Retsugen (W) - Yasui Senchi Senkaku, Swastika 5-3 opening
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . 3 . . . . . . . . . . 2 . . . . |
$$ | . . . , . . . . . , . . . . . , 1 . . |
$$ | . . 4 . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . 8 . . |
$$ | . . 5 , . . . . . , . . . . . , . . . |
$$ | . . . . 6 . . . . . . . . . . 7 . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]

(GoGod 1799-12-13a, Castle Game)

The Swastika 5-3 goes back, I believe, to Honinbo Dosaku Meijin. It suggests that White thought that the 5-3 was at least as good as the 3-4, and not just a slightly inferior play that prevented Black from having an easy game. I was somewhat surprised to find it played as late as 1799. ;) As Senchi was known for his central influence, perhaps Retsugen played four 5-3s to throw him off his game.

OC, Elf regards this board as advantageous for Black, who has gained 10½%, assuming 7½ pt. komi. That means that the 3-4 is, on average, about 2½% better than the 5-3. How many pts. ahead does Golaxy or KataGo estimate Black is? OC, simply playing the averages is not very reliable, as winrate estimates indicate. :) Both :b1: and :w8: are par moves, for instance.

To get a feel for this position, let's look at Elf's main continuations from here. Elf regards only two plays as worth considering, the kosumi and the kick. Senchi played the kosumi, Elf prefers the kick by 1%.

Click Here To Show Diagram Code
[go]$$Wcm10 The kosumi
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X 2 . 4 . . . . . . . O . . . . |
$$ | . . . , 1 3 . . . , . . . . . , X . . |
$$ | . . O . . . . . . . . . . . . B . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . a . . |
$$ | . . 8 , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . 5 . . . . . . . . . . . . . . . . |
$$ | . . . . 9 . . . . . . . . . . . . . . |
$$ | . . . 6 . . . . . . . . . . . . O . . |
$$ | . . X , . . 7 . . , . . . . . , . . . |
$$ | . . . 0 O . . . . . . . . . . X . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]


After the kosumi, :bc:, Elf continues with :w10:, pressing against the top left corner, and then plays :w14:, a favorite pincer of AlphaGo. :b15: breaks the sector line, and White plays :w16:, the keima extension towards the bottom side. Then :b17: is a counter-pincer. :w18: encircles the bottom left corner, and :b19: secures it. (BTW, Retsugen played :w10: at a, a move that would not be considered an error for more than 200 years, i.e., until the AI era.)

Click Here To Show Diagram Code
[go]$$Wcm10 The kick
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X 4 . 6 . . . . . . . O B . . . |
$$ | . . . , 3 5 . . . , . . . . 1 , X . . |
$$ | . . O . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . 2 . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . 7 . 9 . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . 8 . 0 . . . . . . . . . . O . . |
$$ | . . X , . . . . . , . . . . . , . . . |
$$ | . . . . O . . . . . . . . . . X . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]


After the kick, :bc:, :w10: stands, and then :b11: extends towards the right side. As after the kosumi, White presses against the top left corner and plays the pincer against the bottom left. This time, with the White influence in the top right, White plays :w18:, the one space jump towards the center. Then :b19: pushes through. The development on the left side is similar, but :w18: gives this diagram a different feel. :) (Note: Elf's preference for the jump is less than 1%, however.)

Gotta run. More later. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis


Last edited by Bill Spight on Sun Aug 25, 2019 5:06 am, edited 2 times in total.
Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #651 Posted: Sat Aug 24, 2019 11:12 am 
Lives with ko

Posts: 171
Liked others: 24
Was liked: 52
Rank: 2d
Bill Spight wrote:
OC, Elf regards this board as advantageous for Black, who has gained 10½%, assuming 7½ pt. komi. That means that the 3-4 is, on average, about 2½% better than the 5-3. How many pts. ahead does Golaxy or KataGo estimate Black is?

KataGo (using 6.5 points komi) believes :w2: and :w6: are problematic, dropping about 1.6 points each. The position after :w8: is evaluated as around B+3.5, 60% winning percentage. It thinks :b3: should have been the diagonally opposite star-point and :w4: should have been a press in the top right, but neither move changes the score estimate very much.


This post by bernds was liked by: Bill Spight
Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #652 Posted: Sun Aug 25, 2019 8:04 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
A bit more about the margin of error of winrate estimates. I will be repeating myself to some extent, hopefully not too much. ;)

lightvector wrote:
Basically, it's kind of hard to think about how one would add an error estimate (or how one would research how to add it) when not sure what precisely that error is supposed to measure in the first place.


Right. :) In a go playing program, winrate estimates are used, among other factors, to choose moves. These estimates are not tested directly, so we do not know how good an estimate is for any given play or position (along with who is to play). The program does not need that information.

However, the main use, I believe, for such programs is not to play games, but to assist humans in reviewing games. One question humans ask is what is the best play in a given position? The programs provide an answer by indicating their top choice of plays. So far, so good. Another question humans ask is how two moves compare, and by how much? The how much question brings us to the margin of error.

Now, review programs do not typically, AFAIK, compare plays directly. But they do compare subsequent positions in a game. This information is an indication of whether a play is a mistake, and by how much. For instance, if Black made a play in the game that reduced Black's winrate estimate by 10%, humans typically interpret that to mean that the move was a mistake that reduced the probability of winning by 10%. But what if it reduced the winrate estimate by only 2%? Are we even sure it's a mistake? To answer that question we need to know the margin of error of these estimates. And we do not know that. Nobody is precisely sure what "that error is supposed to measure in the first place."

Our uncertainty is not trivial. For instance, Uberdude has posted a game record of a recent game between FineArt and Golaxy ( https://lifein19x19.com/viewtopic.php?f ... 45#p248045 ) along with LZ's winrate estimates and playouts. It is plain from his subsequent comments that he did not intend LZ's opinion to be definitive, or even to have enough playouts. But let's look at a few estimates and differences.

The last play of the game was Black 327, after which White resigned or was adjudicated to have lost. LZ's winrate estimate for Black was 89% with 1.3k playouts. One interpretation of that estimate is that if LZ played itself from that point, White playing first, Black would win 89% of the time, with an error of 11%, and would lose 11% of the time with an error of 89%. Well, I, for one, would be willing to bet 8 USD to 1 that if you let two randomly chosen AGA 10 kyus play out the game from there, that Black would win. Surely LZ vs. LZ would be a certain win for Black. But with only 1.3k playouts LZ is not so sure. I'll happily give that winrate estimate a margin of error of 10%. :)

The game was effectively over after Black 315, at which point only dame and protective plays are left. LZ's winrate estimate for Black with White to play is only 67% with 3.6k playouts. (Many of LZ's estimates were made with fewer than 1k playouts, but 3.6k is fairly respectable. :)) Now, there is research that indicates that two human 5 kyus can correctly play out a game at the dame stage around 98% of the time. Betting 20 USD to 10 that Black would win in that case would be like taking candy from a baby. My impression, going back to the early days of Monte Carlo programs, is that they tend to underestimate winrates of the eventual winner. ;) But a 33% margin of error with 3.6k playouts? :shock:

Let's back up to Black 313, an atari on 7 stones in the bottom right, including a ko stone. In a comment I argued that taking the ko was technically correct, but in fact Black 313 wins a won game. A strong amateur who was unfamiliar with area scoring might make a mistake and play on the left side where Black played move 315. Good for FineArt for clinching the win. But LZ's winrate estimate for Black is only 52½% with 330 playouts. OC, only 330 playouts do not inspire confidence, but what can go wrong? If White allows the capture of the stones in atari, Black surely wins, and if White saves the stones, the last play is obvious to most human SDKs, requiring only a reading depth of 3 ply to see. Yet LZ thinks it's a tossup? Yes, there are only 330 playouts, but immediatel previous play suggests that the region of Black 315 has been explored to at least that depth. I don't know enough about the workings of LZ to speculate why it has such a low winrate estimate. Human dan players surely do better at this point.

One last observation. White 314 did make the obviously correct play of saving the 7 stones in atari. After that play LZ estimated White's winrate at 41% with 647 playouts. OK, so there is still a large error in the winrate estimate itself. That's not so surprising, given the givens. But what does that say about White 314? It indicates that it is a 7% mistake! The overwhelmingly obviously correct play lost 7% in White's estimated winrate.

OK. Not enough playouts. If you have any experience with these review programs, you are aware of the problem. But how many playouts do you need? Especially since a play that the program "thinks" is an error is likely to have few playouts devoted to it. That is the result of the seach strategy of the go playing program. Perhaps a review program should have a different search strategy, I dunno. I think so, but I can't really say. Anyway, humans who use go playing programs for reviews are feeling their way in the dark, or at best in dim light.

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #653 Posted: Mon Aug 26, 2019 2:42 pm 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
A personal note:

Today's mail contained a book, Sweet Taste of Liberty: A true story of slavery and restitution in America, by historian Caleb McDaniel, published by Oxford University Press. McDaniel wrote an article in the September issue of the Smithsonian magazine, p. 12. It is about Henrietta Wood, a slave who was freed in 1848 in Ohio but was kidnapped, along with her young son, in 1853 in Cincinnati, taken into Kentucky and sold back into slavery. After the Civil War she had to remain in the South for years before she could earn enough money to return to Ohio, where she sued one of her kidnappers and won her suit in 1878. She was my wife's great-great-grandmother. She received $2500, the largest reparations ever paid to a person for slavery. McDaniel met us a couple of years ago and Winona was able to give him copies of some family pictures. The book's dedication reads:

Quote:
In Memory of Winona Adkins (1944-2018)
Great-great-granddaughter of Henrietta Wood


You can PM me for a discount code for the book. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #654 Posted: Wed Aug 28, 2019 12:16 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
Some guidance on reviewing games with bots

Thanks to yoyoma's comments ( https://lifein19x19.com/viewtopic.php?p=248070#p248070 ) and following, I have a better understanding of the workings of today's go playing bots, and can offer some suggestions. :)

As I have already mentioned, while winrate estimates are evaluations of positions (along with whose turn it is), go playing bots do not simply rely upon them to choose plays. Nor are bots trained to make accurate evaluations, they are trained to win games. For that they only require good enough evaluations. For humans wishing to evaluate plays and positions for analysis are review, top level programs today, and for the forseeable future, are not trained to make accurate evaluations. So, for now, we have to make use of top bots, whose evaluations are generally better than ours. ;) How to do that?

When considering a play, today's bots produce and reveal information about plays considered but not chosen, in the form of winrate estimates and playouts or visits. Generally, but not always, the bot chooses the play with the best winrate estimate. Playouts or visits are also a factor. Each play considered leads to a node in the search tree. The number of playouts or visits for a node in the search tree is correlated with its winrate, because during search the most promising nodes are most likely to be visited and expanded. This search strategy works well for winning games.

However, for analysis or review we want to compare plays and positions. A problem arises when we compare plays with different numbers of visits or playouts. A winrate estimate with few visits is likely to be more inaccurate than one with many visits. Suppose that we are comparing a play with 50k visits with one with only 500 visits by the program when building the search tree. The play with only 500 visits probably has a worse winrate estimate than the one with 50k visits, but its winrate estimate is not as accurate. If it were expanded so that its number of visits was also 50k, then its winrate accuracy would be more comparable to that of the other play, and we would have a fairer comparison. Now, for the player to do that when choosing plays would be a waste of its time, as the chance of finding a better play would be small. But as reviewers or analysts we do not face that problem. Our concern is not winning a game, but understanding specific plays and positions.

IIUC, the typical way that reviewers use today's bots to analyze games is to chart the changes in winrate estimates for each move. OC, the problem with simply doing that is the variability in the number of playouts or visits. This procedure treats winrate estimates with different visits the same, even though those with fewer visits are less reliable.

Is there a better way? Yes, and it takes only a little more work. It is possible, with typically great effort, to get the bot to make more visits to a node when searching the tree, which then allows better comparisons, but there is a simpler and better way. :) The basic idea is to evaluate a play, not by its winrate estimate, but by the winrate estimate of the bot's chosen reply to it. On average, the two winrate estimates should be the same, but, since the estimate of the reply in one ply deeper in the search tree, it should be more accurate, as a rule. And it is easy to get the go playing program to choose the reply to a play. Simply make the play. :)

Suppose that we wish to compare a play made in a game (play A) with another play made in the same position (play B). First, we make each play and note the winrate estimate of the program's reply. Then we may regard the play whose reply has the better winrate estimate for the player as the preferred play. We may also take the difference in winrate estimates into account when deciding whether one play is actually better than the other. The problem of the margin of error is not resolved, but the problem of inaccuarte comparisons made with too few playouts is addressed by this procedure. :)

An example mañana. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #655 Posted: Wed Aug 28, 2019 3:03 am 
Lives with ko

Posts: 247
Liked others: 0
Was liked: 31
Rank: 2d
IMO the problem during reviews is that there is no theoretical guarantee of the accuracy of the winrate of any move other than the most explored one (or not even that). It is no coincidence the original search algorithm chooses the most visited move, ignoring all winrates.

Bill Spight wrote:
A problem arises when we compare plays with different numbers of visits or playouts. A winrate estimate with few visits is likely to be more inaccurate than one with many visits.
The practical reliability of winrates also depends on other things besides total visits. In quiet positions where most lines evaluate to similar value, their average is also more reliable than in dynamic/tactical positions. The latter needs more visits just to search deep enough, reach better evaluatable positions, and understand the tactical shape of the tree (eliminate some of the blind spots, and shift out poor top moves from the average). So comparing estimates with identical visits can be just as dangerous as different visits if the dynamics of the positions differ.

Quote:
The problem of the margin of error is not resolved
Winrate estimates are theoretically incorrect, for example, an estimate of 50% is surely off by nearly 50%. Similarly, their correlation to actual bot-vs-bot playout games can be low as that depends on the set level of randomness. If run without any randomness all playout games can be identical (0% or 100%). And bots often use different level of randomness for actual play than for selfplay (where those winrates originate from).

But for practical purposes, if one accepts winrates as fuzzy "goodness" measures, it may be possible to look at the variance of the evaluations included in the average, to get some kind of confidence metric. There are some actual experiments in this direction. (This variation in reliability, the dependence on dynamics is also problem for bot training and selfplay where uniform quality is preferred).


This post by moha was liked by: Bill Spight
Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #656 Posted: Wed Aug 28, 2019 9:03 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
Example 1

Click Here To Show Diagram Code
[go]$$Wcm36 Suzuki (W) - Nozawa, Feb., 1930, game 9 of a 10 game match
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X . . O . . . . . . . O O O . . |
$$ | . . . , X . . . . , . . O O X O X . . |
$$ | . . O . . X . . . . . . O X X X X . . |
$$ | . . . O . . . . . . . . . . X O . . . |
$$ | . . X O . 7 . . . . . . . . . . . . . |
$$ | . . . X O . . . . . . . . . . . . . . |
$$ | . . . X O . . . . . . . . . . . X . . |
$$ | . . . X 1 . . . . , . . . . . , . . . |
$$ | . a 4 3 2 6 . . . . . . . . . . . . . |
$$ | . . 8 5 . . . . . . . . . . . . . . . |
$$ | . . . 9 b . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X . . . . . . . . . . . . . . . |
$$ | . . X , . . . . . , . . . . . , . . . |
$$ | . . . . O . . . . . . . . . . O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]

Both players were 7 dans, back when there was only one 9 dan.

From the Elf commentaries here are the Black winrate estimates, rounded to the nearest ½%, along with the number of visits for each play.

:w36: 54% (36.9k)
:b37: 44½% (6)
:w38: 44½% (28.1k)
:b39: 43½% (58k)
:w40: 44% (87.9k)
:b41: 42% (54)
:w42: 43½% (1.3k)
:b43: 42% (1.3k)
:w44: 44% (16.9k)

In any histogram of Black winrates, :b37: stands out, with a winrate drop of 9½%. It certainly appears to be a serious error. Elf recommends a simple extension to 38. OC, Elf was trained on 7.5 pts. komi, but the solid extension looks good in this no komi game, as well.

According to my recommended procedure, we look at the winrate estimates for Elf's replies to :b37: and to Elf's recommended extension. Elf's reply to :b37: is also Suzuki's :w38:, with a winrate estimate of 44½%. Elf's reply to its extension is 42, with a winrate estimate of 54½% (53.2k). The playout numbers are not the same, but they are in the same general ballpark. The winrate difference is 10%, ½% greater than the 9½% difference in the histogram. It is normal for this procedure to produce a result only ½% different from the histogram result. But that is not alway the case. :)

Elf also recommends a different play for :b41:, despite a drop of only 2% in the histogram. It recommends the crawl at 43; White replies with the extension to 44, with a Black winrate of 44% (103.4k). For :w42: in the game (i.e., the reply to :b41: ) Elf recommends the turn at 43, with a Black winrate of 42% (15k). The winrate difference is the same. The playouts are not exactly in the same ballpark, but 15k is much better than 54. ;)

As mentioned, Elf recommends the turn at 43 for :w42:. Black replies with the descent to a, for a Black winrate of 42½% (14.9k). For :b43: (the reply to :w42:), Elf recommends the shoulder hit at b, with a Black winrate of 49% (30k). The winrate difference is 6½% instead of the histogram difference of only 1½%. We may conclude that :w42: is a mistake. Not a bad mistake, but still a mistake. If we looked only at the histogram we would miss it. The low playouts (1.3k) are a clue, but we still might miss it.

Edit:
We might also guess that :b43: is a mistake. Indeed it is, but only to the tune of 5%. :w44: in the game is also Elf's reply, with a winrate estimate of 44%. Elf's reply to Black b, its suggested play, is the turn at 43, with a winrate estimate of 49% (10.4k). Again, the histogram gives no clue.

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #657 Posted: Wed Aug 28, 2019 10:12 am 
Oza

Posts: 2352
Liked others: 15
Was liked: 3417
Bill

Stimulating thoughts. Thank you.

There are quite a few commentaries on this game, including Suzuki's own. The only English commentary, in Go Monthly, is rather anodyne. The rest (Japanese) stress more the contrary nature of Nozawa and indicate how he was playing not just the man but the system he so often railed against. The simplest summary of those views in respect of 37 (described as the "safe" option) can be summed up in one word: kiai. Are win rates relevant in those circumstances?

However, there is one commentary of the period that tells us Black was also concerned by the danger of an impending White moyo at the top and the hane 37 was a way of expressing that. In other words, Black was accepting an immediate local loss in return for what he saw as a very long-term strategic gain. So long term that it was perhaps beyond the measuring capabilities of win rates?

As to 42, it is not mentioned in every commentary, but those that do talk about it all give Black's reply to a putative White 43 as B10, not B9. One commentary specifically also states that White wants to avoid this as he loses the suji at B11. Elsewhere there is a comment that shows how White can also make a sacrificial cut at C12 which gives him various options in the lower left.

One thing I have noticed about AI play is that it cares little for options, either as physical points or in terms of timing (forcing moves are played at, to us, ridiculously early times). That seems to make sense for a machine, but is it sensible for us humans? Are options a sensible way for us to keep a grasp on complexity, and is keeping deferred options not a mark of great skill?

At any rate, even after accepting that AI bots would trounce both Suzuki and Nozawa, all of those points seem to give us both a greater understanding of the present game while also helping us find a way to improve, i.e. to improve the thinking tools that are natural to us.

I'm not trying to say analysis by win rates is irrelevant, but I do think there are risks that its value may be overstated. I see it more like video reviews in sporting events - a useful but peripheral tool which gives us (not always) right answers but doesn't improve our tennis, soccer or umpiring skills one little bit.

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #658 Posted: Wed Aug 28, 2019 10:36 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
Many thanks for your thoughts, John, and for the comments of yore. :)

As for winrates, they are currently the main evaluations that bots give us, and we don't quite know what to do with them. ;) I am feeling my way. :)

The winrate errors of 5% and 6½% are minor, and the 10% loss is of the size that White players would gladly take in no komi games to create chances. It is interesting that Nozawa took the chance of :b37: as Black with no komi instead of the cool headed nobi. Kiai, indeed! :)

As for the bots playing their sente early, that is something I have puzzled over. Doing so may help them deal with the complexity of reading the whole board. Humans can compartmentalize. :) Or maybe it really is just better play. Consider Sakata.

Let me prepare a note with some of Elf's variations on this game. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #659 Posted: Wed Aug 28, 2019 11:24 am 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
Some cool variations by Elf. :)


_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Offline
 Post subject: Re: This 'n' that
Post #660 Posted: Wed Aug 28, 2019 2:17 pm 
Honinbo

Posts: 8820
Liked others: 2602
Was liked: 3009
Mistake? What mistake? Whose mistake?

I happened to look a little further into this game, and found a surprise inside an Elf variation. :o I'm feeling my way along, but I thought it might be interesting to take a closer look. :)

Click Here To Show Diagram Code
[go]$$Bcm41 What about Black 45?
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X . . O . . . . . . . O O O . . |
$$ | . . . , X . . . a , . . O O X O X . . |
$$ | . . O . . X . . . . . . O X X X X . . |
$$ | . . . O . . . . . . . . . . X O . . . |
$$ | . . X O . O . . . . . . . . . . . . . |
$$ | . . . X O . . . . . . . . . . . . . . |
$$ | . . . X O . . . . . . . . . . . X . . |
$$ | . . . X O . . . . , . . . . . , . . . |
$$ | . . X O X X . . . . . . . . . . . . . |
$$ | . . X O . . . . . . . . . . . . . . . |
$$ | . . 5 O . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X . . . . . . . . . . . . . . . |
$$ | . . X , . . . . . , . . . . . , . . . |
$$ | . . . . O . . . . . . . . . . O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]


With a winrate estimate of 37% (1.3k) :b45: apparently loses 6½% by the histogram, to be a minor error. Elf recommends a White reply at a, with a Black winrate of 38½% (19k).

Click Here To Show Diagram Code
[go]$$Bcm41 Variation
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . c . X . . O . . . . . . . O O O . . |
$$ | . . . , X . . . a , . . O O X O X . . |
$$ | . . O . . X . b . . . . O X X X X . . |
$$ | . . . O . . . . . . . . . . X O . . . |
$$ | . . X O . O . . . . . . . . . . . . . |
$$ | . . . X O . . . . . . . . . . . . . . |
$$ | . . . X O . . . . . . . . . . . X . . |
$$ | . . . X O . . . . , . . . . . , . . . |
$$ | . . X O X X . . . . . . . . . . . . . |
$$ | . . X O . . . . . . . . . . . . . . . |
$$ | . . 6 O . . . . . . . . . . . . . . . |
$$ | . . . . 5 . . . . . . . . . . . . . . |
$$ | . . . X . . . . . . . . . . . . . . . |
$$ | . . X , . . . . . , . . . . . , . . . |
$$ | . . . . O . . . . . . . . . . O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]


Elf recommends :b45: in this variation, 41% (33.2k). But then it replies with :w46:, 31½% (9.9k) :shock: That's a drop of 9½% in one ply, with both plays chosen by Elf.

Anyway, following the suggested procedure, :b45: in the game scores 38½% and Elf's choice scores 31½%, which makes Elf's choice a minor error by 7%.

So, what's the story? Is :w46: in this variation so good, or :b45: so bad? I dunno. But note that each of White a, b, and c is bad for White, according to Elf. Which raises the usual question when analyzing with bots. Why are all these plays that look so good actually so bad? :lol:

Anyway, here is the rest of the main line of this variation.

Click Here To Show Diagram Code
[go]$$Wcm46 Main line
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . X . . O . . . . . . . O O O . . |
$$ | . . . , X . . . . , . . O O X O X . . |
$$ | . . O . . X . . . . . . O X X X X . . |
$$ | . . . O . . . . . . . . . . X O . . . |
$$ | . . X O . O . . . . . . . . . . . . . |
$$ | . 2 . X O . . . . . . . . . . . . . . |
$$ | . . . X O . . . . . . . . . . . X . . |
$$ | . . . X O . 0 . . , . . . . . , . . . |
$$ | . 4 X O X X . . . . . . . . . . . . . |
$$ | . 3 X O 6 5 9 . . . . . . . . . . . . |
$$ | . . 1 O 7 8 . . . . . . . . . . . . . |
$$ | . . . . B . . . . . . . . . . . . . . |
$$ | . . . X . . . . . . . . . . . . . . . |
$$ | . . X , . . . . . , . . . . . , . . . |
$$ | . . . . O . . . . . . . . . . O . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]


Well, by :b47: we are down to 3.5k visits, so we can't count on these moves to be accurate. And, as usual, Elf leaves us in media res. But the Black winrate estimates are consistently around 30%. So maybe we are seeing a horizon effect, where White discovered a superior play in :w46:.

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

I think it's a great idea to talk during sex, as long as it's about snooker.

— Steve Davis

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 660 posts ]  Go to page Previous  1 ... 29, 30, 31, 32, 33

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group