It is currently Sun Jul 15, 2018 3:48 pm

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 29 posts ]  Go to page Previous  1, 2
Author Message
Offline
 Post subject: Re: statistical analysis of player performance
Post #21 Posted: Fri Jul 13, 2018 2:14 am 
Dies in gote

Posts: 55
Liked others: 27
Was liked: 49
Bill Spight wrote:
I am not talking about proof of cheating, but of the comparison between Metta's play and that of his opponents (win-loss aside). I haven't seen any test of that comparison aside from the one I did on your early data, which showed no significant difference. Statistically, Carlo's play online was better than his play offline in the games analyzed. But I did not find that Carlo's online play was better than the online play of his opponents. If you can show that, please do. :)


Interesting, I did not get the idea of comparing Metta's histograms with those generated for his "combined opponents". However, I see outright that the percentage of good moves played by his opponents in internet games does vary and differs from the "more consistent" Metta's performance. Thus, I would guess that the chi2 test on the patterns would show a difference, maybe not as large as the one between Metta's regular and internet performances but it should still be there. I will definitely look at it, though it may take me some time before I do the chi2 test on the new data. Right now I am more concerned about Leela 0.11 ability to provide reliable winrates in some types of positions (large scale life and death problems, complicated fights and I do not know what else). I assume it only generates some "noise" at the tail of the histograms but it is a serious hindrance. I also got rather busy with other things in real life ...

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #22 Posted: Fri Jul 13, 2018 3:59 am 
Dies in gote

Posts: 28
Liked others: 3
Was liked: 6
Rank: 5 kyu
AlesCieply wrote:

It does not tell us whether he used Leela, got someone else playing instead of him or whatever. It just says that is is very unlikely that Metta's regular games and internet games "were played by the same person".



Here it is where you are wrong. It does not say that "is very unlikely that Metta's regular games and internet games were played by the same person" but it says that Metta played differently in "those" regular games compared to "those" online games. The reasons for these differences can be many, among hundreds of possible reason you choose the cheating argument, this point makes you a biased judge.

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #23 Posted: Fri Jul 13, 2018 4:50 am 
Dies in gote

Posts: 55
Liked others: 27
Was liked: 49
bugsti wrote:
AlesCieply wrote:

It does not tell us whether he used Leela, got someone else playing instead of him or whatever. It just says that is is very unlikely that Metta's regular games and internet games "were played by the same person".


... but it says that Metta played differently in "those" regular games compared to "those" online games.

Actually, it is exactly what I mean and why I put the "were played by same person" in the parentheses. OK, thanks for precising my statement. :)

bugsti wrote:
AlesCieply wrote:
The reasons for these differences can be many, among hundreds of possible reason you choose the cheating argument, this point makes you a biased judge.

No problem with that, I do not consider myself unbiased on the matter any more. I would only like to add that at the time I started with the analysis I did not know what will come out of it and I clearly see its flaws. And as I said several times already, for me the turning point was finding out about the altered kifu. I really do not believe the explanation provided by the Italians. The analysis is just another piece of evidence, nothing more, it will never be 100%.

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #24 Posted: Fri Jul 13, 2018 6:05 am 
Judan

Posts: 7269
Liked others: 1860
Was liked: 2634
AlesCieply wrote:
Jan.van.Rongen wrote:
IMO the whole analysis is deeply flawed because the very basis of the analysis: the "Leela top 3 suggestions" is undefined, ...

Jan, have you noted that my analysis does not use the "top 3 suggestions". Maybe you speak here about someone's else analysis. :) What my analysis does use are the winrates, the bot estimates of the probability to win the game if the top suggested move is played. There, I see all your three runs provide values in a reasonable agreement with what I have there. The agreement might have been even better if you ran the bot at a higher number of playouts.


Please note that these three runs provide the best estimates for the winrate of :w50: (for Black). Running Leela 11 for 60 sec. provides an average of 167K playouts for :w50:, 123K for :b51:. Running Leela for 72 sec. would get an average of around 200K playouts. The results of the 3 runs range between 53.75% and 55.23%. That's a larger error range than we might wish for analysis, while still good enough for a reasonably strong bot.

One thing that van Rongen's runs illustrates is the unreliability of the third choice. Even with a much longer run than the average run a cheater would have used, the rollouts for the third choice is less than 10K, which is, IMO, the minimum desired precision. Metta played more quickly than necessary, averaging 9.7 sec./move. At 15 sec./move the second choice would still have more than 10K rollouts only occasionally. So at realistic run times for a cheater, only the top choice has any reliability, IMO.

AlesCieply wrote:
Sure, I would very much like to check it with alternate bots (AQ, Leela Zero). The problem I have here is that I do not have access to GPU powered computer, so the numbers of playouts I am getting with them are small (below 10k with 3 min per move). Anyone willing to help? All I need are the rsgf files generated by the Go Review Partner. Just let it analyse the games with a setting to spend (let's say) 3 min on a move (or to get at least 100k playouts), moves 30-181 do suffice for me.


Using Go Review Partner I don't think you want just the single run. The single run provides the its best estimates for the win rates of the human's plays. Then you want (i.e., I want) to go back and add variations where Metta (the person of interest) made different play than the bot's top choice; play the bot's top choice in the new branch and run the bot for the same length of time used for the plays in the actual game (or the same number of rollouts). The difference of the two winrate estimates will give you the bot's best error estimate. :) You could also add a branch for each play. If the human played the bot's top choice you get your precision estimate. :)

And why not get estimates for the whole game? You can always ignore moves before 30 and after 181 if the comparisons are not interesting.

Edit: Corrected Leela to the bot.

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #25 Posted: Sat Jul 14, 2018 1:55 am 
Dies in gote

Posts: 55
Liked others: 27
Was liked: 49
Bill Spight wrote:
Please note that these three runs provide the best estimates for the winrate of :w50: (for Black).

Sure, looking for the best move for :b51: provides the winrate for :w50:. That's how it is done (see column "winrate") in my spreadsheets.

Bill Spight wrote:
Using Go Review Partner I don't think you want just the single run. The single run provides the its best estimates for the win rates of the human's plays. Then you want (i.e., I want) to go back and add variations where Metta (the person of interest) made different play than the bot's top choice; play the bot's top choice in the new branch and run the bot for the same length of time used for the plays in the actual game (or the same number of rollouts). The difference of the two winrate estimates will give you the bot's best error estimate. :)

In theory, I completely agree with you. In practice, the matter is more complex. The problem is I could not find a way to automate this. I have discussed it with the author of GRP and and we were even not able to find a way to fix the number of playouts in the automated GRP runs. There are two choices: either set the GRP to get at least a minimal number of playouts at each position/move (I do this with the minimum set to 200k) or set a fixed time the bot will spend on each move. For the later there is no guarantee a move will get the required number of playouts if the computer gets busy with some other unpredictable tasks. When setting the minimal number of playouts the bot goes to the next move as soon as the minimum is reached but this 200k playouts are added on top of what the bot already spent on the variation while analyzing the previous moves. In the end, when a long (forced) sequence of top move suggestions is played the top move suggestion gets analyzed with incredibly high number of playouts (300k-400k are normal, I have noted even 1000k). Thus, very often the winrate estimate for an unplayed top move suggestion gets higher number of playouts than the move actually played, contrary to what you would have expected. :) My assumption (checked several times) is that the winrate is about settled with the 200k playouts, so it does not really matter so much if some moves are analyzed with much higher number of playouts.

To sum it up, in an ideal world we would like to have the winrates for each played move and for the top move suggested by the bot (if it differs) estimated with the same number of playouts. In reality, I have not found a way to automate this with the available tools.

Bill Spight wrote:
And why not get estimates for the whole game? You can always ignore moves before 30 and after 181 if the comparisons are not interesting.

OK, next week I will run the analysis of the Metta-BenDavid game in its completeness "for you". :) I can do more later on, but will leave for holidays in the middle of next week.

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #26 Posted: Sat Jul 14, 2018 9:00 am 
Judan

Posts: 7269
Liked others: 1860
Was liked: 2634
AlesCieply wrote:
Bill Spight wrote:
Please note that these three runs provide the best estimates for the winrate of :w50: (for Black).

Sure, looking for the best move for :b51: provides the winrate for :w50:. That's how it is done (see column "winrate") in my spreadsheets.

Bill Spight wrote:
Using Go Review Partner I don't think you want just the single run. The single run provides the its best estimates for the win rates of the human's plays. Then you want (i.e., I want) to go back and add variations where Metta (the person of interest) made different play than the bot's top choice; play the bot's top choice in the new branch and run the bot for the same length of time used for the plays in the actual game (or the same number of rollouts). The difference of the two winrate estimates will give you the bot's best error estimate. :)

In theory, I completely agree with you. In practice, the matter is more complex. The problem is I could not find a way to automate this.


Thanks. :sad:

Quote:
Bill Spight wrote:
And why not get estimates for the whole game? You can always ignore moves before 30 and after 181 if the comparisons are not interesting.

OK, next week I will run the analysis of the Metta-BenDavid game in its completeness "for you". :) I can do more later on, but will leave for holidays in the middle of next week.


Thanks. :) Metta took 47 sec. on :b17: and 73 sec. on :b27:, so both of those we can consider as subjectively difficult for him. IMO, :b11: and :b29: are questionable. It will be interesting to see the results. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #27 Posted: Sat Jul 14, 2018 1:26 pm 
Beginner

Posts: 15
Liked others: 1
Was liked: 4
Rank: NL 2 dan
KGS: MrOoijer
Bill Spight wrote:
Please note that these three runs provide the best estimates for the winrate of :w50: (for Black). Running Leela 11 for 60 sec. provides an average of 167K playouts for :w50:, 123K for :b51:.


I am not sure I understand what you write. IMO it runs 160K simulations on the board position after :w50: and the 122K is the number of simulations for the top candidate.

The N-percentage is the likelihood Leela 0.11 thinks a certain move has to be played. These percentages are fixed for the position. For L17 Leela has that as the most likely move (24.1%), even though that is not the best move. L17 starts out as the more likely candidate and as you can see it can remain the favourite by chance because of the nature of random processes in the tree search and the value network.

That also means that the "precision" which you definede elsewhere as being proportional to 1/sqrt(N) where N is the number of simulations for that move is incorrect. These simulations are not at all independent.

This also shows that on a smaller machine without GPU and with more limited run time you are more likely to find the most plausible (human) move and not the best move.

Assessing this situation after move 50 with Leela Zero network #155 (180 seconds) gives a remarkable result.

Quote:
M17 -> 36697 (V: 38.45%) (N: 22.47%)
L17 -> 592 (V: 34.83%) (N: 13.83%)
K16 -> 580 (V: 36.40%) (N: 7.79%)
B13 -> 472 (V: 34.26%) (N: 12.73%)
J14 -> 448 (V: 34.04%) (N: 12.64%)


M17 -> 36773 (V: 39.23%) (N: 7.07%)
J14 -> 1382 (V: 34.37%) (N: 42.50%)
K16 -> 348 (V: 36.12%) (N: 6.87%)
B13 -> 223 (V: 33.85%) (N: 7.59%)
E13 -> 152 (V: 33.49%) (N: 5.53%)

M17 -> 39535 (V: 38.66%) (N: 8.24%)
J14 -> 1635 (V: 34.52%) (N: 41.35%)
B13 -> 371 (V: 34.40%) (N: 9.66%)
E13 -> 258 (V: 33.41%) (N: 8.28%)
H9 -> 202 (V: 33.60%) (N: 6.25%)


The most remarkable point is the almost 16% lower evaluation by LZ #155 for Black in this position. But it also shows that the evaluation cannot be used as an absolute measure of the value of a move. It is always relative to the capabilities of a bot.

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #28 Posted: Sat Jul 14, 2018 3:37 pm 
Judan

Posts: 7269
Liked others: 1860
Was liked: 2634
Jan.van.Rongen wrote:
Bill Spight wrote:
Please note that these three runs provide the best estimates for the winrate of :w50: (for Black). Running Leela 11 for 60 sec. provides an average of 167K playouts for :w50:, 123K for :b51:.


I am not sure I understand what you write. IMO it runs 160K simulations on the board position after :w50: and the 122K is the number of simulations for the top candidate.


I averaged the three runs.

Quote:
The N-percentage is the likelihood Leela 0.11 thinks a certain move has to be played.


It pretends to be, anyway. The winrates are dependent, and we are not sure upon what.

Quote:
These percentages are fixed for the position. For L17 Leela has that as the most likely move (24.1%), even though that is not the best move. L17 starts out as the more likely candidate and as you can see it can remain the favourite by chance because of the nature of random processes in the tree search and the value network.

That also means that the "precision" which you definede elsewhere as being proportional to 1/sqrt(N) where N is the number of simulations for that move is incorrect. These simulations are not at all independent.


Sorry for not being clear. In general, we can regard errors as being random, even if the processes from which they err are not. As you point out below, Leela 11 could be estimating an incorrect winrate. If so, the more rollouts it uses, the closer it should get to it. (That is not always the case, as I have pointed out, for instance, if it is hill climbing, when it is close to the top of the hill.) As it gets closer to its estimate, its precision generally increases. But its estimate may be completely off. (moha, I believe, thinks that Leela's limitations may prevent it from ever finding the right move, even in infinite time. I don't pretend to know.) Anyway, that is why I took care to distinguish between precision and accuracy. Greater precision does not mean greater accuracy. And anyway, I did not define precision as being proportional to 1/sqrt(n), I offered that as a rule of thumb. And I also indicated that the accuracy of an estimate is no greater than precision. So that, for instance, even if we have a precision of ± 1%, we should not regard the error rate of the accuracy as being that tight.

Quote:
Assessing this situation after move 50 with Leela Zero network #155 (180 seconds) gives a remarkable result.

Quote:
M17 -> 36697 (V: 38.45%) (N: 22.47%)
L17 -> 592 (V: 34.83%) (N: 13.83%)
K16 -> 580 (V: 36.40%) (N: 7.79%)
B13 -> 472 (V: 34.26%) (N: 12.73%)
J14 -> 448 (V: 34.04%) (N: 12.64%)


M17 -> 36773 (V: 39.23%) (N: 7.07%)
J14 -> 1382 (V: 34.37%) (N: 42.50%)
K16 -> 348 (V: 36.12%) (N: 6.87%)
B13 -> 223 (V: 33.85%) (N: 7.59%)
E13 -> 152 (V: 33.49%) (N: 5.53%)

M17 -> 39535 (V: 38.66%) (N: 8.24%)
J14 -> 1635 (V: 34.52%) (N: 41.35%)
B13 -> 371 (V: 34.40%) (N: 9.66%)
E13 -> 258 (V: 33.41%) (N: 8.28%)
H9 -> 202 (V: 33.60%) (N: 6.25%)


The most remarkable point is the almost 16% lower evaluation by LZ #155 for Black in this position. But it also shows that the evaluation cannot be used as an absolute measure of the value of a move. It is always relative to the capabilities of a bot.


Leela Zero, being a much better player than Leela 11, should yield more accurate choices of plays, as well as more accurate estimates of winrates. As Uberdude has pointed out, the meaning of Leela Zero's winrates is almost surely different from the meaning of Leela 11's winrates. So it is not that they are estimating the same thing, and Leela Zero is estimating it better. It looks to me that Leela Zero's evaluation of the position after :w50: is less precise than Leela 11's, but more accurate. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

Top
 Profile  
 
Offline
 Post subject: Re: statistical analysis of player performance
Post #29 Posted: Sun Jul 15, 2018 5:29 am 
Judan

Posts: 7269
Liked others: 1860
Was liked: 2634
Chess GM Jan Ludvig Hammer begins to use Leela Chess Zero here: https://www.youtube.com/watch?v=TxiNUPK ... gs=pl%2Cwn

I find this video painful to watch because Hammer is struggling with the software. People are helping him, but he is plainly frustrated.

One thing that he complains about is that he is unable (at least at the moment) to teach Leela Chess Zero about the game he is analyzing by entering lines of play, something that he does with chess engines. In particular there is a move that Leela does not find, but when he plays it, Leela realizes that it is a good play. But then when he backs the game up, Leela Chess Zero does not change its evaluation of the previous position. I suppose that this is a feature of Leela Chess Zero, and I am not going to complain about it myself.

However, in the midst of his explorations of the software, he makes an observation that I can resonate with. He does not care about whether the software plays better than other software (I do, though), he wants to use it for analysis and review. If you are trying to understand a particular game or variation, when your software does not learn along with you, that limits its value for that purpose.

As far as go bots are concerned, I think that we are at a place where they still have a lot of improvement to make, and I think that getting them to play as well as they can is an important goal. At some time we are likely to reach a point of diminishing returns, but we do not seem to be near that point yet. Let us forge ahead. :)

But people are starting to use bots for review and analysis, tasks for which they were not designed. One feature that Hammer wants is for the evaluation of plays or variations that the human enters to propagate up the game tree. We know that life and death is a relative weakness in current go bots. If a human, even an SDK, shows the program that at a certain node in the game tree there is a play that the program missed that kills or saves a group, then that fact ought to affect the program's earlier decisions. If the new evaluation does not propagate up the tree, that will not happen.

Currently the program is used in reviews to compare different plays, to show people where they made a mistake. The bots use winrate percentages to evaluate positions and plays. How much worse, in percentage terms, does a human's play have to be by comparison with the bot's top pick for it to be a mistake? (OC, we cannot be sure that the bot's top pick is best, but that's another story, for now.) For some people, it seems that a difference of less than 1% is enough, for others it takes a difference of 1%, for others, 2%, for some, 4%. But we are all guessing. :( What we would like to know is the error rates and ranges of the evaluations. Bots are trained on millions of self play games. Those games should provide enough data to generate error terms for the winrates. But the error terms are not generated, because accurate evaluation is not the goal of the programs. Winning games is. And simply picking the play with the best evaluation is not how modern bots work. They are more complicated than that. Changes that you might think would help a bot play better, may actually make it play worse. But like Hammer, when I am analyzing a game or position, I am not concerned with how well the software plays in general, I am concerned with evaluating a specific game, position, or play.

Recently I saw a position at the end of the game where Zen7 evaluated a pass by Black as giving White a 61% chance of winning the game. (Edit: See viewtopic.php?p=233790#p233790 ) It recommended a play that would give White a 1% chance. It seemed obvious to me that a pass was correct, indeed, the only winning choice, since it was a 0.5 pt. win. It was easy to show that Black could defend against White's threat, at least for an amateur dan player, and probably for many SDKs as well. At the 10 kyu level, Zen's evaluation might be right. But in analyzing a game I do not need 10 kyu help, thank you very much. :roll: In a position where play was essentially over, a top bot's evaluation was off by 61%. :shock: Obviously, Zen7 does not go around giving 10 kyu advice. But in this specific case it was horribly wrong. And in doing a review or analysis it is specific cases we are interested in. Evaluations made by a program whose goal is evaluation may still be wrong, but they can tell us how good they expect any evaluation to be.

Jan.van.Rongen wrote:
That also means that the "precision" which you definede elsewhere as being proportional to 1/sqrt(N) where N is the number of simulations for that move is incorrect. These simulations are not at all independent.


Lacking error estimates, I can at least compare the precision of evaluations in terms of 1/sqrt(Playouts). Faute de mieux. If anyone would like to provide good error estimates for winrates, that would be great! Meanwhile, I'll make do.

Jan.van.Rongen wrote:
Assessing this situation after move 50 with Leela Zero network #155 (180 seconds) gives a remarkable result.

Quote:
M17 -> 36697 (V: 38.45%) (N: 22.47%)
L17 -> 592 (V: 34.83%) (N: 13.83%)
K16 -> 580 (V: 36.40%) (N: 7.79%)
B13 -> 472 (V: 34.26%) (N: 12.73%)
J14 -> 448 (V: 34.04%) (N: 12.64%)


M17 -> 36773 (V: 39.23%) (N: 7.07%)
J14 -> 1382 (V: 34.37%) (N: 42.50%)
K16 -> 348 (V: 36.12%) (N: 6.87%)
B13 -> 223 (V: 33.85%) (N: 7.59%)
E13 -> 152 (V: 33.49%) (N: 5.53%)

M17 -> 39535 (V: 38.66%) (N: 8.24%)
J14 -> 1635 (V: 34.52%) (N: 41.35%)
B13 -> 371 (V: 34.40%) (N: 9.66%)
E13 -> 258 (V: 33.41%) (N: 8.28%)
H9 -> 202 (V: 33.60%) (N: 6.25%)


OK, Leela Zero, after thinking for 3 minutes, a long time for it, figures the best play to be M-17. It's a better player than I am, so I'll say OK. With 36607 playouts for M-17 in the first run, I'll guess the precision of its evaluations as ± 0.5%, yielding 38.4% ± 0.5%. A likely win for White.

But what about L-17? With only 592 playouts, I'll guess its precision as ± 4.1%. But its evaluation is only 3.8% worse that that of M-17. So there is a chance that L-17 is a better play than M-17. (L-17 does not even show up in the top 5 choices in other runs, however. So it is out of the running.)

Now, Leela Zero's evaluations are quite good enough for it to play well, in general. M-17 is a good move. But whether L-17 or K-14 is better than M-17 in this specific position (with this komi) is a different question. At the very least we would like to evaluate other options to the same degree of precision as the top choice and have an error term to indicate the degree of confidence in the comparison. And, given that people are using software for review and analysis and are asking whether particular plays are mistakes, some software should be designed for that purpose.

Edit: I do not mean to disparage Lizzie or Go Review Partner or any other analysis or review program, but it would be better if they were able to use software developed for evaluation, not play. :)

_________________
The Adkins Principle:

At some point, doesn't thinking have to go on?

— Winona Adkins

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 29 posts ]  Go to page Previous  1, 2

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 6 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group