statistical analysis of player performance

AlesCieply · #1

The topic should serve as an easy to link reference to my analysis of player performance measured by the available bots, in the current version mostly Leela 0.11. I started to work on it in relation to the PGETC case in which an Italian player Carlo Metta was accused of using Leela in his internet games. After an original analysis based on matching the played moves to Leela top 3 suggestions proved to be inconclusive I decided to try a more detailed analysis with an idea of comparing the accused player performances (mistakes histograms) in his games played on internet and at regular (real life) tournaments. The analysis is inspired by works by Ken Regan on measuring in-game player ratings and catching those who cheat with AI in chess, see e.g. an review article. The idea is to look at frequencies/probabilities the players make mistakes of a given value (play moves that lower the probability to win a game, e.g. lowering the winning probability by 1-2%, 3-4% etc). This should form a histogram (or pattern) reflecting the player performance. If a player makes significantly lower number of mistakes in his internet games when compared with games of the same player in regular tournaments, then it could be an indication that the player used an outside help.

The analysis is presented in a form of spreadsheet files with each sheet containing an analysis of one game. For each move the bot is used to estimate the probability to win the game (winrate) before and after the move is played. The difference delta (set to 0 by definition if the top bot choice is played) provides the value of mistake the player makes. For each game separately, the results of the performed analysis can be seen in the histogram tables provided at the top right of each sheet assigned to a particular game. The tables show (separately for black and white player) how many moves were played with delta falling into a specific interval. The percentages of good moves (the played move had a winrate within 1% of the top move suggested by the AI, or even bettered the top move the AI found) and bad moves (causing a drop of the winrate by at least 10%) are also shown there.

The original analysis included 4 internet games by Carlo Metta and 4 of his games played at regular tournaments.

The current analysis of his internet games is far more extensive. It includes four PGETC games and two games from the Italian Championship Online, all played by Carlo Metta before he was accused of cheating. For a comparison three more PGETC games played by Carlo are included that he played after the accusation. Finally, the analysis of the Bryant-Metta game played in the PGETC qualification match is added as well. The analysis of Carlo's regular games was also updated to include four games played at WAGC and two games played in the Italian Championship Final.

Some notes on the internet games played by Carlo before the accusation:
:black:

The data shown in the current analysis are from new runs, so the results are slightly different from those in the earlier analysis (e.g. Carlo had 68% of good moves in the old analysis of the Kulkov-Metta game, it dropped to 64% now). The new runs are more consistent as they come from "automated runs" of Go Review Partner while a good fraction of the original analysis included hand-transcribed winrates. The differences between the original and new delta histograms are relatively small and demonstrate variations due to independent runs of Leela.
:white:

Carlo makes almost no big mistakes (marked by red color) in his internet games which is in contrast when compared with his regular games. One can make only 1-2 (or even 0) big mistake but not so consistently as the results of other analyzed games show (including analysis of games played by someone else).
:black:

The percentage of good moves in Carlo internet games is rather consistent, unlike in his regular games. The percentage of good moves drops sharply to 50% in a game against Csaba Mero played after the accusation.
:white:

The game Bryant-Metta from the UK-IT qualification match is also interesting. It is the only game analyzed so far in which Leela 0.11 has trouble to "understand the game" and provide stable winrates. It was suggested that Carlo used Leela Zero here, though there is no real proof for it.

Two of these games were also analyzed with the AQ bot. Unfortunately, the winrates estimated by the bot are not as stable as those provided by Leela.

I intend to edit this and the following message time by time to provide more information and updates on the analysis. Expect more later.

AlesCieply · #2

Links to the analysis files:
Carlo Metta (4d) internet games - google sheets, rsgf files
Carlo Metta (4d) regular games - google sheets, rsgf files
Ondrej Kruml (5d) regular games - google sheets, rsgf files

Results of the Pearson's chi2 test of independence:
C.Metta on internet (6 games played before the accusation) vs C.Metta in regular tournaments (6 games) - p=3*10^(-8)
C.Metta in regular games (6 games) vs O.Kruml (10 games) - p=0.44
C.Metta on internet (6 games played before the accusation) vs O.Kruml (10 games) - p=5*10^(-7)

The p-value represents a probability that the compared two sets of games were played by "the same person". Here "the same person" means the player that has a similar distribution of mistakes. Two players of about equal strength and with a similar style of play are expected to get a p-value close to 1. Details are provided in the attached file in which the comparison of the mistakes distributions (delta-histograms) is shown as well.

Interpretation - C.Metta's play at regular tournaments is relatively close to the one of O.Kruml but VERY different from his play on internet.

bugsti · #3

AlesCieply wrote:

Some notes on the internet games played by Carlo before the accusation:
:black:

The data shown in the current analysis are from new runs, so the results are slightly different from those in the earlier analysis (e.g. Carlo had 68% of good moves in the old analysis of the Kulkov-Metta game, it dropped to 64% now). The new runs are more consistent as they come from "automated runs" of Go Review Partner while a good fraction of the original analysis included hand-transcribed winrates. The differences between the original and new delta histograms are relatively small and demonstrate variations due to independent runs of Leela.

I also noticed big oscillation in different and indipendent runs of Leela.

AlesCieply wrote:

Carlo makes almost no big mistakes (marked by red color) in his internet games which is in contrast when compared with his regular games. One can make only 1-2 (or even 0) big mistake but not so consistently as my preliminary results for another player show (I am still finalizing the analysis, hope to make it public soon).

You forgot to mention the most important fact among all. Carlo's big mistakes rate is ALWAYS consistent with his opponent's mistakes rate. And also the rate of his good moves.

AlesCieply wrote:

The percentage of good moves in Carlo internet games is rather consistent, unlike in his regular games. The percentage of good moves drops sharply to 50% in a game against Csaba Mero played after the accusation.

But also his opponent's (Mero) good moves rate drops down, consistently with Carlo's drop.

I think you found a method not to detect cheating but to detect if a particular game was an "easy" game or a tough one for the players.

The fact that Carlo's moves are always consistent with his opponent's move can even be a prove that cheating did not occurred.

AlesCieply · #4

bugsti wrote:

You forgot to mention the most important fact among all. Carlo's big mistakes rate is ALWAYS consistent with his opponent's mistakes rate. And also the rate of his good moves.

Actually, you can note quite some difference in the 3 PGETC games Carlo played after he was accused, so your statement is not true. I have not looked at other games of Carlo's opponents so it is hard to say whether their low number of mistakes is due to their higher strength or it is game related.

bugsti wrote:

AlesCieply wrote:

The percentage of good moves in Carlo internet games is rather consistent, unlike in his regular games. The percentage of good moves drops sharply to 50% in a game against Csaba Mero played after the accusation.

But also his opponent's (Mero) good moves rate drops down, consistently with Carlo's drop.

I have no idea how Mero performs in other games. I guess his performance oscillates, but one cannot say for sure before doing the analysis. Have you noted Bajenaru also scored only about 50% of good moves while Carlo had 67% (EDIT: corrected).

bugsti wrote:

I think you found a method not to detect cheating but to detect if a particular game was an "easy" game or a tough one for the players.

I agree there is some correlation between the difficulty of the game for the players (and most likely also for the bot used to make the winrate estimates) and the statistical performance (percentage of mistakes and good moves). Still, Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela. Just have a look at the Bajeranu-Metta and Metta-benDavid games (plus the two from Italian Online Championship).

Bill Spight · #5

One thing I have been wondering about is your precision label. Thanks to your link I took a look at the Go Review Partner documentation. I did not find the word, precision, in that documentation. However, I did find this explanation of delta.

GoReviewPartner wrote:

By comparing the win rate (or Value Network win rate, Monte Carlo win rate) at one move (when the bot best move would be played) with the win rate of the following move (the case when the actual game move was played), one can draw a delta graph for each color.

This is a graph that indicates by how much the bot believes it could have played better than the human player, or eventually by how much the human player move was better than its own move. The difference between both win rate percentage value is called delta.

Personification aside (the bot doesn't believe anything), the delta is comparing, if not apples and oranges, at least different varieties of apples. For the delta to mean what it claims to mean, the bot must play perfectly. Your precision label indicates the difference in win rates before and after a single play by the bot. If the bot played perfectly that difference would always be zero. But the difference can be substantial, and that casts doubt upon the delta measure, which has the same sources of error.

Two of these sources of error are the different number of playouts for each move in the comparison, and the different game trees that are built for each move. A good example is :w36:

in Bojanic's analysis of the Metta-Ben David game. Starting from the position after :b35:

Leela's choice of :w36:

has a win rate (for Black) of 51.6% with 44,162 playouts for an error term of ± 0.5%. White indeed made Leela's play, and starting from that position, Leela's choice of :b37:

has a win rate of 55.4% with 11,483 playouts for an error term of ± 1%. But the win rate difference is 4%, much larger than the error estimates. Those error estimates are based upon playouts, but what happened is that, starting from the position after :w36:

, Leela found an apparently better play for Black's next move than it had found starting from the position after :b35:

.

So to find a good delta we don't want to do what Go Review Partner does. It's OK for casual review, but not for scientific purposes. We want to start from the same place, and we want to have an equal number of playouts for each play we are comparing. With Go Review Partner I think we can do that by making each play we are comparing and then running the bot for a certain number of rollouts, or for a certain length of time. That way we are comparing apples with apples.

A glance at the Bryant-Metta game's precision data indeed suggests that Leela does not understand that game very well. It apparently keeps finding good variations that it missed one ply earlier. At the very least we should compare apples with apples.

Bill Spight · #6

AlesCieply wrote:

Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela.

Well, he won, didn't he?

And, as I mentioned before, a Chi Square Test comparing Carlo's play with the play of his opponents in those games failed to find a significant difference. It's not even close to the 5% level. The significant difference is the one you found, between Carlo's play under different conditions.

AlesCieply · #7

Bill, I intend to address the precision of the bot estimates is some detail, most likely will put it into the second placeholder message. Before I do so, just very briefly:
- I agree with you that to have it "scientifically perfect" one should have about the same numbers of playouts for the estimation of the position after the top move suggested by the bot was played and compare it with the winrate after the game move was played.
- The last column in my data sheets is for precision. I put it there just to see how this can vary when the number of playouts increases. As far as it stays below 1% difference I consider it fine. When it exceeds the limit I color the cell blue, so I can easily spot in where to bot has some trouble to estimate the winrate. For the first two games in the new analysis (Kulkov, Kruml) I made additional runs at higher playouts (300k+) to check the extremes (green, red deltas and blue precisions), so those two games are "doctored" to achieve better winrate estimates. If there was a change to the original winrate, top move suggestion or order of the move played, I marked the affected cells by blue color. These changes had little (if any) impact on the histograms, but the winrates there are simply better estimated than for the rest of the sheet. As it is quite a lot of additional work, I have not done it for other games. I also intend to provide the original rsgf files, so everybody can check that this additional "doctoring" is really not done with a purpose to make the data look good or worse for Metta.

Bill Spight · #8

AlesCieply wrote:

Bill, I intend to address the precision of the bot estimates is some detail, most likely will put it into the second placeholder message. Before I do so, just very briefly:
- I agree with you that to have it "scientifically perfect" one should have about the same numbers of playouts for the estimation of the position after the top move suggested by the bot was played and compare it with the winrate after the game move was played.

Well, if I understand Go Review Partner well enough, the deltas are based upon a similar number of playouts for each comparison, because they depend upon the number of playouts for the bot's top choices. (OC, they are different choices, one ply apart.

)

I think then, that the main source of error is usually finding a better game tree one ply later. That's why it is important to make comparisons at the same level. Apples vs. apples.

Quote:

- The last column in my data sheets is for precision. I put it there just to see how this can vary when the number of playouts increases.

That's important.

You can use it in a slightly different way by running the bot more than once from the same position.

Hmmm. That might be a way to assess the difficulty of a position. An easy position with an obvious play may well have less variable results than a more difficult position. We might use a precision measure to decide which plays to compare.

bugsti · #9

AlesCieply wrote:

I have no idea how Mero performs in other games. I guess his performance oscillates, but one cannot say for sure before doing the analysis. Have you noted Bajenaru also scored only about 50% of good moves while Carlo had 77%.

... while Carlo had 66.7% according to your spreadsheet.

AlesCieply · **#10**

bugsti wrote:

AlesCieply wrote:

... while Carlo had 66.7% according to your spreadsheet.

Ooops, sorry, I mistyped. Thanks for the correction.

AlesCieply · **#11**

Bill Spight wrote:

AlesCieply wrote:

Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela.

Well, he won, didn't he?

I wish it was that simple. It is possible to have a higher percentage of good moves and still loose the game, the Kulkov-Metta PGETC belongs here in the new analysis. The Bryant-Metta is another one but there I really do not know what to think of the analysis. One can also lead for most of the game (having higher percentage of good moves up to move 180) and then self-atari in the endgame etc. My point in the opening message is that Metta is surprisingly consistent in his internet performances prior to the accusation and the consistency is broken afterwards and in his regular games. Just to make sure, I do not consider it as a proof of cheating, he could just have been in a bad state of mind after the accusation so it can be explained both ways. Still, it is worth noting.

Bill Spight · **#12**

AlesCieply wrote:

Bill Spight wrote:

AlesCieply wrote:

Carlo normally outperforms his opponents in the internet games played before the accusation, especially in those played after the latest release of Leela.

Well, he won, didn't he?

I wish it was that simple. It is possible to have a higher percentage of good moves and still loose the game, the Kulkov-Metta PGETC belongs here in the new analysis. The Bryant-Metta is another one but there I really do not know what to think of the analysis. One can also lead for most of the game (having higher percentage of good moves up to move 180) and then self-atari in the endgame etc. My point in the opening message is that Metta is surprisingly consistent in his internet performances prior to the accusation and the consistency is broken afterwards and in his regular games. Just to make sure, I do not consider it as a proof of cheating, he could just have been in a bad state of mind after the accusation so it can be explained both ways. Still, it is worth noting.

I am not talking about proof of cheating, but of the comparison between Metta's play and that of his opponents (win-loss aside). I haven't seen any test of that comparison aside from the one I did on your early data, which showed no significant difference. Statistically, Carlo's play online was better than his play offline in the games analyzed. But I did not find that Carlo's online play was better than the online play of his opponents. If you can show that, please do.

Bill Spight · **#13**

Since we are talking about using bots to assess errors, the game in this post, viewtopic.php?p=233790#p233790 , may be of interest. Zen7 loses one point and the game by making an unnecessary protective play. Without that play it estimates its winrate at 39%. The position is easy for an amateur dan player to read out, IMO.

Uberdude · **#14**

Bill Spight wrote:

Since we are talking about using bots to assess errors, the game in this post, viewtopic.php?p=233790#p233790 , may be of interest. Zen7 loses one point and the game by making an unnecessary protective play. Without that play it estimates its winrate at 39%. The position is easy for an amateur dan player to read out, IMO.

I didn't follow all that thread, but as the creator didn't seem to understand about bots being trained on a fixed komi I would want to check the komi configuration was correct before concluding anything (e.g. Zen may have assumed a komi of 6.5 so thought it would win by 0.5 instead of 1.5 after defending, when in fact the game was played under 7.5 komi). Artificial intelligences are also good at artificial stupidity ;-)

Bill Spight · **#15**

Uberdude wrote:

Bill Spight wrote:

Since we are talking about using bots to assess errors, the game in this post, viewtopic.php?p=233790#p233790 , may be of interest. Zen7 loses one point and the game by making an unnecessary protective play. Without that play it estimates its winrate at 39%. The position is easy for an amateur dan player to read out, IMO.

I didn't follow all that thread, but as the creator didn't seem to understand about bots being trained on a fixed komi I would want to check the komi configuration was correct before concluding anything (e.g. Zen may have assumed a komi of 6.5 so thought it would win by 0.5 instead of 1.5 after defending, when in fact the game was played under 7.5 komi). Artificial intelligences are also good at artificial stupidity ;-)

I just checked. The komi was 6.5 pts. and Black 309 made Black lose by 0.5. :oops:

Uberdude · **#16**

So the game record says 6.5 komi, but doesn't specify ruleset. If the game was played with Chinese counting and 6.5 komi (admittedly a strange combination) then 309 would not have changed black winning. With Japanese it makes black lose. Whether you can communicate this information to Zen and if it pays any attention to it and gives correct results based on it I don't know. I recall some komi/ruleset troubles with Zen in the past.

Bill Spight · **#17**

Uberdude wrote:

So the game record says 6.5 komi, but doesn't specify ruleset. If the game was played with Chinese counting and 6.5 komi (admittedly a strange combination) then 309 would not have changed black winning. With Japanese it makes black lose. Whether you can communicate this information to Zen and if it pays any attention to it and gives correct results based on it I don't know.

Maybe Zen doesn't count territory. :shock:

Quote:

I recall some komi/ruleset troubles with Zen in the past.

Yes, a few years ago I read that its creator complained about using it with 6.5 komi, having trained it on 7.5 komi. But that's not the same problem as not counting territory at all.

I remember thinking at the time, well, you know, you could have contacted Jasiek or yours truly.

Anyway, it was the 39% that struck me. That's off by at least 39% at that point in the game. :shock:

AlesCieply · **#18**

I have uploaded the new version (automated runs with Go Review Partner) of the analysis for Carlo's regular games. In the second post I put the links to the sheets and to the pertinent rsgf files generated by the GRP. Using those one can also review the games and see all the suggestions made by the bot for the analyzed moves.

I was also pondering whether to include or not the Kim-Metta game played at the WAGC, the first one in the sheets. There were not so many moves played and Leela really fails to realize that white's large group is dead. The last game moves where Leela insists on playing s7 while the players do not follow the bot's suggestion results in several "large mistakes" for both players distorting the analysis. In fact, this "blind spot" in the AI engine is likely responsible for some "bad moves" appearing in the analysis of other games too. I guess this might also explain the similar percentages of bad moves accounted to both players in some games and explain the partial coincidence of bad moves noted by bugsti. I should still check this but right now I do not know what to do about it. I have already checked that Leela Zero does better in this particular case, at least for the Kim-Metta game, but I am unable to run it with a sufficient reliability as I do not posses GPU accelerated computer.

Jan.van.Rongen · **#19**

IMO the whole analysis is deeply flawed because the very basis of the analysis: the "Leela top 3 suggestions" is undefined, even if we fix the running parameters such as number os simulations and the base machine configuration. Both the Neural Net and the Monte Carlo Tree Search have some random behaviour - and even on a fast machine with longer thinking time the top three is not always the same.

Tp illustrate this I ran the following experiment: make move 51 in the IT-IL game with Leela 0.11. Steps repeated 3 times in Sabaki:
(a) load game
(b) go to move 50
(c) attach Leela 0.11 engine as Black, which makes move 51 then.

settings: 60 seconds. Results copied from the Sabaki console:

Quote:

# -1---------------------------------------------------
L17 -> 121553 (W: 53.75%) (U: 47.78%) (V: 57.43%: 7683) (N: 24.1%)
E13 -> 13470 (W: 53.62%) (U: 47.93%) (V: 57.13%: 917) (N: 2.5%)
J14 -> 6854 (W: 53.04%) (U: 48.24%) (V: 56.01%: 423) (N: 9.9%)
B13 -> 5759 (W: 53.60%) (U: 48.83%) (V: 56.54%: 363) (N: 1.7%)
D13 -> 1891 (W: 53.69%) (U: 50.39%) (V: 55.73%: 109) (N: 0.7%)

121553 visits, score 53.75% (from 53.53%) PV: L17 J14 E13 D13 J11 H13 G13 H12 H11 M16 L16 L15 M15 N16 N15 O16 K15 L14 K14
160383 visits, 216446 nodes, 160383 playouts, 2642 p/s
# -2---------------------------------------------------
E13 -> 119982 (W: 55.09%) (U: 48.96%) (V: 58.87%: 7448) (N: 2.5%)
L17 -> 27085 (W: 53.89%) (U: 47.44%) (V: 57.87%: 1624) (N: 24.1%)
B12 -> 8690 (W: 54.06%) (U: 48.73%) (V: 57.35%: 495) (N: 3.4%)
J10 -> 2999 (W: 53.60%) (U: 50.72%) (V: 55.39%: 153) (N: 5.2%)
B13 -> 2744 (W: 53.95%) (U: 48.99%) (V: 57.01%: 154) (N: 1.7%)

119982 visits, score 55.09% (from 54.61%) PV: E13 D13 B12 B13 M17 J14 K16 H15 J11 H12 H11 L14 K13 L16 L17 M16 N16 K17 N15
171690 visits, 213389 nodes, 171690 playouts, 2828 p/s
# -3---------------------------------------------------
E13 -> 128229 (W: 55.23%) (U: 48.68%) (V: 59.27%: 7969) (N: 2.5%)
L17 -> 24849 (W: 54.01%) (U: 47.82%) (V: 57.83%: 1548) (N: 24.1%)
M17 -> 3674 (W: 54.02%) (U: 47.42%) (V: 58.09%: 219) (N: 3.0%)
B13 -> 3551 (W: 54.15%) (U: 49.25%) (V: 57.18%: 205) (N: 1.7%)
J14 -> 1495 (W: 51.96%) (U: 47.05%) (V: 54.99%: 87) (N: 9.9%)

128229 visits, score 55.23% (from 54.79%) PV: E13 D13 B12 B13 M17 J14 K16 H15 J11 L14 O16 L11 K13 L16 L17 L13 H9 H8 J9 G10 H10
169370 visits, 211931 nodes, 169370 playouts, 2790 p/s
# -----------------------------------------------------

So there is no "Top 3". Your run has a top 3, but mine has another one, and for comparison purposes a single result is of limited or no value. You have to know the "spread", and from the above example you see that it can be larger than you would think initially.

The second objection is to the use of the "winning percentages" from the bot. The statement that Metta did not commit a blunder in the above game is based on the evaluation of Leela 0.11, and leads to the circular reasoning that he must have used this bot because he did not commit a blunder according to the same bots'evaluation.

But running this game through the evaluations of Leela Zero #153 or AQ we get completely different evaluations - Black made some serious errors and was not "on top" of this game most of the time.

AlesCieply · **#20**

Jan.van.Rongen wrote:

IMO the whole analysis is deeply flawed because the very basis of the analysis: the "Leela top 3 suggestions" is undefined, ...

Jan, have you noted that my analysis does not use the "top 3 suggestions". Maybe you speak here about someone's else analysis.

What my analysis does use are the winrates, the bot estimates of the probability to win the game if the top suggested move is played. There, I see all your three runs provide values in a reasonable agreement with what I have there. The agreement might have been even better if you ran the bot at a higher number of playouts.

Jan.van.Rongen wrote:

The second objection is to the use of the "winning percentages" from the bot. The statement that Metta did not commit a blunder in the above game is based on the evaluation of Leela 0.11, and leads to the circular reasoning that he must have used this bot because he did not commit a blunder according to the same bots'evaluation.

But running this game through the evaluations of Leela Zero #153 or AQ we get completely different evaluations - Black made some serious errors and was not "on top" of this game most of the time.

There is no circular reasoning in the methodology. The results of the analysis is that Metta's playing pattern (mistakes histogram) obtained from his internet games does differ significantly (measured by the chi2 test) from the one generated from his games played at regular tournaments. It does not tell us whether he used Leela, got someone else playing instead of him or whatever. It just says that is is very unlikely that Metta's regular games and internet games "were played by the same person".

Sure, I would very much like to check it with alternate bots (AQ, Leela Zero). The problem I have here is that I do not have access to GPU powered computer, so the numbers of playouts I am getting with them are small (below 10k with 3 min per move). Anyone willing to help? All I need are the rsgf files generated by the Go Review Partner. Just let it analyse the games with a setting to spend (let's say) 3 min on a move (or to get at least 100k playouts), moves 30-181 do suffice for me.

statistical analysis of player performance

Who is online