Measuring player mistakes versus bots

Uberdude · #1

From the Pandanet Leela cheating case, it has become apparent we need better statistical methods of measuring player performance than simply counting matches to bot moves if we are to use them for detecting/convicting players using bots for assistance and achieving better-than-expected results. Inspired by Ken Regan's work on chess the basic idea is to look at how big and how many mistakes the player makes, where a mistake is a drop in a "winrate" metric provided by the bot. Ales Cieply started doing this here.

(Aside: there are other application of this than cheating-detection, such as simply measuring the level of play in games (e.g. this game Mr 1d thought he played well and indeed the mistake profile was more like a typical 3d). Similar approaches in chess have been used to measure the strength of past great players, though this comes with caveats like humans might not play what they think is a better move but the one they think most likely to win against that particular opponent (early 20th century world chess champion Emanuel Lasker was known for this). In Go if we wanted to compare Shusaku to modern pros we also have the komi problem, even if/when we think LeelaZero/Elf etc is sufficiently strong enough to be a judge.)

Ales analysed some of Carlo's games with Go Review Partner, transcribed the win rates into a spreadsheet, calculated the winrate deltas compared to Leela's #1 choice, grouped them into buckets and counted them. I've made a graph of this data, blue are online games, orange are offline.

Attachment:

Mistake profiles.png [ 29.76 KiB | Viewed 10134 times ]

So the question is: is there a statistically significant difference in these distributions that we think there is such a big difference in performance that can only be explained by cheating? We can't answer that without more data, so this thread has several purposes:

1) discuss and improve methodology
2) encourage others to collect and contribute data: it's a rather tiresome process at the moment, though if you also look at the games and think about where you would play it can double up as Go study time not just being a data input monkey.
3) automate the process with tools/scripts? [Edit: pnprog already helping

viewtopic.php?f=9&t=14050&p=232408#p232408]

For starters, we need lots more data so we can build up reference mistake profiles or different strengths of players and see how much variation is typical between games of a single player. Also will there be a difference between online and offline games? Old seasons of the PGETC (e.g. 2010 here, there are links at bottom right of sidebar on homepage) before strong bots existed could be a useful game source. Also this years WAGC (maybe other European players also made more mistakes at the WAGC than they did in the PGETC). Eurogotv has a big archive of sgfs from live European tournaments.

I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem. There's a question of should we use absolute change in the winrate percentage, or some other function, see posts below moved from another thread on this topic. Also should we use Leela 0.11 or LeelaZero? LeelaZero is quite a lot stronger so will give more correct judgements, but also its winrate judgements are harser (what 0.11 thinks might be a 5% mistake LZ says is maybe 10%, and LeelaElf is even harsher) and if one player gets to 90% fairly early the subsequent winrate deltas are likely less useful.

Bill Spight · #2

Uberdude wrote:

(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).

Instead of the winrate you might consider using the log of the odds ratio. Using the base 10 for the logarithm we get the following.

log(50/50) = 0
log(49/51) = -0.0174
Difference = 0.0174

log(15/85) = -0.7533
log(14/86) = -0.7884
Difference = 0.0351

ez4u · #3

Uberdude wrote:

...

(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).

A mistake that drops the winrate for 50% to 49% is a potential game changer. A "mistake" that drops the winrate from 15% to 14% is literally meaningless. If you haven't resigned at 15%, you are pushing hard to force your opponent to make an error in order to turn the game around. Until that big -35% pops up (or accumulates) in your opponent's play, anything goes. No?

Bill Spight · #4

ez4u wrote:

Uberdude wrote:

...

(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).

A mistake that drops the winrate for 50% to 49% is a potential game changer. A "mistake" that drops the winrate from 15% to 14% is literally meaningless. If you haven't resigned at 15%, you are pushing hard to force your opponent to make an error in order to turn the game around. Until that big -35% pops up (or accumulates) in your opponent's play, anything goes. No?

One problem is that we really don't know what these so-called win rates mean. They do not mean that the probability of the actual player facing the actual opponent will win the game. But even if they do mean something like that, it may well mean that a mistake that makes a 1% difference when the odds are 50:50 will appear much smaller to us humans than a mistake that makes a 1% difference when the odds are 85:15.

Edit: Which would mean that a mistake that makes a 1% difference when the odds are 85:15 is less likely for humans to make than one that makes a difference of 1% when the odds are 50:50.

AlesCieply · #5

Bill Spight wrote:

Uberdude wrote:

(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).

Instead of the winrate you might consider using the log of the odds ratio. Using the base 10 for the logarithm we get the following.

log(50/50) = 0
log(49/51) = -0.0174
Difference = 0.0174

log(15/85) = -0.7533
log(14/86) = -0.7884
Difference = 0.0351

I am not sure that using the logarithm is a way to go. When I started working on my analysis of Carlo Metta's games I have considered this option as in fact it is similar to what Ken Regan does in his analysis of chess games. However, there the situation is different. The mistakes in chess are expressed as centiles (not sure of this English word, could look it up in Regan's papers) of a pawn value, not in percentages. It is easy to see (and Regan demonstrated this statistically) that the chess players tend to be less careful when they are leading by a sufficient difference (measured in pawns). The equivalent in go would be the difference measured in terms of territory estimates. It does not matter if the go player wins the game by 20 or 5 points. Thus, it is fine when the player who knows that he is winning avoids going for the best move in a given position (which might lead to a complicated fight) and prefers a move that simplifies the matters and in fact increases his chance (probability) to win the game. Thus, my conclusion is that the estimated winrates already reflect the point that one goes for maybe less territorial profit but a higher chance to win the game.

moha · #6

The theoretical error distributions (this was discussed in another thread recently) are in points dropped. Even this is subject to some distorting factors (like trading margin for safety when ahead, and for variance when behind), but when you convert to winrates I'd expect even more distortions (you cannot distinguish an 1-point endgame mistake in a close game from a 30-point middlegame mistake, for example).

There is some movement in the bot world towards score prediction (besides winrate prediction), so I guess in a few years there will be better options for this. For example, it seems possible to measure the distribution using more dimensions (extra axes for current winrate and/or game phase / move number).

pnprog · #7

Hi!

I am certainly not that good at Go, and really not good at statistic, but I can certainly help to build tool for automatic analysis and data collection of SGF in batches, as mentioned here.

One comment: recently, computer pair Go is becoming popular with Pros in the Est. Maybe this is something we could use to benchmark our methodology. If we can get our hand on a few computer pair Go game played by a Pro (with hopefully a bot stronger than the Pro), it should be possible to collect plenty of regular "over the board" games for that Pro as well. And then, one could use them to measure how well our tools can differentiate from "Normal play" and "Augmented play" by the Pro. Using Pro games for this has the advantage that the Pros have a very stable level (more stable then amateur players in the West).

If a methodology is not even capable to differentiate both type of play, then it's probably an indicator it cannot be used to judge amateurs's game.

AlesCieply · #8

Uberdude wrote:

For starters, we need lots more data so we can build up reference mistake profiles or different strengths of players and see how much variation is typical between games of a single player.

I would also like to emphasize the point that we need a large collection of games, preferably in a uniform format. There are large collections of pro games available but it is a hard task in itself to hunt for game records from amateur tournaments or even games played by the EGF pros. Many of them are transmitted online but noone seems to care to collect them at one place. In principle, EGD could also serve as a place to collect the games (and there are already some there) but it might be better to establish a more dedicated server.

Javaness2 · #9

For amateur games there are some repositories which exist.
For example: the EGD itself has some games. Desprego.ro has over 500 games with at least 1 romanian player.
If you know the account names, you can also rip a lot from KGS broadcasts.

Charlie · **#10**

I have my reservations about the approach of comparing histograms. In Chess, I suppose the approach works because all of the good bots are extraordinarily strong relative to humans and relatively close to perfect play that histograms from different bots would be somewhat similar: a human blunder or mistake would be judged similarly by all.

In Go, however, is this the case?

For a practical example, suppose I was to cheat in 99 games with "EsotericBot" and, in one game, not cheat at all. Suppose that "EsotericBot" is approximately 5 dan but does not resemble Leela Zero's play at all. In fact, it deviates so much from Leela Zero that my normal SDK play matches hers more closely. If you looked at the histogram over my games, with winrate deltas from Leela Zero, you would find overwhelming evidence that I used Leela Zero to cheat in the only game that I played honestly!

This begs the question: does the game of Go at the amateur high-dan level afford enough variety that two bots could be dramatically different and yet attain the same rank? Personally, I believe it does.

What's to do? You could compare histograms computed against all (or many) popular, leading bots. There's still a chance that "EsotericBot" is so esoteric that you'll miss it. There's a much larger chance that doing so will lead to such high variance in the histograms that the test becomes meaningless.

I just don't think that our bots are strong enough, yet. Even AlphaGo probably isn't strong enough and my observation of the ELF network's play makes me think that the AlphaGo (and Alpha Zero) approach becomes more and more opinionated as its strength increases -- an opinionated network won't be very useful in detecting cheating against bots, in general, no matter how strong it truly is.

jlt · **#11**

Uberdude wrote:

(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).

I am not so sure. Consider the following fictitious game: Carla plays against Lili and is leading by p+0.5 points. It is Carla's turn. The bot thinks that the next move A is obvious, and if not played, Carla will lose 4 points. After that, n=100 moves will be played, and for each of these n moves, either Carla or Lili gets 1 point, with probability 50%.

It turns out Carla blundered and didn't play A. If f(p) denotes the initial winrate, then the winrate drop is f(p)-f(p-4), whereas the relative winrate drop is (f(p)-f(p-4))/f(p).

The exact formula for f(p) is the sum of 2^-nC(n,k) where C(n,k) is the binomial coefficient "n choose k", and k lies in the range [(n-p)/2 ,100].

Here is a plot of the absolute winrate drop and the relative winrate drop:

Attachment:

winrate.png [ 58.32 KiB | Viewed 10035 times ]

This suggests that

The relative winrate drop exaggerates the importance of mistakes of the player who is behind;
The absolute winrate drop is a better measurement of the size of the blunder when the initial winrate is in the range 30%-80%.
Therefore, moves that are played when the winrate is outside the range 30%-80% should be excluded from the statistics.

Of course, the above should be considered as a "toy model", the graphs above will not look exactly the same in a real game situation.

dfan · **#12**

Note that if you assume that the win rate is accurate (a big assumption, as Bill Spight has noted), then a reduction from .50 to .49 and a reduction from .15 to .14 have exactly the same effect on your expected number of wins over a large number of games. So as a player you should (rationally) be indifferent to a choice between them.

Bill Spight · **#13**

dfan wrote:

Note that if you assume that the win rate is accurate (a big assumption, as Bill Spight has noted), then a reduction from .50 to .49 and a reduction from .15 to .14 have exactly the same effect on your expected number of wins over a large number of games. So as a player you should (rationally) be indifferent to a choice between them.

However, the latter should be easier for a human to detect, I think. To put it another way, it is less likely to be noise.

This is related to the lack of reported margins of error for computer calculated winrates.

moha · **#14**

dfan wrote:

Note that if you assume that the win rate is accurate

A winrate of 40% of 60% is obviously inaccurate IMO since the "true" winrate can only be 100% or 0%.

But even an inaccurate winrate can be useful if comes from superhuman bots. Even more so with expected scores, if those become available in the future (Golaxy is rumored to use them internally, for example). What seems more open question is how similar error distributions are between humans of similar strength but different character? Maybe if player A is better at opening and B is at middlegame fighting, their error distributions are also completely different? Is there really a standard distribution at a given strength to be compared to?

dfan · **#15**

moha wrote:

dfan wrote:

Note that if you assume that the win rate is accurate

A winrate of 40% of 60% is obviously inaccurate IMO since the "true" winrate can only be 100% or 0%.

To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".

Kirby · **#16**

Uberdude wrote:

I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem.

I'm skeptical that similarity (or lack of similarity) with GnuGo will provide useful information in regard to the analysis.

moha · **#17**

dfan wrote:

To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".

This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained, but this can be significantly different from human winrate due to different playstyles. In fact, a drop of 2% (bot) winrate may even be an 1% (human) winrate gain.

This is another reason to go for expected scores instead of winrates, although it is also possible to train a net specifically for predicting the human winrate (maybe with a strength parameter).

Bill Spight · **#18**

moha wrote:

dfan wrote:

To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".

This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,

Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given. Another reason I suspect that the winrates were not calculated that way is that doing so would take a lot of time, and would not be necessary to improve the ability of the bot. Another reason is that moves are chosen based upon number of visits, not winrates, or not only on winrates.

Tryss · **#19**

Bill Spight wrote:

moha wrote:

dfan wrote:

To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".

This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,

Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.

No, it's not easy, because the given winrate is mostly based on winrate given by the evaluation of the network. And there is no easy way to get the margin of error of these numbers.

moha · **#20**

Bill Spight wrote:

moha wrote:

The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,

Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.

Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1). I'm not sure about error statistics, I agree those could be produced, maybe nobody was interested enough to collect them?

It would not be that easy though, since it would need a different test game set (the loss IS decreasing/disappearing on the training set oc, but that doesn't necessarily mean better predictions on a different set as the danger of overfitting is higher for the value head than the policy).

Measuring player mistakes versus bots

Who is online