Measuring player mistakes versus bots

General conversations about Go belong here.
Uberdude
Judan
Posts: 6727
Joined: Thu Nov 24, 2011 11:35 am
Rank: UK 4 dan
GD Posts: 0
KGS: Uberdude 4d
OGS: Uberdude 7d
Location: Cambridge, UK
Has thanked: 436 times
Been thanked: 3718 times

Measuring player mistakes versus bots

Post by Uberdude »

From the Pandanet Leela cheating case, it has become apparent we need better statistical methods of measuring player performance than simply counting matches to bot moves if we are to use them for detecting/convicting players using bots for assistance and achieving better-than-expected results. Inspired by Ken Regan's work on chess the basic idea is to look at how big and how many mistakes the player makes, where a mistake is a drop in a "winrate" metric provided by the bot. Ales Cieply started doing this here.

(Aside: there are other application of this than cheating-detection, such as simply measuring the level of play in games (e.g. this game Mr 1d thought he played well and indeed the mistake profile was more like a typical 3d). Similar approaches in chess have been used to measure the strength of past great players, though this comes with caveats like humans might not play what they think is a better move but the one they think most likely to win against that particular opponent (early 20th century world chess champion Emanuel Lasker was known for this). In Go if we wanted to compare Shusaku to modern pros we also have the komi problem, even if/when we think LeelaZero/Elf etc is sufficiently strong enough to be a judge.)

Ales analysed some of Carlo's games with Go Review Partner, transcribed the win rates into a spreadsheet, calculated the winrate deltas compared to Leela's #1 choice, grouped them into buckets and counted them. I've made a graph of this data, blue are online games, orange are offline.
Mistake profiles.png
Mistake profiles.png (29.76 KiB) Viewed 13541 times
So the question is: is there a statistically significant difference in these distributions that we think there is such a big difference in performance that can only be explained by cheating? We can't answer that without more data, so this thread has several purposes:

1) discuss and improve methodology
2) encourage others to collect and contribute data: it's a rather tiresome process at the moment, though if you also look at the games and think about where you would play it can double up as Go study time not just being a data input monkey.
3) automate the process with tools/scripts? [Edit: pnprog already helping :) viewtopic.php?f=9&t=14050&p=232408#p232408]

For starters, we need lots more data so we can build up reference mistake profiles or different strengths of players and see how much variation is typical between games of a single player. Also will there be a difference between online and offline games? Old seasons of the PGETC (e.g. 2010 here, there are links at bottom right of sidebar on homepage) before strong bots existed could be a useful game source. Also this years WAGC (maybe other European players also made more mistakes at the WAGC than they did in the PGETC). Eurogotv has a big archive of sgfs from live European tournaments.

I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem. There's a question of should we use absolute change in the winrate percentage, or some other function, see posts below moved from another thread on this topic. Also should we use Leela 0.11 or LeelaZero? LeelaZero is quite a lot stronger so will give more correct judgements, but also its winrate judgements are harser (what 0.11 thinks might be a 5% mistake LZ says is maybe 10%, and LeelaElf is even harsher) and if one player gets to 90% fairly early the subsequent winrate deltas are likely less useful.
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: Announcing GoReviewPartner - v0.11.2 (with Live Analysis

Post by Bill Spight »

Uberdude wrote:(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).
Instead of the winrate you might consider using the log of the odds ratio. Using the base 10 for the logarithm we get the following.

log(50/50) = 0
log(49/51) = -0.0174
Difference = 0.0174

log(15/85) = -0.7533
log(14/86) = -0.7884
Difference = 0.0351
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
User avatar
ez4u
Oza
Posts: 2414
Joined: Wed Feb 23, 2011 10:15 pm
Rank: Jp 6 dan
GD Posts: 0
KGS: ez4u
Location: Tokyo, Japan
Has thanked: 2351 times
Been thanked: 1332 times

Re: Announcing GoReviewPartner - v0.11.2 (with Live Analysis

Post by ez4u »

Uberdude wrote:...

(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).
A mistake that drops the winrate for 50% to 49% is a potential game changer. A "mistake" that drops the winrate from 15% to 14% is literally meaningless. If you haven't resigned at 15%, you are pushing hard to force your opponent to make an error in order to turn the game around. Until that big -35% pops up (or accumulates) in your opponent's play, anything goes. No?
Dave Sigaty
"Short-lived are both the praiser and the praised, and rememberer and the remembered..."
- Marcus Aurelius; Meditations, VIII 21
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: Announcing GoReviewPartner - v0.11.2 (with Live Analysis

Post by Bill Spight »

ez4u wrote:
Uberdude wrote:...

(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).
A mistake that drops the winrate for 50% to 49% is a potential game changer. A "mistake" that drops the winrate from 15% to 14% is literally meaningless. If you haven't resigned at 15%, you are pushing hard to force your opponent to make an error in order to turn the game around. Until that big -35% pops up (or accumulates) in your opponent's play, anything goes. No?
One problem is that we really don't know what these so-called win rates mean. They do not mean that the probability of the actual player facing the actual opponent will win the game. But even if they do mean something like that, it may well mean that a mistake that makes a 1% difference when the odds are 50:50 will appear much smaller to us humans than a mistake that makes a 1% difference when the odds are 85:15. :) Edit: Which would mean that a mistake that makes a 1% difference when the odds are 85:15 is less likely for humans to make than one that makes a difference of 1% when the odds are 50:50.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
AlesCieply
Dies in gote
Posts: 65
Joined: Mon Sep 10, 2012 5:07 am
GD Posts: 0
Has thanked: 31 times
Been thanked: 55 times

Re: Announcing GoReviewPartner - v0.11.2 (with Live Analysis

Post by AlesCieply »

Bill Spight wrote:
Uberdude wrote:(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).
Instead of the winrate you might consider using the log of the odds ratio. Using the base 10 for the logarithm we get the following.

log(50/50) = 0
log(49/51) = -0.0174
Difference = 0.0174

log(15/85) = -0.7533
log(14/86) = -0.7884
Difference = 0.0351
I am not sure that using the logarithm is a way to go. When I started working on my analysis of Carlo Metta's games I have considered this option as in fact it is similar to what Ken Regan does in his analysis of chess games. However, there the situation is different. The mistakes in chess are expressed as centiles (not sure of this English word, could look it up in Regan's papers) of a pawn value, not in percentages. It is easy to see (and Regan demonstrated this statistically) that the chess players tend to be less careful when they are leading by a sufficient difference (measured in pawns). The equivalent in go would be the difference measured in terms of territory estimates. It does not matter if the go player wins the game by 20 or 5 points. Thus, it is fine when the player who knows that he is winning avoids going for the best move in a given position (which might lead to a complicated fight) and prefers a move that simplifies the matters and in fact increases his chance (probability) to win the game. Thus, my conclusion is that the estimated winrates already reflect the point that one goes for maybe less territorial profit but a higher chance to win the game.
moha
Lives in gote
Posts: 311
Joined: Wed May 31, 2017 6:49 am
Rank: 2d
GD Posts: 0
Been thanked: 45 times

Re: Measuring player mistakes versus bots

Post by moha »

The theoretical error distributions (this was discussed in another thread recently) are in points dropped. Even this is subject to some distorting factors (like trading margin for safety when ahead, and for variance when behind), but when you convert to winrates I'd expect even more distortions (you cannot distinguish an 1-point endgame mistake in a close game from a 30-point middlegame mistake, for example).

There is some movement in the bot world towards score prediction (besides winrate prediction), so I guess in a few years there will be better options for this. For example, it seems possible to measure the distribution using more dimensions (extra axes for current winrate and/or game phase / move number).
User avatar
pnprog
Lives with ko
Posts: 286
Joined: Thu Oct 20, 2016 7:21 am
Rank: OGS 7 kyu
GD Posts: 0
Has thanked: 94 times
Been thanked: 153 times

Re: Measuring player mistakes versus bots

Post by pnprog »

Hi!

I am certainly not that good at Go, and really not good at statistic, but I can certainly help to build tool for automatic analysis and data collection of SGF in batches, as mentioned here.

One comment: recently, computer pair Go is becoming popular with Pros in the Est. Maybe this is something we could use to benchmark our methodology. If we can get our hand on a few computer pair Go game played by a Pro (with hopefully a bot stronger than the Pro), it should be possible to collect plenty of regular "over the board" games for that Pro as well. And then, one could use them to measure how well our tools can differentiate from "Normal play" and "Augmented play" by the Pro. Using Pro games for this has the advantage that the Pros have a very stable level (more stable then amateur players in the West).

If a methodology is not even capable to differentiate both type of play, then it's probably an indicator it cannot be used to judge amateurs's game.
I am the author of GoReviewPartner, a small software aimed at assisting reviewing a game of Go. Give it a try!
AlesCieply
Dies in gote
Posts: 65
Joined: Mon Sep 10, 2012 5:07 am
GD Posts: 0
Has thanked: 31 times
Been thanked: 55 times

Re: Measuring player mistakes versus bots

Post by AlesCieply »

Uberdude wrote: For starters, we need lots more data so we can build up reference mistake profiles or different strengths of players and see how much variation is typical between games of a single player.
I would also like to emphasize the point that we need a large collection of games, preferably in a uniform format. There are large collections of pro games available but it is a hard task in itself to hunt for game records from amateur tournaments or even games played by the EGF pros. Many of them are transmitted online but noone seems to care to collect them at one place. In principle, EGD could also serve as a place to collect the games (and there are already some there) but it might be better to establish a more dedicated server.
Javaness2
Gosei
Posts: 1545
Joined: Tue Jul 19, 2011 10:48 am
GD Posts: 0
Has thanked: 111 times
Been thanked: 322 times
Contact:

Re: Measuring player mistakes versus bots

Post by Javaness2 »

For amateur games there are some repositories which exist.
For example: the EGD itself has some games. Desprego.ro has over 500 games with at least 1 romanian player.
If you know the account names, you can also rip a lot from KGS broadcasts.
User avatar
Charlie
Lives in gote
Posts: 310
Joined: Mon Feb 06, 2012 2:19 am
Rank: EGF 4 kyu
GD Posts: 0
Location: Deutschland
Has thanked: 272 times
Been thanked: 126 times

Re: Measuring player mistakes versus bots

Post by Charlie »

I have my reservations about the approach of comparing histograms. In Chess, I suppose the approach works because all of the good bots are extraordinarily strong relative to humans and relatively close to perfect play that histograms from different bots would be somewhat similar: a human blunder or mistake would be judged similarly by all.

In Go, however, is this the case?

For a practical example, suppose I was to cheat in 99 games with "EsotericBot" and, in one game, not cheat at all. Suppose that "EsotericBot" is approximately 5 dan but does not resemble Leela Zero's play at all. In fact, it deviates so much from Leela Zero that my normal SDK play matches hers more closely. If you looked at the histogram over my games, with winrate deltas from Leela Zero, you would find overwhelming evidence that I used Leela Zero to cheat in the only game that I played honestly!

This begs the question: does the game of Go at the amateur high-dan level afford enough variety that two bots could be dramatically different and yet attain the same rank? Personally, I believe it does.

What's to do? You could compare histograms computed against all (or many) popular, leading bots. There's still a chance that "EsotericBot" is so esoteric that you'll miss it. There's a much larger chance that doing so will lead to such high variance in the histograms that the test becomes meaningless.

I just don't think that our bots are strong enough, yet. Even AlphaGo probably isn't strong enough and my observation of the ELF network's play makes me think that the AlphaGo (and Alpha Zero) approach becomes more and more opinionated as its strength increases -- an opinionated network won't be very useful in detecting cheating against bots, in general, no matter how strong it truly is.
User avatar
jlt
Gosei
Posts: 1786
Joined: Wed Dec 14, 2016 3:59 am
GD Posts: 0
Has thanked: 185 times
Been thanked: 495 times

Re: Measuring player mistakes versus bots

Post by jlt »

Uberdude wrote:(I have just realised that using absolute loss of percentage winrate is not such a good idea for measuring size of mistake : if you are 50% and drop 1% to 49%, that's not such a bad mistake as being at 15% and dropping to 14%, better to say the former is 1/50 = 2% relative loss of winrate and later is 1/15 = 7% loss of relative winrate).
I am not so sure. Consider the following fictitious game: Carla plays against Lili and is leading by p+0.5 points. It is Carla's turn. The bot thinks that the next move A is obvious, and if not played, Carla will lose 4 points. After that, n=100 moves will be played, and for each of these n moves, either Carla or Lili gets 1 point, with probability 50%.

It turns out Carla blundered and didn't play A. If f(p) denotes the initial winrate, then the winrate drop is f(p)-f(p-4), whereas the relative winrate drop is (f(p)-f(p-4))/f(p).

The exact formula for f(p) is the sum of 2-nC(n,k) where C(n,k) is the binomial coefficient "n choose k", and k lies in the range [(n-p)/2 ,100].

Here is a plot of the absolute winrate drop and the relative winrate drop:
winrate.png
winrate.png (58.32 KiB) Viewed 13442 times
This suggests that
  • The relative winrate drop exaggerates the importance of mistakes of the player who is behind;
  • The absolute winrate drop is a better measurement of the size of the blunder when the initial winrate is in the range 30%-80%.
  • Therefore, moves that are played when the winrate is outside the range 30%-80% should be excluded from the statistics.
Of course, the above should be considered as a "toy model", the graphs above will not look exactly the same in a real game situation.
dfan
Gosei
Posts: 1599
Joined: Wed Apr 21, 2010 8:49 am
Rank: AGA 2k Fox 3d
GD Posts: 61
KGS: dfan
Has thanked: 891 times
Been thanked: 534 times
Contact:

Re: Measuring player mistakes versus bots

Post by dfan »

Note that if you assume that the win rate is accurate (a big assumption, as Bill Spight has noted), then a reduction from .50 to .49 and a reduction from .15 to .14 have exactly the same effect on your expected number of wins over a large number of games. So as a player you should (rationally) be indifferent to a choice between them.
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: Measuring player mistakes versus bots

Post by Bill Spight »

dfan wrote:Note that if you assume that the win rate is accurate (a big assumption, as Bill Spight has noted), then a reduction from .50 to .49 and a reduction from .15 to .14 have exactly the same effect on your expected number of wins over a large number of games. So as a player you should (rationally) be indifferent to a choice between them.
However, the latter should be easier for a human to detect, I think. To put it another way, it is less likely to be noise.

This is related to the lack of reported margins of error for computer calculated winrates.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
moha
Lives in gote
Posts: 311
Joined: Wed May 31, 2017 6:49 am
Rank: 2d
GD Posts: 0
Been thanked: 45 times

Re: Measuring player mistakes versus bots

Post by moha »

dfan wrote:Note that if you assume that the win rate is accurate
A winrate of 40% of 60% is obviously inaccurate IMO since the "true" winrate can only be 100% or 0%.

But even an inaccurate winrate can be useful if comes from superhuman bots. Even more so with expected scores, if those become available in the future (Golaxy is rumored to use them internally, for example). What seems more open question is how similar error distributions are between humans of similar strength but different character? Maybe if player A is better at opening and B is at middlegame fighting, their error distributions are also completely different? Is there really a standard distribution at a given strength to be compared to?
dfan
Gosei
Posts: 1599
Joined: Wed Apr 21, 2010 8:49 am
Rank: AGA 2k Fox 3d
GD Posts: 61
KGS: dfan
Has thanked: 891 times
Been thanked: 534 times
Contact:

Re: Measuring player mistakes versus bots

Post by dfan »

moha wrote:
dfan wrote:Note that if you assume that the win rate is accurate
A winrate of 40% of 60% is obviously inaccurate IMO since the "true" winrate can only be 100% or 0%.
To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
Post Reply