(Aside: there are other application of this than cheating-detection, such as simply measuring the level of play in games (e.g. this game Mr 1d thought he played well and indeed the mistake profile was more like a typical 3d). Similar approaches in chess have been used to measure the strength of past great players, though this comes with caveats like humans might not play what they think is a better move but the one they think most likely to win against that particular opponent (early 20th century world chess champion Emanuel Lasker was known for this). In Go if we wanted to compare Shusaku to modern pros we also have the komi problem, even if/when we think LeelaZero/Elf etc is sufficiently strong enough to be a judge.)
Ales analysed some of Carlo's games with Go Review Partner, transcribed the win rates into a spreadsheet, calculated the winrate deltas compared to Leela's #1 choice, grouped them into buckets and counted them. I've made a graph of this data, blue are online games, orange are offline. So the question is: is there a statistically significant difference in these distributions that we think there is such a big difference in performance that can only be explained by cheating? We can't answer that without more data, so this thread has several purposes:
1) discuss and improve methodology
2) encourage others to collect and contribute data: it's a rather tiresome process at the moment, though if you also look at the games and think about where you would play it can double up as Go study time not just being a data input monkey.
3) automate the process with tools/scripts? [Edit: pnprog already helping
For starters, we need lots more data so we can build up reference mistake profiles or different strengths of players and see how much variation is typical between games of a single player. Also will there be a difference between online and offline games? Old seasons of the PGETC (e.g. 2010 here, there are links at bottom right of sidebar on homepage) before strong bots existed could be a useful game source. Also this years WAGC (maybe other European players also made more mistakes at the WAGC than they did in the PGETC). Eurogotv has a big archive of sgfs from live European tournaments.
I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem. There's a question of should we use absolute change in the winrate percentage, or some other function, see posts below moved from another thread on this topic. Also should we use Leela 0.11 or LeelaZero? LeelaZero is quite a lot stronger so will give more correct judgements, but also its winrate judgements are harser (what 0.11 thinks might be a 5% mistake LZ says is maybe 10%, and LeelaElf is even harsher) and if one player gets to 90% fairly early the subsequent winrate deltas are likely less useful.