Measuring player mistakes versus bots

Bojanic · Post by **Bojanic** » Wed Jun 13, 2018 5:17 am

Pnprog,

first I would like to appreciate you for the GRP, it is excellent software, great work!

----

On the topic, you can not simply count players moves that correspond to GnuGo.
You can have atari, peep, joseki - and all those moves probably would be answered as best choice by any player.

It is necessary to focus on important moves, move sequences, etc.
Simple statistics is not good enough.

pnprog · Post by **pnprog** » Wed Jun 13, 2018 7:50 am

Bojanic wrote:On the topic, you can not simply count players moves that correspond to GnuGo.
You can have atari, peep, joseki - and all those moves probably would be answered as best choice by any player.

It is necessary to focus on important moves, move sequences, etc.
Simple statistics is not good enough.

Haha, I am just some guy who can makes tools that could be useful for you to test your hypothesis, or perform analysis

So I am trying to stay "neutral" on the existing PGETC case, and I won't embark into trying to develop a method to solve future case.

But, if you have some ideas that you want to apply on large set of data, and that it's to much work (and error prone) to do by hand, then I would be happy to help

Above was just a "proof of concept" of the sort of data that could be extracted from Gnugo, as was mentioned by Uberdude. If some of you believe it could be an useful tool in itself, then I will release the tool in a easy way for you to use.

Bojanic wrote:You can have atari, peep, joseki - and all those moves probably would be answered as best choice by any player

On this specific question, one way to differentiate between important move and urgent move would be, with Leela:

Check if Leela only proposes one move: this is a strong indicator that this is a do or die move
Check the decrease in win rate before the first top move and the second top move. If the first top move has 51% win rate, and the second top move only has 15% win rate, this also indicate a forced move.

Bojanic · Post by **Bojanic** » Wed Jun 13, 2018 8:59 am

pnprog wrote: On this specific question, one way to differentiate between important move and urgent move would be, with Leela:
Check if Leela only proposes one move: this is a strong indicator that this is a do or die move

Check the decrease in win rate before the first top move and the second top move. If the first top move has 51% win rate, and the second top move only has 15% win rate, this also indicate a forced move.

It could be helpful, but some analysis would be needed.
IE, in one game I have seen forced move with two answers, both good.
In other cases, someone might choose not to answer peep, or to play other move nearby.

pnprog · Post by **pnprog** » Sun Jun 17, 2018 6:30 am

Hi!

In some other thread, you mentioned that the PGETC games also have time record. This is also something that could be extracted together with other informations, in its own column.

pnprog · Post by **pnprog** » Mon Jun 18, 2018 2:08 am

That's me again!

I was thinking about something that maybe would work, but would be a lot of work to implement:

Basically, it would consist in training a set of policy networks, each one corresponding to a specific level of play (3k, 2k, 1k, 1d, 2d, 3d...).

<Edit> to be clear, I am not proposing to train a bot, only a policy network. Not something that can play Go, no play-out, no tree search, no MC rolls, no value network...</Edit>

A policy network, as I understand it, was developed by Deepmind for there first version of AlphaGo by showing it games of strong amateurs players they downloaded from the internet. This policy network was used to indicate, for a specific game position, what moves a strong amateur would play. This was used to reduce the number of moves AlphaGo had to evaluate (evaluation being done with value network and montecarlo rolls). Later they used AlphaGo VS AlphaGo games to improve better their policy network.

So, we could try to train one policy network using ~2k players' games, then another one using ~1k players' games, then another one using ~1d players' games, and so on.

Note that we don't really care what level at policy network is labelled (1k, or 3d), we only need them to be in croissant order, and ideally at a regular distance in strength. We could classify them using ELO or simply A, B, C...

With such a set of policy networks, we could evaluate how the moves of one player in his game correlate with each of our policy networks, and draw a chart. One could expect this chart to peak at the policy network closest to this player level.

Then, by comparing those charts for different games, we could then tell that for a particular game, that player did not played at his usual level.

The difficult part would be to gather enough games for training, games from players with stable level, and have those games classified by level...

One way to do that could be to work with Go severs, more specifically with the players they use as anchors
Now, they won't probably want to disclose publicly what players are used as anchors, but maybe this could be done under a non disclosure agreement. Or maybe they could disclose this information when the anchor is removed. Then we can download his games from the period he was selected as an anchor.
Or maybe we could collaborate with Go server to get statistics on what player have a very strong rating confidence.

Once we get enough games to train our policy networks, it also open all sort of possibility regarding the rating of players or their games (like, one could finally get to know the equivalence of ranks among Go servers).

moha · Post by **moha** » Mon Jun 18, 2018 4:14 am

I see two problems with training an 1k network, for example. First, to get 1k level of play you should disable search (otherwise you get much higher levels, someone did this with 1d games and the results were comparable to full bot strength - the policy only used for pruning the search, and good search with 1d pruning is VERY strong). OTOH a no-search policy net will have specific NN related oversights, atypical and different to a human 1k.

Second, even if you get an artificial 1k player, comparing to it doesn't seem much better than comparing to other humans of similar strength. And even two 1k-s can have quite different playstyles and error distributions.

The stronger approach seems to be to compare to a "perfect" player, collect detailed error statistics (exact size of the errors in points dropped in various phases of the game), and then compare those DISTRIBUTIONS to known reference distributions. But even with this approach one should start by studying typical human error distributions, and see how similar or different two humans can be. Those errors may be quite dependent on playing style, for example.

But if you only want to have NN aid in detecting cheaters, you could train a net specifically for this. Showing it a lot of bot games, and a lot of human games of different strength (maybe even human+bot games), you have a direct training target if you ask whether the player was human (maybe subdivided for different strength levels). But since a cheater may not use the bot for all moves (only blunder checking), such direct approaches doesn't seem viable.

Detailed study of error statistics seems to be the only promising way - whatever a player does will have SOME mark on his distribution.

pnprog · Post by **pnprog** » Mon Jun 18, 2018 4:51 am

moha wrote:... Second, even if you get an artificial 1k player ...

No no no, you got me wrong!

I am not proposing to train a bot, I am just proposing to train a policy network

moha · Post by **moha** » Mon Jun 18, 2018 6:04 am

pnprog wrote:I am not proposing to train a bot, I am just proposing to train a policy network

Ok, but then:

moha wrote:a no-search policy net will have specific NN related oversights, atypical and different to a human 1k.

There are some things a raw net often gets wrong, because of the lack of tactical understanding that is inevitable with no search (and because of the fuzzy, approximative nature of NNs). These can be quite different from human mistakes.

EDIT: Back to the original suggestion, even assuming these policies form worthwhile comparison points. Suppose you find a game where the player played better than usual (correlation peak above). This would correspond to having his error distribution shifted/scaled a bit. How do you judge if he were lucky, had a good day or cheated, without a closer look at the details of his distribution?

jlt · Post by **jlt** » Mon Jun 18, 2018 6:54 am

I think it's very hard to detect difficulty of a move using a neural net. The level of problems on the website https://neuralnetgoproblems.com/ is far from accurate, some 1d problems are quite easy (common joseki moves for instance), while some 10k problems look much harder than 10k. In addition, the strength of a player depends on

knowledge
reading.

Knowledge corresponds roughly to the neural network, and reading to simulations. Some players don't have a lot of knowledge but are good at reading, and conversely. Also, you can be (relatively) strong because you make many good moves but regular blunders, or because you make mostly small mistakes.

Maybe the following approach could work:

Choose a database of at least a few hundred games.
Choose a strong bot, like a recent version of LeelaZero.
Say that a position is "relevant" when the game is between moves 30 and 150, LeelaZero evaluates the winrate between 30% and 80%, and the move it suggests is different from the move suggested by GnuGo.
define the "winrate loss" of a human move the difference between the winrate before the move, and the winrate after the move. It can be a negative number when the human finds a better move than LeelaZero.
Using the database, determine the parameters a and b such that exactly 10% of moves made by 1d players at relevant positions have a winrate loss less than a, and 10% of moves made by 1d players at relevant positions have a winrate loss more than b.
Define a "good move" as a move, made at a relevant position, with winrate loss less than a.
Define a "bad move" as a move, made at a relevant position, with winrate loss more than b.
By definition, a "good move" is a move that would be found by less than 10% of 1d players, and a "bad move" is a mistake that less than 10% of 1d players would make.
Using the database, given a grade g (g = ...2k, 1k, 1d, 2d,...) define a_g as the percentage of good and b_g as the percentage of bad moves made by players of grade g. The point M_g=(a_g,b_g) in the plane represents the average play of grade g players.
We will say that a person played at level g during a game if the proportion of good and bad moves he made during that game is closest to the point M_g.
Then one can check using the database how often a 6d player plays at level 4d or conversely.

Of course I have no idea if the above approach works at all. The reference to 1d is arbitrary, as well as the proportions 10%. The approach can also be refined by classifying moves as "very good", "good", "average", "bad", "blunder". The notion of "relevant position" is also arbitrary and could be refined, but as given above it is easy to check on a computer.

Life In 19x19

Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots