Page 2 of 3

Re: Measuring player mistakes versus bots

Posted: Mon Jun 11, 2018 12:05 pm
by Bill Spight
Bill Spight wrote:
moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.
Tryss wrote:No, it's not easy, because the given winrate is mostly based on winrate given by the evaluation of the network. And there is no easy way to get the margin of error of these numbers.
That's my point. :)

Re: Measuring player mistakes versus bots

Posted: Mon Jun 11, 2018 12:08 pm
by Bill Spight
moha wrote:
Bill Spight wrote:
moha wrote:The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.
Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1).
Isn't that a form of reinforcement learning? You don't need accurate winrates for that to work.

Re: Measuring player mistakes versus bots

Posted: Mon Jun 11, 2018 12:26 pm
by dfan
moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained, but this can be significantly different from human winrate due to different playstyles. In fact, a drop of 2% (bot) winrate may even be an 1% (human) winrate gain.
OK. This is all incidental to the actual point I was trying to make anyway, which has now gotten lost in the noise, so I'm just going to drop it.

Re: Measuring player mistakes versus bots

Posted: Mon Jun 11, 2018 12:39 pm
by moha
Bill Spight wrote:
moha wrote:Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1).
Isn't that a form of reinforcement learning? You don't need accurate winrates for that to work.
It's closer to supervised than to "real" reinforcement learning (the selfplay cycle makes it a bit different, net->selfplay->newnet). And the winrates will be pretty "accurate" in a sense, since the network is trained until the loss diminishes, at that point it will output reasonable values - in the positions it was trained on. Hence the need for a different test set if you are interested in its real accuracy.

Or one could actually run hundreds of selfplays from hundreds of chosen test positions. To go back to dfan's original assumption: you could also do the same with human games starting from chosen test positions and collect the accuracy statistics.

Edit: I somehow missed your comment about move selection / number of visits. What I wrote is the value net only, when strengthened with search it will most often use an average of the value evaluations at leafs starting with the move candidate. And selecting on number of visits will converge to selection on avg value, since the higher valued candidates will get more future visits (either reducing the avg if refutation is found, or increasing the visit counts).

It's true this would work even with inaccurate values/winrates, provided at least their ordering is reasonably good. But the above sampling tests still seem possible. And btw, if the nets would be much faster then policy net based rollouts (almost real winrates) would be used for the evaluation.

Re: Measuring player mistakes versus bots

Posted: Mon Jun 11, 2018 1:25 pm
by Bill Spight
Anyway, we can test the winrates by bot vs. bot self play ourselves. :)

Re: Measuring player mistakes versus bots

Posted: Tue Jun 12, 2018 1:18 am
by Bojanic
Go Review Partner can analyze entire game, using selection of bots.
After analysis, it can produce histogram which shows deviations from bot's play.
It is not direct proof of similarities. Of course josekis would be similar, opening and even close fighting.
But if player has a long game similar to Leela, that is cause for further examinations.

Here is histogram of one game between european pros.
Red bars are deviation's from Leela's move (it considers them bad), and green are better moves.

Re: Measuring player mistakes versus bots

Posted: Tue Jun 12, 2018 3:11 am
by Uberdude
It would be interesting to compare the same game with a LeelaZero analysis: when I was reviewing one of Ilya Shikshin's games with Leela 0.11 it often didn't like or expect his moves, as a 4d I thought sometimes it was right they were bad, but sometimes I think his moves were actually better (and indeed sometimes Leela would then like them when shown them, a point pnprog recently explained). As LZ is more strongly opinionated I would expect more red overall, but maybe some of those bars would be relatively smaller. Of course sometimes even the Euro pros do just play pretty badly ;-) .

Re: Measuring player mistakes versus bots

Posted: Tue Jun 12, 2018 3:15 am
by moha
Bill Spight wrote:Anyway, we can test the winrates by bot vs. bot self play ourselves. :)
This is what I was suggesting. And for their accuracy in human games you may not even need the mentioned hundreds of special games from chosen positions: just take a large human database, get bot prediction (both raw net and search result) in a chosen sample of positions, then calculate the overall correlation to outcomes. You may even do this separately for opening-middlegame-endgame positions (or for various winrate ranges).
Uberdude wrote:It would be interesting to compare the same game with a LeelaZero analysis
My first thought was taking a game between two different bots (like an LZ vs. Golaxy game from earlier) and analyzing it with a third bot (Leela?). :)

Re: Measuring player mistakes versus bots

Posted: Wed Jun 13, 2018 2:01 am
by pnprog
Hi!
Uberdude wrote:This info is basically the raw data behind the win rate delta graph, so if you could somehow dump out the data for the whole game as text/file somewhere that'd be super useful, e.g. a CSV (I added a few bonus columns) like
Move number,Colour,Bot move,Bot winrate,Game move,Game winrate,Bot choice,Policy prob
20,W,h17,54.23,j17,53.5,2,5.12
21,B,h18,46.5,h18,46.5,1,45.32
So I prepared an "analysis kit" pretty similar to what I prepared for Ales already: I call it a "kit" because it includes a copy of Leela 0.11, and can be used to perform batch analysis of SGF files, and conversion of the RSGF files to CSV files.

So inside, there are:
  • A python file rsgf2csv.py that is used to extract the data from Leela's RSGF files into CSV file. If you run it directly, it will have you select a RSGF file on your computer, and then create the CSV. For example: mygame.rsgf => mygame.rsgf.csv
  • A minimalist version of GRP, that can only be used to perform analysis with Leela. It has been configured to use Leela with those parameters: Leela0110GTP.exe --gtp --noponder --playouts 150000 --nobook and a thinking time of 1000secondes per moves. In fact, Leela does not follow the --playouts very respectfully, and tends to give much more playouts when she is not sure. But at least 150000 playouts seems to be her minimum limit in that case.
  • An empty folder games_to_be_analysed where you can place the SGF files you want to analyse.
  • Two batch files (bash scripts for Linux) that can be run to perform the batch analysis of all SGF files in games_to_be_analysed folder. So one for Leela CPU (batch_analysis_CPU), and one for Leela GPU (batch_analysis_GPU). For windows, the batch file has first to detect where python is located on the computer to run the analysis. It's working on my Windows computer, but I am not so confident it would work on others windows computer, let me know.
You can modify the Leela command line by modifying the config.ini file:

Code: Select all

[Leela]
slowcommand = Leela0110GTP.exe
slowparameters = --gtp --noponder --playouts 150000 --nobook
slowtimepermove = 1000
fastcommand = Leela0110GTP_OpenCL.exe
fastparameters = --gtp --noponder --playouts 150000 --nobook
fasttimepermove = 1000
slowparameters is for the CPU analysis, and fastparameters for the GPU analysis.

If you want to perform the analysis only on a subset of moves, you can modify the batch_analysis_CPU/GPU to modify the GRP command line by adding the --range parameter. For example:

Code: Select all

for /f "delims=" %%i in ('Assoc .py') do set filetype=%%i
set filetype=%filetype:~4% 
echo filetype for .py files: %filetype%

for /f "delims=" %%i in ('Ftype %filetype%') do set pythonexe=%%i
set pythonexe=%pythonexe:~12,-7%

echo path to python interpreter: %pythonexe%

for %%f in (games_to_be_analysed/*.sgf) do (
	%pythonexe% leela_analysis.py --profil=slow --range="30-1000" "games_to_be_analysed/%%~nf.sgf"
)

for %%f in (games_to_be_analysed/*.rsgf) do (
	%pythonexe% rsgf2csv.py "games_to_be_analysed/%%~nf.rsgf"
)

echo ==================
echo Analysis completed

pause
In the above example, %pythonexe% leela_analysis.py --profil=slow --range="30-1000" "games_to_be_analysed/%%~nf.sgf" will make Leela skip the analysis of moves before 30 and after 1000, so the opening won't be analysed.

At the moment, the main drawback is that it requires python 2.7 to be installed on the computer. For Mac users, I think the Linux version can be used, but the Leela executables need to be replaced by MacOs executables, and the names of the executables has to be updated in the config.ini

Please have a try and let me know if it works, or can be improved.

Edit: in that "kit", I also set GRP to save up to 361 variations. This way, one can be sure none informations is discarded. The --nobook parameter prevents Leela to use her joseki dictionary to play the opening, so she is forced to think about all moves, including during the opening. I deliver all this together in a zip to help making this analysis repeatable: I more people want to help analysing big volume of data by sharing their computer power, it's easy to just distribute this zip file so everybody in analysing is conditions as similar as possible to everybody else.

Re: Measuring player mistakes versus bots

Posted: Wed Jun 13, 2018 5:08 am
by pnprog
Uberdude wrote:I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem.
This also can be performed with GRP, because Gnugo has a command to produce the 10 preferred moves (maybe one could modify the source code of Gnugo to get more moves). And that is what GRP does when using Gnugo to perform an analysis.

I made a quick proof of concept using the controversial game from PGETC. I enclose the CSV file.
WWIWTFDSGS.rsgf.csv.zip
(1.03 KiB) Downloaded 408 times
The column Bot choice indicates the rank of the game move among Gnugo preferred moves. So a rank of 1 means that GnuGo would have played the same move. When the rank indicates ">10" it means this move in not part of Gnugo best 10 moves.

I calculated the average rank for both players (using rank=11 when rank>10) and they are both between 6 and 7 in average.
23/83 moves by black correspond to Gnugo first move.
14/82 moves by white correspond to Gnugo first move.
Both players have played exactly 48 moves inside Gnugo top 10 moves.

Re: Measuring player mistakes versus bots

Posted: Wed Jun 13, 2018 5:17 am
by Bojanic
Pnprog,

first I would like to appreciate you for the GRP, it is excellent software, great work!

----

On the topic, you can not simply count players moves that correspond to GnuGo.
You can have atari, peep, joseki - and all those moves probably would be answered as best choice by any player.

It is necessary to focus on important moves, move sequences, etc.
Simple statistics is not good enough.

Re: Measuring player mistakes versus bots

Posted: Wed Jun 13, 2018 7:50 am
by pnprog
Bojanic wrote:On the topic, you can not simply count players moves that correspond to GnuGo.
You can have atari, peep, joseki - and all those moves probably would be answered as best choice by any player.

It is necessary to focus on important moves, move sequences, etc.
Simple statistics is not good enough.
Haha, I am just some guy who can makes tools that could be useful for you to test your hypothesis, or perform analysis :)

So I am trying to stay "neutral" on the existing PGETC case, and I won't embark into trying to develop a method to solve future case.

But, if you have some ideas that you want to apply on large set of data, and that it's to much work (and error prone) to do by hand, then I would be happy to help :salute:

Above was just a "proof of concept" of the sort of data that could be extracted from Gnugo, as was mentioned by Uberdude. If some of you believe it could be an useful tool in itself, then I will release the tool in a easy way for you to use.
Bojanic wrote:You can have atari, peep, joseki - and all those moves probably would be answered as best choice by any player
On this specific question, one way to differentiate between important move and urgent move would be, with Leela:
  • Check if Leela only proposes one move: this is a strong indicator that this is a do or die move
  • Check the decrease in win rate before the first top move and the second top move. If the first top move has 51% win rate, and the second top move only has 15% win rate, this also indicate a forced move.

Re: Measuring player mistakes versus bots

Posted: Wed Jun 13, 2018 8:59 am
by Bojanic
pnprog wrote: On this specific question, one way to differentiate between important move and urgent move would be, with Leela:
  • Check if Leela only proposes one move: this is a strong indicator that this is a do or die move
  • Check the decrease in win rate before the first top move and the second top move. If the first top move has 51% win rate, and the second top move only has 15% win rate, this also indicate a forced move.
It could be helpful, but some analysis would be needed.
IE, in one game I have seen forced move with two answers, both good.
In other cases, someone might choose not to answer peep, or to play other move nearby.

Re: Measuring player mistakes versus bots

Posted: Sun Jun 17, 2018 6:30 am
by pnprog
Hi!

In some other thread, you mentioned that the PGETC games also have time record. This is also something that could be extracted together with other informations, in its own column.

Re: Measuring player mistakes versus bots

Posted: Mon Jun 18, 2018 2:08 am
by pnprog
That's me again!

I was thinking about something that maybe would work, but would be a lot of work to implement:

Basically, it would consist in training a set of policy networks, each one corresponding to a specific level of play (3k, 2k, 1k, 1d, 2d, 3d...).

<Edit> to be clear, I am not proposing to train a bot, only a policy network. Not something that can play Go, no play-out, no tree search, no MC rolls, no value network...</Edit>

A policy network, as I understand it, was developed by Deepmind for there first version of AlphaGo by showing it games of strong amateurs players they downloaded from the internet. This policy network was used to indicate, for a specific game position, what moves a strong amateur would play. This was used to reduce the number of moves AlphaGo had to evaluate (evaluation being done with value network and montecarlo rolls). Later they used AlphaGo VS AlphaGo games to improve better their policy network.

So, we could try to train one policy network using ~2k players' games, then another one using ~1k players' games, then another one using ~1d players' games, and so on.

Note that we don't really care what level at policy network is labelled (1k, or 3d), we only need them to be in croissant order, and ideally at a regular distance in strength. We could classify them using ELO or simply A, B, C...

With such a set of policy networks, we could evaluate how the moves of one player in his game correlate with each of our policy networks, and draw a chart. One could expect this chart to peak at the policy network closest to this player level.

Then, by comparing those charts for different games, we could then tell that for a particular game, that player did not played at his usual level.

The difficult part would be to gather enough games for training, games from players with stable level, and have those games classified by level...

One way to do that could be to work with Go severs, more specifically with the players they use as anchors
Now, they won't probably want to disclose publicly what players are used as anchors, but maybe this could be done under a non disclosure agreement. Or maybe they could disclose this information when the anchor is removed. Then we can download his games from the period he was selected as an anchor.
Or maybe we could collaborate with Go server to get statistics on what player have a very strong rating confidence.

Once we get enough games to train our policy networks, it also open all sort of possibility regarding the rating of players or their games (like, one could finally get to know the equivalence of ranks among Go servers).