Measuring player mistakes versus bots

Kirby · Post by **Kirby** » Mon Jun 11, 2018 11:12 am

Uberdude wrote:I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem.

I'm skeptical that similarity (or lack of similarity) with GnuGo will provide useful information in regard to the analysis.

moha · Post by **moha** » Mon Jun 11, 2018 11:13 am

dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".

This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained, but this can be significantly different from human winrate due to different playstyles. In fact, a drop of 2% (bot) winrate may even be an 1% (human) winrate gain.

This is another reason to go for expected scores instead of winrates, although it is also possible to train a net specifically for predicting the human winrate (maybe with a strength parameter).

Bill Spight · Post by **Bill Spight** » Mon Jun 11, 2018 11:36 am

moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,

Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given. Another reason I suspect that the winrates were not calculated that way is that doing so would take a lot of time, and would not be necessary to improve the ability of the bot. Another reason is that moves are chosen based upon number of visits, not winrates, or not only on winrates.

Tryss · Post by **Tryss** » Mon Jun 11, 2018 11:46 am

Bill Spight wrote:
moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.

No, it's not easy, because the given winrate is mostly based on winrate given by the evaluation of the network. And there is no easy way to get the margin of error of these numbers.

moha · Post by **moha** » Mon Jun 11, 2018 11:54 am

Bill Spight wrote:
moha wrote:The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.

Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1). I'm not sure about error statistics, I agree those could be produced, maybe nobody was interested enough to collect them?

It would not be that easy though, since it would need a different test game set (the loss IS decreasing/disappearing on the training set oc, but that doesn't necessarily mean better predictions on a different set as the danger of overfitting is higher for the value head than the policy).

Bill Spight · Post by **Bill Spight** » Mon Jun 11, 2018 12:05 pm

Bill Spight wrote:
moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.

Tryss wrote:No, it's not easy, because the given winrate is mostly based on winrate given by the evaluation of the network. And there is no easy way to get the margin of error of these numbers.

That's my point.

Bill Spight · Post by **Bill Spight** » Mon Jun 11, 2018 12:08 pm

moha wrote:
Bill Spight wrote:
moha wrote:The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.
Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1).

Isn't that a form of reinforcement learning? You don't need accurate winrates for that to work.

dfan · Post by **dfan** » Mon Jun 11, 2018 12:26 pm

moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained, but this can be significantly different from human winrate due to different playstyles. In fact, a drop of 2% (bot) winrate may even be an 1% (human) winrate gain.

OK. This is all incidental to the actual point I was trying to make anyway, which has now gotten lost in the noise, so I'm just going to drop it.

moha · Post by **moha** » Mon Jun 11, 2018 12:39 pm

Bill Spight wrote:
moha wrote:Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1).
Isn't that a form of reinforcement learning? You don't need accurate winrates for that to work.

It's closer to supervised than to "real" reinforcement learning (the selfplay cycle makes it a bit different, net->selfplay->newnet). And the winrates will be pretty "accurate" in a sense, since the network is trained until the loss diminishes, at that point it will output reasonable values - in the positions it was trained on. Hence the need for a different test set if you are interested in its real accuracy.

Or one could actually run hundreds of selfplays from hundreds of chosen test positions. To go back to dfan's original assumption: you could also do the same with human games starting from chosen test positions and collect the accuracy statistics.

Edit: I somehow missed your comment about move selection / number of visits. What I wrote is the value net only, when strengthened with search it will most often use an average of the value evaluations at leafs starting with the move candidate. And selecting on number of visits will converge to selection on avg value, since the higher valued candidates will get more future visits (either reducing the avg if refutation is found, or increasing the visit counts).

It's true this would work even with inaccurate values/winrates, provided at least their ordering is reasonably good. But the above sampling tests still seem possible. And btw, if the nets would be much faster then policy net based rollouts (almost real winrates) would be used for the evaluation.

Bill Spight · Post by **Bill Spight** » Mon Jun 11, 2018 1:25 pm

Anyway, we can test the winrates by bot vs. bot self play ourselves.

Bojanic · Post by **Bojanic** » Tue Jun 12, 2018 1:18 am

Go Review Partner can analyze entire game, using selection of bots.
After analysis, it can produce histogram which shows deviations from bot's play.
It is not direct proof of similarities. Of course josekis would be similar, opening and even close fighting.
But if player has a long game similar to Leela, that is cause for further examinations.

Here is histogram of one game between european pros.
Red bars are deviation's from Leela's move (it considers them bad), and green are better moves.

Uberdude · Post by **Uberdude** » Tue Jun 12, 2018 3:11 am

It would be interesting to compare the same game with a LeelaZero analysis: when I was reviewing one of Ilya Shikshin's games with Leela 0.11 it often didn't like or expect his moves, as a 4d I thought sometimes it was right they were bad, but sometimes I think his moves were actually better (and indeed sometimes Leela would then like them when shown them, a point pnprog recently explained). As LZ is more strongly opinionated I would expect more red overall, but maybe some of those bars would be relatively smaller. Of course sometimes even the Euro pros do just play pretty badly

.

moha · Post by **moha** » Tue Jun 12, 2018 3:15 am

Bill Spight wrote:Anyway, we can test the winrates by bot vs. bot self play ourselves.

This is what I was suggesting. And for their accuracy in human games you may not even need the mentioned hundreds of special games from chosen positions: just take a large human database, get bot prediction (both raw net and search result) in a chosen sample of positions, then calculate the overall correlation to outcomes. You may even do this separately for opening-middlegame-endgame positions (or for various winrate ranges).

Uberdude wrote:It would be interesting to compare the same game with a LeelaZero analysis

My first thought was taking a game between two different bots (like an LZ vs. Golaxy game from earlier) and analyzing it with a third bot (Leela?).

pnprog · Post by **pnprog** » Wed Jun 13, 2018 2:01 am

Hi!

Uberdude wrote:This info is basically the raw data behind the win rate delta graph, so if you could somehow dump out the data for the whole game as text/file somewhere that'd be super useful, e.g. a CSV (I added a few bonus columns) like
Move number,Colour,Bot move,Bot winrate,Game move,Game winrate,Bot choice,Policy prob
20,W,h17,54.23,j17,53.5,2,5.12
21,B,h18,46.5,h18,46.5,1,45.32

So I prepared an "analysis kit" pretty similar to what I prepared for Ales already:

I call it a "kit" because it includes a copy of Leela 0.11, and can be used to perform batch analysis of SGF files, and conversion of the RSGF files to CSV files.

So inside, there are:

A python file rsgf2csv.py that is used to extract the data from Leela's RSGF files into CSV file. If you run it directly, it will have you select a RSGF file on your computer, and then create the CSV. For example: mygame.rsgf => mygame.rsgf.csv
A minimalist version of GRP, that can only be used to perform analysis with Leela. It has been configured to use Leela with those parameters: Leela0110GTP.exe --gtp --noponder --playouts 150000 --nobook and a thinking time of 1000secondes per moves. In fact, Leela does not follow the --playouts very respectfully, and tends to give much more playouts when she is not sure. But at least 150000 playouts seems to be her minimum limit in that case.
An empty folder games_to_be_analysed where you can place the SGF files you want to analyse.
Two batch files (bash scripts for Linux) that can be run to perform the batch analysis of all SGF files in games_to_be_analysed folder. So one for Leela CPU (batch_analysis_CPU), and one for Leela GPU (batch_analysis_GPU). For windows, the batch file has first to detect where python is located on the computer to run the analysis. It's working on my Windows computer, but I am not so confident it would work on others windows computer, let me know.

You can modify the Leela command line by modifying the config.ini file:

Code: Select all

[Leela]
slowcommand = Leela0110GTP.exe
slowparameters = --gtp --noponder --playouts 150000 --nobook
slowtimepermove = 1000
fastcommand = Leela0110GTP_OpenCL.exe
fastparameters = --gtp --noponder --playouts 150000 --nobook
fasttimepermove = 1000

slowparameters is for the CPU analysis, and fastparameters for the GPU analysis.

If you want to perform the analysis only on a subset of moves, you can modify the batch_analysis_CPU/GPU to modify the GRP command line by adding the --range parameter. For example:

Code: Select all

for /f "delims=" %%i in ('Assoc .py') do set filetype=%%i
set filetype=%filetype:~4% 
echo filetype for .py files: %filetype%

for /f "delims=" %%i in ('Ftype %filetype%') do set pythonexe=%%i
set pythonexe=%pythonexe:~12,-7%

echo path to python interpreter: %pythonexe%

for %%f in (games_to_be_analysed/*.sgf) do (
	%pythonexe% leela_analysis.py --profil=slow --range="30-1000" "games_to_be_analysed/%%~nf.sgf"
)

for %%f in (games_to_be_analysed/*.rsgf) do (
	%pythonexe% rsgf2csv.py "games_to_be_analysed/%%~nf.rsgf"
)

echo ==================
echo Analysis completed

pause

In the above example, %pythonexe% leela_analysis.py --profil=slow --range="30-1000" "games_to_be_analysed/%%~nf.sgf" will make Leela skip the analysis of moves before 30 and after 1000, so the opening won't be analysed.

At the moment, the main drawback is that it requires python 2.7 to be installed on the computer. For Mac users, I think the Linux version can be used, but the Leela executables need to be replaced by MacOs executables, and the names of the executables has to be updated in the config.ini

Please have a try and let me know if it works, or can be improved.

Edit: in that "kit", I also set GRP to save up to 361 variations. This way, one can be sure none informations is discarded. The --nobook parameter prevents Leela to use her joseki dictionary to play the opening, so she is forced to think about all moves, including during the opening. I deliver all this together in a zip to help making this analysis repeatable: I more people want to help analysing big volume of data by sharing their computer power, it's easy to just distribute this zip file so everybody in analysing is conditions as similar as possible to everybody else.

pnprog · Post by **pnprog** » Wed Jun 13, 2018 5:08 am

Uberdude wrote:I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem.

This also can be performed with GRP, because Gnugo has a command to produce the 10 preferred moves (maybe one could modify the source code of Gnugo to get more moves). And that is what GRP does when using Gnugo to perform an analysis.

I made a quick proof of concept using the controversial game from PGETC. I enclose the CSV file.

WWIWTFDSGS.rsgf.csv.zip: (1.03 KiB) Downloaded 425 times

The column Bot choice indicates the rank of the game move among Gnugo preferred moves. So a rank of 1 means that GnuGo would have played the same move. When the rank indicates ">10" it means this move in not part of Gnugo best 10 moves.

I calculated the average rank for both players (using rank=11 when rank>10) and they are both between 6 and 7 in average.
23/83 moves by black correspond to Gnugo first move.
14/82 moves by white correspond to Gnugo first move.
Both players have played exactly 48 moves inside Gnugo top 10 moves.

Life In 19x19

Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots

Re: Measuring player mistakes versus bots