Measuring player mistakes versus bots

General conversations about Go belong here.
Kirby
Honinbo
Posts: 9553
Joined: Wed Feb 24, 2010 6:04 pm
GD Posts: 0
KGS: Kirby
Tygem: 커비라고해
Has thanked: 1583 times
Been thanked: 1707 times

Re: Measuring player mistakes versus bots

Post by Kirby »

Uberdude wrote:I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem.
I'm skeptical that similarity (or lack of similarity) with GnuGo will provide useful information in regard to the analysis.
be immersed
moha
Lives in gote
Posts: 311
Joined: Wed May 31, 2017 6:49 am
Rank: 2d
GD Posts: 0
Been thanked: 45 times

Re: Measuring player mistakes versus bots

Post by moha »

dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained, but this can be significantly different from human winrate due to different playstyles. In fact, a drop of 2% (bot) winrate may even be an 1% (human) winrate gain.

This is another reason to go for expected scores instead of winrates, although it is also possible to train a net specifically for predicting the human winrate (maybe with a strength parameter).
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: Measuring player mistakes versus bots

Post by Bill Spight »

moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given. Another reason I suspect that the winrates were not calculated that way is that doing so would take a lot of time, and would not be necessary to improve the ability of the bot. Another reason is that moves are chosen based upon number of visits, not winrates, or not only on winrates.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
Tryss
Lives in gote
Posts: 502
Joined: Tue May 24, 2011 1:07 pm
Rank: KGS 2k
GD Posts: 100
KGS: Tryss
Has thanked: 1 time
Been thanked: 153 times

Re: Measuring player mistakes versus bots

Post by Tryss »

Bill Spight wrote:
moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.
No, it's not easy, because the given winrate is mostly based on winrate given by the evaluation of the network. And there is no easy way to get the margin of error of these numbers.
moha
Lives in gote
Posts: 311
Joined: Wed May 31, 2017 6:49 am
Rank: 2d
GD Posts: 0
Been thanked: 45 times

Re: Measuring player mistakes versus bots

Post by moha »

Bill Spight wrote:
moha wrote:The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.
Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1). I'm not sure about error statistics, I agree those could be produced, maybe nobody was interested enough to collect them?

It would not be that easy though, since it would need a different test game set (the loss IS decreasing/disappearing on the training set oc, but that doesn't necessarily mean better predictions on a different set as the danger of overfitting is higher for the value head than the policy).
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: Measuring player mistakes versus bots

Post by Bill Spight »

Bill Spight wrote:
moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.
Tryss wrote:No, it's not easy, because the given winrate is mostly based on winrate given by the evaluation of the network. And there is no easy way to get the margin of error of these numbers.
That's my point. :)
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: Measuring player mistakes versus bots

Post by Bill Spight »

moha wrote:
Bill Spight wrote:
moha wrote:The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained,
Are you sure about that? In that case it would be easy to produce margin of error statistics, which, IIUC, are not given.
Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1).
Isn't that a form of reinforcement learning? You don't need accurate winrates for that to work.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
dfan
Gosei
Posts: 1598
Joined: Wed Apr 21, 2010 8:49 am
Rank: AGA 2k Fox 3d
GD Posts: 61
KGS: dfan
Has thanked: 891 times
Been thanked: 534 times
Contact:

Re: Measuring player mistakes versus bots

Post by dfan »

moha wrote:
dfan wrote:To be clearer, what I meant with that phrase was "if you assume that the win rate accurately represents the probability that a human would win a game against another human of equal ability starting from the position in question".
This assumption also seems false. The winrate approximates the probability of the given bot winning against itself starting from the position. This is how it was trained, but this can be significantly different from human winrate due to different playstyles. In fact, a drop of 2% (bot) winrate may even be an 1% (human) winrate gain.
OK. This is all incidental to the actual point I was trying to make anyway, which has now gotten lost in the noise, so I'm just going to drop it.
moha
Lives in gote
Posts: 311
Joined: Wed May 31, 2017 6:49 am
Rank: 2d
GD Posts: 0
Been thanked: 45 times

Re: Measuring player mistakes versus bots

Post by moha »

Bill Spight wrote:
moha wrote:Consider the training method: from zillions of positions taken from zillions of selfplay games the value head is trained with a loss function that is the difference of its current output and the actual outcome (1/-1).
Isn't that a form of reinforcement learning? You don't need accurate winrates for that to work.
It's closer to supervised than to "real" reinforcement learning (the selfplay cycle makes it a bit different, net->selfplay->newnet). And the winrates will be pretty "accurate" in a sense, since the network is trained until the loss diminishes, at that point it will output reasonable values - in the positions it was trained on. Hence the need for a different test set if you are interested in its real accuracy.

Or one could actually run hundreds of selfplays from hundreds of chosen test positions. To go back to dfan's original assumption: you could also do the same with human games starting from chosen test positions and collect the accuracy statistics.

Edit: I somehow missed your comment about move selection / number of visits. What I wrote is the value net only, when strengthened with search it will most often use an average of the value evaluations at leafs starting with the move candidate. And selecting on number of visits will converge to selection on avg value, since the higher valued candidates will get more future visits (either reducing the avg if refutation is found, or increasing the visit counts).

It's true this would work even with inaccurate values/winrates, provided at least their ordering is reasonably good. But the above sampling tests still seem possible. And btw, if the nets would be much faster then policy net based rollouts (almost real winrates) would be used for the evaluation.
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: Measuring player mistakes versus bots

Post by Bill Spight »

Anyway, we can test the winrates by bot vs. bot self play ourselves. :)
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
Bojanic
Lives with ko
Posts: 142
Joined: Fri May 06, 2011 1:35 pm
Rank: 5 dan
GD Posts: 0
Has thanked: 27 times
Been thanked: 89 times

Re: Measuring player mistakes versus bots

Post by Bojanic »

Go Review Partner can analyze entire game, using selection of bots.
After analysis, it can produce histogram which shows deviations from bot's play.
It is not direct proof of similarities. Of course josekis would be similar, opening and even close fighting.
But if player has a long game similar to Leela, that is cause for further examinations.

Here is histogram of one game between european pros.
Red bars are deviation's from Leela's move (it considers them bad), and green are better moves.
Attachments
QIQJWEPNSE.png
QIQJWEPNSE.png (22.23 KiB) Viewed 9594 times
Uberdude
Judan
Posts: 6727
Joined: Thu Nov 24, 2011 11:35 am
Rank: UK 4 dan
GD Posts: 0
KGS: Uberdude 4d
OGS: Uberdude 7d
Location: Cambridge, UK
Has thanked: 436 times
Been thanked: 3718 times

Re: Measuring player mistakes versus bots

Post by Uberdude »

It would be interesting to compare the same game with a LeelaZero analysis: when I was reviewing one of Ilya Shikshin's games with Leela 0.11 it often didn't like or expect his moves, as a 4d I thought sometimes it was right they were bad, but sometimes I think his moves were actually better (and indeed sometimes Leela would then like them when shown them, a point pnprog recently explained). As LZ is more strongly opinionated I would expect more red overall, but maybe some of those bars would be relatively smaller. Of course sometimes even the Euro pros do just play pretty badly ;-) .
moha
Lives in gote
Posts: 311
Joined: Wed May 31, 2017 6:49 am
Rank: 2d
GD Posts: 0
Been thanked: 45 times

Re: Measuring player mistakes versus bots

Post by moha »

Bill Spight wrote:Anyway, we can test the winrates by bot vs. bot self play ourselves. :)
This is what I was suggesting. And for their accuracy in human games you may not even need the mentioned hundreds of special games from chosen positions: just take a large human database, get bot prediction (both raw net and search result) in a chosen sample of positions, then calculate the overall correlation to outcomes. You may even do this separately for opening-middlegame-endgame positions (or for various winrate ranges).
Uberdude wrote:It would be interesting to compare the same game with a LeelaZero analysis
My first thought was taking a game between two different bots (like an LZ vs. Golaxy game from earlier) and analyzing it with a third bot (Leela?). :)
User avatar
pnprog
Lives with ko
Posts: 286
Joined: Thu Oct 20, 2016 7:21 am
Rank: OGS 7 kyu
GD Posts: 0
Has thanked: 94 times
Been thanked: 153 times

Re: Measuring player mistakes versus bots

Post by pnprog »

Hi!
Uberdude wrote:This info is basically the raw data behind the win rate delta graph, so if you could somehow dump out the data for the whole game as text/file somewhere that'd be super useful, e.g. a CSV (I added a few bonus columns) like
Move number,Colour,Bot move,Bot winrate,Game move,Game winrate,Bot choice,Policy prob
20,W,h17,54.23,j17,53.5,2,5.12
21,B,h18,46.5,h18,46.5,1,45.32
So I prepared an "analysis kit" pretty similar to what I prepared for Ales already: I call it a "kit" because it includes a copy of Leela 0.11, and can be used to perform batch analysis of SGF files, and conversion of the RSGF files to CSV files.

So inside, there are:
  • A python file rsgf2csv.py that is used to extract the data from Leela's RSGF files into CSV file. If you run it directly, it will have you select a RSGF file on your computer, and then create the CSV. For example: mygame.rsgf => mygame.rsgf.csv
  • A minimalist version of GRP, that can only be used to perform analysis with Leela. It has been configured to use Leela with those parameters: Leela0110GTP.exe --gtp --noponder --playouts 150000 --nobook and a thinking time of 1000secondes per moves. In fact, Leela does not follow the --playouts very respectfully, and tends to give much more playouts when she is not sure. But at least 150000 playouts seems to be her minimum limit in that case.
  • An empty folder games_to_be_analysed where you can place the SGF files you want to analyse.
  • Two batch files (bash scripts for Linux) that can be run to perform the batch analysis of all SGF files in games_to_be_analysed folder. So one for Leela CPU (batch_analysis_CPU), and one for Leela GPU (batch_analysis_GPU). For windows, the batch file has first to detect where python is located on the computer to run the analysis. It's working on my Windows computer, but I am not so confident it would work on others windows computer, let me know.
You can modify the Leela command line by modifying the config.ini file:

Code: Select all

[Leela]
slowcommand = Leela0110GTP.exe
slowparameters = --gtp --noponder --playouts 150000 --nobook
slowtimepermove = 1000
fastcommand = Leela0110GTP_OpenCL.exe
fastparameters = --gtp --noponder --playouts 150000 --nobook
fasttimepermove = 1000
slowparameters is for the CPU analysis, and fastparameters for the GPU analysis.

If you want to perform the analysis only on a subset of moves, you can modify the batch_analysis_CPU/GPU to modify the GRP command line by adding the --range parameter. For example:

Code: Select all

for /f "delims=" %%i in ('Assoc .py') do set filetype=%%i
set filetype=%filetype:~4% 
echo filetype for .py files: %filetype%

for /f "delims=" %%i in ('Ftype %filetype%') do set pythonexe=%%i
set pythonexe=%pythonexe:~12,-7%

echo path to python interpreter: %pythonexe%

for %%f in (games_to_be_analysed/*.sgf) do (
	%pythonexe% leela_analysis.py --profil=slow --range="30-1000" "games_to_be_analysed/%%~nf.sgf"
)

for %%f in (games_to_be_analysed/*.rsgf) do (
	%pythonexe% rsgf2csv.py "games_to_be_analysed/%%~nf.rsgf"
)

echo ==================
echo Analysis completed

pause
In the above example, %pythonexe% leela_analysis.py --profil=slow --range="30-1000" "games_to_be_analysed/%%~nf.sgf" will make Leela skip the analysis of moves before 30 and after 1000, so the opening won't be analysed.

At the moment, the main drawback is that it requires python 2.7 to be installed on the computer. For Mac users, I think the Linux version can be used, but the Leela executables need to be replaced by MacOs executables, and the names of the executables has to be updated in the config.ini

Please have a try and let me know if it works, or can be improved.

Edit: in that "kit", I also set GRP to save up to 361 variations. This way, one can be sure none informations is discarded. The --nobook parameter prevents Leela to use her joseki dictionary to play the opening, so she is forced to think about all moves, including during the opening. I deliver all this together in a zip to help making this analysis repeatable: I more people want to help analysing big volume of data by sharing their computer power, it's easy to just distribute this zip file so everybody in analysing is conditions as similar as possible to everybody else.
I am the author of GoReviewPartner, a small software aimed at assisting reviewing a game of Go. Give it a try!
User avatar
pnprog
Lives with ko
Posts: 286
Joined: Thu Oct 20, 2016 7:21 am
Rank: OGS 7 kyu
GD Posts: 0
Has thanked: 94 times
Been thanked: 153 times

Re: Measuring player mistakes versus bots

Post by pnprog »

Uberdude wrote:I'm also thinking that we should also analyse the games with GnuGo, and any move which GnuGo agrees with the human and the strong bot be discarded from the analysis as an obvious move with little information. This should help mitigate the "this was a simple game with many obvious forced moves so will be more similar to the bot" problem.
This also can be performed with GRP, because Gnugo has a command to produce the 10 preferred moves (maybe one could modify the source code of Gnugo to get more moves). And that is what GRP does when using Gnugo to perform an analysis.

I made a quick proof of concept using the controversial game from PGETC. I enclose the CSV file.
WWIWTFDSGS.rsgf.csv.zip
(1.03 KiB) Downloaded 403 times
The column Bot choice indicates the rank of the game move among Gnugo preferred moves. So a rank of 1 means that GnuGo would have played the same move. When the rank indicates ">10" it means this move in not part of Gnugo best 10 moves.

I calculated the average rank for both players (using rank=11 when rank>10) and they are both between 6 and 7 in average.
23/83 moves by black correspond to Gnugo first move.
14/82 moves by white correspond to Gnugo first move.
Both players have played exactly 48 moves inside Gnugo top 10 moves.
I am the author of GoReviewPartner, a small software aimed at assisting reviewing a game of Go. Give it a try!
Post Reply