Page 3 of 3

Re: statistical analysis of player performance

Posted: Tue Jul 17, 2018 3:53 am
by AlesCieply
Javaness2 wrote:
AlesCieply wrote:Expect more later.[/b]
Can we expect more non Carlo files?
Sure, in some time. :) I have the analysis of Ondrej Kruml (Czech 5d) games almost complete. Preliminary, I can say, his percentage of good moves ranges from about 40% to almost 70% (in one game from 8).

What troubles me right now is the accuracy and precision of the winrates provided by Leela in some specific positions. These positions are reasonably rare but when they occur, Leela struggles and evaluates several moves by the players as big mistakes. This affects the tails of the mistake histograms with small counts, so I am considering removing these last low-count bins from the chi2 comparisons.

It also looks I found someone to help me out with running Leela Zero with sufficiently high playouts setting, so I am wondering what to do first. :scratch:

Re: statistical analysis of player performance

Posted: Tue Jul 17, 2018 4:30 am
by AlesCieply
Something for Bill:
MettaBenDavid.rsgf.zip
Meta-BenDavid analysis at 300k+ nodes.
(53.18 KiB) Downloaded 595 times
This is a rsgf file generated by GRP for the Metta-BenDavid PGETC game, all moves. The analysis was done with 300k+ nodes, so it should be slightly more precise than what I use normally. It is quite fresh, I have not checked it myself yet but intend to make a comparison with my "standard" 200k+ file.

Re: statistical analysis of player performance

Posted: Tue Jul 17, 2018 6:42 am
by Bill Spight
AlesCieply wrote:What troubles me right now is the accuracy and precision of the winrates provided by Leela in some specific positions. These positions are reasonably rare but when they occur, Leela struggles and evaluates several moves by the players as big mistakes. This affects the tails of the mistake histograms with small counts, so I am considering removing these last low-count bins from the chi2 comparisons.
Instead of throwing data out, the general rule of thumb is to combine the low count bins into one bin with a count of at least 5. And that also probably means combining the bin where the human play had a better evaluation than Leela's top play.
It also looks I found someone to help me out with running Leela Zero with sufficiently high playouts setting, so I am wondering what to do first. :scratch:
Leela Zero is more accurate than Leela 11. :)

Re: statistical analysis of player performance

Posted: Tue Jul 17, 2018 6:50 am
by Bill Spight
AlesCieply wrote:Something for Bill:
MettaBenDavid.rsgf.zip
This is a rsgf file generated by GRP for the Metta-BenDavid PGETC game, all moves. The analysis was done with 300k+ nodes, so it should be slightly more precise than what I use normally. It is quite fresh, I have not checked it myself yet but intend to make a comparison with my "standard" 200k+ file.
Wow! Much grass! :D

Re: statistical analysis of player performance

Posted: Tue Jul 24, 2018 7:19 am
by AlesCieply
I have uploaded an analysis of 10 games played by Ondrej Kruml, a Czech 5d player. The links are in the second message of this thread and the OP was also altered slightly to reflect the appearance of the new data. They should serve for comparison with the results obtained for the two sets of games played by Carlo Metta. The selection of these 10 games was not quite random as I wanted to include the same number of wins and losses, have games from different tournaments, and not more than two games played with the same opponent. I do not think there was any other bias when I was selecting the games to analyze.

Re: statistical analysis of player performance

Posted: Tue Jul 24, 2018 8:16 am
by pnprog
Hi Bill!
Bill Spight wrote:So to find a good delta we don't want to do what Go Review Partner does. It's OK for casual review, but not for scientific purposes. We want to start from the same place, and we want to have an equal number of playouts for each play we are comparing. With Go Review Partner I think we can do that by making each play we are comparing and then running the bot for a certain number of rollouts, or for a certain length of time. That way we are comparing apples with apples.
There is not direct way to ask Leela (or other bot) to evaluate one specific move. So do you mean something like:
For one given position:
  • Check out the move "A1" played in actual game (let's imagine D16)
  • Check out what move, "B1", would have been played by the bot (let's imagine D17)
  • Ask the bot for its best counter move "A2", to the move "A1" (let's imagine C14)
  • Ask the bot for its best counter move "B2", to the move "B1" (let's imagine C15)
Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)
And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".

Is that what you mean?

By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.

Re: statistical analysis of player performance

Posted: Tue Jul 24, 2018 11:46 am
by AlesCieply
pnprog wrote: Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)
Maybe I can answer this instead of Bill. :) You got it absolutely correct! Just note that my deltas are defined with an opposite sign, but that's not important. I just wanted to have positive value when the player finds a better move than the top bot suggestion. With the improving quality of bots it might be better to define delta as a value of a mistake the player made by playing his/her move.
pnprog wrote: And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".
It would be best if the numbers of playouts/nodes were the same (and could have been preset) for the winrate estimates made for A2 and B2.
pnprog wrote:By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.
This would be great!

Re: statistical analysis of player performance

Posted: Tue Jul 24, 2018 10:54 pm
by Bill Spight
pnprog wrote:Hi Bill!
Bill Spight wrote:So to find a good delta we don't want to do what Go Review Partner does. It's OK for casual review, but not for scientific purposes. We want to start from the same place, and we want to have an equal number of playouts for each play we are comparing. With Go Review Partner I think we can do that by making each play we are comparing and then running the bot for a certain number of rollouts, or for a certain length of time. That way we are comparing apples with apples.
There is not direct way to ask Leela (or other bot) to evaluate one specific move. So do you mean something like:
For one given position:
  • Check out the move "A1" played in actual game (let's imagine D16)
  • Check out what move, "B1", would have been played by the bot (let's imagine D17)
  • Ask the bot for its best counter move "A2", to the move "A1" (let's imagine C14)
  • Ask the bot for its best counter move "B2", to the move "B1" (let's imagine C15)
Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)
And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".

Is that what you mean?

By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.
Here's what I am talking about. Let's look at moves :w14: and :b15: in the Metta-Ben David game.
Click Here To Show Diagram Code
[go]$$Wcm14
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . . O . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . O . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . X . . . . . . . . . . . . . . . . |
$$ | . . . X X X O . . , . . . . . , . . . |
$$ | . . X O O . O . . . . . . 1 . . X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]
Leela evaluates 9 different replies to :w14:.

Its top choice is the keima.
Click Here To Show Diagram Code
[go]$$Wcm14 Keima
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . . O . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . O . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . X . . . . . . . . . . . . 2 . . . |
$$ | . . . X X X O . . , . . . . . , . . . |
$$ | . . X O O . O . . . . . . 1 . . X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]
It evaluates this as 55.90% for Black with 222084 playouts.
Click Here To Show Diagram Code
[go]$$Wcm14 De
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . . O . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . O . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . X . . . . . . . . . . . . . . . . |
$$ | . . . X X X O . . , . . . . . , . . . |
$$ | . . X O O 2 O . . . . . . 1 . . X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]
Its second choice is the de, which it evaluates as 54.72% for Black with 136882 playouts. That is fewer playouts, but they are in the same ballpark, and good enough, I think, for a winrate difference of 1.2%.
Click Here To Show Diagram Code
[go]$$Wcm14 Two space extension
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . . O . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . O . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . 2 . . |
$$ | . . X . . . . . . . . . . . . . . . . |
$$ | . . . X X X O . . , . . . . . , . . . |
$$ | . . X O O . O . . . . . . 1 . . X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]
Leela's third choice is the two space extension, which it evaluates as 54.37% with 41693 playouts. The two winrates are not all that comparable, but good enough for the winrate difference of 1.5%.
Click Here To Show Diagram Code
[go]$$Wcm14 Two space high pincer
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . . O . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . O . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . X . . . . . . . . . . . . . . . . |
$$ | . . . X X X O . . , 2 . . . . , . . . |
$$ | . . X O O . O . . . . . . 1 . . X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]
Leela's sixth choice is the two space high pincer, with only 782 playouts. With so few playouts, it is not worth figuring a winrate difference.

Now let's look at :b15: in the game.
Click Here To Show Diagram Code
[go]$$Wcm14 Two space extension
$$ ---------------------------------------
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . X . . |
$$ | . . . O . . . . . , . . . . . , . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . , . . . . . , . . . . . , . . . |
$$ | . . O . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . 2 . . |
$$ | . . X . . . . . . . . . . . . . . . . |
$$ | . . . X X X O . . , . . . . . , . . . |
$$ | . . X O O . O . . . . . . 1 . . X . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ | . . . . . . . . . . . . . . . . . . . |
$$ ---------------------------------------[/go]
Leela evaluates it as 55.01% for Black with around 341,000 playouts. (I did not add them up exactly.) With many more playouts the winrate is more precise, and presumably more accurate. Not a big difference in this case, but bigger differences have been observed. The delta is 0.9% instead of a winrate difference of 1.5%.

The first comparison, between Leela's first and third choices, is at the same depth of the tree, but with quite different playouts. IIUC, it is not easy to equalize the number of playouts, because, as a kind of Monte Carlo bot, Leela uses the number of playouts as one of its criteria to decide which play to choose. Its purpose is to pick plays, not just evaluate positions and plays.

The second comparison, for the delta, has a comparable number of playouts for the two choices in this case, but they start at different levels of the game tree. Now, Leela is run at :b15: in the game tree, according to the conditions set, time or number of playouts, or whatever. Is it not possible to make a separate variation with Leela's first choice, the keima, and run Leela for it, under the same conditions as Leela is run for the actual play in the game? That would give comparisons made under the same conditions at the same level in the game tree. :D

One way to do that might be, after Leela has evaluated :b15: in the actual game, before it evaluates :w16: in the actual game, have it evaluate Leela's first choice for :b15:. (You wouldn't even have to check to see if it is different from the actual play. Double comparisons of the same play would give you an idea of the error rate of the winrate estimates. Something that we do not currently have.) Another possibility would be go over the game a second time, this time only evaluating the variations with Leela's first choices from the initial run.

Re: statistical analysis of player performance

Posted: Tue Jul 24, 2018 11:15 pm
by Bill Spight
pnprog wrote:There is not direct way to ask Leela (or other bot) to evaluate one specific move. So do you mean something like:
For one given position:
  • Check out the move "A1" played in actual game (let's imagine D16)
  • Check out what move, "B1", would have been played by the bot (let's imagine D17)
  • Ask the bot for its best counter move "A2", to the move "A1" (let's imagine C14)
  • Ask the bot for its best counter move "B2", to the move "B1" (let's imagine C15)
Then, if W(X) is the win rate of move at X, then: delta = W(B2)-W(A2)
And then, the thinking parameters (time and play-outs) should be the same when asking the bot to come out with "B1","A2" and "B2".

Is that what you mean?
Let me try again. IIUC, Leela has already come up with its replies to A1, the move already made in the game. That's what Go Review Partner asks it to do, right? Ask it to do the same for B1. Then we compare the associated winrate estimates. :)
By the way, we could ask the Leela Zero team is they can come up with a specific GTP command to evaluate one precise move. Maybe it's not that hard to implement.
Isn't that what Leela does when it actually plays a game? It evaluates the position after the opponent's move?

Re: statistical analysis of player performance

Posted: Wed Jul 25, 2018 4:13 am
by AlesCieply
Chi2 tests results added to the second post.