Circa April 2020 best setup for reviewing games?

leftoftengen · #1

First - lightvector, gcp, the ~20-50 folks who are working on these bots & tools, and folks like Fiarbaine / Spight who post thoughtful new content - you rock, and I very much appreciate what you do, especially for such a niche community.

It's been ~4 months since I seriously tried to review a game with a tool, and I'd like to start back-up. (fwiw - I'm reasonable familiar with computer go, at various times tried writing / modding alphago like bots, training on my 1080 Ti etc).

I'd like to setup a little workflow where I drop interesting KGS games into a folder, generate reviews that I can at a latter point pull up to focus on the top ~10 biggest mistakes and first ~80 moves, and occasionally dig into in detail.

The tool:
I've used Lizzie a lot, and Sabaki a little, and neither are great for this. Is goreviewpartner the way to go? Do folks have or recommend custom scripts to do some of this (i.e. drop in ~10 move sequences for the bigger blunders with some annotation on the scale of the error, alternatives to consider etc.)? I'd really like to get an annotated sgf for future reading vs. reviewing live (too tempting to go into tangents).

The bot:
I'd love to use a StRoNg bot (I like learnedly novel impractical joseki ... 3k remember!)and one that can reasonably review handicap games.

I believe Katago can do this ... but it doesn't work with goreviewpartner out of the box. Is that right?

Any suggestions?

Drew · #2

I don't have anything to add other than an indirect comment that I think this forum would benefit from a Quarterly or Semi-Annual "State of the Bots" post that sums up where the industry is at in terms of software, hardware, functionality, and how-to's.

For people who can't keep up with the daily or weekly content, it's hard to know where to begin and what is relevant today (whenever today happens to be).

Bill Spight · #3

My shop teacher taught us to use the right tool for the job. Unfortunately, I don't think that there is a right tool for the job. Today's bots are the best thing we have, but they are built for winning games, not analyzing plays and positions, which is what I want from a reviewer. One problem that goes back for years is that the difference between winrate estimates for different candidate plays can be misleading. This is when they are based upon rather different number of rollouts, which is typically the case. The bot's top choice may get 10,000 rollouts, while the move that the human actually made in the game gets 100 rollouts. IMO, a review program should devote about the same number of rollouts to each play that it compares with the other. I am unaware of any program that does that, but a human can make the bot do that.

lightvector recently wrote a very good post about using a bot for review: https://lifein19x19.com/viewtopic.php?p=255703#p255703

inbae · #4

Bill Spight wrote:

IMO, a review program should devote about the same number of rollouts to each play that it compares with the other. I am unaware of any program that does that, but a human can make the bot do that.

The list of sensible moves will vary person by person, and by concentrating on the best moves the bot can discover better moves sometimes.

Bill Spight wrote:

lightvector recently wrote a very good post about using a bot for review: https://lifein19x19.com/viewtopic.php?p=255703#p255703

This should be the default mindset I guess... Even if you have a perfect analysis tool, if you don't think and ponder on which moves should be played, it would be like trying to learn while not asking a single question at all to your teacher.

Bill Spight · #5

inbae wrote:

Bill Spight wrote:

IMO, a review program should devote about the same number of rollouts to each play that it compares with the other. I am unaware of any program that does that, but a human can make the bot do that.

The list of sensible moves will vary person by person, and by concentrating on the best moves the bot can discover better moves sometimes.

IMO, one play that the bot should compare with its top choice is the one actually played by the human. It can be completely off the bot's radar and thus get 0 rollouts, but if played, may often get a winrate estimate within 1 or 2% of the top choice, and sometimes may get a higher winrate. And the same goes for a play that the bot gives relatively few rollouts to. More rollouts can alter a play's winrate estimate considerably.

To put things differently, in search there is a tension between exploration and exploitation. I think that a game review program should emphasize exploration more than a game playing program.

iopq · #6

Bill Spight wrote:

inbae wrote:

Bill Spight wrote:

IMO, a review program should devote about the same number of rollouts to each play that it compares with the other. I am unaware of any program that does that, but a human can make the bot do that.

The list of sensible moves will vary person by person, and by concentrating on the best moves the bot can discover better moves sometimes.

IMO, one play that the bot should compare with its top choice is the one actually played by the human. It can be completely off the bot's radar and thus get 0 rollouts, but if played, may often get a winrate estimate within 1 or 2% of the top choice, and sometimes may get a higher winrate. And the same goes for a play that the bot gives relatively few rollouts to. More rollouts can alter a play's winrate estimate considerably.

To put things differently, in search there is a tension between exploration and exploitation. I think that a game review program should emphasize exploration more than a game playing program.

You can change the constant for exploration vs. exploitation in the settings

in KataGo that would be cpuctExploration in the config file for analysis

and you also get the proper analysis of the human move... when you advance one move - I don't think anyone just opens the first move of the game and just searches - you advance it to the next move to see how the scores and percentages change, otherwise you'd be staring at an empty board while it runs to tell you how good 4-4 is

if it's a small loss you know it's a good move, if it's a large loss maybe not that good

Bill Spight · #7

iopq wrote:

and you also get the proper analysis of the human move... when you advance one move - I don't think anyone just opens the first move of the game and just searches - you advance it to the next move to see how the scores and percentages change, otherwise you'd be staring at an empty board while it runs to tell you how good 4-4 is

if it's a small loss you know it's a good move, if it's a large loss maybe not that good

I believe that this is common practice, but I disagree. IMO the proper comparison is between plays from the same position, not the change in winrate estimates between successive positions. The reason is that the winrate estimates for all plays can change between successive positions. If all plays show a loss, that does not mean that they are all less than optimal.

bernds · #8

leftoftengen wrote:

I'd like to setup a little workflow where I drop interesting KGS games into a folder, generate reviews that I can at a latter point pull up to focus on the top ~10 biggest mistakes and first ~80 moves, and occasionally dig into in detail.

The tool:
I've used Lizzie a lot, and Sabaki a little, and neither are great for this. Is goreviewpartner the way to go? Do folks have or recommend custom scripts to do some of this (i.e. drop in ~10 move sequences for the bigger blunders with some annotation on the scale of the error, alternatives to consider etc.)? I'd really like to get an annotated sgf for future reading vs. reviewing live (too tempting to go into tangents).

Guess I'll plug my own tool, q5go for this. You can queue up files for analysis, and save them when they're done, and win rate/score information will be saved. They are presented as a graph so you can easily see the biggest jumps.

Bill Spight · #9

Here is an example of what I am talking about. It is taken from the Elf GoGoD commentaries. Other programs may and almost surely do differ from Elf, but I think my point remains the same.

Click Here To Show Diagram Code: [go]$$Wcm24 Kim Kiheon, 7 dan (W) vs. Jimmy Cha, 5 dan, 2018-07-18m $$ --------------------------------------- $$ | . . . . . . . . . . . O a . X . X X . | $$ | . . . . . . X . . X X X O b O X O X . | $$ | . . O X . X O X X X O O . . O X . . . | $$ | . . X X . X O O O O O O . . O X X . . | $$ | . . . . . . X O O . X X O 1 O X . X . | $$ | . . . . . . X X . O . . O X O O X . X | $$ | . O . . . . . X O O X X 5 X X O O X . | $$ | X X X X X X X O . O X . . . . X O 3 4 | $$ | O X X O X O X O O X X . O X X X X X 2 | $$ | O O O O O O O . . O X X X X O , X O . | $$ | . X O . . . . O . O X O O X O . O . O | $$ | . . X O O O . O O X O O O . O O . O . | $$ | . X . . X O . X X X X X O O O . O . . | $$ | . O X X . O X X . . . X X O X X O . . | $$ | O . O . X . . O X . . X O O X O . O O | $$ | . O O O X . O O O X X X O O X O O O X | $$ | . . X . O O . O . O X . X X . X O X X | $$ | . . . . . . . O . O X X . . X . X . X | $$ | . . . . . . . . O . O . . . . . . X . | $$ ---------------------------------------[/go]

This is the game record of moves 224 - 228.

After :w24:

Elf estimates Black's winrate as 92.6% with 12,886 rollouts. This is apparently Elf's second choice, with the also rans not reported, as they got fewer than 1500 rollouts. Elf's top choice is at a, yielding a Black winrate estimate of 87.2% with 13,450 rollouts. That play is 5.4% better for White, both estimates based on around 13k rollouts.

After :b25:

Black's winrate estimate is 81.4% with 39 rollouts. That's a drop of 11.2%, but how accurate is an estimate based on only 39 rollouts? :shock:

I don't know what other programs do, but Elf inherits that estimate from its top choice for :b26:

, which is :w28:

. That estimate is 81.4% with 17,451 rollouts, a respectable number.

OK, we have a drop of 11.2% between :w24:

and

(actually, between :w24:

and Elf's top choice for :w26:

). As I say, I don't know what other programs do. If they use an estimate based on only 39 rollouts, all I can do is to roll my eyes. :roll:

But what about the comparison between :b25:

and Elf's top choice for :b25:

, which is b? After Black b, Black's winrate estimate is 93.5% with 41,491 rollouts, a substantial number. The difference between that and the estimate for :b25:

is 12.1%, around 1% more than the drop between positions. A minor difference.

Moving on. After :w26:

we get a Black winrate estimate of 87.2% with 575 rollouts. In this case, since the estimate for the previous position was inherited from Elf's top choice for :w26:

there is no difference between the drop of 5.8% between succesive positions and the winrate difference between plays. However, the fact that one estimate is based on only 575 rollouts while the other estimate is based on 17,451 rollouts is a question. Elf does not inherit estimates if there are 500 or more rollouts. But if it did, it would inherit from ELf's top choice for :b27:

(also at b) an estimate of 93.5% with 28,501 rollouts. That's a substantial difference, both in the winrate estimate and the number of rollouts. Wouldn't it be better for the analyst program to make the human play de novo, and rely upon neither a comparatively low number of rollouts nor inherit the estimate from a different play?

After :b27:

Black's winrate estimate is 78.7% with 301 rollouts. Since this is less than 500 rollouts, the estimate is inherited from the estimate after :w28:

, which is also Elf's top choice, with 25,793 rollouts.

The drop between positions is 8.5%. However, the difference between the estimate for :b27:

and Black b is 14.8%. Is :b27:

a minor error, costing 8 or 9%, or a substantial error, costing 15%?

When you have winrate estimates based upon vastly different numbers of rollouts you can easily get these discrepancies. Also, you can get discrepancies between successive estimates based upon the horizon effect. Better, IMO, to have the analyst program make the human moves de novo for comparison with its top choice (unless both moves are the same, OC).

----

Edit: More dramatic examples are possible, for instance, where the human play is better than Elf's play. But I wanted to show an ordinary position where these questions arise.

Circa April 2020 best setup for reviewing games?

Who is online