Widening Leela Zero's search (nb: contains many images)

xela · #1

Starting a new thread so as not to completely hijack the other conversation...

The challenge is to broaden Leela Zero's search to evaluate not just the best move, but also the near misses and plausible alternatives. A broader search will slightly weaken the playing strength but might make it easier to use as an analysis tool. Of course you can get those evaluations by clicking around in the interface (that was the point of the other conversation), but it could be easier.

My suggestion was to give ten visits (or 20, or 100) to every legal move before starting the usual Monte Carlo tree search. (An alternative approach has been tried, focussing on the top four moves: see Uberdude's post here.) It's not too hard to change the LZ code to do that: it takes about 30 extra lines. My implementation is here: experimental software, use at own risk!

So first of all, it's quite pretty to watch in action! In Lizzie, I've changed the "min playout ratio for stats" parameter from 0.1 to 0.01. Here's a 20-second animation of the first 6,000 or so playouts.

(Apologies for the small image size. I'm waiting for advice on how to post a better picture here.)

So trying out on some real positions, the results aren't all that dramatic. Often it will explore two or three other moves that would otherwise have been ignored -- but when you let it run a bit longer, the "extra" moves disappear, and you end up with the same answers you would have got anyway. I guess this shows that LZ's method of filtering out suboptimal moves is actually doing a pretty good job! So far I've got the most interesting results by watching Lizzie in real time and pausing just as the dust clears, so to speak (i.e. when the animation above changes from hundreds of candidates to just a handful). And slightly older networks seem to be more "open minded" than the newer, stronger ones.

Some examples on a position I've been spending a bit of time with lately:

Click Here To Show Diagram Code: [go]$$c 501 opening problems, problem 1: Black to play $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . e . O . . X . . . . . . . . . | $$ | . . . O . . . . . , . . . . X , X . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . O . . . . . , . . . . . b a . . | $$ | . . . . . . . . . . . . . . . d c . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , O . . | $$ | . . X . . . . . X . . f . O . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Last time I looked at this, most engines would pick two or three of 'a' through 'f' for analysis and would ignore everything else. With my new "LZ-minvisits", do we get any more variety?

GX-47 unmodified explores a, b, c and e
GX-47 + 10 visits for all moves explores all of a-f and one other option
LZ-242 explores 7 different options
LZ-242 + 10 explores 9 different options after 5,000 playouts. Using +30 visits instead of +10 actually narrows the results slightly (one option disappears)
LZ-157 unmodified is already exploring 13 different moves!
LZ-157 + 10 looks at 22 moves (and again drops a few options given more playouts)

More screen shots from Lizzie:

Hmm, limited to three images per post by the looks of it. More to come soon.

xela · #2

LZ-242 in action with the extra visits: note that 30 visits gives "worse" results than 10.

xela · #3

And LZ-157 with the extra visits. Here I've stayed with 10 visits, and shown how some of the options disappear between 5k and 30k playouts.

jann · #4

xela wrote:

My suggestion was to give ten visits (or 20, or 100) to every legal move before starting the usual Monte Carlo tree search.
...
Often it will explore two or three other moves that would otherwise have been ignored -- but when you let it run a bit longer, the "extra" moves disappear, and you end up with the same answers you would have got anyway. I guess this shows that LZ's method of filtering out suboptimal moves is actually doing a pretty good job!

This may also be a sign of a potential problem with your approach. LZ relies strongly on its policy to direct further search, even when (from a human viewpoint) it has sufficient and much more reliable (since original policy = single net lookup) data about the value of a certain move (from the move's first visits).

Weakening the policy as visits accumulate is something that is currently under experimentation, but AFAIK plain old LZ will not necessarily make enough use of the values established by those first 10 visits (for low policy moves), since it tend to visit high policy moves and not high value moves (until late in search).

Knotwilg · #5

jann wrote:

xela wrote:

My suggestion was to give ten visits (or 20, or 100) to every legal move before starting the usual Monte Carlo tree search.
...
Often it will explore two or three other moves that would otherwise have been ignored -- but when you let it run a bit longer, the "extra" moves disappear, and you end up with the same answers you would have got anyway. I guess this shows that LZ's method of filtering out suboptimal moves is actually doing a pretty good job!

This may also be a sign of a potential problem with your approach. LZ relies strongly on its policy to direct further search, even when (from a human viewpoint) it has sufficient and much more reliable (since original policy = single net lookup) data about the value of a certain move (from the move's first visits).

Weakening the policy as visits accumulate is something that is currently under experimentation, but AFAIK plain old LZ will not necessarily make enough use of the values established by those first 10 visits (for low policy moves), since it tend to visit high policy moves and not high value moves (until late in search).

OK, I'm a slow learner on the whole AI revolution, so let me get this straight by rewording the short history as I know it.

1st gen AI (AlphaGo) was trained on human expert games, therefore included a bias for move candidates based on human expert bias
2nd gen AI (AlphaZero) was trained on the rules only and got rid of that bias. It exclusively relied on move value (win percentage) and reinforcement by exploring high value moves more (number of visits)
3rd gen AI (LZ et al) is no longer solely relying on semi-brute force techniques known as Monte Carlo Tree Search, having integrated lessons learnt from stage 2. I guess that's what you refer to as a policy. It's the return of bias, but no longer human expert bias.

That undermines a thought I had about AI: that it only speaks in sequences and there is no other way to articulate its language than replicating the sequences, adorned with human language to link it to human concepts.

If there's such a "policy", or a "bias", it must have a form, one that can be articulated by something else than sequences. Like "if there is a single stone on 4-4 in a quadrant with few other stones, explore 3-3 invasion".

Where is that policy hiding? Is it not possible to articulate it?
Is my rendition of AI history correct?

xela · #6

No, the policy network and the value network have been around since the first AlphaGo, the changes have mostly been in the method of training those two networks. But yes, "it speaks only in sequences" isn't quite true, there are ways to peer inside the brain.

For neural networks trained on images, there's been some interesting work on how to visualise the different layers of the network, so we can say that these neurons are recognising straight lines, those are recognising curves or shadows, these other ones are recognising arms and legs, and so on.

In go, I don't think we've figured out yet which parts of the network are recognising "corner enclosure" or "invasion" or "influence", but I think it's possible to go further in that direction. Lizzie can already display the policy network numbers for those moves that have been searched, so you can see the difference between the "at a glance" evaluation and the result of looking more deeply into a move.

Bill Spight · #7

xela wrote:

For neural networks trained on images, there's been some interesting work on how to visualise the different layers of the network, so we can say that these neurons are recognising straight lines, those are recognising curves or shadows, these other ones are recognising arms and legs, and so on.

Going back a decade or so, go researchers were talking about common destiny graphs (if I remember the terminology correctly) that distinguished rookwise connected stones of the same color and other groups that stood or fell together. If certain neurons trained on images can recognize straight lines and so on, surely certain neurons trained on go games can recognize rookwise connected stones, etc.

jann · #8

Knotwilg wrote:

Where is that policy hiding? Is it not possible to articulate it?
Is my rendition of AI history correct?

Most today's bots ask their NNs two questions when facing a position: which moves are most likely best here (policy), and which side is ahead (value). Over these NN answers a search method is built, which does a non-exhaustive search, mostly looking at moves currently appear promising everywhere (and choosing the most visited top move later).

What's promising is, however, not easy to decide. During the search, when a position is encountered for the first time this depends entirely on NN policy. When a move received some visits, the average value from earlier visits is ALSO taken into account.

As for bot history, there are not that big theoretical differences between generations, particularly not for the search method (eg. policy and value were originally two different networks, which later changed into two separate outputs for a single network - for great practical gains but theoretically still the same). The significance of the "bias" from starting on human games is, at best, debatable - even the original AG did a lot of selfplay training, not to mention the Master version. AGZ's significance was to prove that it is possible to learn from selfplay ONLY, without starting from human knowledge. And there is not much theoretical difference between LZ and AGZ - LZ basically aims to be a minimalistic rewrite of AGZ.

xela · #9

Now that I've got a better idea how all this works, I want to try out a different "wide search" method. I'm calling it "winrate fuzzing". The idea is to temporarily change the winrates at the top level of the search tree, to make LZ think that underexplored moves are more promising. Source code here if you're interested.

(I also experimented with "policy flattening", changing the policy values instead of the winrates. This doesn't work so well: LZ will look at lots of different moves for the first few playouts, but is still quick to identify the best move, and in most positions will give lots of extra playouts to the best move.)

There are two parameters, which you can change on the fly via the GTP console. The "bonus" is how much the winrate gets boosted by. The "ratio" defines which moves to alter: a move gets the winrate bonus added if it's had less than 1/ratio of total playouts.

For example, if you set bonus=0.2 and ratio=10, then LZ will, roughly speaking, give equal playouts to the top ten moves, provided that none of them are more than 20% below the best winrate, but it won't waste time on moves that are "obviously bad".

Test position number 1: something we've looked at before.

Click Here To Show Diagram Code: [go]$$c 501 opening problems, problem 1: Black to play $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . O . . X . . . . . . . . . | $$ | . . . O . . . . . , . . . . X , X . . | $$ | . . O . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . O . . . . . , . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . X . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . , . . . . . , . . . . . , O . . | $$ | . . X . . . . . X . . . . O . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Analysis with LZ-242 unmodified: see post number 2 above.

LZ-242 with bonus=0.2 and ratio=10, 10,000 playouts.

LZ-242 with bonus 0.03 and ratio=10, 10,000 playouts.

LZ-242 with bonus 1 and ratio=20, 10,000 playouts.

xela · **#10**

Test position number 2: Hayashi Yubi misses a good move.

Click Here To Show Diagram Code: [go]$$Bc Black to play $$ +---------------------------------------+ $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . X . . . . . . . . . O . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . O O O X . . . . . . . . . . . . | $$ | . O . O X X X X . . . . . . . . . . . | $$ | . a X X O O . . . . . . . . . . X . . | $$ | . . X O . O . O X . . . . . O . . . . | $$ | . . X O O . b . X . . . . . . O . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ +---------------------------------------+[/go]

Recall that a is the game move, ELF found that b is better, but LZ left to its own devices will ignore b for the first few thousand playouts. But it will find this move if we broaden the search! Admittedly, I had to cherry-pick the parameters to make this happen, perhaps it's a case of something you wouldn't find unless you already knew to look for it. But still a nice illustration of what's possible.

LZ-258 unmodified, 10,000 playouts:

LZ-258 with bonus 0.2 and ratio=5, 10,000 playouts.

xela · **#11**

Test position number 3 (last one for today): Takemiya's experiment.

Click Here To Show Diagram Code: [go]$$Bc Black to play $$ --------------------------------------- $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . X . . | $$ | . . . O . . . . . X X X . . O X X X . | $$ | . . . , O . . . . , . . X X X O X O . | $$ | . . X . . . . . . . . . O O O O O . . | $$ | . . . . . . . . . . . . . . . X . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . O . . . . . . . . . . . . . O . . | $$ | . . . , . . . . . , . . . d . b . . . | $$ | . . . . . . . . . . . a . . . c . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . e . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . O . . . . . O . . | $$ | . . . X . . . . . , . . O O O , . . . | $$ | . . . . . . . . . . X X . X X X . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ | . . . . . . . . . . . . . . . . . . . | $$ ---------------------------------------[/go]

Back in 1985, Takemiya played a in this position, but today's bots prefer b. Uberdude has analysed this position using a different implementation of wide search in LZ. This looked at b through e.

My results with LZ-258 unmodified, 10k playouts:

and with LZ-258 bonus=0.2 ratio=10, 10k playouts:

In case you're wondering why the two versions give different values for b: it's just the difference between 1k and 10k playouts for that move. If you watch it evolve in real time, you'll see b's value going down with more playouts.

lightvector · **#12**

Very cool.

If you and others are interested and you think the method has reached a final form - rather than wanting to keep iterating and refining the exact method - I could add it in KataGo and just make it a configurable option. Spreading a lot of playouts like this would of course weaken playing strength if used generally, but it would be easy to restrict to lz-analyze/kata-analyze, so as to only affect analysis mode when enabled.

If your goal is not merely to show more options for humans to effectively compare with consistent numbers of visits, but rather to *also* help the bot catch blind spots, one issue is that also for analysis you possibly want a wider search at deeper levels than just the root, which makes it tricky. It's not uncommon that the blind spot happens a few levels down, which prevents a root-level move from appearing to be good even when it does get searched. But maybe that's just a generally hard problem to solve, and merely the first effect (showing more options, with consistent visit numbers) is already useful enough.

xela · **#13**

Yes, I'd love to see this as an option in KataGo! I've been doing this work with Leela Zero simply because it's been around longer and I kind of know my way around the LZ source code. But I prefer KataGo as an analysis engine, and I imagine that this method would work pretty much the same way for KataGo.

I'll note that if the parameters can be changed from the GTP command line as well as from the settings file, then it's possible to change settings on the fly within Lizzie, without needing to reload the engine. (I also wouldn't mind being able to play with the "playout doubling advantage" setting this way as well.)

lightvector wrote:

Spreading a lot of playouts like this would of course weaken playing strength if used generally, but it would be easy to restrict to lz-analyze/kata-analyze, so as to only affect analysis mode when enabled.

Agreed that this option is for analysis only, not for "competitive" play. For me the restriction to lz-analyze/kata-analyze is unnecessary: I can maintain different KataGo configurations for different purposes, and I can think of scenarios where I might want to do batch analysis outside Lizzie (e.g. driven by Python scripts), and genmove output could be easier to parse than kata-analyze output. But I'm in a small minority there, and understand why you'd want this feature off by default for genmove. Happy for you to make the call.

lightvector wrote:

If your goal is not merely to show more options for humans to effectively compare with consistent numbers of visits, but rather to *also* help the bot catch blind spots, one issue is that also for analysis you possibly want a wider search at deeper levels than just the root, which makes it tricky. It's not uncommon that the blind spot happens a few levels down, which prevents a root-level move from appearing to be good even when it does get searched. But maybe that's just a generally hard problem to solve, and merely the first effect (showing more options, with consistent visit numbers) is already useful enough.

Again, I totally agree. My goal here is to have an "at a glance" view of the range of possibilities in a position, and of how much gap there is (or isn't) between the AI's preferred move and other plausible options. This is information you can already get by clicking back and forth between variations, but I think the immediacy of having it all on one screen at the same time is worth a lot.

Test position number 2 above (Hayashi misses a good move) is indeed a case where I found a specific position, a few moves deep, which looks like a blind spot for LZ-258 (but not for ELF). But if you broaden the search all the way down, you're changing the relative numbers of playouts given to different branches, which will mess up the evaluations, as well as making the tree less deep of course. You'd fix some blind spots at the cost of introducing some different mistakes.

I'm sure there is scope to tweak the search algorithm and improve strength, but that's an entirely different (and much more difficult) conversation. For now, probably the best way to fix blind spots is just to add a few million more self-play games :-)

----

By the way, this method (changing the top level of the search but nothing else) means that the winrate of each candidate move will still be accurate: once it's picked a move to search, it keeps doing what it always does, so you'll get the same number at the end of the day, assuming the same number of playouts for that move (and modulo any little effects from different random number seeds etc). But the evaluation of the starting position is (kind of) an average of all the playouts. Giving more playouts to second-best moves means the overall evaluation (the number that Lizzie uses for the winrate graph) will be slightly lower. So this method, I think, will have the effect of flattening out the winrate graphs a little bit. Personally I don't see this as an issue.

lightvector · **#14**

Do you think the part where the top N moves all get *exactly* matching visit counts is important? (e.g. maybe it's nice to get a truly apples-to-apples comparison?).

Or would it be reasonable to still have some gradation between them, searching some of them a little more and some less so as to focus and improve eval quality on the moves that are believed better? (but still much more evenly spread than the bot would normally)

xela · **#15**

I don't think it makes a big difference either way. If you have an idea for implementing this differently from what I've done so far, I'm interested to hear your thoughts.

Widening Leela Zero's search (nb: contains many images)

Who is online