KataGo V1.3

Limeztone · Post by **Limeztone** » Sun Mar 01, 2020 4:27 pm

lightvector wrote:Saying a fixed number of playouts you used per move is NOT enough to give a constant hardware-independent strength. You also have to specify how many threads you used to generate that many playouts.

I don't get this...
Are you actually saying that the same net with the same maxPlayouts could be different in strength depending on the number of threads (or executed on different hardware)?

jann · Post by **jann** » Sun Mar 01, 2020 4:53 pm

Limeztone wrote:As I understand visits vs playouts is that if you clear the tree for every move made, visits and playouts become the same.

Thus a big change for a playout based test (which was affected by a random search bonus without this).

the same net with the same maxPlayouts could be different in strength depending on the number of threads (or executed on different hardware)

More search threads means weaker search (less freedom in which nodes to visit/expand).

inbae · Post by **inbae** » Sun Mar 01, 2020 5:29 pm

jann wrote:For example, if you clear the tree each move, fixed playout tests are heavily affected (the same amount of playouts / work will do less effective search) while fixed visit tests are less so (single threaded at least).

Yes, clearing the search tree will certainly affect the results, but I don't think that is realistic in match conditions.

jann wrote:Another example is when you find an otherwise weaker side ahead, because of higher extent of tree reuse (thus effectively more but weaker search). Then repeat the test in a different visit/playout range, and find that these two factors are now less compensate each other, and now the other side comes out ahead.

In this very example, I would say that a stronger engine is weakened by not effectively reusing tree. At the end of the day, this boils down to the question that which represents the strength better between fixed playouts or fixed visits. And due to the aforementioned reasons, I think a fixed playouts test reflects real world strength more correctly, since policy sharpness is a direct result of NN inference.

jann · Post by **jann** » Sun Mar 01, 2020 6:03 pm

A wider test that allows more factors to affect the result will certainly be closer to being called a "real world" test (which is usually a situation with many affecting factors - hence with very hard to interpret results!).

But I think you missed the point of the last example, where your playout results may even trick you. It is known that more search affect different nets/engines differently (stronger tend to benefit more). A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.

inbae · Post by **inbae** » Sun Mar 01, 2020 6:53 pm

jann wrote:A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.

A lower visits test does not necessarily correlate with higher visit tests: Networks scale differently anyways, and for different visit/playout counts, additional tests are required. This is clearer when we consider that the value head influences more for deeper searches. For example, a relative scaling test by Friday9i is an example of different scaling of networks with fixed visits.

jann · Post by **jann** » Sun Mar 01, 2020 7:03 pm

The margin of victory will be different, but not the winner (assuming identical visits - unlike with identical playouts).

Those linked tests used non-identical visits, thus widened the test up to a new factor (scalability). But this was on purpose there, since the test was not about raw strength but scalability itself.

inbae · Post by **inbae** » Sun Mar 01, 2020 7:16 pm

jann wrote:The margin of victory will be different, but not the winner (assuming identical visits - unlike with identical playouts).

Those linked tests used non-identical visits, thus widened the test up to a new factor (scalability). But this was on purpose there, since the test was not about raw strength but scalability itself.

The point is the difference in scalability. If two networks scale differently, you cannot guarantee the winner at lower visits necessarily would win as well at higher visits. I have no idea why you are confident that the winner will not change here. Moreover, one sometimes wants to measure the margin of victory (or Elo rating difference) as well.

jann · Post by **jann** » Sun Mar 01, 2020 7:31 pm

The difference in scalability means that net A needs 1.5x more visits around 1000 visit (to compensate for being weaker) but 2.5x more around 10000 visits. In both cases it is weaker than B, and would (obviously) lose at 1.0x visits.

Network strengths will not (or rarely) swap, what usually happens is another factor (like more search) may compensate for raw strength difference.

inbae · Post by **inbae** » Sun Mar 01, 2020 8:18 pm

jann wrote:The difference in scalability means that net A needs 1.5x more visits around 1000 visit (to compensate for being weaker) but 2.5x more around 10000 visits. In both cases it is weaker than B, and would (obviously) lose at 1.0x visits.

Network strengths will not (or rarely) swap, what usually happens is another factor (like more search) may compensate for raw strength difference.

The true implication of the scaling test is that the rating of networks increases with different slopes with respect to logarithm of playouts (if we assume a naive approximation that Elo rating increases linearly with log(playouts)). It suggests that a better scaling network can eventually overcome another network of worse scaling given enough playouts. This will be especially the case for a network with a better value head: Given more playouts, the search will be influenced by the value head more. And such networks can be results of different weights in the loss function during training.

jann · Post by **jann** » Sun Mar 01, 2020 8:28 pm

inbae wrote:It suggests that a better scaling network can eventually overcome another network of worse scaling given enough playouts.

Sure, that's why more search usually helps the stronger but slower side. But note that even in your linked graph curves usually don't cross the line of "1" (which would happen if the identical-visit winner could easily swap). Being stronger somewhere at identical visits normally determines the rest of the curve, the only question is the slope (ie. when will the extra search required to compensate be more than what's available from eg. the speed difference).

inbae · Post by **inbae** » Sun Mar 01, 2020 8:58 pm

jann wrote:Being stronger somewhere at identical visits normally determines the rest of the curve

Not necessarily, I suppose. For example, there are U-shaped curve, where a weaker network benefits from more visits at sweet spots, but falls of at higher visits due to scaling. However, another group consists of purple curves, where ELFv1 is scaling better than LZ18x, though the ratio=1 line was not crossed here. We are clearly seeing different scalings of networks. Judging the strength of a network from a certain visit counts is as dangerous as testing with 1 playout only, and the strength should be tested in terms of [network, playouts (or visits), number of threads] for example - there is nothing like an absolute measure of the "strength of a network".

And still, this discussion does not justify arguments such as

jann wrote:A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.

jann · Post by **jann** » Sun Mar 01, 2020 9:26 pm

Everything is possible, but not everything is (equally) probable. In any case, a visit based test is more likely to be consistent across visit ranges.

inbae wrote:And still, this discussion does not justify arguments such as

With a playout based test, the side that tend to support more tree reuse has, say, a constant 1.5x effective search advantage. This is similar to a smaller, weaker but faster net, which (with proportional visits, or time based test) can win low search matches but lose high search matches.

inbae · Post by **inbae** » Sun Mar 01, 2020 9:38 pm

jann wrote:With a playout based test, the side that tend to support more tree reuse has, say, a constant 1.5x effective search advantage. This is similar to a smaller, weaker but faster net, which (with proportional visits, or time based test) can win low search matches but lose high search matches.

Such a search advantage can be converted into visits. Say it is 1.5x: then it will be something like 1500 visits for network A vs 1000 visits for network B. Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?

And if a weaker-at-lower-playouts net can manage to win at higher playouts, let it be. It is ultimately what really matters, and the strength of a network cannot be thought of without considering playouts (or visits).

jann · Post by **jann** » Sun Mar 01, 2020 9:48 pm

inbae wrote:Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?

Yes, this is also the exact meaning of the scalability graph you linked. (With some effort you may even find examples for these particular numbers there - crossing the 1.5 line but not the 1.0 line.)

inbae · Post by **inbae** » Sun Mar 01, 2020 10:04 pm

jann wrote:
inbae wrote:Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?
Yes, this is also the exact meaning of the scalability graph you linked.

You are right about the latter, but the former is less justified. As both policy and value heads are involved in the search, networks should behave differently with different playouts/visits. You can think of very extreme cases such as an abysmal value head, one-hot policy or 1 visit case, etc.

And what I'm constantly insisting is that there is nothing wrong with benefiting from tree reuse, or the results varying with playouts. You are tacitly suggesting that the result should be consistent regardless of computational cost, but I have no idea why.

Life In 19x19

KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3

Re: KataGo V1.3