KataGo V1.3

For discussing go computing, software announcements, etc.
Limeztone
Dies in gote
Posts: 63
Joined: Sun Jan 12, 2020 9:28 pm
GD Posts: 0
Has thanked: 8 times
Been thanked: 4 times

Re: KataGo V1.3

Post by Limeztone »

lightvector wrote:Saying a fixed number of playouts you used per move is NOT enough to give a constant hardware-independent strength. You also have to specify how many threads you used to generate that many playouts.
I don't get this...
Are you actually saying that the same net with the same maxPlayouts could be different in strength depending on the number of threads (or executed on different hardware)?
jann
Lives in gote
Posts: 445
Joined: Tue May 14, 2019 8:00 pm
GD Posts: 0
Been thanked: 37 times

Re: KataGo V1.3

Post by jann »

Limeztone wrote:As I understand visits vs playouts is that if you clear the tree for every move made, visits and playouts become the same.
Thus a big change for a playout based test (which was affected by a random search bonus without this).
the same net with the same maxPlayouts could be different in strength depending on the number of threads (or executed on different hardware)
More search threads means weaker search (less freedom in which nodes to visit/expand).
inbae
Dies in gote
Posts: 25
Joined: Tue Feb 04, 2020 11:07 am
GD Posts: 0
KGS: inbae
Been thanked: 7 times

Re: KataGo V1.3

Post by inbae »

jann wrote:For example, if you clear the tree each move, fixed playout tests are heavily affected (the same amount of playouts / work will do less effective search) while fixed visit tests are less so (single threaded at least).
Yes, clearing the search tree will certainly affect the results, but I don't think that is realistic in match conditions.
jann wrote:Another example is when you find an otherwise weaker side ahead, because of higher extent of tree reuse (thus effectively more but weaker search). Then repeat the test in a different visit/playout range, and find that these two factors are now less compensate each other, and now the other side comes out ahead.
In this very example, I would say that a stronger engine is weakened by not effectively reusing tree. At the end of the day, this boils down to the question that which represents the strength better between fixed playouts or fixed visits. And due to the aforementioned reasons, I think a fixed playouts test reflects real world strength more correctly, since policy sharpness is a direct result of NN inference.
jann
Lives in gote
Posts: 445
Joined: Tue May 14, 2019 8:00 pm
GD Posts: 0
Been thanked: 37 times

Re: KataGo V1.3

Post by jann »

A wider test that allows more factors to affect the result will certainly be closer to being called a "real world" test (which is usually a situation with many affecting factors - hence with very hard to interpret results!).

But I think you missed the point of the last example, where your playout results may even trick you. It is known that more search affect different nets/engines differently (stronger tend to benefit more). A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.
inbae
Dies in gote
Posts: 25
Joined: Tue Feb 04, 2020 11:07 am
GD Posts: 0
KGS: inbae
Been thanked: 7 times

Re: KataGo V1.3

Post by inbae »

jann wrote:A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.
A lower visits test does not necessarily correlate with higher visit tests: Networks scale differently anyways, and for different visit/playout counts, additional tests are required. This is clearer when we consider that the value head influences more for deeper searches. For example, a relative scaling test by Friday9i is an example of different scaling of networks with fixed visits.
jann
Lives in gote
Posts: 445
Joined: Tue May 14, 2019 8:00 pm
GD Posts: 0
Been thanked: 37 times

Re: KataGo V1.3

Post by jann »

The margin of victory will be different, but not the winner (assuming identical visits - unlike with identical playouts).

Those linked tests used non-identical visits, thus widened the test up to a new factor (scalability). But this was on purpose there, since the test was not about raw strength but scalability itself.
inbae
Dies in gote
Posts: 25
Joined: Tue Feb 04, 2020 11:07 am
GD Posts: 0
KGS: inbae
Been thanked: 7 times

Re: KataGo V1.3

Post by inbae »

jann wrote:The margin of victory will be different, but not the winner (assuming identical visits - unlike with identical playouts).

Those linked tests used non-identical visits, thus widened the test up to a new factor (scalability). But this was on purpose there, since the test was not about raw strength but scalability itself.
The point is the difference in scalability. If two networks scale differently, you cannot guarantee the winner at lower visits necessarily would win as well at higher visits. I have no idea why you are confident that the winner will not change here. Moreover, one sometimes wants to measure the margin of victory (or Elo rating difference) as well.
jann
Lives in gote
Posts: 445
Joined: Tue May 14, 2019 8:00 pm
GD Posts: 0
Been thanked: 37 times

Re: KataGo V1.3

Post by jann »

The difference in scalability means that net A needs 1.5x more visits around 1000 visit (to compensate for being weaker) but 2.5x more around 10000 visits. In both cases it is weaker than B, and would (obviously) lose at 1.0x visits.

Network strengths will not (or rarely) swap, what usually happens is another factor (like more search) may compensate for raw strength difference.
inbae
Dies in gote
Posts: 25
Joined: Tue Feb 04, 2020 11:07 am
GD Posts: 0
KGS: inbae
Been thanked: 7 times

Re: KataGo V1.3

Post by inbae »

jann wrote:The difference in scalability means that net A needs 1.5x more visits around 1000 visit (to compensate for being weaker) but 2.5x more around 10000 visits. In both cases it is weaker than B, and would (obviously) lose at 1.0x visits.

Network strengths will not (or rarely) swap, what usually happens is another factor (like more search) may compensate for raw strength difference.
The true implication of the scaling test is that the rating of networks increases with different slopes with respect to logarithm of playouts (if we assume a naive approximation that Elo rating increases linearly with log(playouts)). It suggests that a better scaling network can eventually overcome another network of worse scaling given enough playouts. This will be especially the case for a network with a better value head: Given more playouts, the search will be influenced by the value head more. And such networks can be results of different weights in the loss function during training.
jann
Lives in gote
Posts: 445
Joined: Tue May 14, 2019 8:00 pm
GD Posts: 0
Been thanked: 37 times

Re: KataGo V1.3

Post by jann »

inbae wrote:It suggests that a better scaling network can eventually overcome another network of worse scaling given enough playouts.
Sure, that's why more search usually helps the stronger but slower side. But note that even in your linked graph curves usually don't cross the line of "1" (which would happen if the identical-visit winner could easily swap). Being stronger somewhere at identical visits normally determines the rest of the curve, the only question is the slope (ie. when will the extra search required to compensate be more than what's available from eg. the speed difference).
inbae
Dies in gote
Posts: 25
Joined: Tue Feb 04, 2020 11:07 am
GD Posts: 0
KGS: inbae
Been thanked: 7 times

Re: KataGo V1.3

Post by inbae »

jann wrote:Being stronger somewhere at identical visits normally determines the rest of the curve
Not necessarily, I suppose. For example, there are U-shaped curve, where a weaker network benefits from more visits at sweet spots, but falls of at higher visits due to scaling. However, another group consists of purple curves, where ELFv1 is scaling better than LZ18x, though the ratio=1 line was not crossed here. We are clearly seeing different scalings of networks. Judging the strength of a network from a certain visit counts is as dangerous as testing with 1 playout only, and the strength should be tested in terms of [network, playouts (or visits), number of threads] for example - there is nothing like an absolute measure of the "strength of a network".

And still, this discussion does not justify arguments such as
jann wrote:A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.

Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.
jann
Lives in gote
Posts: 445
Joined: Tue May 14, 2019 8:00 pm
GD Posts: 0
Been thanked: 37 times

Re: KataGo V1.3

Post by jann »

Everything is possible, but not everything is (equally) probable. In any case, a visit based test is more likely to be consistent across visit ranges.
inbae wrote:And still, this discussion does not justify arguments such as
With a playout based test, the side that tend to support more tree reuse has, say, a constant 1.5x effective search advantage. This is similar to a smaller, weaker but faster net, which (with proportional visits, or time based test) can win low search matches but lose high search matches.
inbae
Dies in gote
Posts: 25
Joined: Tue Feb 04, 2020 11:07 am
GD Posts: 0
KGS: inbae
Been thanked: 7 times

Re: KataGo V1.3

Post by inbae »

jann wrote:With a playout based test, the side that tend to support more tree reuse has, say, a constant 1.5x effective search advantage. This is similar to a smaller, weaker but faster net, which (with proportional visits, or time based test) can win low search matches but lose high search matches.
Such a search advantage can be converted into visits. Say it is 1.5x: then it will be something like 1500 visits for network A vs 1000 visits for network B. Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?

And if a weaker-at-lower-playouts net can manage to win at higher playouts, let it be. It is ultimately what really matters, and the strength of a network cannot be thought of without considering playouts (or visits).
jann
Lives in gote
Posts: 445
Joined: Tue May 14, 2019 8:00 pm
GD Posts: 0
Been thanked: 37 times

Re: KataGo V1.3

Post by jann »

inbae wrote:Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?
Yes, this is also the exact meaning of the scalability graph you linked. (With some effort you may even find examples for these particular numbers there - crossing the 1.5 line but not the 1.0 line.)
inbae
Dies in gote
Posts: 25
Joined: Tue Feb 04, 2020 11:07 am
GD Posts: 0
KGS: inbae
Been thanked: 7 times

Re: KataGo V1.3

Post by inbae »

jann wrote:
inbae wrote:Do you imply that 1000 vs 1000 visits will be consistent with 10000 vs 10000 visits, but 1500 vs 1000 won't be so with 15000 vs 10000, for example?
Yes, this is also the exact meaning of the scalability graph you linked.
You are right about the latter, but the former is less justified. As both policy and value heads are involved in the search, networks should behave differently with different playouts/visits. You can think of very extreme cases such as an abysmal value head, one-hot policy or 1 visit case, etc.

And what I'm constantly insisting is that there is nothing wrong with benefiting from tree reuse, or the results varying with playouts. You are tacitly suggesting that the result should be consistent regardless of computational cost, but I have no idea why.
Post Reply