jann wrote:Being stronger somewhere at identical visits normally determines the rest of the curve
Not necessarily, I suppose. For example, there are U-shaped curve, where a weaker network benefits from more visits at sweet spots, but falls of at higher visits due to scaling. However, another group consists of purple curves, where ELFv1 is scaling better than LZ18x, though the ratio=1 line was not crossed here. We are clearly seeing different scalings of networks. Judging the strength of a network from a certain visit counts is as dangerous as testing with 1 playout only, and the strength should be tested in terms of [network, playouts (or visits), number of threads] for example - there is nothing like an absolute measure of the "strength of a network".
And still, this discussion does not justify arguments such as
jann wrote:A weaker net with sharper policy that allows more tree reuse trades search quality for (effective) quantity. At 1000 playouts it may win your test, and you may think it is "stronger in real world", but at 10000 playouts (where search quality starts to matter more) it may lose.
Thus your results will be less robust or consistent/representative across various real world scenarios (similarly like if you allowed hw factors to affect your test). With a visit based test, whichever side wins at 1000 visits will likely also win at 10000 visits.