No, we don't need a new definition.
Let's clarify what the purpose of this selection metric is - to pick a reliably good network with high confidence:
Even if the gap is smaller now, "kata1-b40c256-s6485784576-d1573360039" is still rated as being among the strongest all nets before and nearby it, right?
And also the error bar on that net is small, so even it turns out not to be the strongest, with very high likelihood it's not one of the nets which performs unusually poorly, right?
So in both ways, the selection criterion is doing its job well. With decent reliability, it picks out a recent and strongest or nearly-strongest net compared to its neighbors, and with high reliability avoids really bad nets, despite major uncertainty in the ratings relative to the magnitude of differences it's attempting to discriminate between.
Additionally, keep in mind about Elo values in general:In general, across almost all Elo systems, pay more attention to Elo differences, than to absolute Elo numbers.. This is true in general for Elo systems, except for perhaps ones that take extreme pains to maintain stability across time. You can see how Go server and association ratings are all over the place relative to one another, as well as sometimes having inflation or deflation over time. In Chess world things are more stable, but still there is sometimes a little bit of noise or drift, and mild inconsistency between systems. And every different published research paper also uses an Elo scale whose absolute offset is incomparable to that of any other paper. In all cases, the differences are more meaningful than the absolute numbers.
In KataGo's case, the anchor point of the graph right now is arbitrarily chosen as 0 = random, and new rating games are played all the time even between very old nets. If back when KataGo was moving through "DDK" level, new-games indicate that over a span of some nets only 2000 Elo was gained instead of 2050, the entire rating graph above it will shift by 50 Elo, even though nothing practical has changed about our belief of the strength of the current nets. So the absolute number, really, really doesn't matter here.
And, a note about Elo locality:Even more than ignoring absolutes and paying attention to just differences, in any Elo system you ever find in practice, you usually should only consider the
local differences reliable - the ratings difference between a player and other players near them. For larger differences, they are the transitive sum of smaller differences, rather than directly measured. So when P1 is 1150 Elo better than P2 in *any* practical system (not just KataGo), that should be understood to mean something like:
"P1 is measured to approximately win 3:1 against players who win 3:1 against players who win 3:1... against P2", in total iterated 6 times.
It does NOT mean:
"P1 is measured to approximately win 750:1 against P2".
Because in practice, no Elo system will have the games to measure that accurately. Plus, we know that Elo itself is only an approximation of reality. In truth "skill level" is more complex and multidimensional, and precisely one of the places that approximation starts becoming unreliable is in very large differences. So that means that the interpretation of the vertical confidence bands in KataGo's rating graph is a bit subtle. The confidence bands around the nets should be understood to be confidence bands with respect to the Elos of the population of nets around it, say, within the nearby +/ 300 Elo or so. If the local population as a whole moves up or down by more, it doesn't matter.