abcd_z wrote:Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.
To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.
Although normal standards of politeness don't apply on internet forums, I would allow as0770 some credit (even quite a significant amount) for running this "tournament" before going into critique part. I have found this thread very interesting to follow and it has been a joy to check out newest results once they become available, even though it doesn't happen every day. However, running this kind of project on voluntary effort is a huge undertaking, so that is completely understandable. So thanks to as0770 from here!
Now it is true that I also had the impression that "tournament" was somehow plurar (like "engine tournaments", with each post representing one tournament), i.e. a new one is done every once in a while. So there is ambiguity in the term. Maybe the updates could say "here are updated results of the continuing tournament", or something to that effect.
However, despite abcd_z's vehement reaction to this, I would maybe argue, that if the engines are exact same version (like SomeBot 1.4.4 or OtherEngine 10), arranging a 10 game match between them every time is just reducing variance after a while, and not likely to uncover any deeper truths, so I would not put much effort into having GNU Go play against every other bot all the time. For Leela Zero and other actively progressing engines, however, there might be a need to re-run the newest version against the opposition every once in a while.
Of course that raises the question that if BotX won 2/10 against LZ #120, and then LZ #142 beats it 10-0, will it then be 10-0 or 18-2 for Leela vs. BotX. But if it's only a one match per opponent, then you can just overwrite the previous results.
Naturally a tournament is not an ELO comparison, as in the cases where only single game is played, it's about winning probabilities but the result is binary. But in a league, you could argue that the results are likely to even each other out.
One possible idea: Maybe do a matrix of engines, and accumulate games slowly into the matrix, with more games allotted for those where results are not as clear? With one row/column for single version, this would slowly build up into quite good comparison. And for GNU Go vs. LZ ELF you don't need to have 10 game match, one or two would be enough to see that yes indeed LZ ELF mostly wins.