I'm glad you're enthusiastic, but I still don't understand why you insist on using such a tiny number of games (only 4 per network!!) and justifying it on the basis of wanting to serve "end users".
If that is the only computation power you can afford, sure. I absolutely respect and appreciate doing the best one can with limited resources. No problem!
But instead if it's a deliberate choice to use fewer games to better match what end users would experience, then it's silly. Rather than deliberately using an error-prone measurement because you think most users will not notice, it's certainly at least no harm to use an accurate measurement (more games) and report the accurate difference. Then each user can decide for themselves if the accurately-reported difference is big enough to care about.
Four games per test is especially few. Consider a bot A that beats B 60% of the time. I would guess most people would consider that not a huge difference, but still a respectable one. However, with only 4 games, the chance that B beats A 3-1 or 4-0 is about 18%! So there is an 18% chance you'd come up with the entirely backwards conclusion.
You've argued many times in the past that "end users" will only use the bot for few games themselves, therefore the way to make the best recommendation is to test using only a few games because it better matches the usage, rather than tests with a large number of games. We can see by the following example that such logic isn't very good:
- Suppose we did do a 4 game test and we did get a 3-1 result in favor of B (getting a result that was only 18% likely is very possible!).
- Suppose we also did a 1000 game test and this time, the result was that A won 613 games and B won 387 games.
Consider a user who plans to use either bot A or bot B in a tournament where it will play 4 games, and they want the bot with the best chance of doing well. Based on the above two tests, which bot should we recommend to them? Should we trust the 4 game test and recommend B because the tournament will also be 4 games, therefore a 4-game test is the most reliable? Our should we trust the 1000 game test and recommend A because the 1000 game test is overall more accurate measurement?
Obviously we should recommend bot A to them!
We can see here a clear demonstration that the principle "if end users will only notice larger differences and will only be using the bot for a very few games, then the best way to make a good recommendation to to also run tests using only a very few games" is a bad principle. The way to make a good recommendation to an end user that will run few games is to test
many times more games than they will use.