Engine Tournament

abcd_z · Post by **abcd_z** » Thu May 17, 2018 6:54 pm

2. Leela Zero 0.13 128x10 6/16

Why are you using an old model and an old version of Leela Zero? The weights have been 192x15 now for over a month, and the latest version of Leela is 0.15.

as0770 · Post by **as0770** » Sat May 19, 2018 1:33 pm

abcd_z wrote:
2. Leela Zero 0.13 128x10 6/16
Why are you using an old model and an old version of Leela Zero? The weights have been 192x15 now for over a month, and the latest version of Leela is 0.15.

I planed to use the best network of each size. Furthermore the Leela Zero team is now using ELF games to train the network, so I deliberate whether it makes sense to test "best" networks that are weaker than the ELF network.

abcd_z · Post by **abcd_z** » Sun May 20, 2018 9:37 pm

as0770 wrote:I planed to use the best network of each size.

Okay, but... why? Doing it that way misrepresents Leela Zero's actual, current strength.

as0770 wrote:Furthermore the Leela Zero team is now using ELF games to train the network, so I deliberate whether it makes sense to test "best" networks that are weaker than the ELF network.

Again, doing otherwise misrepresents Leela Zero's strength. Doing it the way you're currently doing makes it seem like LZ's current strength is weaker than it actually is.

ELF has a certain level of strength, and Leela Zero has a different level of strength, and that gap will close over time until LZ reaches ELF's level of strength (and presumably exceeds it). Just because another network is stronger than LZ doesn't mean you should ignore the updates that close the gap. After all, isn't the whole point of this tournament to compare the strength of different go programs?

as0770 wrote: Maybe I'll repeat the tournament with a komi of 7.5 but honestly I think a bot should be able to play also with a komi of 6.5.

There are two problems with this. First, deep neural networks need to be trained from scratch in order to play at different komis. There's a paper that talks about an approach that would let go programs play at different komis, and I think Golaxy uses that approach, but AFAIK none of the programs you have use that. They would all need their architecture rewritten to accommodate that.

Second, why are you using 6.5 komi? The easiest type of scoring for a go AI is area scoring, and under area scoring the result without komi is almost always even, so rulesets that use area scoring generally have a komi of 5.5 or 7.5.

as0770 · Post by **as0770** » Sun May 20, 2018 10:16 pm

abcd_z wrote:
as0770 wrote:I planed to use the best network of each size.
Okay, but... why? Doing it that way misrepresents Leela Zero's actual, current strength.
as0770 wrote:Furthermore the Leela Zero team is now using ELF games to train the network, so I deliberate whether it makes sense to test "best" networks that are weaker than the ELF network.
Again, doing otherwise misrepresents Leela Zero's strength. ELF has a certain level of strength, and Leela Zero has a different level of strength, and that gap will close over time until LZ reaches ELF's level of strength (and presumably exceeds it). Doing it the way you're currently doing makes it seem like LZ's current strength is weaker than it actually is.

Just because another network is stronger than LZ doesn't mean you should ignore the updates that close the gap. After all, isn't the whole point of this tournament to compare the strength of different go programs?

I won't test every network, so I have to do a cut somewhere. When ELF came up Leela won 7% against it. Ten Networks later it wins 10%. I think this says much more about the progress than a 16 game tournament.

abcd_z · Post by **abcd_z** » Sun May 20, 2018 11:14 pm

as0770 wrote:I won't test every network, so I have to do a cut somewhere.

That's reasonable, but if you're going to do a tournament that tests the strength of Leela Zero it should include, at minimum, the most recent LZ network. Cut out other LZ networks if you have to.

as0770 wrote:When ELF came up Leela won 7% against it. Ten Networks later it wins 10%. I think this says much more about the progress than a 16 game tournament.

I have no idea what you mean by this.

as0770 · Post by **as0770** » Mon May 21, 2018 7:49 am

abcd_z wrote:
as0770 wrote:I won't test every network, so I have to do a cut somewhere.
That's reasonable, but if you're going to do a tournament that tests the strength of Leela Zero it should include, at minimum, the most recent LZ network. Cut out other LZ networks if you have to.

Other networks have been played already, I don't save time and computing resources when I cut them out.

But you are right, it has been a long time since the last test so I currently run a new one.

Because of the training with ELF games the current 192x15 network will likely make progress for a while.

abcd_z wrote:
as0770 wrote:When ELF came up Leela won 7% against it. Ten Networks later it wins 10%. I think this says much more about the progress than a 16 game tournament.
I have no idea what you mean by this.

When ELF was published Leelas best network won 7% against the ELF network. Yesterday, 11 days later, Leelas best network won 10% against the same ELF network.

abcd_z · Post by **abcd_z** » Mon May 21, 2018 1:08 pm

as0770 wrote:Other networks have been played already, I don't save time and computing resources when I cut them out.

Wait, are you reusing the results between tournaments?!

But you are right, it has been a long time since the last test so I currently run a new one.

Because of the training with ELF games the current 192x15 network will likely make progress for a while.

Oh my god, you are reusing the results. That's why you don't want to use the results from a constantly-changing network.

Look, if you're running a tournament between game engines, you can't just reuse previous results. You have to use all-new matches, or the results aren't valid.

as0770 · Post by **as0770** » Mon May 21, 2018 8:58 pm

abcd_z wrote:You have to use all-new matches, or the results aren't valid.

It is not a ratinglist but a tournament. The results are exactly as valid as intended. If it doesn't match your prospects you have to watch for other sources or run your own tournament. But be aware: One run will take 20 days 24/7 if you do all-new matches.

abcd_z · Post by **abcd_z** » Tue May 22, 2018 3:20 am

as0770 wrote:It is not a ratinglist but a tournament.

Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.

To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.

as0770 · Post by **as0770** » Tue May 22, 2018 6:04 am

abcd_z wrote:
as0770 wrote:It is not a ratinglist but a tournament.
Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.

Now you've heard of one.

abcd_z wrote:To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.

I am sorry that you feel misleaded. How can I name it in the future to avoid misleading?

jokkebk · Post by **jokkebk** » Tue May 22, 2018 9:58 am

abcd_z wrote:Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.
To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.

Although normal standards of politeness don't apply on internet forums, I would allow as0770 some credit (even quite a significant amount) for running this "tournament" before going into critique part. I have found this thread very interesting to follow and it has been a joy to check out newest results once they become available, even though it doesn't happen every day. However, running this kind of project on voluntary effort is a huge undertaking, so that is completely understandable. So thanks to as0770 from here!

Now it is true that I also had the impression that "tournament" was somehow plurar (like "engine tournaments", with each post representing one tournament), i.e. a new one is done every once in a while. So there is ambiguity in the term. Maybe the updates could say "here are updated results of the continuing tournament", or something to that effect.

However, despite abcd_z's vehement reaction to this, I would maybe argue, that if the engines are exact same version (like SomeBot 1.4.4 or OtherEngine 10), arranging a 10 game match between them every time is just reducing variance after a while, and not likely to uncover any deeper truths, so I would not put much effort into having GNU Go play against every other bot all the time. For Leela Zero and other actively progressing engines, however, there might be a need to re-run the newest version against the opposition every once in a while.

Of course that raises the question that if BotX won 2/10 against LZ #120, and then LZ #142 beats it 10-0, will it then be 10-0 or 18-2 for Leela vs. BotX. But if it's only a one match per opponent, then you can just overwrite the previous results.

Naturally a tournament is not an ELO comparison, as in the cases where only single game is played, it's about winning probabilities but the result is binary. But in a league, you could argue that the results are likely to even each other out.

One possible idea: Maybe do a matrix of engines, and accumulate games slowly into the matrix, with more games allotted for those where results are not as clear? With one row/column for single version, this would slowly build up into quite good comparison. And for GNU Go vs. LZ ELF you don't need to have 10 game match, one or two would be enough to see that yes indeed LZ ELF mostly wins.

as0770 · Post by **as0770** » Tue May 22, 2018 12:13 pm

jokkebk wrote:
abcd_z wrote:Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.
To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.
Although normal standards of politeness don't apply on internet forums, I would allow as0770 some credit (even quite a significant amount) for running this "tournament" before going into critique part. I have found this thread very interesting to follow and it has been a joy to check out newest results once they become available, even though it doesn't happen every day. However, running this kind of project on voluntary effort is a huge undertaking, so that is completely understandable. So thanks to as0770 from here!

Thanks for your kind words. I did this kind of tournaments many years with chess engines. When AlphaGo came up I became very intersted in Go and Go engines. I know, and I stated several times, that a tournament of ~20 games for each AI won't give a statistically significant result. Beside that the strength of a Go engine depends a lot on the hardware. Some use GPU some don't. The same tournament on a better graphic card will lead to a totally different result. So my intention is not to find the truth about Go engines, simply because there is no real truth. My intention is to give an overview of the strength of gtp engines of all levels. And I want to do that without running tournaments 24/7.

In the beginning of each post I mentioned the "updates" of the tournament, I wasn't aware this can be misunderstood, I'll try to state that more clear in the future

jokkebk wrote:I also had the impression that "tournament" was somehow plurar (like "engine tournaments", with each post representing one tournament)

The thread title is "Engine tournament" (singular)

jokkebk wrote:For Leela Zero and other actively progressing engines, however, there might be a need to re-run the newest version against the opposition every once in a while.

New versions of an engine or networks play of course new games. I updated the Leela version quite a lot in the past. Now it is somewhere between ZEN and ELF so it don't make much sense to test every new network.

abcd_z · Post by **abcd_z** » Tue May 22, 2018 9:00 pm

as0770 wrote:I am sorry that you feel misleaded. How can I name it in the future to avoid misleading?

I would edit your very first post and include something like the following:

Because the strength of go programs don't really change over time, and because it takes so long to run the matches, I reuse the results from previous tournaments. Each program runs once against all the current competitors when it's introduced, and those results are used for each subsequent tournament.

This also means that I won't be testing the current weights of Leela Zero, which changes constantly. Instead, I'm going to use the last LZ model of each size.

Feel free to copy and paste those paragraphs if you want.

Jokkebk makes a good point: go program strengths don't usually change much over time. LZ is the exception to this rule. With that in mind, I don't have any problems with you doing the tournament the way you've been doing it, as long as it's obvious to the readers that it's not exactly a standard tournament.

as0770 · Post by **as0770** » Tue May 22, 2018 9:19 pm

abcd_z wrote:
as0770 wrote:I am sorry that you feel misleaded. How can I name it in the future to avoid misleading?
I would edit your very first post and include something like the following:
Because the strength of go programs don't really change over time, and because it takes so long to run the matches, I reuse the results from previous tournaments. Each program runs once against all the current competitors when it's introduced, and those results are used for each subsequent tournament.

This also means that I won't be testing the current weights of Leela Zero, which changes constantly. Instead, I'm going to use the last LZ model of each size.
Feel free to copy and paste those paragraphs if you want.

Jokkebk makes a good point: go program strengths don't usually change much over time. LZ is the exception to this rule. With that in mind, I don't have any problems with you doing the tournament the way you've been doing it, as long as it's obvious to the readers that it's not exactly a standard tournament.

I didn't mix different weights or different versions of one engine, in the future I'll explain updated tournaments better. But I don't see a reason to explain why I don't test all versions and networks.

abcd_z · Post by **abcd_z** » Tue May 22, 2018 10:38 pm

as0770 wrote:I didn't mix different weights or different versions of one engine

I... what? I never said you did.

as0770 wrote:But I don't see a reason to explain why I don't test all versions and networks.

I didn't say that, either. I feel like you're reading something different from what I'm writing.

All I'm saying is that when you use the word "tournament", the natural assumption is that you're re-generating the matches each time. All you need to do to prevent this misunderstanding is to make a note at the beginning of your first post saying that you're reusing the results of each tournament. That's all.

EDIT: Wait, you mean that thing about Leela Zero? The reason I brought LZ up specifically was because, unlike all the other engines, LZ changes frequently, and it's easy for a reader to assume that the tests you're running on it are indicative of LZ's current strength.

Life In 19x19

Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament

Re: Engine Tournament