Why are you using an old model and an old version of Leela Zero? The weights have been 192x15 now for over a month, and the latest version of Leela is 0.15.2. Leela Zero 0.13 128x10 6/16
Engine Tournament
-
abcd_z
- Beginner
- Posts: 12
- Joined: Thu Apr 26, 2018 11:32 am
- Rank: 15k
- GD Posts: 0
- Has thanked: 5 times
Re: Engine Tournament
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
I planed to use the best network of each size. Furthermore the Leela Zero team is now using ELF games to train the network, so I deliberate whether it makes sense to test "best" networks that are weaker than the ELF network.abcd_z wrote:Why are you using an old model and an old version of Leela Zero? The weights have been 192x15 now for over a month, and the latest version of Leela is 0.15.2. Leela Zero 0.13 128x10 6/16
-
abcd_z
- Beginner
- Posts: 12
- Joined: Thu Apr 26, 2018 11:32 am
- Rank: 15k
- GD Posts: 0
- Has thanked: 5 times
Re: Engine Tournament
Okay, but... why? Doing it that way misrepresents Leela Zero's actual, current strength.as0770 wrote:I planed to use the best network of each size.
Again, doing otherwise misrepresents Leela Zero's strength. Doing it the way you're currently doing makes it seem like LZ's current strength is weaker than it actually is.as0770 wrote:Furthermore the Leela Zero team is now using ELF games to train the network, so I deliberate whether it makes sense to test "best" networks that are weaker than the ELF network.
ELF has a certain level of strength, and Leela Zero has a different level of strength, and that gap will close over time until LZ reaches ELF's level of strength (and presumably exceeds it). Just because another network is stronger than LZ doesn't mean you should ignore the updates that close the gap. After all, isn't the whole point of this tournament to compare the strength of different go programs?
There are two problems with this. First, deep neural networks need to be trained from scratch in order to play at different komis. There's a paper that talks about an approach that would let go programs play at different komis, and I think Golaxy uses that approach, but AFAIK none of the programs you have use that. They would all need their architecture rewritten to accommodate that.as0770 wrote: Maybe I'll repeat the tournament with a komi of 7.5 but honestly I think a bot should be able to play also with a komi of 6.5.
Second, why are you using 6.5 komi? The easiest type of scoring for a go AI is area scoring, and under area scoring the result without komi is almost always even, so rulesets that use area scoring generally have a komi of 5.5 or 7.5.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
I won't test every network, so I have to do a cut somewhere. When ELF came up Leela won 7% against it. Ten Networks later it wins 10%. I think this says much more about the progress than a 16 game tournament.abcd_z wrote:Okay, but... why? Doing it that way misrepresents Leela Zero's actual, current strength.as0770 wrote:I planed to use the best network of each size.Again, doing otherwise misrepresents Leela Zero's strength. ELF has a certain level of strength, and Leela Zero has a different level of strength, and that gap will close over time until LZ reaches ELF's level of strength (and presumably exceeds it). Doing it the way you're currently doing makes it seem like LZ's current strength is weaker than it actually is.as0770 wrote:Furthermore the Leela Zero team is now using ELF games to train the network, so I deliberate whether it makes sense to test "best" networks that are weaker than the ELF network.
Just because another network is stronger than LZ doesn't mean you should ignore the updates that close the gap. After all, isn't the whole point of this tournament to compare the strength of different go programs?
-
abcd_z
- Beginner
- Posts: 12
- Joined: Thu Apr 26, 2018 11:32 am
- Rank: 15k
- GD Posts: 0
- Has thanked: 5 times
Re: Engine Tournament
That's reasonable, but if you're going to do a tournament that tests the strength of Leela Zero it should include, at minimum, the most recent LZ network. Cut out other LZ networks if you have to.as0770 wrote:I won't test every network, so I have to do a cut somewhere.
I have no idea what you mean by this.as0770 wrote:When ELF came up Leela won 7% against it. Ten Networks later it wins 10%. I think this says much more about the progress than a 16 game tournament.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
Other networks have been played already, I don't save time and computing resources when I cut them out.abcd_z wrote:That's reasonable, but if you're going to do a tournament that tests the strength of Leela Zero it should include, at minimum, the most recent LZ network. Cut out other LZ networks if you have to.as0770 wrote:I won't test every network, so I have to do a cut somewhere.
But you are right, it has been a long time since the last test so I currently run a new one.
Because of the training with ELF games the current 192x15 network will likely make progress for a while.
When ELF was published Leelas best network won 7% against the ELF network. Yesterday, 11 days later, Leelas best network won 10% against the same ELF network.abcd_z wrote:I have no idea what you mean by this.as0770 wrote:When ELF came up Leela won 7% against it. Ten Networks later it wins 10%. I think this says much more about the progress than a 16 game tournament.
-
abcd_z
- Beginner
- Posts: 12
- Joined: Thu Apr 26, 2018 11:32 am
- Rank: 15k
- GD Posts: 0
- Has thanked: 5 times
Re: Engine Tournament
Wait, are you reusing the results between tournaments?!as0770 wrote:Other networks have been played already, I don't save time and computing resources when I cut them out.
Oh my god, you are reusing the results. That's why you don't want to use the results from a constantly-changing network.But you are right, it has been a long time since the last test so I currently run a new one.
Because of the training with ELF games the current 192x15 network will likely make progress for a while.
Look, if you're running a tournament between game engines, you can't just reuse previous results. You have to use all-new matches, or the results aren't valid.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
It is not a ratinglist but a tournament. The results are exactly as valid as intended. If it doesn't match your prospects you have to watch for other sources or run your own tournament. But be aware: One run will take 20 days 24/7 if you do all-new matches.abcd_z wrote:You have to use all-new matches, or the results aren't valid.
-
abcd_z
- Beginner
- Posts: 12
- Joined: Thu Apr 26, 2018 11:32 am
- Rank: 15k
- GD Posts: 0
- Has thanked: 5 times
Re: Engine Tournament
Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.as0770 wrote:It is not a ratinglist but a tournament.
To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
Now you've heard of one.abcd_z wrote:Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.as0770 wrote:It is not a ratinglist but a tournament.
I am sorry that you feel misleaded. How can I name it in the future to avoid misleading?abcd_z wrote:To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.
-
jokkebk
- Dies in gote
- Posts: 44
- Joined: Tue Feb 01, 2011 4:47 am
- Rank: EGF 1 kyu
- GD Posts: 0
- KGS: finity
- Has thanked: 2 times
- Been thanked: 14 times
Re: Engine Tournament
Although normal standards of politeness don't apply on internet forums, I would allow as0770 some credit (even quite a significant amount) for running this "tournament" before going into critique part. I have found this thread very interesting to follow and it has been a joy to check out newest results once they become available, even though it doesn't happen every day. However, running this kind of project on voluntary effort is a huge undertaking, so that is completely understandable. So thanks to as0770 from here!abcd_z wrote:Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.
To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.
Now it is true that I also had the impression that "tournament" was somehow plurar (like "engine tournaments", with each post representing one tournament), i.e. a new one is done every once in a while. So there is ambiguity in the term. Maybe the updates could say "here are updated results of the continuing tournament", or something to that effect.
However, despite abcd_z's vehement reaction to this, I would maybe argue, that if the engines are exact same version (like SomeBot 1.4.4 or OtherEngine 10), arranging a 10 game match between them every time is just reducing variance after a while, and not likely to uncover any deeper truths, so I would not put much effort into having GNU Go play against every other bot all the time. For Leela Zero and other actively progressing engines, however, there might be a need to re-run the newest version against the opposition every once in a while.
Of course that raises the question that if BotX won 2/10 against LZ #120, and then LZ #142 beats it 10-0, will it then be 10-0 or 18-2 for Leela vs. BotX. But if it's only a one match per opponent, then you can just overwrite the previous results.
Naturally a tournament is not an ELO comparison, as in the cases where only single game is played, it's about winning probabilities but the result is binary. But in a league, you could argue that the results are likely to even each other out.
One possible idea: Maybe do a matrix of engines, and accumulate games slowly into the matrix, with more games allotted for those where results are not as clear? With one row/column for single version, this would slowly build up into quite good comparison. And for GNU Go vs. LZ ELF you don't need to have 10 game match, one or two would be enough to see that yes indeed LZ ELF mostly wins.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
Thanks for your kind words. I did this kind of tournaments many years with chess engines. When AlphaGo came up I became very intersted in Go and Go engines. I know, and I stated several times, that a tournament of ~20 games for each AI won't give a statistically significant result. Beside that the strength of a Go engine depends a lot on the hardware. Some use GPU some don't. The same tournament on a better graphic card will lead to a totally different result. So my intention is not to find the truth about Go engines, simply because there is no real truth. My intention is to give an overview of the strength of gtp engines of all levels. And I want to do that without running tournaments 24/7.jokkebk wrote:Although normal standards of politeness don't apply on internet forums, I would allow as0770 some credit (even quite a significant amount) for running this "tournament" before going into critique part. I have found this thread very interesting to follow and it has been a joy to check out newest results once they become available, even though it doesn't happen every day. However, running this kind of project on voluntary effort is a huge undertaking, so that is completely understandable. So thanks to as0770 from here!abcd_z wrote:Every tournament I have ever heard of uses the results of current matches. I have never heard of a tournament that reused results from previous tournaments.
To describe what you are doing as an engine tournament is very misleading, because it implies you're using new results for each new round of testing.
In the beginning of each post I mentioned the "updates" of the tournament, I wasn't aware this can be misunderstood, I'll try to state that more clear in the future
The thread title is "Engine tournament" (singular)jokkebk wrote:I also had the impression that "tournament" was somehow plurar (like "engine tournaments", with each post representing one tournament)
New versions of an engine or networks play of course new games. I updated the Leela version quite a lot in the past. Now it is somewhere between ZEN and ELF so it don't make much sense to test every new network.jokkebk wrote:For Leela Zero and other actively progressing engines, however, there might be a need to re-run the newest version against the opposition every once in a while.
-
abcd_z
- Beginner
- Posts: 12
- Joined: Thu Apr 26, 2018 11:32 am
- Rank: 15k
- GD Posts: 0
- Has thanked: 5 times
Re: Engine Tournament
I would edit your very first post and include something like the following:as0770 wrote:I am sorry that you feel misleaded. How can I name it in the future to avoid misleading?
Feel free to copy and paste those paragraphs if you want.Because the strength of go programs don't really change over time, and because it takes so long to run the matches, I reuse the results from previous tournaments. Each program runs once against all the current competitors when it's introduced, and those results are used for each subsequent tournament.
This also means that I won't be testing the current weights of Leela Zero, which changes constantly. Instead, I'm going to use the last LZ model of each size.
Jokkebk makes a good point: go program strengths don't usually change much over time. LZ is the exception to this rule. With that in mind, I don't have any problems with you doing the tournament the way you've been doing it, as long as it's obvious to the readers that it's not exactly a standard tournament.
-
as0770
- Lives with ko
- Posts: 180
- Joined: Sun Jun 26, 2016 8:07 am
- Rank: Beginner
- GD Posts: 0
- Has thanked: 15 times
- Been thanked: 23 times
Re: Engine Tournament
I didn't mix different weights or different versions of one engine, in the future I'll explain updated tournaments better. But I don't see a reason to explain why I don't test all versions and networks.abcd_z wrote:I would edit your very first post and include something like the following:as0770 wrote:I am sorry that you feel misleaded. How can I name it in the future to avoid misleading?Feel free to copy and paste those paragraphs if you want.Because the strength of go programs don't really change over time, and because it takes so long to run the matches, I reuse the results from previous tournaments. Each program runs once against all the current competitors when it's introduced, and those results are used for each subsequent tournament.
This also means that I won't be testing the current weights of Leela Zero, which changes constantly. Instead, I'm going to use the last LZ model of each size.
Jokkebk makes a good point: go program strengths don't usually change much over time. LZ is the exception to this rule. With that in mind, I don't have any problems with you doing the tournament the way you've been doing it, as long as it's obvious to the readers that it's not exactly a standard tournament.
-
abcd_z
- Beginner
- Posts: 12
- Joined: Thu Apr 26, 2018 11:32 am
- Rank: 15k
- GD Posts: 0
- Has thanked: 5 times
Re: Engine Tournament
I... what? I never said you did.as0770 wrote:I didn't mix different weights or different versions of one engine
I didn't say that, either. I feel like you're reading something different from what I'm writing.as0770 wrote:But I don't see a reason to explain why I don't test all versions and networks.
All I'm saying is that when you use the word "tournament", the natural assumption is that you're re-generating the matches each time. All you need to do to prevent this misunderstanding is to make a note at the beginning of your first post saying that you're reusing the results of each tournament. That's all.
EDIT: Wait, you mean that thing about Leela Zero? The reason I brought LZ up specifically was because, unlike all the other engines, LZ changes frequently, and it's easy for a reader to assume that the tests you're running on it are indicative of LZ's current strength.