Life In 19x19 http://lifein19x19.com/ |
|
Home-made Elo ratings for some engines http://lifein19x19.com/viewtopic.php?f=18&t=16086 |
Page 1 of 2 |
Author: | xela [ Tue Sep 18, 2018 6:32 pm ] | |||
Post subject: | Home-made Elo ratings for some engines | |||
Just how far ahead of us puny humans is Leela Zero by now? On a home PC, is it at human pro strength, or is it already superhuman? How much difference does it make whether or not you use a GPU? Inspired by the excellent Engine Tournament, I'm trying to calculate some Elo ratings for a few engines. (I know that CGOS has already done this, but it's very hard to get information about exactly what hardware, software and configuration was used for those engines.) The good news about using Elo, compared with a league tournament format: it doesn't just tell you "this engine is stronger than that one", it also measures how big a difference it is. Using BayesElo, you can even get error bounds, so you can see roughly how accurate the ratings are. The bad news: you need a much larger number of games to get accurate ratings. I won't be able to run 50 engines at an hour per player per game and play the 1000 or so games you'd need for high quality data. What I've done so far: play a bunch of games at 1 minute absolute time, for a quick check that I've configured everything correctly (actually I caught a few mistakes this way), and to get a ballpark estimate of the ratings. Then more games at 5 minutes, for something that I hope is slightly more accurate. Soon I plan to start a series at 20 minutes per player per game, so we have some data at roughly human-like time controls. I'll have to limit this series to about 15 or 20 engines, otherwise it will takes years to generate enough data. But first there's a few more different engines and configurations I want to try out. My system: Engines tested so far: Results so far at 1 minute time limit: Results so far at 5 minute time limit: Edited 24th September: crosstables attached in CSV format, with a count of how many games each engine has played against each opponent. The "Elo" column is the rating. Elo- and Elo+ are error bounds (to be pedantic, they're Bayesian credible intervals, not to be confused with frequentist confidence intervals). So for example, in the 1-minute ratings with LZ_ELF at Elo=3574, Elo- = 175, Elo+ = 176, this means that BayesElo thinks there's a 95% chance of the true rating being between 3399 and 3750. In the 5-minute ratings, you'll see some negative numbers near the top and bottom. I think this is a symptom of a skewed probability distribution: BayesElo can tell that LZ_ELF is "a lot stronger" than the other engines, but there isn't enough data to measure exactly how much stronger. This time last week I was seeing a lot more minus signs, and they're gradually going away as I add more data. I've offset the ratings to put gnugo at 1500 each time, on the principle that gnugo is theoretically around 5K. This should mean that the BayesElo ratings are more or less in line with EGF ratings (plus or minus a couple of hundred rating points), and also not too far away from Remi Coulomb's ratings for pros. Looking at fuego and pachi, this seems to be in the right ballpark. So we have some weak evidence that the strongest CPU-only engines on a home PC can play at around top amateur or low pro level, and the good GPU-accelerated engines are already superhuman, at least in 5-minute games. It will take a couple of months for me to get similar data for 20-minute games. I'll update here some time (don't hold your breath, I'm very good at procrastination!)
|
Author: | EdLee [ Wed Sep 19, 2018 1:59 am ] |
Post subject: | |
xela, Thanks. |
Author: | Uberdude [ Wed Sep 19, 2018 2:20 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
Very nice, thanks xela. Could you please add in LZ #157, the best 15 block network? I use that a lot as I think it gives better performance at shortish time limits than the deeper networks (to read a ladder the superior judgement of a 40 block network doesn't help if it only has 200* playouts , but 800 playouts of a less-skilled network enables the ladder to be read). * exact numbers not guaranteed, and with more training a deeper network may be able to read ladders with few playouts, but we aren't there yet afaik. |
Author: | xela [ Thu Sep 20, 2018 2:52 pm ] |
Post subject: | Re: Home-made Elo ratings for some engines |
Uberdude wrote: Could you please add in LZ #157, the best 15 block network? Good suggestion. I chose the 15-block network LZ 141 because it's the same one used for the engine tournament. It'll be interesting to see how much stronger 157 is. I'll include it in the next update. |
Author: | EdLee [ Thu Sep 20, 2018 8:37 pm ] |
Post subject: | |
Hi xela, An engine (Taiwan-based?) on IGS, the username is leelazero ( one word, all lowercase ). Its info includes "GTX970. zero.sjeng.org " Any possibility to extrapolate or guessitimate its Elo range ? ( It has the (small avalanche) ladder problem that people exploit. ) Thanks. This page has an Elo graph, roughly at 12,700 ? |
Author: | mb76 [ Fri Sep 21, 2018 2:50 pm ] |
Post subject: | Re: Home-made Elo ratings for some engines |
Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 . |
Author: | xela [ Mon Sep 24, 2018 3:21 am ] |
Post subject: | |
EdLee wrote: Hi xela, An engine (Taiwan-based?) on IGS, the username is leelazero ( one word, all lowercase ). Its info includes "GTX970. zero.sjeng.org " Any possibility to extrapolate or guessitimate its Elo range? Sorry, that's not enough information to work with. At https://zero.sjeng.org/ there is a list of 178 different networks that leelazero can use. If you can find out which network this engine uses, then we can make some guesses. |
Author: | xela [ Mon Sep 24, 2018 3:21 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
mb76 wrote: Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 . Will do. I'll have some results to show in a couple of days. Thanks for the suggestion. |
Author: | xela [ Tue Sep 25, 2018 8:00 pm ] | ||
Post subject: | Re: Home-made Elo ratings for some engines | ||
New this week:
Results so far at 1 minute time limit, based on 986 games with 59 engines: Results so far at 5 minute time limit, based on 1310 games with 50 engines: Next I want to try LZ with the Phoenix weights. After that, I might start the 20-minute series.
|
Author: | xela [ Wed Oct 03, 2018 6:06 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
New this week:
Results so far at 1 minute time limit, based on 1326 games with 61 engines: (edited 9th October: subtract 372 from all ratings to put gnugo at 1500, consistent with my other rating lists) Results so far at 5 minute time limit, based on 1396 games with 52 engines: This week I'm going to start playing some matches with 20 minutes per player per game. This will be with a smaller collection of engines, so that we'll get some results this year. |
Author: | Uberdude [ Wed Oct 03, 2018 7:11 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
157 hero! |
Author: | pnprog [ Sat Oct 06, 2018 9:20 pm ] |
Post subject: | Re: Home-made Elo ratings for some engines |
Hi! Very interested in the thread xela wrote:
Like, imagine I run the tournament on a simple computer: 1000MHz CPU, one thread, no GPU ; then it's like comparing:
For the dream_ponder entry, what you would like to have is:
|
Author: | xela [ Sun Oct 07, 2018 4:16 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
pnprog wrote: But when DreamGo is playing with pondering on, my understanding is that:
Correct. In fact, the difference in Elo ratings between dream and dream_ponder is actually smaller than I expected. pngprog wrote: For the dream_ponder entry, what you would like to have is:
Yes, that would be a better way to test it. The fact is that I intended to run all engines without pondering, to avoid this type of complication. The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident :-) |
Author: | pnprog [ Mon Oct 08, 2018 5:46 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating? Quote: The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder?
|
Author: | xela [ Tue Oct 09, 2018 4:39 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
pnprog wrote: So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player :bow: More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating? I think BayesElo is similar to EGF ratings, but not exactly the same. For a good comparison, we'd need to run the EGF rating algorithm on my engine vs engine games, or else collect some EGF tournament results and run BayesElo on those results to compare with EGF ratings. That's a whole other research project that I'm not going to start this year :-) I think "LeelaZero around 27 stones stronger than Gnugo" is about right, but it could be anywhere between 20 and 35 stones really. pnprog wrote: I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder? No, I don't think it matters. What does a "skewed performance rating" mean anyway? The bot was stronger than I thought it would be? But the BayesElo software doesn't read my mind, it only looks at the game results. Dream_ponder beats weaker bots, and loses against stronger ones, same behaviour as any other bot. I don't think it makes a difference to the ratings whether it gets those results by playing good moves, or by sabotaging the opponents (stealing memory or CPU cycles). In any case, I've been anchoring the ratings to put GnuGo at 1500 every time, so this should help to keep things stable. |
Author: | xela [ Tue Oct 09, 2018 5:24 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
Just for fun, let's do some dodgy mathematical analysis of the 1-minute and 5-minute results, to see if we can extrapolate what will happen in 20-minute games. (I'll post some actual 20-minute results tomorrow. I did the analysis last week, just didn't get around to posting until today.) We already know that small networks beat bigger networks in fast games, but we might expect the bigger networks to catch up in slower games. Let's pretend that each engine/network combination has an "baseline" strength (how well it can play on minimal thinking time) plus an ability to get stronger with more time. There will be diminishing returns: you'd expect a big difference between 1 minute and 10 minutes, but not much difference between 60 minutes and 69 minutes. But the strength is theoretically unbounded (Monte Carlo search converges to the best move given unlimited time and unlimited memory). So a half way reasonable model might be: Elo rating = b + alpha times log(t) where b is the baseline strength, t is the thinking time in minutes per player per game (absolute time, because I don't want to get into complications around byo-yomi), and alpha represents how well the engine/network can make use of extra thinking time. At 1 minute time limits, t=1, log(t)=0, so b is the just the 1-minute Elo rating. Then we can calculate alpha as (5-minute Elo minus 1-minute Elo)/log(4). (For me, log means natural log, because I did too much calculus as a teenager, so log(4) is about 1.386.) And then the expected 20-minute rating from this model would be b + alpha times log(19), or b+2.944 alpha. If we have gnugo at 1500 on both rating scales, then it gets alpha=0, meaning that gnugo gets no stronger when it thinks for a long time. Worse, a few of the weaker engines get negative numbers for alpha. I don't believe that, so I'm going to subtract 200 from all the 1-minute ratings, just to get some more reasonable alpha values. Finally, this projects pachi_nn to be about 3300 in 20 minute games, which isn't realistic (it's nowhere near pro strength), so I'm going to subtract a few rating points from the results to put pachi_nn at 2400. Then a few lines of R programming gives these results: So we can see for example that LZ_phoenix comes 17th in 1-minute games, but 9th in 5-minute games, giving it a big alpha value (it's making great use of the extra thinking time), and we'd expect it to shoot up to 4th place in 20-minute games. On the other hand, LM_E8 (with a 128x10 network) did better at 1 minute than at 5 minutes, so its alpha is lower, and we'd expect it to rank even lower at 20 minutes. Then again, the alpha values for LZ 141 and 174 don't look quite right. This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought. |
Author: | moha [ Wed Oct 10, 2018 4:23 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
xela wrote: This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought. The basic idea usually is that each doubling of thinking time gives a roughly similar strength increase (OC this is not necessarily reasonable idea for all engines). Your formula could capture this if you wouldn't substract 1 from 5 and 20 before log.But in these rating pools one's result depends on others' performances as well, quite a problem for this approach. Maybe you could anchor at gnugo=1500 for 1 min, and anchor other times at a guessed gnugo improvement factor / rating. If you expect your numbers to go up with more time, then you basically compare performance to 1-min gnugo (how strong I should be to play this well in 1-min games), so going up into otherwise "pro" number range is not surprising and does not necessarily mean pro strength. |
Author: | xela [ Wed Oct 10, 2018 4:54 am ] |
Post subject: | Re: Home-made Elo ratings for some engines |
OK, time for some actual results at 20 minutes per game. To start with, I decided to do this as a "win and continue" series of 8-game matches, starting with pachi_nn and introducing opponents about 100-200 Elo points above the previous winner (going by my dodgy projected ratings, to see just how bad they are). I'd expect each new engine to win 5-3 or 6-2. I also decided that if an engine wins its match 8-0, I should backtrack and look for something slightly weaker, in the interest of making the ratings a bit more accurate. Once I get to the top of the list, then I'll go back and add some more games to try and reduce the error margins, and maybe add a couple more engines if anything looks especially interesting. Without gnugo in the list, I've decided to anchor pachi_nn's rating at 2400, so the ratings are still in the same ballpark as my other lists. Round 1: pachi_nn vs oakfoam_nn. This was the first surprise: pachi_nn won the match 5-3. It seems that oakfoam has some problems with time management. It plays at about the right pace in 1-minute or 5-minute games, but in 20-minute games it only uses 6 or 7 minutes total, so it's giving pachi a bit of an advantage. It seemed to be ahead in the opening and early middlegame of each game, and then managed to misread something and lose. Round 2: LZ_91_c2t d pachi_nn 6-2. Round 3: leela_c2t d LZ_91_c2t 5-3 Round 4: LM_W11_c d leela_c2t 5-3 Round 5: LM_W11_c d dream 6-2, another surprise, was expecting dream to do a bit better against a CPU-only engine. Here there were two games with "disputed" scores (both engines agreed that LM_W11_c had won, but gave different winning margins): each game involved a seki. Round 6: LZ_zed d LM_W11_c 5-3 Round 7: leela vs LZ_zed was a 4-4 tie. I decided to add a tiebreaker match: leela d LM_W11_c 6-2. This put leela on top of the list, because it did a better job of beating up LM. Round 8: ray_ELF d leela 8-0. Backtrack: ray_173 d leela 6-2. Again there was one game with disputed score, another seki. ray_173 d ray_ELF 5-3, not what I expected! At this point, BayesElo had both ray_173 and ray_ELF on exactly the same rating (3082): 173 had won the head to head, but ELF did better against leela, and these factors cancelled out. There was also another game with disputed result (agreed that ELF won, but disagreed on the amount), but not a seki this time; instead, the scoring was messed up because ray_173 passed before the game was actually finished. (No harm done, it was losing anyway.) Round 9: Instead of running another tiebreaker match, I decided to just give the next engine 6 games each against the tied leaders:
Round 10: ray_173_6t d ray_ELF 8-0 Remember that ray defaults to one thread, but in 1-minute or 5-minute games it gets a little bit stronger given extra threads, but not by a huge amount. It looks like it gets a lot more benefit from those extra threads in slower games! (LZ uses two threads by default, and seems to actually get weaker given extra threads, at least in short games. But maybe it's worth retesting this theory in longer games?) Backtrack: ray_173_2t d ray_ELF 5-3; ray_173_6t d ray_2t 7-1. In the 6t vs 2t match, there were two games with disputed result: both players passed early and disagreed on who was ahead. I decided to step in as referee and, looking at the positions, awarded one game to each player. Round 11: LM_Z2 d ray_173_6t 7-1. Two more games where ray passed early from a losing position. And throwing all this into BayesElo, the rating list so far is: Code: Name Elo Elo+ Elo- games score avg_opp LM_Z2 3726 312 203 8 88% 3505 ray_173_6t 3505 165 154 24 67% 3350 ray_173_2t 3212 169 177 16 38% 3309 ray_ELF 3113 112 111 38 47% 3140 ray_173 3087 143 133 22 64% 2991 LM_W11 3053 166 179 12 42% 3100 leela 2823 119 125 32 38% 2922 LZ_zed 2796 156 149 16 56% 2758 LM_W11_c 2692 109 109 32 50% 2697 leela_c2t 2618 150 150 16 50% 2620 dream 2553 197 256 8 25% 2692 LZ_91_c2t 2547 156 150 16 56% 2509 pachi_nn 2400 146 151 16 44% 2442 oakfoam_nn 2336 193 217 8 38% 2400 To be continued... |
Author: | Kris Storm [ Sun Oct 14, 2018 12:18 pm ] |
Post subject: | Re: Home-made Elo ratings for some engines |
Hi xela. It's a good idea of doing such comparision. How are you using BayesELO with SGF files? I found it useful only for chess PGN files. |
Author: | xela [ Sun Oct 14, 2018 3:06 pm ] |
Post subject: | Re: Home-made Elo ratings for some engines |
Kris Storm wrote: How are you using BayesELO with SGF files? I've written a few lines of Python code to read the *.dat files created by GoGui and output the results as PGN. (You could do this manually too for a small number of games, and you could do it just as well from the SGF instead the DAT.) The PGN file can be pretty minimal. BayesElo doesn't need the moves of the game; it's happy running on something that looks like this: Code: [White "leela_c"][Black "LM_E8_c"][Result "1-0"] 1-0 [White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0 [White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0 [White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0 [White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0 [White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0 [White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0 [White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0 Then I feed these commands to BayesElo: Code: readpgn filename.pgn elo offset 2000 advantage 0 drawelo 0.01 mm exactdist ratings The "advantage 0" part means that it doesn't care who played black or white, so I can put the winner's name first in my PGN file and all the results as 1-0, which makes it simpler to create the PGN. There was a forum post somewhere by RĂ©mi Coulomb recommending the "advantage 0" and "drawelo 0.01" settings for go games. The "offset 2000" part means that the average rating of the outputs will be 2000; I have another Python script which changes 2000 to a different number, which is how I anchor the ratings (run it twice, figure out which offset will put gnugo at 1500). |
Page 1 of 2 | All times are UTC - 8 hours [ DST ] |
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/ |