Life In 19x19
http://lifein19x19.com/

Home-made Elo ratings for some engines
http://lifein19x19.com/viewtopic.php?f=18&t=16086
Page 1 of 2

Author:  xela [ Tue Sep 18, 2018 6:32 pm ]
Post subject:  Home-made Elo ratings for some engines

Just how far ahead of us puny humans is Leela Zero by now? On a home PC, is it at human pro strength, or is it already superhuman? How much difference does it make whether or not you use a GPU?

Inspired by the excellent Engine Tournament, I'm trying to calculate some Elo ratings for a few engines. (I know that CGOS has already done this, but it's very hard to get information about exactly what hardware, software and configuration was used for those engines.)

The good news about using Elo, compared with a league tournament format: it doesn't just tell you "this engine is stronger than that one", it also measures how big a difference it is. Using BayesElo, you can even get error bounds, so you can see roughly how accurate the ratings are.

The bad news: you need a much larger number of games to get accurate ratings. I won't be able to run 50 engines at an hour per player per game and play the 1000 or so games you'd need for high quality data.

What I've done so far: play a bunch of games at 1 minute absolute time, for a quick check that I've configured everything correctly (actually I caught a few mistakes this way), and to get a ballpark estimate of the ratings. Then more games at 5 minutes, for something that I hope is slightly more accurate.

Soon I plan to start a series at 20 minutes per player per game, so we have some data at roughly human-like time controls. I'll have to limit this series to about 15 or 20 engines, otherwise it will takes years to generate enough data. But first there's a few more different engines and configurations I want to try out.

My system:
GTX 1070 GPU, 1920 cores, 8GB memory
Ryzen 5 2600 CPU, 6 cores (12 threads)
16 GB RAM
Linux operating system: Ubuntu 18.04


Engines tested so far:
I've used short names for the engines, so that when I view the full crosstable I can fit more columns on my screen :-) I hope this isn't too cryptic...

Leela versions:
LZ is Leela Zero version 0.15
Numbers are the network number from https://zero.sjeng.org/
for example LZ_174 is LZ with the 256x40 network c9d70c41
LZ_ELF is Leela Zero with the ELF weights from http://physik.de/CNNlast.tar.gz
LM is Leela Zero version 0.15 using one of the Leela Master networks from https://github.com/pangafu/LeelaMasterWeight
Just plain "leela" is the one from https://sjeng.org/leela.html, version 0.11
_c on the end means CPU-only mode
_1t means running with one thread only, similar for other numbers
By default, LZ in GPU mode uses 2 threads, LZ in CPU mode uses 12 threads, leela uses 12 threads in either mode

ray is the Ray lz branch from https://github.com/zakki/Ray.git checked out on 12th September, using Leela Zero weights

oakfoam_nn is Oakfoam 0.2.1-dev with the included nicego-cnn-06.gtp configuration file, meaning that it uses a neural network
Plain oakfoam is Oakfoam 0.1.3 with no configuration
Other oakfoams are failed attempts at getting a better configuration, before I figured out how to make oakfoam_nn work

pachi_nn is pachi 12.10 using the network http://physik.de/CNNlast.tar.gz
pachi is pachi 12.10 with the --nodcnn option
pachi_monte and pachi_pat are alternative engines for pachi (with the "-e" option), which turned out to be not very good

fuego is version 1.1.SVN

gnugo is version 3.8
gnugo_M is gnugo with more memory (cache size increased from default of 80M to 7G)
gnugo_l1 is gnugo on level 1; similarly for gnugo_l4 and gnugo_l7

You'll also notice two 1-minute games for AQ. It won one and crashed in the other. I decided that AQ was too unstable on my machine for further testing, so it doesn't appear in the 5-minute series.


Results so far at 1 minute time limit:
Code:
Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          3574  176   175   12     67%    3481
LM_GX47         3568  174   156   28     86%    3179
LZ_174          3441  165   156   18     61%    3359
LZ_ELF_6t       3433  174   199   12     33%    3528
ray_ELF_12t     3325  143   131   24     63%    3244
LZ_173          3320  117   112   42     62%    3218
LZ_141          3249  140   133   30     67%    3096
LM_E8           3214  150   142   32     72%    2974
LZ_116          3162  97    95    58     55%    3123
LM_W11          3135  159   162   24     58%    3039
LZ_174_6t       3129  148   160   24     42%    3177
LM_Z2           3087  116   107   42     69%    2931
ray_173_6t      3069  161   169   16     44%    3107
ray_W11_12t     3022  215   275   8      25%    3168
ray_173_12t     3021  182   196   12     42%    3072
LM_B5           2966  110   109   44     57%    2903
ray_173_2t      2963  172   164   16     56%    2923
LZ_91           2826  104   114   60     27%    3049
leela           2813  141   137   38     53%    2780
ray_ELF         2785  219   280   10     20%    3020
ray_W11         2646  175   176   16     44%    2707
ray_173         2605  197   240   12     25%    2785
oakfoam_nn      2577  122   122   74     66%    2308
LM_B5_c         2574  194   171   12     67%    2484
LZ_116_c2t      2574  120   124   44     34%    2756
LM_GX47_c       2535  184   173   12     58%    2491
LM_E8_c         2501  194   194   10     50%    2502
LZ_57           2491  197   197   26     65%    2239
LZ_116_c6t      2468  185   189   12     50%    2460
LM_W11_c        2417  171   194   12     33%    2510
LM_Z2_c         2410  186   220   10     30%    2520
LZ_91_c2t       2169  178   166   18     56%    2132
AQ              2116  383   383   2      50%    2116
leela_c         2116  110   106   78     63%    1948
leela_c1t       2100  210   187   12     67%    1984
leela_c2t       2083  174   180   16     38%    2204
pachi_nn        2022  109   105   74     66%    1826
pachi           1816  125   122   66     56%    1772
leela_nonet     1782  104   101   88     58%    1717
gnugo           1500  88    83    84     64%    1402
gnugo_l7        1498  119   121   52     38%    1631
LZ_57_c2t       1492  246   218   8      63%    1420
gnugo_M         1470  139   133   34     53%    1471
gnugo_l1        1451  90    88    84     48%    1508
gnugo_l4        1435  140   139   32     47%    1489
leela_nonet_1t  1390  240   349   10     10%    1811
oakfoam1        1363  126   122   32     56%    1320
pachi_pat       1339  383   368   2      50%    1339
fuego           1339  90    90    78     37%    1571
oakfoam_book    1256  112   119   40     38%    1359
pachi_1t        1214  197   267   14     14%    1521
oakfoam         1195  92    101   72     25%    1433
oakfoam2        1152  129   153   30     23%    1353
pachi_plain     1151  345   325   2      0%     1339
pachi_monte     1151  345   325   2      0%     1339
michi           1135  301   311   4      0%     1420
matilda         1065  138   181   44     9%     1504


Results so far at 5 minute time limit:
Code:
Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4561  -71   184   28     82%    4314
LM_GX47         4436  51    128   34     68%    4287
LZ_ELF_6t       4410  74    119   36     64%    4302
LZ_173          4270  111   112   48     54%    4197
LZ_141          4252  98    96    68     62%    4126
LZ_174          4239  110   107   52     62%    4110
ray_ELF_12t     4133  159   172   18     44%    4159
LZ_174_6t       4054  92    89    78     55%    4026
ray_173_12t     3957  164   165   17     47%    3981
LM_Z2           3887  98    99    65     51%    3843
ray_173_6t      3855  157   175   18     33%    3975
LZ_116          3816  92    93    84     51%    3793
LM_B5           3808  148   158   26     46%    3820
LM_E8           3784  134   129   36     58%    3718
LM_W11          3762  117   110   50     64%    3646
ray_W11_12t     3593  157   157   24     54%    3553
ray_173_2t      3585  186   212   16     31%    3750
ray_ELF         3484  149   157   36     28%    3750
ray_173         3475  175   189   20     30%    3676
leela           3380  98    104   82     40%    3438
LZ_91           3274  128   136   42     31%    3466
ray_W11         3175  162   161   24     46%    3234
LM_E8_c         3037  142   120   34     76%    2841
LZ_116_c2t      3000  115   111   56     63%    2883
LM_W11_c        2870  117   115   34     53%    2849
LM_GX47_c       2853  104   104   44     48%    2872
oakfoam_nn      2851  94    90    66     56%    2814
leela_c         2784  95    97    66     52%    2751
leela_c2t       2747  82    83    78     49%    2752
LM_B5_c         2631  173   192   14     36%    2731
LZ_91_c2t       2625  115   112   48     56%    2566
LM_Z2_c         2597  139   141   24     46%    2634
LZ_57           2589  129   130   30     43%    2651
leela_c1t       2519  104   111   60     40%    2626
pachi_nn        2419  107   114   60     38%    2530
pachi           2130  116   112   76     61%    2003
leela_nonet     2093  142   152   40     38%    2211
LZ_57_c2t       2042  157   145   28     71%    1830
fuego           1878  119   118   50     62%    1751
pachi_1t        1790  126   124   42     57%    1720
leela_nonet_1t  1788  131   126   40     63%    1677
gnugo           1500  122   7     66     26%    1755
oakfoam         1437  334   -58   10     20%    1772
michi           1435  210   -57   24     25%    1675
oakfoam1        1272  406   -221  12     0%     1836
matilda         1260  420   -232  10     0%     1808
oakfoam_book    1241  356   -251  16     6%     1628


Edited 24th September: crosstables attached in CSV format, with a count of how many games each engine has played against each opponent.

The "Elo" column is the rating. Elo- and Elo+ are error bounds (to be pedantic, they're Bayesian credible intervals, not to be confused with frequentist confidence intervals). So for example, in the 1-minute ratings with LZ_ELF at Elo=3574, Elo- = 175, Elo+ = 176, this means that BayesElo thinks there's a 95% chance of the true rating being between 3399 and 3750. In the 5-minute ratings, you'll see some negative numbers near the top and bottom. I think this is a symptom of a skewed probability distribution: BayesElo can tell that LZ_ELF is "a lot stronger" than the other engines, but there isn't enough data to measure exactly how much stronger. This time last week I was seeing a lot more minus signs, and they're gradually going away as I add more data.

I've offset the ratings to put gnugo at 1500 each time, on the principle that gnugo is theoretically around 5K. This should mean that the BayesElo ratings are more or less in line with EGF ratings (plus or minus a couple of hundred rating points), and also not too far away from Remi Coulomb's ratings for pros. Looking at fuego and pachi, this seems to be in the right ballpark. So we have some weak evidence that the strongest CPU-only engines on a home PC can play at around top amateur or low pro level, and the good GPU-accelerated engines are already superhuman, at least in 5-minute games.

It will take a couple of months for me to get similar data for 20-minute games. I'll update here some time (don't hold your breath, I'm very good at procrastination!)

Attachments:
5_min_crosstable-2018-09-19.csv [7.05 KiB]
Downloaded 68 times
1_min_crosstable-2018-09-19.csv [9.62 KiB]
Downloaded 60 times

Author:  EdLee [ Wed Sep 19, 2018 1:59 am ]
Post subject: 

xela, Thanks. :tmbup:

Author:  Uberdude [ Wed Sep 19, 2018 2:20 am ]
Post subject:  Re: Home-made Elo ratings for some engines

Very nice, thanks xela. Could you please add in LZ #157, the best 15 block network? I use that a lot as I think it gives better performance at shortish time limits than the deeper networks (to read a ladder the superior judgement of a 40 block network doesn't help if it only has 200* playouts , but 800 playouts of a less-skilled network enables the ladder to be read).

* exact numbers not guaranteed, and with more training a deeper network may be able to read ladders with few playouts, but we aren't there yet afaik.

Author:  xela [ Thu Sep 20, 2018 2:52 pm ]
Post subject:  Re: Home-made Elo ratings for some engines

Uberdude wrote:
Could you please add in LZ #157, the best 15 block network?


Good suggestion. I chose the 15-block network LZ 141 because it's the same one used for the engine tournament. It'll be interesting to see how much stronger 157 is. I'll include it in the next update.

Author:  EdLee [ Thu Sep 20, 2018 8:37 pm ]
Post subject: 

Hi xela,

An engine (Taiwan-based?) on IGS, the username is
leelazero ( one word, all lowercase ).
Its info includes "GTX970. zero.sjeng.org "

Any possibility to extrapolate or guessitimate its Elo range ?
( It has the (small avalanche) ladder problem that people exploit. ) Thanks.

This page has an Elo graph, roughly at 12,700 ?

Author:  mb76 [ Fri Sep 21, 2018 2:50 pm ]
Post subject:  Re: Home-made Elo ratings for some engines

Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 .

Author:  xela [ Mon Sep 24, 2018 3:21 am ]
Post subject: 

EdLee wrote:
Hi xela,

An engine (Taiwan-based?) on IGS, the username is
leelazero ( one word, all lowercase ).
Its info includes "GTX970. zero.sjeng.org "

Any possibility to extrapolate or guessitimate its Elo range?


Sorry, that's not enough information to work with. At https://zero.sjeng.org/ there is a list of 178 different networks that leelazero can use. If you can find out which network this engine uses, then we can make some guesses.

Author:  xela [ Mon Sep 24, 2018 3:21 am ]
Post subject:  Re: Home-made Elo ratings for some engines

mb76 wrote:
Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 .


Will do. I'll have some results to show in a couple of days. Thanks for the suggestion.

Author:  xela [ Tue Sep 25, 2018 8:00 pm ]
Post subject:  Re: Home-made Elo ratings for some engines

New this week:
  • Added DreamGo. I was hoping for a low dan level bot that would fill the large rating gap between pachi and pachi_nn. Unfortunately (!) the latest DreamGo is actually quite a lot stronger than that!
  • Added LZ 157, the strongest 192x15 network. In fast games, it turns out to be a bit stronger than some of the bigger networks. I'd expect this to change in slower games.
  • Added LZ zediir weights (LZ_zed). This turned out to be weaker than I expected, but again we might see a different story with slower games.
  • Played a few more games with the other engines, to try and reduce some of the error margins on the ratings.

Results so far at 1 minute time limit, based on 986 games with 59 engines:
Code:
Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          3594  165   154   16     63%    3521
LM_GX47         3559  156   140   32     78%    3239
LZ_157          3550  134   124   28     64%    3451
LZ_174          3490  149   143   22     59%    3420
LZ_ELF_6t       3483  146   158   18     39%    3548
LZ_173          3350  111   108   48     60%    3245
ray_ELF_12t     3347  136   129   26     58%    3297
LZ_141          3294  141   135   32     66%    3132
LM_E8           3239  143   137   36     69%    3017
LZ_116          3186  97    95    58     55%    3141
LZ_174_6t       3162  148   157   26     46%    3176
LM_W11          3152  162   167   24     58%    3048
LM_Z2           3096  120   111   42     69%    2931
ray_173_6t      3096  149   149   20     50%    3096
ray_173_12t     3059  164   164   16     50%    3061
LM_B5           2999  106   103   50     60%    2911
ray_W11_12t     2961  123   129   30     40%    3032
ray_173_2t      2952  139   135   24     54%    2926
leela           2862  118   113   50     58%    2785
LZ_zed          2831  136   152   26     31%    2981
LZ_91           2804  105   113   64     27%    3040
ray_ELF         2730  133   138   28     39%    2834
ray_173         2678  186   202   14     36%    2785
ray_W11         2659  148   148   22     45%    2703
dream           2627  151   160   24     54%    2525
oakfoam_nn      2571  120   122   76     64%    2325
LM_B5_c         2554  194   171   12     67%    2464
LZ_116_c2t      2551  119   123   46     33%    2754
LM_GX47_c       2514  184   173   12     58%    2470
LZ_57           2483  199   197   26     65%    2239
LM_E8_c         2480  194   194   10     50%    2481
LZ_116_c6t      2451  184   189   12     50%    2444
LM_W11_c        2397  171   195   12     33%    2490
LM_Z2_c         2389  186   220   10     30%    2499
LZ_91_c2t       2166  177   165   18     56%    2129
leela_c         2114  116   111   78     62%    1962
leela_c1t       2099  210   187   12     67%    1983
leela_c2t       2082  174   180   16     38%    2202
pachi_nn        2022  109   105   76     64%    1846
pachi           1817  125   122   68     54%    1796
leela_nonet     1781  104   101   88     58%    1717
gnugo           1500  88    83    84     64%    1401
gnugo_l7        1498  119   122   52     38%    1630
LZ_57_c2t       1492  246   218   8      63%    1419
gnugo_M         1469  139   133   34     53%    1471
gnugo_l1        1451  90    88    84     48%    1508
gnugo_l4        1435  140   139   32     47%    1489
leela_nonet_1t  1390  240   346   10     10%    1809
oakfoam1        1363  126   122   32     56%    1319
pachi_pat       1339  384   360   2      50%    1339
fuego           1339  90    90    78     37%    1571
oakfoam_book    1256  112   119   40     38%    1358
pachi_1t        1213  198   262   14     14%    1522
oakfoam         1195  92    101   72     25%    1433
oakfoam2        1151  129   154   30     23%    1353
pachi_monte     1151  348   288   2      0%     1339
pachi_plain     1151  348   288   2      0%     1339
michi           1134  304   274   4      0%     1419
matilda         1064  139   172   44     9%     1504

Attachment:
5_min_crosstable-2018-09-26.csv [7.95 KiB]
Downloaded 46 times


Results so far at 5 minute time limit, based on 1310 games with 50 engines:
Code:
Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4556  -23   123   46     67%    4427
LM_GX47         4515  16    117   44     66%    4378
LZ_ELF_6t       4506  25    109   48     65%    4390
LZ_157          4463  64    130   28     54%    4437
LZ_173          4328  97    97    62     55%    4254
LZ_141          4309  94    94    74     59%    4190
LZ_174          4279  106   103   60     63%    4136
ray_ELF_12t     4148  109   113   42     45%    4178
LZ_174_6t       4114  89    86    88     57%    4057
ray_173_12t     3944  109   114   42     38%    4040
LM_Z2           3932  99    100   68     53%    3860
LM_B5           3918  113   111   44     57%    3850
ray_173_6t      3860  99    101   48     44%    3911
LZ_116          3854  89    91    90     51%    3826
LM_E8           3778  115   112   46     54%    3748
LM_W11          3770  111   107   56     61%    3668
ray_173_2t      3691  119   123   44     50%    3680
ray_W11_12t     3633  125   114   44     68%    3487
ray_ELF         3472  113   114   54     35%    3667
leela           3400  97    99    88     44%    3436
ray_173         3377  107   115   50     30%    3571
LZ_zed          3359  107   106   52     48%    3399
LZ_91           3272  99    104   66     32%    3456
dream           3218  126   128   34     56%    3107
ray_W11         3156  111   114   44     43%    3225
LM_E8_c         3048  111   106   50     60%    2971
LZ_116_c2t      3002  115   110   56     63%    2888
LM_W11_c        2868  118   116   34     53%    2849
oakfoam_nn      2849  93    90    68     54%    2829
LM_GX47_c       2848  105   105   44     48%    2869
leela_c         2771  91    91    72     51%    2746
leela_c2t       2752  82    83    78     49%    2757
LM_B5_c         2662  129   135   26     42%    2718
LZ_91_c2t       2638  116   113   48     56%    2576
LM_Z2_c         2618  123   124   30     47%    2646
LZ_57           2600  128   131   30     43%    2660
leela_c1t       2526  104   111   60     40%    2631
pachi_nn        2430  106   112   64     39%    2543
pachi           2137  111   108   80     58%    2033
LZ_57_c2t       2093  132   121   40     70%    1900
leela_nonet     2086  137   149   42     36%    2185
fuego           1865  107   105   72     65%    1691
pachi_1t        1858  119   115   54     65%    1690
leela_nonet_1t  1856  124   117   52     69%    1652
gnugo           1500  120   -33   106    20%    1791
michi           1466  206   -68   40     55%    1432
oakfoam1        1287  341   -246  28     43%    1430
oakfoam         1068  521   -465  26     27%    1386
oakfoam_book    998   573   -534  32     13%    1435
matilda         975   603   -557  26     15%    1407

Attachment:
5_min_crosstable-2018-09-26.csv [7.95 KiB]
Downloaded 46 times


Next I want to try LZ with the Phoenix weights. After that, I might start the 20-minute series.

Attachments:
1_min_crosstable-2018-09-26.csv [10.27 KiB]
Downloaded 40 times

Author:  xela [ Wed Oct 03, 2018 6:06 am ]
Post subject:  Re: Home-made Elo ratings for some engines

New this week:
  1. Added LZ with Phoenix weights. This doesn't do so well in fast games, I'd expect it to overtake LZ 157 in slower games. I'll get to that in a few weeks...
  2. I realised that DreamGo has pondering turned on by default. So I've renamed dream from last week's update to dream_ponder, and added a new dream with pondering off. It looks like pondering is worth about 100 rating points in 5-minute games, and less in fast games.
  3. Played some ray vs ray games, now that I know how to make ray play against itself with two different weight files. This helps with making the ratings more accurate (no more negative errors at the top of the table now).

Results so far at 1 minute time limit, based on 1326 games with 61 engines:
(edited 9th October: subtract 372 from all ratings to put gnugo at 1500, consistent with my other rating lists)
Code:
Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_157          3780  98    91    60     72%    3614
LM_GX47         3758  102   95    62     73%    3517
LZ_ELF          3717  101   98    48     58%    3650
LZ_ELF_6t       3636  96    101   48     42%    3693
LZ_174          3590  87    88    64     47%    3611
ray_ELF_12t     3542  93    95    58     45%    3578
LZ_173          3519  115   112   48     60%    3411
LZ_141          3471  115   114   44     59%    3359
LM_E8           3450  116   115   50     64%    3249
LZ_116          3369  99    97    58     55%    3318
LZ_174_6t       3325  112   113   42     50%    3315
ray_173_6t      3311  112   110   36     53%    3293
LM_Z2           3308  98    94    60     67%    3137
ray_173_12t     3289  115   113   34     53%    3269
LM_W11          3259  111   115   44     50%    3230
LM_B5           3178  102   101   56     59%    3076
LZ_phoenix      3156  119   123   36     39%    3265
ray_W11_12t     3150  114   118   36     42%    3212
ray_173_2t      3103  129   129   28     50%    3104
LZ_zed          3010  122   126   34     41%    3088
leela           2990  116   116   54     54%    2925
LZ_91           2935  100   106   74     30%    3149
ray_ELF         2899  129   127   32     47%    2949
ray_173         2887  133   128   32     56%    2843
ray_W11         2778  110   108   44     52%    2767
dream_ponder    2759  117   121   40     53%    2679
dream           2637  121   122   34     47%    2658
oakfoam_nn      2624  121   123   82     62%    2406
LM_GX47_c       2561  130   124   30     57%    2520
LZ_116_c2t      2497  111   113   58     31%    2768
LM_E8_c         2472  140   140   22     50%    2472
LM_B5_c         2471  134   134   24     50%    2471
LZ_116_c6t      2450  137   138   24     50%    2446
LZ_57           2373  116   117   50     52%    2336
LM_Z2_c         2370  121   114   36     61%    2291
LM_W11_c        2306  125   133   28     39%    2377
leela_c1t       2202  108   108   42     52%    2177
leela_c2t       2135  128   136   30     37%    2248
LZ_91_c2t       2133  137   140   26     46%    2160
leela_c         2126  103   101   88     59%    2004
pachi_nn        2028  110   107   76     64%    1855
pachi           1818  126   123   68     54%    1807
leela_nonet     1783  105   102   88     58%    1722
gnugo           1500  88    83    84     64%    1402
gnugo_l7        1498  120   122   52     38%    1633
LZ_57_c2t       1492  246   218   8      63%    1419
gnugo_M         1470  139   133   34     53%    1472
gnugo_l1        1451  91    89    84     48%    1509
gnugo_l4        1435  140   139   32     47%    1490
leela_nonet_1t  1386  241   335   10     10%    1788
oakfoam1        1363  126   122   32     56%    1319
pachi_pat       1338  385   332   2      50%    1338
fuego           1338  90    90    78     37%    1573
oakfoam_book    1256  113   119   40     38%    1359
pachi_1t        1213  198   235   14     14%    1522
oakfoam         1195  92    101   72     25%    1434
oakfoam2        1152  130   151   30     23%    1353
pachi_monte     1151  356   211   2      0%     1338
pachi_plain     1151  356   211   2      0%     1338
michi           1134  311   197   4      0%     1419
matilda         1064  140   127   44     9%     1505
Attachment:
1_min_crosstable-2018-10-03.csv [11.05 KiB]
Downloaded 48 times

Results so far at 5 minute time limit, based on 1396 games with 52 engines:
Code:
Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4544  11    112   46     67%    4415
LM_GX47         4504  49    111   44     66%    4370
LZ_ELF_6t       4496  56    104   48     65%    4381
LZ_157          4457  89    123   32     59%    4385
LZ_173          4307  94    93    66     55%    4242
LZ_174          4291  102   97    64     66%    4135
LZ_141          4280  90    90    78     58%    4188
ray_ELF_12t     4145  102   105   46     46%    4173
LZ_phoenix      4133  121   122   36     47%    4154
LZ_174_6t       4120  84    82    92     57%    4066
ray_173_12t     3969  103   108   46     39%    4054
LM_Z2           3933  95    97    72     50%    3885
LM_B5           3932  112   110   44     57%    3862
LZ_116          3881  86    88    94     51%    3848
ray_173_6t      3874  99    101   48     44%    3924
LM_E8           3794  114   112   46     54%    3761
LM_W11          3786  110   107   56     61%    3682
ray_173_2t      3706  119   123   44     50%    3693
ray_W11_12t     3649  125   114   44     68%    3503
ray_ELF         3487  113   114   54     35%    3679
leela           3414  98    100   88     44%    3446
ray_173         3392  107   115   50     30%    3585
LZ_zed          3372  108   106   52     48%    3410
LZ_91           3289  95    98    71     35%    3446
dream_ponder    3232  117   116   40     58%    3118
ray_W11         3172  105   105   50     46%    3218
dream           3130  129   129   29     52%    3114
LM_E8_c         3035  109   104   52     58%    2977
LZ_116_c2t      2988  107   105   62     58%    2912
LM_W11_c        2874  116   114   36     53%    2856
LM_GX47_c       2842  105   105   44     48%    2864
oakfoam_nn      2839  92    89    70     53%    2833
leela_c         2759  91    92    72     51%    2738
leela_c2t       2740  82    83    78     49%    2746
LZ_91_c2t       2648  105   101   56     59%    2568
LM_Z2_c         2601  109   110   38     47%    2624
LM_B5_c         2599  114   121   34     38%    2683
LZ_57           2597  113   114   38     45%    2645
leela_c1t       2535  95    100   68     41%    2624
pachi_nn        2429  106   113   64     39%    2542
pachi           2137  112   108   80     58%    2034
LZ_57_c2t       2093  132   121   40     70%    1900
leela_nonet     2086  137   150   42     36%    2186
fuego           1865  107   105   72     65%    1691
pachi_1t        1858  119   115   54     65%    1690
leela_nonet_1t  1856  124   117   52     69%    1652
gnugo           1500  131   -57   106    20%    1791
michi           1466  219   -92   40     55%    1432
oakfoam1        1287  360   -270  28     43%    1430
oakfoam         1068  543   -489  26     27%    1386
oakfoam_book    998   595   -558  32     13%    1435
matilda         975   625   -581  26     15%    1407
Attachment:
5_min_crosstable-2018-10-03.csv [8.51 KiB]
Downloaded 52 times

This week I'm going to start playing some matches with 20 minutes per player per game. This will be with a smaller collection of engines, so that we'll get some results this year.

Author:  Uberdude [ Wed Oct 03, 2018 7:11 am ]
Post subject:  Re: Home-made Elo ratings for some engines

157 hero! :clap:

Author:  pnprog [ Sat Oct 06, 2018 9:20 pm ]
Post subject:  Re: Home-made Elo ratings for some engines

Hi!
Very interested in the thread :)

xela wrote:
  • I realised that DreamGo has pondering turned on by default. So I've renamed dream from last week's update to dream_ponder, and added a new dream with pondering off. It looks like pondering is worth about 100 rating points in 5-minute games, and less in fast games.
But when DreamGo is playing with pondering on, my understanding is that:
  • Not only it will increase its level
  • But it will also decrease its opponent's level, by taken away some of the computing power the opponent needs, no? This in turn will make the opponent appear weaker, and then explains for the big difference in ELO?

Like, imagine I run the tournament on a simple computer: 1000MHz CPU, one thread, no GPU ; then it's like comparing:
  • DreamGo (1000Mhz) VS Pachi (1000MHz)
  • DreamGo (1000MHz + pondering at 500MHz) VS Pachi (500MHz)
We can expect the level of Pachi to be significantly weaker, while facing a DreamGo boosted a little by pondering?

For the dream_ponder entry, what you would like to have is:
  • DreamGo (1000MHz + pondering at 1000MHz) VS Pachi (1000MHz)
Or I misunderstand something?

Author:  xela [ Sun Oct 07, 2018 4:16 am ]
Post subject:  Re: Home-made Elo ratings for some engines

pnprog wrote:
But when DreamGo is playing with pondering on, my understanding is that:
  • Not only it will increase its level
  • But it will also decrease its opponent's level, by taken away some of the computing power the opponent needs, no? This in turn will make the opponent appear weaker, and then explains for the big difference in ELO?

Correct. In fact, the difference in Elo ratings between dream and dream_ponder is actually smaller than I expected.

pngprog wrote:
For the dream_ponder entry, what you would like to have is:
  • DreamGo (1000MHz + pondering at 1000MHz) VS Pachi (1000MHz)
Or I misunderstand something?

Yes, that would be a better way to test it. The fact is that I intended to run all engines without pondering, to avoid this type of complication. The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident :-)

Author:  pnprog [ Mon Oct 08, 2018 5:46 am ]
Post subject:  Re: Home-made Elo ratings for some engines

So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player :bow:

More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating?

Quote:
The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident :-)
I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder?

Author:  xela [ Tue Oct 09, 2018 4:39 am ]
Post subject:  Re: Home-made Elo ratings for some engines

pnprog wrote:
So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player :bow:

More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating?

I think BayesElo is similar to EGF ratings, but not exactly the same. For a good comparison, we'd need to run the EGF rating algorithm on my engine vs engine games, or else collect some EGF tournament results and run BayesElo on those results to compare with EGF ratings. That's a whole other research project that I'm not going to start this year :-)

I think "LeelaZero around 27 stones stronger than Gnugo" is about right, but it could be anywhere between 20 and 35 stones really.

pnprog wrote:
I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder?

No, I don't think it matters. What does a "skewed performance rating" mean anyway? The bot was stronger than I thought it would be? But the BayesElo software doesn't read my mind, it only looks at the game results. Dream_ponder beats weaker bots, and loses against stronger ones, same behaviour as any other bot. I don't think it makes a difference to the ratings whether it gets those results by playing good moves, or by sabotaging the opponents (stealing memory or CPU cycles).

In any case, I've been anchoring the ratings to put GnuGo at 1500 every time, so this should help to keep things stable.

Author:  xela [ Tue Oct 09, 2018 5:24 am ]
Post subject:  Re: Home-made Elo ratings for some engines

Just for fun, let's do some dodgy mathematical analysis of the 1-minute and 5-minute results, to see if we can extrapolate what will happen in 20-minute games. (I'll post some actual 20-minute results tomorrow. I did the analysis last week, just didn't get around to posting until today.)

We already know that small networks beat bigger networks in fast games, but we might expect the bigger networks to catch up in slower games. Let's pretend that each engine/network combination has an "baseline" strength (how well it can play on minimal thinking time) plus an ability to get stronger with more time. There will be diminishing returns: you'd expect a big difference between 1 minute and 10 minutes, but not much difference between 60 minutes and 69 minutes. But the strength is theoretically unbounded (Monte Carlo search converges to the best move given unlimited time and unlimited memory).

So a half way reasonable model might be:

                              Elo rating = b + alpha times log(t)

where b is the baseline strength, t is the thinking time in minutes per player per game (absolute time, because I don't want to get into complications around byo-yomi), and alpha represents how well the engine/network can make use of extra thinking time.

At 1 minute time limits, t=1, log(t)=0, so b is the just the 1-minute Elo rating. Then we can calculate alpha as (5-minute Elo minus 1-minute Elo)/log(4). (For me, log means natural log, because I did too much calculus as a teenager, so log(4) is about 1.386.) And then the expected 20-minute rating from this model would be b + alpha times log(19), or b+2.944 alpha.

If we have gnugo at 1500 on both rating scales, then it gets alpha=0, meaning that gnugo gets no stronger when it thinks for a long time. Worse, a few of the weaker engines get negative numbers for alpha. I don't believe that, so I'm going to subtract 200 from all the 1-minute ratings, just to get some more reasonable alpha values.

Finally, this projects pachi_nn to be about 3300 in 20 minute games, which isn't realistic (it's nowhere near pro strength), so I'm going to subtract a few rating points from the results to put pachi_nn at 2400.

Then a few lines of R programming gives these results:

Code:
             Name Elo1_adjusted Elo5 rank1 rank5 alpha Elo20 rank20
1          LZ_ELF          3517 4544     3     1   741           4993      1
2       LZ_ELF_6t          3436 4496     4     3   765           4982      2
3         LM_GX47          3558 4504     2     2   682           4862      3
4      LZ_phoenix          2956 4133    17     9   849           4751      4
5          LZ_157          3580 4457     1     4   633           4738      5
6          LZ_173          3319 4307     7     5   713           4712      6
7          LZ_141          3271 4280     8     7   728           4709      7
8          LZ_174          3390 4291     5     6   650           4599      8
9       LZ_174_6t          3125 4120    11    10   718           4533      9
10    ray_ELF_12t          3342 4145     6     8   579           4343     10
11          LM_B5          2978 3932    16    13   688           4299     11
12    ray_173_12t          3089 3969    14    11   635           4253     12
13          LM_Z2          3108 3933    13    12   595           4155     13
14     ray_173_6t          3111 3874    12    15   550           4027     14
15         LZ_116          3169 3881    10    14   514           3976     15
16     ray_173_2t          2903 3706    19    18   579           3904     16
17         LM_W11          3059 3786    15    17   524           3898     17
18    ray_W11_12t          2950 3649    18    19   504           3730     18
19          LM_E8          3250 3794     9    16   392           3700     19
20        ray_ELF          2699 3487    23    20   568           3668     20
21        ray_173          2687 3392    24    22   509           3479     21
22          leela          2790 3414    21    21   450           3410     22
23         LZ_zed          2810 3372    20    23   405           3299     23
24   dream_ponder          2559 3232    26    25   485           3283     24
25          LZ_91          2735 3289    22    24   400           3207     25
26          dream          2437 3130    27    27   500           3204     26
27        LM_E8_c          2272 3035    31    28   550           3188     27
28        ray_W11          2578 3172    25    26   428           3135     28
29     LZ_116_c2t          2297 2988    30    29   498           3060     29
30       LM_W11_c          2106 2874    35    30   554           3032     30
31        leela_c          1926 2759    39    33   601           2990     31
32      leela_c2t          1935 2740    37    34   581           2940     32
33      LZ_91_c2t          1933 2648    38    35   516           2747     33
34      LM_GX47_c          2361 2842    29    31   347           2678     34
35     oakfoam_nn          2424 2839    28    32   299           2600     35
36      leela_c1t          2002 2535    36    39   384           2429     36
37       pachi_nn          1828 2429    40    40   434           2400     37
38        LM_Z2_c          2170 2601    34    36   311           2380     38
39          LZ_57          2173 2597    33    38   306           2369     39
40      LZ_57_c2t          1292 2093    44    42   578           2288     40
41        LM_B5_c          2271 2599    32    37   237           2263     41
42       pachi_1t          1013 1858    49    45   610           2103     42
43          pachi          1618 2137    41    41   374           2015     43
44          fuego          1138 1865    47    44   524           1977     44
45    leela_nonet          1583 2086    42    43   363           1946     45
46 leela_nonet_1t          1186 1856    45    46   483           1904     46
47          michi           934 1466    51    48   384           1359     47
48          gnugo          1300 1500    43    47   144           1020     48
49       oakfoam1          1163 1287    46    49    89            721     49
50        oakfoam           995 1068    50    50    53            445     50
51        matilda           864  975    52    52    80            395     51
52   oakfoam_book          1056  998    48    51   -42            228     52

So we can see for example that LZ_phoenix comes 17th in 1-minute games, but 9th in 5-minute games, giving it a big alpha value (it's making great use of the extra thinking time), and we'd expect it to shoot up to 4th place in 20-minute games. On the other hand, LM_E8 (with a 128x10 network) did better at 1 minute than at 5 minutes, so its alpha is lower, and we'd expect it to rank even lower at 20 minutes. Then again, the alpha values for LZ 141 and 174 don't look quite right.

This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought.

Author:  moha [ Wed Oct 10, 2018 4:23 am ]
Post subject:  Re: Home-made Elo ratings for some engines

xela wrote:
This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought.
The basic idea usually is that each doubling of thinking time gives a roughly similar strength increase (OC this is not necessarily reasonable idea for all engines). Your formula could capture this if you wouldn't substract 1 from 5 and 20 before log.

But in these rating pools one's result depends on others' performances as well, quite a problem for this approach. Maybe you could anchor at gnugo=1500 for 1 min, and anchor other times at a guessed gnugo improvement factor / rating. If you expect your numbers to go up with more time, then you basically compare performance to 1-min gnugo (how strong I should be to play this well in 1-min games), so going up into otherwise "pro" number range is not surprising and does not necessarily mean pro strength.

Author:  xela [ Wed Oct 10, 2018 4:54 am ]
Post subject:  Re: Home-made Elo ratings for some engines

OK, time for some actual results at 20 minutes per game. To start with, I decided to do this as a "win and continue" series of 8-game matches, starting with pachi_nn and introducing opponents about 100-200 Elo points above the previous winner (going by my dodgy projected ratings, to see just how bad they are). I'd expect each new engine to win 5-3 or 6-2. I also decided that if an engine wins its match 8-0, I should backtrack and look for something slightly weaker, in the interest of making the ratings a bit more accurate. Once I get to the top of the list, then I'll go back and add some more games to try and reduce the error margins, and maybe add a couple more engines if anything looks especially interesting. Without gnugo in the list, I've decided to anchor pachi_nn's rating at 2400, so the ratings are still in the same ballpark as my other lists.

Round 1: pachi_nn vs oakfoam_nn. This was the first surprise: pachi_nn won the match 5-3. It seems that oakfoam has some problems with time management. It plays at about the right pace in 1-minute or 5-minute games, but in 20-minute games it only uses 6 or 7 minutes total, so it's giving pachi a bit of an advantage. It seemed to be ahead in the opening and early middlegame of each game, and then managed to misread something and lose.

Round 2: LZ_91_c2t d pachi_nn 6-2.

Round 3: leela_c2t d LZ_91_c2t 5-3

Round 4: LM_W11_c d leela_c2t 5-3

Round 5: LM_W11_c d dream 6-2, another surprise, was expecting dream to do a bit better against a CPU-only engine. Here there were two games with "disputed" scores (both engines agreed that LM_W11_c had won, but gave different winning margins): each game involved a seki.

Round 6: LZ_zed d LM_W11_c 5-3

Round 7: leela vs LZ_zed was a 4-4 tie. I decided to add a tiebreaker match: leela d LM_W11_c 6-2. This put leela on top of the list, because it did a better job of beating up LM.

Round 8: ray_ELF d leela 8-0. Backtrack: ray_173 d leela 6-2. Again there was one game with disputed score, another seki. ray_173 d ray_ELF 5-3, not what I expected! At this point, BayesElo had both ray_173 and ray_ELF on exactly the same rating (3082): 173 had won the head to head, but ELF did better against leela, and these factors cancelled out. There was also another game with disputed result (agreed that ELF won, but disagreed on the amount), but not a seki this time; instead, the scoring was messed up because ray_173 passed before the game was actually finished. (No harm done, it was losing anyway.)

Round 9: Instead of running another tiebreaker match, I decided to just give the next engine 6 games each against the tied leaders:
  • ray_173 vs LM_W11 3-3
  • ray_ELF d LM_W11 4-2
Another surprise: I'd expected LM_W11 to be stronger.

Round 10: ray_173_6t d ray_ELF 8-0

Remember that ray defaults to one thread, but in 1-minute or 5-minute games it gets a little bit stronger given extra threads, but not by a huge amount. It looks like it gets a lot more benefit from those extra threads in slower games! (LZ uses two threads by default, and seems to actually get weaker given extra threads, at least in short games. But maybe it's worth retesting this theory in longer games?)

Backtrack: ray_173_2t d ray_ELF 5-3; ray_173_6t d ray_2t 7-1. In the 6t vs 2t match, there were two games with disputed result: both players passed early and disagreed on who was ahead. I decided to step in as referee and, looking at the positions, awarded one game to each player.

Round 11: LM_Z2 d ray_173_6t 7-1. Two more games where ray passed early from a losing position.

And throwing all this into BayesElo, the rating list so far is:
Code:
Name        Elo   Elo+  Elo-  games  score  avg_opp
LM_Z2       3726  312   203   8      88%    3505
ray_173_6t  3505  165   154   24     67%    3350
ray_173_2t  3212  169   177   16     38%    3309
ray_ELF     3113  112   111   38     47%    3140
ray_173     3087  143   133   22     64%    2991
LM_W11      3053  166   179   12     42%    3100
leela       2823  119   125   32     38%    2922
LZ_zed      2796  156   149   16     56%    2758
LM_W11_c    2692  109   109   32     50%    2697
leela_c2t   2618  150   150   16     50%    2620
dream       2553  197   256   8      25%    2692
LZ_91_c2t   2547  156   150   16     56%    2509
pachi_nn    2400  146   151   16     44%    2442
oakfoam_nn  2336  193   217   8      38%    2400


To be continued...

Author:  Kris Storm [ Sun Oct 14, 2018 12:18 pm ]
Post subject:  Re: Home-made Elo ratings for some engines

Hi xela. It's a good idea of doing such comparision.

How are you using BayesELO with SGF files? I found it useful only for chess PGN files.

Author:  xela [ Sun Oct 14, 2018 3:06 pm ]
Post subject:  Re: Home-made Elo ratings for some engines

Kris Storm wrote:
How are you using BayesELO with SGF files?

I've written a few lines of Python code to read the *.dat files created by GoGui and output the results as PGN. (You could do this manually too for a small number of games, and you could do it just as well from the SGF instead the DAT.) The PGN file can be pretty minimal. BayesElo doesn't need the moves of the game; it's happy running on something that looks like this:
Code:
[White "leela_c"][Black "LM_E8_c"][Result "1-0"] 1-0
[White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0
[White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0
[White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0
[White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0
[White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0
[White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0
[White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0


Then I feed these commands to BayesElo:
Code:
readpgn filename.pgn
elo
offset 2000
advantage 0
drawelo 0.01
mm
exactdist
ratings


The "advantage 0" part means that it doesn't care who played black or white, so I can put the winner's name first in my PGN file and all the results as 1-0, which makes it simpler to create the PGN. There was a forum post somewhere by RĂ©mi Coulomb recommending the "advantage 0" and "drawelo 0.01" settings for go games. The "offset 2000" part means that the average rating of the outputs will be 2000; I have another Python script which changes 2000 to a different number, which is how I anchor the ratings (run it twice, figure out which offset will put gnugo at 1500).

Page 1 of 2 All times are UTC - 8 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/