Life In 19x19 :: Home-made Elo ratings for some engines

Life In 19x19 http://lifein19x19.com/

Home-made Elo ratings for some engines http://lifein19x19.com/viewtopic.php?f=18&t=16086	Page 1 of 2

Author:

xela [ Tue Sep 18, 2018 6:32 pm ]

Post subject:

Home-made Elo ratings for some engines

Just how far ahead of us puny humans is Leela Zero by now? On a home PC, is it at human pro strength, or is it already superhuman? How much difference does it make whether or not you use a GPU?

Inspired by the excellent Engine Tournament, I'm trying to calculate some Elo ratings for a few engines. (I know that CGOS has already done this, but it's very hard to get information about exactly what hardware, software and configuration was used for those engines.)

The good news about using Elo, compared with a league tournament format: it doesn't just tell you "this engine is stronger than that one", it also measures how big a difference it is. Using BayesElo, you can even get error bounds, so you can see roughly how accurate the ratings are.

The bad news: you need a much larger number of games to get accurate ratings. I won't be able to run 50 engines at an hour per player per game and play the 1000 or so games you'd need for high quality data.

What I've done so far: play a bunch of games at 1 minute absolute time, for a quick check that I've configured everything correctly (actually I caught a few mistakes this way), and to get a ballpark estimate of the ratings. Then more games at 5 minutes, for something that I hope is slightly more accurate.

Soon I plan to start a series at 20 minutes per player per game, so we have some data at roughly human-like time controls. I'll have to limit this series to about 15 or 20 engines, otherwise it will takes years to generate enough data. But first there's a few more different engines and configurations I want to try out.

My system:

Engines tested so far:

Results so far at 1 minute time limit:

Code:

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          3574  176   175   12     67%    3481
LM_GX47         3568  174   156   28     86%    3179
LZ_174          3441  165   156   18     61%    3359
LZ_ELF_6t       3433  174   199   12     33%    3528
ray_ELF_12t     3325  143   131   24     63%    3244
LZ_173          3320  117   112   42     62%    3218
LZ_141          3249  140   133   30     67%    3096
LM_E8           3214  150   142   32     72%    2974
LZ_116          3162  97    95    58     55%    3123
LM_W11          3135  159   162   24     58%    3039
LZ_174_6t       3129  148   160   24     42%    3177
LM_Z2           3087  116   107   42     69%    2931
ray_173_6t      3069  161   169   16     44%    3107
ray_W11_12t     3022  215   275   8      25%    3168
ray_173_12t     3021  182   196   12     42%    3072
LM_B5           2966  110   109   44     57%    2903
ray_173_2t      2963  172   164   16     56%    2923
LZ_91           2826  104   114   60     27%    3049
leela           2813  141   137   38     53%    2780
ray_ELF         2785  219   280   10     20%    3020
ray_W11         2646  175   176   16     44%    2707
ray_173         2605  197   240   12     25%    2785
oakfoam_nn      2577  122   122   74     66%    2308
LM_B5_c         2574  194   171   12     67%    2484
LZ_116_c2t      2574  120   124   44     34%    2756
LM_GX47_c       2535  184   173   12     58%    2491
LM_E8_c         2501  194   194   10     50%    2502
LZ_57           2491  197   197   26     65%    2239
LZ_116_c6t      2468  185   189   12     50%    2460
LM_W11_c        2417  171   194   12     33%    2510
LM_Z2_c         2410  186   220   10     30%    2520
LZ_91_c2t       2169  178   166   18     56%    2132
AQ              2116  383   383   2      50%    2116
leela_c         2116  110   106   78     63%    1948
leela_c1t       2100  210   187   12     67%    1984
leela_c2t       2083  174   180   16     38%    2204
pachi_nn        2022  109   105   74     66%    1826
pachi           1816  125   122   66     56%    1772
leela_nonet     1782  104   101   88     58%    1717
gnugo           1500  88    83    84     64%    1402
gnugo_l7        1498  119   121   52     38%    1631
LZ_57_c2t       1492  246   218   8      63%    1420
gnugo_M         1470  139   133   34     53%    1471
gnugo_l1        1451  90    88    84     48%    1508
gnugo_l4        1435  140   139   32     47%    1489
leela_nonet_1t  1390  240   349   10     10%    1811
oakfoam1        1363  126   122   32     56%    1320
pachi_pat       1339  383   368   2      50%    1339
fuego           1339  90    90    78     37%    1571
oakfoam_book    1256  112   119   40     38%    1359
pachi_1t        1214  197   267   14     14%    1521
oakfoam         1195  92    101   72     25%    1433
oakfoam2        1152  129   153   30     23%    1353
pachi_plain     1151  345   325   2      0%     1339
pachi_monte     1151  345   325   2      0%     1339
michi           1135  301   311   4      0%     1420
matilda         1065  138   181   44     9%     1504

Results so far at 5 minute time limit:

Code:

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4561  -71   184   28     82%    4314
LM_GX47         4436  51    128   34     68%    4287
LZ_ELF_6t       4410  74    119   36     64%    4302
LZ_173          4270  111   112   48     54%    4197
LZ_141          4252  98    96    68     62%    4126
LZ_174          4239  110   107   52     62%    4110
ray_ELF_12t     4133  159   172   18     44%    4159
LZ_174_6t       4054  92    89    78     55%    4026
ray_173_12t     3957  164   165   17     47%    3981
LM_Z2           3887  98    99    65     51%    3843
ray_173_6t      3855  157   175   18     33%    3975
LZ_116          3816  92    93    84     51%    3793
LM_B5           3808  148   158   26     46%    3820
LM_E8           3784  134   129   36     58%    3718
LM_W11          3762  117   110   50     64%    3646
ray_W11_12t     3593  157   157   24     54%    3553
ray_173_2t      3585  186   212   16     31%    3750
ray_ELF         3484  149   157   36     28%    3750
ray_173         3475  175   189   20     30%    3676
leela           3380  98    104   82     40%    3438
LZ_91           3274  128   136   42     31%    3466
ray_W11         3175  162   161   24     46%    3234
LM_E8_c         3037  142   120   34     76%    2841
LZ_116_c2t      3000  115   111   56     63%    2883
LM_W11_c        2870  117   115   34     53%    2849
LM_GX47_c       2853  104   104   44     48%    2872
oakfoam_nn      2851  94    90    66     56%    2814
leela_c         2784  95    97    66     52%    2751
leela_c2t       2747  82    83    78     49%    2752
LM_B5_c         2631  173   192   14     36%    2731
LZ_91_c2t       2625  115   112   48     56%    2566
LM_Z2_c         2597  139   141   24     46%    2634
LZ_57           2589  129   130   30     43%    2651
leela_c1t       2519  104   111   60     40%    2626
pachi_nn        2419  107   114   60     38%    2530
pachi           2130  116   112   76     61%    2003
leela_nonet     2093  142   152   40     38%    2211
LZ_57_c2t       2042  157   145   28     71%    1830
fuego           1878  119   118   50     62%    1751
pachi_1t        1790  126   124   42     57%    1720
leela_nonet_1t  1788  131   126   40     63%    1677
gnugo           1500  122   7     66     26%    1755
oakfoam         1437  334   -58   10     20%    1772
michi           1435  210   -57   24     25%    1675
oakfoam1        1272  406   -221  12     0%     1836
matilda         1260  420   -232  10     0%     1808
oakfoam_book    1241  356   -251  16     6%     1628

Edited 24th September: crosstables attached in CSV format, with a count of how many games each engine has played against each opponent.

The "Elo" column is the rating. Elo- and Elo+ are error bounds (to be pedantic, they're Bayesian credible intervals, not to be confused with frequentist confidence intervals). So for example, in the 1-minute ratings with LZ_ELF at Elo=3574, Elo- = 175, Elo+ = 176, this means that BayesElo thinks there's a 95% chance of the true rating being between 3399 and 3750. In the 5-minute ratings, you'll see some negative numbers near the top and bottom. I think this is a symptom of a skewed probability distribution: BayesElo can tell that LZ_ELF is "a lot stronger" than the other engines, but there isn't enough data to measure exactly how much stronger. This time last week I was seeing a lot more minus signs, and they're gradually going away as I add more data.

I've offset the ratings to put gnugo at 1500 each time, on the principle that gnugo is theoretically around 5K. This should mean that the BayesElo ratings are more or less in line with EGF ratings (plus or minus a couple of hundred rating points), and also not too far away from Remi Coulomb's ratings for pros. Looking at fuego and pachi, this seems to be in the right ballpark. So we have some weak evidence that the strongest CPU-only engines on a home PC can play at around top amateur or low pro level, and the good GPU-accelerated engines are already superhuman, at least in 5-minute games.

It will take a couple of months for me to get similar data for 20-minute games. I'll update here some time (don't hold your breath, I'm very good at procrastination!)

Attachments:

5_min_crosstable-2018-09-19.csv [7.05 KiB]
Downloaded 769 times

1_min_crosstable-2018-09-19.csv [9.62 KiB]
Downloaded 779 times

Author:	EdLee [ Wed Sep 19, 2018 1:59 am ]
Post subject:
xela, Thanks.

Author:	Uberdude [ Wed Sep 19, 2018 2:20 am ]
Post subject:	Re: Home-made Elo ratings for some engines
Very nice, thanks xela. Could you please add in LZ #157, the best 15 block network? I use that a lot as I think it gives better performance at shortish time limits than the deeper networks (to read a ladder the superior judgement of a 40 block network doesn't help if it only has 200* playouts , but 800 playouts of a less-skilled network enables the ladder to be read). * exact numbers not guaranteed, and with more training a deeper network may be able to read ladders with few playouts, but we aren't there yet afaik.

Author:	xela [ Thu Sep 20, 2018 2:52 pm ]
Post subject:	Re: Home-made Elo ratings for some engines
Uberdude wrote: Could you please add in LZ #157, the best 15 block network? Good suggestion. I chose the 15-block network LZ 141 because it's the same one used for the engine tournament. It'll be interesting to see how much stronger 157 is. I'll include it in the next update.

Author:	EdLee [ Thu Sep 20, 2018 8:37 pm ]
Post subject:
Hi xela, An engine (Taiwan-based?) on IGS, the username is leelazero ( one word, all lowercase ). Its info includes "GTX970. zero.sjeng.org " Any possibility to extrapolate or guessitimate its Elo range ? ( It has the (small avalanche) ladder problem that people exploit. ) Thanks. This page has an Elo graph, roughly at 12,700 ?

Author:	mb76 [ Fri Sep 21, 2018 2:50 pm ]
Post subject:	Re: Home-made Elo ratings for some engines
Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 .

Author:	xela [ Mon Sep 24, 2018 3:21 am ]
Post subject:
EdLee wrote: Hi xela, An engine (Taiwan-based?) on IGS, the username is leelazero ( one word, all lowercase ). Its info includes "GTX970. zero.sjeng.org " Any possibility to extrapolate or guessitimate its Elo range? Sorry, that's not enough information to work with. At https://zero.sjeng.org/ there is a list of 178 different networks that leelazero can use. If you can find out which network this engine uses, then we can make some guesses.

Author:	xela [ Mon Sep 24, 2018 3:21 am ]
Post subject:	Re: Home-made Elo ratings for some engines
mb76 wrote: Could you please add in LZ zediir, based "Supervised. From the TYGEM dataset"? https://github.com/gcp/leela-zero/issues/884 . Will do. I'll have some results to show in a couple of days. Thanks for the suggestion.

Author:

xela [ Tue Sep 25, 2018 8:00 pm ]

Post subject:

Re: Home-made Elo ratings for some engines

New this week:

Added DreamGo. I was hoping for a low dan level bot that would fill the large rating gap between pachi and pachi_nn. Unfortunately (!) the latest DreamGo is actually quite a lot stronger than that!
Added LZ 157, the strongest 192x15 network. In fast games, it turns out to be a bit stronger than some of the bigger networks. I'd expect this to change in slower games.
Added LZ zediir weights (LZ_zed). This turned out to be weaker than I expected, but again we might see a different story with slower games.
Played a few more games with the other engines, to try and reduce some of the error margins on the ratings.

Results so far at 1 minute time limit, based on 986 games with 59 engines:

Code:

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          3594  165   154   16     63%    3521
LM_GX47         3559  156   140   32     78%    3239
LZ_157          3550  134   124   28     64%    3451
LZ_174          3490  149   143   22     59%    3420
LZ_ELF_6t       3483  146   158   18     39%    3548
LZ_173          3350  111   108   48     60%    3245
ray_ELF_12t     3347  136   129   26     58%    3297
LZ_141          3294  141   135   32     66%    3132
LM_E8           3239  143   137   36     69%    3017
LZ_116          3186  97    95    58     55%    3141
LZ_174_6t       3162  148   157   26     46%    3176
LM_W11          3152  162   167   24     58%    3048
LM_Z2           3096  120   111   42     69%    2931
ray_173_6t      3096  149   149   20     50%    3096
ray_173_12t     3059  164   164   16     50%    3061
LM_B5           2999  106   103   50     60%    2911
ray_W11_12t     2961  123   129   30     40%    3032
ray_173_2t      2952  139   135   24     54%    2926
leela           2862  118   113   50     58%    2785
LZ_zed          2831  136   152   26     31%    2981
LZ_91           2804  105   113   64     27%    3040
ray_ELF         2730  133   138   28     39%    2834
ray_173         2678  186   202   14     36%    2785
ray_W11         2659  148   148   22     45%    2703
dream           2627  151   160   24     54%    2525
oakfoam_nn      2571  120   122   76     64%    2325
LM_B5_c         2554  194   171   12     67%    2464
LZ_116_c2t      2551  119   123   46     33%    2754
LM_GX47_c       2514  184   173   12     58%    2470
LZ_57           2483  199   197   26     65%    2239
LM_E8_c         2480  194   194   10     50%    2481
LZ_116_c6t      2451  184   189   12     50%    2444
LM_W11_c        2397  171   195   12     33%    2490
LM_Z2_c         2389  186   220   10     30%    2499
LZ_91_c2t       2166  177   165   18     56%    2129
leela_c         2114  116   111   78     62%    1962
leela_c1t       2099  210   187   12     67%    1983
leela_c2t       2082  174   180   16     38%    2202
pachi_nn        2022  109   105   76     64%    1846
pachi           1817  125   122   68     54%    1796
leela_nonet     1781  104   101   88     58%    1717
gnugo           1500  88    83    84     64%    1401
gnugo_l7        1498  119   122   52     38%    1630
LZ_57_c2t       1492  246   218   8      63%    1419
gnugo_M         1469  139   133   34     53%    1471
gnugo_l1        1451  90    88    84     48%    1508
gnugo_l4        1435  140   139   32     47%    1489
leela_nonet_1t  1390  240   346   10     10%    1809
oakfoam1        1363  126   122   32     56%    1319
pachi_pat       1339  384   360   2      50%    1339
fuego           1339  90    90    78     37%    1571
oakfoam_book    1256  112   119   40     38%    1358
pachi_1t        1213  198   262   14     14%    1522
oakfoam         1195  92    101   72     25%    1433
oakfoam2        1151  129   154   30     23%    1353
pachi_monte     1151  348   288   2      0%     1339
pachi_plain     1151  348   288   2      0%     1339
michi           1134  304   274   4      0%     1419
matilda         1064  139   172   44     9%     1504

Attachment:

5_min_crosstable-2018-09-26.csv [7.95 KiB]
Downloaded 705 times

Results so far at 5 minute time limit, based on 1310 games with 50 engines:

Code:

Name            Elo   Elo+  Elo-  games  score  avg_opp
LZ_ELF          4556  -23   123   46     67%    4427
LM_GX47         4515  16    117   44     66%    4378
LZ_ELF_6t       4506  25    109   48     65%    4390
LZ_157          4463  64    130   28     54%    4437
LZ_173          4328  97    97    62     55%    4254
LZ_141          4309  94    94    74     59%    4190
LZ_174          4279  106   103   60     63%    4136
ray_ELF_12t     4148  109   113   42     45%    4178
LZ_174_6t       4114  89    86    88     57%    4057
ray_173_12t     3944  109   114   42     38%    4040
LM_Z2           3932  99    100   68     53%    3860
LM_B5           3918  113   111   44     57%    3850
ray_173_6t      3860  99    101   48     44%    3911
LZ_116          3854  89    91    90     51%    3826
LM_E8           3778  115   112   46     54%    3748
LM_W11          3770  111   107   56     61%    3668
ray_173_2t      3691  119   123   44     50%    3680
ray_W11_12t     3633  125   114   44     68%    3487
ray_ELF         3472  113   114   54     35%    3667
leela           3400  97    99    88     44%    3436
ray_173         3377  107   115   50     30%    3571
LZ_zed          3359  107   106   52     48%    3399
LZ_91           3272  99    104   66     32%    3456
dream           3218  126   128   34     56%    3107
ray_W11         3156  111   114   44     43%    3225
LM_E8_c         3048  111   106   50     60%    2971
LZ_116_c2t      3002  115   110   56     63%    2888
LM_W11_c        2868  118   116   34     53%    2849
oakfoam_nn      2849  93    90    68     54%    2829
LM_GX47_c       2848  105   105   44     48%    2869
leela_c         2771  91    91    72     51%    2746
leela_c2t       2752  82    83    78     49%    2757
LM_B5_c         2662  129   135   26     42%    2718
LZ_91_c2t       2638  116   113   48     56%    2576
LM_Z2_c         2618  123   124   30     47%    2646
LZ_57           2600  128   131   30     43%    2660
leela_c1t       2526  104   111   60     40%    2631
pachi_nn        2430  106   112   64     39%    2543
pachi           2137  111   108   80     58%    2033
LZ_57_c2t       2093  132   121   40     70%    1900
leela_nonet     2086  137   149   42     36%    2185
fuego           1865  107   105   72     65%    1691
pachi_1t        1858  119   115   54     65%    1690
leela_nonet_1t  1856  124   117   52     69%    1652
gnugo           1500  120   -33   106    20%    1791
michi           1466  206   -68   40     55%    1432
oakfoam1        1287  341   -246  28     43%    1430
oakfoam         1068  521   -465  26     27%    1386
oakfoam_book    998   573   -534  32     13%    1435
matilda         975   603   -557  26     15%    1407

Attachment:

5_min_crosstable-2018-09-26.csv [7.95 KiB]
Downloaded 705 times

Next I want to try LZ with the Phoenix weights. After that, I might start the 20-minute series.

Attachments:

1_min_crosstable-2018-09-26.csv [10.27 KiB]
Downloaded 726 times

Author:	xela [ Wed Oct 03, 2018 6:06 am ]
Post subject:	Re: Home-made Elo ratings for some engines
New this week: Added LZ with Phoenix weights. This doesn't do so well in fast games, I'd expect it to overtake LZ 157 in slower games. I'll get to that in a few weeks... I realised that DreamGo has pondering turned on by default. So I've renamed dream from last week's update to dream_ponder, and added a new dream with pondering off. It looks like pondering is worth about 100 rating points in 5-minute games, and less in fast games. Played some ray vs ray games, now that I know how to make ray play against itself with two different weight files. This helps with making the ratings more accurate (no more negative errors at the top of the table now). Results so far at 1 minute time limit, based on 1326 games with 61 engines: (edited 9th October: subtract 372 from all ratings to put gnugo at 1500, consistent with my other rating lists) Code: Name Elo Elo+ Elo- games score avg_opp LZ_157 3780 98 91 60 72% 3614 LM_GX47 3758 102 95 62 73% 3517 LZ_ELF 3717 101 98 48 58% 3650 LZ_ELF_6t 3636 96 101 48 42% 3693 LZ_174 3590 87 88 64 47% 3611 ray_ELF_12t 3542 93 95 58 45% 3578 LZ_173 3519 115 112 48 60% 3411 LZ_141 3471 115 114 44 59% 3359 LM_E8 3450 116 115 50 64% 3249 LZ_116 3369 99 97 58 55% 3318 LZ_174_6t 3325 112 113 42 50% 3315 ray_173_6t 3311 112 110 36 53% 3293 LM_Z2 3308 98 94 60 67% 3137 ray_173_12t 3289 115 113 34 53% 3269 LM_W11 3259 111 115 44 50% 3230 LM_B5 3178 102 101 56 59% 3076 LZ_phoenix 3156 119 123 36 39% 3265 ray_W11_12t 3150 114 118 36 42% 3212 ray_173_2t 3103 129 129 28 50% 3104 LZ_zed 3010 122 126 34 41% 3088 leela 2990 116 116 54 54% 2925 LZ_91 2935 100 106 74 30% 3149 ray_ELF 2899 129 127 32 47% 2949 ray_173 2887 133 128 32 56% 2843 ray_W11 2778 110 108 44 52% 2767 dream_ponder 2759 117 121 40 53% 2679 dream 2637 121 122 34 47% 2658 oakfoam_nn 2624 121 123 82 62% 2406 LM_GX47_c 2561 130 124 30 57% 2520 LZ_116_c2t 2497 111 113 58 31% 2768 LM_E8_c 2472 140 140 22 50% 2472 LM_B5_c 2471 134 134 24 50% 2471 LZ_116_c6t 2450 137 138 24 50% 2446 LZ_57 2373 116 117 50 52% 2336 LM_Z2_c 2370 121 114 36 61% 2291 LM_W11_c 2306 125 133 28 39% 2377 leela_c1t 2202 108 108 42 52% 2177 leela_c2t 2135 128 136 30 37% 2248 LZ_91_c2t 2133 137 140 26 46% 2160 leela_c 2126 103 101 88 59% 2004 pachi_nn 2028 110 107 76 64% 1855 pachi 1818 126 123 68 54% 1807 leela_nonet 1783 105 102 88 58% 1722 gnugo 1500 88 83 84 64% 1402 gnugo_l7 1498 120 122 52 38% 1633 LZ_57_c2t 1492 246 218 8 63% 1419 gnugo_M 1470 139 133 34 53% 1472 gnugo_l1 1451 91 89 84 48% 1509 gnugo_l4 1435 140 139 32 47% 1490 leela_nonet_1t 1386 241 335 10 10% 1788 oakfoam1 1363 126 122 32 56% 1319 pachi_pat 1338 385 332 2 50% 1338 fuego 1338 90 90 78 37% 1573 oakfoam_book 1256 113 119 40 38% 1359 pachi_1t 1213 198 235 14 14% 1522 oakfoam 1195 92 101 72 25% 1434 oakfoam2 1152 130 151 30 23% 1353 pachi_monte 1151 356 211 2 0% 1338 pachi_plain 1151 356 211 2 0% 1338 michi 1134 311 197 4 0% 1419 matilda 1064 140 127 44 9% 1505 Attachment: 1_min_crosstable-2018-10-03.csv [11.05 KiB] Downloaded 717 times Results so far at 5 minute time limit, based on 1396 games with 52 engines: Code: Name Elo Elo+ Elo- games score avg_opp LZ_ELF 4544 11 112 46 67% 4415 LM_GX47 4504 49 111 44 66% 4370 LZ_ELF_6t 4496 56 104 48 65% 4381 LZ_157 4457 89 123 32 59% 4385 LZ_173 4307 94 93 66 55% 4242 LZ_174 4291 102 97 64 66% 4135 LZ_141 4280 90 90 78 58% 4188 ray_ELF_12t 4145 102 105 46 46% 4173 LZ_phoenix 4133 121 122 36 47% 4154 LZ_174_6t 4120 84 82 92 57% 4066 ray_173_12t 3969 103 108 46 39% 4054 LM_Z2 3933 95 97 72 50% 3885 LM_B5 3932 112 110 44 57% 3862 LZ_116 3881 86 88 94 51% 3848 ray_173_6t 3874 99 101 48 44% 3924 LM_E8 3794 114 112 46 54% 3761 LM_W11 3786 110 107 56 61% 3682 ray_173_2t 3706 119 123 44 50% 3693 ray_W11_12t 3649 125 114 44 68% 3503 ray_ELF 3487 113 114 54 35% 3679 leela 3414 98 100 88 44% 3446 ray_173 3392 107 115 50 30% 3585 LZ_zed 3372 108 106 52 48% 3410 LZ_91 3289 95 98 71 35% 3446 dream_ponder 3232 117 116 40 58% 3118 ray_W11 3172 105 105 50 46% 3218 dream 3130 129 129 29 52% 3114 LM_E8_c 3035 109 104 52 58% 2977 LZ_116_c2t 2988 107 105 62 58% 2912 LM_W11_c 2874 116 114 36 53% 2856 LM_GX47_c 2842 105 105 44 48% 2864 oakfoam_nn 2839 92 89 70 53% 2833 leela_c 2759 91 92 72 51% 2738 leela_c2t 2740 82 83 78 49% 2746 LZ_91_c2t 2648 105 101 56 59% 2568 LM_Z2_c 2601 109 110 38 47% 2624 LM_B5_c 2599 114 121 34 38% 2683 LZ_57 2597 113 114 38 45% 2645 leela_c1t 2535 95 100 68 41% 2624 pachi_nn 2429 106 113 64 39% 2542 pachi 2137 112 108 80 58% 2034 LZ_57_c2t 2093 132 121 40 70% 1900 leela_nonet 2086 137 150 42 36% 2186 fuego 1865 107 105 72 65% 1691 pachi_1t 1858 119 115 54 65% 1690 leela_nonet_1t 1856 124 117 52 69% 1652 gnugo 1500 131 -57 106 20% 1791 michi 1466 219 -92 40 55% 1432 oakfoam1 1287 360 -270 28 43% 1430 oakfoam 1068 543 -489 26 27% 1386 oakfoam_book 998 595 -558 32 13% 1435 matilda 975 625 -581 26 15% 1407 Attachment: 5_min_crosstable-2018-10-03.csv [8.51 KiB] Downloaded 765 times This week I'm going to start playing some matches with 20 minutes per player per game. This will be with a smaller collection of engines, so that we'll get some results this year.

Author:	Uberdude [ Wed Oct 03, 2018 7:11 am ]
Post subject:	Re: Home-made Elo ratings for some engines
157 hero!

Author:

pnprog [ Sat Oct 06, 2018 9:20 pm ]

Post subject:

Re: Home-made Elo ratings for some engines

Hi!
Very interested in the thread

xela wrote:

I realised that DreamGo has pondering turned on by default. So I've renamed dream from last week's update to dream_ponder, and added a new dream with pondering off. It looks like pondering is worth about 100 rating points in 5-minute games, and less in fast games.

But when DreamGo is playing with pondering on, my understanding is that:

Not only it will increase its level
But it will also decrease its opponent's level, by taken away some of the computing power the opponent needs, no? This in turn will make the opponent appear weaker, and then explains for the big difference in ELO?

Like, imagine I run the tournament on a simple computer: 1000MHz CPU, one thread, no GPU ; then it's like comparing:

DreamGo (1000Mhz) VS Pachi (1000MHz)
DreamGo (1000MHz + pondering at 500MHz) VS Pachi (500MHz)

We can expect the level of Pachi to be significantly weaker, while facing a DreamGo boosted a little by pondering?

For the dream_ponder entry, what you would like to have is:

DreamGo (1000MHz + pondering at 1000MHz) VS Pachi (1000MHz)

Or I misunderstand something?

Author:	xela [ Sun Oct 07, 2018 4:16 am ]
Post subject:	Re: Home-made Elo ratings for some engines
pnprog wrote: But when DreamGo is playing with pondering on, my understanding is that: Not only it will increase its level But it will also decrease its opponent's level, by taken away some of the computing power the opponent needs, no? This in turn will make the opponent appear weaker, and then explains for the big difference in ELO? Correct. In fact, the difference in Elo ratings between dream and dream_ponder is actually smaller than I expected. pngprog wrote: For the dream_ponder entry, what you would like to have is: DreamGo (1000MHz + pondering at 1000MHz) VS Pachi (1000MHz) Or I misunderstand something? Yes, that would be a better way to test it. The fact is that I intended to run all engines without pondering, to avoid this type of complication. The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident :-)

Author:	pnprog [ Mon Oct 08, 2018 5:46 am ]
Post subject:	Re: Home-made Elo ratings for some engines
So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating? Quote: The inclusion of dream_ponder was an accident! I decided to leave it in the ratings list, rather than deleting it, because it's an interesting accident I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder?

Author:	xela [ Tue Oct 09, 2018 4:39 am ]
Post subject:	Re: Home-made Elo ratings for some engines
pnprog wrote: So now, reading the EGF rating system page on Sensei, they indicate/define one stone in strength is equivalent to 100 Elo, and that would make LeelaZero around 27 stones stronger than Gnugo. Something like 21 dan amateur player :bow: More seriously, if we fix Gnugo at a certain level (like 1500Elo/5k) what else data do we need to make our Elo scale comparable to the EGF rating? I think BayesElo is similar to EGF ratings, but not exactly the same. For a good comparison, we'd need to run the EGF rating algorithm on my engine vs engine games, or else collect some EGF tournament results and run BayesElo on those results to compare with EGF ratings. That's a whole other research project that I'm not going to start this year :-) I think "LeelaZero around 27 stones stronger than Gnugo" is about right, but it could be anywhere between 20 and 35 stones really. pnprog wrote: I am not really knowledgeable about those Elo ratings, but if we introduce a bot with skewed performance/rating, won't it affect the rating of all the bots on the scale? Like decrease the rating of the bots weaker than dream_ponder, and increase the rating of the bots stronger than dream_ponder? No, I don't think it matters. What does a "skewed performance rating" mean anyway? The bot was stronger than I thought it would be? But the BayesElo software doesn't read my mind, it only looks at the game results. Dream_ponder beats weaker bots, and loses against stronger ones, same behaviour as any other bot. I don't think it makes a difference to the ratings whether it gets those results by playing good moves, or by sabotaging the opponents (stealing memory or CPU cycles). In any case, I've been anchoring the ratings to put GnuGo at 1500 every time, so this should help to keep things stable.

Author:	xela [ Tue Oct 09, 2018 5:24 am ]
Post subject:	Re: Home-made Elo ratings for some engines
Just for fun, let's do some dodgy mathematical analysis of the 1-minute and 5-minute results, to see if we can extrapolate what will happen in 20-minute games. (I'll post some actual 20-minute results tomorrow. I did the analysis last week, just didn't get around to posting until today.) We already know that small networks beat bigger networks in fast games, but we might expect the bigger networks to catch up in slower games. Let's pretend that each engine/network combination has an "baseline" strength (how well it can play on minimal thinking time) plus an ability to get stronger with more time. There will be diminishing returns: you'd expect a big difference between 1 minute and 10 minutes, but not much difference between 60 minutes and 69 minutes. But the strength is theoretically unbounded (Monte Carlo search converges to the best move given unlimited time and unlimited memory). So a half way reasonable model might be: Elo rating = b + alpha times log(t) where b is the baseline strength, t is the thinking time in minutes per player per game (absolute time, because I don't want to get into complications around byo-yomi), and alpha represents how well the engine/network can make use of extra thinking time. At 1 minute time limits, t=1, log(t)=0, so b is the just the 1-minute Elo rating. Then we can calculate alpha as (5-minute Elo minus 1-minute Elo)/log(4). (For me, log means natural log, because I did too much calculus as a teenager, so log(4) is about 1.386.) And then the expected 20-minute rating from this model would be b + alpha times log(19), or b+2.944 alpha. If we have gnugo at 1500 on both rating scales, then it gets alpha=0, meaning that gnugo gets no stronger when it thinks for a long time. Worse, a few of the weaker engines get negative numbers for alpha. I don't believe that, so I'm going to subtract 200 from all the 1-minute ratings, just to get some more reasonable alpha values. Finally, this projects pachi_nn to be about 3300 in 20 minute games, which isn't realistic (it's nowhere near pro strength), so I'm going to subtract a few rating points from the results to put pachi_nn at 2400. Then a few lines of R programming gives these results: Code: Name Elo1_adjusted Elo5 rank1 rank5 alpha Elo20 rank20 1 LZ_ELF 3517 4544 3 1 741 4993 1 2 LZ_ELF_6t 3436 4496 4 3 765 4982 2 3 LM_GX47 3558 4504 2 2 682 4862 3 4 LZ_phoenix 2956 4133 17 9 849 4751 4 5 LZ_157 3580 4457 1 4 633 4738 5 6 LZ_173 3319 4307 7 5 713 4712 6 7 LZ_141 3271 4280 8 7 728 4709 7 8 LZ_174 3390 4291 5 6 650 4599 8 9 LZ_174_6t 3125 4120 11 10 718 4533 9 10 ray_ELF_12t 3342 4145 6 8 579 4343 10 11 LM_B5 2978 3932 16 13 688 4299 11 12 ray_173_12t 3089 3969 14 11 635 4253 12 13 LM_Z2 3108 3933 13 12 595 4155 13 14 ray_173_6t 3111 3874 12 15 550 4027 14 15 LZ_116 3169 3881 10 14 514 3976 15 16 ray_173_2t 2903 3706 19 18 579 3904 16 17 LM_W11 3059 3786 15 17 524 3898 17 18 ray_W11_12t 2950 3649 18 19 504 3730 18 19 LM_E8 3250 3794 9 16 392 3700 19 20 ray_ELF 2699 3487 23 20 568 3668 20 21 ray_173 2687 3392 24 22 509 3479 21 22 leela 2790 3414 21 21 450 3410 22 23 LZ_zed 2810 3372 20 23 405 3299 23 24 dream_ponder 2559 3232 26 25 485 3283 24 25 LZ_91 2735 3289 22 24 400 3207 25 26 dream 2437 3130 27 27 500 3204 26 27 LM_E8_c 2272 3035 31 28 550 3188 27 28 ray_W11 2578 3172 25 26 428 3135 28 29 LZ_116_c2t 2297 2988 30 29 498 3060 29 30 LM_W11_c 2106 2874 35 30 554 3032 30 31 leela_c 1926 2759 39 33 601 2990 31 32 leela_c2t 1935 2740 37 34 581 2940 32 33 LZ_91_c2t 1933 2648 38 35 516 2747 33 34 LM_GX47_c 2361 2842 29 31 347 2678 34 35 oakfoam_nn 2424 2839 28 32 299 2600 35 36 leela_c1t 2002 2535 36 39 384 2429 36 37 pachi_nn 1828 2429 40 40 434 2400 37 38 LM_Z2_c 2170 2601 34 36 311 2380 38 39 LZ_57 2173 2597 33 38 306 2369 39 40 LZ_57_c2t 1292 2093 44 42 578 2288 40 41 LM_B5_c 2271 2599 32 37 237 2263 41 42 pachi_1t 1013 1858 49 45 610 2103 42 43 pachi 1618 2137 41 41 374 2015 43 44 fuego 1138 1865 47 44 524 1977 44 45 leela_nonet 1583 2086 42 43 363 1946 45 46 leela_nonet_1t 1186 1856 45 46 483 1904 46 47 michi 934 1466 51 48 384 1359 47 48 gnugo 1300 1500 43 47 144 1020 48 49 oakfoam1 1163 1287 46 49 89 721 49 50 oakfoam 995 1068 50 50 53 445 50 51 matilda 864 975 52 52 80 395 51 52 oakfoam_book 1056 998 48 51 -42 228 52 So we can see for example that LZ_phoenix comes 17th in 1-minute games, but 9th in 5-minute games, giving it a big alpha value (it's making great use of the extra thinking time), and we'd expect it to shoot up to 4th place in 20-minute games. On the other hand, LM_E8 (with a 128x10 network) did better at 1 minute than at 5 minutes, so its alpha is lower, and we'd expect it to rank even lower at 20 minutes. Then again, the alpha values for LZ 141 and 174 don't look quite right. This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought.

Author:	moha [ Wed Oct 10, 2018 4:23 am ]
Post subject:	Re: Home-made Elo ratings for some engines
xela wrote: This is a pretty simplistic model, so I don't expect the results to be at all accurate (we can tell it's not right by the way gnugo has dropped 500 points in the output), but it's interesting food for thought. The basic idea usually is that each doubling of thinking time gives a roughly similar strength increase (OC this is not necessarily reasonable idea for all engines). Your formula could capture this if you wouldn't substract 1 from 5 and 20 before log. But in these rating pools one's result depends on others' performances as well, quite a problem for this approach. Maybe you could anchor at gnugo=1500 for 1 min, and anchor other times at a guessed gnugo improvement factor / rating. If you expect your numbers to go up with more time, then you basically compare performance to 1-min gnugo (how strong I should be to play this well in 1-min games), so going up into otherwise "pro" number range is not surprising and does not necessarily mean pro strength.

Author:	xela [ Wed Oct 10, 2018 4:54 am ]
Post subject:	Re: Home-made Elo ratings for some engines
OK, time for some actual results at 20 minutes per game. To start with, I decided to do this as a "win and continue" series of 8-game matches, starting with pachi_nn and introducing opponents about 100-200 Elo points above the previous winner (going by my dodgy projected ratings, to see just how bad they are). I'd expect each new engine to win 5-3 or 6-2. I also decided that if an engine wins its match 8-0, I should backtrack and look for something slightly weaker, in the interest of making the ratings a bit more accurate. Once I get to the top of the list, then I'll go back and add some more games to try and reduce the error margins, and maybe add a couple more engines if anything looks especially interesting. Without gnugo in the list, I've decided to anchor pachi_nn's rating at 2400, so the ratings are still in the same ballpark as my other lists. Round 1: pachi_nn vs oakfoam_nn. This was the first surprise: pachi_nn won the match 5-3. It seems that oakfoam has some problems with time management. It plays at about the right pace in 1-minute or 5-minute games, but in 20-minute games it only uses 6 or 7 minutes total, so it's giving pachi a bit of an advantage. It seemed to be ahead in the opening and early middlegame of each game, and then managed to misread something and lose. Round 2: LZ_91_c2t d pachi_nn 6-2. Round 3: leela_c2t d LZ_91_c2t 5-3 Round 4: LM_W11_c d leela_c2t 5-3 Round 5: LM_W11_c d dream 6-2, another surprise, was expecting dream to do a bit better against a CPU-only engine. Here there were two games with "disputed" scores (both engines agreed that LM_W11_c had won, but gave different winning margins): each game involved a seki. Round 6: LZ_zed d LM_W11_c 5-3 Round 7: leela vs LZ_zed was a 4-4 tie. I decided to add a tiebreaker match: leela d LM_W11_c 6-2. This put leela on top of the list, because it did a better job of beating up LM. Round 8: ray_ELF d leela 8-0. Backtrack: ray_173 d leela 6-2. Again there was one game with disputed score, another seki. ray_173 d ray_ELF 5-3, not what I expected! At this point, BayesElo had both ray_173 and ray_ELF on exactly the same rating (3082): 173 had won the head to head, but ELF did better against leela, and these factors cancelled out. There was also another game with disputed result (agreed that ELF won, but disagreed on the amount), but not a seki this time; instead, the scoring was messed up because ray_173 passed before the game was actually finished. (No harm done, it was losing anyway.) Round 9: Instead of running another tiebreaker match, I decided to just give the next engine 6 games each against the tied leaders: ray_173 vs LM_W11 3-3 ray_ELF d LM_W11 4-2 Another surprise: I'd expected LM_W11 to be stronger. Round 10: ray_173_6t d ray_ELF 8-0 Remember that ray defaults to one thread, but in 1-minute or 5-minute games it gets a little bit stronger given extra threads, but not by a huge amount. It looks like it gets a lot more benefit from those extra threads in slower games! (LZ uses two threads by default, and seems to actually get weaker given extra threads, at least in short games. But maybe it's worth retesting this theory in longer games?) Backtrack: ray_173_2t d ray_ELF 5-3; ray_173_6t d ray_2t 7-1. In the 6t vs 2t match, there were two games with disputed result: both players passed early and disagreed on who was ahead. I decided to step in as referee and, looking at the positions, awarded one game to each player. Round 11: LM_Z2 d ray_173_6t 7-1. Two more games where ray passed early from a losing position. And throwing all this into BayesElo, the rating list so far is: Code: Name Elo Elo+ Elo- games score avg_opp LM_Z2 3726 312 203 8 88% 3505 ray_173_6t 3505 165 154 24 67% 3350 ray_173_2t 3212 169 177 16 38% 3309 ray_ELF 3113 112 111 38 47% 3140 ray_173 3087 143 133 22 64% 2991 LM_W11 3053 166 179 12 42% 3100 leela 2823 119 125 32 38% 2922 LZ_zed 2796 156 149 16 56% 2758 LM_W11_c 2692 109 109 32 50% 2697 leela_c2t 2618 150 150 16 50% 2620 dream 2553 197 256 8 25% 2692 LZ_91_c2t 2547 156 150 16 56% 2509 pachi_nn 2400 146 151 16 44% 2442 oakfoam_nn 2336 193 217 8 38% 2400 To be continued...

Author:	Kris Storm [ Sun Oct 14, 2018 12:18 pm ]
Post subject:	Re: Home-made Elo ratings for some engines
Hi xela. It's a good idea of doing such comparision. How are you using BayesELO with SGF files? I found it useful only for chess PGN files.

Author:	xela [ Sun Oct 14, 2018 3:06 pm ]
Post subject:	Re: Home-made Elo ratings for some engines
Kris Storm wrote: How are you using BayesELO with SGF files? I've written a few lines of Python code to read the .dat files created by GoGui and output the results as PGN. (You could do this manually too for a small number of games, and you could do it just as well from the SGF instead the DAT.) The PGN file can be pretty minimal. BayesElo doesn't need the moves of the game; it's happy running on something that looks like this: Code:* [White "leela_c"][Black "LM_E8_c"][Result "1-0"] 1-0 [White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0 [White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0 [White "LM_E8_c"][Black "leela_c"][Result "1-0"] 1-0 [White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0 [White "leela_c"][Black "LM_W11_c"][Result "1-0"] 1-0 [White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0 [White "LM_W11_c"][Black "leela_c"][Result "1-0"] 1-0 Then I feed these commands to BayesElo: Code: readpgn filename.pgn elo offset 2000 advantage 0 drawelo 0.01 mm exactdist ratings The "advantage 0" part means that it doesn't care who played black or white, so I can put the winner's name first in my PGN file and all the results as 1-0, which makes it simpler to create the PGN. There was a forum post somewhere by Rémi Coulomb recommending the "advantage 0" and "drawelo 0.01" settings for go games. The "offset 2000" part means that the average rating of the outputs will be 2000; I have another Python script which changes 2000 to a different number, which is how I anchor the ratings (run it twice, figure out which offset will put gnugo at 1500).

Page 1 of 2	All times are UTC - 8 hours [ DST ]
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group http://www.phpbb.com/