Strength as error distribution

gennan · **#21**

lightvector wrote:

gennan wrote:

On an empty board with 7.5 komi, katago believes black has a 44% winrate. When black passes, his winrate drops to 6.2%. Translating these to Elo, black's pass decreases black's odds by 430 Elo. This is a rough indication that 1 rank is about 430 Elo wide at KataGo's level. This is close to the Elo per rank at 3050 EGF on the new fit.

If it matters, specifically it would be the Elo in self-play conditions, which might average out to being something like the equivalent of 300 or 400 visits. (Selfplay randomizes between 200 and 1k visits in a certain way). Probably black's chances would go down a bit more than that with more typical match settings.

Yes, I would expect that to happen. Black's odds after a pass would decrease with more visits, which means that the Elo loss of the pass would increase, implying that the Elo per rank increases, which is consistent with KataGo's level increasing with more visits (higher Elo per rank implies higher level).

gennan · **#22**

BTW, with these fitted functions for Elo per rank, it's possible to make a conversion between Elo player ratings and EGF player ratings by integrating and adding a suitable constant:

Using my earlier fit, I get

Code:

Elo = 6ln(1/(3200 - EGF)) * 400/ln(10) + 9100

Using my new fit, I get

Code:

Elo = (200/(3400 - EGF)^0.3) * 400/ln(10) - 2250

Also see them on Fooplot.

The inverted functions allow converting from Elo player ratings to EGF player ratings:

Using my earlier fit, I get

Code:

EGF = 3200 - exp(((9100 - Elo) / 6) * ln(10)/400)

Using my new fit, I get

Code:

EGF = 3400 - ((200/(2250 + Elo)) * 400/ln(10))^(1/0.3)

Also see them on Fooplot.

With these functions you may get a decent conversion between EGF ratings and Elo ratings as stated in Deepmind's papers and the Goratings website.

I mapped some examples using these conversion functions (the new ones, though the earlier ones are quite similar in the human range):

Code:

  Elo rating              EGF rating (100 pts/full handicap stone)
  ------------------      --------------------
  infinity?               3400 (14d/24p, perfect player?)
  5180 (AG Zero)          3230 (19p)
  4860 (AG Master)        3200 (12d/18p)
  3750 (AG Lee, #1 human) 3050 (13p, KataGo?)
  3500 (Shibano)          3000 (10d/11p)
  3250 (Yamashita)        2940 (9p)
  3000 (Redmond)          2880 (7p)
  2800                    2800 (8d/3p)
  2600                    2700 (7d/1p)
  2400                    2600 (6d)
  2200                    2500 (5d)
  2000                    2300 (3d)
  1800                    2100 (1d)
  1500                    1800 (3k)
  1200                    1300 (8k)
   800                     200 (19k)
   400                   -1600 (37k)
     0(random play?)     -5800 (79k)

I feel this looks pretty reasonable overall (although the behaviour in the deep DDK range is anyone's guess, I suppose).

These go Elo ratings are even somewhat similar to chess Elo ratings, except at the high end (from about 2300 Elo upwards or FIDE Master), where the wide draw margin of chess may be a limiting factor on chess Elo ratings (it's hard to win in a chess game between two highly skilled players, even when there is a skill gap).

Note: The Elo gap between AG Zero and AG Lee is about 1500 Elo, which means that AG Lee is expected to win only 1 game in 10,000 against AG Zero in even games. But the rank gap is only 2 ranks, so if it gets a handicap of 3 stones with komi, AG Lee may be able to win a decent amount of games. It may be the same for the ultimate human players (Shin Jinseo, Ke Jie and Park Junghwan with Elo ratings hovering around 3700 in recent years).

moha · **#23**

Elo conversion functions can be verified a bit with the derived function for deviation. Since winrate classes have meaning both in Elo and in deviations, the number of classes between neighbouring dan ranks also determines an sd for the distribution there. This can also be used in the "wr loss for 2 pts early mistake" question - iirc there were some doubtful points in my earlier table.

moha · **#24**

Ah, it seems possible to estimate the stone rank of perfect play even directly!

Making a bunch of simplifying assumptions (which are all surely somwhat incorrect but may not bad enough to completely ruin the estimate) it seems sd may decrease roughly linearly per dan rank. This is because if the per-move errors are something like exponential (as seen above), then their mean and sd are the same. If game length can be approximated by its average (and is roughly same most players), then (assuming a quickly normalizing distribution like Erlang) the per-game total (sum of movecount units of per-move errors) should also have sd roughly proportional to its mean (thanks lightvector!). The means add up (get multiplied by movecount), variances as well, so sd gets multiplied by sqrt(movecount). Thus:

Code:

sd ~= mean_distance_from_perfect_play / sqrt(movecount)

Just using the idea that the last factor might be seen as constant makes the earlier sequencing easier: if we know how fast sd decreases in practice (have at least two data points to get sd_delta for that distance), it may directly and linearly point to the rank where it reaches zero.

Even simpler, if we just know sd in points for a certain single player, and estimate average game length as ~300 moves (without resignation, 150 per player), then simply multiplying that sd by ~12 might give his rough distance to perfect play.

If I start Katago now it shows an initial B winrate of 48.5% (Japanese with 6.5 komi). Trying komi 4.5 and 8.5 the avg change is ~12.5% for 2 pts. This corresponds to normals 0.45 sd apart (wolfram from above), 0.225 sd per point, which gives sd ~4.44 pts. Multiplied by 12 gives a bit more than 50 points -> ~3.8 stones to perfect play (OC this is for selfplay visit range, in high visit matches less). Again nothing to be taken too seriously (reality surely is much more complex esp near perfect play, these simple models can easily go significantly wrong) - but at glance a reasonable result.

lightvector · **#25**

When talking about bots and actual handicap stones, also watch out for the complication for some "zero trained bots" not knowing how to play with or against handicap stones as well.

The "PDA" training in KataGo resulted in a massive gain of something like 250 Elo points against test opponents like Leela Classic in handicap games versus not doing so. And merely a few weeks ago KataGo received also a massive strength boost for when it *receives* handicap stones by enabling negative values of PDA - in certain handicap test matchups resulting in things like swinging from winning 25% against opponents to winning 75% (more than 300 Elo!). And the PDA training itself being present in the selfplay has barely any effect on the overall strength of the run in even games, I'm not even sure if it's positive or negative - it's dwarfed by the noise of test runs I tried.

I also got another 100 handicap-game Elo more 'passively' simply from doing some minor adjustments on how score maximization utility works.

So overall, holding the even-game strength of bots constant, with bot A stronger than bot B, one could obtain perhaps a 400-700 Elo point swing in matchups like "A vs B with 3 stone handicap" between "A didn't do any PDA or score training but B did" and "B didn't do any PDA or score training but A did".

Which is bound to cause a bit of trouble when you try to talk strength differences measured in chunks as large as stones. ;-)

gennan · **#26**

From my crude conversion above between EFG rating and Elo rating, it seems that around the level of AlphaGo Master/AlphaGo zero, the Elo per rank is about 100 Elo per point. So a 400-700 Elo improvement for AI around that level may correspond roughly to a handicap (rank) improvement of about half a stone.

lightvector · **#27**

@gennan - regarding your chart, I would strongly suspect that both KataGo and Leela Zero are in the neighborhood of AlphaGo Master at least. I'd guess this for a few reasons:

* Leela Zero has gone for about 19 million training games, compared to AG Zero which went for 29 million training games (the one that you have at 5180 in your chart), using the same neural network architecture and pretty much the same algorithm, albeit with some differences of hyperparameters. Strength improvement is logarithmic in training time, so Leela Zero should be not far from AG Zero.

* More quantitatively: KataGo and Leela Zero both observe not-too-dissimilar amounts of Elo gain per doubling of selfplay compute power spent. KataGo's enhancements seem to make the entire training N times faster for some respectable N, so it has a much better multiplicative constant, but the gain per doubling is similar, and when I last measured for KataGo it seemed to be very roughly on the order of 300 Elo per doubling. There's one complication regarding the fact that only about half of Leela Zero's games have been with 40 blocks, the rest have been with cheaper networks, and also the first *third* of the games had bugs, slowing the improvement and "counting for less", so Leela Zero's effective compute spent isn't quite linear in games. And also one can never be sure about the effect of the differences in hyperparameters between Leela Zero and AlphaGo. But altogether, Leela Zero is probably somewhere from one to two doublings away from AlphaGo Zero right now.

* KataGo, depending on how you measure nowadays and what hardware, etc, is anywhere from perhaps 50 to 200 Elo stronger than Leela Zero, so KataGo should probably be in the same ballpark of AlphaGo Master too if Leela Zero is, and if not yet caught up to AlphaGo Zero, then at least a little closer still.

* Based just anecdotally on what you see when you use them to review pro games, and some of the bot vs pro handicap games you can find on servers, KataGo and Leela Zero seem probably capable now of doing similarly well as AlphaGo Master's 60-0 wins, except for the ever-present chance of just throwing a game as a rare fluke due to blind spots or ladders ("heavy tails" which may violate the assumptions of the Elo model itself). And many of these server games where they win against pros with handicap are even played with hardware perhaps an order of magnitude weaker than what AlphaGo Master used (4 TPUs). Although with handicap games you have to be careful about the heavy nonlinearities I pointed out in my earlier post. For AlphaGo Lee, although there is limited evidence, it would be a little doubtful if it could do this, based on the gap with AlphaGo Fan and and also the loss against Lee Sedol (even despite using 12x? the hardware for that game than AlphaGo Master used in the 60 games).

* Measured in compute power invested, it's also been at least 2 doublings (600 Elo?) for both Leela Zero and KataGo since the older networks where they were already probably able to win quite reliably against pros in even games. About 2 doublings ago puts you roughly around ELF strength, and it's probably safe to say that ELF itself is already well-superhuman. So it's very very hard to put KataGo near AlphaGo Lee in strength and make the chart come out consistently.

Given that it seems you tried basing the scale off of KataGo in part (the measured "distance from perfect in points/stones") I'm not sure how this might affect the way you're trying to set the scaling.

Edit: Revised some of the part above about Leela Zero compute - the story is actually a bit more complex than what I first wrote here, and added one more bullet.

gennan · **#28**

@lightvector:

The Elo ratings for the different AlphaGo versions are from DeepMind's papers.

For my estimate of KataGo's rank, I only used the handicap games between KataGo and Yeonwoo as rough data points.

How does KataGo compare to AlphaGo Master?
It may well be that KataGo is at the level of AlphaGo Master when given similar computing power in a game. But my estimate is that AlphaGo Master was allowed about 1 million visits per move in many of its performances (and even 10 times more to build the published opening database). I don't think KataGo gets that much resources from its typical users.

I don't know what kind of hardware Yeonwoo used for her little handicap match, but I think it's nowhere near what AlphaGo Master had available.
In her videos it looks like KataGo was allowed about 5 seconds per move on common hardware. So I would estimate that KataGo was only allowed about 1000 visits per move.

So my estimate of KataGo's rank is for KataGo at about 1000 visits, while I read DeepMind's Elo rating for AlphaGo Master as applicable to 1000000 visits. That's 10 doublings difference. I cannot make a fair comparison between them, because I have no data about KataGo's level with 1000000 visits per move.
It may well be that KataGo can give Yeonwoo between 4 and 5 stones handicap with that much computing power (corresponding to a rating of about 3200 EGF/12d, like my estimate for AlphGo Master).
That would require KataGo to improve its game score by about 1 point for each doubling of visits beyond 1000. This may the case, but I don't have the kind of hardware needed to test this hypothesis.
Note: If this scaling behaviour really exist and if it holds even beyond 3200 EGF/12d and if perfect play is indeed about 14d, KataGo may approach perfect play at around 10^15 visits per move (about 50,000 years per move on common 2020 hardware?).

Strength as error distribution

Who is online