Strength as error distribution

moha · #1

This came up a few times recently - some random thoughts:

The basic idea is that a player's strength can be described by the errors he makes. For simplicity I'd define an error as a move that loses points compared to the minimax solution (a bit doubtful ^*1). Such errors should be somewhat normal-ish (many small errors, fewer large errors ^*2), and after playing 100-200 moves the sum of these errors may be even more so (central limit).

Overall I think assigning a mean and a deviation to a player's per-game error total could offer a decent model. This is not much different to Elo fundaments actually (performance = -errors, and deviation may even be guessable from the mean). Except in go, there is a more tangible meaning behind these numbers. When two players play, the winning margin is the actual sum of the errors of the opponent, minus actual sum of errors of the player (assuming correct komi).

So for each game we have two distributions similar to this plot. The player wins if his "random sample" turns out to be higher than the opponent's (= he gives up less points in the game than the opponent).

Player A has a distribution described by [Aev,Asd] and opponent has [Bev,Bsd]. For simple cases the distribution of the difference can be constructed, but a more general way of getting A's winning probability: for each point on A's distribution, we take its density multiplied by B's cumulative distribution from -infinity to that point (the cases where B made more errors than A's error point in question).

Since only the relative width and position matters, B's distribution can be normalized, to use only A's shifted and scaled one afterwards: A becomes [Aev',Asd'] and B is [0,1]. This means A's numbers are expressed using B's original deviation as unit: we are only interested in where our distribution lies relative to opponent's one, and how it's shape aligns with his (how much wider/narrower it is).

So Aev'=(Aev-Bev)/Bsd and Asd'=Asd/Bsd. With these, the winning probability can be approximated (^*2):

1/(sqrt(pi*2*Asd'^2)) * int_x_-inf_inf( e^(-(x-Aev')^2/(2*Asd'^2)) * 0.5*(1+erf(x/sqrt(2))))

Here is a wolfram example to calculate such win probabilities (variable substitution would make it too complex for the free version, so Aev' and Asd' occurrences need to be replaced manually inside square brackets).

Although the absolute position of a distribution doesn't really matter, a very rough guess is strong pro level is somewhere around -50 (komi = 7, 1 stone = 2*komi, so 3-5 stones to perfect play). Two players are 1 stone apart if their ev difference is roughly 14 (supposedly 50% winrate with 1 extra stone or with reverse komi).

More interesting is the question of deviation. There is a known problem in translating Elo-like ratings to stones: EGF win% table predicts that winrate against 1 stone stronger opponents is ~33% at 9k, ~25% at 1d, and only ~20% at 7d levels. Using the above function in reverse hints that at 1d the deviation may be a bit less than 1 stone (<14 points). For stronger levels the deviation decreases - making fewer and smaller errors not only means higher ev, but less absolute variance as well.

These rank-dependent winrate differences are handled by EGF using an extra (deviation-like) variable term. This approach offers a natural explanation, from where A's distribution is shifted and scaled against B's normalised one. For stronger players the relative/scaled position of a one stone (14 points) stronger opponent's distribution is significantly farther (since the deviations are smaller). I think this is the real reason behind those differences observed in practice.

^*1 This ignores that a deliberate safety move that trades points for consolidation of a winning position is not the same kind of error as points lost on misplaying a local fight for example.

^*2 In go the actual error values and sums are integer, so something like a binomial distribution would probably be best. But approximating with other distributions like normal or logistic should also be ok, except maybe at near-perfect play (no positive values / side).

Joaz Banbeck · #2

moha wrote:

...errors should be somewhat normal-ish (many small errors, fewer large errors ...

I'm suspicious of this assumption. The availability of errors of different sizes varies throughout the game. ( The largest error that can be made on the first move should be no more than komi*2, and in the last few moves it is usually a point or two. But in the middle game a bad move can sometimes throw away 100+ points )

I suspect that it will not be a normal distribution: that small errors will be over-represented.

Bill Spight · #3

To me the fact that errors are non-negative integers suggests a Poisson distribution.

Bill Spight · #4

Joaz Banbeck wrote:

moha wrote:

...errors should be somewhat normal-ish (many small errors, fewer large errors ...

I'm suspicious of this assumption. The availability of errors of different sizes varies throughout the game. ( The largest error that can be made on the first move should be no more than komi*2, and in the last few moves it is usually a point or two. But in the middle game a bad move can sometimes throw away 100+ points )

I suspect that it will not be a normal distribution: that small errors will be over-represented.

For many amateurs the error distribution may be bimodal. With better amateurs making fewer large errors.

moha · #5

Joaz Banbeck wrote:

The availability of errors of different sizes varies throughout the game. ( The largest error that can be made on the first move should be no more than komi*2, and in the last few moves it is usually a point or two. But in the middle game a bad move can sometimes throw away 100+ points )

Right, the scales of individual errors likely correlate with temperature changes throughout the game. And a large per-game error total may have more to do with a middlegame blunder than with dozens of smaller errors, for example.

This in itself doesn't exclude normality for the total though (e.g. the sum of a few normals is still normal, even if one of them is on orders of magnitude larger scale). But the normality of individual errors is even more questionable OC.

Another possible consequence, verifiable from actual data on results: if the largest errors come from middlegame, the deviation of the total can significantly depend on the character of the player as well (so not guessable from the mean, like EGF tries). Someone who has a strong middlegame likely makes fewer errors there, so likely has a smaller deviation for his total than others with the same rank (mean). This still leaves him with 50% against them, but should have noticeable and consistent effects on his chances against 1 stone stronger opponents (similarly to 9k-1d-7d anomalies above).

Quote:

I suspect that it will not be a normal distribution: that small errors will be over-represented.

This is why the longer route with the double integral seems preferable: it works for a wider range of distributions.

Bill Spight wrote:

To me the fact that errors are non-negative integers suggests a Poisson distribution.

In a few years the newer bots (with multi-komi NNs or the SAI fork) may be able to provide actual data on this.

moha · #6

Some further thoughts in comparison to 1-dimensional (mean only) systems:

When two players play, the side with the higher mean always have the upper hand. How much his advantage is, however, depends on deviations nearly as much as on means.

Balancing a matchup to 50% winrate with handicap or komi needs means only. This basically shifts means to be identical, then deviations don't matter anymore.

Partially non-transitive situations are possible, rather practical even (no special correlations, players showing the same performance against all opponents). For example, A is [-100,10], B is [-110,15], and C is [-115,30]. Then A>B (71%), B>C (56%), but A wins less against C (68%) than against B.

So it may be better to exclude non-handicapped games between players of different ranks from 1-dimensional systems. Otherwise deviations may get measured/smeared into the ratings (which should approximate means only - rating C higher than B would be incorrect).

moha · #7

Out of curiosity I tried to use this approach on the relation between points (early mistakes / advantages) and winning probabilities.

This is well defined if we know the shapes of players' distributions (or a good approximation), and we have at least a single data point to establish the scale (the distance between the two distributions in deviations). So if we know the percentage value of X points, we can calculate Y points and so on (by shifting the distributions).

And we do have one data point: a whole stone. This is if one player passes his first move - or if there is 1 stone strength difference between the players. And we can guess the point value of this is twice komi - roughly 14 points.

For human ranks I took the winrates against 1 stone stronger opponent from the above EGF table (adjusting down half-rank, and some guessing for pro levels / 9d since it only goes up to 7d). I experimented with bots as well, but their winrate gains fluctuate wildly and sometimes inconsistently (even for smaller mistakes), so I could only roughly conclude that one move for LZ is about 35-40% gain. Which is not much different from my 9d approximation so I made no column for this. Instead I include the idea of 2pt=10% - this can also be used as an anchor.

So, using the above wolfram calculator in both directions, I get the following values:

Code:

                           1 dan   7 dan   9 dan   2=10?
--------------------------------------------------------
winrate vs 1 extra stone   27.4%   20.2%    ~16%    3.7%?
equiv. distance in sd-s     0.85    1.18    1.41
1 point distance in sd-s   .0607   .0843   .1007   .1800
sd in points               16.47   11.86    9.93    5.56
------------------------
winrate gain for 1 pt       1.71    2.38    2.84    5.06
winrate gain for 2 pts      3.42    4.74    5.66    10.0
winrate gain for 3 pts      5.12    7.10    8.46    14.9
winrate gain for 5 pts      8.50    11.7    13.9    23.8
winrate gain for 7 pts      11.8    16.2    19.1    31.4
winrate gain for 14 pts     22.6    29.8    34.0    46.3

This is for early game only OC - and similar results can be obtained by an oddswise approach as well.

moha · #8

With Katago's progress it now seems possible to look at some actual data. As an experiment I took 8 of my recent games (2-2 of B wins W losses etc. each), and run them through Katago move by move (5000 visits). The game move may not have been searched properly, but I took the "errors" as the (color corrected) differences between B's lead/score at next move (the result of game move) and at current move.

Since this is fractional (the bot is far from perfect) I rounded them down to half points (instead of whole points as real errors). The per-move distribution looked like (my moves only - KGS 2d):

Attachment:

pic_sajt_05.png [ 16.6 KiB | Viewed 11847 times ]

Negative values come mostly from the bot's own variance of opinion between moves (0.1-0.2 pts typically), but in same rare cases the game move was found to be better actually (extreme was a 10 pts improvement).

In one game there was an 5-10 moves sequence of around 60 pts errors by both players: there was an aji involving the L&D of a large group which neither player noticed. Although these are truly errors, their effect on the distributions may be a bit confusing (especially if this goes on for dozens of moves - which not happened here). But maybe this is rare enough not to have a significant impact.

Just 8 games are far too few to see the distribution of per-game totals, but these were (summed before rounding): 272, 390, 312, 222, 270, 532*, 273, 275 (biased by varying game length, resignations etc). For opponents (1d-3d): 369, 372, 343, 214, 246, 581*, 199, 276. If anything these may seem a bit higher than expected (but also affected by the bot's own errors, which may total around 30? or more at these low visits). If one stone really amounts to 14 points, pro levels may be closer to -100 than to -50.

I also tried 100k of "games", random sums where each move was pulled from this distribution (positive range only). Unsurprisingly the result was quite normal-ish (central limit strong after more than 100 additions). But OC this is nothing to be taken seriously.

moha · #9

Filtered to moves below 40:

Attachment:

pic_sajt_040.png [ 14.76 KiB | Viewed 11845 times ]

Moves 40-130:

Attachment:

pic_sajt_40130.png [ 16.22 KiB | Viewed 11845 times ]

Above 200:

Attachment:

pic_sajt_200.png [ 16.33 KiB | Viewed 11845 times ]

moha · **#10**

An idea how to guess the distance to perfect play with this approach.

As mentioned above, the deviation of a distribution/player can be accurately measured by shifting it a known extent (playing with extra pass/move/stone or changing komi a few points), then observing the winrate change (which tells the distance in deviations). The point- or stone-wise distance between two distributions can also be measured (which change gives 50%). However, direct measurement of the absolute location of a distribution seem impossible.

But we can use the fact that deviations get smaller as strength increase (ie. as means move from minus hundreds towards 0). This is not necessarily true in all cases (an artificial player could be made rather strong with high deviation), but there are reasons why usually it is in practice. And since flawless play has both location and deviation 0, by looking at the sequence/function of deviations for 1d-9d, then 10d-11d-12d etc, a rough guess seems possible by extrapolation.

This would need the deviations from the strongest bots, since the function is most interesting (and may change slope) near these levels. And would be nothing accurate OC, but neither is "hard to imagine a 9p losing to anything on 6 stones".

gennan · **#11**

I also tried to model this a while ago.

My guess is that the simplest model is assuming some gamma distribution of point loss per move (perhaps the exponential distribution). The score difference between 2 players over a whole game may then roughly follow a beta distribution. From this one may predict the win probability distribution between 2 players separated by a number of ranks (full handicap stones, assumed to be worth about 14 points each).

gennan · **#12**

My gut feeling is that "god" (prefect play) cannot give top AI much of a handicap. Perhaps something between 2 stones and black without komi? My guess is that top AI may only lose about 10 points total per game on average.

gennan · **#13**

A little while ago I also tried to determine the total point loss per game with KataGo on a couple of games.

I also got roughly 300 points at about 2d EGF.

I got roughly 1000 points for a 25k game and about 100 points for an AlphaGo Zero self play game.

Assuming 14 points loss per rank, I would expect the "real" point loss to be more like 500 for 25k, 150 for 2d EGF and 10-ish for AlphaGo Zero.

total point loss per game:

Code:

rank  real?  KataGo estimate
AG0    10    100
2d    150    300
25k   500   1000

average point loss per move:

Code:

rank  real?  KataGo estimate
AG0   0.1    0.8
2d    1.3    2.5
25k   4.2    7.0

So there is a factor 2 discrepancy between our expectations and our results.

Are KataGo's error estimations too high, or is there a flaw in our reasoning?
The estimate of AG0 making 100 point loss in a game compared to KataGo is already suspicious.
Perhaps KataGo's estimates have some outliers that skew the data (it may be a bit too harsh in its judgement of "bad" moves).

So I made another table where I avoid outliers by using the median point loss per move:

median point loss per move:

Code:

rank  real?  KataGo estimate
AG0   0.1    0.3
2d    1.3    1.2
25k   4.2    5.6

As you can see, the numbers are getting closer. I don't know if this hints at some underlying principle.

Friday9i · **#14**

moha wrote:

An idea how to guess the distance to perfect play with this approach.

That's more or less what I tried in the past few weeks, but with a different approach, using KataGo! :-)

I measured the points advantage of KataGo with 4x more visits vs KataGo, depending on the visits: eg KataGo with 80 visits is on par against KG 20 visits when giving it ~20 points advantage (ie fair komi is about -12.5 instead of the default 7.5 komi).
Note: technically, I used PDA=2 for white and PDA=-2 for Black (in line with parameters used for KataGo selfplay)
With 320 visits vs 80 visits, the advantage is down to ~15 points (ie -7.5 komi). You'll find below the table up to 160K visits vs 40K visits (but few games as it's dead slow, so quite uncertain), where the advantage is still around 9 points (+/- 1.5 points I would say).
Extrapolating to infinity, the sum of advantages of perfect play is probably between 40 and 80 points from KG (net 20b-s2G97) with 10K visits, ie around 3 to 5 full stones better! That's of course extremely speculative, and there are many practical and theoretical uncertainties (eg the assumption on the curve equation, is there a systemic weakness hindering the approach, ...), but to my knowledge, it's a first "rough assessment" of the distance to perfect play :-)

!
Note: latest available 30b net (30b-s2G84) is probably ~0.5 stones better than 20b-s2G97 at visit-parity, so this rough approach seems to show that perfect play is probably between 2.5 and 4.5 full stones better than latest available KataGo 30b net at 10K visits.

Here are the detailed results:
W B Adv 4x
40 10 20,7
80 20 19,5
160 40 18,6
320 80 14,7
640 160 12,1
1 280 320 12,0
2 500 640 11,3
5 000 1 280 12,0
10 000 2 500 11,2
20 000 5 000 NA
40 000 10 000 9,8
80 000 20 000 NA
160 000 40 000 9,0 (uncertain)
Points up to 1280 vs 320 are done with ~2K games each (with various komi above and below optimal komi, to assess it efficiently), then ~500 games for the following points excepted the last one (160K visits vs 40K) with only 100 games (sooo slow). Matches are done using katago match tool.

Comments welcome!

Bill Spight · **#15**

Just a reminder. Handicap stones are not linear in points. Games with 40 stones or more are possible, but it is not possible to give 600 points reverse komi.

That said, IMX go ranks based upon handicap stones are nonetheless roughly linear, even with rank differences greater than 20. I don't think that we know much about the relationship between ratings, handicaps, and komi when the differences are large. So in the future we might see bots that can give top humans 6 stones, but cannot give 75 pts. reverse komi.

moha · **#16**

gennan wrote:

The score difference between 2 players over a whole game may then roughly follow a beta distribution.

Whether you use gamma or beta (both imperfect as continuous), the sum after 150 additions should be pretty close to a normal as well. I think the biggest weakness of these approaches is not the distribution used for approximation, but the implied assumption that the errors are independent. In reality this is not quite true.

For example, players may be weaker in certain types of games (moyo or large scale tactical fight), against certain shapes or a certain opponent, which makes many of their errors in that game larger than usual. The above sequence of 50pts errors also have some direct relation between them.

On the other hand, Elo works pretty well for many games, despite making even stronger simplifying assumption (uniform sd). In go that would not be sustainable, but to me that seems quite wild in reality even for chess. The reason for a smaller error total is almost always smaller individual errors - thus smaller devation for the total.

Quote:

My guess is that top AI may only lose about 10 points total per game on average.

This seems a bit too low to me, and there is a problem with these estimates: both humans and bots get stronger with more time, but perfect play (and any distance to it) is time-independent fixed strength.

Quote:

total point loss per game:

Code:

rank  real?  KataGo estimate
AG0    10    100

I agree this line is a bit suspicious. Comparing to a different but similar strength player is already asking for trouble. Also variance - from the 8 games above my results fluctuated quite widely, so a lot of samples would be needed for a real average.

moha · **#17**

Friday9i wrote:

That's more or less what I tried in the past few weeks, but with a different approach, using KataGo! :-)

Nice! I'm not sure this is completely the same though. I also wonder if 4x visits surely gave the same advantage at both ends of your table (winrate wise, with normal komi) - your approach seem to depend on this assumption.

Friday9i · **#18**

moha wrote:

Friday9i wrote:

That's more or less what I tried in the past few weeks, but with a different approach, using KataGo! :-)

Nice! I'm not sure this is completely the same though. I also wonder if 4x visits surely gave the same advantage at both ends of your table (winrate wise, with normal komi) - your approach seem to depend on this assumption.

I did not test that but I tested the "additivity" and it works: if going from x visits to 2x give an advantage of p points, and going from 2x to 4x gives q points, then going from 1x to 4x gives almost exactly p+q points (eg 13+11=23, or something like that, so it works within statistical noise :-)

)

gennan · **#19**

I've made several attempts of fitting the Elo width per rank from several sources. Recently I was converging to

Code:

6/(3200-r)*40000/ln(10)

where 40000/ln(10) is a conversion factor to convert from log-odds per EGF point to Elo width per rank. That formula contains my previous assumption that current top AI are fairly close to perfect play and perfect play may correspond to 12d (3200 EGF).

But Friday9i gave an estimate for perfect play being roughly 3.5 ranks above KataGo. The gap between KataGo and Korean 1p YeonWoo (a youtuber) seems to be roughly 3 ranks and on the EGF rating scale, she is probably 7d-8d (2750 EGF). So that would put KataGo at about 10.5d (3050 EGF) and the perfect player at about 14d (3400 EGF).

So I modified my fit to

Code:

60/(3400-x)^1.3*40000/ln(10)

You can compare those fits on Fooplot.

On an empty board with 7.5 komi, katago believes black has a 44% winrate. When black passes, his winrate drops to 6.2%. Translating these to Elo, black's pass decreases black's odds by 430 Elo. This is a rough indication that 1 rank is about 430 Elo wide at KataGo's level. This is close to the Elo per rank at 3050 EGF on the new fit.

Note: In these fits, I'm still assuming that the Elo width per rank approaches infinity for a perfect player (instead of 5600 Elo as was hypothesised in the related topic), but that doesn't make much of a difference for these fits.

lightvector · **#20**

gennan wrote:

On an empty board with 7.5 komi, katago believes black has a 44% winrate. When black passes, his winrate drops to 6.2%. Translating these to Elo, black's pass decreases black's odds by 430 Elo. This is a rough indication that 1 rank is about 430 Elo wide at KataGo's level. This is close to the Elo per rank at 3050 EGF on the new fit.

If it matters, specifically it would be the Elo in self-play conditions, which might average out to being something like the equivalent of 300 or 400 visits. (Selfplay randomizes between 200 and 1k visits in a certain way). Probably black's chances would go down a bit more than that with more typical match settings.

Strength as error distribution

Who is online