Revised European go ratings

gennan · Post by **gennan** » Wed Sep 27, 2017 11:32 am

Krama wrote:Is it possible to do a comparison of goratings.org and goratings.eu?

Perhaps to somehow place goratings.org into the eu version and see how euro players compare to aisan pros?

goratings.org is made by Rémi Coulom and goratings.eu is made by me (Dave de Vos). There is no connection other than a similarity in name. The similarity in name comes from the fact that both are about go ratings, but I think the similarity stops there.

goratings.eu focusses on finding a rating system that maps to European amateur go ranks well (like the EGD), in the past, present and future. In such a system, 100 rating points difference corresponds to one go rank difference for all go ranks (at least this is the intention). The winning odds between go ranks varies considerably, depending on the rank of players (this is different from normal Elo ratings). Overall rating drift is undesirable, because it would mean that the mapping to European amateur go ranks would be lost over time. I think this kind of system mostly serves European amateur tournament players.

I think goratings.org is intended as a world ranking list to order the world's top players by their current competitive success rate. There are European top players in that list (in the lower regions). In my understanding these ratings are like standard Elo ratings: it means 100 points rating difference corresponds to about 2:1 winning odds for all ratings. That means that ratings in that list have no direct relation to go ranks (amateur or professional). I suppose the overall rating values may also drift over time, because that does not matter for the order in the list. Only rating differences have a meaning for the order of the list.

So comparing ratings from goratings.org to ratings from goratings.eu is like comparing apples and oranges IMO.

I did make a crude estimate for conversion at the bottom of this page, by matching extrapolated winning odds from goratings.eu and guestimating an offset by comparing some players that occur in both lists, but I don't think you should attach much meaning to it. They work differently, because they serve different purposes. They use different scales and the offset between them may drift significantly over time.

gennan · Post by **gennan** » Wed Sep 27, 2017 1:27 pm

Pio2001 wrote:Hi,
I don't know how the european system works, but in France, there are several adjustments made to the rating system.
For new players, an iterative system is performed to correct the rank. If the variation after the first tournament is more than -50 or +50, the player's registration grade is replaced with it's final grade, and the calculations are performed again. The final grade is fed in place of the initial grade as long as the variation is more than 50.
It allows to start the ranking of a new player according to his/her performance during the tournament, rather than guessing.
The system works well except if the first ranked game is a single game in the club. If the result is not meaningful, then the same adjustment can't occur during the second tournament of the player.

However, a second type of iterative adjustment is possible : if, during any tournament, the grade increases more than 60 points, then the iterative system is performed. This only works for positive variations. There are no iteration for negative variations, except during the very first tournament of new players.

These correction are also useful for their opponents, as the correction occurs before all the calculations.
For example, if a young player has gone from 5 kyu to 1 dan during the holidays, and they get a +200 correction, their opponents are considered to have lost against a 3 kyu player instead of a 5 kyu.

Why did the player in your example not declare himself 1k after the holidays? Was he unaware that he became so much stronger during the holidays? I find that a bit hard to believe.

Perhaps his national go association did not allow him to declare himself 1k? I know that some countries won't allow it, but I think such a policy is counterproductive, because it inevitably causes problems like these. And then you need a complicated system to fix it and arrive at the same result as when that player were just allowed to declare himself 1k.

I suppose such policies are intended to prevent rank inflation, but is that really a big problem? I can understand a desire to have some regulation for (higher) dan ranks, but for (lower) kyu ranks, the cure seems worse than the disease. I think it leads to kyu rank deflation. I read the topic that Herman linked to a few posts back and some posters in that thread claim that kyu rank deflation was clearly an issue in France in 2011.

Pio2001 wrote: Also, the french federation uses a weighting parameter for handicap games that is biased towards the higher grades : the variation of the players for a handicap games are multiplied by (1 - H/10), except if while looses. In this case, the variation of white's grade is again multiplied by (1 - H/10).
It means that handicap games are weighted at 90 % of their values for a 1 stone handicap, 10 % for a 9 stone game, and that strong players have a special protection when they play handicap games against weaker players. For a 9 stone game, their variation is only 10 % of the calculation if they win, but 1 % if they loose (their opponent still have 10 % in this case).

I think that handicap differences define go ranks. So handicap games are an important benchmark for the spacing between ranks. So I would rather not reduce the weight of large handicap games as long as they are fairly additive (so that in practise, 8.5 ranks differences matches 9 stones handicap equally well as 2.5 ranks difference matches 3 stones handicap). I feel that the smaller K factor of the stronger player is already sufficient "protection" for the stronger player's rating (assigning more inertia to his rating). (I don't know if the French system uses a variable K factor).

I haven't come to it yet, but I intend to study handicap game statistics from the EGD game history. There are almost 100,000 handicap games in it (more than 10% of the total), so there is plenty of data. For example, I hope to find out to which degree handicap games can serve as long range benchmarks for rank spacings, because even though even game winning odds are a reasonable local measure of the rating curve, they are a bit too volatile with large differences to accurately determine the global shape of the rating curve.

gennan · Post by **gennan** » Wed Sep 27, 2017 11:57 pm

I played with the rating reset policy.

I find that the tweaking the rating reset policy works well to regulate overall rating drift (inflation / deflation).
I chose a rating reset policy that is a bit more liberal in the kyu range (basically adhering more belief to self-promotions) and more conservative in the high dan range (compared to the EGD rating reset policy), optimized to the point that the apparent rating drift over the years look about 0.
In fact it seems that tweaking the reset policy works so well to regulate drift, that I was able to throw away the epsilon parameter that was intended to regulate rating drift.

The effect of this can be seen in the rating distribution charts. For example, the EGD rating distribution shows skewed distributions an deflation in the mid-dan range. But the revised rating distribution look much more regular. For me, this result proves that it is good idea to believe declared self promotions, particularly in the kyu range.

There are underranked and overranked players in the EGD and in the revised system. But I think it's only to be expected that the distributions are more or less regular gaussian distributions over the entire rating and rank ranges. That is just how regular statistics behave.

Pio2001 · Post by **Pio2001** » Thu Sep 28, 2017 4:17 am

The problem is that many players see the rank as a title or a reward.

The FFG allows to ask for a re-evaluation for players who have improved a lot without playing official games. But the process is regulated :
The reevaluation must have the backup of an administrator of the club, and be sent with sgf of some games, that are then analysed by a comitee, who will decide if the player deserve a reevaluation or not.

But in my club, the top players are proud of their rank, and do not like to see people getting high rank without winning their games in actual tournaments. I remember when they purposely underevaluated the reevaluation form of a player because they said that otherwise, the player would be ranked above another one and that would be "unfair". For them, a rank is a trophy and should not be awarded freely.

This behaviour has its root in asiatic tradition, where the dans were honorary distinctions that were awarded only to the best champions.

In my opinion, this attitude is consistent as long as we are dealing with the players that are usually above the McMahon Bar in tournaments. For players under the McMahon bar, it is necessary to have the most accurate ranking possible for the McMahon system to work properly and pair together players of equal strength. In this case, reevaluations are useful.

gennan · Post by **gennan** » Thu Sep 28, 2017 11:32 am

I assume the majority of the French go community prefers to keep these regulations (otherwise, why are they still used by the FFG?), so it is of little consequence how I feel about these policies. I may be able to show if these policies affect French rank distributions, though:

When I inspect the overall rating distribution, it seems that the distribution of ranks is about gaussian.

In the revised rank distributions (of 2009 for example), I see there is not much skew (asymmetry) and each rank distribution happens to have a consistent width (FWHM) of about 160 rating points over the whole rating range and they are quite evenly spaced. Because of the consistency and regularity of these distributions, I feel this may be close to the natural, unbiased variation of go ranks.

The EGD rank distibutions are similar, but (again looking at 2009) around 1k the rank distributions tend to be wider (about 250 rating points), more skewed and about 50 points deflated. To me this hints at systemic biases, which I'm trying to eliminate in my revised system (while keeping it as simple as possible).

The revised rank distributions look quite unbiased to me, so from the statistics of all European players I cannot tell if the FFG policies leads to kyu rank deflation. At least it's not visible in the total European data.

I think it would be interesting to present rank distributions per country to see if there is any correlation between rank bias and national rank regulation policies.

gennan · Post by **gennan** » Fri Sep 29, 2017 3:11 pm

I added rank distributions per country, and I don't really notice an big difference between countries with large go populations. For example, comparing France 2011 with Germany 2011 or Russia 2011, I don't notice a clear difference between strict rank regulation and no rank regulation. If anything, kyu rank regulation seems to be less harmful than I expected. Perhaps it is even beneficial to evenly distributed ranks.

I notice that countries with a smaller go population have a greater variance in rank distribution. But perhaps this is not surprising, because smaller numbers usually means greater statistical deviations.
But comparing medium sized go population countries like Czechia 2011 with Netherlands 2011, the Czech kyu ranks look more evenly distributed.

Pio2001 · Post by **Pio2001** » Sat Sep 30, 2017 2:46 am

Very interesting,
Thank you for your work !

gennan wrote:I assume the majority of the French go community prefers to keep these regulations (otherwise, why are they still used by the FFG?)

The truth is that nearly nobody understand how the rating system works, except the people in charge.

Players are aware of the existence of special adjustments, because they can see it in their rating, but the details are mostly unknown to them.

gennan · Post by **gennan** » Sat Sep 30, 2017 9:11 am

Pio2001 wrote:Very interesting,
Thank you for your work !

Thank you

I tweaked the rank distributions: I removed the normalization and I added extra buttons to jump to the first and last year.

I also added markers to show the amount of rank shift in the data.
For example, comparing 2012 EGD rating distribution with 2012 Revised rating distribution, the rank shift of the EGD data in the lower dan area become much clearer.

gennan · Post by **gennan** » Sat Sep 30, 2017 11:13 am

I find some curious differences in EGD rank shift per country. For example when I compare 2012 UK EGD with 2012 UK revised, the EGD distribution has a strong bias in the mid-dan region. But when I compare 2012 CZ EGD with 2012 CZ revised, the EGD distribution does not have much bias.

What could explain this difference? The Czech distribution shows a relatively low number of games in the mid-dan region, while the UK distribution shows a relatively high number of games in that region. So perhaps the 2012 UK mid-dan players were more affected by the EGD biases, simply because they were more active in 2012 than their Czech counterparts.

So I looked up a year where the Czech mid-dan players were active: 1998. Comparing 1998 CZ EGD with 1998 CZ revised, the EGD distribution does show a bigger bias in the mid-dan region.

I cannot say that this conjecture holds in all cases, but there is theoretical support for it, because of the the parameter values that the EGD uses, in particular a (see 1/a). Because of this choice of a, the expected odds of the EGD are especially unfavourable for players around 1d, causing them to lose rating points when they are in fact not underperforming. So if these players want to conserve their EGD rating, the best strategy would be not to play in tournaments. I tried to make my revised system neutral in this aspect.

gennan · Post by **gennan** » Sat Sep 30, 2017 2:28 pm

I added rating lists, for example Spain: ES revised rating list

gennan · Post by **gennan** » Tue Oct 03, 2017 11:40 pm

I added statistics of handicap games. For example, see 4 handicap probabilities.

The handicap statistics are more noisy than the even game statistics, but I suppose the smaller number of games is a factor here. Particularly, severely overhandicapped or underhandicapped games are rather sparse.

But overall, the rule that correct handicap should match the rank difference seems to hold for the EGD system (and the revised sytem). It means that the observed winrate is close to 50% when the handicap matches the rating difference (handicap = ratingDifference / 100 - 0.5 * sign(ratingDifference). This seem to hold for all ratings and all handicaps. I cannot claim this with great confidence, because the data is somewhat sparse and noisy, but overall, the data seems to confirm it.

There is not much difference between the EGD system and the revised system in this respect, but this can be expected, because the EGD bias usually doesn't exceed half a stone and that is a bit too subtle to detect in the handicap statistics.

So on average, the EGD ratings seem quite suitable for predicting handicap differences, which means that the overall shape of the rating distribution is ok.
So fitting a rating function to the observed winning odds should result in a rating scale that matches rating differences with handicaps / rank differences.

The predicted winning odds (predicted by the EGD system) don't match the observed winning odds well, so what else can explain that handicaps match rating differences rather well in the EGD?

I think the reset policy is the reason: newcomers and quickly improving players keep feeding the EGD system with declared ranks that are basically about right with respect to handicap.
This keeps the system calibrated with respect to handicaps to the 1st order. The mismatched expected winning odds of the EGD cause only a second order bias on top of this 1st order calibration. It only affects players that play many games without resets: active old-timers.

So I would call the EGD reset policy crude but effective. Overall, it makes the system work in spite of the mismatched winning odds.
Basically, believing newcomers' declared ranks and believing self-promotions helps a lot to keep the system calibrated.

So having a reset policy is quite important and I included a reset policy in the revised system (I'm only trying to make it a bit more sophisticated).

gennan · Post by **gennan** » Thu Oct 05, 2017 1:03 pm

I added an addendum at the bottom of the About page. It is a bit long to post here and it repeats things I posted here earlier, but here is a copy:

Addendum 2017-10-05
Example:
If the predicted winrates matched observed winrates exactly, on average players would not gain nor lose points when everybody's skill stays the same.
But the expected winrates don't match observed winrates in the EGD. In the example below I show what happens because of it.

We have a player with rating 2100. He plays games against a player with rating 2000. The EGD expects him to win 71% of these games. In reality he wins about 60% (as observed in the statistics of the EGD).
So his winrate minus the expected winrate is -0.11. His K factor is 24, so on average he will lose 2.6 points per game played against this opponent.

The same player also plays against another player with rating 2200. The EGD expects him to win 26% of these games. In reality he wins about 35% (as observed in the statistics of the EGD).
So his winrate minus the expected winrate is +0.11. Using his K-factor of 24, we find that on average he will win about 2.6 points per game played against this opponent.
So if he plays both opponents with the same frequency, his rating will not change on average.

But the demographics of the EGD data show that since 2003, players rated around 2000 appear more frequencly in tournament games than players rated around 2200 (the ratio is about 5:4). Correcting for this we find that his rating will change by about (5 * -2.6 + 4 * 2.6) / 9 = -0.29 points per game in this demographic distribution.
This is not much, but if this player plays 25 games a year, which is typical for tournament players in this rating region, he will lose 6 points in a year and over 10 years, every player around 2100 rating would lose 60 points.

But the EGD also uses an epsilon parameter. This will give this player 24 * 0.016 = 0.39 free points for every game he plays. This is more than enough to compensate for the expected winrate errors.
One could argue that this epsilon correction would not be neccessary if the expected winrates were closer to reality (My finding is that this is indeed the case and I see no reason to keep these winrate errors).
Nevertheless, it would seem that the expected winrate errors are more then compensated by the epsilon parameter.

Then?
Still, there is a gradually increasing difference between declared ranks and ratings in the EGD rating distributions, with a maximum of about 50 points around the lower dan region in 2012. This trend is reversing a bit in recent years, but my theory is that in recent years, dan players chose to comply with the rating system instead of looking at the ranks they have according to handicap.
What other causes could there be?
Is it that players around the lower dan region were overranking themselves more and more between 1996 and 2012?
I cannot rule this out, but neither can I rule out that this deflation is caused by a defect of the rating system.

Another possible cause for deflation is improving players. There are two mechanisms in the EGD that are supposed to compensate for this.
1: The rating reset policy. This is to prevent quickly improving players from removing many points from the system. But the EGD only resets players who get 2 stones stronger between tournaments. That is rather conservative, because getting stronger is usually a more gradual process. Most players don't get stronger that quickly.
2: The epsilon parameter. This should compensate for slowly improving players (which is a much bigger group I assume). But as we have seen above, 3/4 of the epsilon parameter is used up to compensate for the expected winrate errors. So of the original 0.016, only 0.004 is left to counter deflation from slowly improving players!

So in the end, there isn't much to counter the deflation caused by slowly improving players and they will inevitably take away points from the system, leading to deflation.

So what to do?
1: Fix the expected winrate errors.
2: Use a less conservative reset policy.

The revised system does both and I find it has no need for an epsilon parameter.

ez4u · Post by **ez4u** » Thu Oct 05, 2017 5:36 pm

The example seems incomplete. Could you describe what happens to the ratings of all three players (2000 player, 2100 player, and 2200 player) if they each play 24 games evenly split between the two opponents, or somehow weighted based on the sizes of the underlying pools? It seems insufficient to view the problem through the results of a single player when two play in each game and the problem is described in terms of three.

Javaness2 · Post by **Javaness2** » Fri Oct 06, 2017 12:06 am

It's quite an interesting result to me that you don't need any epsilon parameter if you improve the rating reset policy, I have to read further

Schachus · Post by **Schachus** » Fri Oct 06, 2017 1:32 am

My take is that this is neither surprising nor an improvement. The "problem" you are trying to fix is that ratings dont match the ranks(also not the average Rating over the rank). Of course, if you give kyu players a reset each time they improve the decrlared rank, then ratings will "fit" the rank better, because rating is often forced to fit it. But it is questionable, whether this is an improvement.
Why is it actually a problem, that ratings dont fit ranks? It is a fact, that the difference between 1k and 1d is smaller than other ranks, because people like to call themselves dans, so they change their rank to 1d often prematurely. This effect(and similar in neighboring ranks, because people match that behaviour(so if the 1d is not much stronger, they go to 1k)) is seen in this "problem" and to me its not a problem that needs to be fixed at all.
Of course rating have problems but self estimations have them as well. In chess there is no such thing as self estimation and there are studies showing 80% of players believe to be underrated. Now is that likely, or are most of them just overestimating themselves? Now you could say, this is a problem, we need to reset their rating to their estimation, but you would screw up more then you are fixing, because rating becomes less objective.

In fact, there might be players(me for example), who explicitly dont want a reset, as long as they are just slow and steady improving, because the rating is the testimony for said improvement(you dont just imagine it, your results really get better). I got one reset from 8k to 5k because it was appearent I improved and the old rating was not appropriate for me anymore. And since then I only increased my rank by 1 (more or less following the rating), thus I can see my rating evolve from the reset to 1600 to its current stage(1802) documenting my improvement. With your system, the rating got reset to 4k and to 3k (and would have gotten reset to 2k, if not for the fact, that the tournament where I registered experimentally as a 2k for some reason didnt make its way into EGD). So the only thing I see in your version of the rating, is that It improved from 1800 reset to 1840 over last 2(or3?) tournaments, the tournaments before are rendered irellevant to the ratng cause of the reset.

On a positive note: One thing that is much better in your system is the handling of weak players(20k level). In EGD their ratings are largely screwed up, bacause EGD has a bottom cuttoff of 100 rating(like they couldnt handle negative numbers!?). Now there is handicap tournaments for children, where 20k beats 30k with 9 stones, which is just as expected. But 30k counts like 20k for EGD, so the EGD thinks the 20k did the equivalent of beating an 11k, and boots his rating a lot. This leads to large inflation of rating in the ranks below 15k, and I think your system is in fact much better there.

Life In 19x19

Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings