Revised European go ratings

HermanHiddema · Post by **HermanHiddema** » Fri Oct 06, 2017 2:03 am

Schachus wrote:Why is it actually a problem, that ratings dont fit ranks?

Ranks are based on handicaps, and are used to determine handicaps. They have been used for that for centuries, and have worked very well in that respect.

Ratings are based on win percentages, and cannot be used to determine handicaps from that alone (i.e. if player A defeats player B in 83% of even games, what is the proper handicap?)

If you want to use ratings to determine handicaps, you have two options:

1. Publish rating-handicap tables (e.g. at rating 1400, handicaps 1, 2, 3, 4, 5, 6 are at 1467, 1530, 1590, 1649, 1705, 1758 or something like that)

2. Fiddle with the rating parameters (scale the ratings) so the handicaps line up with multiples of 100 reasonably closely.

The EGF has chosen option 2, which IMO is a much saner option, because it is much more user-friendly. This has always been a feature of the EGF system.

Dave's work here is a suggestion to improve the parameters slightly so that the ratings line up more closely with historical rank data.

Unless you can show that the current rating parameters provide more accurate handicap determination, i.e. that the historical data is systematically biased, I don't see any reason not to try to match the historical data as closely as possible.

gennan · Post by **gennan** » Fri Oct 06, 2017 2:53 am

Schachus wrote:My take is that this is neither surprising nor an improvement. The "problem" you are trying to fix is that ratings dont match the ranks(also not the average Rating over the rank). Of course, if you give kyu players a reset each time they improve the decrlared rank, then ratings will "fit" the rank better, because rating is often forced to fit it. But it is questionable, whether this is an improvement.
Why is it actually a problem, that ratings dont fit ranks? It is a fact, that the difference between 1k and 1d is smaller than other ranks, because people like to call themselves dans, so they change their rank to 1d often prematurely. This effect(and similar in neighboring ranks, because people match that behaviour(so if the 1d is not much stronger, they go to 1k)) is seen in this "problem" and to me its not a problem that needs to be fixed at all.
Of course rating have problems but self estimations have them as well. In chess there is no such thing as self estimation and there are studies showing 80% of players believe to be underrated. Now is that likely, or are most of them just overestimating themselves? Now you could say, this is a problem, we need to reset their rating to their estimation, but you would screw up more then you are fixing, because rating becomes less objective.

If you can't trust declared ranks, then the whole rank system of go means little. Then why not use a pure Elo system that has no relation to ranks? That would be a different kind of sytem (like goratings.org).
But the EGD claims a relation between their ratings and ranks. It's a great feature, but it adds the burden to make sure it's about right, no? Is there a better way to calibrate it than basically believing declared ranks?

Schachus wrote: In fact, there might be players(me for example), who explicitly dont want a reset, as long as they are just slow and steady improving, because the rating is the testimony for said improvement(you dont just imagine it, your results really get better). I got one reset from 8k to 5k because it was appearent I improved and the old rating was not appropriate for me anymore. And since then I only increased my rank by 1 (more or less following the rating), thus I can see my rating evolve from the reset to 1600 to its current stage(1802) documenting my improvement. With your system, the rating got reset to 4k and to 3k (and would have gotten reset to 2k, if not for the fact, that the tournament where I registered experimentally as a 2k for some reason didnt make its way into EGD). So the only thing I see in your version of the rating, is that It improved from 1800 reset to 1840 over last 2(or3?) tournaments, the tournaments before are rendered irellevant to the ratng cause of the reset.

So you'd rather have an epsilon parameter than a liberal reset policy. It's possible, but it is quite difficult to determine the correct value of such an epsilon parameter.
I did make my reset policy a bit conservative for higher ranks: If an 1100 rated player promotes himself himself to 10k (1000 rating), the reset will grant him a full reset to 1000. But if a 2400 player promotes himself to to 5d (2500 rating), the reset will only grant him a reset to 2450 (the lower bound of 5d). This behaviour flips gradually around rating 2100.
I intend to experiment with more sophisticated resets: When a player promotes himself or when e newcomer enters the system, his K factor will be double the normal value (temporarily increasing the volatility of his rating). His opponents' K factors halve when they play him. Then over the course of a dozen or so games, the K factors gravitate to their normal values. In that way, the system collects evidence for the new rating with reduced disturbance of his opponents ratings.

Schachus wrote: On a positive note: One thing that is much better in your system is the handling of weak players(20k level). In EGD their ratings are largely screwed up, bacause EGD has a bottom cuttoff of 100 rating(like they couldnt handle negative numbers!?). Now there is handicap tournaments for children, where 20k beats 30k with 9 stones, which is just as expected. But 30k counts like 20k for EGD, so the EGD thinks the 20k did the equivalent of beating an 11k, and boots his rating a lot. This leads to large inflation of rating in the ranks below 15k, and I think your system is in fact much better there.

Yes, I think 30k is better. Perhaps 35k or 40k would be better still. In my experience, a 7 year old beginner is about 40k (based on handicaps in the kids go club that I run). But the EGD changes declared ranks below 20k to 20k, so ranks lower than 20k are absent in the data that I got. It's a pity.

gennan · Post by **gennan** » Fri Oct 06, 2017 2:55 am

ez4u wrote:The example seems incomplete. Could you describe what happens to the ratings of all three players (2000 player, 2100 player, and 2200 player) if they each play 24 games evenly split between the two opponents, or somehow weighted based on the sizes of the underlying pools? It seems insufficient to view the problem through the results of a single player when two play in each game and the problem is described in terms of three.

Yes, it is not enough. Perhaps I can make something to run simulations over many games and many players to see the overall long term behaviour of particular algorithms on a player population.

gennan · Post by **gennan** » Fri Oct 06, 2017 4:35 am

gennan wrote:
ez4u wrote:The example seems incomplete. Could you describe what happens to the ratings of all three players (2000 player, 2100 player, and 2200 player) if they each play 24 games evenly split between the two opponents, or somehow weighted based on the sizes of the underlying pools? It seems insufficient to view the problem through the results of a single player when two play in each game and the problem is described in terms of three.
Yes, it is not enough. Perhaps I can make something to run simulations over many games and many players to see the overall long term behaviour of particular algorithms on a player population.

Actually, the EGD and the revised system already do that. Only they use the actual tournament data as input instead of a hypothetical player and pairing distribution. The results of those "simulations" are reflected in their respective rating distributions.

gennan · Post by **gennan** » Fri Oct 06, 2017 5:30 am

Schachus wrote: In fact, there might be players(me for example), who explicitly dont want a reset, as long as they are just slow and steady improving, because the rating is the testimony for said improvement(you dont just imagine it, your results really get better). I got one reset from 8k to 5k because it was appearent I improved and the old rating was not appropriate for me anymore. And since then I only increased my rank by 1 (more or less following the rating), thus I can see my rating evolve from the reset to 1600 to its current stage(1802) documenting my improvement.

So you happen to chose a conservative / pessimistic self promotion policy. If everybody would do that (avoiding the reset policy), it leads to overall deflation, the system needs an positive epsilon parameter that increases everybody's rating by a small amount for every tournament game.

Other players happen to chose a more liberal / optimistic self-promotion policy. If everybody would do that (exploiting the reset policy), it leads to overall inflation, the system needs a negative epsilon parameter that decreases everybody's rating by a small amount for every tournament game.

In practise, some players are conservative and some players are liberal with self promotions. I suppose this is normal and it has always been like this, even long before the EGD existed. And it's fine, as long as they balance each other out. As far as I can tell, this is mostly the case: The EGD works fairly well with a small positive epsilon value (so perhaps resets should be applied a bit more often).

Instead of using an epsilon parameter to balance long term inflation / deflation, I find that tweaking the reset policy works just as well or even better. And I find that a more liberal reset policy works better than a conservative one (like the EGD reset policy), which means that the average European tournament player is not overly conservative or overly liberal when it comes to self-promotions.

Schachus wrote: With your system, the rating got reset to 4k and to 3k (and would have gotten reset to 2k, if not for the fact, that the tournament where I registered experimentally as a 2k for some reason didnt make its way into EGD). So the only thing I see in your version of the rating, is that It improved from 1800 reset to 1840 over last 2(or3?) tournaments, the tournaments before are rendered irellevant to the ratng cause of the reset.

How relevant should historical data be? The fact is that Artem Kachanovskyi is currently 1p (I suppose we can agree on that). Does it matter how he got there? It's the system's job to estimate everybody's current level as best as it can. Ideally all the ratings should behave like a random walk around the "real" skill level of each player at each moment. I don't see rating points as something that one earns (like money or XP points in a video game). You try to improve and if you do, the system should reflect what's happened in the real world as quickly and as accurately as possible. To do that, it needs all the help it can get.

The rating system is basically a measurement device, calibrated to a certain scale. For go, that scale is the go rank scale, which is based on handicap. The EGD has insufficient data on handicap games (other than declared ranks, which are implicitly referring to handicap games), so it uses declared ranks from newcomers, expected winrates, resets and epsilon as a fallback.

With your 2k experiment: I think the system should listen to the experimental self-promotion, but if your results don't support it, the system should quickly gravitate back to a rating that matches your results (preferrably with minimal effect on your opponent's ratings from your experiment). Note that if you'd later promote to 2k again, the system won't reset your rating, because it does not exceed your highest declared rank anymore (both the EGD and revised policies work that way). So a "failed" experiment would mean that later, you'd have to fight to a 2k rating the hard way.

Schachus · Post by **Schachus** » Fri Oct 06, 2017 7:53 am

Why are you so sure, conservative resetting leads to deflation? There are absolutely no resets in chess and still there is no deflation...(in fact, chess players are whining about inflation, but I dont believe in that either, really). But if you are worried, about deflation, I'm happy having a slight epsilon inside there somewhere. Actually, whether there is a slight drift in ratings over a long period is irrelevant to me, since what counts is how ratings compare to one-another(if in 20 Years, the strongest european rank(by rating) is not anymore 8d but 9d or 7d, although the player having it is exactly as strong as In-Seong now, that doesnt doesnt concern me too much, I can compare his rating to others). Thats also why I dont believe it needs to be tied to raks to strongly. In-Seong doesnt call himself 8d because he somehow intrinsictly knows to be 8d, but because that comes out to fit his rating(and of course the ranks of players of similar strenght). On Tygem he would be 9d, in AGA system probably too, in some countrys maybe only 6d, because amateur ranks only go till there. I agree that the idea rating should reflect handicaps is nice, and rating system should try to have that 100pts= 1 stone, but that doesnt need to be tied to ranks, first of all it only concerns the behaviour of how ratings compare to one-another.

While we are there: Hermann said ranks are build to fit handicaps, but how? There is a "1k" player at a club, where I often play, that is clearly weaker than me. There also is a "5k" player, who is slightly weaker than me, but probably stronger than the "1k". They seldom play one-another, so they dont realize this. What should my rank be, so that hadicap, taken from this rank works with both players? Of course, maybe the 1K should really be 5k and the 5k really be 4k and then I would be 3k, or maybe everyone a rank stronger, but am I supposed to go to someone who played as 1k for years and say "sorry, but I believe you are 4-5k"? I dont think ranks refelct handicap better than ratings, except maybe in the DDK range where ratings are crap, due to already discussed problems.

Rating system actually reached this conclusion(about the 1k) and he has a rating around 1650, which is right for him, though it doesnt fit his rank. This is all no problem, until, hyphothectially, some new player comes along(maybe he wasnt in Europe before or he only played online), who calibrates his rank by playing against the 1k, finds out that is his level and enters a tournament as 1k. His rating would then iniatialize on 2000, although 1650 would be right.

Historical data is important, because a few tournaments mean nothing at all. The rating only gives a solid and reliable answer if it has enough rated tournaments to draw this from, because over a single tournament, there is so much noise in the data(form, luck, opponents play).. and so on that you can only say "he is a 3k plus or minus 2 ranks". I could have told you that without rating system. The strenght of a rating system is, imo , to consider a lot of data(newer data more impotant oviously), to give a more exact answer of our current "average" playing strenght.

HermanHiddema · Post by **HermanHiddema** » Fri Oct 06, 2017 8:03 am

Schachus wrote: While we are there: Hermann said ranks are build to fit handicaps, but how? There is a "1k" player at a club, where I often play, that is clearly weaker than me. There also is a "5k" player, who is slightly weaker than me, but probably stronger than the "1k". They seldom play one-another, so they dont realize this. What should my rank be, so that hadicap, taken from this rank works with both players? Of course, maybe the 1K should really be 5k and the 5k really be 4k and then I would be 3k, or maybe everyone a rank stronger, but am I supposed to go to someone who played as 1k for years and say "sorry, but I believe you are 4-5k"? I dont think ranks refelct handicap better than ratings, except maybe in the DDK range where ratings are crap, due to already discussed problems.

The statement "I dont think ranks refelct handicap better than ratings" doesn't really make sense to me.

Lets say I have two hypothetical players in some hypothetical pure Elo rating system. No fiddling like the EGF has done, just basic Elo as implemented in chess. One of them has rating 4400, the other has rating 4650. According to the Elo rating formula, at a 250pt difference, that means the player with the higher rating should win about 81% of even games between them. What would you consider a proper handicap between these players?

gennan · Post by **gennan** » Fri Oct 06, 2017 8:43 am

I'm looking at the overall statistics of the EGD history. Any statistical distribution has variation and outliers, of which you seem to have encountered a case with this 5k and 1k. I'm sure all of us know some examples like this, but overall, ranks do give a good indication of someone's skill (with a mean error of one or two ranks).

Go is different from chess, in that historically it has a rank system based on handicap (you may call them titles, but they are less fixed than titles IMO, especially kyu ranks). These ranks are based on handicap needed against other ranked players to get a 50% winrate and they are not based on even game winrates at all (except when the ranks are equal, in which case 50% winrate is expected). In that sense, go ranks are not compatible with a normal Elo rating system, which is based on even game winrates only.

You could use a pure Elo rating system for go. I would be absolutely fine with that: only rating differences matter and overall rating drift means nothing, as long as you don't compare year 2010 ratings with year 1970 ratings (there is a little thing though, that chess also has titles linked to ratings, like 2500 = Grandmaster and these titles happen to suffer from long term inflation).

But if you use a pure Elo rating system for go, you should not claim a fixed relation to go ranks (handicaps). It would just be a seperate system. You might publish annual correlation tables as an indication used to convert year 2016 ratings to go ranks, but these correlations would be free to drift from one year to the other. I would be perfectly ok with such a rating system.

The "problem" is that the EGD does claim a fixed mapping to go ranks (handicaps). I think it is a good feature, but if you make this claim, you should do your best to maintain an accurate mapping, which means finetuning the system to detect and counter long term drift and local / global contraction or dilation of the rating range (because that leads to mismatched handicaps which would invalidate the mapping).

Schachus · Post by **Schachus** » Fri Oct 06, 2017 8:49 am

HermanHiddema wrote:
Schachus wrote: While we are there: Hermann said ranks are build to fit handicaps, but how? There is a "1k" player at a club, where I often play, that is clearly weaker than me. There also is a "5k" player, who is slightly weaker than me, but probably stronger than the "1k". They seldom play one-another, so they dont realize this. What should my rank be, so that hadicap, taken from this rank works with both players? Of course, maybe the 1K should really be 5k and the 5k really be 4k and then I would be 3k, or maybe everyone a rank stronger, but am I supposed to go to someone who played as 1k for years and say "sorry, but I believe you are 4-5k"? I dont think ranks refelct handicap better than ratings, except maybe in the DDK range where ratings are crap, due to already discussed problems.
The statement "I dont think ranks refelct handicap better than ratings" doesn't really make sense to me.

Lets say I have two hypothetical players in some hypothetical pure Elo rating system. No fiddling like the EGF has done, just basic Elo as implemented in chess. One of them has rating 4400, the other has rating 4650. According to the Elo rating formula, at a 250pt difference, that means the player with the higher rating should win about 81% of even games between them. What would you consider a proper handicap between these players?

That is of course absolutely true, in your example you can say nothing at all about handicaps. But EGD has not got a basic ELO system, diverging in 2 important points(in order to fix this): Nr.1: Handicap games are rated: For these purpososes, handicap is compensated for with 100pt a stone(50pt for the first one, because its is only half as good). If there are enough handicap games rated that should help calibrating things

Nr.2: other than in Elo, rating difference does not immediately reflect winning expectation. It also depends on the stronger players rating(the number a defines the difference, where expectations are e:1, and depends on the rating).I actually dont know, how the correspondence of a and the rating was obtained, but a good way would be: Take players known to have a certain level and are 1 stone apart(however you determine that, I would say, as the rank system did. That means chances in a 1 stone game(that is reverse komi for black) should be 50/50)(so ratings should be 100 points apart) and let them play even games. Check how many the stronger player wins(70%,? 80%) and make the a corresponding to that strength such, that the expectation for a 100 points difference game matches that. I dont know, if this was done when dependence of GoR and a was defined, but it makes sense, that a grows lower for stronger players as we know chances of 5d beating 6d even are much lower than 8k beating 7k, so at least a was sort of plausibly chosen.

Of course this process does have to do something with the rank system in being set up. But you have that way a rating system that has chances of reflecting handicap, without the need for rating resets.

This is also my suggesttion: If you want handicaps to work better with ratings, why not take the data from handicap games you observe and use it to optimize this correlation between GoR and a(and maye also the K factor, that is for some reason called con in EGD), in such a way that the revised ratings fit handicap games better.

I would not want to entirely get rid of resets, since there are cases where players improve 20 ranks between tournaments and thats skews things heavily (1d losing to someone with 20k rating is strange), but I suggest having as few of them as possible.

gennan · Post by **gennan** » Fri Oct 06, 2017 8:54 am

HermanHiddema wrote:
Schachus wrote: While we are there: Hermann said ranks are build to fit handicaps, but how? There is a "1k" player at a club, where I often play, that is clearly weaker than me. There also is a "5k" player, who is slightly weaker than me, but probably stronger than the "1k". They seldom play one-another, so they dont realize this. What should my rank be, so that hadicap, taken from this rank works with both players? Of course, maybe the 1K should really be 5k and the 5k really be 4k and then I would be 3k, or maybe everyone a rank stronger, but am I supposed to go to someone who played as 1k for years and say "sorry, but I believe you are 4-5k"? I dont think ranks refelct handicap better than ratings, except maybe in the DDK range where ratings are crap, due to already discussed problems.
The statement "I dont think ranks refelct handicap better than ratings" doesn't really make sense to me.

Lets say I have two hypothetical players in some hypothetical pure Elo rating system. No fiddling like the EGF has done, just basic Elo as implemented in chess. One of them has rating 4400, the other has rating 4650. According to the Elo rating formula, at a 250pt difference, that means the player with the higher rating should win about 81% of even games between them. What would you consider a proper handicap between these players?

Yes, the EGD basically claims to predict handicaps (ranks), but its predictions are mostly based on even game results. I think the reason that is works fairly well in the absense of data, is that it is fed constantly with declared ratings (based on handicaps).

Schachus · Post by **Schachus** » Fri Oct 06, 2017 8:56 am

actually, I'm interested: did you do anything the "a" in your revised ratings?

gennan · Post by **gennan** » Fri Oct 06, 2017 9:00 am

Schachus wrote:Nr.2: other than in Elo, rating difference does not immediately reflect winning expectation. It also depends on the stronger players rating(the number a defines the difference, where expectations are e:1, and depends on the rating).I actually dont know, how the correspondence of a and the rating was obtained, but a good way would be: Take players known to have a certain level and are 1 stone apart(however you determine that, I would say, as the rank system did. That means chances in a 1 stone game(that is reverse komi for black) should be 50/50)(so ratings should be 100 points apart) and let them play even games. Check how many the stronger player wins(70%,? 80%) and make the a corresponding to that strength such, that the expectation for a 100 points difference game matches that. I dont know, if this was done when dependence of GoR and a was defined, but it makes sense, that a grows lower for stronger players as we know chances of 5d beating 6d even are much lower than 8k beating 7k, so at least a was sort of plausibly chosen.

Of course this process does have to do something with the rank system in being set up. But you have that way a rating system that has chances of reflecting handicap, without the need for rating resets.

This is also my suggesttion: If you want handicaps to work better with ratings, why not take the data from handicap games you observe and use it to optimize this correlation between GoR and a(and maye also the K factor, that is for some reason called con in EGD), in such a way that the revised ratings fit handicap games better.

This is basically what I did: match the predicted winrates with the observed winrates from the EGD. I posted my findings on the site. For example http://goratings.eu/Probabilities/P_PredictedEGD vs http://goratings.eu/Probabilities/P_ObservedEGD. I also looked at handicap game winrates specifically, but the amount of handicap game data is a bit lacking in the EGD.

Schachus · Post by **Schachus** » Fri Oct 06, 2017 9:02 am

Oh, good we talked about it! I didnt realsize that was what your graphics show.
Still I think, if your goal is to raflect handicaps well, you need to take handicap game data to check the wuality of your ratings, even if there is not a lot of data. I still think, checking against the rank defeats the purpose, since that way the system "assing the rating that corresponds to the declared rank" would be optimal, while it clearly isnt.

gennan · Post by **gennan** » Fri Oct 06, 2017 9:11 am

Yes. Take http://goratings.eu/Probabilities/P_PredictedEGD. The purple curve intersecting the 2100 grid line at 50% is the 2100 rating curve. The curve intersects the 2000 grid line at 70%. This means the EGD predicts a 70% win probability when a 2100 player plays a 2000 player.

Then take http://goratings.eu/Probabilities/P_ObservedEGD. Here it shows that the 2100 curve intersects the 2000 grid line at 60%. This is the observed winrate of a 2100 player against a 2000 player over the history of the EGD.

So the EGD predicted winrates don't match the observed winrates very well.

Better predictions is one of the improvements I built in the revised system.

gennan · Post by **gennan** » Fri Oct 06, 2017 9:16 am

These predictions matter, because if your winrate is lower than the expected winrate, you lose points.

Life In 19x19

Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings

Re: Revised European go ratings