KGS Ranking adjustment?

jann · Post by **jann** » Mon Jan 27, 2020 2:48 pm

If it's still unclear what I pointed at I doubt I could make it clearer. The above example of what could you tell after 10 komi / 10 no komi games looks like my best attempt summarizing it.

I see no problem with actual handi (H2+) games because in practice those are almost always played without komi. I also don't see a problem if there are a huge amount of games with frequently changing handicap / updated rating.

But without extra stones - and with the everyday server practice of offering a free choice (ranked differently OC, but that only helps in the long run) between the auto / nominal handicap or an even game with normal komi - there is a peculiar trait of those "h1" games: about 1 out of 3 games have no clear winner or loser (the recorded winner depends on the players' mentioned choice, which is arbitrary).

Bill Spight · Post by **Bill Spight** » Mon Jan 27, 2020 4:53 pm

jann wrote:
Bill Spight wrote:Given a method of evaluation that has a probabilistic semantics, such as the percentage of correct answers on a test, or percentage of wins in a contest
The percentage of correct answers is an exact, factual data (just like the percentage of various board scores). The percentage of wins (given those board scores) depends on an arbitrary parameter "komi".

Here is what I meant. Any percentage other than 0% or 100% indicates variability in the correctness of answers to questions or of wins and losses.

Bill Spight · Post by **Bill Spight** » Mon Jan 27, 2020 5:04 pm

jann wrote:If it's still unclear what I pointed at I doubt I could make it clearer. The above example of what could you tell after 10 komi / 10 no komi games looks like my best attempt summarizing it.

I see no problem with actual handi (H2+) games because in practice those are almost always played without komi. I also don't see a problem if there are a huge amount of games with frequently changing handicap / updated rating.

Well, it seems that you somehow think that komi is arbitrary.

Aside from that, you seem to be arguing something similar to the following:

There are two chess players who are close in strength. In ten games in which one player always took Black and the other player always took White. All ten games were draws. Who can tell which player is stronger? There is something wrong with having one player take Black and the other take White.

(The first player has the advantage in both chess and go. Go without komi is what you call H1. There is no komi in chess.)

I do not mean to parody you, but chess without odds seems to me to be analogous to what you are calling H1 without komi.

jann · Post by **jann** » Mon Jan 27, 2020 6:01 pm

Bill Spight wrote:Well, it seems that you somehow think that komi is arbitrary.

For half stone rank diffs it is the players' arbitrary choice, both 0.5 and whole komi is very common. Lack of this choice is also why the chess analogy seem to fail.

There are two chess players who are close in strength. In ten games in which one player always took Black and the other player always took White. All ten games were draws. Who can tell which player is stronger?

For 10 draws there is only one interpretation: B performed slightly above W. OC this is nowhere decisive (just 10 games afterall) but more than nothing. B being stronger is a bit more likely than the opposite.

This is not the same as 10 no-komi games all B+3. In that case there are two interpretations: W performed better (winning all with komi, instead of the expected 6 or 5) or worse (losing all without komi, instead of the expected 5 or 4). Harder to say who performed better (W OC, but only if you allowed to look at the scores, not just win%).

xela · Post by **xela** » Mon Jan 27, 2020 8:51 pm

jann wrote:This is not the same as 10 no-komi games all B+3. In that case there are two interpretations: W performed better (winning all with komi, instead of the expected 6 or 5) or worse (losing all without komi, instead of the expected 5 or 4). Harder to say who performed better (W OC, but only if you allowed to look at the scores, not just win%).

In terms of calculating ratings, there is no problem here. The only problem is using the ambiguous word "better". For this scenario, black performed above expectations, therefore black's rating needs to be adjusted upwards. Whether or not white is a stronger player than black is impossible to say from this data. It's possible that black is several stones stronger than white but chose to play slack moves (or safe moves) when ahead.

----

But there's a more fundamental issue here. We're getting hung up over a very implausible scenario. These two hypothetical players who can play a ten-game series and get exactly the same score (plus or minus one point) every time:

First, Larry and Moe here are also playing games against other people, and those games also feed into the rating calculation, right? (Or if they're not playing anyone else, then does their rating calculation actually matter at all?)
But even more important, this scenario is a one-in-a-thousand freak event, not business as usual. Humans just aren't that consistent. Even strong AI isn't that consistent yet. Look at the LZ promotion matches or the CGOS game results. Look at the many "golden era" challenge matches played without komi. Below pro level, if you can play me without komi ten times in a row and win every single time, then you're strong enough that you should be giving me at least three stones. (Or else this was a freakish run of games, and our next ten-game series will look very different). And if you're winning by exactly two or three points every time, you're choosing to play in a non-greedy style and you're very good at counting.

Of all the ways in which a rating system could go wrong, this is one of the last things we need to worry about.

gennan · Post by **gennan** » Tue Jan 28, 2020 12:44 am

Yes, exploring atypical scenarios is not be very useful to evaluate rating systems performance. For statistical models, scenarios are not very useful when it's anecdotal (or highly unlikely).

You need a lot of real world data to create or evaluate rating systems.

You may analyse the data to extract its typical characteristics.
This can be used to create a statistical model of the data.
Preferrably, you want to find the simplest data model that still captures the most important characteristics and statistical behaviour of the data.

You're looking for probability distributions that model typical human players by only a few parameters. The most important parameter being the rating.
Other parameters may include the standard deviation of the player's rating or even a full history of previous game results, nationality, etcetera.
You can also incorporate some other stuff and some theoretical consideration in your models. In the case of go, there are declared ranks, komi, handicap, time settings, hypothetical perfect play, improving players, passage of time, etcetera.

A good data model helps to create robust algorithms for the experimental rating system (in particular, how it updates the ratings when it processes game results).
You may then feed the data into your experimental rating system as a simulation of the real world and evaluate how the rating system performs over time (accuracy of predictions, stability, rating drift).

You can iterate this process to improve your experimental system until you're satisfied.

gennan · Post by **gennan** » Tue Jan 28, 2020 1:26 am

I did go into some scenarios in this thread, like the discussion about the 6k player playing against opponents which were some Elo distance away from him.

I said that the predicted winrates of the AGA system are much to high and that a match between a 6k and a 7k would bring the AGA ratings of these players closer together (the 6k player's rating would go down).

That is true, but it's not really a good scenario. Considering only 2 players is not enough to analyse the system. It ignores too many things that happen when there are more players in the system.
For example: If you add a 5k players and have these 3 players play many games with too high expectations of winrates, the 6k player's rating will not deflate. He will gain rating points from his games with the 5k player and he will lose rating points from his games with the 7k player. But overall, his rating will stay the same if he plays the same number of games against both opponents. What does happens is that the 5k's rating goes up and the 7k's rating goes down. So the rating system's high winrate expectations contracts the ratings toward the middle.

If you extrapolate this to a system with many players, it will contract the ratings towards the median rating. The median rating depends on the demographics of the player polulation in the system. In the EGF system, 5k is the median rank in tournament games.

So it's not really true that all players' ratings get deflated by high winrate expectations. A more accurate statement would be that only ratings above 5k will deflate over time in the EGF system. But the system is pretty much anchored at the top, so over time, the deflation above 5k will push everybody below downwards as well.

I knew this when I discussed the 6k scenario, but I didn't mention it then, because I tried to keep it simple. But perhaps it's better to not oversimplify things.

Bill Spight · Post by **Bill Spight** » Tue Jan 28, 2020 1:32 am

Many thanks, jann. I think I understand your ideas better.

Many thanks, xela. You have shed light on the questions involved.

And thanks to others, as well. I don't want to leave anybody out who has contributed to this discussion, but I am only responding with jann and xela's latest notes in mind.

OK, nothing really specific. I may say more mañana, but I actually have a life. I think.

----

A rating system is, in a way, a fool's errand. Why is that? Because it pretends that we can represent a player's strength with a single number. We can't.

Everybody is familiar with the situation, even if we don't personally know of one, in which Player A can usually beat Player B, who can usually beat Player C, who can usually beat Player A. If we could represent the strength of each of these players by a single number, the player's rating, then Player A's rating would be greater than Player B's rating, which would be greater than Player C's rating, which would be greater than Player A's rating. Tilt! No puedo, señor.

Why are such situations possible? One reason is that there are several skills and other factors that combine to form skill at go. For instance, there is skill at reading, but actually, there are at least three skills which produce that one. There is skill at invading, skill at sabaki, skill at utilizing thickness, etc. And each of these skills are probably composed of other skills, as well. In addition there are factors such as memory, emotional control, discipline, physical fitness, alertness, etc. If we can represent each of these skills and factors with a single number, we still cannot reduce all of those numbers to a single number. That is why I said that skill at go is a vector, with a number for each factor that makes up a person's go skill.

However, as xela alludes to, other players also matter. For instance, if Player A never played Player C in the situation above, we would happily think that Player A's general go skill was better than Player C's. So really, we should probably think of a player's go strength as a matrix.

Now, the situation with Players A, B, and C, is not so unusual if they are all amateur shodans. But it would be quite unusual, perhaps almost impossible, if Player A were a shodan, Player B were a one kyu, and Player C were a two kyu. There may be a certain two kyu who can regularly beat a certain shodan, playing even, but I have never heard of such a case. In any event, I am willing to say that there is some difference in go skills such that the weaker player will never beat the stronger player on a regular basis. What this means is that we can order some players on the basis of general go strength, but not all players. We say that go strength is partially ordered.

Ratings, OC, are numbers, and are this completely ordered. Since they are taken to represent go strength, which is only partially ordered, nobody has a precise rating. It is not that there is some uncertainty about a player's rating, which could be reduced with more games played. There is an irreducible uncertainty, such that nobody has a precise rating. Any attempt to assign precise ratings is doomed to failure. As xela points out, "better" is ambiguous.

----

What about ranks? Player's ranks are still not completely ordered. It is not unusual for two players of different ranks to play even with each other. It is unusual for an amateur of a lower rank to regularly beat a player of a higher rank, however. So ranks are pretty good indicators of general go skill. They are not nearly as precise as ratings, but that is a good thing, IMO. Ranks better reflects reality than ratings.

The earliest numerical ranking system I am aware of goes back centuries. A one rank difference meant that the higher ranked player could normally take White to make the game even. A two rank difference meant that the higher ranked player alternated between taking White and giving two stones. A three rank difference meant that the higher ranked player gave two stones. Etc. These were pro ranks, OC. Over time pro ranks got closer together. Today the idea that a pro 8 dan would give 4 stones to a pro shodan is absurd. It was absurd 100 years ago, as well. 3 stones was not. Today there is no guarantee that a pro 8 dan can give a pro shodan 2 stones. Times change.

But going back to ancient times, two pros might agree to a lengthy match of several games, often 10 or more. It was normal in these matches for the handicap to change after four wins in a row, regardless of the official handicap. For instance, a player might go from playing Black all the time to playing Black in two games out of three and White in the other.

When I was learning go it was usual for amateurs to change handicaps after a three game winning streak. For instance, the players might be playing even by alternating between taking White and Black, and then, after one player won three games straight, to that player playing White. And then if White won three games straight, she would give two stones. That change was a bigger leap, OC, but we were not worried about niceties.

Note that we usually played even by alternating between taking White and Black, not by komi. Some players used komi, but most of us did not bother with it.
----

Against that background, the idea of one player playing Black and the other playing White for 10 games without changing the handicap after Black won 3 or 4 games in a row is bizarre. Komi or no komi. It is a made up scenario to produce a problem that makes no practical sense. I won't say that the problem does not exist, because jann sees one, but I don't.

Now, with some people playing many games a day online with a rating system that, for some reason or other, does not recalculate ratings immediately, something like that scenario could occur. In fact, with thousands of games played on different servers every day, such a scenario may not be all that unusual. But first, it is ephemeral, as modern rating systems are self-correcting. And second, since precise ratings are a pipe dream, anyway, who cares?

jlt · Post by **jlt** » Tue Jan 28, 2020 2:33 am

gennan wrote: What does happen is that the 5k's rating goes up and the 7k's rating goes down.

Did you mean the opposite? Given that your next sentence was "So the rating system's high winrate expectations contracts the ratings toward the middle."

gennan wrote: But the system is pretty much anchored at the top, so over time, the deflation above 5k will push everybody below downwards as well.

I don't understand this sentence. What do you mean by "anchored at the top"? Why does everyone get deflated?

jann · Post by **jann** » Tue Jan 28, 2020 4:51 am

As I wrote several times, if there are a huge number of samples, with ratings adjusted afer every game, a rating system is fine in practice even ignoring the issue with "H1" games. But there are scenarios where this is not the case (eg. new player with few games), and I also found the fact that many of these games have dual winners interesting (esp. compared to chess). Apparently I'm the only one.

A rating system is, in a way, a fool's errand. Why is that? Because it pretends that we can represent a player's strength with a single number. We can't.

Preferrably, you want to find the simplest data model that still captures the most important characteristics and statistical behaviour of the data.

For a random thought experiment I could even imagine a system that does not throw away most of this data, ie. a server that ignores komi, records each game as board results B+3, B+8, W+1 etc, and manages player ratings using the whole result from each game (instead of truncating to 1-1 bits, which is dubious for H1). It may even be more successful in managing player matchups and handicaps (esp. if it doesn't even allow komi to be set, though I think weak amateurs would not play much differently because of 6 pts, until late endgame at least).

xela · Post by **xela** » Tue Jan 28, 2020 4:59 am

Back to what started this conversation --

shimari65 wrote:Yes, KGS rankings seem to be pretty far off, and have not been closely tied to appropriate anchors in a very long time. I am trying to gradually tweak the system, without making major shockwaves. Our goal would be for KGS ranks to align more closely with AGA ranks. There are so many servers, and such wide rank variations among them, that picking a standard is very hard. The merit of AGA ranks is that we have extensive records of in person play for hundreds of individuals.

If you find your rank, or someone else's has changed in a way that is totally irrational, let me know here. I can't promise to fix anything, but my actions may cause unintended fluctuations, and knowing about them can help me to make better decisions in the future.

_________________
Paul Barchilon,
AGF Vice President
KGS Manager

If the AGA can put resources behind this, it strikes me that this would be a great topic for a Kaggle competition. Can you publish a big file with the (anonymised/de-identified?) results of all games played in 2019 plus some metadata? Metadata might include handicap, komi, time settings, AGA ratings of the players where known, exact date and time when the game was played, which country/region people were logged in from if known.

I suspect the correlation between AGA and KGS rank would be weak, as some people play better online than over the board, some people the other way round, and different people take online games more or less seriously.

I also wonder whether you'd find cliques in the KGS players. For instance, group A plays mostly fast games, group B plays mostly slow games, they hardly ever play against each other, and 2k in group A is a different strength from 2k in group B. Or it could split up by time of day, country, or something else.

The goals of the Kaggle competition could be to design a better ranking system from scratch, or to suggest how to improve the current system without radical change, or to explain the various factors making ranks (appear to be) unstable or inconsistent.

ez4u · Post by **ez4u** » Tue Jan 28, 2020 5:23 am

I confess that I do not understand much of what has been written under this topic. In hopes of getting a little closer to the topic of KGS' ranks, and adjustments thereto, I offer the following data.

The three Ayabot00X bots (ayabot001, ayabot002, and ayabot003) have been playing steadily since 2014 on KGS. I did the following:

1. Downloaded the cvs files for these three from KGS analytics.
2. Deleted all free games and games against unrated players, leaving some 624K games in total.
3. Replaced all positive handicap with an equal negative number where the ayabot played White (= giving handicap)
4. For all handicap = 0 games where the ayabot was one level higher and playing W, substituted handicap = -1 (i.e. assumed that ayabot received komi = 0.5; the csv file does not list komi, just handicap)
5. For all handicap = 0 games where the ayabot was one level lower and playing B, substituted handicap = 1 (i.e. assumed that ayabot gave a komi = 0.5)
6. Totaled the games and wins for each handicap and for Black versus White.
7. Calculated the winning rates and compared the winning percentage at each handicap for Black versus White.

We can see in the results table that the bots won significantly more games as White than as Black at all handicap levels. This was surprising to me. I expected that the bias that favors White in assigning traditional handicaps would favor White at all levels except even games (in the list handicap = 0). Therefore I expected that White would win approximately 50% of the handi = 0 games and a higher percentage of the remainder. This did not turn out to be the case. The ayabots won 56% of their games as White with handi = zero and 58% as White with handi = 5 (stones). Note that we have to be careful with the handi = 6 figures; the ayabots played with a maximum handicap of 6 stones so these results may be "noisier" than the rest.

I honestly do not know what to make of this data. On the other hand it only took me about an hour and a half last night, start to finish, to download it an produce the calculations. I would think that this is a vast, easily accessible treasure trove of information. Recall that when Remi published his paper on Whole History Ratings, he used 10.8 million KGS games in his work! I think that for discussion of rating/ranking systems to be used in real life, such data is a better basis than hypothetical Andrew-Bart-Chuck round robins!

These bots played at slightly different levels but always as sdk's. This table shows the breakdown of games by kyu level.

Javaness2 · Post by **Javaness2** » Tue Jan 28, 2020 6:07 am

The robot players cannot handle handicap games 'correctly'. It's a bad set of data to pick.
It would be nice if KGS could simply block handicap games with bots.

Bill Spight · Post by **Bill Spight** » Tue Jan 28, 2020 6:30 am

jann wrote:
Bill Spight wrote:Well, it seems that you somehow think that komi is arbitrary.
For half stone rank diffs it is the players' arbitrary choice, both 0.5 and whole komi is very common.

Well, it should not be. For a half stone rank difference the proper handicap uses 0.5 komi or 0 komi. Because the traditional handicap for a one amateur rank difference was for the lower ranked player to take Black without komi, 50 years ago most amateurs did not know that that was the wrong handicap. Now they should know that White should give komi for a proper handicap. If that is not general knowledge I blame those who run go servers and tournament organizers.

ez4u · Post by **ez4u** » Tue Jan 28, 2020 6:55 am

Javaness2 wrote:The robot players cannot handle handicap games 'correctly'. It's a bad set of data to pick.
It would be nice if KGS could simply block handicap games with bots.

I am curious. Which part of the results do you think indicates this lack of ability?

Life In 19x19

KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?

Re: KGS Ranking adjustment?