KGS ranking system

skydyr · Post by **skydyr** » Thu Dec 13, 2012 8:36 am

One other thing to consider is that the system picks anchors for the ratings from active and relatively stable players, so if one of them improves suddenly and rapidly, they may warp everyone elses' ratings around the anchor instead of changing ranks themselves. As far as I know, there is no way to tell who is an anchor by design, without access to the underlying database.

hibbs · Post by **hibbs** » Thu Dec 13, 2012 10:45 am

Mike Novack wrote:
hibbs wrote: Since all the probabilities are known, the probability for each outcome can be calculated, e.g. the probability for outcome 1 is 0.15 (chance to improve in the workshop) * 0.75^5 (probability to win five games in a row at a 75% average win rate) = 3.5%. Other probabilities can be calculated in a similar way, the probability that the person did not improve in the workshop and got a 5 game win streak is 2.7 %

What is important now: We have observed the outcome “five wins in a row”, which means under the given assumptions it is actually more likely that the person has really improved than not. And even though winning 5 games in a row is a rather common event that happens by chance in 3% of all cases, and even though it is unlikely to improve by attending the workshop, the person may still correctly feel that he should get a promotion. (Everyone: Please do not start a discussion if this should be reflected in a ranking system… Read above disclaimer first)
I think that is perhaps the crux of the disagreement. Possibly related to the usual and customary certainties expected before "publication" in the different science. Yes, .56 (3.5/6.2) is greater than .44 (2.7/6.2) but not a whole lot greater. If the system gave promotions based upon attending this class and then having a five game winning streak to 100 mythical players would have been correct to do so 56 times and incorrect to do so 44 times. That's a pretty bad "error rate". The calculation might be redone to determine what lengths of streaks would have been necessary to get the error rate down to below 10%, below 5%, etc.

I think you got this one wrong, but first I want to re-iterate two earlier statements:
The crux of the disagreement (or at least what I consider a flaw in the KGS rating system) is that it takes a different number of games to be promoted in dependence of the frequency of played games. It not a priori reasonable why this should be the case, and it may lead to the fact the someone has a winning streak beyond reasonable statistical doubt and still does not get the promotion (at least not immediately). WMS himself has stated that the KGS ranking system does not account well for sudden or fast improvements in strength. What I consider a flaw here is apparently the price to pay for an otherwise probabilistically correct system.
I have also stated that a ranking system cannot account for things like someone attending a class, so therefore it mus not. The rest of the discussion is entirely hypothetical.

Now if you think this error rate of making a mistake of about 50% as in the example. That is the way it is. The Bayesian inference does not help us with calculating the wins in a row needed to get a smaller error rate. The question that is answered is that after the observation "Attended a class and played a streak of 5 won games in a row": What is the probability that the observation is caused by a real improvement? This depends on the prior probability of improving by attending the class. So what would be the probability that there was a real improvement after attending a cooking class? Zero. But the better the class actually is, the more likely an observation of 5 won games in a row points towards a real improvement of the player in question.

jts · Post by **jts** » Thu Dec 13, 2012 11:10 am

It's very reasonable. The more data you have attesting a statistic, the more confident you can be that the observed statistic is close to the actual statistic. The more confident you are, the more data you need to readjust your conclusions. Easy peasy.

hibbs · Post by **hibbs** » Thu Dec 13, 2012 11:40 am

skydyr wrote:One other thing to consider is that the system picks anchors for the ratings from active and relatively stable players, so if one of them improves suddenly and rapidly, they may warp everyone elses' ratings around the anchor instead of changing ranks themselves. As far as I know, there is no way to tell who is an anchor by design, without access to the underlying database.

Whether someone is an anchor or not should make no difference. The system first calculates all ranks independently of wether someone is an anchor nor not. After that, it shifts all ranks so that the average difference of the calculated ranks of the anchors to their "anchored ranks" gets minimal. That means all rankings are effected by the anchor system to the same extent.

hibbs · Post by **hibbs** » Thu Dec 13, 2012 11:43 am

Boidhre wrote:
hibbs wrote:First of all, the statistical independence is a necessary assumption for the various calculations to be meaningful (As I wrote, otherwise these calculations would not be valid).
There's plenty of maths out there for dealing with non-independent events statistically. I've forgotten most/all of it since college since I no longer work with it, but assuming non-independent events to be independent just so you can use a linear regression or whatever just gives you misleading results.

That is right. In this case the assumption of statistical independence should nevertheless be the first one to consider. It should only be changed if the observed behavior is not consistent with it. Since mef has embarked on figuring this out, we should just wait...

Mike Novack · Post by **Mike Novack** » Thu Dec 13, 2012 3:44 pm

I think I see the problem. Confusion about the observer.

"The question that is answered is that after the observation "Attended a class and played a streak of 5 won games in a row": What is the probability that the observation is caused by a real improvement? This depends on the prior probability of improving by attending the class. So what would be the probability that there was a real improvement after attending a cooking class? Zero. But the better the class actually is, the more likely an observation of 5 won games in a row points towards a real improvement of the player in question."

No, the question was when the "observer" (the rating system) had reason to conclude an improvement had taken place based solely upon observation of a streak of wins of size M. The "observer" in this case has no knowledge about any class the player may or may have taken immediately before this streak let alone whether if a class had been taken was a class on go or a class on cooking. Another observer (the player who took the class) has additional information and therefor a different conclusion.

That is what I meant by "subjective". The player who thinks the streak plenty long enough may have confused his or her judgement of the probability with that of the rating system. There is nothing wrong with using Bayesian inference here as long as you are looking from the point of view of the correct observer.

ez4u · Post by **ez4u** » Thu Dec 20, 2012 2:34 am

Mef wrote:
Mef wrote: At the end of the day though, I'm with you, I'd prefer to see someone dig into some data and see if there's anything worthwhile there.

All right...since KGS analytics just spits out a CSV with all the game results....and I ended up having a bit of free time...I made a quick and dirty excel macro that analyzed streaks in game histories. I looked at 3 players who I like to use for KGS statistical data because A: Their ratings are fairly consistent, B: They play a ton of games, and C: They are fairly recognizable KGS personalities, here are my results:

Streak = 3 games
Ok, this assumes a streak starts at 3 games. WSW = Win Streak Winning %, that is, the percentage of time a game was won given that the three prior games were also won. LSL% = Loss Streak losing %, the percentage of games lost given that the prior three games were also lost.
Code: Select all
                Win %   Loss %  WSW %   LSL %           Games   Streaks
Twoeye          0.625   0.375   0.658   0.433           14691   1858
sum             0.526	0.474	0.559	0.506           13278	1664
TheCaptain      0.511	0.489	0.531	0.524           22859	2899
Streak= 4 games
Same as above, this assumes a streak starts after 4 games
Code: Select all
             Win %   Loss %  WSW %   LSL %           Games   Streaks
Twoeye       0.625   0.375   0.667   0.443           14691   1073
sum          0.526   0.474   0.569   0.520           13278   869
TheCaptain   0.511   0.489   0.534   0.536           22859   1505
Streak=5 Games
Same as above, this assumes a streak starts after 5 games
Code: Select all
             Win %      Loss %  WSW %   LSL %           Games   Streaks
Twoeye       0.625      0.375   0.680   0.467           14691   637
sum          0.526      0.474   0.592   0.515           13278   463
TheCaptain   0.511      0.489   0.522   0.522           22859   827
I need to go to sleep now, but later today I'll try to double check my script and make sure there's no glaring errors. Also I may rework it to try and test my "Good days / bad days" theory.

Bored with waiting for mef to wake up from his nap, I started to fool with this stuff too. IANAS (I am not a statistician) so my method was to try to parrot the statistical work of Gilovich, Vallone, and Tversky in The Hot Hand in Basketball: On the Misperception of Random Sequences. I used the downloaded records of twoeye, sum, thecaptain (as described by mef earlier), and our own speedchase. Since I downloaded the csv files on a different date than mef, my numbers are slightly different than his.

In the paper they begin by looking at the question, "Do players hit a higher percentage of their shots after having just made their last shot (or last several shots), than after having just missed their last shot (or last several shots)?" Of course for Go this translates to the question - Do players win a higher percentage of their games after having just won their last game (or last several games), than after having just lost their last game (or last several games)?

For the most basic answer to this question I constructed table 1 below. Here we see the
* total games,
* total wins,
* total losses, and
* average winning/losing percentages in the upper section.

Under that we have the summary figures from a more detailed analysis of streaks to be described later. This gives us the number of:

Streak extending:
* wins that occurred following a win,
* losses that occurred following a loss,

Streak ending:
* wins that occurred following a loss, and
* losses that occurred following a win
together with their applicable winning/losing percentages.

We can clearly see that for all four players the winning percentage following a win and the losing percentage following a loss were higher than the average winning and losing percentages. Meanwhile the winning percentages following a loss and the losing percentages following a win were lower than the average winning and losing percentages. (Note that percentages below the average are highlighted in red for easier reading.)

So the simple answer to the question above is "YES". Unlike professional basketball players, our sample KGS'ers do seem to have hot hands!

Next in the paper they used the Wald-Wolfowitz runs test to check whether the number of runs observed was consistent with a random distribution of hits or misses (wins or losses for us). Here a "run" means a series of one or more wins or losses. What we call winning "streaks" are unexpectedly long runs of wins. The more "streaky" our data, the fewer runs we will observe as the players continue to win or lose longer than expected before losing or winning and thereby starting a new run.

The WW runs test calculates an expected number of runs and standard deviation from the number of wins and losses that we actually observe. It then calculates the difference between the actual runs observed in our data and the expected number. The difference is expressed is a Z statistic (i.e. a measure expressed as a number of standard deviations). With a Z table (downloaded from the internet in my case) we can find the probability that the observed runs were produced by a random process. The result for our four players is shown in table 2 below.

Here we see that the observed number of runs for twoeye, sum, and thecaptain are all quite far away (in standard deviations) from the expected figure. We can reject the idea that they are randomly produced at a quite high confidence level (99.99%). In the case of speedchase we can not reject the idea that the number of runs we see is simply a random fluctuation in the data with much confidence (72%).

Finally in the paper they create a test for non-stationarity or the idea that players temporarily become hot or cold with an elevated or depressed winning percentage over short periods of time. For basketball players they cut their shooting records into four-shot intervals, totaled the number of hits in each "set" of shots and looked for unusually high numbers of "high performance" and "low performance" sets. I did the same for collections of four-game sets for each of our players. As in the basketball paper, I repeated the set-building process three more times, stepping one game forward in the overall player history each time. This gave four related but different sets of data. The resulting numbers were tested against the expected numbers from a random process with the same overall winning rate using the chi squared test. The results are shown in table 3 below.

Here we can see that twoeye again is an outlier. Unlike the basketball players, his data seems to strongly indicate that his wins are not produced be a single random process. In other words he plays in streaks. Our other two big guns, sum and thecaptain, are less clear in this regard. Some of their data sets produce low probability measures, like twoeye, but others are higher. Finally, our man speedchase puts up results that fit a random process fairly well.

Overall this all may be nonsense due to errors due to my ignorance. Hopefully our more erudite posters will point such out if they see them. Otherwise I would say there is pretty strong indications here that streaks happen more often than expected on kgs for at least some of the players.

daal · Post by **daal** » Thu Dec 20, 2012 3:43 am

ez4u wrote:IANAS (I am not a statistician)

Missed your calling?

pwaldron · Post by **pwaldron** » Thu Dec 20, 2012 5:21 am

ez4u wrote: Here we can see that twoeye again is an outlier. Unlike the basketball players, his data seems to strongly indicate that his wins are not produced be a single random process. In other words he plays in streaks.

Nice work!

The only thing that comes to mind is the analysis (or at least the null hypothesis) assumes that game results are independent and identically distributed. Of course it isn't the case--my chance of winning vs. a 3-dan are much higher than of winning against a 5-dan. A player who plays a streak of 3-dans would be expected to score a streak of wins also.

hibbs · Post by **hibbs** » Thu Dec 20, 2012 5:33 am

pwaldron wrote: Nice work!

I totally agree.

pwaldron wrote: The only thing that comes to mind is the analysis (or at least the null hypothesis) assumes that game results are independent and identically distributed. Of course it isn't the case--my chance of winning vs. a 3-dan are much higher than of winning against a 5-dan. A player who plays a streak of 3-dans would be expected to score a streak of wins also.

If the games are played with a proper handicap, than this should not matter. One of the assumtions of the KGS rating system is that a proper handicap brings the win ratio on average to around 50%. (At least within the accuracy of the system, if a strong 3D plays a weak 4D things would be different, of course).

Probably it would be a good idea to open a new thread for this line of discussion?

speedchase · Post by **speedchase** » Thu Dec 20, 2012 12:08 pm

I just had an Idea, what if the handicap (default, and for automatch) were calculated using the difference in rating instead of the difference in rank.

wms · Post by **wms** » Thu Dec 20, 2012 2:04 pm

speedchase wrote:I just had an Idea, what if the handicap (default, and for automatch) were calculated using the difference in rating instead of the difference in rank.

This is more a matter of preference than accuracy. I thought it would be annoying if a 5k played a 6k, and might get anything from an even game up through h-2. As it stands, if you see your rank and somebody else's, you know exactly what the default handicap/komi will be.

yoyoma · Post by **yoyoma** » Thu Dec 20, 2012 3:29 pm

hibbs wrote:
pwaldron wrote: Nice work!
I totally agree.

pwaldron wrote: The only thing that comes to mind is the analysis (or at least the null hypothesis) assumes that game results are independent and identically distributed. Of course it isn't the case--my chance of winning vs. a 3-dan are much higher than of winning against a 5-dan. A player who plays a streak of 3-dans would be expected to score a streak of wins also.
If the games are played with a proper handicap, than this should not matter. One of the assumtions of the KGS rating system is that a proper handicap brings the win ratio on average to around 50%. (At least within the accuracy of the system, if a strong 3D plays a weak 4D things would be different, of course).

Probably it would be a good idea to open a new thread for this line of discussion?

According to KGS rating math:
A middle 4-dan is expected to beat a very weak 3-dan 79% of the time (taking white, no komi)
A middle 4-dan is expected to beat a very strong 5-dan 21% of the time (taking black, no komi)

In both cases their ratings are 1.5 stones apart, and the handicap only corrects for 0.5 stones, leaving a 1.0 stone difference. That 1.0 stone difference translates into a 79:21 win ratio.

So winning streaks could be the 4-dan rematching the weak 3-dan. Losing streaks could be the 4-dan rematching the strong 5-dan.

ETA: Also ez4u, you just took win/loss data from sum's history? Going how far back? His rank has changed from 5d to 4d 3 months ago. Throwing games played as 5d and games played as 4d together will confuse things a lot I think.

speedchase · Post by **speedchase** » Fri Dec 21, 2012 10:41 am

wms wrote:This is more a matter of preference than accuracy. I thought it would be annoying if a 5k played a 6k, and might get anything from an even game up through h-2. As it stands, if you see your rank and somebody else's, you know exactly what the default handicap/komi will be.

Perhaps, but under the current system, if you are are at a given rank, and you are about to rank up, almost all of you matches (>90%) will favor you to win, so it will be difficult for you to "prove yourself" and ultimately increase your rank. Using my idea, the you always have a 50% chance of being favored to win, which would remove the stickyness associated with passing through ranks.
Ultimately, I suppose everything is a matter of prefernce, but I would "prefer" the ranking system to be less sticky, to being able to guess something that the system will tell me anyway.

ez4u · Post by **ez4u** » Fri Dec 21, 2012 7:07 pm

yoyoma wrote:...

ETA: Also ez4u, you just took win/loss data from sum's history? Going how far back? His rank has changed from 5d to 4d 3 months ago. Throwing games played as 5d and games played as 4d together will confuse things a lot I think.

I think this is a good point, but I don't know how good.

The table below shows the winning record of the three bigs broken down by their rank at the time the game was played (our speedster lies on a completely different scale and so does not make it into this graph). Each of the three has two different ranks making up significant portions of their records, with clearly different winning percentages. This alone should force the issue of non-stationarity into the statistics if I understand that concept correctly.

BTW, I downloaded and used the complete kgs history of each player. They appeared on kgs (in their current username anyway):
* thecaptain 2002-09-26
* sum 2004-03-25
* twoeye 2004-05-12

Life In 19x19

KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system

Re: KGS ranking system