Whole History Rating open source implementation.

Rémi · Post by **Rémi** » Wed May 30, 2012 11:58 am

yoyoma wrote:Remi do you have any numbers like this for observed KGS games to get numbers for Elo/Rank from them?

I did most of my experiments without handicap. If I find time in the days to come, I'll try to take a closer look. But I have been saying this to myself since the WHR paper in 2008, so I am not sure I'll do it soon.

Rémi

pete · Post by **pete** » Wed May 30, 2012 12:30 pm

Yoyoma,

I'm wondering if we have different models in our heads at this point. When you present a probability statement like P = 1 / ( 1 + 10^((RankB-RankA)/400)) ) and then go on to say that RankB and RankA are actual kyu/dan ranks, I don't follow.

The model that WHR uses (and Remi, correct me if I misspeak) is P(A wins) = NaturalA/(NaturalA+NaturalB). To convert from Natural ratings to ELO, use the formula (NaturalX * 400.0)/ln(10). WHR primarily works on Natural scaled ratings internally. In my library, I convert the user's input into Natural ratings, and convert output back into ELO.

This produces a "linear" strength scale. Linear in the sense that the probability of a 1000 ELO player beating a 900 ELO player is the same as that of a 200 ELO player beating a 100 ELO player. (see the test_winrates_are_equal_for_same_elo_delta test in the library).

Historically, Go ranks are tied to handicap stones, and stronger players can use stones more effectively, thus ranks are not an equal distance apart in terms of strength. So it is in the conversion from ELO to ranks (which happens outside of the library, and in GoShrine code), that the strength scale takes on a curve.

Since a handicap stone is a varying amount of ELO, based on the players' strengths, the library supports the use of a callback, which allows the calling to code to implement a curve for handicap values as well.

Does this clear matters up? Essentially WHR knows nothing about the curved scale of go ranks and go handicaps, but just does what it's good at, computing estimates of relative strengths on a flat scale.

-Pete

yoyoma · Post by **yoyoma** » Wed May 30, 2012 1:58 pm

Yes we need to be clear what scales we're talking about. What you call Natural I thought was called Gamma.
Natural = ln(Gamma).
Elo = Natural*400/ln(10)
I think these are the same as the definitions given in 2.1 of http://remi.coulom.free.fr/WHR/WHR.pdf (Greek letter gamma = Gamma, lowercase r = Natural, uppercase R = Elo).

Code: Select all

|Elo    |Natural|Gamma  | win%|
|0.00   |0      |1.00   |0.50 |
|30.00  |0.075  |1.19   |0.46 |
|60.00  |0.15   |1.41   |0.41 |
|400.00 |1      |10.00  |0.09 |

Am I right that the handicap argument for Game::initialize is on the classic Elo scale? I see this bits of code that make me think so:

opponent_elo = bpd.elo + black_advantage # Addition used here, as I expected
rval = 10**(opponent_elo/400.0) # Here is the conversion from Elo to Natural

When I wrote: "So for kyu players and 1 rank difference: RankB-RankA=1 and k=0.85.", that was for the KGS formula, which uses a Natural scale: P = 1 / ( 1 + e^(k*(RankB-RankA)) ). So for that formula ranks are fixed to always be 1 rank = 1.0 on the Natural scale. And the "k" parameter is used to change expected win rates for dans vs kyus.

So to compare apples to apples I converted from that formula to the classic Elo formula which uses log10 and has the 400 constant in there. I did a similar conversion from EGF GoR's parameter they call "a" (http://www.europeangodatabase.eu/EGD/EG ... system.php).

When you wrote your system used 30-60 Elo per rank, I assumed you meant the classic Elo scale using log10 and the 400 constant, is that right? I added a table for those values:

Code: Select all

|           | KGS   | EGF   | EGF   | KGS   | EGF   | EGF   |
|           | exp.  | exp.  | obs.  | exp.  | exp.  | obs.  |
| even game | win % | win % | win % | elo   | elo   | elo   |
|-----------|-------|-------|-------|-------|-------|-------|
| 10k vs 9k | 30.0  | 33.9  | 44.8  | 148   | 116   | 36    |
| 5d vs 6d  | 21.4  | 20.1  | 27.8  | 226   | 232   | 166   |

30 Elo difference | 45.7% |  (go shrine lower end 1 rank difference)
60 Elo difference | 41.5% |  (go shrine lower end 1 rank difference)

pete · Post by **pete** » Wed May 30, 2012 3:58 pm

yoyoma wrote:

Code: Select all

|           | KGS   | EGF   | EGF   | KGS   | EGF   | EGF   |
|           | exp.  | exp.  | obs.  | exp.  | exp.  | obs.  |
| even game | win % | win % | win % | elo   | elo   | elo   |
|-----------|-------|-------|-------|-------|-------|-------|
| 10k vs 9k | 30.0  | 33.9  | 44.8  | 148   | 116   | 36    |
| 5d vs 6d  | 21.4  | 20.1  | 27.8  | 226   | 232   | 166   |

30 Elo difference | 45.7% |  (go shrine lower end 1 rank difference)
60 Elo difference | 41.5% |  (go shrine lower end 1 rank difference)

Ok, I understand the table now, thanks for being patient, Your assumptions are correct about handicap being in ELO, and that the ELO in my WHR implementation is the same ELO you are talking about. The 30 & 60 elo deltas do indeed give the winrates that you list in the table above.

I'm wondering if you would indulge my curiosity and expand upon your explanation for why the observed values in the table above are at such odds with the expected winrates. "errors in the rating estimation" should create errors in both directions, overestimating, and underestimating, no? And why do McMahon tournaments match underrated 10kyus with overrated 9kyus? Wouldn't they also match overrated 9kyus with underrated 10kyus?

I'm willing to accept that my ELO values might be low, but perhaps existing rating systems are also erring on the high side, as the above tables might suggest.

-Pete

yoyoma · Post by **yoyoma** » Wed May 30, 2012 4:37 pm

pete wrote:
yoyoma wrote:
Code: Select all
|           | KGS   | EGF   | EGF   | KGS   | EGF   | EGF   |
|           | exp.  | exp.  | obs.  | exp.  | exp.  | obs.  |
| even game | win % | win % | win % | elo   | elo   | elo   |
|-----------|-------|-------|-------|-------|-------|-------|
| 10k vs 9k | 30.0  | 33.9  | 44.8  | 148   | 116   | 36    |
| 5d vs 6d  | 21.4  | 20.1  | 27.8  | 226   | 232   | 166   |

30 Elo difference | 45.7% |  (go shrine lower end 1 rank difference)
60 Elo difference | 41.5% |  (go shrine lower end 1 rank difference)
Ok, I understand the table now, thanks for being patient, Your assumptions are correct about handicap being in ELO, and that the ELO in my WHR implementation is the same ELO you are talking about. The 30 & 60 elo deltas do indeed give the winrates that you list in the table above.

I'm wondering if you would indulge my curiosity and expand upon your explanation for why the observed values in the table above are at such odds with the expected winrates. "errors in the rating estimation" should create errors in both directions, overestimating, and underestimating, no? And why do McMahon tournaments match underrated 10kyus with overrated 9kyus? Wouldn't they also match overrated 9kyus with underrated 10kyus?

I'm willing to accept that my ELO values might be low, but perhaps existing rating systems are also erring on the high side, as the above tables might suggest.

-Pete

I probably shouldn't have thrown in the errors in rating estimation part, because I don't know much about it. I read that somewhere but I can't find it. Basically what I understood is that when you have two players who are estimated to be 1500 and 1600, with some normal distribution of what their ratings *really* are... Blah blah lots of math I can't do on my own (hehe), turns out just using the 1500 and 1600 numbers by themselves gives a lower probability of upsets than using the full distributions? Honestly I don't know how that works so maybe someone can explain better, or maybe I'll find where I read it.

The McMahon one is easier to understand. Take a tournament with two 9ks and two 10ks, and many 30k-11k and 8k+. In round one, the 9ks play each other and the 10ks play each other. In round 2, the 9k winner players the 10k loser. Typically this will be whichever 9k was most underrated and whichever 10k was most overrated. So in general in McMahon tournaments, underrated players go up and overrated players go down, meeting each other and creating more than expected upsets. How big this effect is I don't know.

daniel_the_smith · Post by **daniel_the_smith** » Thu May 31, 2012 10:06 pm

I don't have anything to contribute but I'm very much enjoying the thread!

Kaya.gs · Post by **Kaya.gs** » Fri Jun 01, 2012 8:40 am

Its a nice discussion

.

I think that it could be a valuable effort to set up a testing environment for the testing of different rating systems. I had planned on doing this on OpenKaya, but i never compiled a set of games to make estimates with .

It can be very fruitful to agree on some systematic testing, so everytime we try out new rating systems and more specifically, tweaking on those systems, we can easily compare them.

Just figure running tests against different compliations (with handicap, witohut handi, with bots, etc) and getting figures directly like:

Accuracy
Glicko -> 40%
WHR(GoShrine's) -> 47%
WHR(yoyoma's) -> 49%
Tygem's -> ?

Performance
Glicko -> X operations
WHR(GoShrine's) -> Y operations
WHR(yoyoma's) -> Z operations
Tygem's -> ?

and so on.

Id like to get this rodeo going at some point , although its not top priority for us now.

Making it an open standard could end up serving in other places, like chess, or just a novel use like comparing EGF rating with the same game results with different systems.

hyperpape · Post by **hyperpape** » Fri Jun 01, 2012 10:52 am

I wonder: while having real life games is nice, there is the problem that game pairings are influenced by the rating system. Perhaps that's not an issue for reasonable systems, but since some systems (Tygem) are very slow to fix large errors, that could introduce a real distortion.

bakekoq · Post by **bakekoq** » Sat Jun 09, 2012 4:19 pm

hello.
may I know how to install it?
it can be good for me and my clubs in the future.

pete · Post by **pete** » Sat Jun 09, 2012 4:31 pm

bakekoq wrote:hello.
may I know how to install it?
it can be good for me and my clubs in the future.

There are instructions on the linked page. It's a ruby gem, so you must be familiar with ruby and rubygems first.

Life In 19x19

Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.

Re: Whole History Rating open source implementation.