LZ's progression

Vargo · **#21**

For Bill Spight.
Sorry, for the 9/4 I thought you were talking about

Quote:

A wins 50% against B, who wins 50% against C (--> A wins 50% against C)
or A wins 1 game out of 3 against B, who wins 2 games out of 3 against C (A wins 50% against C)

moha · **#22**

Vargo wrote:

Sampling errors on a too small sample (100 games) are most likely, but they should go both ways, and I would have hoped that they more or less cancel each other. Your bugged program is a good example, but the progression would be so slow (probably logarithmic) as to be almost non existant.

Quite the contrary I think: It will be linear (since promotion probability does not decrease by time, as would normally be the case), just less steep. In fact, such linearity of the elo graph can be an indication of problems in the learning process. The deviation on 400 samples is 10, so 55% is not completely out of reach of luck (esp. with early sprt stops).

Even for nets that are truly stronger, luck will usually be there as that is still an easy way towards promotion. Most networks with >55% winrates will in fact be around 52% or so.

Bill Spight wrote:

I would go further, and, since we are talking go, not just require the candidate winner to beat the previous version, but to beat each previous version by a greater margin.

But don't forget we don't just randomly pick nets here. Each one is the same as the last promoted net, after showing it a few more training samples. Each net is expected to be slightly stronger, even if this does not manifests in the test match. The test is there to prevent catastrophic regression, but as AlphaZero shown it is not absolutely necessary.

BTW, this A>B>C problem reminds me of an earlier idea:

Bill Spight wrote:

69.23% ≅ 9/13 , so the win/loss odds are 9/4.
60% = ⅗, so the win/loss odds are 3/2.

Andrew beats Bob 60% of the time, with win/loss odds of 3/2; Bob bets Charlie 60% of the time, with win/loss odds of 3/2. Assuming transitivity and no error, Andrew beats Charlie with win/loss odds of (3/2) (3/2) = 9/4, or 9/13 of the time.

In terms of the log of the odds, log(3/2) + log(3/2) = log(9/4).

I would try to model a player's level as a distribution of total errors made in a game. Assuming the simplest case with no distorting factors and a roughly bell shaped distribution (binomial? questionable OC), a player can be described as dropping an average of P points with an sd of D. The outcome of the game is the difference of dropped points, which should follow a shifted bell curve. If the players are close in strength (no adjustments for sd difference), the original question can be seen as shifting it twice as much as necessary for a 60% result.

It's a pity today's programs don't reliably analyse expected score, as that would enable a more accurate error distribution analysis.

Bill Spight · **#23**

moha wrote:

Bill Spight wrote:

I would go further, and, since we are talking go, not just require the candidate winner to beat the previous version, but to beat each previous version by a greater margin.

But don't forget we don't just randomly pick nets here. Each one is the same as the last promoted net, after showing it a few more training samples. Each net is expected to be slightly stronger, even if this does not manifests in the test match. The test is there to prevent catastrophic regression, but as AlphaZero shown it is not absolutely necessary.

Drift is still possible, as is the accumulation of deleterious changes in the successive winners. Skill can be lost which an earlier winner might have, which would not be enough for that winner to beat the current winner, unless the current winner needs to show sufficient superiority (as in giving a handicap).

Quote:

BTW, this A>B>C problem reminds me of an earlier idea:

Bill Spight wrote:

69.23% ≅ 9/13 , so the win/loss odds are 9/4.
60% = ⅗, so the win/loss odds are 3/2.

Andrew beats Bob 60% of the time, with win/loss odds of 3/2; Bob bets Charlie 60% of the time, with win/loss odds of 3/2. Assuming transitivity and no error, Andrew beats Charlie with win/loss odds of (3/2) (3/2) = 9/4, or 9/13 of the time.

In terms of the log of the odds, log(3/2) + log(3/2) = log(9/4).

I would try to model a player's level as a distribution of total errors made in a game. Assuming the simplest case with no distorting factors and a roughly bell shaped distribution (binomial? questionable OC), a player can be described as dropping an average of P points with an sd of D. The outcome of the game is the difference of dropped points, which should follow a shifted bell curve. If the players are close in strength (no adjustments for sd difference), the original question can be seen as shifting it twice as much as necessary for a 60% result.

While a single player's variation in overall skill may be roughly normal (bell shaped), that is not the shape of the presumed "fitness landscape" for advancing players. Both Elo (and I, when I set up a ratings system years ago for New Mexico) assumed a kind of power law shape, which is decidedly not bell shaped.

Quote:

It's a pity today's programs don't reliably analyse expected score, as that would enable a more accurate error distribution analysis.

I agree. But the "win rate" estimate worked better in producing stronger Monte Carlo programs.

moha · **#24**

Bill Spight wrote:

Drift is still possible, as is the accumulation of deleterious changes in the successive winners. Skill can be lost which an earlier winner might have, which would not be enough for that winner to beat the current winner, unless the current winner needs to show sufficient superiority (as in giving a handicap).

The only difference I see to the simple common case when a net is continuously trained on existing data is a potential negative feedback loop through the selfplay games (generated by the current partially trained net). But apparently such didn't happen with AlphaZero (without promotion matches), at least not to an extent to make real problems.

Quote:

While a single player's variation in overall skill may be roughly normal (bell shaped), that is not the shape of the presumed "fitness landscape" for advancing players. Both Elo (and I, when I set up a ratings system years ago for New Mexico) assumed a kind of power law shape, which is decidedly not bell shaped.

Could you elaborate on "fitness landscape for advancing players" and it's role in the A>B>C case?

I think go is a bit different to chess (elo) in that the accumulation of those tiny errors (which makes some normality of a players performance) is actually visible (in points) and verifiable here (with a strong enough program, and enough match samples). Which distribution would we see on expected points dropped (sum of single errors), and on the expected match score between two players (difference of the sums of single errors)? One distorting factor I see is winning players (like programs) trade margin for safety and simplicity, intentionally dropping some points.

Bill Spight · **#25**

moha wrote:

Bill Spight wrote:

Drift is still possible, as is the accumulation of deleterious changes in the successive winners. Skill can be lost which an earlier winner might have, which would not be enough for that winner to beat the current winner, unless the current winner needs to show sufficient superiority (as in giving a handicap).

The only difference I see to the simple common case when a net is continuously trained on existing data is a potential negative feedback loop through the selfplay games (generated by the current partially trained net).

I don't want to strain a metaphor too far, but Uberdude's post exemplifies the potential problem which might mean that LeelaZero is making less progress than it appears to be making. Different players have different weaknesses, and it is possible for successive winners to cycle between different strengths and weaknesses without making overall progress. I don't mean that the cycle is only three winners long, but the accumulation of small errors in exchange for small advantages elsewhere can produce the effect. Both randomness and multiple skills make this phenomenon possible.

Quote:

But apparently such didn't happen with AlphaZero (without promotion matches), at least not to an extent to make real problems.

In hill-climbing this kind of phenomenon tends to happen near the top of a hill. Perhaps we have not seen it with AlphaGo Zero because it is not near the hilltop for go.

However, I suspect that it did happen with AlphaZero (chess), which is why they played against a hobbled version of Stockfish. Considering the rapid initial progress of AlphaZero, reaching top level play in only a few hours, why did they not run it for a few days more and take on the best, including an opening book and endgame table bases? My guess is that AlphaZero stalled out. That does not minimize their accomplishment, nor does it alter the fact that the way AlphaZero plays chess is more human like than the play of other chess engines. But stalling out is not so good from a PR standpoint. {shrug}

Quote:

While a single player's variation in overall skill may be roughly normal (bell shaped), that is not the shape of the presumed "fitness landscape" for advancing players. Both Elo (and I, when I set up a ratings system years ago for New Mexico) assumed a kind of power law shape, which is decidedly not bell shaped.

Could you elaborate on "fitness landscape for advancing players" and it's role in the A>B>C case?

Consider the case of pool (pocket billiards). One test of skill in straight pool, where you can shoot the ball you pick and call the shot, is the average length of a run, how many balls, on average, that you can sink in a row. If the probability of sinking each ball is constant (not true, but perhaps approximately so), then in a sense the gain in skill for a poor player with an average run of 1 to increase it to 2 is approximately the same as for a much better player with an average run of 50 to increase it to 51. OC, the much better player has a harder time improving by 1 ball, in general, because he is much nearer the limit of the skills needed to play pool than the poor player (nearer the top of the hill).

Let us say that if Player B is one "level" better than Player A that he can beat Player A with a win/loss ratio of 1.5, and Player C is one level better than Player B. Then, based upon the structure of the levels (the "fitness landscape"), and not upon the shape of the variation in each player's play, we may, with simplifying assumptions, expect that Player C can beat Player A with a win/loss ratio of 1.5^2 = 2.25. The less the variation in each player's play, the more accurate that estimate will be. (Edit: But, as both of us have pointed out, it is more likely to be an overestimate than an underestimate.)

Quote:

I think go is a bit different to chess (elo) in that the accumulation of those tiny errors (which makes some normality of a players performance) is actually visible (in points) and verifiable here (with a strong enough program, and enough match samples). Which distribution would we see on expected points dropped (sum of single errors), and on the expected match score between two players (difference of the sums of single errors)?

I took advantage of that in my rating system by basing ratings on the ability to give handicaps, not simply upon win/loss ratios of even games.

Quote:

One distorting factor I see is winning players (like programs) trade margin for safety and simplicity, intentionally dropping some points.

Right. That is one reason to use handicap stones or variable komi to measure progress, so that the winner cannot afford to slack off against a weaker player.

Edit: Since I based ratings on the ability to give handicaps and komi, I did not follow in Elo's footsteps and had no reason to study that system. I infer Elo's "fitness landscape" from Vargo's remarks. I may well be mistaken about that.

moha · **#26**

Bill Spight wrote:

Consider the case of pool (pocket billiards). One test of skill in straight pool, where you can shoot the ball you pick and call the shot, is the average length of a run, how many balls, on average, that you can sink in a row. If the probability of sinking each ball is constant (not true, but perhaps approximately so), then in a sense the gain in skill for a poor player with an average run of 1 to increase it to 2 is approximately the same as for a much better player with an average run of 50 to increase it to 51. OC, the much better player has a harder time improving by 1 ball, in general, because he is much nearer the limit of the skills needed to play pool than the poor player (nearer the top of the hill).

Let us say that if Player B is one "level" better than Player A that he can beat Player A with a win/loss ratio of 1.5, and Player C is one level better than Player B. Then, based upon the structure of the levels (the "fitness landscape"), and not upon the shape of the variation in each player's play, we may, with simplifying assumptions, expect that Player C can beat Player A with a win/loss ratio of 1.5^2 = 2.25. The less the variation in each player's play, the more accurate that estimate will be.

I don't see how can you ignore the shape of the actual distribution of single game performances. I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)

dfan · **#27**

moha wrote:

I don't see how can you ignore the shape of the actual distribution of single game performances. I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)

Indeed. See my earlier comment #13, which I think got lost in the shuffle a little, for a trivial example.

Bill Spight · **#28**

moha wrote:

Bill Spight wrote:

Consider the case of pool (pocket billiards). One test of skill in straight pool, where you can shoot the ball you pick and call the shot, is the average length of a run, how many balls, on average, that you can sink in a row. If the probability of sinking each ball is constant (not true, but perhaps approximately so), then in a sense the gain in skill for a poor player with an average run of 1 to increase it to 2 is approximately the same as for a much better player with an average run of 50 to increase it to 51. OC, the much better player has a harder time improving by 1 ball, in general, because he is much nearer the limit of the skills needed to play pool than the poor player (nearer the top of the hill).

Let us say that if Player B is one "level" better than Player A that he can beat Player A with a win/loss ratio of 1.5, and Player C is one level better than Player B. Then, based upon the structure of the levels (the "fitness landscape"), and not upon the shape of the variation in each player's play, we may, with simplifying assumptions, expect that Player C can beat Player A with a win/loss ratio of 1.5^2 = 2.25. The less the variation in each player's play, the more accurate that estimate will be.

I don't see how can you ignore the shape of the actual distribution of single game performances.

I am not ignoring it. We are talking about two different things, that is all. We have to consider the distribution of game results to address the question of whether the winner in a set of games is better than his opponent or opponents. But in the case of A vs. B vs. C, we are assuming that C is better than B and B is better than A. The question then is how often the winner should win, given that difference. Different questions.

Quote:

I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)

For the purpose of model building I think that we can assume the normality of the results of a sufficiently large number of games. I suspect that the log odds is the best measure of results, but with enough games it should not make much difference if we use percentages.

The oddswise estimate is only correct if there is no variability in the win/loss odds. Otherwise it should be an overestimate. (Edit: For games like go and chess, I mean.) You can see that phemonenon with go ranks. With even matches (alternating Black and White) a 10 kyu will have a lower winning percentage against a 12 kyu than a 3 dan will have against a shodan, because the dan players' results are less variable. (I am assuming that if the stronger player of each pair alternates between giving two or three stones, the results are even. I do not assume that ranking based upon even games will behave that way.)

Bill Spight · **#29**

dfan wrote:

moha wrote:

I don't see how can you ignore the shape of the actual distribution of single game performances. I think the correctness of this oddswise estimate depends heavily on that. For normal distributions it can be reasonably correct (69.4% instead of 69.23% for the A>B>C 60%+60% case), but for other distributions cannot it also be completely wrong? (even without distorting factors like correlation or rock-paper-scissors)

Indeed. See my earlier comment #13, which I think got lost in the shuffle a little, for a trivial example.

How is it an example? Doesn't it depend upon the structure of the game and a presumed definition of expertise at it, rather than the distribution of the game results per se?

Your point there is no necessary relationship between the win rates of A vs. B, B vs. C, and A vs. C is well taken. But I don't think that is what moha is saying.

dfan · **#30**

Bill Spight wrote:

dfan wrote:

Indeed. See my earlier comment #13, which I think got lost in the shuffle a little, for a trivial example.

How is it an example? Doesn't it depend upon the structure of the game and a presumed definition of expertise at it, rather than the distribution of the game results per se?

Your point there is no necessary relationship between the win rates of A vs. B, B vs. C, and A vs. C is well taken. But I don't think that is what moha is saying.

I thought it was an example of a "shape of the actual distribution of single game performances" (uniform rather than Gaussian or somesuch) but maybe I was misinterpreting moha's phrase. For one thing, I was interpreting "game performance" as being a function of a single player, and whoever has the better performance wins; perhaps something else was meant.

Bill Spight · **#31**

dfan wrote:

There is no particular reason that winning percentages have to be related in this exact mathematical way.

For example, Alice, Bob and Carol all play the classic game "Whose random number is bigger?". Alice is a beginner and picks integers from 1 to 100 uniformly at random. Bob is more experienced and picks integers from 51 to 150 uniformly at random. Carol is an expert and picks integers from 101 to 200 uniformly at random (she's very good at this game, though you can probably imagine even better strategies).

How often does Bob beat Alice? How often does Carol beat Bob? How often does Carol beat Alice?

Bob beats Alice ⅞ of the time, with at win/loss ratio of 7/1. Carol beats Bob ⅞ of the time, with at win/loss ratio of 7/1. Carol always beats Alice. If we estimate Carol's win/loss ratio as (7/1) (7/1) = 49/1, OC, the win/loss ratio is off by infinity. However, the winning percentage is off by only 2%. :lol:

drmwc · **#32**

Suppose we play a hold'em game. We have 3 possible starting hands. You choose a hand first, I choose second. We then play out the flop, turn and river with no betting.

We bet an amount on the outcome. Surely you are bound to win, as you choose first?

The 3 possible hands are:
The red 2s
6, 7 of spades
Ace of spades, king of clubs.

moha · **#33**

dfan wrote:

Bill Spight wrote:

How is it an example? Doesn't it depend upon the structure of the game and a presumed definition of expertise at it, rather than the distribution of the game results per se?
Your point there is no necessary relationship between the win rates of A vs. B, B vs. C, and A vs. C is well taken. But I don't think that is what moha is saying.

I thought it was an example of a "shape of the actual distribution of single game performances" (uniform rather than Gaussian or somesuch) but maybe I was misinterpreting moha's phrase. For one thing, I was interpreting "game performance" as being a function of a single player, and whoever has the better performance wins; perhaps something else was meant

I think we talk about the same thing, and your example seems good (edit: or maybe not?

Bill's billiard example may be more interesting - exponential? but the difference is still more normal). I assumed that there is an individual performance distribution for both players (pointwise for simplicity - verifiable in go), and that game result distribution is a function of those two and can be different (though with normal individual distributions the difference will be normal as well). In the A>B>C 60%+60% situation it may even be possible to design games with individual shapes for either extreme (A beats C in 61% or 99% - though this may need distorting factors, since the difference will usually be a bit more normal shape, like in dfan's example).

About "there is no necessary relationship between the win rates of A vs. B, B vs. C, and A vs. C": it seems to me that - assuming the simplest case like no correlations, players performances are independent, etc. - there is a relationship, which depends on the individual performance distributions (thus varies by game type).

moha · **#34**

drmwc: Your example seems to be a second player advantage game, not quite the same as the A>B>C 60%+60% question.
For the the latter, a slightly similar and interesting example with pure uncorrelated distributions:

Player A picks number 30 (20% of cases) or 3.
Player B picks number 20 (50% of cases) or 2.
Player C picks number 10 (80% of cases) or 1.

But this still seems an r-p-s like situation, which can be considered a distorting factor (individual distributions differ in shape).

Bill Spight · **#35**

moha wrote:

drmwc: Your example seems to be a second player advantage game, not quite the same as the A>B>C 60%+60% question.

Apologies to drmwc for butting in, but I take it as an example of non-transitivity.

Quote:

For the the latter, a slightly similar and interesting example with pure uncorrelated distributions:

Player A picks number 30 (20% of cases) or 3.
Player B picks number 20 (50% of cases) or 2.
Player C picks number 10 (80% of cases) or 1.

But this still seems an r-p-s like situation, which can be considered a distorting factor (individual distributions differ in shape).

I'm not sure, but your idea of individual distributions seems to be related to what I am calling different strengths and weaknesses of different players (multi-dimensionality). Which can also cause non-transitivity.

moha · **#36**

Bill Spight wrote:

I'm not sure, but your idea of individual distributions seems to be related to what I am calling different strengths and weaknesses of different players (multi-dimensionality). Which can also cause non-transitivity.

Yes, but this is similar to rock-paper-scissors. I tried to exclude such distortions, and assumed uncorrelated performances, and that the players only differ in strength (position and variance but not shape of distribution). Then that shape still seems to matter for the accuracy of the oddswise approach.

chut · **#37**

Just wondering, there is a rather sharp upturn in strength graph at elo 10800, does that correspond to introducing ELF OpenGo to train LZ?

dfan · **#38**

chut wrote:

Just wondering, there is a rather sharp upturn in strength graph at elo 10800, does that correspond to introducing ELF OpenGo to train LZ?

Yes. (You can see the ELF network in the graph in multiple places; it's the gray cross with hash code starting with 62b.)

chut · **#39**

In this series of matches with Haylee, there is no 3,3 point invasions even with 2 stones handicap, so the network weight used is the 'human' one?

https://www.youtube.com/watch?v=hExYHwtsra8

I find the zero human network quite bad with handicap games. Leela 11 would trash me with 4 stones handicap, but Zero would start by invading all the 3,3 points making the game much easier and much less interesting.

I am wondering whether there is a way to tweek the MCTS for handicap game, for example to favor branches that may not be the best, but with the highest number of sub-branches that are near optimal.

dfan · **#40**

People on the LZ team have been thinking about how to make it play handicap games better: https://github.com/gcp/leela-zero/issues/1313

LZ's progression

Who is online