"Indefinite improvement" for AlphaZero-like engines
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: "Indefinite improvement" for AlphaZero-like engines
And I guess I'm saying - if linear additivity does break (even in ways that aren't exactly like the above model), then you're potentially back to how Chess behaves where win/loss/draw doesn't obey arithmetic in the same way and you can't really bound the amount of Elo present by such arguments.
-
moha
- Lives in gote
- Posts: 311
- Joined: Wed May 31, 2017 6:49 am
- Rank: 2d
- GD Posts: 0
- Been thanked: 45 times
Re: "Indefinite improvement" for AlphaZero-like engines
I think even against perfect play, you cannot assume that the game is balanced on the razor's edge in a way that one 0.1 pt mistake would always turn the draw into a 1 pt loss. If this is the case from B's view it follows that there are room for 0.9 pts total mistakes when playing with W (0.1-0.9 balance), which is 50% of time. So if you make 0.1 pt mistake average, not only would you not lose 90% but not even 50% (drawing the rest), less than a whole class.
It's also probably more healthy to start with a 0.5-0.5 subpoint balance assumption against perfect play. In that case you need to make 0.5 pt worth of subpoint mistakes to lose a game with any color - so to get 25% you need to be over 0.5 in half of cases and under 0.5 in other half (0.5 avg, same as the integer view - coincidence?).
But I'm not ruling out the possibility of similar or different tricky phenomenons happening near perfect play. This does not need any subpoint logic though. It seems enough to say that as perfect play does not (may not) include randomness, it probably not explores the whole distribution from class-1. I think this is similar to what your example aimed at so let's borrow part of your idea: Suppose it is easier to go astray and enter a losing line with B than with W (with perfect komi). We have little randomness because we are close to perfect play. Following your logic, it could seem that class-1 .. class-2 .. class-3 enters into a pattern where they start to lose most of their games (against slightly stronger opponent who does not cancel out small mistakes?) as B (many losing lines, one drawing line) while still drawing with W. This could give them near 25% against each other despite being only fractionally weaker pointwise.
This also shows why this argument is invalid though. To be at class-1 it is not enough to get 25% against perfect play when played directly. For this strength (and Elo) you need to perform accordingly against a wide panel of (all possible) opponents and situations (even when one does not actually test all in practice). It is common that a player performs unexpectedly good or bad against a certain single opponent or situation, but this does not change his (theoretical) strength or class. The 1 point example was just an example, in reality "rounding" your mistakes to integers cannot affect (or affect differently) your class. If you only make average 0.1 pt worth of mistakes per game, you will be rated much closer than a class from perfect play once tested widely and seriously - exactly because 0.1 pt mistakes has, on average, much less effect on the game than one point mistakes.
Edit: I'm not sure if this can not be turned against the original hypothesis somehow, though.
It's also probably more healthy to start with a 0.5-0.5 subpoint balance assumption against perfect play. In that case you need to make 0.5 pt worth of subpoint mistakes to lose a game with any color - so to get 25% you need to be over 0.5 in half of cases and under 0.5 in other half (0.5 avg, same as the integer view - coincidence?).
But I'm not ruling out the possibility of similar or different tricky phenomenons happening near perfect play. This does not need any subpoint logic though. It seems enough to say that as perfect play does not (may not) include randomness, it probably not explores the whole distribution from class-1. I think this is similar to what your example aimed at so let's borrow part of your idea: Suppose it is easier to go astray and enter a losing line with B than with W (with perfect komi). We have little randomness because we are close to perfect play. Following your logic, it could seem that class-1 .. class-2 .. class-3 enters into a pattern where they start to lose most of their games (against slightly stronger opponent who does not cancel out small mistakes?) as B (many losing lines, one drawing line) while still drawing with W. This could give them near 25% against each other despite being only fractionally weaker pointwise.
This also shows why this argument is invalid though. To be at class-1 it is not enough to get 25% against perfect play when played directly. For this strength (and Elo) you need to perform accordingly against a wide panel of (all possible) opponents and situations (even when one does not actually test all in practice). It is common that a player performs unexpectedly good or bad against a certain single opponent or situation, but this does not change his (theoretical) strength or class. The 1 point example was just an example, in reality "rounding" your mistakes to integers cannot affect (or affect differently) your class. If you only make average 0.1 pt worth of mistakes per game, you will be rated much closer than a class from perfect play once tested widely and seriously - exactly because 0.1 pt mistakes has, on average, much less effect on the game than one point mistakes.
Edit: I'm not sure if this can not be turned against the original hypothesis somehow, though.
-
hyperpape
- Tengen
- Posts: 4382
- Joined: Thu May 06, 2010 3:24 pm
- Rank: AGA 3k
- GD Posts: 65
- OGS: Hyperpape 4k
- Location: Caldas da Rainha, Portugal
- Has thanked: 499 times
- Been thanked: 727 times
Re: "Indefinite improvement" for AlphaZero-like engines
In the extreme case, consider deterministic strategies. There is (at least one) perfect strategy. Take one game featuring that strategy against itself. For each move in that game, create a new strategy that differs only in virtue of playing a losing move in that particular position. Then you've created a series of strategies that are each (at least) 400 points stronger than the previous strategy, and (at least) 400 points weaker than the next strategy, until you reach the perfect one.
That seems to show a lower bound of (something like) 200 * 400 ELO points. But we can do much more: for each of those imperfect players, we can do the same cloning for all points after the first deviation. I can't quite think through how far this point goes right now, but I think we're dealing with the "kind of" astronomical numbers you see when calculating the number of possible states of the go board, or number of games.
Our human brains are not deterministic, so human ELO scales are quite compressed. Bots are more predictable, which is one contributor to the 16000 ELO that you hear about with Leela, but they still aren't deterministic. So I expect the bot range is more like the human range than the astronomical range. But I don't know what it would be.
Related to this point: ELO scales aren't really a measure of strength. They're a measure of strength relative to a population and a pairing strategy for games. The ELO difference between players depends on the other players they're surrounded by. If that's true, then the original question isn't really about how long bots can improve, it's a question about how many different strategies those improving bots can implement.
That seems to show a lower bound of (something like) 200 * 400 ELO points. But we can do much more: for each of those imperfect players, we can do the same cloning for all points after the first deviation. I can't quite think through how far this point goes right now, but I think we're dealing with the "kind of" astronomical numbers you see when calculating the number of possible states of the go board, or number of games.
Our human brains are not deterministic, so human ELO scales are quite compressed. Bots are more predictable, which is one contributor to the 16000 ELO that you hear about with Leela, but they still aren't deterministic. So I expect the bot range is more like the human range than the astronomical range. But I don't know what it would be.
Related to this point: ELO scales aren't really a measure of strength. They're a measure of strength relative to a population and a pairing strategy for games. The ELO difference between players depends on the other players they're surrounded by. If that's true, then the original question isn't really about how long bots can improve, it's a question about how many different strategies those improving bots can implement.
-
moha
- Lives in gote
- Posts: 311
- Joined: Wed May 31, 2017 6:49 am
- Rank: 2d
- GD Posts: 0
- Been thanked: 45 times
Re: "Indefinite improvement" for AlphaZero-like engines
These would be almost identical in strength as strength or Elo is not measured against a single opponent but as average performance against all possible opponents.hyperpape wrote:Then you've created a series of strategies that are each (at least) 400 points stronger than the previous strategy,
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: "Indefinite improvement" for AlphaZero-like engines
You might have missed that my example was with a non-integer komi, so no draws. And/or maybe you haven't quite understood the model?moha wrote:I think even against perfect play, you cannot assume that the game is balanced on the razor's edge in a way that one 0.1 pt mistake would always turn the draw into a 1 pt loss. If this is the case from B's view it follows that there are room for 0.9 pts total mistakes when playing with W (0.1-0.9 balance), which is 50% of time. So if you make 0.1 pt mistake average, not only would you not lose 90% but not even 50% (drawing the rest), less than a whole class.
To illustrate, suppose the model is that at the end of the game, whatever hidden fractional advantage is deterministically rounded to the nearest of -1.5, -0.5, 0.5, 1.5, etc for the final observable outcome in points. Suppose it luckily happens to be the case that naturally the game "starts at" 0.01 fractional advantage - extremely close to 0.
Say player A over the course of any game loses 0.05 fractional advantage with a standard deviation of 0.01, independently and roughly normally distributed.
Say player B over the course of any game loses 0.1 fractional advantage with a standard deviation of 0.02, independently and roughly normally distributed.
Say player C over the course of any game loses 0.2 fractional advantage with a standard deviation of 0.04, independently and roughly normally distributed.
Say player D over the course of any game loses 0.4 fractional advantage with a standard deviation of 0.08, independently and roughly normally distributed.
Then A beats B, B beats C, C beats D, each time with high probability, averaging more than 90%. Each one will reliably achieve a small net fractional advantage over the next player much more than how much the color affects it (only 0.01), and with a small enough variation that it will almost always round to a win by 0.5 points (by being closer to 0.5 than to -0.5).
And if A plays D, then A will achieve average net fractional advantage of +0.34 or +0.36 depending on color, and again with a small enough standard deviation that virtually always it will still round to A winning by 0.5, rather than either -0.5 or 1.5.
We can do the same with draws. Suppose we use an integer komi, so that now the model is that at the end of the game, whatever hidden fractional advantage is rounded to the nearest of -2, -1, 0, 1, 2, etc for the final observable outcome in points. Suppose it luckily happens to be the case that naturally the game "starts at" 0.51 fractional advantage - extremely close to 0.5, from white's perspective.
Let's use A,B,C,D as above exactly the same. Now what happens is each player will win win as white 90+% of the time, and draw as black 90+% of the time against the previous, for a net outcome of close to 75% wins if you count draws as half. So we've squeezed almost 3 "classes" into the span of less than half of a point, and it's clear we could have squeezed in more.
So this:
might simply not be true. The above model demonstrates one way in which it might not be true - namely if the game happens to be balanced near an edge and the other player is sufficiently closer to perfect than you so that even if they also make many mistakes, statistically the sum of your mistakes almost always exceed theirs and put you on the wrong side of the edge. (and we *know* already that miai counting theory shows us that endgames may sometimes behave this way, with this deterministic-like rounding of tiny theoretical differences to an integer or half-integer)0.1 pt mistakes has, on average, much less effect on the game than one point mistakes.
Maybe it requires some "luck" for this game balance, and also the above model makes lots of assumptions, so I won't say that it's even a likely or good model, but it seems very hard to be sure that something roughly like it can't happen.
And notice that I didn't need to suppose that any of the players is particularly specialized at exploiting any of the other players. Each one independently simply makes mistakes according to a certain distribution regardless of who it's playing.
-
moha
- Lives in gote
- Posts: 311
- Joined: Wed May 31, 2017 6:49 am
- Rank: 2d
- GD Posts: 0
- Been thanked: 45 times
Re: "Indefinite improvement" for AlphaZero-like engines
I think how I first interpreted your example (0.1-0.9) may be closer to go reality (and without players who basically play with a constant per-game error). But even with these extreme conditions, I don't think the example works: these players are not classes apart. Direct results against each other is secondary (such upsets are not uncommon), but if you test and rate them against a wider set of opponents who make all kinds of errors (so that their whole distribution, real strength, all possible rounding ranges/cases gets explored), their performance will not be significantly different, so will get almost the same Elo rating. Or am I still misunderstanding you?
BTW both Elo and my error distribution model is flawed in the immediate vicinity of perfect play. Both assume a symmetrical performance distribution, that if average is X then rare cases of performing X-5 is equally likely than X+5. This is never completely correct but may became an intolerable flaw in the immediate vicinity of 0 error (no negative errors), so care must be taken here. (Cf your example: if your avg is 0.1 you pretty much have to play a perfect game quite often so sd 0.02 sounds weird.)
What seems possible though is taking a player (or various players) 1 stone or a few points below perfect play (who achieves 50% against it with the extra stone or komi), and testing how it performs against perfect play on perfect komi (how much rare draws it can catch). This could show their Elo distance (the equivalent of that point distance). And would also hint a bit about the role of whole point scoring, and the "hardness" difference between draw and the smallest loss.
BTW both Elo and my error distribution model is flawed in the immediate vicinity of perfect play. Both assume a symmetrical performance distribution, that if average is X then rare cases of performing X-5 is equally likely than X+5. This is never completely correct but may became an intolerable flaw in the immediate vicinity of 0 error (no negative errors), so care must be taken here. (Cf your example: if your avg is 0.1 you pretty much have to play a perfect game quite often so sd 0.02 sounds weird.)
What seems possible though is taking a player (or various players) 1 stone or a few points below perfect play (who achieves 50% against it with the extra stone or komi), and testing how it performs against perfect play on perfect komi (how much rare draws it can catch). This could show their Elo distance (the equivalent of that point distance). And would also hint a bit about the role of whole point scoring, and the "hardness" difference between draw and the smallest loss.
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: "Indefinite improvement" for AlphaZero-like engines
Sort of. I feel like the above model isn't sticking in your head because of some deep assumption you're making, or that I'm not getting across well. I don't think you have any guarantee that testing against a wider range of opponents (that are also similarly "close" to optimal so as to not just lose all the time) will get you other roundings and ranges and cases.moha wrote:I don't think the example works: these players are not classes apart. Direct results against each other is secondary (such upsets are not uncommon), but if you test and rate them against a wider set of opponents who make all kinds of errors (so that their whole distribution, real strength, all possible rounding ranges/cases gets explored), their performance will not be significantly different, so will get almost the same Elo rating. Or am I still misunderstanding you?
Please take the model I stated above literally at face value for a moment - consider if perhaps maybe between ten and thirty times per game a player faces a choice between very similarly good moves in which occasionally they could make a mistake that loses 0.01 "points", or 0.003 "points", or other tiny amounts "worse" by some metric. Suppose that the way you might lose 0.2 "points" on average turns out often not to be by a single random blunder of 1 point with 20% probability over the course of a game, but rather by statistically on average 10 "mistakes" of 0.02 "points" over the course of the game that give you a normalish distribution. And suppose that in the endgame it will deterministically turn out regardless of what variety of opponent you're playing against, whoever made the smaller sum of "mistakes" by this hidden emergent internal metric of "points" will turn out to get tedomari for the last real point and win by a whole 0.5 real points in terms of the actual game result. Again: no matter what variety of opponent, the game always roughly rounds in this way, because it is a function of how the game itself turns out to work at such levels of play, not of any particularly special class of opponents.
Adjust my numbers to reflect this if you like - you can widen all the standard deviations to be sqrt(10)x smaller than the mean instead of 5x smaller. You still get 3 or 4 "classes" within half a point, rather than just one or two.
A stupid model yes, hard to confidently rule out this behavior. For example, in Chess the emergent metric of "sum of material values of pieces plus various standard positional factors" matters for all practical players, even though nowhere in the rules is it written that "1 * pawn + 3.5 * bishops + 5 * rooks + ..." should have anything to do with whether your king will eventually be checkmated. It doesn't matter whether the opponent thinks in terms of material or some entirely different way - so long as you and they are of similar strengths, in practice, "4 pawn equivalents" of advantage will almost always round to "the will be checkmated many moves later". So there is a hidden emergent "metric" (indeed, many of them, all highly correlated) such that almost all practical matchups of players experience the game "rounding" to a discrete outcome the same way based on it, even players that don't explicitly "think" in terms of things like material count.
Why can't this happen in Go, for the fight over the last point? Perhaps once you get within a few points of optimal there ends up being some advanced and sophisticated emergent concepts in Go that apply to almost all non-pathological players for "how to globally coordinate positions to get the last point to round in your favor". Perhaps it turns out that fighting for it is less like "each turn independently, your imperfections give you a 10% chance to throw away the final point, and more like "gradually you accumulate a succession of advantages and optionality that give you increasing control over the parity for the final point" so that whichever player is slightly better at it can reliably convert that to winning by 0.5, regardless of the quirks of the opponent. Then there could be room for hundreds and hundreds of Elo in being incrementally and reliably better than the opponent at this. And again, if this is emergent to the game itself (like material is in chess), it could be shared across all natural players, and not go away simply when you average across a population of opponents.
One last way very different to visualize it maybe - imagine again we're at the fairest possible integer + 0.5 komi, and we had some oracle that magically would categorize all moves as "will win or lose by 0.5" or "losing by more than that" or "winning by more than that", and both we and the opponent could always use the oracle on every turn, and thereby neither would ever lose by more than 0.5. Consider pruning the game tree down to just the positions resulting from that. How much strategy would be left for fighting over which of "win by 0.5" and "lose by 0.5"? Suppose we're also talking about the bots XXX years in the future that are not far away from this level anyways.
* It could be highly chaotic with few learnable generalizations about which branches lead to what. Then Elo measures would be dominated by non-transitive pathological trees of players better than other players in particular ways that don't transfer.
* It could be that the +0.5 and the -0.5 positions were clustered in a learnable way but in a way that any +0.5 position wasn't much "farther" from -0.5 than any other. So any occasional mistake could always flip the result, unless the opponent made an occasional mistake back later. So it would all come down to the probability each turn that you flip the result, which averaged across many opponents and positions would get you your 1 or 2 classes fitting within the last point, but no more, exactly as you said.
* It could turn out they were clustered in a very-graduated way such that such that some +0.5 positions could be "distant" from any -0.5 positions, some would be "nearer", and so on, such that making a mistake would only incrementally cost you by putting you nearer, and a usually only a mistake near the boundary would actually flip you into the other side, a mistake while already at -0.5 would put you deeper, etc. Then the resulting strategy could be tug-of-war-like, such that the player making a lower rate of small incremental errors would reliably win, but a player making an even lower rate would reliably beat them and so on. Then you have the potential for many more than 2 "classes" of Elo to fit within the fight over this 1 point.
The last case could be less likely than the others, but it seems a-priori possible. And again it doesn't matter whether you average across many opponents or not.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: "Indefinite improvement" for AlphaZero-like engines
Despite knowing next to nothing about the Elo system, I am strangely drawn to this discussion, with the hope that I might have something meaningful to say.
I may have something to add about fractional errors.
There are at least two possible meanings to an error of some number of points in go. One sense is that the player makes a play that is dominated by a play that gains more points, on average. Another is that the player makes a play that, after correct play, gets a worse score. I don't know if the players are talking about either one of these, or if they have something else in mind.
Here is an example of domination. Suppose that there are two plays left on the board, one that gains 0.6 pts. and one that gains 0.4 pts. by territory scoring. Then the play that gains 0.6 dominates the play that gains 0.4. IOW, the play that gains 0.6 pts. will never score worse than the play that gains 0.4 pts., but the play that gains 0.4 pts. may score worse than the play that gains 0,6 pts., and we consider the play that gains only 0.4 pts. to be a mistake. Using | notation, let the plays be these: {1|0} and {2|1||0}. The local scores are pts. for Black. If you don't know this notation, I think it will be clear after I go through this example.
Suppose that Black plays first. Then if she makes the 0.4 pt. play she will move to the left in {1|0} to get a local score of 1 pt. and {2|1||0} will remain on the board. White will next take the remaining, 0.6 pt. play, moving in {2|1||0} to a local score of 0, for a combined local score of 1 + 0 = 1. Now suppose instead that Black makes the 0.6 pt. play first. She will move in {2|1||0} to {2|1}. That will leave two 0.4 pt. plays, {1|0} and {2|1}. They are miai, so the combined local score will be 2 + 0 when White moves to 0 in one and Black moves to 2 in the other, or 1 + 1 when White moves to 1 in one and Black moves to 1 in the other. The local scores add to 2 in either case. Black does 1 pt. better by making the 0.6 pt. play.
Now suppose that White plays first. If he makes the 0.4 pt. play there will be a local score of 0 and {2|1||0} will remain on the board. Black will next take the remaining, 0.6 pt. play, leaving the {2|1} play on the board, which White will take, for a combined local score of 0 + 1 = 1. Now suppose instead that White makes the 0.6 pt. play first, leaving a local score of 0 and the {1|0} play. Black will take that play, for a combined local score of 0 + 1 = 1. In this case it does not matter which play White makes first, the score is the same. Still, we may consider the 0.4 pt. play to be a mistake in the first sense.
OC, bots, as currently written, do not calculate the value of plays this way, so they don't make mistakes of this sort, except as viewed by humans.
As I have been writing this, I have come up with a third possible meaning of a fractional mistake. It involves the concepts of an environment and its temperature. What the environment is for any board depends upon the player. The environment consists of background plays, and there at least one play in the foreground. The player knows or assumes the following. There is a play in the environment that gains a certain number of points, T, on average, along with other plays that gain T or less, such that we may estimate that playing in the environment will add T/2 pts. to the current value of the board. T is the temperature of the environment, or ambient temperature. T/2 plays a similar role in the concept of the environment as komi does at the start of the game.
Now suppose that we have one gote in the foreground that gains T + 0.1, octal, on average, for each player. We may then estimate the result of making that play and then letting our opponent play in the environment as T + 0.1 - T/2 = T/2 + 0.1. But if instead we play in the environment and let our opponent take that gote, we make estimate the result as T - T - 0.1 + T/2 = T/2 - 0.1. That is 0.2 pts. worse, on average, than if we take the gote. So we may think that not taking the gote is a 0.2 pt. mistake. OC, we are talking about estimates and average scores, so it may not be a mistake at all. Also, note that the error associated with the estimation of the final score is associated with the environment, not with the small difference. That error is on the order of T/2. Yes, if we have a lot of small errors, our estimation of their sum should be approximately normal and dependent on the size and number of the errors, but our uncertainty about the environment trumps that of the size of any particular small error. Besides, bots do not think this way, either. But humans might.
Another sense of the size of an error is the eventual difference in the score, given the following assumptions. Suppose that if you make play A. then Black can play to guarantee that the final score will be at least S for Black and White can play to guarantee that the final score will be at most S for Black, we may consider S to be the theoretical value of the game after you make play A. Following Berlekamp, I refer to such play as canonical. Similarly, if you make play B, the theoretical value is T. If S > T and you are Black, then play B is a mistake that costs at least S - T, and if you are White, then play A is mistake that costs at least S - T. OC, this is not what our discussants have in mind, because by both territory and area scoring S - T is an integer, not a fraction. But there is a form of go that has fractional scores, Chilled Go. (Environment, temperature, and chilled go all have explanations on SL, if not here.
)
In chilled go each move costs 1 pt. by comparison with territory scoring. In territory scoring each move costs 1 pt. by comparison with area scoring. So we may think of territory scoring as chilled go for area scoring.
If there are no complications by ko, correct play in chilled go is also correct play in territory go, which is also correct play in area go. It is quite possible, then, that the theoretical value of a go game is 7 pts. for Black in all three forms of go. For simplicity and convenience I will make that assumption. A perfect player playing Black can guarantee a final board score of at least 7 pts. for Black and a perfect White can guarantee a final board score of at most 7 pts. for Black.
Let me show how fractional scores can arise in chilled go. Here are the two positions we looked at before, {1|0} and {2|1||0}, using territory scores. In chilled go we subtract a point for each move, so {1|0} becomes {0|1} and {2|1||0} becomes {0|1||1}. {0|1} = 0.4, octal, and {0|1||1} = 0.6. Together they add up to 1.2. That's the same as their average value by territory scoring and by area scoring, as well. It's just that neither player wishes to lose a fraction of a point in chilled go by playing in either position. (This is also discussed on SL and here.)
Now let's suppose that 7 pts. for Black is the game theoretical value of chilled go and that we have a perfect player for it. Suppose that this player is Black, and White is almost as good. Then if White makes a 0.2 pt. mistake and otherwise each player plays canonically, the final chilled go score on the board will be 7.2. We do not have to rely upon estimates, we can take the final score to evaluate fractional errors. In this case, what would the final territory score be? That depends on who got the last play in chilled go. If it was Black, then White can round the 7.2 score down to 7, and if it was White, then Black can round the 7.2 score up to 8, which will usually round up to 9 by area scoring. That is true for any fractional value, with no kos. 7.2 is just as good or bad under territory scoring and area scoring as 7.8.
Now, I am not suggesting that bots switch to playing chilled go. If ko makes headaches for territory scoring, think of the headaches it will make for chilled go.
But if we are talking about near perfect play and fractional errors, chilled go offers a way of thinking about them.
moha wrote:The 1 point example was just an example, in reality "rounding" your mistakes to integers cannot affect (or affect differently) your class. If you only make average 0.1 pt worth of mistakes per game, you will be rated much closer than a class from perfect play once tested widely and seriously - exactly because 0.1 pt mistakes has, on average, much less effect on the game than one point mistakes.
Note: Below I am going to use octal notation. I know that the discussants are familiar with it, but not all of the readers may be. In octal notation, 0.7 = ⅞, 0.6 = ¾, 0.4 = ½, etc. Outside of ko positions, fractional go values will be in terms of powers of ½. A mistake of 1/10 is very unlikely to occur.lightvector wrote:{That} might simply not be true.
There are at least two possible meanings to an error of some number of points in go. One sense is that the player makes a play that is dominated by a play that gains more points, on average. Another is that the player makes a play that, after correct play, gets a worse score. I don't know if the players are talking about either one of these, or if they have something else in mind.
Here is an example of domination. Suppose that there are two plays left on the board, one that gains 0.6 pts. and one that gains 0.4 pts. by territory scoring. Then the play that gains 0.6 dominates the play that gains 0.4. IOW, the play that gains 0.6 pts. will never score worse than the play that gains 0.4 pts., but the play that gains 0.4 pts. may score worse than the play that gains 0,6 pts., and we consider the play that gains only 0.4 pts. to be a mistake. Using | notation, let the plays be these: {1|0} and {2|1||0}. The local scores are pts. for Black. If you don't know this notation, I think it will be clear after I go through this example.
Suppose that Black plays first. Then if she makes the 0.4 pt. play she will move to the left in {1|0} to get a local score of 1 pt. and {2|1||0} will remain on the board. White will next take the remaining, 0.6 pt. play, moving in {2|1||0} to a local score of 0, for a combined local score of 1 + 0 = 1. Now suppose instead that Black makes the 0.6 pt. play first. She will move in {2|1||0} to {2|1}. That will leave two 0.4 pt. plays, {1|0} and {2|1}. They are miai, so the combined local score will be 2 + 0 when White moves to 0 in one and Black moves to 2 in the other, or 1 + 1 when White moves to 1 in one and Black moves to 1 in the other. The local scores add to 2 in either case. Black does 1 pt. better by making the 0.6 pt. play.
Now suppose that White plays first. If he makes the 0.4 pt. play there will be a local score of 0 and {2|1||0} will remain on the board. Black will next take the remaining, 0.6 pt. play, leaving the {2|1} play on the board, which White will take, for a combined local score of 0 + 1 = 1. Now suppose instead that White makes the 0.6 pt. play first, leaving a local score of 0 and the {1|0} play. Black will take that play, for a combined local score of 0 + 1 = 1. In this case it does not matter which play White makes first, the score is the same. Still, we may consider the 0.4 pt. play to be a mistake in the first sense.
OC, bots, as currently written, do not calculate the value of plays this way, so they don't make mistakes of this sort, except as viewed by humans.
As I have been writing this, I have come up with a third possible meaning of a fractional mistake. It involves the concepts of an environment and its temperature. What the environment is for any board depends upon the player. The environment consists of background plays, and there at least one play in the foreground. The player knows or assumes the following. There is a play in the environment that gains a certain number of points, T, on average, along with other plays that gain T or less, such that we may estimate that playing in the environment will add T/2 pts. to the current value of the board. T is the temperature of the environment, or ambient temperature. T/2 plays a similar role in the concept of the environment as komi does at the start of the game.
Now suppose that we have one gote in the foreground that gains T + 0.1, octal, on average, for each player. We may then estimate the result of making that play and then letting our opponent play in the environment as T + 0.1 - T/2 = T/2 + 0.1. But if instead we play in the environment and let our opponent take that gote, we make estimate the result as T - T - 0.1 + T/2 = T/2 - 0.1. That is 0.2 pts. worse, on average, than if we take the gote. So we may think that not taking the gote is a 0.2 pt. mistake. OC, we are talking about estimates and average scores, so it may not be a mistake at all. Also, note that the error associated with the estimation of the final score is associated with the environment, not with the small difference. That error is on the order of T/2. Yes, if we have a lot of small errors, our estimation of their sum should be approximately normal and dependent on the size and number of the errors, but our uncertainty about the environment trumps that of the size of any particular small error. Besides, bots do not think this way, either. But humans might.
Another sense of the size of an error is the eventual difference in the score, given the following assumptions. Suppose that if you make play A. then Black can play to guarantee that the final score will be at least S for Black and White can play to guarantee that the final score will be at most S for Black, we may consider S to be the theoretical value of the game after you make play A. Following Berlekamp, I refer to such play as canonical. Similarly, if you make play B, the theoretical value is T. If S > T and you are Black, then play B is a mistake that costs at least S - T, and if you are White, then play A is mistake that costs at least S - T. OC, this is not what our discussants have in mind, because by both territory and area scoring S - T is an integer, not a fraction. But there is a form of go that has fractional scores, Chilled Go. (Environment, temperature, and chilled go all have explanations on SL, if not here.
In chilled go each move costs 1 pt. by comparison with territory scoring. In territory scoring each move costs 1 pt. by comparison with area scoring. So we may think of territory scoring as chilled go for area scoring.
Let me show how fractional scores can arise in chilled go. Here are the two positions we looked at before, {1|0} and {2|1||0}, using territory scores. In chilled go we subtract a point for each move, so {1|0} becomes {0|1} and {2|1||0} becomes {0|1||1}. {0|1} = 0.4, octal, and {0|1||1} = 0.6. Together they add up to 1.2. That's the same as their average value by territory scoring and by area scoring, as well. It's just that neither player wishes to lose a fraction of a point in chilled go by playing in either position. (This is also discussed on SL and here.)
Now let's suppose that 7 pts. for Black is the game theoretical value of chilled go and that we have a perfect player for it. Suppose that this player is Black, and White is almost as good. Then if White makes a 0.2 pt. mistake and otherwise each player plays canonically, the final chilled go score on the board will be 7.2. We do not have to rely upon estimates, we can take the final score to evaluate fractional errors. In this case, what would the final territory score be? That depends on who got the last play in chilled go. If it was Black, then White can round the 7.2 score down to 7, and if it was White, then Black can round the 7.2 score up to 8, which will usually round up to 9 by area scoring. That is true for any fractional value, with no kos. 7.2 is just as good or bad under territory scoring and area scoring as 7.8.
Now, I am not suggesting that bots switch to playing chilled go. If ko makes headaches for territory scoring, think of the headaches it will make for chilled go.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: "Indefinite improvement" for AlphaZero-like engines
Thanks, Bill.
I was a little fuzzy on exactly how I was defining "fractional points" because for the Elo discussion it doesn't matter as much. If there exists any naturally learnable strategic concept or pattern or feature that lets players discriminate between "better" and "worse" positions within the same game-theoretic category - e.g. "winning 0.5 points easily, even allowing for tiny technical inaccuracies and variety in move choice" vs "winning by 0.5 points, but any inaccuracy will switch the result", and also such these "inaccuracies" by you or your opponent are commonplace and often switch you between these shades of "better or worse", and net out against each other...
Then you should have at least the potential to have more Elo "classes" than predicted by the simple model of players losing points according to a certain error distribution, with fractional points rounding sort of "randomly" based on the particular position and opponent. Because rather than mistakes "randomly" causing you to lose or not discretely, you now have room for more tiers of players with different "inaccuracy rates" relative to this natural strategic feature of the game, where a player with a lower rate may reliably beat a player with a modestly higher one.
I was a little fuzzy on exactly how I was defining "fractional points" because for the Elo discussion it doesn't matter as much. If there exists any naturally learnable strategic concept or pattern or feature that lets players discriminate between "better" and "worse" positions within the same game-theoretic category - e.g. "winning 0.5 points easily, even allowing for tiny technical inaccuracies and variety in move choice" vs "winning by 0.5 points, but any inaccuracy will switch the result", and also such these "inaccuracies" by you or your opponent are commonplace and often switch you between these shades of "better or worse", and net out against each other...
Then you should have at least the potential to have more Elo "classes" than predicted by the simple model of players losing points according to a certain error distribution, with fractional points rounding sort of "randomly" based on the particular position and opponent. Because rather than mistakes "randomly" causing you to lose or not discretely, you now have room for more tiers of players with different "inaccuracy rates" relative to this natural strategic feature of the game, where a player with a lower rate may reliably beat a player with a modestly higher one.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: "Indefinite improvement" for AlphaZero-like engines
Thanks, lightvector. 
IIUC, if it is a 50-50 proposition in all of these cases, then, outside of chilled go the fractional scores come down to chance. So as we approach perfection, these small differences between the almost perfect players wash out. But if we are far enough from perfection, then small errors can add up to 1 pt. of territory. For instance, two 0.4 pt. errors (octal) by the same player in the opening add up to 1 pt., so there is no rounding to do as far as they are concerned. That should be enough to give the opponent a chance to win.
It is natural to assume that the territory scoring error of a multitude of fractional mistakes has a normal distribution. Actually, for go it has a tighter distribution than normal, because when some number of the fraction add to exactly 1, which is not unusual because, kos aside, the fractions are in powers of 2, they have an error of 0. For instance, suppose that the distribution of small errors extends to ±3. With a normal distribution, we would expect that when we estimate a win by one player to be 1.3, sometimes that player will lose. However, once the small errors add up to 1 or more, that point acts as a barrier to "rounding down". Since there are still estimation errors, it is possible that the player will lose, but much less likely than with a normal distribution. This may be behind my sense that Monte Carlo bots tend to underestimate the winning chances of the player who is ahead. {shrug}
So there may be a natural classification of go skill based upon multiples of 1 pt. score differences. Do we then have 360 classes?
Taking random in the sense of unknown, even if determined, it is plausible, at first glance, that a chilled go score of 7.2 should be more likely to round down to a territory score of 7 than a score of 7.8. As against that, since in either case it boils down to whose turn it is, which it is plausible is a 50-50 proposition, it is plausible that the closeness to one integer or the other does not matter. Furthermore, when it boils down to a chilled go position worth 7 pts. plus an infinitesimal, it still comes down to whose turn it is; my guess in all of these cases whether it rounds up or down is a 50-50 proposition. It's an empirical question, OC.lightvector wrote:Thanks, Bill.
I was a little fuzzy on exactly how I was defining "fractional points" because for the Elo discussion it doesn't matter as much. If there exists any naturally learnable strategic concept or pattern or feature that lets players discriminate between "better" and "worse" positions within the same game-theoretic category - e.g. "winning 0.5 points easily, even allowing for tiny technical inaccuracies and variety in move choice" vs "winning by 0.5 points, but any inaccuracy will switch the result", and also such these "inaccuracies" by you or your opponent are commonplace and often switch you between these shades of "better or worse", and net out against each other...
Then you should have at least the potential to have more Elo "classes" than predicted by the simple model of players losing points according to a certain error distribution, with fractional points rounding sort of "randomly" based on the particular position and opponent. Because rather than mistakes "randomly" causing you to lose or not discretely, you now have room for more tiers of players with different "inaccuracy rates" relative to this natural strategic feature of the game, where a player with a lower rate may reliably beat a player with a modestly higher one.
IIUC, if it is a 50-50 proposition in all of these cases, then, outside of chilled go the fractional scores come down to chance. So as we approach perfection, these small differences between the almost perfect players wash out. But if we are far enough from perfection, then small errors can add up to 1 pt. of territory. For instance, two 0.4 pt. errors (octal) by the same player in the opening add up to 1 pt., so there is no rounding to do as far as they are concerned. That should be enough to give the opponent a chance to win.
It is natural to assume that the territory scoring error of a multitude of fractional mistakes has a normal distribution. Actually, for go it has a tighter distribution than normal, because when some number of the fraction add to exactly 1, which is not unusual because, kos aside, the fractions are in powers of 2, they have an error of 0. For instance, suppose that the distribution of small errors extends to ±3. With a normal distribution, we would expect that when we estimate a win by one player to be 1.3, sometimes that player will lose. However, once the small errors add up to 1 or more, that point acts as a barrier to "rounding down". Since there are still estimation errors, it is possible that the player will lose, but much less likely than with a normal distribution. This may be behind my sense that Monte Carlo bots tend to underestimate the winning chances of the player who is ahead. {shrug}
So there may be a natural classification of go skill based upon multiples of 1 pt. score differences. Do we then have 360 classes?
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
moha
- Lives in gote
- Posts: 311
- Joined: Wed May 31, 2017 6:49 am
- Rank: 2d
- GD Posts: 0
- Been thanked: 45 times
Re: "Indefinite improvement" for AlphaZero-like engines
I'm not sure if I'm still missing something about lightvector's idea but rereading it doesn't seem I was too far off. The important part seems to be:
Maybe he excludes performance in those situations, such as what would only matter with different komi/handicap, not in even games (where near perfect players win anyway against 1-2 pt weaker opponents). But I'm not sure this is the case. The small winrate differences that a class-1 player MUST show above a class-2 one against weak opponents means the very rare cases where he fails to win should be significantly rarer even. OC even in a 0.5-0.5 balance situation making 0.1 or 0.2 avg mistakes matter a little, since later another swing may bring the fractional close to the rounding boundary again. But I feel that the final rounding step will still eliminate much of the required differences (compared to what would happen WITHOUT the final rounding, which would be the REAL class performance advantage).
But maybe there is a factor of roughly 2 here, coming from the fact that cases closer to 0.5-0.5 balance are similarly frequent than cases close to 0.1-0.9 (or conversely, that rounding to nearest only introduces half point inaccuracy)? Now, after Bill's post (which I will still need to read again) it seems rounding to nearest may not be a good model to start with? Getting the last move is connected to dame parity and territory score parity, thus for perfect play and it's vicinity komi parity, isn't it?
As I wrote I'm not ruling out anything atm, just openly curious. I'm not sure if this strengthen or weaken the hypothesis, but instead of CGT-like subpoints we can think something like path cardinality (which reminds me). Indeed a weakness of the whole point error model is that it assumes the same distribution in all cases, whereas in some cases reaching the minimax optimal score can be harder (compared to losing 1 pt) depending on how many paths lead there (as lightvector also posted meanwhile). But I'm not sure if this is bad enough since, as a distribution, it includes all cases averaged (the best and the worst as well). Also as I wrote in the other thread I think the weakness of the assumption of independence of per-move errors may be much more significant in practice (and likely affects practical class counts more). The worse the model approximates reality the worse predictions drawn from it will be.
Which seems to match his opinion:moha wrote:I think even against perfect play, you cannot assume that the game is balanced on the razor's edge in a way that one 0.1 pt mistake would always turn the draw into a 1 pt loss. If this is the case from B's view it follows that there are room for 0.9 pts total mistakes when playing with W (0.1-0.9 balance), which is 50% of time. So if you make 0.1 pt mistake average, not only would you not lose 90% but not even 50% (drawing the rest), less than a whole class.
The reason I felt this argument is unconvincing was that even if you assume the 0.1-0.9 balance from the empty board (which this example seems to do, ie. it's easy to lose with one color but accordingly hard to lose with the other color), this balance changes as long as you play against weaker players who make different and larger errors. So if we consider performance in all situations and against all opponents, cases that we will be tested from an 0.5-0.5 balance situation (equally hard to lose a point with either color) or 0.4-0.6 etc are overall more frequent than the 0.1-0.9 extreme. And given his [-0.1,0.02] and similar distributions in those cases [-0.1] will not be able to significantly outperform [-0.2]. So their overall performance differences and Elo differences will still be bound to the width of the smallest scoring unit (the amount that small errors need to accumulate to to become meaningful).lightvector wrote:We can do the same with draws. Suppose we use an integer komi, so that now the model is that at the end of the game, whatever hidden fractional advantage is rounded to the nearest of -2, -1, 0, 1, 2, etc for the final observable outcome in points. Suppose it luckily happens to be the case that naturally the game "starts at" 0.51 fractional advantage - extremely close to 0.5, from white's perspective.
Let's use A,B,C,D as above exactly the same. Now what happens is each player will win win as white 90+% of the time, and draw as black 90+% of the time against the previous, for a net outcome of close to 75% wins if you count draws as half.
Maybe he excludes performance in those situations, such as what would only matter with different komi/handicap, not in even games (where near perfect players win anyway against 1-2 pt weaker opponents). But I'm not sure this is the case. The small winrate differences that a class-1 player MUST show above a class-2 one against weak opponents means the very rare cases where he fails to win should be significantly rarer even. OC even in a 0.5-0.5 balance situation making 0.1 or 0.2 avg mistakes matter a little, since later another swing may bring the fractional close to the rounding boundary again. But I feel that the final rounding step will still eliminate much of the required differences (compared to what would happen WITHOUT the final rounding, which would be the REAL class performance advantage).
But maybe there is a factor of roughly 2 here, coming from the fact that cases closer to 0.5-0.5 balance are similarly frequent than cases close to 0.1-0.9 (or conversely, that rounding to nearest only introduces half point inaccuracy)? Now, after Bill's post (which I will still need to read again) it seems rounding to nearest may not be a good model to start with? Getting the last move is connected to dame parity and territory score parity, thus for perfect play and it's vicinity komi parity, isn't it?
As I wrote I'm not ruling out anything atm, just openly curious. I'm not sure if this strengthen or weaken the hypothesis, but instead of CGT-like subpoints we can think something like path cardinality (which reminds me). Indeed a weakness of the whole point error model is that it assumes the same distribution in all cases, whereas in some cases reaching the minimax optimal score can be harder (compared to losing 1 pt) depending on how many paths lead there (as lightvector also posted meanwhile). But I'm not sure if this is bad enough since, as a distribution, it includes all cases averaged (the best and the worst as well). Also as I wrote in the other thread I think the weakness of the assumption of independence of per-move errors may be much more significant in practice (and likely affects practical class counts more). The worse the model approximates reality the worse predictions drawn from it will be.
But can we say that he will also reliable outperform it against weaker players (the vast majority of test cases) by at least the amount that a class difference would demand from him?lightvector wrote:have room for more tiers of players with different "inaccuracy rates" relative to this natural strategic feature of the game, where a player with a lower rate may reliably beat a player with a modestly higher one.
-
Bill Spight
- Honinbo
- Posts: 10905
- Joined: Wed Apr 21, 2010 1:24 pm
- Has thanked: 3651 times
- Been thanked: 3373 times
Re: "Indefinite improvement" for AlphaZero-like engines
Parity does not affect the chilled go result. which is why I think it will support a potentially countable infinity of skill classes. But for territory and area scoring rounding the chilled go score matters.moha wrote:Now, after Bill's post (which I will still need to read again) it seems rounding to nearest may not be a good model to start with? Getting the last move is connected to dame parity and territory score parity, thus for perfect play and it's vicinity komi parity, isn't it?
I have posted a speculative post in reply to lightvector, which may, OC, be way off base.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
At some point, doesn't thinking have to go on?
— Winona Adkins
Visualize whirled peas.
Everything with love. Stay safe.
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: "Indefinite improvement" for AlphaZero-like engines
No, 0.1 for the empty board in my example (the one without draws, that would round at the end to either -0.5 or 0.5) corresponds to it being easy to lose or win with *either* color, depending sensitively on the error rates of the players. 0.1 is NOT saying something like "white should win 90% of the time and black should win 10% of the time" - that would be giving one player a huge advantage and be the exact opposite of being "razor's edge". I don't know if you want to spend any more time on this, but if you do, can you try re-reading without this mis-assumption?moha wrote:The reason I felt this argument is unconvincing was that even if you assume the 0.1-0.9 balance from the empty board (which this example seems to do, ie. it's easy to lose with one color but accordingly hard to lose with the other color), this balance changes as long as you play against weaker players who make different and larger errors.
Yep, exactly. If it is sort of "random or unpredictable" and always 50-50, then might not be much room any Elo classes fitting smaller than 1 point so long as that's the case. But alternatively suppose that players near this level would discover some way to judge far better than chance which way the final point goes and to play moves that affect it. Then there could be room for *many* Elo classes fitting smaller than 1 point. Indeed, true optimal players would how to judge it better than 50-50 - they know the exact result - so the big open question is what do natural player populations "in-between" look like (e.g. the players you'd get from AlphaZero iterated to almost to infinity with a wide variety of different random seeds and hyperparameter settings), as you transition from players for whom it's 50-50 to optimal players for whom it's certain? They could behave like moha's model suggests, or like my model suggests, or there could just be too many pathologies/non-transitivities.Bill Spight wrote: Taking random in the sense of unknown, even if determined, it is plausible, at first glance, that a chilled go score of 7.2 should be more likely to round down to a territory score of 7 than a score of 7.8. As against that, since in either case it boils down to whose turn it is, which it is plausible is a 50-50 proposition, it is plausible that the closeness to one integer or the other does not matter. Furthermore, when it boils down to a chilled go position worth 7 pts. plus an infinitesimal, it still comes down to whose turn it is; my guess in all of these cases whether it rounds up or down is a 50-50 proposition. It's an empirical question, OC.
IIUC, if it is a 50-50 proposition in all of these cases, then, outside of chilled go the fractional scores come down to chance.
Or something else entirely may happen. I have little confidence in any of the above models saying anything sane about how you transition from the player for whom the last point is seemingly random to the optimal player, or how one can be confident enough in any particular model or from trends only observed from current players what will happen at such an extreme, so any behavior seems hard to rule out - there could be only one "class" here, or many.
-
lightvector
- Lives in sente
- Posts: 759
- Joined: Sat Jun 19, 2010 10:11 pm
- Rank: maybe 2d
- GD Posts: 0
- Has thanked: 114 times
- Been thanked: 916 times
Re: "Indefinite improvement" for AlphaZero-like engines
Maybe one more attempt, in case my earlier attempts are too hard to make sense of.
Let's make up a new 2-player game. The game lasts 200 turns, 100 by each side. On each turn, a computer randomly generates 100 cards and shows them to the current player face down. That player must select a card and turn it over. Each card has a number on its face that is either zero, or some possibly-fractional positive number, (e.g. 7/4, 1/16, 3/256, etc.). At least one of the cards is always generated to be a zero.
The game has a single global score, which starts at exactly 0, and which player 1 wants to be positive, and player 2 wants to be negative. Every time player 1 turns over a card, the number is *subtracted* from the global score. Every time player 2 turns over a card, the number is *added* to the global score - so both players prefer to turn over zeros as much as they can.
The backs of the cards are all decorated with patterns in ways such that any pattern uniquely determines the number on the face, but there are exponentially many possible patterns and the relationship is extremely complex, although to get started that there are at least some basic recognizable elements of the relationship that can let one imperfectly predict the number. Skill in this game consists of learning over time the relationships and becoming increasingly able to identify cards with zeros or low numbers.
At the end of 200 turns, the score determines the winner, by determining the selection of a gridpoint from the grid ..., -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, .... If the score falls exactly on a gridpoint, that gridpoint is chosen. Otherwise, if it falls between two gridpoints:
Variant A: Regardless of how close the score is to one gridpoint or the other, a fair coin is flipped and determines which of the two gridpoints is chosen.
Variant B: One of the two gridpoints is chosen with probability inversely proportional to the distance. For example, a score of 1.3 would cause 1.5 to be chosen 80% of the time, and 0.5 to be chosen 20% of the time.
Variant C: The closer of the two gridpoints is always chosen. If the score is exactly equidistant between two gridpoints, then a fair coin is flipped.
And Player 1 is declared the winner if a positive gridpoint is chosen, player 2 if a negative gridpoint is chosen.
Since this game is non-interactive (actually, it is very slightly interactive if players can see the score so far), any player can be mostly characterized by the final distribution of the sum of numbers they choose averaged over all the possible deals the computer could give them. Suppose that although some players are higher "variance" than others, among players with similar mean sum, the variance of their sum tends to only differ a little in practice.
Exercise: qualitatively determine how variants A,B,C may differ in many Elo there can be "per gridpoint" as players approach levels where they are consistently able to select a set of cards summing up to no more than a small number each game.
Let's make up a new 2-player game. The game lasts 200 turns, 100 by each side. On each turn, a computer randomly generates 100 cards and shows them to the current player face down. That player must select a card and turn it over. Each card has a number on its face that is either zero, or some possibly-fractional positive number, (e.g. 7/4, 1/16, 3/256, etc.). At least one of the cards is always generated to be a zero.
The game has a single global score, which starts at exactly 0, and which player 1 wants to be positive, and player 2 wants to be negative. Every time player 1 turns over a card, the number is *subtracted* from the global score. Every time player 2 turns over a card, the number is *added* to the global score - so both players prefer to turn over zeros as much as they can.
The backs of the cards are all decorated with patterns in ways such that any pattern uniquely determines the number on the face, but there are exponentially many possible patterns and the relationship is extremely complex, although to get started that there are at least some basic recognizable elements of the relationship that can let one imperfectly predict the number. Skill in this game consists of learning over time the relationships and becoming increasingly able to identify cards with zeros or low numbers.
At the end of 200 turns, the score determines the winner, by determining the selection of a gridpoint from the grid ..., -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, .... If the score falls exactly on a gridpoint, that gridpoint is chosen. Otherwise, if it falls between two gridpoints:
Variant A: Regardless of how close the score is to one gridpoint or the other, a fair coin is flipped and determines which of the two gridpoints is chosen.
Variant B: One of the two gridpoints is chosen with probability inversely proportional to the distance. For example, a score of 1.3 would cause 1.5 to be chosen 80% of the time, and 0.5 to be chosen 20% of the time.
Variant C: The closer of the two gridpoints is always chosen. If the score is exactly equidistant between two gridpoints, then a fair coin is flipped.
And Player 1 is declared the winner if a positive gridpoint is chosen, player 2 if a negative gridpoint is chosen.
Since this game is non-interactive (actually, it is very slightly interactive if players can see the score so far), any player can be mostly characterized by the final distribution of the sum of numbers they choose averaged over all the possible deals the computer could give them. Suppose that although some players are higher "variance" than others, among players with similar mean sum, the variance of their sum tends to only differ a little in practice.
Exercise: qualitatively determine how variants A,B,C may differ in many Elo there can be "per gridpoint" as players approach levels where they are consistently able to select a set of cards summing up to no more than a small number each game.
-
hyperpape
- Tengen
- Posts: 4382
- Joined: Thu May 06, 2010 3:24 pm
- Rank: AGA 3k
- GD Posts: 65
- OGS: Hyperpape 4k
- Location: Caldas da Rainha, Portugal
- Has thanked: 499 times
- Been thanked: 727 times
Re: "Indefinite improvement" for AlphaZero-like engines
I think that's ok. These players are nearly perfect. They'll win against almost all other players. But since they reliably beat each other, each of them should end up >= 400 points better than the previous one. Otherwise, playing against weaker opponents who you always beat would lower your ELO rating.moha wrote:These would be almost identical in strength as strength or Elo is not measured against a single opponent but as average performance against all possible opponents.hyperpape wrote:Then you've created a series of strategies that are each (at least) 400 points stronger than the previous strategy,