“Decision: case of using computer assistance in League A”

Bill Spight · Post by **Bill Spight** » Wed Apr 04, 2018 3:10 am

As a Bayesian, I suppose that I should be pleased that so many people believe in confirmatory evidence. Bayesians do, frequentists do not. This evening I took a look at a review of some chess games from a chess scandal. https://www.youtube.com/watch?v=cx0nurp-mpM There were, to me, some awesome tactics in those games. I was a bit dismayed, having read about Regan's work, that the reviewer was obsessed with the similarity of the accused player's play to a particular version of Houdini, which he was running as he reviewed the games. Sometimes he had to wait for a while until Houdini's search elevated the play in question to the top one or two choices. But it just goes to show that most people's belief in confirmatory evidence is too strong. They don't know how weak it is. {sigh}

bernds wrote:
Uberdude wrote:Doesn't look strong evidence of Leela cheating to me.
I'd still say the 98% case is a very odd outlier, although the metric "matches top three moves" is not a very good one. In chess they have things like "average centipawn loss" which would correspond to an average drop in win rate, and number of blunders (moves which are worse than a computer move by a certain amount). The problem with both of these is that they become meaningless in a winning position, which is achievable in Go if you have the computer play your opening, or if you've studied enough with the computer that you know how to wreck a human in the fuseki.

I've now run the game through both Leela and her sister Zero, and they disagree fairly significantly about what's going on (which kind of invalidates the theory that there were only ordinary moves that everyone would play).

I don't know if that's the theory, but the disagreement between the Leela sisters shows that, at present, we do not have enough agreement between AI players about the value of specific plays to rate them, as they do in chess. Once we can rate plays, then we can say, oh, this player chose a play that is above his normal skill level. Or we could say, in this game he played many fewer blunders than usual. We are not matching his play to the AI's choices, we are comparing it to his usual play. It's disconfirmatory evidence, which is what we want.

Pio2001 · Post by **Pio2001** » Wed Apr 04, 2018 4:03 am

I believe that some players review their games with Leela. I sometimes do, just to look for blunders.

If this kind of review is done by one or two players for every one of their games, then eventually, they are going, one day or another, to find some examples of sequences (moves 50 to 150, for example) that are not far from the software choice (among the three top choices, for example).

This would be obviously "cherry picking".

Uberdude · Post by **Uberdude** » Wed Apr 04, 2018 4:29 am

Another Leela similarity analysis, this time with a player who has studied with Leela, Daniel from the UK team. I chose his game against Dutch 6/5d Geert Groenen from last year during his Leela period (see his journal entry on the game at forum/viewtopic.php?p=215737#p215737). I thought he was trying to play a solid Leela style so would match a lot, but it didn't. That was probably because endgame started really early (about move 100) and there were a lot of non-matches, sometimes because even choice #10 is similar win% to #1 but often in fact Leela identified mistakes I agree with, e.g. both players playing slack lines. Something I found notable in this analysis was how often the humans, particularly Geert, played moves Leela considered mistakes but which had the highest policy network probability; i.e. Leela is really good at predicting moves humans will play, often locally good shape (the old Dutch high dans are known for this style) but missing out that something else, often a tenuki, is better (at least in Leela's view, it's not super strong oracle to believe unquestioningly but I tended to agree with her).

Leela top 3 and within 5% metric for moves 50-149, Daniel (white) scored 33/50 = 66% and Geert scored 37/50 = 74%. Using the stricter top 1 matching Daniel scored 23/50 = 46% and Geert 20/50 = 40%. Daniel won by 4.5 Added to table, with winner in brackets:

Code: Select all

+-----------------+------+----------------+------+---------+---------+---------+---------+
|      Black      | Rank |     White      | Rank | B top 3 | W top 3 | B top 1 | W top 1 |
+-----------------+------+----------------+------+---------+---------+---------+---------+
| [Carlo Metta]   |  4d  | Reem Ben David |  4d  |    * 98 |      80 |    * 72 |      54 |
| Andrey Kulkov   |  6d  | [Carlo Metta]  |  4d  |      80 |    * 86 |      68 |    * 62 |
| Dragos Bajenaru |  6d  | [Carlo Metta]  |  4d  |      74 |    * 78 |      50 |    * 60 |
| [Andrew Simons] |  4d  | Jostein Flood  |  3d  |      80 |      88 |      54 |      62 |
| Geert Groenen   |  5d  | [Daniel Hu]    |  4d  |      74 |      66 |      40 |      46 |
| [Ilya Shikshin] |  1p  | Artem Kachan.  |  1p  |      56 |      76 |      38 |      60 |
| [Andrew Simons] |  4d  | Victor Chow    |  7d  |      84 |      76 |      44 |      44 |
| Cornel Burzo    |  6d  | [A. Dinerstein]|  3p  |      74 |      66 |      40 |      48 |
| Jonas Welticke  |  6d  | [Daniel Hu]    |  4d  |      54 |      64 |      34 |      42 |
| [Park Junghwan] |  9p  | Lee Sedol      |  9p  |      74 |      64 |      64 |      38 |
| Lothar Speigel  |  5d  | [Daniel Hu]    |  4d  |      66 |      58 |      48 |      42 |
| Gilles v.Eeden  |  6d  | [Viktor Lin]   |  6d  |      82 |      70 |      56 |      46 |
+-----------------+------+----------------+------+---------+---------+---------+---------+

P.S. Whilst it's interesting to analyse with Leela Zero (and comparing differences between bots is valuable), at the point the game in question against Reem was played (November 2017) she was still a kyu player (my my, they grow up so fast!) so not much good for cheating.

Edit: I also analysed the El Clasico of European go, Ilya Shikshin 1p vs Artem Kachanovskyi 1p. These players are quite possibly stronger than Leela 0.11 on 50k nodes. So not matching could mean they are playing better rather than worse moves than Leela. As expected the more territorial and orthodox Artem was more similar than creative fighter Ilya. This was also, I think, the first game I analysed to feature a ko (which makes a lot of obvious matches for taking the ko, but also threats can differ). Top 3 match was 38/50 = 76% for Artem and 28/50 = 56% for Ilya, top 1 was 30/50 = 60% for Artem and 19/50 = 38% for Ilya.

Edit 2: Also did my game vs Victor Chow 7d from a few years ago as another example of a weaker player scoring an upset against a stronger one with a solid style. I played well in the opening and middlegame and got a good lead (but only won by half a point when he turned on super endgame and I was under time pressure, after move 150). For over 50 moves of the game Leela really wanted me to invade the left side at c7, which I was aware of but as I was leading against a 7d I knew was a strong fighter I didn't invade there to avoid complications I would well mess up. This was responsible for a lot of my failed matches with Leela's top 1 (often still top 3, but a few times not), plus of course some straight out mistakes from both of us.

Edit 3: And Cornel Burzo 6d vs Alexander Dinerstein 3p. Cornel has an elegant honte style, whilst Dinerstein is territorial and lead the whole time with a territory lead and ways into Cornel's flaky centre. As with Kulkov and Groenen games the player with highest top 3 match wasn't the same as with highest top 1 match.

Edit 4: And Daniel vs Jonas Welticke. Jonas is known for crazy openings and weird style, which he did here opening on the sides, only 25% win after 50 moves. As expected his wacky moves didn't match much. Daniel played solidly and matched a lot, except Leela got confused by a simple semeai so wanted to be stupid. Also despite having won the semeai already, in calm positions Leela wanted to keep playing the semeai rather than some profitable move elsewhere (but Daniel was winning so much maybe he could essentially pass and still win).

Edit 5: First pro game. My expectation was pros might score lower matching against Leela than us mid-high amateur dans as they are much stronger and could be playing unexpected better moves. I chose Park Junghwan and Lee Sedol's last game at some festival. Park is a fairly conventional player, whilst Lee is more creative, so I expected Park to match more. Park did match more, but they were both similar to us amateurs. Maybe Leela is stronger than I realised. Leela did not expect the moves which made me feel "Wow, cool pro moves" (often tenuki), but she did better than I did (with brief thinking) in predicting the contact fighting.

Edit 6: Another of Daniel's from last year, vs Lothar Spiegel 5d from Austria who is a fairly sensible player. Lots of matching during long but joseki-ish middle game invasions, but also misses from mistakes and also both players overlooking important sente exchange for a while (f11/g10).

Edit 7: Gilles van Eeden 6d (classic good shape Dutch 6d) vs Viktor Lin 6d. Most mismatches were due to a ko fight, and a few disagreements in early yose. Going into yose Leela gave Gilles 77% win, but this looks like a misunderstanding of his dead group at top left: if I played out a few more moves to make it clearly dead then the win% collapsed to 57%. In the end he lost by 2.5.

sybob · Post by **sybob** » Wed Apr 04, 2018 2:53 pm

Bill Spight wrote: Still, drawing conclusions from one game is absurd.

Huh?
The issue is: did he cheat IN THIS GAME.
That's what he is accused of. It does not matter how/what other games are.

jeromie · Post by **jeromie** » Wed Apr 04, 2018 3:11 pm

sybob wrote:
Bill Spight wrote: Still, drawing conclusions from one game is absurd.
Huh?
The issue is: did he cheat IN THIS GAME.
That's what he is accused of. It does not matter how/what other games are.

That’s only true if you consider the likelihood of cheating in one game to be independent of cheating in other games AND you think there is nothing to learn from a player’s performance in other games. But that’s probably not true.

At the very least, a person’s general level of play adds some important data. If I were to suddenly start beating dan level players on KGS after a long period of stable play as a 3 kyu, you’d have good grounds to be suspicious of my improvement.

Bill Spight · Post by **Bill Spight** » Wed Apr 04, 2018 4:38 pm

sybob wrote:
Bill Spight wrote: Still, drawing conclusions from one game is absurd.
Huh?
The issue is: did he cheat IN THIS GAME.
That's what he is accused of. It does not matter how/what other games are.

Why do you think that they are irrelevant? If he played well enough without cheating, as evidenced by other games in which he beat stronger players than his opponent in that game, why would he cheat in that game? Yes, there is evidence that he played like Leela in that game, but that is not the same as cheating.

Edit: And, indeed, they did conclude that Carlo probably cheated in the other games, based upon playing like Leela in that one game. They threw those other results out, as well. If his play in that game is relevant to his play in the other games, isn't his play in those games relevant to his play in that game?

Also, if all we are going by is similarity to Leela's play, we need a lot more evidence than we can get in one game. If we have behavioral or physical evidence of cheating, that's a different matter. But we do not.

Javaness2 · Post by **Javaness2** » Thu Apr 05, 2018 12:23 am

Regarding the EGF matter, the report will not be released until the appeals process is finished.
Regarding the CIT ... ?

Uberdude · Post by **Uberdude** » Thu Apr 05, 2018 12:33 am

sybob wrote:
Bill Spight wrote: Still, drawing conclusions from one game is absurd.
Huh?
The issue is: did he cheat IN THIS GAME.
That's what he is accused of. It does not matter how/what other games are.

I wouldn't go quite as far as calling it "absurd", but "requiring of far stronger evidence that anything significant happened compared to looking at multiple games and therefore it is unlikely you can disprove the null hypothesis and can justifiably convict".

If you really want to only look at this one game from Carlo in isolation then you could, but you need to be careful with the stats (a human analysis of the plausibility of the plays like Stanislaw did is much better). Setting aside the much discussed problems of looking at matching rates against a bot, once you've got the 98% top 3 match figure you still need to know what is a typical value for it, which you need to get from looking at other games (to account for playing style it would be best if Carlo's, but others' would do too). "98% is a big number" is not good enough. To be flippant, Carlo played 100% of his moves on the intersections of the board, just like Leela did too.

I only have 10 data points, but fitting them to a normal distribution (dubious: too small sample, could be different shape, plus 100 is a hard max) I get a mean of 80 and standard deviation of 8. So then you might say 98 is 2.2 sds from the mean, what's the chance of that? Look up your normal distribution probability tables and you get 1.2%. That's small, an inept statistician would say, less than the oft used 0.05 significance level, he must be guilty! But that's the chance a randomly selected game has that value (based on the false assumption the metric is normally distributed with those parameters). But this game was not randomly selected, it was chosen to be examined precisely because it has a high similarity. So such a probability is invalid. As Feynman eloquently said:

You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!

Javaness2 · Post by **Javaness2** » Thu Apr 05, 2018 1:09 am

I think you're supposed to expect up to 3 sd from the norm in any distribution model?

John Fairbairn · Post by **John Fairbairn** » Thu Apr 05, 2018 1:23 am

You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357.

As a Londoner I can point to something even more amazing - yesterday I actually saw an empty parking space!!!!!

But more seriously, I remember the registration number of my father's first car from 60 years ago, and I can't even remember which day it is now. UK registration plates are area-based, but last year I saw that same plate on a new car here, 300 miles away. That sort of coincidence reminds me of the work on coincidences of an Austrian mathematician whose name I've forgotten but I think begins with Ka- (and for some reason my brain also associates frogs with him). I'd like to be reminded of his name, but the point is he showed that coincidences are normal, and even fourth-order coincidences are not extraordinary. I read that as a student and I've never believed in conspiracy theories since.

Bill Spight · Post by **Bill Spight** » Thu Apr 05, 2018 1:27 am

One other requirement for null hypothesis testing is that the data be independent, but the moves in a single go game are far from independent.

Uberdude · Post by **Uberdude** » Thu Apr 05, 2018 2:09 am

Bill Spight wrote:One other requirement for null hypothesis testing is that the data be independent, but the moves in a single go game are far from independent.

Each datum in the situation mentioned though is the matching percentage for a whole game (or rather one player's moves in 50-149 chunk), so the lack of independence of individual moves doesn't matter and is subsumed into that single value. The relevant question is then is each game's matching % independent from another's? I should think so, though there will certainly be correlations to properties like player strength, player style, time-limits, or seriousness of event so you need to make sure your sample is from a relevant population.

The lack of independence of moves though would, I suspect, cause these data to be less tightly clustered around the mean than otherwise. So less like a normal distribution with a nice tight peak and more of a pancake. So you can't just blindly slap a normal distribution on it and do your P(X>mean+f*sd) test.

P.S. Analysed an Ilya vs Artem game, updated table above.

Bill Spight · Post by **Bill Spight** » Thu Apr 05, 2018 5:04 am

Uberdude wrote:
Bill Spight wrote:One other requirement for null hypothesis testing is that the data be independent, but the moves in a single go game are far from independent.
Each datum in the situation mentioned though is the matching percentage for a whole game (or rather one player's moves in 50-149 chunk), so the lack of independence of individual moves doesn't matter and is subsumed into that single value. The relevant question is then is each game's matching % independent from another's? I should think so, though there will certainly be correlations to properties like player strength, player style, time-limits, or seriousness of event so you need to make sure your sample is from a relevant population.

The lack of independence of moves though would, I suspect, cause these data to be less tightly clustered around the mean than otherwise. So less like a normal distribution with a nice tight peak and more of a pancake. So you can't just blindly slap a normal distribution on it and do your P(X>mean+f*sd) test.

P.S. Analysed an Ilya vs Artem game, updated table above.

Point well taken about the independence of data across games. That can reasonably be assumed. I was concerned about the application of the standard deviation, but you raise that issue, as well. More on that problem below.

The lack of independence between moves in a single game raises the question of what you count. Go players regard the hane-and-connect as a unit. Why count it as two matches instead of one? Semeai may not be one lane roads, because the order of play can vary, but they produce a sequence of play where matches to the top three options of individual plays is higher than normal. Now, over a large sample of games the average number of obvious responses to forcing moves, joseki sequences, and one lane roads, etc., evens out, so that counting one move matches is an OK proxy for a better matching metric. But that does not apply when you are looking only at one game. For instance, if the 100 move sequence in a game included a 20 move one lane road, that would push up the single move match percentage. A long ko fight would generate a large number of obvious responses to forcing moves, and that would increase the single move match percentage, as well. You can't just rely on testing a single game, using single move match criteria.

BlindGroup · Post by **BlindGroup** » Thu Apr 05, 2018 6:59 am

Uberdude wrote:I only have 10 data points, but fitting them to a normal distribution (dubious: too small sample, could be different shape, plus 100 is a hard max) I get a mean of 80 and standard deviation of 8. So then you might say 98 is 2.2 sds from the mean, what's the chance of that? Look up your normal distribution probability tables and you get 1.2%. That's small, an inept statistician would say, less than the oft used 0.05 significance level, he must be guilty! But that's the chance a randomly selected game has that value (based on the false assumption the metric is normally distributed with those parameters). But this game was not randomly selected, it was chosen to be examined precisely because it has a high similarity. So such a probability is invalid. As Feynman eloquently said:

You know, the most amazing thing happened to me tonight. I was coming here, on the way to the lecture, and I came in through the parking lot. And you won’t believe what happened. I saw a car with the license plate ARW 357. Can you imagine? Of all the millions of license plates in the state, what was the chance that I would see that particular one tonight? Amazing!

Uberdude, your taking the time to go through even these 10 games seems to be more than we've seen anyone else doing to systematically assess these decisions. A few thoughts to contribute:

1. As you note a sample size of 10 data points is VERY small. I think even "inept statisticians" would be uncomfortable move forward with only these data. That said, this is not meant to criticize your efforts, but rather to argue that your are on the right track and that your efforts should be extended significantly by some organization with significantly greater access to computational resources.

2. I think you have the logic of the hypothesis testing framework slightly twisted and it affects the interpretation of the 1.2 percent error rate (the "Type I" rate). You are right, we chose the game with the 98 percent top-3 match rate deliberately -- it was the game under question. The 1.2 percent that you have estimated gives you the odds of getting a match rate of 98 percent or more given "normal go play". Said differently, if you set up a decision framework that classifies any match rate that is 98 percent or more as cheating, you will falsely classify 1.2 percent of all normal (non-cheating) games as cheating. From a research perspective, this is well within the accepted probabilities of error for most disciplines. But is it small enough for the purposes of identifying cheating in go? Probably not. This rate would mean that on average, at least one person would be convicted of cheating at every tournament with 100 people. That seems like an uncomfortably high level of false convictions to me.

Relative to the Feynman quote, the 1.2 percent tells us how likely it would be to observe the AWS 357 license plate through random variation. The question is whether or not to assign significance to this occurrence (e.g. declare it to be unusual and worthy of further investigation) or to let it go. If we set up a decision process that assigns significance to it if the probability of observation is 1.2 percent or less, then even when there is no true significance to it, we will be wasting our time investigating it 1.2 percent of the time. The point of the quote is that we experience rare events more often than most people realize and so, need to be careful about using the rareness of the event alone as justifying further investigation. In this case, someone has run into the classroom and told Feynam that they saw a care with the license plate AWS 357 in the parking lot just before a local bank was robbed. We have to decide whether or not that observation warrants following up on the owner of that car or to let the lead go.

3. There are statistical techniques for handling data with unknown distributions, but they are very "data hungry" in that they require very large data sets. Same goes for dealing with data that is not "independently identically distributed". Your and Bills comments are on point, but given a reasonable amount of data, these issues are easily addressed.

Bill Spight · Post by **Bill Spight** » Thu Apr 05, 2018 7:25 am

BlindGroup wrote:Uberdude, your taking the time to go through even these 10 games seems to be more than we've seen anyone else doing to systematically assess these decisions. A few thoughts to contribute:

1. As you note a sample size of 10 data points is VERY small. I think even "inept statisticians" would be uncomfortable move forward with only these data. That said, this is not meant to criticize your efforts, but rather to argue that your are on the right track and that your efforts should be extended significantly by some organization with significantly greater access to computational resources.

Let me second that.

And also add the necessity to apply the Adkins Principle (named, not by me, after my late wife): At some point, doesn't thinking have to go on?

3. There are statistical techniques for handling data with unknown distributions, but they are very "data hungry" in that they require very large data sets. Same goes for dealing with data that is not "independently identically distributed". Your and Bills comments are on point, but given a reasonable amount of data, these issues are easily addressed.

The main point is, we do not yet have a reasonable amount of data regarding either single move matches or cheating or their possible relationship.

Life In 19x19

“Decision: case of using computer assistance in League A”

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A