“Decision: case of using computer assistance in League A”

Bartleby · Post by **Bartleby** » Fri Mar 30, 2018 2:03 am

Uberdude wrote:
Bartleby wrote: although in chess a 98 percent agreement between an engine and a player would be considered very strong evidence (even the best chess players in the world who train with computers all the time don't score nearly as high).
When you talk of agreement with a chess engine do you mean playing the top choice of the engine, playing one of the top 3 choices of the engine (as in this case), or some more complex comparison. I am concerned that by choosing the broader top 3 metric a headline figure of 98% can be quoted (without telling people the typical distribution of non-cheaters) which suggests more guilt to the casual reader than warranted. When comparing to Leela's top choice I got 72% agreement*.

* It may be even lower if you allow Leela to analyse more deeply. Leela starts off analysing moves suggested by its policy network, which has been trained on strong human games, so has a very human-like style (unlike -Zero bots). As it analyses more it may come to prefer moves which the policy network didn't like; AlphaGo's move 37 5th line shoulder hit being a famous example. To give an example from this game : in my first analysis for move 51 l17 was Leela's #1 choice and this is what Carlo played. It is also quite likely what I would play. But if I let Leela analyse for longer l17 becomes the #2 choice and e11 becomes #1. This may well happen with other moves too, so a deeper Leela analysis could see this similarity metric of 72% drop even further.

P.S Another data point: a 1 kyu on reddit got 64% similarity to Leela's top 3 over moves 1-150 of his correspondence game; would be lower over same 50-150 interval as more similarity in opening. Chart: https://i.imgur.com/jMM4EIM.png. Unsurprising that a 1k isn't as good / similar to Leela at middle game than mid dans.

I think close comparisons with chess are unlikely to be that useful here: there are too many differences (e.g., in chess the average game has much fewer moves, which shorter length is often truncated even more by deep opening theory; there are generally fewer reasonable candidate moves; there are generally more forced moves, etc.). A 98 percent match rate in chess might only allow one nonmatch out of an entire game; it might happen occasionally in games between top chess pros but I suspect it would be rare.

I don't have any firm opinion whether there was any cheating in this particular go game. But surely it is evident that a 64 percent match rate has 18 times as many nonmatches as a 98 percent match rate. That's a rather substantial difference.

Without looking at the game or knowing anything about the players (other than your comment that they are mid-dans), my gut reaction would be that a match rate
of 98 percent top three moves is extremely high. The Go board is too big, mid-dans are quite good amateurs but will play many suboptimal moves every game, and there is no obvious reason why their suboptimal moves should match or even be members of the same small set 98 percent of the time. There are probably also many moves per game on which there are multiple optimal moves (or so nearly optimal as to make no difference to a human player); there is no obvious reason these moves should have a near perfect match rate either.

I am a similar level at chess as a mid-dan or even a bit higher, and if you analyzed all of the games I have played in my life with a top engine, I would be surprised if more than a few of them matched the engines top three moves 98 percent of the time, and would not surprised if none of them had such a high match rate. And rough logic suggests to me that the match rate should (a) be lower in Go than in chess; and (b) be lower with a weaker engine than a stronger one.

The above is not a serious analysis, just my gut reaction. But yes, a 98 percent top three match rate seems very, very high to me.

RobertJasiek · Post by **RobertJasiek** » Fri Mar 30, 2018 2:27 am

Bartleby wrote:the match rate should (a) be lower in Go than in chess

Higher because a go game can have many (for high dan) obvious moves.

Uberdude · Post by **Uberdude** » Fri Mar 30, 2018 2:54 am

Bartleby, a few points:
- I agree that as chess and go are different games with different number of reasonable moves etc the exact metric will be different. But what would be useful is how much of an outlier from typical values is strong enough to declare cheating. So if chess players of your level typically match bots 50-70% then 98% is very unusual.
- Leela 0.11 is not superhuman like stockfish or AlphaGo, I think it's about 5-6d amateur so close to the humans in question.
- 64% is a poor baseline to compare with, the 80 and 88% from my game is better as similar level. Even better would be players who have studied with Leela. A 4-5d on the UK team is such a player, I plan to check one of his games when I have time.
- The denominator of these percentages is 50, so we are talking 49 out of 50 top 3 matches meaning cheating versus 44 out of 50 being innocent.

Kirby · Post by **Kirby** » Fri Mar 30, 2018 4:48 am

RobertJasiek wrote:
Bartleby wrote:the match rate should (a) be lower in Go than in chess
Higher because a go game can have many (for high dan) obvious moves.

Are there fewer obvious moves for top chess players? I suppose in go, you often choose one sequence over the other. Is it more like choosing specific moves in chess?

Bill Spight · Post by **Bill Spight** » Fri Mar 30, 2018 5:10 am

As I mentioned before, matching is confirmatory evidence, and confirmatory evidence is weak. The principal statistical question is whether Carlos's play differed from his normal, non-cheating play. Note that differing from his normal play does not mean that he was cheating. One way of cheating is to choose one of Leela's top three plays for several moves, so matching addresses the question of how he cheated, if he did, which is a different question from whether he cheated. As Uberdude pointed out, he might have cheated, if he did, by copying Zen, with similar results.

As hyperpape pointed out in post #11 ( viewtopic.php?p=228848#p228848 ) Regan's work on cheating in chess is relevant. Regan's statistical approach asks how much better a player played than usual, not about similarity to the play of any specific engine, and typically requires non-statistical evidence of cheating, which may be supported by the statistical evidence. Only in very rare cases does Regan consider the statistical evidence good enough by itself to indicate cheating. Note that the discussions here about specific plays and how good or bad they may be are about non-statistical evidence, and are appropriate to the question of whether cheating occurred.

Bartleby wrote:I was actively playing chess online during the period when commercially-available engines got really strong, and this experience has made me very cynical. Many people are honest and would never consider using an engine to cheat. But a surprising number of people will cheat if they think they can get away with it. Some will cheat stupidly and get caught very easily; others will cheat intelligently and perhaps never be caught. Some amateurs will cheat; so will some professionals.

From a Bayesian point of view, Bartleby's observations are pertinent to the question of cheating in this case, or any other case of cheating at go, as they inform prior beliefs about the possibility and probability of cheating.

Bill Spight · Post by **Bill Spight** » Fri Mar 30, 2018 6:00 am

Kirby wrote:
RobertJasiek wrote:
Bartleby wrote:the match rate should (a) be lower in Go than in chess
Higher because a go game can have many (for high dan) obvious moves.
Are there fewer obvious moves for top chess players? I suppose in go, you often choose one sequence over the other. Is it more like choosing specific moves in chess?

Well, the match rate does not mean much, anyway. Confirmatory evidence is weak, weak, weak.

Openings in chess and joseki in go allow players to make plays at a high level simply by knowing them. And in the late endgame, even if good players may not choose exactly the same play at any point, plays which match the actual win rate are extremely common. By contrast, in chess the endgame can be extremely difficult, with even top players missing wins. In go it is the middle game where the weaknesses of strong amateurs are likely to show up.

My impression is that in go there are more likely to be long sequences of play where there is little choice, because the plays are not independent. There is more chunking in go. It seems to me that combinations of the same length are likely to be conceptually richer in chess. For instance, in go the one lane road is fairly common. I don't think that is the case in chess. Simply counting matches of individual plays in go does not take into account the dependence of plays upon each other, which is quite common in go. Once a certain play is chosen, several followups may become obvious. That's another reason that you can't just count plays that match, or even plays that are good or bad. I remember seeing a 41 move variation (21 chess moves long) in a game commentary, but the author (Kitani, IIRC) said that even amateurs could read it out, because it was a one lane road. Would that 21 move match be evidence of cheating?

Javaness2 · Post by **Javaness2** » Fri Mar 30, 2018 6:03 am

Uberdude wrote: A 4-5d on the UK team is such a player, I plan to check one of his games when I have time.

Well since he did play Carlo in the final game of last year, it might be amusing.

Bartleby · Post by **Bartleby** » Fri Mar 30, 2018 9:17 am

Uberdude wrote:Bartleby, a few points:
- I agree that as chess and go are different games with different number of reasonable moves etc the exact metric will be different. But what would be useful is how much of an outlier from typical values is strong enough to declare cheating. So if chess players of your level typically match bots 50-70% then 98% is very unusual.
- Leela 0.11 is not superhuman like stockfish or AlphaGo, I think it's about 5-6d amateur so close to the humans in question.
- 64% is a poor baseline to compare with, the 80 and 88% from my game is better as similar level. Even better would be players who have studied with Leela. A 4-5d on the UK team is such a player, I plan to check one of his games when I have time.
- The denominator of these percentages is 50, so we are talking 49 out of 50 top 3 matches meaning cheating versus 44 out of 50 being innocent.

I'm afraid I don 't have any hard statistics for you regarding top three chess moves, just impressions.

Are you saying that the fact that Leela is weaker makes matching more likely? I would think the opposite is true. There are generally more suboptimal moves in any given position than optimal moves, so I suspect that match rate should be lower with weaker engines, not higher.

It would be interesting to know the match rste of the other plsyer who has trained with Leela, and if that match rate is higher than an equally strong player who hasn't with Leela.

I still think 98 percent is really high. Although confirmatory evidence may be weak in general, at some point that becomes no longer true. If a player had a 100 percent match rate over an entire game would this not be highly suspect? 98 per cent is quite close to 100 per cent.

But my main point is not about this game in particular, or even match rates in general. It's rather that in a competitive game like chess or Go, and especially when playing over the Internet when a cheater can always rely on plausible deniability, cheating is likely to become a real problem as stronger and stronger engines become available. Maybe the problem will be less in Go than in chess: my general impression is that there is a significantly higher percentage of dysfunctional personalities among chess players than Go players.

Knotwilg · Post by **Knotwilg** » Fri Mar 30, 2018 9:51 am

Bartleby wrote:My general impression is that there is a significantly higher percentage of dysfunctional personalities among chess players than Go players.

For sure in the Winter Games.

Bill Spight · Post by **Bill Spight** » Fri Mar 30, 2018 10:17 am

Bartleby wrote: I still think 98 percent is really high. Although confirmatory evidence may be weak in general, at some point that becomes no longer true. If a player had a 100 percent match rate over an entire game would this not be highly suspect? 98 per cent is quite close to 100 per cent.

You find an agreement of 49/50 to be strong evidence of cheating. How about 11/12?

Of those 50 moves, there are blindingly obvious ones, such as replying to a threat to kill several stones. Also, as I have indicated, there are sequences of play which form a unit, because once you have made the first play you are going to make the others. The relationship is tighter than that of most chess combinations, because as a rule the sequence forms a chunk, or single idea. The fifty moves in question are Black 51 - Black 149.

Black 51 is not obvious, but it starts a 4 move chunk -- or maybe two 2 move chunks, depending upon how you make the call. In any event, the four moves are not independent. Soon thereafter there is a 7 move chunk, or perhaps a 5 move chunk followed by two obvious moves, it comes to the same thing. Then there is a move that a lot of players of Carlos's strength would choose, but it's not obvious. Then there is an obvious play, followed by a 5 move chunk. Not long after there is another 5 move sequence, which I would call two chunks, but if you play the first one you are going to play the second one, so together they form a unit. There are a few more long sequences that form units. Ignoring obvious plays that are not part of a unit, I come up with 12 units of play (including one move units) in those 50 moves. I assume that one of the units does not match Leela's top three choices. OC, other people may see things somewhat differently, but Black made rather fewer independent non-obvious decisions than 50.

John Fairbairn · Post by **John Fairbairn** » Fri Mar 30, 2018 11:02 am

Some years ago Mark Hall and I ran a sideshow at the London Open where players (nearly all dan and high kyu) took up an invitation to predict all the moves of a pro game (i.e. both sides) they had never seen before using GoScorer. They had to guess the move played, not just one of the top three. It was accepted that the first dozen moves or so would be more or less impossible to guess, so straightaway no-one could score close to 100% (or even 98%). There was, however, a function that gave you a broad hint (e.g. which quarter of the board), though this meant you didn't get the full score for that move.

Nevertheless, scores were consistently high, which surprised us. I can't remember the percentages now, but I think 60-70% was common and a high dan scored (I think) over 80% by thinking a long time. But the surprise at the generally high scores is still a fresh memory.

Thinking about the explanations later, it became apparent that quite a lot more moves than we expected are routine (e.g. hanetsugi) or trivial (connecting after atari).

But some years later, I noticed another delimiting effect. A very high percentage of moves are adjacent to or within one space of the previous move (i.e. the opponent's move). Just as a rough pointer, I have just looked at a recent game between a pro and an AI, limiting myself to moves 11-110, and counted how many moves fell within that scope. It was 64 (i.e. 64%).

You can extend this. If you instead count all the moves adjacent to or within one space of the last move (his) and all those adjacent to or within one space of the move before (yours), you get noticeably high figures. I didn't actually do a count here but I could see at a glance there were many such moves.

Now of course "adjacent to or within one space of the last move" can cover a fair number of points (not all empty, though) so there is some guesswork, and this is a big part of the reason why you can't use a computer to generate good moves like this. But in most cases even an amateur dan human (and, apparently, high kyus) can make a decent stab at which is the right point. And a strong human cam also often tell when to tenuki.

If you add in style training, I imagine you can up the percentages even more.

Does style training work? It must, surely, otherwise no-one would have a style. Is it possible to copy a style well enough to bring a high score obtained by other factors, such as the above? It may be unusual but it seems possible. I recall Jan van der Steen's remarkable ability to comment on a game in exactly the manner used for the pro commentaries in Go World. It wasn't a parody. If you posed a sort of Turing test and presented someone with his commentary and a GW commentary, I don't think they could tell the difference in origin.

But at that time, Go World was just about the only thing in existence that gave long commentaries in English and, like many others, Jan studied them intensely faute de mieux (he was 3-dan at the time, I think). So, studying Leela intensely could in like manner possibly produce a Leela clone, especially given that unlike humans Leela is probably very consistent. The clone may not understand what he's doing, but his subconscious has learnt enough to be a good mimic?

Bartleby · Post by **Bartleby** » Fri Mar 30, 2018 12:24 pm

So I ran some old chess games through a strong chess engine (Houdini Pro 4), and the results were a bit surprising to me, similar to those noted by Mr. Fairbairn.

I looked at just five games. There were played in 1992 at a small local FIDE invitational. At the time my FIDE rating was in the low 2300s; I would guess that this is roughly the equivalent of the lower end of the mid-dans range in Go. All of my opponents were rated within 120 points of me.

The results were as follows, looking at only the number of matches with Houdini's top three moves:

Game 1: 28/43 or 65 percent
Game 2: 14/16 or 87.5 percent
Game 3: 29/38 or 76 percent
Game 4: 55/77 or 71 percent
Game 5: 26/30 or 87 percent
Overall: 152/204 or 74.5 percent

Frankly, these match rates are significantly higher than I anticipated, so I am having to rethink my position that top three moves matching can be a strong indicator.

Interestingly, of my two games with the highest match rates, in Game 2 my opponent played rather weakly and lost quickly; in Game 5 I lost rather miserably to a stronger player.

Uberdude · Post by **Uberdude** » Fri Mar 30, 2018 1:44 pm

Bartleby wrote: Are you saying that the fact that Leela is weaker makes matching more likely? I would think the opposite is true. There are generally more suboptimal moves in any given position than optimal moves, so I suspect that match rate should be lower with weaker engines, not higher.

I suspect that normal play of mid- to high-dan amateur has a closer match to Leela (whose policy network has been trained on human games) than to a stronger bot like AlphaGo Zero without human training (or recent Leela Zero, or AlphaGo Master which started with human training but has had a lot of self play training to develop its own style). I also suspect a top pro would have a lower match against Leela. Of course I'd like some real data and would update my views accordingly.

Bartleby wrote: I still think 98 percent is really high. Although confirmatory evidence may be weak in general, at some point that becomes no longer true. If a player had a 100 percent match rate over an entire game would this not be highly suspect? 98 per cent is quite close to 100 per cent.

I agree 98% is suspicious, but not particularly so if 80-90% is normal.
But suspicious is not enough to convict and punish. After suspicions I think a human analysis like Stanislaw did which I posted, or we did here of moves e2 or l17 or t13 is warranted.

Bartleby wrote: But my main point is not about this game in particular, or even match rates in general. It's rather that in a competitive game like chess or Go, and especially when playing over the Internet when a cheater can always rely on plausible deniability, cheating is likely to become a real problem as stronger and stronger engines become available.

Yup. It's difficult. If we were already in a position where it's accepted 10% or something of people are cheating online then I'd be happier with much weaker evidence to convict someone, "on the balance of probabilities" level (as for civil cases in English law). But if we are still in the cheating is rare world (maybe I'm being naive) then stronger "beyond reasonable doubt" (criminal law) evidence is needed.

Bartleby wrote: Maybe the problem will be less in Go than in chess: my general impression is that there is a significantly higher percentage of dysfunctional personalities among chess players than Go players.

I hope so!

P.S. That 1k on reddit with the 64% made an interesting point that in his game Leela had a delusion about the status of a dead group (it's sometimes really stupid at nakades) so wanted to keep playing dumb moves there which sensible humans don't, thus lowering the matching metric. Carlo's game nor mine had a dead group to confuse Leela. Games which do should perhaps be excluded from the dataset for finding the usual distribution of this similarity metric.

dfan · Post by **dfan** » Fri Mar 30, 2018 3:24 pm

I don't have enough information to have a real opinion in this matter (and I suspect neither do the people in charge of the decision), but I think that the numbers to compare when hand-wavily throwing around stats are the disagreements of 2% vs 10-20%, not the agreements of 98% vs 80-90%. It's easy to look at numbers above 80 as just all being pretty large, but there's a really big difference between 2% disagreement and 10% disagreement which is easier to see when you look at it that way.

Bill Spight · Post by **Bill Spight** » Fri Mar 30, 2018 10:29 pm

Uberdude wrote:
Bartleby wrote: Are you saying that the fact that Leela is weaker makes matching more likely? I would think the opposite is true. There are generally more suboptimal moves in any given position than optimal moves, so I suspect that match rate should be lower with weaker engines, not higher.
I suspect that normal play of mid- to high-dan amateur has a closer match to Leela (whose policy network has been trained on human games) than to a stronger bot like AlphaGo Zero without human training (or recent Leela Zero, or AlphaGo Master which started with human training but has had a lot of self play training to develop its own style). I also suspect a top pro would have a lower match against Leela. Of course I'd like some real data and would update my views accordingly.

My suspicions are similar, given the kind of matching done. The range of moves that (strong amateur) Leela considers good is very likely to include the move a strong but slightly weaker amateur human might play. This is ignoring painfully obvious moves and one lane roads, OC, where the actual moves chosen should match almost 100%. But pros will often dismiss a move that a strong amateur thinks is good, because they see at a glance that it doesn't stand up. So there are probably in general fewer pro moves for the strong amateur human to match.

Uberdude wrote:
Bartleby wrote: I still think 98 percent is really high. Although confirmatory evidence may be weak in general, at some point that becomes no longer true. If a player had a 100 percent match rate over an entire game would this not be highly suspect? 98 per cent is quite close to 100 per cent.
I agree 98% is suspicious, but not particularly so if 80-90% is normal.
But suspicious is not enough to convict and punish. After suspicions I think a human analysis like Stanislaw did which I posted, or we did here of moves e2 or l17 or t13 is warranted.

I agree with Regan. Matches with Leela or other very strong bot may provide supporting evidence, given other evidence of cheating, but is very rarely good enough to stand alone. I also agree with Uberdude that the high number of matches with Leela justifies looking for more evidence, such as analysis of specific plays.

Uberdude wrote:
Bartleby wrote: But my main point is not about this game in particular, or even match rates in general. It's rather that in a competitive game like chess or Go, and especially when playing over the Internet when a cheater can always rely on plausible deniability, cheating is likely to become a real problem as stronger and stronger engines become available.
Yup. It's difficult. If we were already in a position where it's accepted 10% or something of people are cheating online then I'd be happier with much weaker evidence to convict someone, "on the balance of probabilities" level (as for civil cases in English law). But if we are still in the cheating is rare world (maybe I'm being naive) then stronger "beyond reasonable doubt" (criminal law) evidence is needed.

That's a very Bayesian outlook, Uberdude.

And I think that the similarity to civil and criminal law is important. I became a duplicate bridge director after taking a class from the world's best.

The class did not cover cheating, and it was emphasized that the attitude of civil law was the right one. If there is an irregularity the idea is to restore equity, not to punish wrongdoing. The burden of proof is also less in civil law. I defer to the officials who found that this game was irregular, in which case throwing out the game result would restore equity. The close similarity to Leela's play is enough to say that there is something funny about this game. But it is, as Regan says, only supporting evidence of cheating. Also, cheating is not just an irregularity, it is wrongdoing that should be punished. For a finding of cheating, the burden of proof needs to be higher, just as it is in criminal law.

BTW, if online cheating at chess is like other anti-social behavior, it may well be that the number of players who have cheated could be as high as 25%; but the number who cheat frequently or who cheat in tournaments is probably less than 5%. A lot of people succumb to temptation once or twice, but then find that it is not all that rewarding, and give it up.

Life In 19x19

“Decision: case of using computer assistance in League A”

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A