“Decision: case of using computer assistance in League A”

Bill Spight · Post by **Bill Spight** » Tue Jun 05, 2018 8:28 am

AlesCieply wrote:
Just to be clear: the allegedly fabricated game record is *not* the kifu from the game that raised accusations of cheating but another game, between Carlo Metta and Kim Shakhov. You really should be specific in this instance.
I am quite specific on it in the report, did not feel like copy/pasting from the report when people can read it.

I understood that it was a different game, even without looking at either the game record or the report.

AlesCieply wrote:
A possible source of bias is that you're comparing games he won with games he mostly lost. A possible source of bias is that you're comparing games he won with games he mostly lost.

I am definitely aware of it. The problem is they are not that many regular games Carlo won recently with the records available.

Emphasis mine.

Given the lack of comparative data (game records) and the uncertainties of evaluation (something I will address in another note), I doubt if a strong statistical case can currently be made against Metta. As Regan points out, a purely statistical case can rarely be made. You need physical or behavioral evidence. Game records can include behavioral evidence. IMO, the record of the game vs. Reem points away from cheating. And the behavioral evidence of doctoring a game record points the other way.

Bill Spight · Post by **Bill Spight** » Tue Jun 05, 2018 8:39 am

Bojanic wrote:It would be surprising is someone used program for entire game, which would be idiotic to say the least.

Using a program for the entire game seems to be a way of cheating at chess on the internet, at least in non-tournament games, where there is less scrutiny. The main indicator seems to be that the player makes no mistakes or blunders, only "inaccuracies". The program (chess engine) the cheater is using is unknown, but his moves match the top three choices of any given strong engine.

This may be the source of the one-of-the-top-three indicator used in Metta's case, but no theory of cheating has been offered for it. Your theory of cheating is a good one, but would not produce the 98% matches in the Metta-Reem game.

jlt · Post by **jlt** » Tue Jun 05, 2018 9:49 am

AlesCieply wrote:the supplied document is here, https://drive.google.com/file/d/1NaWwHx ... sp=sharing

On page 3 you talk about three points:

(1) Carlo Metta performed unusually well at PGETC;
(2) During PGETC he made more good moves and less bad moves than usual;
(3) He modified an internet game and presented it as a regular tournament game record.

IMO these are only two points, not three. If you make more good moves and less bad moves than usual (point 2), you often beat stronger opponents than usual (point 1).

(But (2) is more precise than (1), as it gives clues about the manner a game was won.)

I also don't understand how you get your statement that such a feat would occur in 1 out of 3000 tournaments. He had 4 victories against 6d, 5d, 6d and 4d, then 1 loss, a victory against 3d and 1 loss. If you only take into account the four first matches, according to http://www.europeangodatabase.eu/EGD/winning_stats.php such a winning streak occurs with a probability 0.176²x0.3x0.5 which is about 0.5 %, i.e. one person accomplishes such a feat once every 200 tournaments; or if you prefer, during a tournament with 200 participants, you can expect one such performance.

Of course the calculation is very rough, but my point is that (1) is not so unusual, as Uberdude also pointed out by giving concrete examples.

Concerning point (2): to determine whether the percentage of good or bad moves was unusually high or low for a 4d player, it would be necessary to analyse a large number of games (say 100) played by 4d players and determine the average percentage of good and bad moves, as well as the standard deviation. In your document, the number of analysed games was much too low to allow any significant statistical analysis.

On the other hand, if (3) is true and if Carlo Metta cannot provide a convincing explanation, the conjunction of (1) and (3) does cast some serious doubts.

Bill Spight · Post by **Bill Spight** » Tue Jun 05, 2018 10:19 am

AlesCieply wrote:Hello, I guess you know I was involved in dealing with the case as a member of the PGETC appeals committee. Since about that time I started to look into the matter also on my own trying to devise a better and statistically more sound method to check if someone used AI in internet games or not. My analysis is based on comparing the player performance in internet and live games.

IMO this is the right tack, to look for differences instead of similarities.

Matching data (confirmatory evidence) is weak.

I have looked at your report but not studied it. A few comments, quoting you from it.

AlesCieply wrote:I decided to try another approach inspired by Ken Reagan’s works (see e.g. [3]) on convincing chess players of using AI help in their games. The method is based on a statistical analysis determining how often players make mistakes of a given magnitude. The stronger the players are they do less mistakes (or smaller mistakes) and the distribution of the mistakes made by a particular player forms a pattern characteristic for the player and his strength.

Emphasis mine.

Two questions that arise are what is a mistake, and how large is it?

First, nobody believes that Leela plays perfectly, or even as well as other current AI bots. So deviations from Leela's play cannot be considered mistakes, even if they probably are. Second, it is improper to confound making a mistake with failing to match Leela's top choice of plays when matching Leela per se is taken as evidence of cheating. AlphaGo Zero is not available for testing plays -- although DeepMind might be persuaded to make it available to go organizations for the purpose of detecting cheating --, but the Facebook network is, I understand, as it has been incorporated into Leela Zero. Not only is it better than Leela, it is different. Use it instead of Leela.

Third, unlike in chess, I do not think that we have a proven method of evaluating plays. That sounds silly, since all the top go bots evaluate plays. However, that is in context of playing a whole game well, not of evaluating specific plays. A general evaluation method that is good enough to play well is not the same thing. Just as top humans can make mistaken evaluations of single plays, even while playing well, so can top bots. Top chess engines seem to be able to distinguish three categories of single errors: inaccuracies, mistakes, and blunders. Go bots have not been shown to achieve that level of evaluation of plays.

One problem is the evaluation of plays in terms of win rates. In the Monte Carlo Tree Search (MCTS) era, win rates were found to be better than score estimates in producing good overall play. However, win rates are not as well defined as score estimates. (You need another parameter similar to komi to use score estimates, anyway.

) This lack of definition is indicated by the lack of error estimates for the win rates. Win rates other than 0 or 1 depend upon mistakes. But what level of mistakes, what frequency, and what kind? Who knows?

I do not have enough experience with Leela to say, but MCTS bots were known to make strange plays and win rate estimates in the endgame, unless the game was close. This suggests that the win rate estimates were of a different nature than the win rate estimates earlier in the game. A bot which was behind might make a play that a human dan player might immediately dismiss as a mistake. (Programmers dismissed these human evaluations by saying that humans don't understand win rates.

) One possibility is that the bot's play left open the possibility of a horrendous blunder by the opponent, one that the human player would dismiss out of hand, judging it to be impossible for a player as strong as the opponent. Another possibility is that the randomness of Monte Carlo playouts in such situations simply made the win rate estimates unreliable. In either of these cases, the win rate estimates would be qualitatively different from those earlier in the game. The choice of plays would be less likely to be good, and the size of deviations from the top choice would be less indicative of the size of a mistake (if any).

Now, a good evaluation function for individual plays can surely be developed. For instance, if a particular bot played out the game with White playing first from a certain position 10,000 times you might get a win rate estimate and error estimate for Black of x% ± y%. Suppose that a play from a second position yielded estimates of v% ± w%, and v + w < x and v < x - y. Then we might regard a Black play to the second position instead of the first to be a mistake.

AlesCieply wrote:the work on it is rather slow and tedious.

I suppose by "it" you mean the game analyses and report. I am afraid that a lot of tedious work needs to be done to reach a point where a program can reliably evaluate individual plays to the degree that current chess engines can. Statistical evidence should be based upon making fewer and smaller mistakes, given the opportunity to cheat than otherwise.

Bill Spight · Post by **Bill Spight** » Tue Jun 05, 2018 10:39 am

jlt wrote:
AlesCieply wrote:the supplied document is here, https://drive.google.com/file/d/1NaWwHx ... sp=sharing
On page 3 you talk about three points:

(1) Carlo Metta performed unusually well at PGETC;
(2) During PGETC he made more good moves and less bad moves than usual;
(3) He modified an internet game and presented it as a regular tournament game record.

IMO these are only two points, not three. If you make more good moves and less bad moves than usual (point 2), you often beat stronger opponents than usual (point 1).

(But (2) is more precise than (1), as it gives clues about the manner a game was won.)

I also don't understand how you get your statement that such a feat would occur in 1 out of 3000 tournaments. He had 4 victories against 6d, 5d, 6d and 4d, then 1 loss, a victory against 3d and 1 loss. If you only take into account the four first matches, according to http://www.europeangodatabase.eu/EGD/winning_stats.php such a winning streak occurs with a probability 0.176²x0.3x0.5 which is about 0.5 %, i.e. one person accomplishes such a feat once every 200 tournaments; or if you prefer, during a tournament with 200 participants, you can expect one such performance.

Of course the calculation is very rough, but my point is that (1) is not so unusual, as Uberdude also pointed out by giving concrete examples.

Concerning point (2): to determine whether the percentage of good or bad moves was unusually high or low for a 4d player, it would be necessary to analyse a large number of games (say 100) played by 4d players and determine the average percentage of good and bad moves, as well as the standard deviation. In your document, the number of analysed games was much too low to allow any significant statistical analysis.

There are a few problems with (2), as implemented. First, Leela's top choices are confounded with good moves. (It's a better comparison that matching one of three, but of the same kind.) Another way of determining good moves and mistakes should be used. Second, a lack of fit could be the result, not only of Metta playing differently, but of his having different opponents. (Which he obviously did.) Third, the way in which Metta plays poorly may be quite different from how he plays well. As a counterexample, in a recent game against one of the top bots Haylee gradually lost ground. That is, she played pretty much the same as when she wins. Another player, such as an amateur, might blunder. Another pro might sense losing ground and embark on risky maneuvers. Such specific differences between winning play and losing play could produce poor fits which have nothing to do with cheating. (Again, the lack of a theory of cheating reveals itself.)

Edit: There is another problem with the Chi Square test, using sparse groupings. There should be only four categories for the test, combining the less frequent categories.

Edit: And another problem. Suppose that Leela's top choice is an obvious one, like replying to a sente. A cheater does not need to copy Leela to play an obvious move, so it is irrelevant to the question of cheating, and should not be used in the test. The same goes for sufficiently easy plays, such as a 2 kyu would play. They have to be hard enough that the suspected cheater might miss them without cheating.

jlt · Post by **jlt** » Tue Jun 05, 2018 11:05 am

Bill Spight wrote: There are a few problems with (2), as implemented.

But here, we are dealing with a player suspected of cheating with Leela, and not with Elf, Zen, or a 9p giving hints.

If you don't like the terms "good" and "bad", let's use "nice" and "ugly" instead. By definition, a player using Leela to cheat will produce more nice and less ugly moves than during normal play, so if one can prove that C.M. played a very unusually high number of nice moves and a very unusually low number of ugly ones during PGETC, whatever the definition of "nice" and "ugly" is, then this will raise suspicions of cheating using Leela.

Bill Spight · Post by **Bill Spight** » Tue Jun 05, 2018 12:01 pm

jlt wrote:
Bill Spight wrote: There are a few problems with (2), as implemented.
But here, we are dealing with a player suspected of cheating with Leela, and not with Elf, Zen, or a 9p giving hints.

If you don't like the terms "good" and "bad", let's use "nice" and "ugly" instead. By definition, a player using Leela to cheat will produce more nice and less ugly moves than during normal play, so if one can prove that C.M. played a very unusually high number of nice moves and a very unusually low number of ugly ones during PGETC, whatever the definition of "nice" and "ugly" is, than this will raise suspicions of cheating using Leela.

There is, IMO, enough evidence to indicate that if Metta cheated, he did so by copying Leela. That being the case, trying to prove by comparing his plays to Leela's confounds the question of how he might have cheated with the question of whether he cheated. They are separate questions, and if we can, without using Leela, address the question of whether Metta cheated, we should do so. Since we now have at least one other way of measuring the quality of individual plays, with the combination of LeelaZero with the Facebook neural network, we can use that, or perhaps use something else. The theory of cheating, as Regan points out, is that the cheater played significantly better, given the opportunity to cheat, than without that opportunity. Use some other bot to evaluate the difficulty of individual plays and the margin of error. Besides, LeelaZero, with or without the Facebook neural net, is better able to rate plays than Leela is. So using it we are better able to compare the quality of Metta's plays.

bernds · Post by **bernds** » Tue Jun 05, 2018 1:11 pm

Lukan wrote:At the end of his post, I would also like to reveal something strange. On 31st May, this post apparently written by some Italian player, has appeared in this discussion, but it disappeared in about 10 minutes for an unknown reason... (see the file with the screen)

That's quite a serious accusation as well, so I went digging through games from the "metta" account (I don't know for certain that it belongs to the player accused of cheating, but it seems plausible). I found one where he marked opponent stones in seki as dead, but it was around 20k so I'm inclined to disregard it. More interesting is a game from Nov 2007, poporo [3k] vs metta [4k], which ends with Black getting rekt, and he leaves the game with the words:

metta [4k]: i'm sorry but i'm used to play go not this disgusting game, please when you'll learn to play this game inform me so i can take you off of my censor list

So that seems to be partial confirmation for the claims in the previous message.

edit: same month, after losing some stones:

metta [4k]: thanks for your unfairness
metta [4k]: insert in my censor list
metta [4k]: i inform admins immediately
metta [4k]: please don't play with me again
metta [4k]: at your level it's needed that you grow up a little
metta [4k]: bye little boy
konstantyn [4k]: read my terms...no undo!

edit2:

duyen [3k]: it's not a misclick
metta [4k]: congratulations!!! you are the first name in my censor list
metta [4k]: thanks for your unfairness, please don't play with again, in 2 minutes all the english game room will know about your unfairness
duyen [3k]: it's not a misclick, so you can't undo

Shenoute · Post by **Shenoute** » Tue Jun 05, 2018 1:43 pm

bernds wrote:(...) I went digging through games from the "metta" account (I don't know for certain that it belongs to the player accused of cheating, but it seems plausible). I found one where he marked opponent stones in seki as dead, but it was around 20k so I'm inclined to disregard it. More interesting is a game from Nov 2007, poporo [3k] vs metta [4k], which ends with Black getting rekt, and he leaves the game with the words: (...)

It is also the account of a very frequent escaper. Not that this says much about the cheating case but if this is/was indeed Carlo Metta's account, the idea that such a player can be a referee during the EGC is disturbing/laughable...

figgitaly · Post by **figgitaly** » Tue Jun 05, 2018 2:47 pm

AlesCieply wrote:Finally, I would very much appreciate if Carlo Metta came out and explained why he presented an apparently fabricated game record to the league manager. I do believe he is in principle an honest man who has done a lot for the go community and can continue to do so. I just think he made a mistake with using AI in his internet games and now is afraid of admitting it.
EDIT: Here I refer to a game record from the Shakhov-Metta game Carlo himself suplied (among several other records) claiming it was played at regular tournament and contained also many moves "similar to Leela". In fact, the game was played at KGS and the record was edited to look as played "live", see the report for more details on it.

Dear cieply,

I'm Maurizio Parton, one of the authors of the appeal document. Mirco Fanti asked me to answer your messages here on the forum, because he already lost a lot of time answering your emails, and he has an important tournament to organize. I have an EGC to organize, thus I will try to be short and clear.

Carlo agreed with the referee to share some SGFs in order to clarify his style. Carlo looked among his files and indeed made a mistake: he attributed one of his SGFs to a live game, while it was in fact an online game.

But why on earth would Carlo have done this on purpose? What would have been the malignant objective of this manipulation? This game has a low 'similarity' with Leela: why would have Carlo lied to his own disadvantage?!?

The other question: why is the game slightly different from the actual game, like if it was not downloaded from KGS, but handwritten? Well because it *is* handwritten. Every week we meet at our Go club in Pisa, and quite often we ask Carlo to show us a game: he then writes the game down while he comments it. After that, that game is on Carlo's laptop.

As for the new 'analysis' that you, a member of the appeal commission, made and used instrumentally to invoke new accusations against Carlo, I am not going to address it, for several reasons.

The first reason is in the same appeal document that the appeal commission accepted as a proof that the accusations moved against Carlo were flawed:

"The methodology was chosen by people who were not blind to the moves (...) this carries the risk of involuntarily picking a methodology exactly because it confirmed the accusations"

This flawed activity is called 'cherry picking', and *voilà*, you could have bet with 98% probability: this is exactly what is happening! 'Cherries' everywhere! I warmly invite you to read the appeal document.

The second reason why I am not going to address the new round of analysis is in the introduction on Regan's work that you cite yourself:

"His [KEN REGAN] work began on September 29, 2006, during the Topalov-Kramnik World Championship match. Vladimir Kramnik had just forfeited game five in protest to the Topalov team's accusation that Kramnik was consulting a chess engine during trips to his private bathroom. (...) Topalov's team published a controversial press release trying to prove their previous allegations. Topalov's manager, Silvio Danailov, wrote in the release, '... we would like to present to your attention coincidence statistics of the moves of GM Kramnik with recommendations of chess program Fritz 9.' (...) An online battle commenced between pundits who took Danailov's 'proof' seriously versus others, like Regan, who insisted that valid statistical methods to detect computer assistance did not yet exist. (...) In just a few weeks, the greatest existential threat to chess had gone from a combination of bad politics and a lack of financial support to something potentially more sinister: scientific ignorance. In Regan's mind, this threat seemed too imminent to ignore. 'I care about chess,' he says. 'I felt called to do the work at a time when it really did seem like the chess world was going to break apart.'"

This is exactly what is happening now: the Go world is breaking apart. And I'm sorry to say that in this analogy you represent yourself as Regan, but in fact you act like Danailov.

The third reason is that, from Regan's work, it is apparent that in order to create a solid methodology it is necessary to analyze thousands of games. This is not something that can be done in few days or weeks, and not by somebody who repeatedly claims that he is not an expert on statistics.

To be constructive: I think we should focus exactly on creating a solid method, as Regan did, based on science and data, to be applied in future tournaments, because, as explained above, trying to create methods to confirm one's opinion is flawed in the first place. Let's start this process all together: I warmly invite you to send your proposal to the AGM, and/or make proposals on this forum.

Finally: apologies to everybody if I sounded rude. Let's close this sad chapter in the history of Go, and let's start working together, not against each other.

Best regards, Maurizio

bugsti · Post by **bugsti** » Tue Jun 05, 2018 3:05 pm

AlesCieply wrote:
May you provide a reference? I do not recall any 4d (and not fast improving!) player performace like that. Of course, there are fast improving 1d players who perform as 3d at tournaments regularly. I agree, the figure 3000 tournaments is approximate, thought even if it was 1000 ...

Even if it was one over 1000, this means that it is not a rare event. 1 over 1000 for every single players and for every series of 4 games means that a such event will happen 1 time in 2 o 3 tournaments, way more often of what you think and what you wrote in your document.

bugsti · Post by **bugsti** » Tue Jun 05, 2018 3:16 pm

AlesCieply wrote:
May you provide a reference? I do not recall any 4d (and not fast improving!) player performace like that. Of course, there are fast improving 1d players who perform as 3d at tournaments regularly. I agree, the figure 3000 tournaments is approximate, thought even if it was 1000 ...

Even if it was one over 1000, this means that it is not a rare event. 1 over 1000 for every single players and for every series of 4 games means that a such event will happen 1 time in 2 o 3 tournaments, way more often of what you think and what you wrote in your document.

bugsti · Post by **bugsti** » Tue Jun 05, 2018 3:22 pm

This story seems illuminating

"His [KEN REGAN] work began on September 29, 2006, during the Topalov-Kramnik World Championship match. Vladimir Kramnik had just forfeited game five in protest to the Topalov team's accusation that Kramnik was consulting a chess engine during trips to his private bathroom. (...) Topalov's team published a controversial press release trying to prove their previous allegations. Topalov's manager, Silvio Danailov, wrote in the release, '... we would like to present to your attention coincidence statistics of the moves of GM Kramnik with recommendations of chess program Fritz 9.' (...) An online battle commenced between pundits who took Danailov's 'proof' seriously versus others, like Regan, who insisted that valid statistical methods to detect computer assistance did not yet exist. (...) In just a few weeks, the greatest existential threat to chess had gone from a combination of bad politics and a lack of financial support to something potentially more sinister: scientific ignorance. In Regan's mind, this threat seemed too imminent to ignore. 'I care about chess,' he says. 'I felt called to do the work at a time when it really did seem like the chess world was going to break apart.'"

bugsti · Post by **bugsti** » Tue Jun 05, 2018 4:20 pm

Bill Spight wrote:I found the following paragraph to be of interest:

In my earlier criticism I pointed out that similarity (confirmatory evidence) is very weak evidence, and that contrast (disconfirmatory evidence) was necessary. I was unaware that Leela herself provided contrasting evidence, the contrast being between Leela's choices and the choices of strong amateurs. Picking one of Leela's choices which is not one of the human choices would be evidence of cheating. (Such evidence seems to be more available in chess, where top engines often play non-human moves.) There are other ways of finding disconfirmatory evidence, but they involve analyzing a large number of games. Here was disconfirmatory evidence that Leela put on the plate. However, that evidence was nil. There was no case where Metta picked a choice of Leela's that was not also a human choice.

To be sure, other evidence of cheating might have been developed, but was not.

I agree that this disconfirmatory evidence tells more than hundreds of messagges of amateurish statistics and accusations.

I wonder why nobody saw this in time. We may have spared weeks of debates and avoid ruining a player reputation and turning all go community in paranoia. This situation has dug wounds that are difficult to recover.

urqui54 · Post by **urqui54** » Tue Jun 05, 2018 4:57 pm

Lukan wrote:[Charlie wrote]Do not be arrogant. Many people who read this thread will not go and download your PDF and read it in great detail.
You are not only accusing someone of cheating but also accusing them of fraudulently fabricating evidence! The very least you could do is exercise some care and diligence in doing so![/Charlie wrote]

I'm sorry? I feel like the only one here behaving arrogant is you, after what I have just read.
And besides this discussion it's definitely the Italian captain ("MircoF"), who writes one vulgar e-mail after another to Ales Cieply... Therefore, don't start a war from this, please and try to respect the opposition.

Lukan, your rage and childish behaviour is poisoning the go community. Please accept an advice from someone older than you and stop this witch hunt.
You are an high dan player and you are supposed to help the development of Go in europe and not to sabotage it.

Life In 19x19

“Decision: case of using computer assistance in League A”

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A