John Fairbairn wrote:But I think it is also important to try to see it all from the point of view of the organisers and referees.
{snip}
people see cheating in chess/go, just as people see drugs in athletics, very much in black-and-white terms. Halfway houses and discreet averting of the eyes are not tolerated by the vast majority. Cheating must be stamped out - even if some people suffer wrongly, it seems.
There was a cheating scandal in an IGS tournament in the 1990s in which Sprint, a strong Chinese amateur, was discovered to have gotten help from a Chinese pro. That may have made Pandanet sensitive to accusations of cheating in their tournaments. AFAICT, nothing written here criticizing the treatment of the evidence in this case, or the CIT case, condones cheating.
Now, given that this can only be done on the basis of some sort of probabilistic assumptions,
That may be so in these cases, but not in general, as Regan has pointed out. (Unless you are a Bayesian.

) Even in the case of casual online chess cheating, the cheaters typically put down their opponents, a form of behavioral evidence. (In itself weak, OC, but not just statistical.)
is it possible to lend support to organisers and referees (and through them to the overwhelming majority of players) by using statistics in the same way that seems to be accepted elsewhere.
That was not done in this case. I have argued in Bayesian terms, first, because I am a Bayesian, and second, because Bayesians, like most of the public, and like the organizers and referees, believe in confirmatory evidence. But, unlike most of the public, we know that it is very, very weak. The use of confirmatory evidence is not generally accepted statistical practice.
What I have in mind is something like the "significance" factor which is often mentioned in connection with 95% probability. How can such a metric be devised and accepted?
Regan addresses that in chess, not with the question of whether a player plays like Houdini or other top engine (confirmatory evidence), but whether the player plays better than he does without cheating (disconfirmatory evidence). Regan can make use of individual moves, because he is able to rate them. Thus, an obvious play, even though every engine would play it, does not count against the player because it is what he would play without cheating. In go, we are not able to do that yet; give us a few years. What we have to do instead is to rely upon the judgement of strong players. For instance, in the Reem vs. Metta game, consider the sequence,

-

, where Black secures the bottom right corner. Black has options for

, but given that play and White's responses, the four plays,

-

, would be played by not only by Carlo Metta, but also by weaker dan players who were not cheating. Even in cases of suspected online cheating at chess, accusers look at the plays of suspected cheaters and point out plays that are unlike human plays, or human plays of the level of the suspect. That is, the accusers look for disconfirmatory evidence, not confirmatory evidence, or not just confirmatory evidence. The four Black plays,

-

, are confirmatory evidence of the proposition, "He plays like Leela", but are not evidence of cheating. The question is not just a reliance upon statistical evidence alone, but a reliance upon the wrong statistical evidence.
Now, if one is using a bot to cheat, then one's play will resemble that of the bot, to some extent. Therefore, as Blindgroup points out, given enough games, the number of plays that are matches to Leela's choices but not because of cheating should even out, on average. But that is not the case for a single game. You need to look at a number of games in which Carlo is suspected of cheating, such as all of his games in this tournament, and compare them with other games in which he is not suspected of cheating. That is, we must look for disconfirmatory evidence: Carlo plays differently in one set of games from how he plays in the other set of games. If you suspect him of cheating in all games, then you compare his play against the play of other players of similar ability. OC, in that case the similarity of his play to Leela's may simply be evidence, not of cheating, but of intensive training with Leela for a couple of years.
So you could, if you gloss over the question of randomization, set up a significance test using some metric of similarity to Leela's play. But doing so would involve the use of a large number of games, and any statistically significant result would not be a 98% match in a single game.
Acceptance doesn't seem to be a problem to me because human are used to running their entire lives on the basis of probability. But it could perhaps be made easier to accept if the first "punishment" was not so swingeing. E.g. a player could be put on notice that he is suspected of cheating. The arbiters could also indicate what measures need to be taken to satisfy them in future (e.g. a player could video himself while playing an important on-line game and use that to show that he is not consulting a machine).
As I have said, the evidence in that one game is enough to raise suspicion. And that would justify the organizers to treat Carlo like Caesar's wife, requiring him to be above suspicion, and require that his future games be monitored. It would also justify looking at the plays in the questioned game to see whether the result might be voided. It might even be possible to find further evidence of cheating by analyzing that game.