“Decision: case of using computer assistance in League A”

Uberdude · Post by **Uberdude** » Fri Jun 15, 2018 12:02 am

Here is the winrate for Tryss using Leela 0.11 on 200k playouts.

As I mentioned before LZ is stronger than 0.11 and also more opinionated so if one player gets a big lead according to LZ (which isn't so big according to weaker 0.11 or humans) then subsequent winrate changes are less useful at identifying player mistakes. And indeed here we can see Leela 0.11 not going to the high 90s so fast so more mistakes (according to Leela) visible.

For comparison, here is Leela 0.11 on 200k winrate histogram for Carlo vs Reem. It is interesting to compare this to the one in Bojanic's pdf (which I think was about 30-50k) to examine reproducability and the effect of more playouts. Overall we see smaller red bars (and no green) in generally the same places but some differences (e.g. mine shows Carlo making several moderate mistakes in a row around move 120 which Bojanic's does not).

Bojanic found games from other players the league this year with not much red on their GRP Leela histograms, and his suspicion/conclusion was that they were cheating with Leela too. My interpretation of this is that it more likely shows non-cheating players can be similar to Leela. An obvious way to resolve this is to examine games we know for sure people didn't cheat with Leela, the early seasons of PGETC before Leela existed are perfect for this: http://pandanet-igs.com/communities/eur ... rounds/1#1. I'm still setting up pnprog's analysis kit for a more rigorous analysis with lots more games, but for tasters here are some histograms of games from league A round 1 back in 2010 (at 50k nodes, I'll do 200k later). I also think we need to remember that just because Leela 0.11 says a move is bad (red) it doesn't mean it really is: so 7/8ds having more red doesn't mean they played worse than Carlo did with less red, it could be they played better moves that Leela just doesn't evaluate correctly (in the pro game I analysed Lee Sedol had a Leela top 3 matching at the low end of our mid-high amateur dans range, and Park Junghwan in the middle). Using a bot that is clearly much stronger (LZ) could help us distinguish these cases. My guess is LZ will agree most Leela 0.11 red were bad too, but with a sizeable minority actually ok/better. Another of my todos…

Edit: now have 2 histograms both at 50k and 200k to compare.

Javaness2 · Post by **Javaness2** » Fri Jun 15, 2018 12:28 am

Uberdude wrote: Bojanic found games from other players the league this year with not much red on their GRP Leela histograms, and his suspicion/conclusion was that they were cheating with Leela too. My interpretation of this is that it more likely shows non-cheating players can be similar to Leela. An obvious way to resolve this is to examine games we know for sure people didn't cheat with Leela, the early rounds of PGETC before Leela existed are perfect for this: http://pandanet-igs.com/communities/eur ... rounds/1#1. I'm still setting up pnprog's analysis kit for a more rigorous analysis with lots more games, but for tasters here are some histograms of games from league A round 1 back in 2010 (at 50k nodes, I'll do 200k later). I also think we need to remember that just because Leela 0.11 says a move is bad (red) it doesn't mean it really is: so 7/8ds having more red doesn't mean they played worse than Carlo did with less red, it could be they played better moves that Leela just doesn't evaluate correctly (in the pro game I analysed Lee Sedol had a Leela top 3 matching at the low end of our mid-high amateur dans range, and Park Junghwan in the middle). Using a bot that is clearly much stronger (LZ) could help us distinguish these cases. Another of my todos…

It seems like an enormous task to produce a tool that everyone can trust.
One metric can be similarity to Leela(N) but that alone is probably not enough
You need to be able to show that the player is finding moves above his level, I think that given our different strengths within go that is very difficult. One player may invade well; another uses influence well; another is a God of shape; etc Is this task easier in Chess? No idea.
We do not even have an agreed definition of tournament performance rating yet. Plus GoR's winning percentages are suspect.
Ultimately I consider it frighteningly hard to say definitively that a 2400 player + Leela(N) played at 2550 for X moves in N.

Bojanic · Post by **Bojanic** » Fri Jun 15, 2018 1:04 am

Uberdude,
it is a bit more complex than just looking at histograms with same number of variations.
First, it is better for analysis to look how move suggestions evolve in Leela, after some time and number of variations.
Important moves I looked in Leela directly, and watched how they evolve. I noticed that after some 50k variations changes are rare and rather slow.
Sometimes suggestion would come up after some time and stay there. In my paper I listed when I noticed first appearances of variations, if it is after 2k it is basically immediately.
Sometimes suggestion appears early, stays on top for several 10k variations, and then change. If the game is already decided, you can chose early move suggestion and skip waiting.

In game Metta-Ben David, black's move 139 is very interesting. It is very strong attack on white, and cuts part of his group.
It is interesting that it appears only after quite some variations. Now, this was important and difficult move, and it is expected in any case that player in this situation would want to do more calculations on it.

-----

Histograms of entire game that I inserted are actually less than 50k variations.
Analysis with 50k or 200k I had to do in several sessions, so I don't have them in one piece.
When doing them, I noticed, as you have now, that faster histograms differ slightly from more detailed ones, and that is why I kept them.

Now, regarding deviations histogram, it shows how much Leela thinks that move is better or worse than her's. It is not difference from her moves, moves can be totally out of her suggestions, but with similar chance of winning.
After deviations histogram of one side reaches more than 80% chance of winning, you can play every possible move, and it would still be pretty much same as Leela's chance. For same reason, in game Master vs Alpha Zero, one program played completely stupid move inside other's territory - and other program equally stupidly replied there. It s not bad by their calculation.

-----

Here is histogram from a game of one European pro in PGETC R4:

It is clearly very similar to Leela, but as I wrote earlier, it has to be examined in more details.
That is why in paper I compared move after move, and wrote differences in xls file to be more visible.
In this game fighting started early, and there was lot of forced moves on both sides, not surprisingly resulting in lot of Leela's top choices appearing.
Tenukis also matched Leela's, but they were obvious even for me.
Also something interesting - white had some 5 moves that were not on the Leela's suggestion at all - and they are not listed as bad moves. Please not that in two Metta's online games there was no move in middle game that was not in Leela's suggestions, actually in top suggestions.
Overall speaking, I don't think that this game is similar to Leela's as one might think at first. If it lasted longer and if it was not forced so much, more differences would be visible.

I have found more similar short games, those are the games that are mentioned as games with similar percentage as Metta's, but are much different.
Overall, short fighting games are not so good for comparation.

Bojanic · Post by **Bojanic** » Fri Jun 15, 2018 1:17 am

Uberdude wrote: Bojanic found games from other players the league this year with not much red on their GRP Leela histograms, and his suspicion/conclusion was that they were cheating with Leela too. My interpretation of this is that it more likely shows non-cheating players can be similar to Leela.

As explained in previous post, it is not my conclusion.
The game we mentioned in PM I analyzed in more details. It is not just histogram similarity - most of the fighting moves were Leela's top choice.
Only moves that were not it's top choices were, quite interestingly, two tenukis (both were not so bad moves IMO).
I analyzed two older similar games from same player, and in fighting they also contain most of Leela's choices (not unusual since it was forced), but there is less top choices and more mistakes.

Overall speaking in this case, I am very suspicious of using program assistance in fighting sequences, but it would be much more difficult to make analysis than Carlo's, since more games should be analyzed.
Therefore I decided to wait to see what will happen to analysis of Metta's games.

Bojanic · Post by **Bojanic** » Fri Jun 15, 2018 1:34 am

Tryss,
after move 44 Leela thinks you have 100% chances of winning, and you can play almost anything, she will not consider it a mistake.
But if you open GRP file, and go move by move, you will see how much your moves are different than Leela's.

Bojanic · Post by **Bojanic** » Fri Jun 15, 2018 1:37 am

Bill Spight wrote:As I discovered back in the 1980s, the internet is a hot medium, in McLuhan's terms. Back then, one of the most frequent online sentences was "You didn't read what I wrote."

That is actually a quite long pattern emerging.

maf · Post by **maf** » Fri Jun 15, 2018 2:08 am

Bojanic, could you update your PDF such that it reflects the discussion since it was first published? I found it very hard to understand what work exactly you did and what not, and which data exactly you used (also for comparison with other players). That is very important. It would also be good to base it on a null hypothesis (i.e. no cheating) and work from that to your hypothesis that cheating in fact did occur.

You have probably read the rebuttal from the Italian professor, which was made public a while ago and which highlighted a lot of valid and very important ideas on how an analysis can and can not be done. If you could work that into your PDF, I believe that would make it a lot more solid.

Bojanic · Post by **Bojanic** » Fri Jun 15, 2018 2:21 am

maf wrote:Bojanic, could you update your PDF such that it reflects the discussion since it was first published? I found it very hard to understand what work exactly you did and what not, and which data exactly you used (also for comparison with other players). That is very important. It would also be good to base it on a null hypothesis (i.e. no cheating) and work from that to your hypothesis that cheating in fact did occur.

I will try to do it today.
Although there was lot of time wasting, some of the members gave good contribution.

maf wrote:You have probably read the rebuttal from the Italian professor, which was made public a while ago and which highlighted a lot of valid and very important ideas on how an analysis can and can not be done. If you could work that into your PDF, I believe that would make it a lot more solid.

I don't consider simple statistical method not accurate enough, either his or Cieply's. In statistics you have same value of forced move, to important middle game, it makes no sense. It can be only basis for additional research.
It is better to examine move by move.

Jan.van.Rongen · Post by **Jan.van.Rongen** » Fri Jun 15, 2018 2:41 am

@Bojanic - in that forst game you give from Carlos, Leela 0.11 was not even available.

20K simulations is only 4 seconds on a modest GPU. Those are not the moves Leela would suggest whan the thinking time is normal and you run Leela 0.11 in analysis mode. On my laptop with 1050 GPU it will usually be over 300K.

What you call "tenuki" in game 2 is very subjective.

The evaluation digram for other AI's are very different. After 139 in game 2 I let AQ take white against Leela 0.11 and it wins 3-0,

Tryss · Post by **Tryss** » Fri Jun 15, 2018 3:57 am

Bojanic wrote:Tryss,
after move 44 Leela thinks you have 100% chances of winning, and you can play almost anything, she will not consider it a mistake.
But if you open GRP file, and go move by move, you will see how much your moves are different than Leela's.

And? if I cheated using LZ, why wouldn't I play moves different from LZ if it doesn't makes me lose?

It's strange that I stop playing like LZ when I get close to 100% winrate !

RobertJasiek · Post by **RobertJasiek** » Fri Jun 15, 2018 4:21 am

Bojanic wrote:In statistics you have same value of forced move, to important middle game, it makes no sense. It can be only basis for additional research.

It sounds relevant, but I understand nothing of what you want to say in this quotation. Please explain what you mean.

Why in statistics? Why only in statistics? What value? What, for your intention, is a forced move? What is the middle game and why does it not run in parallel from the game start to the game end? What research do you want to see?

Uberdude · Post by **Uberdude** » Fri Jun 15, 2018 4:25 am

Bojanic wrote:
Uberdude wrote: Bojanic found games from other players the league this year with not much red on their GRP Leela histograms, and his suspicion/conclusion was that they were cheating with Leela too. My interpretation of this is that it more likely shows non-cheating players can be similar to Leela.
As explained in previous post, it is not my conclusion.
The game we mentioned in PM I analyzed in more details. It is not just histogram similarity - most of the fighting moves were Leela's top choice.
Only moves that were not it's top choices were, quite interestingly, two tenukis (both were not so bad moves IMO).

Fair enough, that's why I wrote suspicion/conclusion as I wasn't sure how far along the scale just the histograms put you.

Your not-much-red histogram of the European pro is interesting, and I think you should include it in your report: at the moment your case suffers from not having control groups (of other players, having live games of Carlo is a good start) so it's easy to criticise with "maybe other innocent people exhibit the same characteristics that you take as evidence of Leela cheating". If your case was simply "not-much-red => cheated with Leela" then this Euro pro game would be a counterexample so damage it (to support it you'd need to analyse lots of random games of others (what my recent post starts to do and I believe you've done some too) and show they all had lots of red). However, as you are only using "not-much-red" as the first stage, and then doing a more detailed tenuki/sequence analysis, showing another game which looks guilty on the not-much-red aspect but not with the tenuki analysis would strengthen the case. Of course we'd then need to check how valid is this tenuki analysis, with only a few data points it's going to be hard to prove much, but maybe it could work like that chess example. And avoid cherry picking.

Jan.van.Rongen wrote:@Bojanic - in that forst game you give from Carlos, Leela 0.11 was not even available.

He did note this in the pdf (post link for convenience) and used 0.10 for that game.

Jan.van.Rongen wrote:20K simulations is only 4 seconds on a modest GPU. Those are not the moves Leela would suggest whan the thinking time is normal and you run Leela 0.11 in analysis mode. On my laptop with 1050 GPU it will usually be over 300K.

When I did the Leela top 3 analysis a while ago I used 50k nodes per move (which takes about 45 seconds on my 5 year-old-and-no-GPU laptop) because that's what the initial investigation into Carlo's game used. I wonder where it came from, plucked out of the air/arse? It would have been a good idea to find out the specs of Carlo's computer (let's assume he didn't use a remote one) and with reference to move times in the game so the node count was plausible for cheating in that game. Or failing that survey some people and find a typical value. But the concern about too low playouts is real, particularly as Leela 0.11 starts off liking moves favoured by the policy network which is an excellent model of "What is the human good shape intuition move in this position?" and tends to switch to "Leela but not human" style moves only after more analysis.

Jan.van.Rongen wrote:The evaluation diagram for other AI's are very different. After 139 in game 2 I let AQ take white against Leela 0.11 and it wins 3-0,

Isn't that easy to dismiss because AQ is stronger than Leela 0.11? Though it does at least show that the position is not hopeless for white after 139 and a stronger player can still win vs a weaker. What would be more interesting is if AQ as white beat AQ as black.

Bojanic · Post by **Bojanic** » Fri Jun 15, 2018 5:33 am

Jan.van.Rongen wrote:@Bojanic - in that forst game you give from Carlos, Leela 0.11 was not even available.

Uberdude already answered (thx for reading carefully)

Jan.van.Rongen wrote:20K simulations is only 4 seconds on a modest GPU. Those are not the moves Leela would suggest whan the thinking time is normal and you run Leela 0.11 in analysis mode. On my laptop with 1050 GPU it will usually be over 300K.

I analyzed on older machine, so it took some time.

Jan.van.Rongen wrote:What you call "tenuki" in game 2 is very subjective.

We spoke on definition of tenuki here, and I agree it is very subjective.
For this research, I considered it a move that is away from current stream of moves, or particular group, and most importantly - which is not forced.
In a move which is not forced we can more clearly see strength of player.

Jan.van.Rongen wrote:The evaluation digram for other AI's are very different. After 139 in game 2 I let AQ take white against Leela 0.11 and it wins 3-0,

Yes, it would be good to analyze some games in other AIs, but I am with very limited time and resources.

Bojanic · Post by **Bojanic** » Fri Jun 15, 2018 5:45 am

Uberdude wrote: Your not-much-red histogram of the European pro is interesting, and I think you should include it in your report: at the moment your case suffers from not having control groups (of other players, having live games of Carlo is a good start) so it's easy to criticise with "maybe other innocent people exhibit the same characteristics that you take as evidence of Leela cheating".

Since I made histograms of all games from this year A league, I did consider it, but dropped it.
Main focus is question: what is the difference of suspects game on internet, and in live games?
Because that is the difference we are researching. If someone else plays to similar to program - examine him.
I wrote that only handful of games had diagrams with very little red. And none of those games is so similar to Leela like Metta's two games.

Uberdude wrote:If your case was simply "not-much-red => cheated with Leela" then this Euro pro game would be a counterexample so damage it (to support it you'd need to analyse lots of random games of others (what my recent post starts to do and I believe you've done some too) and show they all had lots of red). However, as you are only using "not-much-red" as the first stage, and then doing a more detailed tenuki/sequence analysis, showing another game which looks guilty on the not-much-red aspect but not with the tenuki analysis would strengthen the case.

"Not much red" approach is similar to statistics used (it used same data, after all). It is just case for suspicion.
Most important part is comparison of every move. I will try to explain it more in updated paper.

Uberdude wrote:Of course we'd then need to check how valid is this tenuki analysis, with only a few data points it's going to be hard to prove much, but maybe it could work like that chess example. And avoid cherry picking.

Please note that I examined every move, and that I conside tenuki moves as more important. They are in no way forced.
On the other side, forced moves are considered of less importance.
This is attempt to try to find moves which are more important - something that first analysis lacked.

theoldway · Post by **theoldway** » Fri Jun 15, 2018 6:24 am

Bojanic wrote:
Jan.van.Rongen wrote:20K simulations is only 4 seconds on a modest GPU. Those are not the moves Leela would suggest whan the thinking time is normal and you run Leela 0.11 in analysis mode. On my laptop with 1050 GPU it will usually be over 300K.
I analyzed on older machine, so it took some time.

Here is another flaw. What if we are analyzing a move that Leela suggests as good, but only after let's say 50 nodes (= 40 / 60secs without a dedicated GPU = 15secs with a standard GPU), but the accused player took only 5/10 seconds to play it?
In all analysis we've discussed so far has anyone considered every move for the same time that the accused players actually used?

Otherwise all these analysis are almost nil

Life In 19x19

“Decision: case of using computer assistance in League A”

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: Re:

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A

Re: “Decision: case of using computer assistance in League A