As I mentioned before LZ is stronger than 0.11 and also more opinionated so if one player gets a big lead according to LZ (which isn't so big according to weaker 0.11 or humans) then subsequent winrate changes are less useful at identifying player mistakes. And indeed here we can see Leela 0.11 not going to the high 90s so fast so more mistakes (according to Leela) visible.
For comparison, here is Leela 0.11 on 200k winrate histogram for Carlo vs Reem. It is interesting to compare this to the one in Bojanic's pdf (which I think was about 30-50k) to examine reproducability and the effect of more playouts. Overall we see smaller red bars (and no green) in generally the same places but some differences (e.g. mine shows Carlo making several moderate mistakes in a row around move 120 which Bojanic's does not).
Bojanic found games from other players the league this year with not much red on their GRP Leela histograms, and his suspicion/conclusion was that they were cheating with Leela too. My interpretation of this is that it more likely shows non-cheating players can be similar to Leela. An obvious way to resolve this is to examine games we know for sure people didn't cheat with Leela, the early seasons of PGETC before Leela existed are perfect for this: http://pandanet-igs.com/communities/eur ... rounds/1#1. I'm still setting up pnprog's analysis kit for a more rigorous analysis with lots more games, but for tasters here are some histograms of games from league A round 1 back in 2010 (at 50k nodes, I'll do 200k later). I also think we need to remember that just because Leela 0.11 says a move is bad (red) it doesn't mean it really is: so 7/8ds having more red doesn't mean they played worse than Carlo did with less red, it could be they played better moves that Leela just doesn't evaluate correctly (in the pro game I analysed Lee Sedol had a Leela top 3 matching at the low end of our mid-high amateur dans range, and Park Junghwan in the middle). Using a bot that is clearly much stronger (LZ) could help us distinguish these cases. My guess is LZ will agree most Leela 0.11 red were bad too, but with a sizeable minority actually ok/better. Another of my todos…
Edit: now have 2 histograms both at 50k and 200k to compare.