Life In 19x19

Posted: **Mon Nov 09, 2020 9:01 am**

This thread discusses the paper Derived Metrics for the Game of Go - Intrinsic Network Strength Assessment and Cheat-Detection by Attila Egri-Nagy and Antti Törmänen.
https://arxiv.org/pdf/2009.01606.pdf

So far I have read to section 3.1.

The visit count N(s,a) is defined as the number of times a variation starting with move a at position s is examined. Presumably, this has been clarified elsewhere but I wonder: is this the number of leaves below the (s,a) node or the number of the search algorithm's walks through the (s,a) node?

Apparently, the scoremean is defined as a mean over all visited scores (at the leaves, I suppose) during a Monte-Carlo search. When correct subsequent play approaches a leaf, the scoremean can converge to strong human score prediction. So far so good. However, in the general case, which includes many positions far earlier than a leaf, there may be some stability in the values and game-tree-local convergence for strong AI play but we do not know by how much in every specific position scoremean and strong human score prediction differ. The scoremean does not equal strong human score prediction.

The paper says: "Every move played in a game reduces the number of its future possibilites." Unless a superko or similar rule applies, this is just a conjecture and disproven by this counter-example: White's two-eye-formation fills the board, White fills an eye, Black passes, White fills an eye committing suicide (assuming it is legal according to the rules). The resulting position has a greater number of future possibilities than the initial position. To get a theorem instead of a conjecture, some presuppositions need to be stated and a proof is required.

The effect of a move is defined as the difference of scoremeans after and before it. The paper says that statistical information on the effects describe the playing skill of a player. No. It only describes a model of the playing skill of a player because the scoremean is only a model of correct positional judgement.

Posted: **Mon Nov 09, 2020 9:07 am**

RobertJasiek wrote: The paper says: "Every move played in a game reduces the number of its future possibilites." Unless a superko or similar rule applies, this is just a conjecture and disproven by this counter-example: White's two-eye-formation fills the board, White fills an eye, Black passes, White fills an eye committing suicide (assuming it is legal according to the rules). The resulting position has a greater number of future possibilities than the initial position. To get a theorem instead of a conjecture, some presuppositions need to be stated and a proof is required.

How can the resulting position have a greater number of future possibilities than the initial position, if the resulting position can be reached from the initial position? Wouldn't every future possibility of the resulting position be, by definition, a future possibility of the initial position?

Posted: **Mon Nov 09, 2020 9:12 am**

It depends on a careful definition of "number of possibilities", of course. I have only sketched of how a counter-example looks like. Hint: refer to "force" a la Japanese 2003 Rules, that is, imply an aim of applying perfect play.

Posted: **Mon Nov 09, 2020 9:49 am**

HermanHiddema wrote:How can the resulting position have a greater number of future possibilities than the initial position, if the resulting position can be reached from the initial position? Wouldn't every future possibility of the resulting position be, by definition, a future possibility of the initial position?

No, because games end. And they typically end when there are many open points on the board which could be played, which are not. When the game ends, they cease to be possibilities. OC, with today's rules, ending the game does not alter the position of the board, so we typically have the same position which has many possibilities before the game ends and zero possibilities afterwards. However, human players typically recognize when the game is over, even if we sometimes make mistakes. So in a sense these end of the game possibilities are not really possibilities at all, as neither player will make those plays near the end of the game.

This matters earlier in the game. For instance, light play generally leaves more possibilities than heavy play. OTOH, failure to play the coup de grace typically opens up possibilities for the opponent.

Yes, if the players are random robots then any legal play is possible. But we are not talking about random robots.

Posted: **Tue Nov 10, 2020 2:04 am**

Now I discuss section 3.2.

P(s,a) is defined as the prior probability of a move a in position s, as provided by the raw network output. This probability distribution p is called the policy. I wonder: distribution over what? For a given s, the distribution over all a? The tree search yields an updated policy pi, measured by visit counts.

A table for four sample positions of different games compares hitrate percentages. To me, it is unclear what is the relation to the probability distributions? Are the hitrate percentages meant to be probability distributions pi?

The Kullback-Leibler divergence, or KL-divergence, is introduced for the two discrete probability distributions p and pi (apparently over all moves; restricted to legal moves) as

D_KL (p || pi) = SUM p ln (p/pi)

and said to measure the [relative] disparity of the two distributions and how much information is gained [or, my comment, lost] if the distribution pi instead of the distribution p is used.

Now, this deserves explanation. Information is mentioned but which information is meant? According to Das ist Information, Horst Völz, p. 49, it must be S-information, or Shannon-information, which is defined as the entropy

H = -SUM p ln p.

ln p is zero-or-negative so the formula needs the minus sign. The idea is to count incidents (here: moves) occuring with a probability p(a) and along a tree variation with an (idealised) tree depth ln p. So indeed the Kullback-Leibler divergence is derived from a measurement of amount of S-information.

Unlike the Shannon entropy, the Kullback-Leibler divergence compares p(a)/pi(a), that is, relative change of probability of each move. When p > pi, then ln p/pi is positive. When p < pi, then ln p/pi is negative.

Hence D_KL indeed measures a change of the amount H of information (or you can say: entropy).

I think that if p(a) < pi(a) for a move a, the impact of the tree search is an increment of probability, that is, a likely indication of a good tree search on the move a. Have I understood this correctly? So ln p(a)/pi(a) is negative. Hence, the smaller the Kullback-Leibler divergence D_KL the better the quality of the tree search. The paper shows such an example in which a b40 Go-AI network shows smaller Kullback-Leibler divergence than a (supposedly weaker) b20 Go-AI network.

As a minor aspect, the paper speaks of normalising probabilities of a selection of legal moves. I wonder how that normalisation is defined. I can guess but you better tell me.

Posted: **Tue Nov 10, 2020 6:41 am**

Well, I finally took a look at the paper this morning. One thing that struck me was the lack of empirical validation of cheating detection. That would be a serious flaw in a paper presenting a method for detecting cheating. But the paper does not make that claim. It is more modest. It claims to only to have found "a fine-grained, move-by-move performance evaluation of a player." Now, the words, "performance evaluation" cry out for an empirical study, but the paper has developed a worthy hypothesis for testing.

Edit: Current empirical validation of performance evaluation is far from fine grained, at the move by move level. It is on the order of hundreds or thousands of games.

Posted: **Tue Nov 10, 2020 9:47 am**

Now I discuss section 4 about intrinsic strength of networks.

The paper studies examples of a simple measure called hitrate (how often search hits the same top move of the raw policy) expressed as a percentage and implies that a higher hitrate was better.

Then it studies example games observing that a smaller KL-divergence relates to stronger play. Here, the paper first ought to have stated and explained some theory of exactly that before demonstrating its application to example games. Namely the theory I have already described: discrete S-entropy, KL-divergence as a measurement of its change and the logarithm implying smaller values being better. Not samples imply the theory but an understanding of the theory implies how good samples should (and apparently do) behave.

I understand why KGS 4d+ games have been used as a sample. However, this is also a bit funny because KGS games tend to be fast with lots of large blunders that strong amateurs should not make by far as often in real world games. Luckily, even the somewhat dubious sample illustrates the hidden theory.

EDIT: this section explains the paper's table I have wondered about earlier.

Posted: **Tue Nov 10, 2020 10:22 am**

The beginning of section 5 writes: "Players with a rating history are easier to catch from cheating by noticing a sudden increase in their won games." Sudden increments can have various causes, such as a player starting a period with great concentration and diligence. Such cannot be easily distinguished from cheating.

Posted: **Tue Nov 10, 2020 10:52 am**

Section 5.1 is about human ways of recognizing an AI-using cheater. Points are made in favour of detecting cheating, however, every point I can also interpret as not cheating as follows.

A player can delay his moves not only to await AI input but also to maintain and reassure psychological self-control.

A player can play many unexpected moves because he copies them from AI or because the player himself is very creative and unconventional in his playing style.

A player can play safely when ahead because he copies AI or because he knows how to win a won game and controls his own psychology calmly.

The paper claims that AI does not follow strategic plans, which can be expressed in human terms. Wrong. I already described such during the early AlphaGo days, when AI performed exactly according to my previously described generally applicable strategic planning of a certain kind (best reducing large sphere of inflence).

Posted: **Tue Nov 10, 2020 2:04 pm**

RobertJasiek wrote:A player can play safely when ahead because he copies AI or because he knows how to win a won game and controls his own psychology calmly.

I actually think that this is an area where humans and bots choose recognizably different plays. When the bots are far ahead they can, and often do, approach random play. Human's don't. Human play is more consistent.

The paper claims that AI does not follow strategic plans, which can be expressed in human terms. Wrong. I already described such during the early AlphaGo days, when AI performed exactly according to my previously described generally applicable strategic planning of a certain kind (best reducing large sphere of influence).

Those in the know can correct me if I am wrong, but I thought that deep recursive networks allowed planning by the network. Also, so does deep reading.

Posted: **Tue Nov 10, 2020 10:28 pm**

Now I discuss the rest of the paper.

For cheat detection, the paper considers a winrate graph over a game's moves according to the AI's stated probabilites. One player cheating is described as the graph steadily going upwards to 99%, both players cheating as a rather constant graph.

Regardless of indirect calculation of winrates, I agree that such graphs identify suspective players because such graphs express them to make essentially no significant mistakes. They are an indication of possibly occurring cheating but not a proof of it.

The paper claims that a player's average 'effect' and consistent development of his moves' effects indicated his skill. Since effect is calculated from scoremeans, again, this is wrong. Effects only indicate a model of skill. See my earlier remarks.

The paper repeats its earlier mistakes, which I have mentioned for earlier sections.

It compares a player's moves to KataGo's first move suggestions; close agreement is said to let a player be suspicious. Suspicious, why not; every player can be suspected to be cheating. However, there is another explanation than cheating: a player can have trained a lot with KataGo or have a similar playing style. Besides, there is the problem that a player might be cheating using a different AI program or different KataGo network; then KataGo's moves might not be particularly suitable for comparison to the player's moves.

The paper studies four example games discussing indications of cheating. Again, it frequently repeats its earlier mistakes. Per player and game, a few indicators are considered to judge about cheating. Most alleged indicators are interpreted as indicating cheating, although they can also be interpreted as the opposite. The paper's systematic repetition of its earlier mistakes combined with indicators interpreted with prejudice of indicating cheating produces alleged detection of cheaters on the implied assumption that several such aspects combined would be sufficiently convincing evidence. A warning of caution that false allegations can occur only serves as an alibi. Such an approach of cheating detection is bound to detect cheaters regardless of what percentage of players is judged upon wrongly.

Another part of the problem is that the paper has introduced values (called metrics) and applies them while the paper does not provide theory for distinguishing when values, or combinations of values, indicate versus do not indicate cheating. This is like statistics without confidence thresholds for individual values, let alone for combined consideration of several same kinds of values and then for different kinds of values.

Value graphs are supposed to be interpreted subconsciously by human arbiters. Instead, there ought to be theory interpreting data represented in graphs, such as analysing differences between two value curves in a graph.

Such theory requires agreement to very large samples of games and their values and graphs. This is so also because different board positions and different games can have different behaviours of values. E.g., imagine a semeai with two local maxima: one correct and one wrong; when calculting an average for a roughly balanced tree search, also the average will be a wrong indication. Currently, such is interpreted as being as good as all other values for ordinary positions.

The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.

In conclusion, although the paper suggests some values potentially useful for some studies or models, the theory is very far from rather safe application as distinction between cheating and no cheating, except for the mentioned success cases of tool usage followed by a player admitting cheating. Currently, the theory is very incomplete, is over-interpreted and frequently advertised by the paper's authors within the paper as being more (an alleged description of reality, such as "a player's skill") than it is (only a model such as "a model of a player's skill"; furthermore a model lacking quality evaluation, which - for the promoted application of cheating detection - is essential).

Posted: **Wed Nov 11, 2020 3:00 am**

RobertJasiek wrote:Apparently, the scoremean is defined as a mean over all visited scores (at the leaves, I suppose) during a Monte-Carlo search. When correct subsequent play approaches a leaf, the scoremean can converge to strong human score prediction. So far so good. However, in the general case, which includes many positions far earlier than a leaf, there may be some stability in the values and game-tree-local convergence for strong AI play but we do not know by how much in every specific position scoremean and strong human score prediction differ. The scoremean does not equal strong human score prediction.

Sure enough, human counting and AI counting are different.

An interesting feature of the scoremean is that KataGo is reliably able to produce ties against itself (with an integer komi), showing that its calculations are at least consistent to a degree. (For example Leela Zero is unable to do the same.) KataGo also reliably beats Leela Zero, indicating that its understanding of the game should be the better one. While the scoremean values are 'impure' and 'imprecise', unlike human counting, I still think we should give them value.

RobertJasiek wrote:The paper says: "Every move played in a game reduces the number of its future possibilites." Unless a superko or similar rule applies, this is just a conjecture and disproven by this counter-example: White's two-eye-formation fills the board, White fills an eye, Black passes, White fills an eye committing suicide (assuming it is legal according to the rules). The resulting position has a greater number of future possibilities than the initial position. To get a theorem instead of a conjecture, some presuppositions need to be stated and a proof is required.

I'm not sure I get this. For every move, you could have played the move or passed. After playing a move (or passing), you clearly have a smaller number of possible futures remaining.

RobertJasiek wrote:The effect of a move is defined as the difference of scoremeans after and before it. The paper says that statistical information on the effects describe the playing skill of a player. No. It only describes a model of the playing skill of a player because the scoremean is only a model of correct positional judgement.

For sure, a player's average effect in a single game does not make it possible to accurately estimate their playing skill – we never claimed such a thing. There is a fairly strong correlation, however.

RobertJasiek wrote:The paper claims that AI does not follow strategic plans, which can be expressed in human terms. Wrong. I already described such during the early AlphaGo days, when AI performed exactly according to my previously described generally applicable strategic planning of a certain kind (best reducing large sphere of inflence).

Of course you can fit a humanly describable strategic plan to a particular move by an AI.

The point is that the AI's move-choosing procedure itself is not, e.g., 'right now there is nothing urgent going on, so the next most valuable thing is to reduce that sphere of influence, and for that particular task I should use the technique I read about in a theory book.' The AI chooses its moves purely from the information available on the game board, through algorithms not accessible to humans, without trying to infer the opponent's verbalisable strategy.

RobertJasiek wrote:Regardless of indirect calculation of winrates, I agree that such graphs identify suspective players because such graphs express them to make essentially no significant mistakes. They are an indication of possibly occurring cheating but not a proof of it.

I think we should note that the only 'proof' of cheating is the cheater's confession or getting caught in the act, for example by a video recording or a trusted proctor. All other anti-cheating solutions are finally based on probabilities, which I think should be called 'evidence' rather than 'proof'.

RobertJasiek wrote:It compares a player's moves to KataGo's first move suggestions; close agreement is said to let a player be suspicious. Suspicious, why not; every player can be suspected to be cheating. However, there is another explanation than cheating: a player can have trained a lot with KataGo or have a similar playing style. Besides, there is the problem that a player might be cheating using a different AI program or different KataGo network; then KataGo's moves might not be particularly suitable for comparison to the player's moves.

I have tested the model on a wide variety of AIs, even AlphaGo Master, which is trained on human games and plays considerably differently from modern AIs such as KataGo. Superhuman performance is still visible in the graphs, even if KataGo's own value function is slightly different.

Furthermore, being able to for example play KataGo's favoured sequences out from time to time will not mark a player as suspicious, but consistently playing in the roughly right part of the board will. I have seen no evidence of a player's 'familiarity with AI moves' making them stand out in my analysis.

RobertJasiek wrote:Another part of the problem is that the paper has introduced values (called metrics) and applies them while the paper does not provide theory for distinguishing when values, or combinations of values, indicate versus do not indicate cheating. This is like statistics without confidence thresholds for individual values, let alone for combined consideration of several same kinds of values and then for different kinds of values.

I think you may have misunderstood the purpose of the paper. As we are basically just starting the research, we can hardly have a finished product for cheat detection work – if we did, you could already download the software somewhere. The paper describes how different metrics derivable from AI analysis can be used in cheat detection. We cannot possibly do statistics with confidence thresholds if we don't even know what we should measure. This is what the paper starts to tackle.

RobertJasiek wrote:E.g., imagine a semeai with two local maxima: one correct and one wrong; when calculting an average for a roughly balanced tree search, also the average will be a wrong indication. Currently, such is interpreted as being as good as all other values for ordinary positions.

This will however not happen, because the AI will not use the same amount of playouts for both possibilities. You can easily test this with KataGo and see that it gives the (roughly) right scoremean once it figures out the status of the position. Depending on the complexity of a position, this may of course require a larger number of playouts.

RobertJasiek wrote:The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.

As I said above, this claim is unsubstantiated.

RobertJasiek wrote:In conclusion, although the paper suggests some values potentially useful for some studies or models, the theory is very far from rather safe application as distinction between cheating and no cheating, except for the mentioned success cases of tool usage followed by a player admitting cheating. Currently, the theory is very incomplete, is over-interpreted and frequently advertised by the paper's authors within the paper as being more (an alleged description of reality, such as "a player's skill") than it is (only a model such as "a model of a player's skill"; furthermore a model lacking quality evaluation, which - for the promoted application of cheating detection - is essential).

As I also wrote above, I think you have misunderstood the point of the paper. It certainly was not presenting a robust system that can be applied in cheat detection – if that was the case, then the research would be nearing completion, rather than having just started, and we would already have a product to offer. The four cases presented in the paper are examples of how interpretation of the generated graphs can possible quicken the otherwise slow, pure human analysis of alleged cheating cases. None of the presented metrics or procedures is given as a 'general solution', but rather as a check that can be done to get a better idea of what happened. When applying a series of 'cheat filters' such as these and all (or most) of them coming off as 'suspicious', then we have probabilistic data that I believe is more trustworthy than a 'mere' human interpretation made from reviewing a game. When there is no way to get actual 'proof' of cheating, this, I think, can very well be the next best thing.

Posted: **Wed Nov 11, 2020 4:03 am**

What I see here is an argument between mathematics + mathematics and mathematics + common sense. You can easily guess which side I'm on.

It seems from the reports here on the aptly named Corona Cup that online go already has its own CV problem - the Cheating Virus. A vaccine is needed urgently. The world and (more significantly perhaps) its stock markets have already greeted with jubilation the news that there may be a vaccine soon for the real CV that gives "only" 90% protection. An approach that gives 90% protection NOW or SOON against cheating in go is surely to be similarly welcomed.

Posted: **Wed Nov 11, 2020 4:24 am**

Some people worry more about side-effects of the vaccine on healthy people than the protective effect on vulnerable people. The discussion can go on forever. I am confident that the anti-CV committee will make reasonable use of the vaccine.

Posted: **Wed Nov 11, 2020 5:26 am**

NordicGoDojo wrote:An interesting feature of the scoremean is that KataGo is reliably able to produce ties against itself (with an integer komi), showing that its calculations are at least consistent to a degree. [...] KataGo also reliably beats Leela Zero, indicating that its understanding of the game should be the better one. While the scoremean values are 'impure' and 'imprecise', unlike human counting, I still think we should give them value.

Sure.

Where I object is over-interpretation of KataGo's skill, e.g., when the paper refers to "a player's skill".

RobertJasiek wrote:number of its future possibilites
I'm not sure I get this.

I lack to time to work out this. The paper might as well omit the related statement without significant impact, so who cares?:)

a player's average effect in a single game does not make it possible to accurately estimate their playing skill

At the same time, the paper describes the intention of analysing a player's skill (but so far should only speak of a "model of it") from just one game. His demonstrated, what the paper calls, skill shall then be used as a basis for possibly detecting his cheating in this game.

If you do hold your statement, you must at the same time hold that cheating detection by the paper's means from only one game of the player is impossible.

There is a fairly strong correlation, however.

I do not have a problem with seeing a fairly strong correlation, as long as it roughly described as "for an average game of a particular, arbitrary player, the paper's tools can indicate a 'cheating' suspicion under the assumption that the model of the player's performance is his performance".

Of course you can fit a humanly describable strategic plan to a particular move by an AI.

Not just to one move but To particular kinds of move sequences.

The point is that the AI's move-choosing procedure itself is [...]

...not described as a human-readable strategic plan indeed, right. It is well hidden in the network values, pure tree searches and code.

I think we should note that the only 'proof' of cheating is the cheater's confession or getting caught in the act, for example by a video recording or a trusted proctor. All other anti-cheating solutions are finally based on probabilities, which I think should be called 'evidence' rather than 'proof'.

Right.

Therefore, if "statistical" probabilities shall serve as evidence, they require theory for thresholds, levels of confidence and agreement to large samples.

I have tested the model on a wide variety of AIs, even AlphaGo Master, which is trained on human games and plays considerably differently from modern AIs such as KataGo.

Good.

Furthermore, being able to for example play KataGo's favoured sequences out from time to time will not mark a player as suspicious, but consistently playing in the roughly right part of the board will.

(My reply refers to phases before the endgame phase.)

I disagree. A player can have the skill to always play in roughly the, as indicated by AI analysis, right part of the board in some of his games. Such a player need not have superhuman level.

A player is suspicious if he also consistently plays locally close to optimal. If we know he is a strong (or very strong) player, we must be extra cautious and tolerant towards interpreting his skill.

I have seen no evidence of a player's 'familiarity with AI moves' making them stand out in my analysis.

I expect what you describe. Nevertheless and regardless, could you describe your observations so far in more detail, please? We might learn from them.

I think you may have misunderstood the purpose of the paper.

I get it that the paper is an early step in metrics analysis - for that purpose, I do not think to have misunderstood its purpose.

At the same time, at various places, the paper makes detailed statements that go far beyond the aforementioned purpose. I criticise the paper for such over-interpreting statements.

Furthermore, the paper goes far beyond the aforementioned purpose when suggesting and describing application to cheating detection. I also criticise that the paper rushes ahead too fast while it even serves as part of justification of already applying such tools in tournaments.

IOW, there is not just one purpose of the paper - not just the modest purpose of an early step in metrics analysis. This paper does not give an impression like a pure maths paper, such as about KL-divergence gives. Quite contrarily, implicitly the paper is referred to as strong justification when a tournament announcement refers to "state-of-the-art" anti-cheating tools.

Depending on the complexity of a position, this may of course require a larger number of playouts.

Yes, but cheating detection is supposed to be applicable even when, in quite a few positions, there are not enough playouts.

RobertJasiek wrote:The paper's value analysis applied to players creates an unfair prejudice: some players with specific playing styles, studying with specific AI programs or having studied much with AI are in much greater danger of being wrongly indicated as cheaters.
As I said above, this claim is unsubstantiated.

I am not convinced because I do not buy it that the paper's only purpose would be early research. (You might rewrite the paper to convince me by removing all details hinting at advanced application / interpretation, but please do not waste your time on doing so:) As a suggestion for future papers, clearly distinguish current level of understanding and possible future research, and maybe applications beyond a paper described outside a paper itself.)

It certainly was not presenting a robust system that can be applied in cheat detection – if that was the case, then the research would be nearing completion, rather than having just started, and we would already have a product to offer.

Thank you for the clarification!

When applying a series of 'cheat filters'

Good in theory, but only good in practice if each filter itself is convincing - and is not a roughly 50% interpretation chance "cheated or not cheated".

I believe is more trustworthy than a 'mere' human interpretation made from reviewing a game.

I think what might some time become a useful filter is objective analysis of data currently presented as graphs and indicating similar progress (such as "winning chances") during a game of different AIs' moves versus a player's moves. Such characteristics are very hard to fake if they occur absolutely consistently before the stage of already having won a game strategically. Of course, that presumes extensive studies that coincidences do not occur just because of a specific nature of a game's development.

Life In 19x19

Derived Metrics for the Game of Go

Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go

Re: Derived Metrics for the Game of Go