AI verdict on Jowa

John Fairbairn · #1

Three months ago Ohashi Hirofumi started a mini-series in Go World as part of its 800th issue celebrations. Using data from the Chinese AI program Golaxy, he looked at three famous games, by Jowa, Shusaku and Dosaku respectively. At the time they appeared I just skimmed over them. I have barely looked at AI games and wanted to read a primer before I looked at these historical games closely.

I was disappointed in not getting my desired primer on last month's trip to Tokyo, and was disappointed overall with the book haul (but amply compensated for by joining the Tokyo branch of the RSCDS in their monthly Scottish country dancing class and a visit with my grandson to the Tamiya Factory for him to pick up some Russian tanks).

I did buy several AI books but most were pot boilers, and the only one I really rated as worth reading on the plane was a book by Ohashi himself. He seems to be the most knowledgeable pro interested in AI and has all the best contacts in China and Korea. (He also writes well.)

But when I came back I decided finally to delve into the historical series without trying to stuff myself full of background. That is not to say that my mind was tabula rasa. I have been looking at some old games and books by comparing their comments with Lizzie. It's a curate's egg mess with books: the only one I found that seemed to score consistently well was Kimu Sujun's recent book on the Four Basic Rules for Surrounding Territory Efficiently (novel stuff: perhaps influenced by AI study?). Most books score more like 50:50 or 60:40. The celebrated Katsugo Shinpyo is more like 0%, although I expect that must have a lot to do with Lizzie evaluating whole-board positions whereas KS is about local positions.

Commentaries and actual games, however, seem to score well on the whole. Even where a pro makes a mistake, a human commentator has usually spotted it before the bot, and more often than not the pro move is close to the top few selected by the bot. The game seems to unravel mainly because of just one or two big mistakes. The other thing I have noticed with old games is that the nominally stronger player generally scores better on the AI scale than the weaker one.

The only game in Ohashi's series I have looked at so far is the Three Brilliancies game between Jowa and Akaboshi Intetsu (this is also the subject of my Slate & Shell book Brilliance).

The most significant point to cover first is the komi. AI bots generally are trained on 7.5 points komi and this badly affects the reliability of their assessments of no-komi games. I personally don't realise why but the Elf team, when they were adding AI commentary to all the GoGoD games, told me this was a major point but results in the early fuseki are probably not too badly affected.

However, the reason Golaxy was used in Ohashi's series is that it can cope with different komis. Other plus points are that it is probably stronger even than AlphaGo and that it can give its evaluations not just as winning percentages but as point scores. It has concluded that on an empty board with no komi Black wins on average by 6.1 points.

For the evaluations done on the historical games, the machine was run for an entire day, looking at 5 million nodes per move. Reading out the results of that data-crunching was no easy task either.

The overall picture was that Jowa did not make any serious booboos. Intetsu made a couple, but most moves that were not rated best by the computer were close to the best or could be adjudged either simply slack moves or deliberately risky moves - in both cases (as Ohashi takes pains to demonstrate) based on positional evaluations and explainable by psychology. That does not mean the human evaluations were correct, but they were at least rational.

The three briliiancies were not quite the best moves according to the bot, but were not at all bad, and Ohashi argued (and demonstrated convincingly, I thought) that they should still be regarded as myosu. He said that a myosu is a brilliant move that is hard to see and that is the point. If it was hard for Jowa to see them it was also hard for Intetsu. The best move according to AI for Jowa's first brilliancy was an easy-to-see invasion, but it left him behind overall, as Jowa must have realised. Black's winning ratio at that point was 64% or 3 points. After waving his Harry Potter wand, he went back to the invasion but now he was level pegging! Intetsu had failed to find the best replies.

(Incidentally there was a case a little later where Black's winning ratio was 61.1% but territorially White was ahead by 0.2 points. Ohashi admits this is hard to understand.)

It has already been pointed out on this forum that there are several cases where a bot does not even list a particular move in its top N moves, but when that move is actually played the win ratio barely changes. The same thing seems to happen with Golaxy. In fact, it often didn't "see" a move preferred by Lizzie (and vice versa, of course). Ohashi does not describe every move, so it's hard to compare Golaxy and Lizzie but my impression was that Lizzie preferred the same sort of moves as Golaxy but on a couple of crucial occasions totally missed a killer move inside a variation spotted by Golaxy.

Going back to the slack/risky moves point, Ohashi several times made the point that Intetsu (with the advantage of Black, of course) made moves that he must have realised were slack but were safe, as he clearly judged he was ahead (and he was - but he was whittling down his own lead). Ohashi didn't make the point but I noticed that the far fewer slack moves Jowa made came at points when Intetsu had just made a slack move, as if he relieved to have the chance to do a bit of "free" patching up. In the case of risky moves, all by Jowa, his timing and psychology seemed to be spot on. Ohashi claims that is also a defining skill of elite players such as Ke Jie and Iyama Yuta. But either way, as already said, these dubious moves all reflected the players' possibly ropey evaluation of the overall position. But that's by far the toughest aspect of the game for humans.

As an example of human evaluation vs computer evaluation consider this position:

The triangled move was 35 in the game. Black had just before made the famous hane in the corner. It was famous because the Inoue school had studied it intensely and thought it was a secret weapon. Indeed, there was long an opinion that Black succeeded with this ploy, but maybe Jowa was not so impressed, and Golaxy certainly wasn't. It rated the Black hane 33 as a bad move that reduced Black's territorial lead from 5.7 to 2.1 points.

However, Jowa and Golaxy differed in their choice of reply. Jowa chose to force at A and then lived in the corner. Golaxy preferred to fight the ko with B and came up with the following line of play that gave the position where it thought it had gained 3.6 points. For a human even to just feel that White had made a gain here is surely problematical - just too much left up in the air. Jowa's (slightly) inferior move may well be regarded as correct for a human.

In all the comments I have read on the new AI style of play, many people have pointed out the new kinds of moves (e.g. high shoulder hits), there have been insightful characterisations of the style (e.g. an emphasis very early on on causing the opponent to be overconcentrated), and there have been new words (e.g. the tiger enclosure). But nowhere have I seen anything that suggests that humans have even begun to get a grip on how to evaluate positions such as the second one above. Everything seems to indicate humans are still satisfied (because they have to be) with Jowa's kind of response.

That should not surprise us if we look back at Shin Fuseki. There was great excitement at the time, and many books and articles purporting to elucidate the theory. But it didn't really take too long before even the excitable players more or less resumed normal service, and Shin Fuseki left barely a trace. Of course, in all the excitement new josekis emerged, just as some people are still getting very excited about AI josekis. But there at least we can perhaps say that the AI bots have really done little more than the best human joseki masters such as Go Seigen have already done.

In fact, I have continued to be encouraged by how well humans appear to have done overall in the AI comparisons. Jowa's reputation seems to have been relatively unscathed, at least in humans terms, and I gather there may even be surprises in store for how much better Dosaku shows up. I may report on that, and the Shusaku game, in due course, although I have to say that contributing to L19 feels a bit like equine necrophilia these days.

Bill Spight · #2

John Fairbairn wrote:

The most significant point to cover first is the komi. AI bots generally are trained on 7.5 points komi and this badly affects the reliability of their assessments of no-komi games. I personally don't realise why but the Elf team, when they were adding AI commentary to all the GoGoD games, told me this was a major point but results in the early fuseki are probably not too badly affected.

Before looking at the Elf commentaries on old no komi games I had the same impression as the Elf team about the early fuseki. I figured that at that stage of the game the lack of komi would affect the winrate estimates, but probably not the ranking of plays. But when I took a look, I was struck by how often White took winrate losses right off the bat. OC, this was in line with what I had been taught, that White would often make objectively inferior plays in order to complicate the play. The prime example was approaching a Black corner stone at move 4 instead of playing in the open corner, to prevent Black from making a good enclosure. What surprised me was how low a winrate White would accept. Sometimes White would get a winrate of 25% or less in the opening. Since the winrate was predicated on a 7.5 komi, that meant that White was, in effect, adding significantly to the expected number of points Black was ahead.

Quote:

The overall picture was that Jowa did not make any serious booboos. Intetsu made a couple, but most moves that were not rated best by the computer were close to the best or could be adjudged either simply slack moves or deliberately risky moves - in both cases (as Ohashi takes pains to demonstrate) based on positional evaluations and explainable by psychology. That does not mean the human evaluations were correct, but they were at least rational.

I think one factor is shared assumptions of the human players. We have seen that bots trained on self play can have blind spots because in its evolution, its opponent, a slight variant of itself, made the same misjudgement as it did. Over infinite time these blind spots will disappear, but meanwhile they exist. The same thing happens with a community of human players. Even though the players are not clones of each other, they share assumptions about plays and the evaluation of positions. This is apparent in the Elf commentaries, where one top player will make play that loses, say, 10% in Elf's winrate estimate, and then the opponent turns around and returns the favor on the next move, losing, say, 11%.

Another factor, I think, is the difference in skill between players. If a 19th century pro 8 dan took White againsts a pro 5 dan, their chances were roughly equal, or perhaps White had a slight advantage. Estimating Black to be 6 pts. ahead would be way off. Furthermore, the 8 dan would probably have a good idea of the kinds of mistakes the 5 dan would make. As of yet, bots have not trained, based on these factors. (But as I have heard, a neural net chess engine has been trained specifically against Stockfish, and does well against it. The future of AI will be interesting.

)

Quote:

(Incidentally there was a case a little later where Black's winning ratio was 61.1% but territorially White was ahead by 0.2 points. Ohashi admits this is hard to understand.)

IMO, not enough research has been done into the error functions of bots. Meanwhile, the ordering of winrate estimates and territory estimates is probably more reliable than the actual figures. It is plain from the above example that Golaxy does not derive one estimate from the other. One or the other is wrong, probably both. By how much, we don't know. (Just my guess, but with 5 million playouts per move I would trust the territory estimate more. Why? Because, as the program gets better, its winrate swings become greater. In the limit they approach 100%, since a slight error in territorial terms could make the difference between winning and losing. {For instance, in one of my problems a mistake of 1/64 pt. loses the game.

) The opposite is true for the territory estimates, which should converge as the program improves its evaluations.

Edit: I take back my confidence in the territory estimate, because [lightvector] has pointed out, below, that the data could be skewed by large wins by White. IMHO the proper territory estimate should not be the average of the data, but the median, i.e., the value which divides the scores closest to 50-50, like komi. The 0.2 pt. territory value for White may well be the average, not the median, which could be a few points for Black.

Quote:

It has already been pointed out on this forum that there are several cases where a bot does not even list a particular move in its top N moves, but when that move is actually played the win ratio barely changes. The same thing seems to happen with Golaxy.

With weaker bots or fewer playouts we see the same thing. We even see cases where the new play is an improvement on the bot's top choice. That is why I would like an analyst bot that makes a broader search than a player bot does. Since we do not have any error function for winrate estimates, the number of playouts for each option is a good proxy. It is plain with todays bots, a winrate estimate based on only 1000 playouts for that option (many more playouts for the move to be chosen) is unreliable.

Quote:

As an example of human evaluation vs computer evaluation consider this position:

{snip}

However, Jowa and Golaxy differed in their choice of reply. Jowa chose to force at A and then lived in the corner. Golaxy preferred to fight the ko with B and came up with the following line of play that gave the position where it thought it had gained 3.6 points. For a human even to just feel that White had made a gain here is surely problematical - just too much left up in the air. Jowa's (slightly) inferior move may well be regarded as correct for a human.

In all the comments I have read on the new AI style of play, many people have pointed out the new kinds of moves (e.g. high shoulder hits), there have been insightful characterisations of the style (e.g. an emphasis very early on on causing the opponent to be overconcentrated), and there have been new words (e.g. the tiger enclosure). But nowhere have I seen anything that suggests that humans have even begun to get a grip on how to evaluate positions such as the second one above. Everything seems to indicate humans are still satisfied (because they have to be) with Jowa's kind of response.

Well, White has gained around 13 pts. in the bottom right, plus something on the left side and some outside strength in the top right. In addition, White did not have a secure position in the top right corner to start with. An influence oriented human might well be satisfied with the outcome, or even prefer it.

As for human evaluations, it goes back to shared assumptions. Tomorrow's top pros, who will have grown up with the bots, will have more shared assumptions with the bots than with even today's top humans. I think that the next 20 years will be exciting times for human go.

Hades12 · #3

Can I just say: thank you for this write up? Since you are an author, it should not surprise me, but I thoroughly enjoyed the flow of this post. It felt very academic, and the ideas were simply represented and expanded upon. As a English grad student, I loved reading this, and hope you continue this analysis with Shusaku and Dosaku. Please continue to beat the dead horse!!

Mikebass14 · #4

Hear hear! Thanks for the fascinating stuff.

lightvector · #5

John Fairbairn wrote:

(Incidentally there was a case a little later where Black's winning ratio was 61.1% but territorially White was ahead by 0.2 points. Ohashi admits this is hard to understand.)

It's not hard to understand. If Golaxy is trained anything like KataGo, the score estimate and winrate are both expected values in the mathematical sense of the word, namely averages over the outcomes that the bot currently believes are plausible good play by both sides.

Such a result can arise if there is significant skew in the distribution of believed-plausible outcomes. If a bot is anticipating some imminent potential fights and in the majority of outcomes it sees black ahead by a few points, and in the minority but still a major fraction of the variations it sees white ahead by a ton of points, then the winrate will favor black but the expected score will be larger for white.

The simplest case would be if there is some black dragon where if it lives black will be clearly winning by a little, but if it dies black will be clearly losing by a lot, and where the bot is genuinely unsure if it will live or die but thinks that living is more likely.

So based on the evaluation alone (even though of course the evaluation is uncertain and may be fallible, as always) if you wanted to answer the question "if this position were played out in bot-v-bot selfplay hundreds of times with different randomization, who would more often win this position" you would be better off guessing black because black has the higher winrate. And if you wanted to answer the question of "who would get the higher average score when doing so" you would be better off guessing white because white has the higher score estimate. These are two distinctly different questions, so it is not surprising if sometimes they have different answers.

Bill Spight wrote:

IMO, not enough research has been done into the error functions of bots. Meanwhile, the ordering of winrate estimates and territory estimates is probably more reliable than the actual figures. It is plain from the above example that Golaxy does not derive one estimate from the other.

Yes, the ordering would be probably more reliable than the actual numbers. But I would disagree with the connotation of your statement about deriving one from another. Probably it is literally true - in KataGo at least they are separate outputs from the neural net, so literally speaking neither is derived from the other. But they are trained and predicted jointly. So although the bot might well be misjudging things or making reading errors, the score estimate and winrate should usually be closely *mutually-consistent* given those possible errors. And the two values each favoring different sides is not particularly strong evidence of inconsistency.

Bill Spight · #6

lightvector wrote:

John Fairbairn wrote:

(Incidentally there was a case a little later where Black's winning ratio was 61.1% but territorially White was ahead by 0.2 points. Ohashi admits this is hard to understand.)

It's not hard to understand. If Golaxy is trained anything like KataGo, the score estimate and winrate are both expected values in the mathematical sense of the word, namely averages over the outcomes that the bot currently believes are plausible good play by both sides.

Such a result can arise if there is significant skew in the distribution of believed-plausible outcomes. If a bot is anticipating some imminent potential fights and in the majority of outcomes it sees black ahead by a few points, and in the minority but still a major fraction of the variations it sees white ahead by a ton of points, then the winrate will favor black but the expected score will be larger for white.

Thanks for the clear explanation.

lightvector wrote:

Bill Spight wrote:

IMO, not enough research has been done into the error functions of bots. Meanwhile, the ordering of winrate estimates and territory estimates is probably more reliable than the actual figures. It is plain from the above example that Golaxy does not derive one estimate from the other.

Yes, the ordering would be probably more reliable than the actual numbers. But I would disagree with the connotation of your statement about deriving one from another. Probably it is literally true - in KataGo at least they are separate outputs from the neural net, so literally speaking neither is derived from the other. But they are trained and predicted jointly. So although the bot might well be misjudging things or making reading errors, the score estimate and winrate should usually be closely *mutually-consistent* given those possible errors.

Well, I carefully avoided saying that the two estimates were independent, for the reason you give.

Quote:

And the two values each favoring different sides is not particularly strong evidence of inconsistency.

I wasn't saying that there was any underlying inconsistency. But the apparent inconsistency lies within their error functions, which are unknown.

But I do take back my statement about relying more upon the territory estimate, given the possibility you point out that the territory data may be skewed. Edit: And that Golaxy may well be calculating the mean score instead of the median.

Quote:

And if you wanted to answer the question of "who would get the higher average score when doing so"

I think I would be more interested in the median score rather than the mean. For instance, the best estimate of komi is the median result, not the mean. In the case Ohashi reports, it may well be, given skew, that the median score favors Black instead of White.

Edit: And, for measuring errors in both score estimates and winrate estimates, I prefer the interquartile range to the root-mean-square error.

Edit2: Also, as I have said before, when choosing between plays, score estimates are not sufficient; you also need temperature estimates.

AI verdict on Jowa

Who is online