MuZero beats AlphaZero

For discussing go computing, software announcements, etc.
pookpooi
Lives in sente
Posts: 727
Joined: Sat Aug 21, 2010 12:26 pm
GD Posts: 10
Has thanked: 44 times
Been thanked: 218 times

Re: MuZero beats AlphaZero

Post by pookpooi »

Since DeepMind is not gonna provide exact elo value anyway I'll do this for fun. I try to find elo from graphs assuming graphs have accurate scale.

We'll start with the exact number the paper mention (from AlphaGo Zero paper)
3,144 for AlphaGo Fan
3,739 for AlphaGo Lee
4,858 for AlphaGo Master
AlphaGo Zero (40 blocks/ 40 days) 5,185

Now estimated number
AlphaGo Zero (20 blocks/ 3 days) 4,884 (from AlphaZero paper)
AlphaZero (20 blocks/ 13 days) 4987 (from MuZero paper), 4980 (from AlphaZero paper), very similar number across these two papers so I think they have accurate scale graphs
MuZero (16 blocks/ 12 hours?) 5161 (from MuZero paper)

Though there is a very BIG caution, they're different match condition, in MuZero paper the condition is 800 simulations per move, and in other graph shows that MuZero is able to outperform AlphaZero from 0.1 seconds to 20 seconds per move, at 20 to 50 seconds per move AlphaZero outperform MuZero, and we don't know what will happen at even longer thinking time.
lightvector
Lives in sente
Posts: 759
Joined: Sat Jun 19, 2010 10:11 pm
Rank: maybe 2d
GD Posts: 0
Has thanked: 114 times
Been thanked: 916 times

Re: MuZero beats AlphaZero

Post by lightvector »

Actually we know what will happen at longer thinking times - it's almost guaranteed that AlphaZero continues to pull further ahead of MuZero.

The reason that AlphaZero pulls ahead at longer thinking times is because the accuracy of MuZero's representation of the board degrades the more times it passes through the dynamics function, so as it thinks more and more moves ahead, its "mental picture" of the future board state becomes worse and worse until it degrades into garbage. (This is a general phenomenon that afflicts all known RNN-style architectures that attempt to model any kind of state dynamics.)

The paper itself remarks on the fact that quite amazingly, the degradation only really noticeably starts at least a whole order of magnitude beyond what was used in self-play training. But for deep searches, as it currently is, it can't compete with AlphaZero, which has an actual software implementation of a Go board to make the moves on and therefore perfect future board perception.

(As others have mentioned, it's very clear from features of the design like this one that Go wasn't really the target problem being solved here, they're focused on more general tasks where you can't simply implement the rules of the game in your model).
Bill Spight
Honinbo
Posts: 10905
Joined: Wed Apr 21, 2010 1:24 pm
Has thanked: 3651 times
Been thanked: 3373 times

Re: MuZero beats AlphaZero

Post by Bill Spight »

Thanks, lightvector. :)

One thing that keeps coming to my mind is Richard Feynman's caution about extrapolation. OC, everybody knows that you can't trust extrapolation, but Feynman pointed out that your can't trust extreme data points, either. They are not validated by further exploration. See the horizan effect.

That's why, when I see long variations produced by analytical programs, I cringe. The Elf commentaries sometimes produces long variations, as well, but they cut them off when the number of visits or playouts drops below 1500. You can't trust moves that have not been explored at least that much.
The Adkins Principle:
At some point, doesn't thinking have to go on?
— Winona Adkins

Visualize whirled peas.

Everything with love. Stay safe.
Post Reply