My working hypothesis is that Leela Elf with the 200K setting is better than it is with the 100K setting. (N.B. These playout numbers are not actually observed in the files.) So the observed ∆s are not random noise, but indicate likely errors with the 100K setting. The sign changes in the ∆s in the game record support that hypothesis.
The median ∆ is -0.03. If we subtract that amount from each ∆ we get 137 ∆s with one 0. Ignoring that ∆ we have a sequence of 136 signed ∆s, half with a + sign, half with a - sign. Our expected number of sign changes in the sequence (ignoring the 0) is 136/2 = 68. We get only 50 sign changes, too few for a random sequence.
This lack of randomness is more obvious when we look at sequences of signs of the same kind, called runs. The expected random run length is 2. The average run length for the game is 2.7. What mainly skews the result is two runs of length 12.
(One of these contains the median, so is 13 moves long.) The first long run (13 moves) begins at the position after
, based upon Leela Elf's choice for
. (So it shows up starting at move 68 in the chart.) The second long run (12 moves) begins at the position after Black 147 (move 148 in the chart). One explanation for these long runs is that there are persistent features of the board in each that Leela Elf misevaluates at a setting of 100K and evaluates better at a setting of 200K. During the first run it underestimates Black's chances, and during the second run in overestimates them.