Tuning with 50000 visits:
Z:\>LG0\Lizzie\katago\katago.exe genconfig -model \LG0\Lizzie\katago\g170-b30c32
0x2-s1287828224-d525929064.bin.gz -output gtp_custom.cfg
=========================================================================
RULES
What rules should KataGo use by default for play and analysis?
(chinese, japanese, korean, tromp-taylor, aga, chinese-ogs, new-zealand, bga, st
one-scoring, aga-button):
japanese
=========================================================================
SEARCH LIMITS
When playing games, KataGo will always obey the time controls given by the GUI/t
ournament/match/online server.
But you can specify an additional limit to make KataGo move much faster. This do
es NOT affect analysis/review,
only affects playing games. Add a limit? (y/n) (default n):
n
NOTE: No limits configured for KataGo. KataGo will obey time controls provided b
y the GUI or server or match script
but if they don't specify any, when playing games KataGo may think forever witho
ut moving. (press enter to continue)
When playing games, KataGo can optionally ponder during the opponent's turn. Thi
s gives faster/stronger play
in real games but should NOT be enabled if you are running tests with fixed limi
ts (pondering may exceed those
limits), or to avoid stealing the opponent's compute time when testing two bots
on the same machine.
Enable pondering? (y/n, default n):y
Specify max num seconds KataGo should ponder during the opponent's turn. Leave b
lank for no limit:
=========================================================================
GPUS AND RAM
Finding available GPU-like devices...
Found CUDA device 0: GeForce RTX 2080 Ti
Found CUDA device 1: GeForce RTX 2080 Ti
Specify devices/GPUs to use (for example "0,1,2" to use devices 0, 1, and 2). Le
ave blank for good default:
"0,1"
could not parse int: "0
Specify devices/GPUs to use (for example "0,1,2" to use devices 0, 1, and 2). Le
ave blank for good default:
0,1
By default, KataGo will cache up to about 3GB of positions in memory (RAM), in a
ddition to
whatever the current search is using. Specify a max in GB or leave blank for def
ault:
60
=========================================================================
PERFORMANCE TUNING
Specify number of visits to use test/tune performance with, leave blank for defa
ult based on GPU speed.
Use large number for more accurate results, small if your GPU is old and this is
taking forever:
50000
Specify number of seconds/move to optimize performance for (default 5), leave bl
ank for default:
2020-03-12 22:55:26+0100: Loading model and initializing benchmark...
=========================================================================
TUNING NOW
Tuning using 50000 visits.
Automatically trying different numbers of threads to home in on the best:
2020-03-12 22:55:26+0100: nnRandSeed0 = 2369906978592220054
2020-03-12 22:55:26+0100: After dedups: nnModelFile0 = \LG0\Lizzie\katago\g170-b
30c320x2-s1287828224-d525929064.bin.gz useFP16 auto useNHWC auto
2020-03-12 22:55:28+0100: Cuda backend: Found GPU GeForce RTX 2080 Ti memory 118
11160064 compute capability major 7 minor 5
2020-03-12 22:55:28+0100: Cuda backend: Found GPU GeForce RTX 2080 Ti memory 118
11160064 compute capability major 7 minor 5
2020-03-12 22:55:28+0100: Cuda backend: Model version 8 useFP16 = true useNHWC =
true
2020-03-12 22:55:28+0100: Cuda backend: Model name: g170-b30c320x2-s1287828224-d
525929064
2020-03-12 22:55:28+0100: Cuda backend: Model version 8 useFP16 = true useNHWC =
true
2020-03-12 22:55:28+0100: Cuda backend: Model name: g170-b30c320x2-s1287828224-d
525929064
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32
,
numSearchThreads = 5: 10 / 10 positions, visits/s = 533.10 nnEvals/s = 350.16 n
nBatches/s = 213.88 avgBatchSize = 1.64 (938.0 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 1131.75 nnEvals/s = 769.38
nnBatches/s = 198.99 avgBatchSize = 3.87 (441.9 secs)
numSearchThreads = 10: 10 / 10 positions, visits/s = 964.41 nnEvals/s = 649.12 n
nBatches/s = 204.31 avgBatchSize = 3.18 (518.5 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 1520.41 nnEvals/s = 1003.61
nnBatches/s = 152.46 avgBatchSize = 6.58 (329.0 secs)
numSearchThreads = 16: 10 / 10 positions, visits/s = 1387.92 nnEvals/s = 932.16
nnBatches/s = 178.77 avgBatchSize = 5.21 (360.4 secs)
numSearchThreads = 24: 10 / 10 positions, visits/s = 1624.20 nnEvals/s = 1089.80
nnBatches/s = 136.46 avgBatchSize = 7.99 (308.0 secs)
numSearchThreads = 32: 10 / 10 positions, visits/s = 1796.26 nnEvals/s = 1201.35
nnBatches/s = 113.86 avgBatchSize = 10.55 (278.5 secs)
Optimal number of threads is fairly high, tripling the search limit and trying a
gain.
2020-03-12 23:49:10+0100: nnRandSeed0 = 6506758374797114957
2020-03-12 23:49:10+0100: After dedups: nnModelFile0 = \LG0\Lizzie\katago\g170-b
30c320x2-s1287828224-d525929064.bin.gz useFP16 auto useNHWC auto
2020-03-12 23:49:13+0100: Cuda backend: Found GPU GeForce RTX 2080 Ti memory 118
11160064 compute capability major 7 minor 5
2020-03-12 23:49:13+0100: Cuda backend: Found GPU GeForce RTX 2080 Ti memory 118
11160064 compute capability major 7 minor 5
2020-03-12 23:49:13+0100: Cuda backend: Model version 8 useFP16 = true useNHWC =
true
2020-03-12 23:49:13+0100: Cuda backend: Model name: g170-b30c320x2-s1287828224-d
525929064
2020-03-12 23:49:13+0100: Cuda backend: Model version 8 useFP16 = true useNHWC =
true
2020-03-12 23:49:13+0100: Cuda backend: Model name: g170-b30c320x2-s1287828224-d
525929064
Possible numbers of threads to test: 1, 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32
, 40, 48, 64, 80, 96,
numSearchThreads = 6: 10 / 10 positions, visits/s = 626.73 nnEvals/s = 407.14 n
nBatches/s = 209.06 avgBatchSize = 1.95 (797.9 secs)
numSearchThreads = 48: 10 / 10 positions, visits/s = 2214.93 nnEvals/s = 1421.03
nnBatches/s = 93.34 avgBatchSize = 15.22 (226.0 secs)
numSearchThreads = 64: 10 / 10 positions, visits/s = 2301.42 nnEvals/s = 1500.58
nnBatches/s = 77.43 avgBatchSize = 19.38 (217.5 secs)
numSearchThreads = 80: 10 / 10 positions, visits/s = 2322.34 nnEvals/s = 1543.88
nnBatches/s = 65.55 avgBatchSize = 23.55 (215.6 secs)
numSearchThreads = 40: 10 / 10 positions, visits/s = 1983.09 nnEvals/s = 1353.57
nnBatches/s = 104.84 avgBatchSize = 12.91 (252.3 secs)
Ordered summary of results:
numSearchThreads = 5: 10 / 10 positions, visits/s = 533.10 nnEvals/s = 350.16 n
nBatches/s = 213.88 avgBatchSize = 1.64 (938.0 secs) (EloDiff baseline)
numSearchThreads = 6: 10 / 10 positions, visits/s = 626.73 nnEvals/s = 407.14 n
nBatches/s = 209.06 avgBatchSize = 1.95 (797.9 secs) (EloDiff +57)
numSearchThreads = 10: 10 / 10 positions, visits/s = 964.41 nnEvals/s = 649.12 n
nBatches/s = 204.31 avgBatchSize = 3.18 (518.5 secs) (EloDiff +208)
numSearchThreads = 12: 10 / 10 positions, visits/s = 1131.75 nnEvals/s = 769.38
nnBatches/s = 198.99 avgBatchSize = 3.87 (441.9 secs) (EloDiff +264)
numSearchThreads = 16: 10 / 10 positions, visits/s = 1387.92 nnEvals/s = 932.16
nnBatches/s = 178.77 avgBatchSize = 5.21 (360.4 secs) (EloDiff +334)
numSearchThreads = 20: 10 / 10 positions, visits/s = 1520.41 nnEvals/s = 1003.61
nnBatches/s = 152.46 avgBatchSize = 6.58 (329.0 secs) (EloDiff +362)
numSearchThreads = 24: 10 / 10 positions, visits/s = 1624.20 nnEvals/s = 1089.80
nnBatches/s = 136.46 avgBatchSize = 7.99 (308.0 secs) (EloDiff +381)
numSearchThreads = 32: 10 / 10 positions, visits/s = 1796.26 nnEvals/s = 1201.35
nnBatches/s = 113.86 avgBatchSize = 10.55 (278.5 secs) (EloDiff +408)
numSearchThreads = 40: 10 / 10 positions, visits/s = 1983.09 nnEvals/s = 1353.57
nnBatches/s = 104.84 avgBatchSize = 12.91 (252.3 secs) (EloDiff +436)
numSearchThreads = 48: 10 / 10 positions, visits/s = 2214.93 nnEvals/s = 1421.03
nnBatches/s = 93.34 avgBatchSize = 15.22 (226.0 secs) (EloDiff +471)
numSearchThreads = 64: 10 / 10 positions, visits/s = 2301.42 nnEvals/s = 1500.58
nnBatches/s = 77.43 avgBatchSize = 19.38 (217.5 secs) (EloDiff +467)
numSearchThreads = 80: 10 / 10 positions, visits/s = 2322.34 nnEvals/s = 1543.88
nnBatches/s = 65.55 avgBatchSize = 23.55 (215.6 secs) (EloDiff +451)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching
deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, an
d 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 5: (baseline)
numSearchThreads = 6: +57 Elo
numSearchThreads = 10: +208 Elo
numSearchThreads = 12: +264 Elo
numSearchThreads = 16: +334 Elo
numSearchThreads = 20: +362 Elo
numSearchThreads = 24: +381 Elo
numSearchThreads = 32: +408 Elo
numSearchThreads = 40: +436 Elo
numSearchThreads = 48: +471 Elo (recommended)
numSearchThreads = 64: +467 Elo
numSearchThreads = 80: +451 Elo
Using 48 numSearchThreads!
=========================================================================
DONE
Writing new config file to gtp_custom.cfg
You should be now able to run KataGo with this config via something like:
LG0\Lizzie\katago\katago.exe gtp -model '\LG0\Lizzie\katago\g170-b30c320x2-s1287
828224-d525929064.bin.gz' -config 'gtp_custom.cfg'
Feel free to look at and edit the above config file further by hand in a txt edi
tor.
For more detailed notes about performance and what options in the config do, see
:
https://github.com/lightvector/KataGo/b ... xample.cfg