lightvector wrote:
NVIDIA's licensing terms terms make it not easy to legally distribute any of those DLLs. So legally speaking, you are attempting to use the CUDA version, it is up to you to install both CUDA and CUDNN (the latter of which requires that you sign up for a free "developer account" at NVIDIA's website) and make sure that the appropriate "dll" files are within your library search path on Windows or otherwise copied to where they need to be so that they can be found. The latest windows release (unless you are compiling a custom version from source) needs CUDA 11.2 and CUDNN 8.
Anyways, this is why the official recommendation (
https://github.com/lightvector/KataGo#o ... a-vs-eigen) recommends the OpenCL version to everyone. CUDA is only if you are willing to do the technical work yourself, and care a lot about every bit of performance - including the significant chance that at the end, the work you do isn't worth anything because it could be that on your system the OpenCL version is faster anyways.
Thx. I have fixed my two problems.
And have also looked here:
https://lifein19x19.com/viewtopic.php?t=17317Now tuning looks like this:
Z:\>LG0\Lizzie\katago\katago.exe genconfig -model \LG0\Lizzie\katago\katanetwork.gz -output gtp_custom.cfg
=========================================================================
RULES
What rules should KataGo use by default for play and analysis?
(chinese, japanese, korean, tromp-taylor, aga, chinese-ogs, new-zealand, bga, stone-scoring, aga-button):
japanese
=========================================================================
SEARCH LIMITS
When playing games, KataGo will always obey the time controls given by the GUI/tournament/match/online server.
But you can specify an additional limit to make KataGo move much faster. This does NOT affect analysis/review,
only affects playing games. Add a limit? (y/n) (default n):
n
NOTE: No limits configured for KataGo. KataGo will obey time controls provided by the GUI or server or match script
but if they don't specify any, when playing games KataGo may think forever without moving. (press enter to continue)
When playing games, KataGo can optionally ponder during the opponent's turn. This gives faster/stronger play
in real games but should NOT be enabled if you are running tests with fixed limits (pondering may exceed those
limits), or to avoid stealing the opponent's compute time when testing two bots on the same machine.
Enable pondering? (y/n, default n):y
Specify max num seconds KataGo should ponder during the opponent's turn. Leave blank for no limit:
=========================================================================
GPUS AND RAM
Finding available GPU-like devices...
Found CUDA device 0: GeForce RTX 2080 Ti
Found CUDA device 1: GeForce RTX 2080 Ti
Specify devices/GPUs to use (for example "0,1,2" to use devices 0, 1, and 2). Leave blank for a default SINGLE-GPU config:
0,1
By default, KataGo will cache up to about 3GB of positions in memory (RAM), in addition to
whatever the current search is using. Specify a different max in GB or leave blank for default:
64
=========================================================================
PERFORMANCE TUNING
Specify number of visits to use test/tune performance with, leave blank for default based on GPU speed.
Use large number for more accurate results, small if your GPU is old and this is taking forever:
100000
Specify number of seconds/move to optimize performance for (default 5), leave blank for default:
1
2021-01-16 16:07:21+0100: Loading model and initializing benchmark...
2021-01-16 16:07:21+0100: nnRandSeed0 = 13173919156662199898
2021-01-16 16:07:21+0100: After dedups: nnModelFile0 = \LG0\Lizzie\katago\katanetwork.gz useFP16 auto useNHWC auto
2021-01-16 16:07:23+0100: Cuda backend thread 0: Found GPU GeForce RTX 2080 Ti memory 11811160064 compute capability major 7 minor 5
2021-01-16 16:07:23+0100: Cuda backend thread 1: Found GPU GeForce RTX 2080 Ti memory 11811160064 compute capability major 7 minor 5
2021-01-16 16:07:23+0100: Cuda backend thread 0: Model version 10 useFP16 = true useNHWC = true
2021-01-16 16:07:23+0100: Cuda backend thread 1: Model version 10 useFP16 = true useNHWC = true
2021-01-16 16:07:23+0100: Cuda backend thread 0: Model name: kata1-b40c256-s5675792640-d1366587029
2021-01-16 16:07:23+0100: Cuda backend thread 1: Model name: kata1-b40c256-s5675792640-d1366587029
=========================================================================
TUNING NOW
Tuning using 100000 visits.
Automatically trying different numbers of threads to home in on the best:
2021-01-16 16:07:27+0100: nnRandSeed0 = 2285250991643616650
2021-01-16 16:07:27+0100: After dedups: nnModelFile0 = \LG0\Lizzie\katago\katanetwork.gz useFP16 auto useNHWC auto
2021-01-16 16:07:29+0100: Cuda backend thread 0: Found GPU GeForce RTX 2080 Ti memory 11811160064 compute capability major 7 minor 5
2021-01-16 16:07:29+0100: Cuda backend thread 1: Found GPU GeForce RTX 2080 Ti memory 11811160064 compute capability major 7 minor 5
2021-01-16 16:07:29+0100: Cuda backend thread 0: Model version 10 useFP16 = true useNHWC = true
2021-01-16 16:07:29+0100: Cuda backend thread 1: Model version 10 useFP16 = true useNHWC = true
2021-01-16 16:07:29+0100: Cuda backend thread 0: Model name: kata1-b40c256-s5675792640-d1366587029
2021-01-16 16:07:29+0100: Cuda backend thread 1: Model name: kata1-b40c256-s5675792640-d1366587029
Possible numbers of threads to test: 2, 3, 4, 5, 6, 8, 10, 12, 16, 20, 24, 32, 40, 48,
numSearchThreads = 6: 10 / 10 positions, visits/s = 1172.56 nnEvals/s = 666.52 nnBatches/s = 354.20 avgBatchSize = 1.88 (852.9 secs)
numSearchThreads = 20: 10 / 10 positions, visits/s = 2167.26 nnEvals/s = 1195.58 nnBatches/s = 234.58 avgBatchSize = 5.10 (461.5 secs)
numSearchThreads = 12: 10 / 10 positions, visits/s = 1797.95 nnEvals/s = 987.94 nnBatches/s = 291.59 avgBatchSize = 3.39 (556.3 secs)
numSearchThreads = 32: 10 / 10 positions, visits/s = 2466.86 nnEvals/s = 1419.08 nnBatches/s = 174.40 avgBatchSize = 8.14 (405.5 secs)
numSearchThreads = 40: 10 / 10 positions, visits/s = 2832.08 nnEvals/s = 1532.72 nnBatches/s = 147.34 avgBatchSize = 10.40 (353.2 secs)
numSearchThreads = 48: 10 / 10 positions, visits/s = 2865.65 nnEvals/s = 1619.12 nnBatches/s = 124.66 avgBatchSize = 12.99 (349.1 secs)
Optimal number of threads is fairly high, increasing the search limit and trying again.
2021-01-16 16:57:39+0100: nnRandSeed0 = 15290355468334568374
2021-01-16 16:57:39+0100: After dedups: nnModelFile0 = \LG0\Lizzie\katago\katanetwork.gz useFP16 auto useNHWC auto
2021-01-16 16:57:41+0100: Cuda backend thread 0: Found GPU GeForce RTX 2080 Ti memory 11811160064 compute capability major 7 minor 5
2021-01-16 16:57:41+0100: Cuda backend thread 1: Found GPU GeForce RTX 2080 Ti memory 11811160064 compute capability major 7 minor 5
2021-01-16 16:57:41+0100: Cuda backend thread 0: Model version 10 useFP16 = true useNHWC = true
2021-01-16 16:57:41+0100: Cuda backend thread 1: Model version 10 useFP16 = true useNHWC = true
2021-01-16 16:57:41+0100: Cuda backend thread 0: Model name: kata1-b40c256-s5675792640-d1366587029
2021-01-16 16:57:41+0100: Cuda backend thread 1: Model name: kata1-b40c256-s5675792640-d1366587029
Possible numbers of threads to test: 24, 32, 40, 48, 64, 80, 96, 128,
numSearchThreads = 80: 10 / 10 positions, visits/s = 3019.80 nnEvals/s = 1771.93 nnBatches/s = 79.95 avgBatchSize = 22.16 (331.4 secs)
numSearchThreads = 64: 10 / 10 positions, visits/s = 3021.92 nnEvals/s = 1715.16 nnBatches/s = 96.56 avgBatchSize = 17.76 (331.1 secs)
Ordered summary of results:
numSearchThreads = 6: 10 / 10 positions, visits/s = 1172.56 nnEvals/s = 666.52 nnBatches/s = 354.20 avgBatchSize = 1.88 (852.9 secs) (EloDiff baseline)
numSearchThreads = 12: 10 / 10 positions, visits/s = 1797.95 nnEvals/s = 987.94 nnBatches/s = 291.59 avgBatchSize = 3.39 (556.3 secs) (EloDiff +134)
numSearchThreads = 20: 10 / 10 positions, visits/s = 2167.26 nnEvals/s = 1195.58 nnBatches/s = 234.58 avgBatchSize = 5.10 (461.5 secs) (EloDiff +174)
numSearchThreads = 32: 10 / 10 positions, visits/s = 2466.86 nnEvals/s = 1419.08 nnBatches/s = 174.40 avgBatchSize = 8.14 (405.5 secs) (EloDiff +181)
numSearchThreads = 40: 10 / 10 positions, visits/s = 2832.08 nnEvals/s = 1532.72 nnBatches/s = 147.34 avgBatchSize = 10.40 (353.2 secs) (EloDiff +214)
numSearchThreads = 48: 10 / 10 positions, visits/s = 2865.65 nnEvals/s = 1619.12 nnBatches/s = 124.66 avgBatchSize = 12.99 (349.1 secs) (EloDiff +191)
numSearchThreads = 64: 10 / 10 positions, visits/s = 3021.92 nnEvals/s = 1715.16 nnBatches/s = 96.56 avgBatchSize = 17.76 (331.1 secs) (EloDiff +163)
numSearchThreads = 80: 10 / 10 positions, visits/s = 3019.80 nnEvals/s = 1771.93 nnBatches/s = 79.95 avgBatchSize = 22.16 (331.4 secs) (EloDiff +109)
Based on some test data, each speed doubling gains perhaps ~250 Elo by searching deeper.
Based on some test data, each thread costs perhaps 7 Elo if using 800 visits, and 2 Elo if using 5000 visits (by making MCTS worse).
So APPROXIMATELY based on this benchmark, if you intend to do a 1 second search:
numSearchThreads = 6: (baseline)
numSearchThreads = 12: +134 Elo
numSearchThreads = 20: +174 Elo
numSearchThreads = 32: +181 Elo
numSearchThreads = 40: +214 Elo (recommended)
numSearchThreads = 48: +191 Elo
numSearchThreads = 64: +163 Elo
numSearchThreads = 80: +109 Elo
Using 40 numSearchThreads!
=========================================================================
DONE
Writing new config file to gtp_custom.cfg
You should be now able to run KataGo with this config via something like:
LG0\Lizzie\katago\katago.exe gtp -model '\LG0\Lizzie\katago\katanetwork.gz' -config 'gtp_custom.cfg'
Feel free to look at and edit the above config file further by hand in a txt editor.
For more detailed notes about performance and what options in the config do, see:
https://github.com/lightvector/KataGo/b ... xample.cfgCan someone explain the performance tuning part in more detail?
1. Is it better to use 10.000 or 100.000 or 1.000.000 visits for the tuning?
Or is it better to use default, which means it depends on gpu speed-but how does it work compared to the other option?
2. How tuning "seconds per move" works in detail? Is it better to use 1 second or 10 or 60 seconds?
If I tune for 1 second, would it be better at 1 second per move compared to a tuning at 60 seconds per move?
And after tuned for 1 second how good is the elo increase when I use for long analysis? I mean would a 60 seconds per move tuning scale better when going from 1 to 2 to 3 to 4 to...60 seconds compared to the 1 second per move tuning?
And how does it look when doing 5 minutes in some special positions?
3.Please compare also the new tuning (elo) with the old one (elo) from link above ->
50000 visits
So APPROXIMATELY based on this benchmark, if you intend to do a 5 second search:
numSearchThreads = 5: (baseline)
numSearchThreads = 6: +57 Elo
numSearchThreads = 10: +208 Elo
numSearchThreads = 12: +264 Elo
numSearchThreads = 16: +334 Elo
numSearchThreads = 20: +362 Elo
numSearchThreads = 24: +381 Elo
numSearchThreads = 32: +408 Elo
numSearchThreads = 40: +436 Elo
numSearchThreads = 48: +471 Elo (recommended)
numSearchThreads = 64: +467 Elo
numSearchThreads = 80: +451 Elo
Using 48 numSearchThreads!