And wrote:
in default_gtp.cfg file if searchnumSearchThreads = 6 or 8, works (tuning does not occur, the message "gtp ready" appears, genmove b - ok). if = 12, 16, 32 the test starts, tuning does not occur, the message "gtp ready" appears, genmove b - the KataGo stops working. that is, the same error is not.
deleted dummy file tune6_gpuGeForceGT610_x19_y19_c320_mv8.txt, set searchnumSearchThreads = 1, same error as at the beginning.
set searchnumSearchThreads = 32, g170e 20 block s3.35G - tuning ok
And wrote:
30 block network works without problems for more than 12 hours (match engines). Auto tuning does not work. can I edit the opencltuning file obtained for a 20 or 40 block network, and what needs to be replaced? Could this lead to an error or performance degradation?
Given that you found 6 or 8 threads works, but 12 threads or more threads doesn't work, when using the 30 block networks (even if you skip the tuning), I'm going to guess that running the 30 block net is simply right on the borderline of what your GPU can handle.
I'm surprised the tuning, of all things, would cause it to fail as well, since the tuning allocates an amount of memory a little smaller than actual usage - it tunes using operations equivalent to a batch size of 2, in theory. But maybe there's something about the way the tuning is implemented that makes it more resource intensive, perhaps the fact that it also tries a lot of sub-optimal computational configurations too in the process of trying to find the best one?
Anyways, you just said that the 30 block networks works fine for you for smaller numbers of threads if you just use the 20 or 40 block net's tuning file. It's probably not optimal, but I don't see the point in trying to fiddle with it more if you can't run the tuning and the net anyways is borderline almost unable-to-be-handled by your GPU. You can try the 40 block network instead if you like. The 40 block net should be *less* resource intensive than the 30 block net regarding resource limits despite being more blocks, since the convolutions it does are smaller (256 channels, instead of 320).
Other than that, I think there's nothing for you to do here. If you want to run the 30 block with large numbers of threads (large batch size), I guess you might simply need a better GPU.