Hi @lightvector,
Hope this finds you well!
Not sure whether you remember me. Two years ago I spent a few months trying to set up KataGo on my laptop to train a model to play Go and also worked on adapting KataGo to play one of Go's variants - Daoqi. However I wasn't able to get very far because I didn't have a decent GPU and it's too expensive to get one.
Now two years later GPUs are more affordable. So I built a brand new machine with AMD Ryzen 9 5900x + Nvidia GeForce Rtx 3080Ti(12GB) + 64GB RAM. I installed Ubuntu 20.04 with CUDA 11.7.1, CUDNN 8.4.0, Python 3.7, TensorFlow 1.15 etc. I was able to compile KataGo with CUDA backend and run the synchronous_loop.sh. The selfplay, shuffle, train etc worked fine. However the gatekeeper is throwing below error. I understand gatekeeper is optional but this error might occur while I run the model as well I guess. Wonder what I should do to fix this error. Any help would be highly appreciated.
Code:
...
2022-05-24 10:57:03-0400: Game loop thread 127 starting game testing candidate: mbp-s656768-d204361
terminate called after throwing an instance of 'StringError'
what(): CUBLAS Error, for ginputw file /home/gcao/KataGo2/cpp/neuralnet/cudabackend.cpp, func cublasHgemm( cudaHandles->cublas, CUBLAS_OP_N, CUBLAS_OP_N, outChannels, batchSize, inChannels, alpha, (const half*)matBuf,outChannels, (const half*)inputBuf,inChannels, beta, (half*)outputBuf,outChannels ), line 663, error CUBLAS_STATUS_NOT_SUPPORTED
Aborted (core dumped)