It is currently Mon Jul 04, 2022 2:54 am

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 4 posts ] 
Author Message
Offline
 Post subject: KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORTED
Post #1 Posted: Tue May 24, 2022 8:24 am 
Dies in gote

Posts: 23
Liked others: 0
Was liked: 1
Rank: AGA 6D
Hi @lightvector,

Hope this finds you well!

Not sure whether you remember me. Two years ago I spent a few months trying to set up KataGo on my laptop to train a model to play Go and also worked on adapting KataGo to play one of Go's variants - Daoqi. However I wasn't able to get very far because I didn't have a decent GPU and it's too expensive to get one.

Now two years later GPUs are more affordable. So I built a brand new machine with AMD Ryzen 9 5900x + Nvidia GeForce Rtx 3080Ti(12GB) + 64GB RAM. I installed Ubuntu 20.04 with CUDA 11.7.1, CUDNN 8.4.0, Python 3.7, TensorFlow 1.15 etc. I was able to compile KataGo with CUDA backend and run the synchronous_loop.sh. The selfplay, shuffle, train etc worked fine. However the gatekeeper is throwing below error. I understand gatekeeper is optional but this error might occur while I run the model as well I guess. Wonder what I should do to fix this error. Any help would be highly appreciated.

Code:
...
2022-05-24 10:57:03-0400: Game loop thread 127 starting game testing candidate: mbp-s656768-d204361
terminate called after throwing an instance of 'StringError'
  what():  CUBLAS Error, for ginputw file /home/gcao/KataGo2/cpp/neuralnet/cudabackend.cpp, func cublasHgemm( cudaHandles->cublas, CUBLAS_OP_N, CUBLAS_OP_N, outChannels, batchSize, inChannels, alpha, (const half*)matBuf,outChannels, (const half*)inputBuf,inChannels, beta, (half*)outputBuf,outChannels ), line 663, error CUBLAS_STATUS_NOT_SUPPORTED
Aborted (core dumped)

Top
 Profile  
 
Offline
 Post subject: Re: KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORT
Post #2 Posted: Wed May 25, 2022 4:26 pm 
Lives in sente

Posts: 720
Liked others: 113
Was liked: 872
Rank: maybe 2d
That's a little surprising. I don't know. Some thoughts:

* I have never tested KataGo with CUDA 11.7.1. You may notice the release is back at 11.1 or 11.2 (https://github.com/lightvector/KataGo/r ... ag/v1.11.0), but I've also successfully used cuda 11.4 (along with cudnn 8.2.4). Does installing a side-by-side downgraded CUDA 11.4 and cudnn 8.2.4 and using that instead work for you?

(As a side note, if you're on Linux, although slightly out of date, https://www.iridescent.io/tech-blogs-in ... right-way/ is a good guide to installing cuda in a way that won't bork future attempts to upgrade/downgrade, easily allows having multiple side-by-side versions installed at once, etc. In general the secret is to use the runfile version - I've used the deb version in the past and it always leaves apt packages in a messy state when I try to change versions. Indeed, the runfile version is also the one you can do without sudo: https://stackoverflow.com/questions/674 ... thout-sudo, i.e. you can do it in an entirely local and self-contained way)

* Does KataGo's OpenCL version work for you and use your GPU successfully? (this might distinguish a GPU/GPU-driver issue from a CUDA-library-level issue).

* Instead of running gatekeeper right away, how about just running plain old KataGo benchmark, or hooking up to any popular game analysis GUI and just doing plain game analysis?

* Does it work if you disable FP16 in the config? (e.g. cudaUseFP16 = false in the config)

* There is some chance some other user in the discord https://discord.gg/45EWcZu7 will have seen a similar error and can help you troubleshoot.

Top
 Profile  
 
Offline
 Post subject: Re: KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORT
Post #3 Posted: Thu May 26, 2022 6:07 am 
Dies in gote

Posts: 23
Liked others: 0
Was liked: 1
Rank: AGA 6D
Thanks a lot. I did try to run benchmark and got same error. I'll try the downgrade and other suggestions.

Top
 Profile  
 
Offline
 Post subject: Re: KataGo gatekeeper throws error CUBLAS_STATUS_NOT_SUPPORT
Post #4 Posted: Thu May 26, 2022 10:14 am 
Dies in gote

Posts: 23
Liked others: 0
Was liked: 1
Rank: AGA 6D
I tried to set cudaUseFP16 to false. Both gatekeeper and benchmark worked fine.

Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 4 posts ] 

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group