Why is Titan V GPU Always Insufficient Resources When Greater than 64 Registers

I have found that only with Titan V (all other GPUs I’ve tried work fine) with driver 440.44 for my application (built with Cuda v9) that if I request a kernel launch, even with block size of 1 thread and 32K blocks, the kernel launch always fails if my kernel uses more than 64 registers.

When I read the specs here I see that Titan V, a 7.0 compute capability GPU offers Maximum number of 32-bit registers per thread of 255. I only have one Titan V gpu to test with, so I’m unable to see whether this works on another.

It seems to me this is some kind of hardware failure given the specs saying 255 registers and my launch fails once I reach 65 registers.

What is the best way to proceed here? Is this likely a hardware failure with my specific gpu or a limitation with all Titan V gpus?

Likely an issue with your code. Can you post minimal, but complete, repro code that demonstrates the issue? Please also show the exact nvcc commandline that was used to build the code.

The default GPU architecture target for the CUDA 9.x toolchain is sm_30 with a maximum of 63 registers per thread. If you built your code without specifying the GPU target architecture of the Titan V, your code will be restricted to the sm_30 resource limits.