I have found that only with Titan V (all other GPUs I’ve tried work fine) with driver 440.44 for my application (built with Cuda v9) that if I request a kernel launch, even with block size of 1 thread and 32K blocks, the kernel launch always fails if my kernel uses more than 64 registers.
When I read the specs here I see that Titan V, a 7.0 compute capability GPU offers Maximum number of 32-bit registers per thread
of 255
. I only have one Titan V gpu to test with, so I’m unable to see whether this works on another.
It seems to me this is some kind of hardware failure given the specs saying 255 registers and my launch fails once I reach 65 registers.
What is the best way to proceed here? Is this likely a hardware failure with my specific gpu or a limitation with all Titan V gpus?