Hello,
I have an issue where reported maximum number of threads I can use to launch a function doesn’t match projected number. In my case function uses 64 registers, cudaDeviceProp.maxThreadsPerBlock is 1024 and cudaDeviceProp.regsPerBlock is 65536. Thus, maximum thread block size can be 1024 threads, but cudaFuncAttributes.maxThreadsPerBlock reports only 512. Trying to launch with 1024 threadblock I get an error “cudaErrorInvalidValue”. Cuda occupancy calculator suggest I should be able to launch 1 block with 1024 threads.
Are you using the latest CUDA version? That should be CUDA Toolkit 11.2 Update 2, best I can tell.
I should have expressed myself more clearly:
Please cut & paste the nvcc command line here. Please cut & paste the complete output from compiling with -Xptxas -v here.
Other than registers, I wonder whether per-thread stack usage could be the limiting factor here. I don’t recall what the default per-thread stack size is. Have you tried increasing the stack size via cudaDeviceSetLimit?
I don’t have hands-on experience with Ampere-class GPUs. Generally speaking, the number of registers allocated by the hardware allocator is >= (#threads * #regs reported by ptxas) due to granularity constraints. Note >=. Although one would assume that granularity boundaries are usually power-of-2 sized, so if ptxas says 64 per thread are needed one would expect that the hardware allocator uses exactly that.
The problem with cudaErrorInvalidValue is that there is no sub-code indicating which value is considered invalid. You might want to file a feature request with NVIDIA to get that changed.
Hi . sorry, I was too quick to respond with too little info. That was a debug build, release build doesn’t use stack memory.
Here’s cmdline (minus visual studio macro/include train)
It looks like number of threads is half (or less) of what occupancy calculator says (and what you get based on device properties). For instance, another function with 80 registers can only be launched with 384 threads max, which doesn’t make sense.
Note that ‘cudaErrorInvalidValue’ goes away as soon as I size my thread block to what cudaFuncAttrs.maxThreadBlocks reports, so I conclude that number of threads in a thread block is the invalid number.
Thanks for the additional information. Unfortunately I am unable to draw any conclusions from this and I cannot explain the thread count limit of 512 right now. A minimal self-contained reproducer code could be helpful, but I realize may not be trivial to construct. There may be resource limit I am not aware of, possibly specific to the Ampere architecture. Is the code using textures by any chance, or barriers of some kind?
If you have double-checked that all launch parameters (thread and block dimensions, shared memory size, etc) are within specified limits, and you are encountering this with the latest CUDA version available, you may want to file a bug with NVIDIA. Maybe there is a bug in the runtime, maybe there is information missing in the documentation.
Although no longer relevant: The default stack size seems to be 1KB per thread, and your debug build was well within that limit.
I figured I’m using 11.2.1 cuda. What I’ll do is first update cuda to the latest update, updated drivers to latest available. By the way - should I use drivers coming with the toolkit or can I get latest drivers from nvidia assuming all features cuda needs present (given version number is same or newer than what cuda installs ? ) .
I’ll also try linux enviroment with different hardware to check if it’s an RTX only problem.
I could try create a simple reprocase if nothing helps.
When I install a new CUDA version, I install the full package including the driver. This should result in a software combo that is guaranteed to work, since tested by NVIDIA. Once I have verified that the new CUDA version is working properly by building and running some of my applications, I install the latest driver package identified by the auto-configurator on NVIDIA’s download page, then run another quick test with my apps.
Compiling with -maxrregcount=63, or -maxrregcount=60, or no -maxrregcount switch at all might be interesting experiments.
The basic idea is to try to get a handle on whether the code is running into some sort of internal accounting issue or an actual resource limitation. That’s not clear to me. Maybe one register is always reserved for internal system purposes, so user code can use at most 63?
I guess it would get reported as a used register ? And in addition, one register should not half max number of threads, just reduce it slightly I presume.
I am just brain storming here. Or grasping for straws, if you prefer :-)
I really have no idea what could be going on, and it doesn’t help that I haven’t used an Ampere-class GPU yet. There may be a bug somewhere, or maybe I am simply overlooking something.
I got same behavior on different hardware. And of course with my dummy reprocase kernel everything works as advertised :). At least I now know that it’s something specific to my particular kernel. I’ll keep digging…
ok, I found the problem. but that raises more questions :)
I am using launch_bounds to control from the source number of registers used . I originally thought launch_bounds is a complicated way of specifying maxregs per function, rather than per compile unit with a compiler option. But now it turns out it also acts as hard limits even though nothing prevents launching configurations larger than what launch_bounds specifies.
How am I supposed to control register usage per function then ?
Because I want to maximize number of resident blocks. Let’s say without any limitations compiler is happy with 36 registers. with a 512 threads threadblock that’s just 3 resident blocks on RTX3090, which is nice. But on V100 or A100 that would be only 3 resident blocks (edit: out of maximum possible 4). By applying per-kernel max-register tweak I can press compiler to fit into 32 registers, which would work nicely on all the hardware I care about.