cudaFuncAttributes.maxThreadsPerBlock cudaDeviceProp.maxThreadsPerBlock mismatch problem

Hello,
I have an issue where reported maximum number of threads I can use to launch a function doesn’t match projected number. In my case function uses 64 registers, cudaDeviceProp.maxThreadsPerBlock is 1024 and cudaDeviceProp.regsPerBlock is 65536. Thus, maximum thread block size can be 1024 threads, but cudaFuncAttributes.maxThreadsPerBlock reports only 512. Trying to launch with 1024 threadblock I get an error “cudaErrorInvalidValue”. Cuda occupancy calculator suggest I should be able to launch 1 block with 1024 threads.

What am I missing ?

Thanks.

CUDA version?
GPU architecture?
Full set of nvcc command line switches used, and output when adding -Xptxas -v?

RTX3090, compiling for sm_8_6:
1> 256 bytes stack frame, 378 bytes spill stores, 840 bytes spill loads
1>ptxas info : Used 64 registers, 288 bytes cumulative stack size, 32896 bytes smem, 480 bytes cmem[0]

Are you using the latest CUDA version? That should be CUDA Toolkit 11.2 Update 2, best I can tell.

I should have expressed myself more clearly:

Please cut & paste the nvcc command line here. Please cut & paste the complete output from compiling with -Xptxas -v here.

Other than registers, I wonder whether per-thread stack usage could be the limiting factor here. I don’t recall what the default per-thread stack size is. Have you tried increasing the stack size via cudaDeviceSetLimit?

I don’t have hands-on experience with Ampere-class GPUs. Generally speaking, the number of registers allocated by the hardware allocator is >= (#threads * #regs reported by ptxas) due to granularity constraints. Note >=. Although one would assume that granularity boundaries are usually power-of-2 sized, so if ptxas says 64 per thread are needed one would expect that the hardware allocator uses exactly that.

The problem with cudaErrorInvalidValue is that there is no sub-code indicating which value is considered invalid. You might want to file a feature request with NVIDIA to get that changed.

Hi . sorry, I was too quick to respond with too little info. That was a debug build, release build doesn’t use stack memory.
Here’s cmdline (minus visual studio macro/include train)

 -gencode=arch=compute_86,code=\"sm_86,compute_86\"  -x cu    -lineinfo  -use_fast_math -maxrregcount=0 --ptxas-options=-v --machine 64 --compile -cudart shared -O4 -use_fast_math  -Xptxas -dscm -Xptxas cs -Xptxas --opt-level -Xptxas 4 -Xptxas --allow-expensive-optimizations -Xptxas true -res-usage -Xptxas -warn-lmem-usage -Xptxas -warn-spills -Xptxas -warn-double-usage -Xptxas --verbose  -Xcudafe "--diag_suppress=field_without_dll_interface --diag_suppress=base_class_has_different_dll_interface"  

ptxas output:

1>ptxas info    : Compiling entry function '_Z9my_kerneliiiiiiPKfS0_S0_S0_S0_S0_S0_S0_S0_S0_S0_S0_Pf' for 'sm_86'
1>ptxas info    : Function properties for _Z9my_kerneliiiiiiPKfS0_S0_S0_S0_S0_S0_S0_S0_S0_S0_S0_Pf
1>    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>ptxas info    : Used 64 registers, 32896 bytes smem, 480 bytes cmem[0]

I also do printout to check how many threads cuda thinks I can use:

cudaFuncGetAttributes(&attrs, &my_kernel);
printf("function's maxThreadsPerBlock = %i\n", attrs.maxThreadsPerBlock);

and it prints:

function's maxThreadsPerBlock = 512

It looks like number of threads is half (or less) of what occupancy calculator says (and what you get based on device properties). For instance, another function with 80 registers can only be launched with 384 threads max, which doesn’t make sense.

Note that ‘cudaErrorInvalidValue’ goes away as soon as I size my thread block to what cudaFuncAttrs.maxThreadBlocks reports, so I conclude that number of threads in a thread block is the invalid number.

Sorry, missed this one ;). Yes, 11.2 for sure, though not sure if it’s ‘update 2’ or not, how do I check ?

Thanks for the additional information. Unfortunately I am unable to draw any conclusions from this and I cannot explain the thread count limit of 512 right now. A minimal self-contained reproducer code could be helpful, but I realize may not be trivial to construct. There may be resource limit I am not aware of, possibly specific to the Ampere architecture. Is the code using textures by any chance, or barriers of some kind?

If you have double-checked that all launch parameters (thread and block dimensions, shared memory size, etc) are within specified limits, and you are encountering this with the latest CUDA version available, you may want to file a bug with NVIDIA. Maybe there is a bug in the runtime, maybe there is information missing in the documentation.

Although no longer relevant: The default stack size seems to be 1KB per thread, and your debug build was well within that limit.

Don’t know as I don’t have that installed either. 11.2 (without the update 2) should have most “childhood issues” fixed.

Thanks a lot for helping, Njuffa,

I figured I’m using 11.2.1 cuda. What I’ll do is first update cuda to the latest update, updated drivers to latest available. By the way - should I use drivers coming with the toolkit or can I get latest drivers from nvidia assuming all features cuda needs present (given version number is same or newer than what cuda installs ? ) .

I’ll also try linux enviroment with different hardware to check if it’s an RTX only problem.

I could try create a simple reprocase if nothing helps.

And no, no fancy stuff used. just plain old school way :)

11.2.1 suggests to me CUDA 11.2 update 1.

When I install a new CUDA version, I install the full package including the driver. This should result in a software combo that is guaranteed to work, since tested by NVIDIA. Once I have verified that the new CUDA version is working properly by building and running some of my applications, I install the latest driver package identified by the auto-configurator on NVIDIA’s download page, then run another quick test with my apps.

Compiling with -maxrregcount=63, or -maxrregcount=60, or no -maxrregcount switch at all might be interesting experiments.

Will try this too. Keeping you posted.

BR

The basic idea is to try to get a handle on whether the code is running into some sort of internal accounting issue or an actual resource limitation. That’s not clear to me. Maybe one register is always reserved for internal system purposes, so user code can use at most 63?

I guess it would get reported as a used register ? And in addition, one register should not half max number of threads, just reduce it slightly I presume.

I am just brain storming here. Or grasping for straws, if you prefer :-)

I really have no idea what could be going on, and it doesn’t help that I haven’t used an Ampere-class GPU yet. There may be a bug somewhere, or maybe I am simply overlooking something.

I got same behavior on different hardware. And of course with my dummy reprocase kernel everything works as advertised :). At least I now know that it’s something specific to my particular kernel. I’ll keep digging…

ok, I found the problem. but that raises more questions :)
I am using launch_bounds to control from the source number of registers used . I originally thought launch_bounds is a complicated way of specifying maxregs per function, rather than per compile unit with a compiler option. But now it turns out it also acts as hard limits even though nothing prevents launching configurations larger than what launch_bounds specifies.
How am I supposed to control register usage per function then ?

why do you want to do that?

Because I want to maximize number of resident blocks. Let’s say without any limitations compiler is happy with 36 registers. with a 512 threads threadblock that’s just 3 resident blocks on RTX3090, which is nice. But on V100 or A100 that would be only 3 resident blocks (edit: out of maximum possible 4). By applying per-kernel max-register tweak I can press compiler to fit into 32 registers, which would work nicely on all the hardware I care about.