I’ve been finding that the CUDA compiler is pretty good at deciding what to do if I reduce the thread count enough that the compiler can expand the register usage as it wishes. This aligns with some advice a couple of years ago from @njuffa. However, I still have a number of cases where I need there to be multiple blocks (in particular, when I want to use enough __shared__ memory per thread that the total __shared__ memory per SM would be in excess of 48 kB, sometimes even in excess of 96 kB when I would want to use at least four blocks per SM. In other situations, I may want 8 blocks per SM because various __syncthreads() operations in the block will occur and I want more granularity to address a larger number of small tasks, each block handling one task at a time.
Is there a way to let the compiler know that I want a certain number of distinct blocks per SM, but let it decide how large to make each of those blocks? I doubt that I would encounter a situation where the compiler tries to optimize to a smaller number of threads than I actually need, e.g. if I need at least 128 threads (four warps) per block, I would not be in danger of getting blocks smaller than that when specifying that two blocks launch per SM.
The number of threads per block is a launch parameter you have to specify. It will not be determined for you. You can ask Cuda about the capabilities of the used GPU and use some heuristics to find a good value.
You can compile and assemble your kernel with the maxrregcount or with __launch_bounds__, which are different ways and views to achieve the same result.
__syncthreads() can sometimes (depending on your program logic) be replaced with synchronization of just some of the warps instead of a synchronization of the whole block.
See the higher-level cooperative groups (CUDA C++ Programming Guide) or lower-level named barriers (names from 0…15: PTX ISA 8.5)
As I understand, you would combine some of the blocks into one block. Why not choose with __launch_bounds__ the minimum number of threads per block, let each block have enough shared memory and just possibly use more blocks than necessary. Or would you exceed the number of (resident) blocks per SM?
Thanks @Curefab. My understanding of __launch_bounds__ was that it took the maximum number of threads per block and the minimum number of blocks per SM, which as you see is the reverse of what I really want to tell it. It also seems to be overloaded, with forms that accept maxThreadsPerBlock and then that plus minBlocksPerMultiProcessor, but the absolute requirement is the maximum number of threads per block.
That may seem like what I need, i.e. I specify 512 maximum threads per block and maybe the compiler will design the blocks such that they take 384 threads and work most efficiently at higher register pressure, but in fact what it’s doing is forcing the compiler to design the blocks in such a way that they can be launched with up to 512 threads per block, and if I choose to launch twice as many blocks with 256 threads each that’s a possibility (albeit also allocating the same amount of __shared__ memory per block, regardless of the thread count, which is why I’ve had to make different variants of some kernels when I want to subdivide the launch grid).
At a fundamental level, I don’t see a way to tell the compiler “design blocks such that they will work with at least 128 threads, more as is feasible within the register limits, and prepare to launch at least two / four / eight blocks per SM.”
True, I can specify that at least 2 / 4 / 8 blocks be launched per SM. But I can only do that if I also specify that they be launched with a particular number of threads per block, some subdivision of, say 512 in the above example. But what if I want to let the compiler decide, for each architecture that it’s building for, how many threads it would prefer to launch the blocks with? If I give it a number it effectively means launch with that number, not choose a number up to this limit and work out the registers as would be best for this hardware line.
What I’ve been doing thus far is outfitting each kernel with __launch_bounds __(maxThreadsPerBlock, minBlocksPerMultiProcessor) and then, by hand, tuning maxThreadsPerBlock to be as high as possible without creating register spills, and (in fact) as high as possible without much changing the register usage that the compiler would prefer if I specified a very low number for maxThreadsPerBlock.
My hope in the OP was that there might be a way to automate that process, rather than continually reviewing the ptxas output for the whole code each time a new architecture comes out and adjusting each kernel’s maxThreadsPerBlock by hand. That’s the critical distinction.
I think it is possible to launch with a higher number of threads per block than specified during compilation as maxThreadsPerBlock as long as the resulting registers fit on the SM. (Even if the documentation is quite strict → runtime error: PTX ISA 8.5)
The resulting number of registers are approximately, what you want to achieve (“without much changing the register usage”).
I think it is possible to launch with a higher number of threads per block than specified during compilation as maxThreadsPerBlock as long as the resulting registers fit on the SM.
In my experience this is a classic way to have a kernel launch fail and leave you wondering why the code doesn’t work in some pathway. I have whole C++ classes devoted to running cudaFuncGetAttributes on every variant of my kernels and collecting the maxThreadsPerBlock so that this information can be dispensed at runtime to inform the correct launch parameters. That way I only need one macro where the kernel variant is compiled to determine launch bounds, and the program itself will carry the results forward through to whatever launch I am doing. These classes also have a mechanism for subdividing the blocks and reporting how many threads to use in each subdivision if I choose to allow that. But if I could let the compiler decide how many threads it wants to use to get the maximum performance, that would let me keep my C++ XxxxKernelManager classes in place while eliminating a layer of macros.
You are probably already doing a lot, compared to what is possible.
One could modify the automatically generated PTX code during compilation and replace .maxntid with .regntid and potentially in addition use ptxas parameters influencing the optimization of the register numbers.
Or instead of using __launch_bounds__ if you specify maxrregcount set to a value which is guaranteeing the minimum number of blocks per SM (together with choosing appropriately low shared memory usage to fit as many blocks)?
Otherwise I do not know of any other solution, apart from Nvidia improving the tools
Thanks @Curefab and @Robert_Crovella. I will look into maxregcount to see if there is a more elegant solution, but it sounds like I am doing a reasonable job of making use of what CUDA does provide.