When I was learning about register usage, one of the first things that stuck out in what I was being told is that the number returned by
--ptxas-options="-v" isn’t really what you want. ‘Used 47 registers’ really means 48, 53 is perhaps 56, 62 might as well be 64. The break points get spaced further apart as the overall number of registers gets larger, I’m pretty sure. One of your very helpful engineers once sent me the link where all of this was tabulated.
What I’m doing is designing a kernel management object to handle my kernel launches. When the kernel is set to utilize the maximum threads per block, the solution is obvious, but even then, I want to have a runtime means of checking that my launch parameters fit with the launch bounds, otherwise I’m chasing pre-processor directives and macro definitions everywhere. The runtime-executed code will provide a nice umbrella for collecting all of the decisions regarding launch parameters. Furthermore, there are cases where one size will not fit all. I may compile a kernel with 1024 max threads per block, but instead want to launch it as four x (streaming multiprocessor count) blocks at 256 threads each, or even 8 x 128, depending on the workload.
In the construction of that object, I’d like to hard-code the break points for CUDA register usage to make the best decisions about the numbers of smaller blocks to allocate. Please help if you know where I can find that table!
If I understand the question correctly, you are looking for the CUDA occupancy calculator, which is an Excel spreadsheet that incorporates all the architecture-specific register allocation decisions of the hardware in order to compute occupancy.
FWIW, at this time, my standing advice with regard to any contemplated use of
__launch_bounds__ is: don’t use these, let the compiler work its magic (Pascal architecture and up, which is pretty much all that is relevant now).
Thanks as always, @njuffa! I gave your advice some thought last night, and while being a little ways into the development of this kernel launch manager tends to push the psychology towards continuing on that route, I am trying to balance that with the principle of getting the most from the compiler’s sophistication and the accrued expertise of all those engineers.
My problems are structured around work units of well defined sizes: multiples of eight warps’ worth of work for particle pair interactions, pre-cached particle locations and properties for local interactions, planned reduction operations over each system. The workload is a big unknown until the program is launched–I could have one or a handful of systems with 25,000 atoms apiece, a single system of 250,000 atoms, or two thousand systems of 25-50 atoms each. The various work units are designed around sizes that will reach a critical mass of imported information, but remain as granular as possible. I’ve tried to anticipate the balance between smaller work units that would lean too hard on memory bandwidth and work units that are larger than necessary.
My sense is that I can press on the compiler a little bit to make it understand that the launch grid may need to bring a lot of threads to bear, but I don’t want to press too hard lest the compiler start doing inefficient things just to meet my excessive demands, or even start cleaving 64-bit information in the manner we spoke about the other day. My sense is that, if I allow the compiler a reasonable degree of wiggle room, it will be able to optimize and I will be able to launch blocks of the sizes that fit my work units. I will do a few experiments where I scale back the requirements (push a minimum 5 blocks of 256 per multiprocessor back to 4, for example).
I may also take advantage of the organizational utility afforded by my kernel management system to also compile versions of the code where I just let the compiler have whatever it wants–scale back some of these kernels to half the thread count I typically ask for, or ask for a minimum of just 1 block of 256 threads per multiprocessor, and let the compiler have free reign to utilize as many registers as it wants to get the calculation done as fast as possible. This fits with a guiding philosophy in my new code base of making things that are as elastic as possible: the highest possible speed on a small workload, one that will push a single simulation to long timescales, and the best possible throughput on a large workload, gearing down and getting the most aggregate statistics on many related simulations. (I doubt that the compiler always knows best, because it doesn’t necessarily know if I want speed on a small problem or throughput on a big one, but I will keep an eye on the results to see whether the laissez-faire approach is the best in both cases.)
That likely means imposing a register limit of no more than 2-3 registers below the “natural” usage the compiler picks based on its heuristics. I am not saying that use of
__launch_bounds will never help. There are noise effects in code generation that may result in small performance improvements just by jostling the compiler in some way.
However, I would be surprised if significant performance advantages were achieved with these mechanisms. The CUDA compiler of 2022 is not the CUDA compiler of 2012. Also, the GPU ISAs seem to finally have converged to a streamlined RISC-like platform that make it easier for compilers to target (the fact that register pairs must be used for 64-bit operands is a fly in the ointment, but moving to a full 64-bit architecture would likely be wasteful).
That likely means imposing a register limit of no more than 2-3 registers below the “natural” usage the compiler picks based on its heuristics.
Indeed, that’s what I tend to do. It was something I just learned to do, like the way one can push the trash down in the bin and go an extra half day without taking it out. It’s not much but I often find myself at 42 or 66 registers of natural usage and I want to get another block of 256 threads onto the SMP for better granularity.