Will register spills be compartmentalized?

So, I have some, let’s just call them ‘very fused kernels.’ One kernel, for example, imports a list of particles’ coordinates and then performs a list of different operations on it (the list of operations, and the particles that each element of the list operates on, are determined at run time). There are about nine different operations to choose from, ‘do 32x of harmonic stretching,’ ‘do 32x rotational profiles,’ etc. Some of those operations, particularly the single-axis rotational profiles and biaxial rotations, are much more register-heavy. We’ve had lots of discussions internally about breaking those off and placing them in alternative kernels, but part of the point is to fuse the kernels for code simplicity, to avoid extra kernel launches, and to make the most use of the global memory transactions before and after the kernel performs its list of operations.

The register usage of that kernel (when rdc=false) is between 92 and 102 / thread (there are different versions of the kernel for specialized run modes in the program as a whole, another reason that breaking it into more pieces would get messy). (If I turn on rdc=true (Relocatable Device Code), the register usage skyrockets to 255 / thread. No rdc=true!) In the former case, I’d still like to get more than 640 threads / SMP if at all possible, because there are even more things that the kernel could be doing which are, in fact, particularly register-light. However, it’s obviously not a good thing to try and tack a register-light activity onto a kernel with such constricted throughput.

My thought is this: the kernel currently runs in blocks of 128. While it’s not trivial to make the thing run in larger blocks, I could do it. What if I upped the thread count per block to 1024, to make any one SMP take 1024 threads rather than topping out at 5 x 128 = 640 due to register pressure? Would the register spills be confined to code paths with high register usage, or would I be unable to control where the compiler decides to spill registers? Some of the highest register usage is happening in branches (operations) that are rarely invoked, so if the spills were only to happen there most users wouldn’t be getting the performance drag at all.

does compiler know what these barcnes are rarely invoked? probably no

first, you can increase blocks without making correct code and see what to happen, may be analyze SASS code

second, compiler probably will fail to determine what to keep and what to spill, so you should help him. one possibility on CPU compilers is expected builtin. anotheк GPU-specific one is assignment of these variables to shared memory. you can also fake index calculation in some way (f.e. passing indexes as kernel parameters) to force these vars out of registers

Pretty much. The compiler is reasonably clever though, when it comes to spilling. It accounts for nested loops by trying to spill in the outer loops, for example. Also, it tries to vectorize spills/fills where possible.

Note that some minimal spilling (e.g. four registers per thread) is typically harmless, as it is absorbed by the L1 cache. Other than that, try to reformulate the register-heavy paths so that they use fewer registers. In some instances I have spend a couple of days on such a task and managed to reduce the register usage of infrequently used paths by a single-digit percentage number. Explicitly moving data to other places than registers may help, as pointed out by BulatZingashin, as well as simplifying math. It’s amazing how much many registers one eats up by a single call to pow(), for example :-)

As far as I know, the CUDA compiler does not support annotations (presumably: attributes) that allow programmers to indicate which code paths are taken frequently, nor does it support the importation of profiler data for that purpose.

As for finding the optimal thread-block size, I would suggest simply trying a bunch of different configurations, if possible with the help an auto-tuning framework. In my experience, the GPU execution model is too complex to predict with any accuracy what combination of occupancy and spilling will result in optimal performance for a particular GPU (family).