So, I have some, let’s just call them ‘very fused kernels.’ One kernel, for example, imports a list of particles’ coordinates and then performs a list of different operations on it (the list of operations, and the particles that each element of the list operates on, are determined at run time). There are about nine different operations to choose from, ‘do 32x of harmonic stretching,’ ‘do 32x rotational profiles,’ etc. Some of those operations, particularly the single-axis rotational profiles and biaxial rotations, are much more register-heavy. We’ve had lots of discussions internally about breaking those off and placing them in alternative kernels, but part of the point is to fuse the kernels for code simplicity, to avoid extra kernel launches, and to make the most use of the global memory transactions before and after the kernel performs its list of operations.
The register usage of that kernel (when rdc=false) is between 92 and 102 / thread (there are different versions of the kernel for specialized run modes in the program as a whole, another reason that breaking it into more pieces would get messy). (If I turn on rdc=true (Relocatable Device Code), the register usage skyrockets to 255 / thread. No rdc=true!) In the former case, I’d still like to get more than 640 threads / SMP if at all possible, because there are even more things that the kernel could be doing which are, in fact, particularly register-light. However, it’s obviously not a good thing to try and tack a register-light activity onto a kernel with such constricted throughput.
My thought is this: the kernel currently runs in blocks of 128. While it’s not trivial to make the thing run in larger blocks, I could do it. What if I upped the thread count per block to 1024, to make any one SMP take 1024 threads rather than topping out at 5 x 128 = 640 due to register pressure? Would the register spills be confined to code paths with high register usage, or would I be unable to control where the compiler decides to spill registers? Some of the highest register usage is happening in branches (operations) that are rarely invoked, so if the spills were only to happen there most users wouldn’t be getting the performance drag at all.