Understanding different register counts for the same kernel


I have a kernel implemented in two ways (with same block/grid size, input, compiler flags, etc., called at the same place in a program):

__global__ void k1(const St* __restrict__ p0, ...other params) {
    const int* __restrict__ p1 = p0->p1;
    const int* __restrict__ p2 = p0->p2;
    const int* __restrict__ p3 = p0->p3;
    // and 8 more lines like this from p4 to p11

    // same calculation code

__global__ void k2(const int* __restrict__ p1, const int* __restrict__ p2,
..., const int* __restrict__ p11, ...other params) {
    // same calculation code

The compiler used 64 registers for k1 and 40 registers for k2. To my surprise, k1 is consistently 30% faster than k2.

To explain the difference in the register counts, does it mean, at least in this case (CUDA 10.1 / sm_60), the kernel arguments are less likely to (or simply aren’t) copied to registers?

And does it make a difference when it comes to accessing global device memory in these two different ways? If not, what else might be the reason for the 30% difference in the performance?


Kernel arguments are stored a special buffer, or constant memory at the moment.


It is impossible to determine issues with performance without seeing the entire code base.

Thanks for the clarification.

The code is actually quite complex and I don’t think it’s a good idea to post the whole kernel when it’s not gonna be really helpful. I understand that without the code, it’s difficult to say what the real issue is. Therefore, I intended to ask a few general questions here.

One question is, in general, is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register? I am not an expert but I would assume they should be quite close. Based on this assumption and the fact that the rest of the codes in two kernels are identical, I can only come to a conclusion that, for this case, although larger register count decreased the max. occupancy, the kernel has been further optimized with more registers available. I have also learned that higher occupancy has never guaranteed better performance.

If using more registers in the kernel can potentially increase the performance, I wonder if there are any flags or attributes I can set the min. register count for the kernel. I know “maxregcount” and launch bounds are available to limit the register count btw.


is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register?

I can’t think of a reason why this would be true. But best way to find out is look at the SASS code.

cuobjdump <executable> -sass

It’s possible that one kernel has too many registers and is spilling into local memory. Local memory is stored in global and will impact performance. You can set the -warn-spills and -warn-lmem-usage flag during compile to check.

I think the best thing for you to do is profile your code using Nsight Compute. You can compare kernels side-by-side as well. If you turn on -lineinfo, you can also select a particular line number and see the compiled SASS for that line. This might help answer your first question.