Understanding different register counts for the same kernel

Hi,

I have a kernel implemented in two ways (with same block/grid size, input, compiler flags, etc., called at the same place in a program):

__global__ void k1(const St* __restrict__ p0, ...other params) {
    const int* __restrict__ p1 = p0->p1;
    const int* __restrict__ p2 = p0->p2;
    const int* __restrict__ p3 = p0->p3;
    // and 8 more lines like this from p4 to p11

    // same calculation code
}

__global__ void k2(const int* __restrict__ p1, const int* __restrict__ p2,
..., const int* __restrict__ p11, ...other params) {
    // same calculation code
}

The compiler used 64 registers for k1 and 40 registers for k2. To my surprise, k1 is consistently 30% faster than k2.

To explain the difference in the register counts, does it mean, at least in this case (CUDA 10.1 / sm_60), the kernel arguments are less likely to (or simply aren’t) copied to registers?

And does it make a difference when it comes to accessing global device memory in these two different ways? If not, what else might be the reason for the 30% difference in the performance?

Thanks.

Kernel arguments are stored a special buffer, or constant memory at the moment.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#function-parameters

It is impossible to determine issues with performance without seeing the entire code base.

Thanks for the clarification.

The code is actually quite complex and I don’t think it’s a good idea to post the whole kernel when it’s not gonna be really helpful. I understand that without the code, it’s difficult to say what the real issue is. Therefore, I intended to ask a few general questions here.

One question is, in general, is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register? I am not an expert but I would assume they should be quite close. Based on this assumption and the fact that the rest of the codes in two kernels are identical, I can only come to a conclusion that, for this case, although larger register count decreased the max. occupancy, the kernel has been further optimized with more registers available. I have also learned that higher occupancy has never guaranteed better performance.

If using more registers in the kernel can potentially increase the performance, I wonder if there are any flags or attributes I can set the min. register count for the kernel. I know “maxregcount” and launch bounds are available to limit the register count btw.

Thanks.

is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register?

I can’t think of a reason why this would be true. But best way to find out is look at the SASS code.

cuobjdump <executable> -sass

It’s possible that one kernel has too many registers and is spilling into local memory. Local memory is stored in global and will impact performance. You can set the -warn-spills and -warn-lmem-usage flag during compile to check.

I think the best thing for you to do is profile your code using Nsight Compute. You can compare kernels side-by-side as well. If you turn on -lineinfo, you can also select a particular line number and see the compiled SASS for that line. This might help answer your first question.

https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/