Understanding different register counts for the same kernel

oukore · December 12, 2019, 11:34pm

Hi,

I have a kernel implemented in two ways (with same block/grid size, input, compiler flags, etc., called at the same place in a program):

__global__ void k1(const St* __restrict__ p0, ...other params) {
    const int* __restrict__ p1 = p0->p1;
    const int* __restrict__ p2 = p0->p2;
    const int* __restrict__ p3 = p0->p3;
    // and 8 more lines like this from p4 to p11

    // same calculation code
}

__global__ void k2(const int* __restrict__ p1, const int* __restrict__ p2,
..., const int* __restrict__ p11, ...other params) {
    // same calculation code
}

The compiler used 64 registers for k1 and 40 registers for k2. To my surprise, k1 is consistently 30% faster than k2.

To explain the difference in the register counts, does it mean, at least in this case (CUDA 10.1 / sm_60), the kernel arguments are less likely to (or simply aren’t) copied to registers?

And does it make a difference when it comes to accessing global device memory in these two different ways? If not, what else might be the reason for the 30% difference in the performance?

Thanks.

mnicely · December 13, 2019, 1:17am

Kernel arguments are stored a special buffer, or constant memory at the moment.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#function-parameters

It is impossible to determine issues with performance without seeing the entire code base.

oukore · December 13, 2019, 8:36am

Thanks for the clarification.

The code is actually quite complex and I don’t think it’s a good idea to post the whole kernel when it’s not gonna be really helpful. I understand that without the code, it’s difficult to say what the real issue is. Therefore, I intended to ask a few general questions here.

One question is, in general, is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register? I am not an expert but I would assume they should be quite close. Based on this assumption and the fact that the rest of the codes in two kernels are identical, I can only come to a conclusion that, for this case, although larger register count decreased the max. occupancy, the kernel has been further optimized with more registers available. I have also learned that higher occupancy has never guaranteed better performance.

If using more registers in the kernel can potentially increase the performance, I wonder if there are any flags or attributes I can set the min. register count for the kernel. I know “maxregcount” and launch bounds are available to limit the register count btw.

Thanks.

mnicely · December 13, 2019, 1:30pm

is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register?

I can’t think of a reason why this would be true. But best way to find out is look at the SASS code.

cuobjdump <executable> -sass

It’s possible that one kernel has too many registers and is spilling into local memory. Local memory is stored in global and will impact performance. You can set the -warn-spills and -warn-lmem-usage flag during compile to check.

I think the best thing for you to do is profile your code using Nsight Compute. You can compare kernels side-by-side as well. If you turn on -lineinfo, you can also select a particular line number and see the compiled SASS for that line. This might help answer your first question.

https://devblogs.nvidia.com/using-nsight-compute-to-inspect-your-kernels/

Topic		Replies	Views
The number of registers used in a kernel and the performance are related to the way the local variables are processed？ CUDA Programming and Performance	4	536	March 1, 2020
Understanding kernel registers CUDA Programming and Performance	1	502	April 23, 2024
registers vs global memory kernel comparison CUDA Programming and Performance	3	787	September 2, 2011
Code Performance Issues CUDA Programming and Performance	4	1921	June 14, 2009
Register usage much higher under Cuda 10.2 than Cuda 9.2 CUDA Programming and Performance cuda	4	525	April 24, 2020
Constant Memory vs. Kernel Input Argument Access Speed CUDA Programming and Performance	7	5142	February 15, 2023
Number of registers CUDA Programming and Performance	8	519	July 7, 2023
Number of registers CUDA Programming and Performance	6	2049	March 24, 2009
In registor or not the parameters defined in program CUDA Programming and Performance	8	8349	July 31, 2008
Why inline instead of register? using register memory CUDA Programming and Performance	4	1106	July 4, 2009

Understanding different register counts for the same kernel

Related topics