I have a kernel implemented in two ways (with same block/grid size, input, compiler flags, etc., called at the same place in a program):
__global__ void k1(const St* __restrict__ p0, ...other params) {
const int* __restrict__ p1 = p0->p1;
const int* __restrict__ p2 = p0->p2;
const int* __restrict__ p3 = p0->p3;
// and 8 more lines like this from p4 to p11
// same calculation code
}
__global__ void k2(const int* __restrict__ p1, const int* __restrict__ p2,
..., const int* __restrict__ p11, ...other params) {
// same calculation code
}
The compiler used 64 registers for k1 and 40 registers for k2. To my surprise, k1 is consistently 30% faster than k2.
To explain the difference in the register counts, does it mean, at least in this case (CUDA 10.1 / sm_60), the kernel arguments are less likely to (or simply aren’t) copied to registers?
And does it make a difference when it comes to accessing global device memory in these two different ways? If not, what else might be the reason for the 30% difference in the performance?
The code is actually quite complex and I don’t think it’s a good idea to post the whole kernel when it’s not gonna be really helpful. I understand that without the code, it’s difficult to say what the real issue is. Therefore, I intended to ask a few general questions here.
One question is, in general, is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register? I am not an expert but I would assume they should be quite close. Based on this assumption and the fact that the rest of the codes in two kernels are identical, I can only come to a conclusion that, for this case, although larger register count decreased the max. occupancy, the kernel has been further optimized with more registers available. I have also learned that higher occupancy has never guaranteed better performance.
If using more registers in the kernel can potentially increase the performance, I wonder if there are any flags or attributes I can set the min. register count for the kernel. I know “maxregcount” and launch bounds are available to limit the register count btw.
is there a difference in reading global data array via a pointer in the arguments vs. via a pointer stored in the register?
I can’t think of a reason why this would be true. But best way to find out is look at the SASS code.
cuobjdump <executable> -sass
It’s possible that one kernel has too many registers and is spilling into local memory. Local memory is stored in global and will impact performance. You can set the -warn-spills and -warn-lmem-usage flag during compile to check.
I think the best thing for you to do is profile your code using Nsight Compute. You can compare kernels side-by-side as well. If you turn on -lineinfo, you can also select a particular line number and see the compiled SASS for that line. This might help answer your first question.