I’m using CUDA 9.2.148, and am getting slightly different results than you, but I see the trend.
I haven’t done a lot of analysis yet, but I have a hunch.
One of the requirements for the conversion of an “immediate” or “stack-based” array variable like:
(i.e. a variable in the logical local address space)
to registers is that the indexing (if any) must be fully computable/discoverable by the compiler at compile time.
You can make this easy for the compiler or you can make it hard. In my opinion you’ve made it hard.(*)
This alternate code, which I believe is functionally similar to your code, makes the indexing much easier to compute/discover (I think), and at least on CUDA 9.2.148, only uses 8 bytes stack frame per thread, which obviously means the entire “stack based” array rc is no longer on the stack:
$ cat t1398.cu
#define TASKS_PER_THREAD 5
#define PSIZE 512
__global__ void uvmer(const double* a)
int idx = threadIdx.x+blockDim.x*blockIdx.x;
for (int id = 0; id < TASKS_PER_THREAD; id ++)
if (idx < (blockDim.x * gridDim.x * TASKS_PER_THREAD))
rc[id] = a[idx * PSIZE];
idx += blockDim.x*gridDim.x;
// this loop does nothing but ensure compiler does not remove rc
for (int i = 0; i < TASKS_PER_THREAD; i++)
if (rc[i] == 0) printf("ok %lf\n", rc[i]);
$ nvcc -c -maxrregcount 63 -arch=sm_70 -Xptxas=-v t1398.cu
ptxas info : 8 bytes gmem
ptxas info : Compiling entry function '_Z5uvmerPKd' for 'sm_70'
ptxas info : Function properties for _Z5uvmerPKd
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 30 registers, 360 bytes cmem
I’ve made a variety of changes that you may or may not agree with, but from a stack frame usage perspective, the drop from ~48 bytes to 8 bytes occurred when I refactored the first for-loop. [b]It’s possible I made a mistake, of course.
Probably additional SASS analysis is also in order here.[/b]
(*) additional commentary:
- The execution/launch configuration of a kernel has no impact on how the compiler compiles it.
- Let’s look at this indexing construct from your gist:
rc[id / (blockDim.x * gridDim.x)] = ...
You as the programmer can look at that and know that on the first loop iteration it will always compute to 0 across the grid. Likewise on the second iteration it will always compute to 1, and so on. I’m not sure the compiler is able to deduce that. It’s not obvious to me that unrolling the loop and doing whatever level of substitution is possible, leads to a discoverable constant (which is essentially what we need). If you think otherwise, and can demonstrate it, then it may be suitable for a compiler RFE. Keep in mind that the numerator (and denominator) of that index represents a linear combination of variables, for which the compiler is not able to assign a constant to any one of them. The refactoring that I did not only makes the index readily computable, but also converts your loop which has a non-constant trip count, to a loop with a constant trip count. This is also necessary (I believe) for the compiler deduction needed to convert to known indexing, and therefore registers instead of stack.