Register Caches with Double Precision?

Following this article:
https://devblogs.nvidia.com/register-cache-warp-cuda/

It seems that when I try to use a register cache for double-precision, it gets moved to the stack instead.

ptxas output for a toy kernel:
ptxas info : 8 bytes gmem
ptxas info : Compiling entry function ‘uvmer’ for ‘sm_70’
ptxas info : Function properties for uvmer
48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 24 registers, 368 bytes cmem[0]

Toy kernel:

Is this intended? I understand that double precision values should take 2 registers but there are still about 8 remaining before spillage. Other variations of this that use less total registers have the same issue. Perhaps this should be a feature request?

What value are you using for TASKS_PER_THREAD and PSIZE ?

And which CUDA 9.x version?

In that specific example, TASKS_PER_THREAD=6. However, here is the output for 1:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘uvmer’ for ‘sm_70’
ptxas info : Function properties for uvmer
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 23 registers, 368 bytes cmem[0]

Noting the stack frame matches.

EDIT:
Full variable disclosure:
#define PSIZE 512
#define PNUM 2097152
(PNUM is size of array, ~8GB)
#define BLOCKS 1
#define THREADS 32

uvmer<<<BLOCKS, THREADS>>>(array, array2);

And which CUDA 9.x version?

I updated my previous post with PSIZE. It is CUDA 9.2.88.1.

I’m using CUDA 9.2.148, and am getting slightly different results than you, but I see the trend.

I haven’t done a lot of analysis yet, but I have a hunch.

One of the requirements for the conversion of an “immediate” or “stack-based” array variable like:

double rc[TASKS_PER_THREAD];

(i.e. a variable in the logical local address space)

to registers is that the indexing (if any) must be fully computable/discoverable by the compiler at compile time.

You can make this easy for the compiler or you can make it hard. In my opinion you’ve made it hard.(*)

This alternate code, which I believe is functionally similar to your code, makes the indexing much easier to compute/discover (I think), and at least on CUDA 9.2.148, only uses 8 bytes stack frame per thread, which obviously means the entire “stack based” array rc is no longer on the stack:

$ cat t1398.cu
#include <stdio.h>
#define TASKS_PER_THREAD 5
#define PSIZE 512
__global__ void uvmer(const double* a)
{                                                                                                                                                                                              
    // 1-4
    double rc[TASKS_PER_THREAD];
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    #pragma unroll
    for (int id = 0; id < TASKS_PER_THREAD; id ++)
    {
        if (idx < (blockDim.x * gridDim.x * TASKS_PER_THREAD))
          rc[id] = a[idx * PSIZE];
        idx += blockDim.x*gridDim.x;
    }
  // this loop does nothing but ensure compiler does not remove rc
    #pragma unroll
    for (int i = 0; i < TASKS_PER_THREAD; i++)
        if (rc[i] == 0) printf("ok %lf\n", rc[i]);
}
$ nvcc -c -maxrregcount 63 -arch=sm_70 -Xptxas=-v t1398.cu
ptxas info    : 8 bytes gmem
ptxas info    : Compiling entry function '_Z5uvmerPKd' for 'sm_70'
ptxas info    : Function properties for _Z5uvmerPKd
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 30 registers, 360 bytes cmem[0]
$

I’ve made a variety of changes that you may or may not agree with, but from a stack frame usage perspective, the drop from ~48 bytes to 8 bytes occurred when I refactored the first for-loop. [b]It’s possible I made a mistake, of course.

Probably additional SASS analysis is also in order here.[/b]

(*) additional commentary:

  1. The execution/launch configuration of a kernel has no impact on how the compiler compiles it.
  2. Let’s look at this indexing construct from your gist:
rc[id / (blockDim.x * gridDim.x)] = ...

You as the programmer can look at that and know that on the first loop iteration it will always compute to 0 across the grid. Likewise on the second iteration it will always compute to 1, and so on. I’m not sure the compiler is able to deduce that. It’s not obvious to me that unrolling the loop and doing whatever level of substitution is possible, leads to a discoverable constant (which is essentially what we need). If you think otherwise, and can demonstrate it, then it may be suitable for a compiler RFE. Keep in mind that the numerator (and denominator) of that index represents a linear combination of variables, for which the compiler is not able to assign a constant to any one of them. The refactoring that I did not only makes the index readily computable, but also converts your loop which has a non-constant trip count, to a loop with a constant trip count. This is also necessary (I believe) for the compiler deduction needed to convert to known indexing, and therefore registers instead of stack.

No this is great, this is exactly what was needed. I implemented your idea on top of mine (instead of copying) and it seems to work.

I see your point about the indexing and agree that the original way was more difficult than it needed to be. It did not occur to me that the cache-optimization was so stringent on the usage of fix-sized arrays, but it makes sense due to the absence of bounds checks.

Thanks! This is extremely helpful.