Register Caches with Double Precision?

tnallen · July 25, 2018, 7:43pm

Following this article:
https://devblogs.nvidia.com/register-cache-warp-cuda/

It seems that when I try to use a register cache for double-precision, it gets moved to the stack instead.

ptxas output for a toy kernel:
ptxas info : 8 bytes gmem
ptxas info : Compiling entry function ‘uvmer’ for ‘sm_70’
ptxas info : Function properties for uvmer
48 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 24 registers, 368 bytes cmem[0]

Toy kernel:

gist.github.com

https://gist.github.com/tallendev/8623f697ac3e86a7b45c6c7f61505914

doublerc.cu

extern "C"                                                                                                              
__global__ void uvmer(const double* a)                                                                       
{                                                                                                                                                                                                                                            
    size_t id = blockIdx.x * blockDim.x + threadIdx.x;
    // 1-4
    double rc[TASKS_PER_THREAD];                                                                                        
    #pragma unroll                                                                                                      
    for (; id < (blockDim.x * gridDim.x * TASKS_PER_THREAD); id += blockDim.x * gridDim.x)                              
    {                                                                                                                   
        rc[id / (blockDim.x * gridDim.x)] = a[id * PSIZE];

This file has been truncated. show original

Is this intended? I understand that double precision values should take 2 registers but there are still about 8 remaining before spillage. Other variations of this that use less total registers have the same issue. Perhaps this should be a feature request?

Robert_Crovella · July 25, 2018, 7:53pm

What value are you using for TASKS_PER_THREAD and PSIZE ?

And which CUDA 9.x version?

tnallen · July 25, 2018, 7:56pm

In that specific example, TASKS_PER_THREAD=6. However, here is the output for 1:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function ‘uvmer’ for ‘sm_70’
ptxas info : Function properties for uvmer
8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 23 registers, 368 bytes cmem[0]

Noting the stack frame matches.

EDIT:
Full variable disclosure:
#define PSIZE 512
#define PNUM 2097152
(PNUM is size of array, ~8GB)
#define BLOCKS 1
#define THREADS 32

uvmer<<<BLOCKS, THREADS>>>(array, array2);

Robert_Crovella · July 25, 2018, 7:59pm

And which CUDA 9.x version?

tnallen · July 25, 2018, 8:03pm

I updated my previous post with PSIZE. It is CUDA 9.2.88.1.

Robert_Crovella · July 25, 2018, 8:20pm

I’m using CUDA 9.2.148, and am getting slightly different results than you, but I see the trend.

I haven’t done a lot of analysis yet, but I have a hunch.

One of the requirements for the conversion of an “immediate” or “stack-based” array variable like:

double rc[TASKS_PER_THREAD];

(i.e. a variable in the logical local address space)

to registers is that the indexing (if any) must be fully computable/discoverable by the compiler at compile time.

You can make this easy for the compiler or you can make it hard. In my opinion you’ve made it hard.(*)

This alternate code, which I believe is functionally similar to your code, makes the indexing much easier to compute/discover (I think), and at least on CUDA 9.2.148, only uses 8 bytes stack frame per thread, which obviously means the entire “stack based” array rc is no longer on the stack:

$ cat t1398.cu
#include <stdio.h>
#define TASKS_PER_THREAD 5
#define PSIZE 512
__global__ void uvmer(const double* a)
{                                                                                                                                                                                              
    // 1-4
    double rc[TASKS_PER_THREAD];
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    #pragma unroll
    for (int id = 0; id < TASKS_PER_THREAD; id ++)
    {
        if (idx < (blockDim.x * gridDim.x * TASKS_PER_THREAD))
          rc[id] = a[idx * PSIZE];
        idx += blockDim.x*gridDim.x;
    }
  // this loop does nothing but ensure compiler does not remove rc
    #pragma unroll
    for (int i = 0; i < TASKS_PER_THREAD; i++)
        if (rc[i] == 0) printf("ok %lf\n", rc[i]);
}
$ nvcc -c -maxrregcount 63 -arch=sm_70 -Xptxas=-v t1398.cu
ptxas info    : 8 bytes gmem
ptxas info    : Compiling entry function '_Z5uvmerPKd' for 'sm_70'
ptxas info    : Function properties for _Z5uvmerPKd
    8 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 30 registers, 360 bytes cmem[0]
$

I’ve made a variety of changes that you may or may not agree with, but from a stack frame usage perspective, the drop from ~48 bytes to 8 bytes occurred when I refactored the first for-loop. [b]It’s possible I made a mistake, of course.

Probably additional SASS analysis is also in order here.[/b]

(*) additional commentary:

The execution/launch configuration of a kernel has no impact on how the compiler compiles it.
Let’s look at this indexing construct from your gist:

rc[id / (blockDim.x * gridDim.x)] = ...

You as the programmer can look at that and know that on the first loop iteration it will always compute to 0 across the grid. Likewise on the second iteration it will always compute to 1, and so on. I’m not sure the compiler is able to deduce that. It’s not obvious to me that unrolling the loop and doing whatever level of substitution is possible, leads to a discoverable constant (which is essentially what we need). If you think otherwise, and can demonstrate it, then it may be suitable for a compiler RFE. Keep in mind that the numerator (and denominator) of that index represents a linear combination of variables, for which the compiler is not able to assign a constant to any one of them. The refactoring that I did not only makes the index readily computable, but also converts your loop which has a non-constant trip count, to a loop with a constant trip count. This is also necessary (I believe) for the compiler deduction needed to convert to known indexing, and therefore registers instead of stack.

tnallen · July 25, 2018, 8:44pm

No this is great, this is exactly what was needed. I implemented your idea on top of mine (instead of copying) and it seems to work.

I see your point about the indexing and agree that the original way was more difficult than it needed to be. It did not occur to me that the cache-optimization was so stringent on the usage of fix-sized arrays, but it makes sense due to the absence of bounds checks.

Thanks! This is extremely helpful.

Topic		Replies	Views
call stack/ stack frame CUDA Programming and Performance	28	9132	October 10, 2014
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3181	June 3, 2015
Why cuda kernel use unexpected stack frame? CUDA Programming and Performance cuda , kernel	8	329	April 3, 2024
too many registers issue with memory writes and registers CUDA Programming and Performance	7	1921	July 13, 2011
Register usage too high How to reduce register usage? CUDA Programming and Performance	33	7276	December 4, 2011
Register usage of a device function for vector rotation CUDA Programming and Performance	14	703	June 12, 2022
Where can i find the information regarding the --ptxas-options=-v output ? CUDA Programming and Performance	9	3047	March 23, 2024
inconsistent results on every run CUDA Programming and Performance	15	5488	October 17, 2011
Anyway to reduce stack frame usage CUDA Programming and Performance	8	1363	November 29, 2018
BUG: Broken register allocation, toolkit 2.3 CUDA Programming and Performance	15	6914	May 10, 2010

Register Caches with Double Precision?

Related topics