how to store vector in registers

I need to statically allocate a small vector (float x[10]) inside every thread. Can this be done?

I cannot use “float x0,x1,…x9” because I have a loop that I want to unroll it automatically.

I tried “register float x[10]”, but the performance is not improving and the profiler is reporting the same number of registers as without the “register” keyword. I’m sure the vector is not stored in registers because the profiler is reporting the same number of registers with or without vector x inside the kernel. Also, I have no accesses to local memory, so where is the x vector stored? Why it is not spilled to local memory?

If I use only one variable (float x) and replace all accesses to vector x with the single scalar variable x, then the kernel is one order of magnitude faster.

Only if all accesses to that vector use array indices that are known at compile time. Registers are not indexable.

You can look at the generated PTX to see where the data is actually stored.

The register keyword has no effect in this context. If you statically declare an array and all indexing of the array is not resolvable at compile time, the compiler will put it in local memory, not registers. This is why there is such a big performance gap between the two cases. If you want to unroll a loop and keep the calculations in register, I think you have no choice but to do it by hand.

As I understand it, array contents can only be placed in registers if the register references can be resolved at compile time. Relative indexing in the register file is not supported in the hardware architecture (or in many other architectures). In principle, a fully unrolled loop would have effectively constant indexing and could be stored in registers, although I don’t know if the compiler actually figures that out. Is your index limit in the for loop also a constant?

This is a subtle enough problem that you probably want to write a simple kernel accessing the array with the #pragma unroll directive, and inspect the PTX output. This will also make it obvious where the array is really being stored.

Thanks for your quick answer. The code is looking like that:

...

register float sum[10];

...

#pragma unroll

for(int ty=0; ty<Ky; ty++) sum[ty]=0.0f;

for(int i=0; i<My; i++){

  opt = (Sy+1)*i*stride_prev_y + tidx;

  for(int j=0; j<Mx; j++){

    dval = d[i*stride_d+j];

    #pragma unroll

    for(int ty=0; ty<Ky; ty++) sum[ty] += prev_y[opt + ty*stride_prev_y] * dval;

    opt += Sxp1;

  }

}

#pragma unroll

for(int ty=0; ty<Ky; ty++) *W++ -= eta*sum[ty];

Ky is a template parameter, so it is constant. This means that all indexes inside loops that are to be unrolled are known at compile time. The vector summ is not used anywhere else in the kernel.

Looking at the obj file (I think is the same as ptx) I found this:

.local .align 4 .b8 __cuda___cuda_sum_6412992[40];

The question remains: why sum it is allocated in local memory and not in registers. My entire kernel uses 22 registers and the limit I set for the compiler is 60. Also, if sum is in local memory, why the profiler shows no accesses to local memory (cached or not)?

Why won’t you unroll the loops manually and use x0,…,x9?
It does not look too difficult, even though the resulting code may look a bit ugly.
Anyway, from previous comments to your question I conclude that even if
you managed to somehow force the compiler to compile your register-array-based code,
this would have to violate C/C++ standard (according to the standard, the name of an array can be used as a pointer, which has profound consequences, and since registers cannot be referenced, the standard-compliant compiler should refuse to compile your code or allocate the array in the local memory.

An alternative solution: allocate the array in the shared memory. For 128-thread blocks this would require only 1280 floats, or about 5KB.

Unless I misunderstand what the desired functionality is, this seems to work as desired with the CUDA 3.2 toolchain. Below is my little test app which wraps the code that was posted.

#include <stdio.h>

#include <stdlib.h>

template <int Ky>

__device__ void func (float eta, int Sxp1, int Sy, int My, int Mx, const float *d, float *W, const float *prev_y, int stride_d, int stride_prev_y)

{

    float sum[Ky];

    float dval;

    int opt;

    int tidx = blockDim.x * blockIdx.x + threadIdx.x;

#pragma unroll

    for(int ty=0; ty<Ky; ty++) {

        sum[ty]=0.0f;

    }

    for (int i=0; i<My; i++) {  

        opt = (Sy+1)*i*stride_prev_y + tidx;  

        for(int j=0; j<Mx; j++) {    

            dval = d[i*stride_d+j];    

#pragma unroll    

            for (int ty=0; ty<Ky; ty++) {

               sum[ty] += prev_y[opt + ty*stride_prev_y] * dval;    

            }

            opt += Sxp1;  

        }

    }

#pragma unroll

    for (int ty=0; ty<Ky; ty++) {

        *W++ -= eta*sum[ty];

    }

}

__global__ void kernel (float eta, int Sxp1, int Sy, int My, int Mx, const float *d, float *W, const float *prev_y, int stride_d, int stride_prev_y)

{

    func<10>(eta, Sxp1, Sy, My, Mx, d, W, prev_y, stride_d, stride_prev_y);

}

int main (void)

{

    return EXIT_SUCCESS;

}

I compiled the code above as follows

nvcc -o kernel --keep -Xptxas -v -arch={sm_20|sm_10} kernel.cu

I don’t see any local memory references in either the sm_10 or sm_20 version of the code (I checked the PTX as well as the machine code; registered developers can download cuobjdump to disassemble sm_1x machine code in .cubin files). PTXAS reports:

ptxas info : Compiling entry function ‘_Z6kernelfiiiiPKfPfS0_ii’ for ‘sm_10’

ptxas info : Used 37 registers, 56+16 bytes smem

ptxas info : Compiling entry function ‘_Z6kernelfiiiiPKfPfS0_ii’ for ‘sm_20’

ptxas info : Used 42 registers, 88 bytes cmem[0]

That’s interesting. On first look I would have assumed that nvcc would not unroll that loop due to the fact that “stride_prev_y” integer and “dval” is not known at compile time.

Thank you everyone, I will try your code tomorrow. Today our GTX580s came and I have to update the HW/SW.

Programming guide:

I guess, in your code, float sum[Ky]; is the first case. Try putting this thing in shared memory if possible:

extern __shared__ int sum[]; //this will make a dynamical shared array, it's size is given in launch configuration as third parameter

__global__ void kernel(...)

{

  //replace sum[a] with

  sum[threadIdx.x+blockDim.x*a]=0.0f;

  //this address pattern is needed to avoid bank conflicts

}