How is memory type chosen for stack variable?

I am noticing that the performance of my cuda program is extremely sensitive to what type of memory is used for each stack variable. In particular, which data gets put into local memory vs. registers seems to be the main factor. My question is, is there any way for me to determine which variables have been put into local memory and also, if there is a way to instruct the compiler to put certain variables into local memory and others into registers.

My guess is that there is currently no way so my feature requests would be

  • have a way to determine the type of memory assigned to each variable in device functions, eg. in an annotated output from nvcc
  • have a way to specify whether a stack variable should be put into local, register, or shared (this could be a hint like inline)
  • make it possible to assign local stack variables to shared memory without explicitly allocating it for all threads in a block. ie. I’d like “int a” rather than “int a[threadBlockSize]”. Maybe this could automatically arrange things in memory for coalescing too.

I’d also like to confirm a theory, and that is how local memory is assigned. My tests seem to show that almost anything that I put into a structure is put into local memory rather than on the stack. It would seem useful to have structured data also put into registers if the structure is small enough.

Thank you!

Andrew

I’m not quite sure if I understand what you try to do, so maybe a small piece of code could clearify your problem. But from what I get NVCC puts some of your variables instead into registers on the local memory. As far as I know this can happen for two reasons:

1- The amount of allocated memory exceeds the number of registers.

2- You are trying to allocate an array of some kind. Registers are NOT INDEXABLE, which means you cannot store an array in registers because you cannot access them with an index parameter.

__device__

void myfunction()

{

  // allocate an array

  int n[3];  // -> this will be allocated in local memory, because access is indirect (indexed)

 // example

  int x = 1;

  // ... modify x dynamically...

  // copy element x of n to y

  int y = n[ x ];   // this operation cannot be executed on registers!

  

  // allocate three individual variables

  int a,b,c; // -> these will be allocated in registers, because access is always direct.

 // ... do something with it ...

}

Have a look at your code an see if this applies somewhere.

–Jake

Well, we have those already… local puts the variable into local memory, shared puts it into shared memory. Specifying nothing puts it into register, if there is still space.

I just tried local and my program failed to compile. Is this a new features in the 1.1 sdk?

You’re right, I was using arrays in some places. CUDA also seems to put unions into local memory, eg:

union {

    uint uval;

    float fval;

};

Andrew

I am also puzzled by this question and someone told me to use local but it doesn’t work.

So after a long time of experiement I find maybe this is the way we should do:

You can define variables with device in the kernel file ( suppose kernel.cu) but outside the kernel function for device cannot be used in kernel functions.

For example

kernel.cu

__device__ float num1[100];  //array num1 with __device__ is put in device memory

__global__ void kernel()

{

   float  num2[100];    //array num2 is put in shared memory when you don't use  

                                   //anything to define it

   __shared__ num3[100];  // array num3 is put in shared memory

   

   for(int i=0;i<100;i++) num1[i]=0;  //you can operate num1 in kernel function

}

So you can find it not diffcult and you can also operate the device variables in the wapper .cu file But one thing you should remember , you must make sure the variables you define in kernel function with shared or with nothing , should not be out of memory . The limitation is 0x4000 bytes or 8192 elements (float, int, short or something like that). When the register is overflowed, the compiler will automatically put the redundant part into device memory which is hard for you to control.

As far as the compiler I cann’t find exactly how to use the parameters to command nvcc where it should put variables in. The programming guide has refferred to this but it doesn’t make it clear.

The local qualifier was removed in CUDA 1.0.