Shared Memory initialization

Hi,

I am working with CUDA on a windows platform. I had a query regarding initialization of an array on shared memory. If the array is initialized as follows :__shared float Array[16 * 16] it works fine. But when I try to pass in the size of the array as an argument to the kernel function (for example __shared float Array [m] where m is passed in as an argument) it results in a build error: expression must have a constant value. Is it not possible to declare the size of the array on the fly? I tried using the extern __shared float declaration but I am using quite a few variables and am facing memory overlap problems. I would greatly appreciate any help.

Thanks,

Shyam

Hi Shyam,

This is explained in the programming guide. Because it’s not possible to allocate memory on the fly inside global or device functions, you have to either allocate shared arrays statically, or pass the size through the execution configuration (the <<< >>> parameters). For the same reason, you can allocate at most one dynamic shared array per kernel invocation. Here’s some example code:

__global__ void myKernel(float *deviceData)

{

    extern __shared__ float myData[];

   // kernel code...

}

void main()

{

    // initialization, data download, etc.

   int sharedMemSize = calcSharedMemSize(); // depends on the problem

   myKernel<<<threads, blocks, sharedMemSize>>>(data);

   // shutdown

}

The key here is that you declared the shared array extern to tell the compiler it would be allocated elsewhere. Then at runtime the cuda runtime API allocates the shared memory based on the third parameter in the execution configuration. Because of this, only one dynamically-sized shared memory array per kernel is supported.

Mark

Thanks Mark. I did try that too. Infact I tried giving the offsets myself as mentioned in the guide for dynamically allocated arrays but with the number of variables I am using, it wasn’t feasible. So I would have to statically allocate the size of the arrays.

Thanks again

Shyam

OK, without any actual code to look at it’s hard to say what the problem is.

Mark

ok this might give you a better idea…

extern “C”
void runTest(float *data, float *estimates, int n )
{

// copy host memory to device
…Memcpy…

//allocate device memory to store results
…cudaMalloc…

// setup execution parameters
dim3 grid(1,1,1);
dim3 threads(3,n,1);
// execute the kernel
kernel<<< grid, threads>>>(d_C, d_A, d_B,d_RHS, n) ;

n is the height of the array which is computed in my main cpp file by using the sizeof() function and passed into

global void
kernel( float* C, float* A, float* B, float* D, unsigned int n)
{
// access thread id
const unsigned int tid = threadIdx.x; // 0-3
// warp id
const unsigned int wid = threadIdx.y; // 0-n
const unsigned int ltid = (wid * blockDim.x) + tid;
// access number of threads in this block
const unsigned int num_threads = blockDim.x * blockDim.y;

__shared__ float As[183];
__shared__ float Bs[3];
__shared__ float Cs[183];
__shared__ float Ds[61];
__shared__ float CTs[183];
__shared__ float Ps[3 * 3];
__shared__ float Rs[3];
__shared__ float Ls[3*3];
__shared__ float Us[3*3];
__shared__ float Ys[3];
__shared__ float Xs[3];

I cannot use the extern like you mentioned but if i try to allocate these variables as

    __shared__ float As[n*3].....ans so on

I get the errors i mentioned earlier…

Thanks

Shyam

(Note, please wrap code in code blocks using the buttons on the edit pane so that the forum will not reformat your code.)

I see. As I said, there is no dynamic memory allocation inside CUDA kernel code. What you could do is declare one array and define offsets and macros to access “virtual” arrays.

// define some macros to access your "virtual" arrays

#define As(i) data[AsIndex + (i)]

#define Bs(i) data[BsIndex + (i)]

#define Cs(i) data[CsIndex + (i)]

#define Ds(i) data[DsIndex + (i)]

// etc...

__global__ void myFunc(float *deviceDataA, float *deviceDataB);

{

    extern __shared__ float data[];

   unsigned AsIndex = 0;

    unsigned BsIndex = n

    unsigned BsIndex = n+3;

    unsigned CsIndex = 2*n+3;

    unsigned DsIndex = 4*n+3;

    // etc...

   // now, when you need to access the "virtual" arrays, just use the macros:

    As(threadIdx.x) = deviceDataA[threadIdx.x];

    Bs(threadIdx.x) = deviceDataB[threadIdx.x];

    // etc...

}

Now you just allocate “data” at runtime by using the execution configuration.

Mark

Hi,

Thanks a ton. It worked perfectly. But there is a limit on the amount of shared memory I can allocate from the execution configuration right?? Is there a way I could get around this as I am trying to develop a generic program that could do the computation on any number of floating points. In my other post I have mentioned block sizes which I think would be the best solution for this. But I am confused about how I could use blocks with the number of variables I am using. Also there is a sequence of operations I am trying to perform on the data.

Thanks Again,

Shyam

There is a physical limitation of 16KB per thread block on G80. You have to work within that physical limit. You can always re-use memory space later by re-using temporary storage for intermediates and loading more data later in the kernel to replace data you are finished with.

To extend the earlier example:

// define some macros to access your "virtual" arrays

#define As(i) data[AsIndex + (i)]

#define Bs(i) data[BsIndex + (i)]

#define Cs(i) data[CsIndex + (i)]

#define Ds(i) data[DsIndex + (i)]

#define Temp(i) data[(i)]

// etc...

__global__ void myFunc(float *deviceDataA, float *deviceDataB, 

                                    float * deviceDataC, float * deviceDataD, 

                                    float *deviceDataOut);

{

   extern __shared__ float data[];

  unsigned temp = 0;

   unsigned AsIndex = n;

   unsigned BsIndex = 2*n

  // note that we restart Cs at n.  This means we can't use Cs and Ds at the same 

   // time as As, and Bs

   unsigned CsIndex = n;    

   unsigned DsIndex = 2*n;

   // etc...

  int tid = threadIdx.x;

  // now, when you need to access the "virtual" arrays, just use the macros:

   As(tid) = deviceDataA[tid];

   Bs(tid) = deviceDataB[tid];

  __syncthreads();

  // compute using As and Bs and store in temp

   Temp(tid) = As(tid) + Bs(tid);

  // now load in C and D:

   Cs(tid) = deviceDataC[tid];

   Ds(tid) = deviceDataD[tid];

  __syncthreads();

  // compute using temp, As, and Bs and write to output

   deviceDataOut[threadIdx.x] = Cs(tid) * (Ds(tid) + Temp(tid));

}

Also, for constant data, consider using constant memory or texture memory.

Mark

I use a similar method that you guys might like:

extern __shared__ float data[];

float *As = data;

float *Bs = data+183; // 183 is size of As

// etc

Then you can use “As”, etc as you’d expect.

External Image

Generally this should work, but I’ve had problems in the past with shared memory pointers. Perhaps those issues are fixed in the current beta release.

Mark

Yes, this works in the current Beta release.

Either this:

extern __shared__ float data[];

__device__ func()     // some device or __global__ function

{

    float *As = data;

    float *Bs = data+183; // 183 is size of As

}

of this:

extern __shared__ float data[];

__shared__ float *As = data;

__shared__ float *Bs = data+183; // 183 is size of As

The programming guide is outdated in this regard.

Section 4.2.2.4 should say:

[i]Pointers in code that is executed on the device are supported as long as the compiler is able to resolve whether they point to either the shared memory space or the global memory space, otherwise they are restricted to only point to memory allocated or declared in the global memory space.

Dereferencing a pointer either to global or shared memory in code that is executed on the host or to host memory in code that is executed on the device results in an undefined behavior, most often in a segmentation fault and application termination.[/i]

One additional note, Shyam. Inability to allocate arrays on the fly as you tried (float Array[m]) is not unique to CUDA. You cannot do this in C or C++ since, as the compiler complained, array dimension in the declaration must be a constant.

Hi Paulius,

Thanks for the note. I had never tried that in C/C++ but assumed it would work. I am not much of a programmer. :)

Shyam

The g++ compiler will let you declare arrays with sizes that are variables, even though it is not part of the C++ standard. This has confused a number of people I work with when they moved to a Solaris compiler which did not support that extension.

Should this work for different data types?

For instance. I’m having trouble with this technique, something like:

extern __shared__ float shmem[];

__shared__ unsigned int *int_data= (unsigned int *) shmem + 0;

__shared__ float *float_data= shmem + num_threads;

This works in the emulator, but on the card (GeForce 8800 GTS), the float data doesn’t reference properly.

When assigning a constant value to shared memory, is it better to let one thread do the work, or is there an optimization that occurs? For example:

__shared__ float x;

x = -1.0f;      // is this better?

if (tid == 0) x = -1.0f;    // or this?

The ptx code for the straight assignment is just two instructions, while the if(tid) approach is 4 instructions.

Will the straight assignment will get serialized, with each thread repeating the assignment in turn? Or is there some kind of special treatment for a constant assignment to shared memory?

The manual says that the result of concurrent writes to the same memory location is undefined. As long as all threads write the same value, you can probably get away with it, but you better stick with the second version as the threads might not be in sync and so a second thread might write the -1 while the first has already moved on modifying x further. If x isn’t modified further down, you should make it constant.

Peter

My hunch is that the second is better, but you need to follow it with a __syncthreads() to avoid the issue Peter mentions. Also, you may be able to do this:

if (tid < 32) x = -1.0f;    

__syncthreads();

This would mean the entire first warp writes the value, so that the first warp is non-divergent, unlike the (tid == 0) case. However, divergence in only one warp is probably not much faster overall than divergence in zero warps, unless you have very few warps per block (which is usually bad for efficiency anyway).

Mark

[/quote]

Thanks Mark and Peter.

My question was specifically targeting whether there was some subtle optimization for initializing shared memory to a constant. Apparently there’s not.

I suspect, then that the method where ALL threads write the (same) constant to one shared memory location would cause bank conflicts? I didn’t detect any, but it’s a pretty small part of my code. If it did, that would settle the issue. :P

I left out the __syncthreads() explicitly because I was initializing other shared variables. But point well taken!