Initializing shared memory

I am writing some CUDA code, where each thread within a block needs to access ALL values in the shared memory in order to do a computation for that thread. in my case, the shared memory is made up of a bunch of arrays.

So, the shared memory size AND shared memory indexing has nothing to do with the thread index; all of the data in the shared memory will be accessed for each thread. I have a piece of code as shown below.

The question that I have: IS the shared memory initialized every time a thread runs, or only once per block?

What I want to have happen is to initialize the shared memory only once per block, and all threads to read from that shared memory. Does the code below do this or does it initialize the shared mem once per thread?


  • Alex


extern shared int array;

global kernel_func (…) {

// Initialize the shared memory
float* shared_d_thresholds = (float*) array; // 6 floats = 24 bytes
int3* shared_d_KernelDimsCenter = (int3*)&shared_d_thresholds[24];  // 1 int3 = 12 bytes 
int3* shared_d_KernelDims = (int3*) &shared_d_KernelDimsCenter[12];  // 1 int3 = 12 bytes
int4* shared_d_signalMapSizes = (int4*) &shared_d_KernelDims[12]; // 1 int4 = 16 bytes
int2* shared_d_AADims = (int2*) &shared_d_signalMapSizes[16]; // 1 int2 = 8 bytes
int*  shared_d_numResultingPoints = (int*) &shared_d_AADims[8]; // 1 in = 4 bytes
float*** shared_d_KernelWeights = (float***) &shared_d_numResultingPoints[4]; // rest of shared memory
    // Fill the shared memory with data from global device memory...
shared_d_KernelDimsCenter->x = d_KernelDimsCenter->x;
shared_d_KernelDimsCenter->y = d_KernelDimsCenter->y;
shared_d_KernelDimsCenter->z = d_KernelDimsCenter->z;

// Use indexing within large array to assign appropiate data  to crunch for current thread


void entry_point_to_CUDA {

init …

// Setup the kernel excution configuration
dim3 dimGrid ...
dim3 dimBlock ...
size_t Ns ...
printf("Allocating %d bytes of shared memory per thread block\n", int(Ns));




Whenever a thread writes to a shared memory location, the change will be visible to all threads within the threadblock. Make sure to call __syncthreads() between the write and read, to avoid RAW conflicts.


Well, the place in your code labeled “Initialize the shared memory” doesn’t actually initialize anything; it just sets up a bunch of pointers. Incorrectly, by the way:


is a pointer to memory 24 floats past the start of shared_d_thresholds, not 24 bytes.

The next chunk, “Fill the shared memory with data from global device memory…”, does actually initialize the shared memory by coping values in from global memory.

And it copies each value from global to shared memory once per thread, which is bad.

A better approach is to just think of the entire chunk of data you need to copy into shared memory as a homogeneous array of bytes – S of them, let’s say. You want to read it into shared memory (where the pointers you set up will refer to it), you have N threads, so you give each thread S/N bytes to copy.

Once you’ve got that approach working, treat the chunk as an array of ints, so you’re copying 4 bytes at a time per thread, and make sure consecutive threads are copying consecutive ints, so your global memory reads coalesce, and you reduce shared-memory bank conflicts for the writing part.