Copying data from global memory to shared memory by each thread

Hi,
Sorry for this newbie question but I couldn’t locate any place where this has been answered clearly. I need to have some arrays in the shared memory which are accessed by each thread executing in a block. As the host can’t directly load the arrays into the shared memory, should I have something like this at the very beginning of the kernel?

// gArray points to an array in the global memory passed as a pointer to the kernel
// sArray is the shared array
global testkernel(float *gArray)
{
shared float sArray[256];

for(i = 0; i < 256; i++)
{
sArray[i] = gArray[i];
}

}

If I have this in the kernel will this not be executed for every single thread in the block copying the same data over to the shared memory? I can’t have each thread copying different portions of the array as I need the entire array to proceed with the computation. As the array would be a read-only I can perhaps use texture memory and I can also use syncthreads() to wait for all threads to finish copying different portions of the array but is there any other way to load the data into the shared memory once for all the threads?

Sorry if I’m missing something here.

Thanks
Shibdas

You need to load it manually. You can share that load among threads as well. Successive threads can load successive data and you can put this in a FOR loop. (this will also aid memory coalescing)

You might want to read parts 4&5 of the series of CUDA programming articles from Dr Dobbs Journal
[url=“http://www.ddj.com/architect/208401741”]http://www.ddj.com/architect/208401741[/url]

I found it useful when I was getting started.

Make each thread copy a chunk of your array.

__global__ testkernel(float *gArray)

{

  __shared__ float sArray[256];

if (threadIdx.x < 256)

  {

	  sArray[threadIdx.x] = gArray[threadIdx.x];

  }

  __syncthreads();// wait for each thread to copy its elemenet

  ...

}

If you have less than 256 threads, each of them should be copying a few elements of your array:

...

 sArray[i] = gArray[i];

 sArray[i + THREADS_NUM] = gArray[i + THREADS_NUM];

 ...
3 Likes

Thanks for all the suggestions. They are really helpful.

13 years later … :)

I use dim3 threads(64) and pass an array pointer uint32_t (64 elements) to the kernel, using managed memory. I believe this array ends up in device’s global memory, right?

To make things faster, I copy the array to shared memory.

__global__ void kernel(const uint32_t *a_gm,
                             uint32_t *b_gm) // return value
{
int i = threadIdx.x;
__shared__ uint32_t a_sm[64];
a_sm[i] = a_gm[i];
...
}

After that, I have a speed increase of about 3%.
I use a_sm[] a lot inside the kernel.
I was expecting a lot more.
What did I miss?

It’s really not possible to do performance analysis on one line of code.

If you’re not witnessing much benefit from shared memory “caching” there are some possibilities:

  • your kernel’s limiter to performance is not (access to) the data in question
  • the data in question was already getting “cached” sufficiently in L1 or L2; shared did not make much difference
  • you may be getting bank-conflicted access in shared memory
  • you have now introduced shared access pressure as a limiter to performance for your kernel
  • your own analysis is not correct, or you’ve made an error of some sort

For general usage of shared for caching, in volta processors and beyond, NVIDIA made a point of saying that the improvements in L1 may significantly reduce the “historically expected” benefit from this approach.