Copying data from global memory to shared memory by each thread

shibdas · May 15, 2009, 6:19am

Hi,
Sorry for this newbie question but I couldn’t locate any place where this has been answered clearly. I need to have some arrays in the shared memory which are accessed by each thread executing in a block. As the host can’t directly load the arrays into the shared memory, should I have something like this at the very beginning of the kernel?

// gArray points to an array in the global memory passed as a pointer to the kernel
// sArray is the shared array
global testkernel(float *gArray)
{
shared float sArray[256];

for(i = 0; i < 256; i++)
{
sArray[i] = gArray[i];
}
…
}

If I have this in the kernel will this not be executed for every single thread in the block copying the same data over to the shared memory? I can’t have each thread copying different portions of the array as I need the entire array to proceed with the computation. As the array would be a read-only I can perhaps use texture memory and I can also use syncthreads() to wait for all threads to finish copying different portions of the array but is there any other way to load the data into the shared memory once for all the threads?

Sorry if I’m missing something here.

Thanks
Shibdas

Sarnath · May 15, 2009, 6:33am

You need to load it manually. You can share that load among threads as well. Successive threads can load successive data and you can put this in a FOR loop. (this will also aid memory coalescing)

avidday · May 15, 2009, 6:34am

You might want to read parts 4&5 of the series of CUDA programming articles from Dr Dobbs Journal
[url=“http://www.ddj.com/architect/208401741”]http://www.ddj.com/architect/208401741[/url]

I found it useful when I was getting started.

_DK · May 15, 2009, 10:57am

Make each thread copy a chunk of your array.

__global__ testkernel(float *gArray)

{

  __shared__ float sArray[256];

if (threadIdx.x < 256)

  {

	  sArray[threadIdx.x] = gArray[threadIdx.x];

  }

  __syncthreads();// wait for each thread to copy its elemenet

  ...

}

If you have less than 256 threads, each of them should be copying a few elements of your array:

...

 sArray[i] = gArray[i];

 sArray[i + THREADS_NUM] = gArray[i + THREADS_NUM];

 ...

shibdas · May 15, 2009, 5:34pm

Thanks for all the suggestions. They are really helpful.

geohei · January 7, 2022, 4:18pm

13 years later … :)

I use dim3 threads(64) and pass an array pointer uint32_t (64 elements) to the kernel, using managed memory. I believe this array ends up in device’s global memory, right?

To make things faster, I copy the array to shared memory.

__global__ void kernel(const uint32_t *a_gm,
                             uint32_t *b_gm) // return value
{
int i = threadIdx.x;
__shared__ uint32_t a_sm[64];
a_sm[i] = a_gm[i];
...
}

After that, I have a speed increase of about 3%.
I use a_sm[] a lot inside the kernel.
I was expecting a lot more.
What did I miss?

Robert_Crovella · January 7, 2022, 4:27pm

It’s really not possible to do performance analysis on one line of code.

If you’re not witnessing much benefit from shared memory “caching” there are some possibilities:

your kernel’s limiter to performance is not (access to) the data in question
the data in question was already getting “cached” sufficiently in L1 or L2; shared did not make much difference
you may be getting bank-conflicted access in shared memory
you have now introduced shared access pressure as a limiter to performance for your kernel
your own analysis is not correct, or you’ve made an error of some sort

For general usage of shared for caching, in volta processors and beyond, NVIDIA made a point of saying that the improvements in L1 may significantly reduce the “historically expected” benefit from this approach.

Topic		Replies	Views
Copying data into shared memory CUDA Programming and Performance	9	3877	July 1, 2009
copying to shared block mem CUDA Programming and Performance	11	4344	April 6, 2008
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9388	November 12, 2007
Shared Memory question CUDA Programming and Performance	5	2968	November 25, 2016
General Shared Memory Question CUDA Programming and Performance	5	6713	March 4, 2010
Shared memory vs global memory CUDA Programming and Performance	6	3542	April 30, 2007
moving data between Device Global to Device Shared CUDA Programming and Performance	7	5502	February 12, 2009
Dynamic Shared memory CUDA Programming and Performance	3	6168	June 4, 2009
Transfer a one-dimensional array saved by rows-major from global memory to shared memory CUDA Programming and Performance cuda	1	510	July 1, 2021
Data load question CUDA Programming and Performance	3	113	December 18, 2024

Copying data from global memory to shared memory by each thread

Related topics