Question regarding transfer from global to shared memory

anshu · November 26, 2010, 8:44am

Hi all,
I have a very basic question. I have an array in global memory each element of which is 32 bytes. Now which of the following is a better way to load this data into shared memory and if possible why?

Each thread loads 1 element of the array ie 32 bytes from global to shared memory. But in this case I will be using less number of threads to load the data so most of the threads remains idle. ( My array has say a total of 100 elements)
Each thread loads 4 bytes of data from global to shared memory. In this case I will have lessors number of threads which are idle.

SO which one of these two will be better?

Regards

anshu · November 26, 2010, 8:44am

Hi all,
I have a very basic question. I have an array in global memory each element of which is 32 bytes. Now which of the following is a better way to load this data into shared memory and if possible why?

Each thread loads 1 element of the array ie 32 bytes from global to shared memory. But in this case I will be using less number of threads to load the data so most of the threads remains idle. ( My array has say a total of 100 elements)
Each thread loads 4 bytes of data from global to shared memory. In this case I will have lessors number of threads which are idle.

SO which one of these two will be better?

Regards

tera · November 26, 2010, 12:08pm

You’ll achieve optimal throughput if every thread transfers 16 consecutive bytes, which results in two 128byte transactions (assuming the access is properly aligned). 8 or four bytes per thread will not be much slower.

How much slower other accesses are depends on the compute capability. 1.0 and 1.1 will serialize accesses that cannot be coalesced, resulting in 1/16th of the total bandwidth. 1.2 and 1.3 devices will coalesce as far as possible, but will have to read some addresses twice. 2.x devices cache the access and will give almost identical results, no matter how the access is done.

tera · November 26, 2010, 12:08pm

You’ll achieve optimal throughput if every thread transfers 16 consecutive bytes, which results in two 128byte transactions (assuming the access is properly aligned). 8 or four bytes per thread will not be much slower.

How much slower other accesses are depends on the compute capability. 1.0 and 1.1 will serialize accesses that cannot be coalesced, resulting in 1/16th of the total bandwidth. 1.2 and 1.3 devices will coalesce as far as possible, but will have to read some addresses twice. 2.x devices cache the access and will give almost identical results, no matter how the access is done.

umod.47 · November 27, 2010, 12:09am

Keep in mind, that shared memory has bank conflicts. It’s a good idea for every thread to read consecutive 4 bytes (1 int) from data in global memory and write it to consecutive 4 bytes in shared memory. This will give you coalesced memory access to global memory and will keep you from bank conflict in shared memory.

If you keep some threads intact while memopry copy, don’t forget to make __syncthreads();. And if amount of threads copying data is a multiple of 32, you don’t lose anything on idle threads. This should look like this:

__shared__ int private_data;

__global__ void myKernel (void *Data)

{

if(threadIdx.x<SIZE)

  {

  private_data[threadIdx]=*((int*)Data+threadIdx.x);

  }

__syncthreads();

}

(can’t be sure in syntax without syntax-checker at hand, but should be ok)

umod.47 · November 27, 2010, 12:09am

Keep in mind, that shared memory has bank conflicts. It’s a good idea for every thread to read consecutive 4 bytes (1 int) from data in global memory and write it to consecutive 4 bytes in shared memory. This will give you coalesced memory access to global memory and will keep you from bank conflict in shared memory.

If you keep some threads intact while memopry copy, don’t forget to make __syncthreads();. And if amount of threads copying data is a multiple of 32, you don’t lose anything on idle threads. This should look like this:

__shared__ int private_data;

__global__ void myKernel (void *Data)

{

if(threadIdx.x<SIZE)

  {

  private_data[threadIdx]=*((int*)Data+threadIdx.x);

  }

__syncthreads();

}

(can’t be sure in syntax without syntax-checker at hand, but should be ok)

Topic		Replies	Views
moving data between Device Global to Device Shared CUDA Programming and Performance	7	5373	February 12, 2009
Copying data from global memory to shared memory by each thread CUDA Programming and Performance	6	16959	January 7, 2022
Another question about coalesced reads/writes CUDA Programming and Performance	10	2130	August 18, 2009
Shared memory doubt CUDA Programming and Performance	5	4598	June 11, 2008
performance for global and shared memory CUDA Programming and Performance	2	6233	January 15, 2008
What is the fastest way to copy 512 bytes from global to shared memory? CUDA Programming and Performance	5	982	December 24, 2014
global memory latency CUDA Programming and Performance	12	16172	December 13, 2007
copying to shared block mem CUDA Programming and Performance	11	4179	April 6, 2008
General Shared Memory Question CUDA Programming and Performance	5	6611	March 4, 2010
Coalescing into shared memory CUDA Programming and Performance	1	1964	December 13, 2008

Question regarding transfer from global to shared memory

Related topics