Question regarding transfer from global to shared memory

Hi all,
I have a very basic question. I have an array in global memory each element of which is 32 bytes. Now which of the following is a better way to load this data into shared memory and if possible why?

  1. Each thread loads 1 element of the array ie 32 bytes from global to shared memory. But in this case I will be using less number of threads to load the data so most of the threads remains idle. ( My array has say a total of 100 elements)
  2. Each thread loads 4 bytes of data from global to shared memory. In this case I will have lessors number of threads which are idle.

SO which one of these two will be better?

Regards

Hi all,
I have a very basic question. I have an array in global memory each element of which is 32 bytes. Now which of the following is a better way to load this data into shared memory and if possible why?

  1. Each thread loads 1 element of the array ie 32 bytes from global to shared memory. But in this case I will be using less number of threads to load the data so most of the threads remains idle. ( My array has say a total of 100 elements)
  2. Each thread loads 4 bytes of data from global to shared memory. In this case I will have lessors number of threads which are idle.

SO which one of these two will be better?

Regards

You’ll achieve optimal throughput if every thread transfers 16 consecutive bytes, which results in two 128byte transactions (assuming the access is properly aligned). 8 or four bytes per thread will not be much slower.

How much slower other accesses are depends on the compute capability. 1.0 and 1.1 will serialize accesses that cannot be coalesced, resulting in 1/16th of the total bandwidth. 1.2 and 1.3 devices will coalesce as far as possible, but will have to read some addresses twice. 2.x devices cache the access and will give almost identical results, no matter how the access is done.

You’ll achieve optimal throughput if every thread transfers 16 consecutive bytes, which results in two 128byte transactions (assuming the access is properly aligned). 8 or four bytes per thread will not be much slower.

How much slower other accesses are depends on the compute capability. 1.0 and 1.1 will serialize accesses that cannot be coalesced, resulting in 1/16th of the total bandwidth. 1.2 and 1.3 devices will coalesce as far as possible, but will have to read some addresses twice. 2.x devices cache the access and will give almost identical results, no matter how the access is done.

Keep in mind, that shared memory has bank conflicts. It’s a good idea for every thread to read consecutive 4 bytes (1 int) from data in global memory and write it to consecutive 4 bytes in shared memory. This will give you coalesced memory access to global memory and will keep you from bank conflict in shared memory.

If you keep some threads intact while memopry copy, don’t forget to make __syncthreads();. And if amount of threads copying data is a multiple of 32, you don’t lose anything on idle threads. This should look like this:

__shared__ int private_data;

__global__ void myKernel (void *Data)

{

if(threadIdx.x<SIZE)

  {

  private_data[threadIdx]=*((int*)Data+threadIdx.x);

  }

__syncthreads();

}

(can’t be sure in syntax without syntax-checker at hand, but should be ok)

Keep in mind, that shared memory has bank conflicts. It’s a good idea for every thread to read consecutive 4 bytes (1 int) from data in global memory and write it to consecutive 4 bytes in shared memory. This will give you coalesced memory access to global memory and will keep you from bank conflict in shared memory.

If you keep some threads intact while memopry copy, don’t forget to make __syncthreads();. And if amount of threads copying data is a multiple of 32, you don’t lose anything on idle threads. This should look like this:

__shared__ int private_data;

__global__ void myKernel (void *Data)

{

if(threadIdx.x<SIZE)

  {

  private_data[threadIdx]=*((int*)Data+threadIdx.x);

  }

__syncthreads();

}

(can’t be sure in syntax without syntax-checker at hand, but should be ok)