What is the fastest way to copy 512 bytes from global to shared memory?

alikim · December 23, 2014, 2:18pm

I have a global char array passed into a kernel as a parameter ‘char*’ and for every block I need to copy 512 consecutive bytes (with a different offset for each block, multiple of 512 bytes) from that array into a shared char* array or any other container for further byte manipulation.

Computing capabilities 2.1.

What kind of container in shared memory to use, how many bytes transfer in each thread and avoid bank conflicts?

Also, do bank conflicts exist between threads in a block or only between threads in a wrap?

Thank you!

Robert_Crovella · December 23, 2014, 4:39pm

bank conflicts only exist between threads in a warp.

one possibility: use a uchar4 data type, load one uchar4 per thread.
As long as you transfer things consecutively, your global loads will be coalesced and your shared stores will be un-conflicted.

#define SSIZE 512
__global__ void my_kernel(char *my_global,...){
  __shared__ char my_shared[SSIZE]; 

  int lidx = threadIdx.x;
  uchar4 *u4shared = reinterpret_cast<uchar4 *>(my_shared);
  uchar4 *u4global = reinterpret_cast<uchar4 *>(my_global);
  while (lidx < SSIZE/4){
    u4shared[lidx] = u4global[lidx+(blockIdx.x*SSIZE/4)];
    lidx += blockDim.x;}

  __syncthreads();

  ...
}

(assumes 1D threadblock structure)

allanmac · December 23, 2014, 7:21pm

You might also want to create a union of { uchar[512]; uint4[32]; } and see how a ld.global.v4.u32 followed by a st.shared.v4.u32 performs. The cost of the “replays” might be OK.

alikim · December 24, 2014, 3:04am

txbob:

bank conflicts only exist between threads in a warp.

use a uchar4 data type, load one uchar4 per thread.
As long as you transfer things consecutively, your global loads will be coalesced and your shared stores will be un-conflicted.
#define SSIZE 512
__global__ void my_kernel(char *my_global,...){
  __shared__ char my_shared[SSIZE]; 

  int lidx = threadIdx.x;
  uchar4 *u4shared = reinterpret_cast<uchar4 *>(my_shared);
  uchar4 *u4global = reinterpret_cast<uchar4 *>(my_global);
  while (lidx < SSIZE/4){
    u4shared[lidx] = u4global[lidx+(blockIdx.x*SSIZE/4)];
    lidx += blockDim.x;}

  __syncthreads();

  ...
}
(assumes 1D threadblock structure)

Thank you! One question - you said each thread loads 4 bytes but because of

lidx += blockDim.x;

if I have 64 threads per block, each thread will seem to load 8 bytes in two calls, is that how you meant it or did you assume that I should have at least 128 threads per block?

Robert_Crovella · December 24, 2014, 3:31am

I believe it should work OK and give good performance. You may find a faster approach.

alikim · December 24, 2014, 7:49am

ok, thank you

Topic		Replies	Views
Bank Conflicts CUDA Programming and Performance	2	2019	December 6, 2009
Shared memory bank conflicts with byte arrays CUDA Programming and Performance	4	3366	April 19, 2017
beginner question regarding shared memory CUDA Programming and Performance	4	7017	November 16, 2009
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	216	June 23, 2025
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6820	February 8, 2009
Problem with bank conflict. Something wrong with my experiment?Confused! CUDA Programming and Performance	4	1321	February 26, 2009
Question regarding transfer from global to shared memory CUDA Programming and Performance	5	6067	November 27, 2010
moving data between Device Global to Device Shared CUDA Programming and Performance	7	5490	February 12, 2009
Efficient use of a shared data CUDA Programming and Performance	1	3272	April 1, 2012
copy a matrix in global to a vector in shared avoiding bank conflicts CUDA Programming and Performance	2	2138	November 7, 2009

What is the fastest way to copy 512 bytes from global to shared memory?

Related topics