What is the fastest way to copy 512 bytes from global to shared memory?

I have a global char array passed into a kernel as a parameter ‘char*’ and for every block I need to copy 512 consecutive bytes (with a different offset for each block, multiple of 512 bytes) from that array into a shared char* array or any other container for further byte manipulation.

Computing capabilities 2.1.

What kind of container in shared memory to use, how many bytes transfer in each thread and avoid bank conflicts?

Also, do bank conflicts exist between threads in a block or only between threads in a wrap?

Thank you!

bank conflicts only exist between threads in a warp.

one possibility: use a uchar4 data type, load one uchar4 per thread.
As long as you transfer things consecutively, your global loads will be coalesced and your shared stores will be un-conflicted.

#define SSIZE 512
__global__ void my_kernel(char *my_global,...){
  __shared__ char my_shared[SSIZE]; 

  int lidx = threadIdx.x;
  uchar4 *u4shared = reinterpret_cast<uchar4 *>(my_shared);
  uchar4 *u4global = reinterpret_cast<uchar4 *>(my_global);
  while (lidx < SSIZE/4){
    u4shared[lidx] = u4global[lidx+(blockIdx.x*SSIZE/4)];
    lidx += blockDim.x;}

  __syncthreads();

  ...
}

(assumes 1D threadblock structure)

You might also want to create a union of { uchar[512]; uint4[32]; } and see how a ld.global.v4.u32 followed by a st.shared.v4.u32 performs. The cost of the “replays” might be OK.

Thank you! One question - you said each thread loads 4 bytes but because of

lidx += blockDim.x;

if I have 64 threads per block, each thread will seem to load 8 bytes in two calls, is that how you meant it or did you assume that I should have at least 128 threads per block?

I believe it should work OK and give good performance. You may find a faster approach.

ok, thank you