memcpy equivalent for global memory to shared memo

Hello,

I have a large (almost 16kB) block of data that needs to be accessed by all threads of my thread block.

So, I desire to copy that first from global memory into the shared memory, and make sure this happens before any computations occur by any of the threads in the block. All of the threads use all of the data within this shared memory.

I could just do a for-loop over the array and copy the values one-by-one, but I thougt that surely there is a faster way to do it. This is what I first tried:

__global__ void calc_it( int num_time_slots, float2 *gdata, float2 *rdata )

{

    bool coeffs_loaded = false;

    __shared__ float2 r_coeffs[ SIZE1 ][ SIZE2 ];

    if( ( coeffs_loaded == false ) && ( threadIdx.x == 0 ) ) {

        cudaMemcpy( &r_coeffs[0][0], rdata, SIZE1*SIZE2*sizeof(float2), cudaMemcpyDeviceToDevice );

        coeffs_loaded = true;

    }

    __syncthreads();

   .... then execute code ...

}

but of course this gives an error as cudaMemcpy cannot be executed with a kernel running on a real card.

Isn’t there some memcpy equivalent for copying a block of data between global memory and shared memory?

Thank you –

I hope that I’ve understood your question.

The really short answer is to use your threads to fetch from the global memory and store to the shared memory. Do it as part of the setup for your kernel, don’t do it as a separate routine.

The max shared memory per multiprocessor is 16 KB, so if you stay enough under that you can fit it into the shared memory. HOWEVER, you will only be able to launch one block per multiprocessor, so you will be executing rather slowly.

It’s best to do this application another way. Consider using a texture, so you can have a lot of data that won’t limit your block issue, and should cache reasonably.

No, there is not (it’s really not feasible in this context). If you check the programming guide, cudaMemcpy() function is a host-only function, meaning it can only be called by the host (CPU).

You don’t have to loop over all the elements you wish to copy. Have all the threads in the block participate in the read (make sure it’s coalesced for perf reasons). Then just call __syncthreads() and proceed with the computation part. For example, check the transpose or matrixMultiply samples (the latter is also described in the Programming Guide).

Paulius

Not quite. Fetching the same value from the texture multiple times is likely to be slower than reading it into smem once and getting it from smem multiple times.

Paulius

And you must do the fetching into shared memory “by hand”. In my opnion, this is a feature, not a bug. Only you know what your block configuration is. Only you know which one of an infinite number of memory access patterns you need to use. Read the section on global memory coalescing: it is very important to attaining significant performance speedups. It is also important to make sure that every thread in the block participates in the memory read.

Depending on your access pattern or course: reading a block of global memory into shared memory is as easy as:

for (int i = 0; i < blocks_to_read; i++)  // unroll this loop for maximum performance

    shmem[threadIdx.x + blockDim.x*i] = global_mem[base_index_for_block + threadIdx.x + blockDim.x*i];

__syncthreads();

// rest of kernel using shmem

Thank you for all of your input. Instead of replying one-by-one to these suggestions, I would like to summarize all of them and draw some conclusions.

  • Shared memory is preferred over texture memory for my application for two reasons:
  1. the texture cache is only 8kB (per multiprocessor) whereas shared memory is 16kB per multiprocessor.
  2. cached texture memory accesses may be slower than multiple shared memory accesses, assuming that you’ve organized them to avoid bank conflicts. With every thread in the block accessing every element of the data I’ve loaded, it’s best to just use shared memory.
  • There is no memcpy function that can run on the device to copy blocks of data between shared memory and global memory. Furthermore, even though there is no natural mapping between the number of threads in my block and the dimensions of the 2-D array that I need to load, I should force all threads of the block to participate in loading shared memory in a coalescing fashion.

Best Regards,
Glen Mabey