memcpy equivalent for global memory to shared memo

glenvidia · November 9, 2007, 11:32pm

Hello,

I have a large (almost 16kB) block of data that needs to be accessed by all threads of my thread block.

So, I desire to copy that first from global memory into the shared memory, and make sure this happens before any computations occur by any of the threads in the block. All of the threads use all of the data within this shared memory.

I could just do a for-loop over the array and copy the values one-by-one, but I thougt that surely there is a faster way to do it. This is what I first tried:

__global__ void calc_it( int num_time_slots, float2 *gdata, float2 *rdata )

{

    bool coeffs_loaded = false;

    __shared__ float2 r_coeffs[ SIZE1 ][ SIZE2 ];

    if( ( coeffs_loaded == false ) && ( threadIdx.x == 0 ) ) {

        cudaMemcpy( &r_coeffs[0][0], rdata, SIZE1*SIZE2*sizeof(float2), cudaMemcpyDeviceToDevice );

        coeffs_loaded = true;

    }

    __syncthreads();

   .... then execute code ...

}

but of course this gives an error as cudaMemcpy cannot be executed with a kernel running on a real card.

Isn’t there some memcpy equivalent for copying a block of data between global memory and shared memory?

Thank you –

RussAtkinson · November 10, 2007, 12:40am

I hope that I’ve understood your question.

The really short answer is to use your threads to fetch from the global memory and store to the shared memory. Do it as part of the setup for your kernel, don’t do it as a separate routine.

The max shared memory per multiprocessor is 16 KB, so if you stay enough under that you can fit it into the shared memory. HOWEVER, you will only be able to launch one block per multiprocessor, so you will be executing rather slowly.

It’s best to do this application another way. Consider using a texture, so you can have a lot of data that won’t limit your block issue, and should cache reasonably.

paulius · November 10, 2007, 12:43am

No, there is not (it’s really not feasible in this context). If you check the programming guide, cudaMemcpy() function is a host-only function, meaning it can only be called by the host (CPU).

You don’t have to loop over all the elements you wish to copy. Have all the threads in the block participate in the read (make sure it’s coalesced for perf reasons). Then just call __syncthreads() and proceed with the computation part. For example, check the transpose or matrixMultiply samples (the latter is also described in the Programming Guide).

Paulius

paulius · November 10, 2007, 12:46am

Not quite. Fetching the same value from the texture multiple times is likely to be slower than reading it into smem once and getting it from smem multiple times.

Paulius

MisterAnderson42 · November 10, 2007, 1:50pm

And you must do the fetching into shared memory “by hand”. In my opnion, this is a feature, not a bug. Only you know what your block configuration is. Only you know which one of an infinite number of memory access patterns you need to use. Read the section on global memory coalescing: it is very important to attaining significant performance speedups. It is also important to make sure that every thread in the block participates in the memory read.

Depending on your access pattern or course: reading a block of global memory into shared memory is as easy as:

for (int i = 0; i < blocks_to_read; i++)  // unroll this loop for maximum performance

    shmem[threadIdx.x + blockDim.x*i] = global_mem[base_index_for_block + threadIdx.x + blockDim.x*i];

__syncthreads();

// rest of kernel using shmem

glenvidia · November 12, 2007, 2:57pm

Thank you for all of your input. Instead of replying one-by-one to these suggestions, I would like to summarize all of them and draw some conclusions.

Shared memory is preferred over texture memory for my application for two reasons:

the texture cache is only 8kB (per multiprocessor) whereas shared memory is 16kB per multiprocessor.
cached texture memory accesses may be slower than multiple shared memory accesses, assuming that you’ve organized them to avoid bank conflicts. With every thread in the block accessing every element of the data I’ve loaded, it’s best to just use shared memory.

There is no memcpy function that can run on the device to copy blocks of data between shared memory and global memory. Furthermore, even though there is no natural mapping between the number of threads in my block and the dimensions of the 2-D array that I need to load, I should force all threads of the block to participate in loading shared memory in a coalescing fashion.

Best Regards,
Glen Mabey

Topic		Replies	Views
cudaMemcpy() behavior question CUDA Programming and Performance	4	6655	August 8, 2007
Shared Memory question CUDA Programming and Performance	5	2885	November 25, 2016
copying to shared block mem CUDA Programming and Performance	11	4155	April 6, 2008
global-global memory transfers CUDA Programming and Performance	9	1805	December 1, 2014
Memory management issues Global and Shared memory management CUDA Programming and Performance	12	3868	March 2, 2009
Optimizing App Memory Bandwidth Requirements Optimizing App Memory Bandwidth Requirem CUDA Programming and Performance	7	7599	May 7, 2008
Copy from texture memory to shared memory Confused about best transfer strategy CUDA Programming and Performance	4	1547	February 11, 2010
Shared Mem caching strategy Comparison of benchmark results CUDA Programming and Performance	9	4187	May 11, 2008
Memory types and CUDA access CUDA Programming and Performance	5	59219	February 3, 2009
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009

memcpy equivalent for global memory to shared memo

Related topics