cudaMemcpy() behavior question

Simple question: what happens when you use cudaMemcpy() to transfer between device main memory and shared memory? For example, if I want to transfer 16 bytes from device main memory to shared memory using cudaMemcpy, does that get translated into a single 16-byte load from main memory followed by 4 4-byte stores to shared memory (since the maximum granularity of device memory operations seems to be larger than that of the shared memory banks)?

(for the sake of the question, assume the data is 16-byte aligned)


You cannot use cudaMemcpy to move data between global and shared memory (your kernels must do that explicitly).


Bother. Just figured that out.

New question: is there not a memcpy() equivalent that can be used in the device context? (obviously, if I try to use memcpy, I get a “you can’t use that in device context” error…)


Nope, there is no memcpy equivalent that can be used on the device. You need to roll your own. This is a Good Thing ™. 1) Every application needs to load memory in different ways, so explicitly controlling all memory copies gives you the control you need to tune memory accesses for your application. 2) If you don’t think you need that kind of control, I suggest you read the performance guild lines in the programming guide. There are a lot of rules to follow to get good memory access performance.

Here’s a simple option:



  • Copy the first N values from src to dest

  • Uses all threads in a block. Works for shared memory too.



device void memcpy_device(T * dest, const T * src, unsigned int num_values){

unsigned int threads_per_block = blockDim.x*blockDim.y*blockDim.z;	

unsigned int num_iters = num_values / threads_per_block;

//copy the bulk of the data

//each thread transfers one T at a time

unsigned int index = threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z;	 //threadId	

for(unsigned int i = 0; i < num_iters; i++){

	dest[index] = src[index];		

	index += threads_per_block;


//copy the tail of the data, some threads idle

if(index < num_values){		

	dest[index] = src[index];


// Synchronize to make sure data is copied