cudaMemcpy() behavior question

halcyonic · July 31, 2007, 2:02am

Simple question: what happens when you use cudaMemcpy() to transfer between device main memory and shared memory? For example, if I want to transfer 16 bytes from device main memory to shared memory using cudaMemcpy, does that get translated into a single 16-byte load from main memory followed by 4 4-byte stores to shared memory (since the maximum granularity of device memory operations seems to be larger than that of the shared memory banks)?

(for the sake of the question, assume the data is 16-byte aligned)

Thanks,
Nick

paulius · July 31, 2007, 4:22am

You cannot use cudaMemcpy to move data between global and shared memory (your kernels must do that explicitly).

Paulius

halcyonic · August 6, 2007, 10:12am

Bother. Just figured that out.

New question: is there not a memcpy() equivalent that can be used in the device context? (obviously, if I try to use memcpy, I get a “you can’t use that in device context” error…)

Thanks,
Nick

MisterAnderson42 · August 6, 2007, 2:26pm

Nope, there is no memcpy equivalent that can be used on the device. You need to roll your own. This is a Good Thing ™. 1) Every application needs to load memory in different ways, so explicitly controlling all memory copies gives you the control you need to tune memory accesses for your application. 2) If you don’t think you need that kind of control, I suggest you read the performance guild lines in the programming guide. There are a lot of rules to follow to get good memory access performance.

wnbell · August 8, 2007, 1:10am

Here’s a simple option:

[codebox]

/*

Copy the first N values from src to dest
Uses all threads in a block. Works for shared memory too.

*/

template

device void memcpy_device(T * dest, const T * src, unsigned int num_values){

unsigned int threads_per_block = blockDim.x*blockDim.y*blockDim.z;	

unsigned int num_iters = num_values / threads_per_block;

//copy the bulk of the data

//each thread transfers one T at a time

unsigned int index = threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z;	 //threadId	





for(unsigned int i = 0; i < num_iters; i++){

	dest[index] = src[index];		

	index += threads_per_block;

}

//copy the tail of the data, some threads idle

if(index < num_values){		

	dest[index] = src[index];

}

// Synchronize to make sure data is copied

__syncthreads();

}

[/codebox]

Topic		Replies	Views
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9173	November 12, 2007
global-global memory transfers CUDA Programming and Performance	9	1861	December 1, 2014
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2259	May 30, 2009
asynchronous cuMemcpyDtoD ? CUDA Programming and Performance	9	2409	December 9, 2008
Using dma memory transfers CUDA Programming and Performance	2	8123	February 23, 2007
why use cudamemcpy instead of a simple for loop CUDA Programming and Performance	4	1747	March 16, 2009
Cuda Device to Device Copy with Host Side Synchronization CUDA Programming and Performance	0	1564	September 24, 2017
CudaMallocManaged and CudaMemcpy CUDA Programming and Performance	2	18659	May 29, 2016
Shared Memory question CUDA Programming and Performance	5	2888	November 25, 2016
Multithreading and CUDA CUDA Programming and Performance	6	9083	April 14, 2010

cudaMemcpy() behavior question

Related topics