Simple question: what happens when you use cudaMemcpy() to transfer between device main memory and shared memory? For example, if I want to transfer 16 bytes from device main memory to shared memory using cudaMemcpy, does that get translated into a single 16-byte load from main memory followed by 4 4-byte stores to shared memory (since the maximum granularity of device memory operations seems to be larger than that of the shared memory banks)?
(for the sake of the question, assume the data is 16-byte aligned)
New question: is there not a memcpy() equivalent that can be used in the device context? (obviously, if I try to use memcpy, I get a “you can’t use that in device context” error…)
Nope, there is no memcpy equivalent that can be used on the device. You need to roll your own. This is a Good Thing ™. 1) Every application needs to load memory in different ways, so explicitly controlling all memory copies gives you the control you need to tune memory accesses for your application. 2) If you don’t think you need that kind of control, I suggest you read the performance guild lines in the programming guide. There are a lot of rules to follow to get good memory access performance.
Uses all threads in a block. Works for shared memory too.
*/
template
device void memcpy_device(T * dest, const T * src, unsigned int num_values){
unsigned int threads_per_block = blockDim.x*blockDim.y*blockDim.z;
unsigned int num_iters = num_values / threads_per_block;
//copy the bulk of the data
//each thread transfers one T at a time
unsigned int index = threadIdx.x + blockDim.x*threadIdx.y + blockDim.x*blockDim.y*threadIdx.z; //threadId
for(unsigned int i = 0; i < num_iters; i++){
dest[index] = src[index];
index += threads_per_block;
}
//copy the tail of the data, some threads idle
if(index < num_values){
dest[index] = src[index];
}
// Synchronize to make sure data is copied
__syncthreads();