Slow memory transfers


I wish to know if the problems that I’ve right now is because an incorrect hardware configuration or because an error of concept.
The next code only allocs one byte of memory in the graphics card and then memcpy that byte to a local variable one millon times.

In my computer it takes more than one minute!

Is it normal? Because I’ve to call to a kernel more than one millon times and only copying the result of the kernel takes more time than executing the kernel in the CPU.

The graphics card is Quadro 4000 and the computer Intel Q660 (Dell XPS 420). Win7 64 bits, driver 270.61, using CUDA 3.2 and CUDA 4.0, same results.

#include <cuda.h>
#include <Windows.h>

int main(void)
int* gpu;
int cpu;

cudaMalloc( (void**)&gpu, sizeof(int));

printf("The system time is: %02d:%02d:%02d\n", st.wHour, st.wMinute, st.wSecond);

for(int i=0; i<1000000; i++ )
	cudaMemcpy(&cpu, gpu, sizeof(int), cudaMemcpyDeviceToHost);


printf("The system time is: %02d:%02d:%02d\n", st2.wHour, st2.wMinute, st2.wSecond);

return 0;


The driver has its own overhead for Memcpy, especially if you are on Win7/Vista. Is the second kernel launch dependent on the first? If not it’ll be better for you to launch the kernel perhaps only a few times.

The code has two forms of overhead: 1) the driver overhead, which on Windows Vista/Windows 7 involves doing a kernel thunk (thousands of clock cycles) to submit work to the hardware, and 2) the CPU/GPU synchronization, since cudaMemcpy() does not return until the memcpy has been completed.

If the data must be sourced from pageable system memory, there’s no getting around the overhead. If that’s the case, the main way to avoid overhead would be to copy more data per memcpy call.

If you can use pinned memory, the way to submit a series of small memcpy’s in quick succession would be to use cudaMemcpyAsync(). That way, the work does not get submitted on every call - it will be queued in user mode by the driver and periodically submitted when the user mode buffer is full. There is also a benefit on non-WDDM drivers, even though they do not have to do a kernel thunk to submit work to the GPU: they don’t have to do CPU/GPU synchronization every time. cudaMemcpy()'s synchronous semantics are guaranteed to introduce bubbles into the pipeline of work being processed by the GPU.

If you are not trying to submit a series of memcpy’s, but wanting to stage data into device memory for kernel processing, then mapped pinned memory is a good bet - the kernel can read the data directly.

Actually that memcpy in his code is async. Somewhere in the programming guide says that memcpys smaller than certain size (i kinda remember it’s 16K) are by default async.

That only applies to host->device copies. Device->host copies are never promoted to async because that would violate the API. And AFAIR the limit is 64k.

thanks for the correction.
But I see a bidirectional arrow on the programming guide 4.0, page 30, regarding memory copy between host and device. Are you certain that it only works for device to host copy?

Yes. I really thought Nvidia had corrected it by now.

Thank you for your replies.

What confused me was that in all the books/introductory samples, there’s a “Fill vector kernel” or a “Add vector kernel”, so I supposed that it’s performance-valid to create a small kernel for small operations and call it all the times you want.

I already tried pinning before this post, and I was not happy with the performance.

After read your comments, and after read this post I realized that it’s better to create a “work intensive” kernel and call it only one time, so I just modified my kernel and now I’m really happy: without any optimization, it’s about 60 times faster than the algorithm with one thread in the CPU.