Slow memory transfers

Jose_Angel_MT · May 22, 2011, 7:49am

Hello,

I wish to know if the problems that I’ve right now is because an incorrect hardware configuration or because an error of concept.
The next code only allocs one byte of memory in the graphics card and then memcpy that byte to a local variable one millon times.

In my computer it takes more than one minute!

Is it normal? Because I’ve to call to a kernel more than one millon times and only copying the result of the kernel takes more time than executing the kernel in the CPU.

The graphics card is Quadro 4000 and the computer Intel Q660 (Dell XPS 420). Win7 64 bits, driver 270.61, using CUDA 3.2 and CUDA 4.0, same results.

#include
#include <cuda.h>
#include <Windows.h>

int main(void)
{
int* gpu;
int cpu;

cudaMalloc( (void**)&gpu, sizeof(int));

SYSTEMTIME st;
GetSystemTime(&st);
printf("The system time is: %02d:%02d:%02d\n", st.wHour, st.wMinute, st.wSecond);


for(int i=0; i<1000000; i++ )
	cudaMemcpy(&cpu, gpu, sizeof(int), cudaMemcpyDeviceToHost);

SYSTEMTIME st2;
GetSystemTime(&st2);

printf("The system time is: %02d:%02d:%02d\n", st2.wHour, st2.wMinute, st2.wSecond);

cudaFree(gpu);
return 0;

}

hyqneuron · May 22, 2011, 2:52pm

The driver has its own overhead for Memcpy, especially if you are on Win7/Vista. Is the second kernel launch dependent on the first? If not it’ll be better for you to launch the kernel perhaps only a few times.

nwilt · May 22, 2011, 3:48pm

The code has two forms of overhead: 1) the driver overhead, which on Windows Vista/Windows 7 involves doing a kernel thunk (thousands of clock cycles) to submit work to the hardware, and 2) the CPU/GPU synchronization, since cudaMemcpy() does not return until the memcpy has been completed.

If the data must be sourced from pageable system memory, there’s no getting around the overhead. If that’s the case, the main way to avoid overhead would be to copy more data per memcpy call.

If you can use pinned memory, the way to submit a series of small memcpy’s in quick succession would be to use cudaMemcpyAsync(). That way, the work does not get submitted on every call - it will be queued in user mode by the driver and periodically submitted when the user mode buffer is full. There is also a benefit on non-WDDM drivers, even though they do not have to do a kernel thunk to submit work to the GPU: they don’t have to do CPU/GPU synchronization every time. cudaMemcpy()'s synchronous semantics are guaranteed to introduce bubbles into the pipeline of work being processed by the GPU.

If you are not trying to submit a series of memcpy’s, but wanting to stage data into device memory for kernel processing, then mapped pinned memory is a good bet - the kernel can read the data directly.

hyqneuron · May 22, 2011, 7:33pm

@nwilt
Actually that memcpy in his code is async. Somewhere in the programming guide says that memcpys smaller than certain size (i kinda remember it’s 16K) are by default async.

tera · May 22, 2011, 8:05pm

That only applies to host->device copies. Device->host copies are never promoted to async because that would violate the API. And AFAIR the limit is 64k.

hyqneuron · May 23, 2011, 5:46am

@tera
thanks for the correction.
But I see a bidirectional arrow on the programming guide 4.0, page 30, regarding memory copy between host and device. Are you certain that it only works for device to host copy?

tera · May 23, 2011, 8:28am

Yes. I really thought Nvidia had corrected it by now.

Jose_Angel_MT · May 23, 2011, 1:22pm

Thank you for your replies.

What confused me was that in all the books/introductory samples, there’s a “Fill vector kernel” or a “Add vector kernel”, so I supposed that it’s performance-valid to create a small kernel for small operations and call it all the times you want.

I already tried pinning before this post, and I was not happy with the performance.

After read your comments, and after read this post I realized that it’s better to create a “work intensive” kernel and call it only one time, so I just modified my kernel and now I’m really happy: without any optimization, it’s about 60 times faster than the algorithm with one thread in the CPU.

Topic		Replies	Views
Reducing overhead Is there a way to minimize CUDA overhead CUDA Programming and Performance	3	3271	August 9, 2008
CUDAmemcpy takes too long CUDA Programming and Performance	2	4380	July 16, 2009
Memory Transfer CUDA Programming and Performance	7	2960	October 10, 2008
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25086	March 8, 2010
cuMemcpy Overhead cuMemcpy call has unacceptable overhead CUDA Programming and Performance	6	2991	July 31, 2008
copy memory slow? CUDA Programming and Performance	2	4800	February 12, 2009
Idea: a new memcpy from device to host for gain performance CUDA Programming and Performance	3	526	October 18, 2018
Host to Device memcpy overhead CUDA Programming and Performance	2	1148	March 17, 2009
About CUDA CUDA Programming and Performance	2	4712	December 3, 2008
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15225	April 29, 2014

Slow memory transfers

Related topics