CUDAmemcpy takes too long


I have a program in CUDA and it’s seems to long to run.

My program:

I have rfData_h as an input

cudaMalloc((void **) &rfData_d, size_nmpts_nmlne);

cudaMemcpy(rfData_d, rfData_h, size_nmpts_nmlne, cudaMemcpyHostToDevice);

My kernels invocation

cudaMemcpy(rfData_h, rfDataMODE_d, size_nmpts_nmlne, cudaMemcpyDeviceToHost);

My kernel is quick so no problem at all ( <300 us )
But the memory transfer takes too long (>100ms)
Can I use the shared memory instead of the local memory so that it’ll be faster?

I need to have the copy -> kernel -> retrieve <1ms.

Shared memory cant be used outside your kernel function.

How do you know your kernel takes less than 300us??

What you probably see is that the kernel invocation is async and therefore you only measure the launch time of the kernel.

The memcpy than implictly calls cudaThreadSynchronize and therefore you think your copy takes so much time.

Make sure you time the code correcly by putting a sync after the kernel and then measuring the time, also check for errors returned

from the kernel.

You can probably find more info about this in the programming guide or the best practice manual released lately by nVidia.


Ok thanks now I see what’s going on!