CUDAmemcpy takes too long

Hy,

I have a program in CUDA and it’s seems to long to run.

My program:

I have rfData_h as an input

cudaMalloc((void **) &rfData_d, size_nmpts_nmlne);

//Copy
cudaMemcpy(rfData_d, rfData_h, size_nmpts_nmlne, cudaMemcpyHostToDevice);

My kernels invocation

//Retrieve
cudaMemcpy(rfData_h, rfDataMODE_d, size_nmpts_nmlne, cudaMemcpyDeviceToHost);

My kernel is quick so no problem at all ( <300 us )
But the memory transfer takes too long (>100ms)
Can I use the shared memory instead of the local memory so that it’ll be faster?

I need to have the copy → kernel → retrieve <1ms.

Shared memory cant be used outside your kernel function.

How do you know your kernel takes less than 300us??

What you probably see is that the kernel invocation is async and therefore you only measure the launch time of the kernel.

The memcpy than implictly calls cudaThreadSynchronize and therefore you think your copy takes so much time.

Make sure you time the code correcly by putting a sync after the kernel and then measuring the time, also check for errors returned

from the kernel.

You can probably find more info about this in the programming guide or the best practice manual released lately by nVidia.

eyal

Ok thanks now I see what’s going on!