MemCpyAsync with DevToDev Flag

Dear All!
I want to make sure if it is reasonable to use “cudaMemcpyAsync” with “cudaMemcpyDeviceToDevice” flag, while another processing is running in the kernel. Does it improve the job performance?

device to device copies are always asynchronous, so it doesn’t make a difference afaik

you mean there is no difference between these 2 alternative sources:

(IQ_Kernel is dependent from the copy results.)

cudaMemcpy( &d_I[nextIndex][H1] , d_I[curIndex] , sizeof(DATA_TYPE) * H1 * (W1-1) , cudaMemcpyDeviceToDevice);


cudaMemcpyAsync(&d_I[nextIndex][H1], d_I[curIndex], sizeof(DATA_TYPE) * H1 * (W1-1), cudaMemcpyDeviceToDevice, streamDevDev);


I also can not understand

“copies are always asynchronous”. What if i need the copy results in the IQ_Kernel?!

If you need the results of the copy in the IQ_Kernel, then it should be run in the same stream as the memcpy, shouldn’t it?

Thanks.Your response makes aother question for me:

Is it possible to run a kerenl SYNCHRONOULY in the cuda v1.1?

and i also did not give the response of my first question.

Does it improve the performance to execute MemCpyAsync with DeviceToDeviceFlag while another process is running in the kernel?

You can make a kernel run synchronously in effect by calling cudaThreadSynchronize (or the stream variant) right after calling the kernel. You need to do this when checking for error codes, but it is not useful in release mode. The GPU will automatically wait for events to catch up (in the same stream).

I believe wumpus answered your question in his first post. It shouldn’t matter. DeviceToDevice is always asynchronous. The GPU won’t run a kernel and a DevicetoDevice copy at the same time as far as I know.