asynchronous cuMemcpyDtoD ?

Is there asynchronous cuMemcpyDtoD?
I tried to cascade several kernels, and there are some feedbacks in some kernels, so i need to do cuMemcpyDtoD.
My kernels are asynchronous, so I need asynchronous cuMemcpyDtoD, right?

But I noticed therer is no asynchronous device memory copy in cuda driver.
that is so weird…

So I need to write a kernel to device memory copy, and execute it asynchronously.

or is there any better way?

According to the Guide, looks like it doesn’t exist.

Although the “synchronous” memcpy shouldn’t cause a true synchronize (ie, interfere with cpu-gpu overlap). It should just force the streams to catch up to one-another.

As long as you stay on 1 device, you can just give the second kernel the same pointer as input, that was output of the first kernel. So I am probably missing what you mean ;)

I guess he would not really bother making some asynchronous memory transfers if he was on the same device … sounds like a pipeline where you have 1 GPU or so per step.


You can use streams with more than 1 GPU?

Perhaps i’m just plain wrong, but i thought there is just no support for actual device to device communications, and that we have to use some intermediate RAM buffer … (GPU -> RAM then RAM->GPU). Is there any better solution ?


As far as I know you can just copy from device to device. It is just that at this time it goes via system RAM and people have requested a faster version that goes directly from device to device.
but cudaMemcpy(…,…,cudaMemcpyDeviceToDevice) works as far as I know.

Uh, how are you getting addresses from multiple contexts (and have a runtime API that is aware that these are from different contexts)?


It would be extremely nice if the stream API would be extended to be able to have more than 1 stream, each on it’s own GPU. And then some kind of nice devicetodevice copy function between devices of 2 streams ;)

That is what I want to do, forcing kernels in a stream to run one-by-one.

Actually, asynchronous cuMemcpyDtoD is not a problem. I can do it in kernels.

But, I have another big problem, there is no asynchronous cuMemcpyDtoA.

Damn it, the built-in texture filtering is very robust, and i really need it in my implementation.

Why I cannot use it in asynchronous execution?

Anyone has good idea?