asynchronous cuMemcpyDtoD ?

Is there asynchronous cuMemcpyDtoD?
I tried to cascade several kernels, and there are some feedbacks in some kernels, so i need to do cuMemcpyDtoD.
My kernels are asynchronous, so I need asynchronous cuMemcpyDtoD, right?

But I noticed therer is no asynchronous device memory copy in cuda driver.
that is so weird…

So I need to write a kernel to device memory copy, and execute it asynchronously.

or is there any better way?

According to the Guide, looks like it doesn’t exist.

Although the “synchronous” memcpy shouldn’t cause a true synchronize (ie, interfere with cpu-gpu overlap). It should just force the streams to catch up to one-another.

As long as you stay on 1 device, you can just give the second kernel the same pointer as input, that was output of the first kernel. So I am probably missing what you mean ;)

I guess he would not really bother making some asynchronous memory transfers if he was on the same device … sounds like a pipeline where you have 1 GPU or so per step.

Cédric

You can use streams with more than 1 GPU?

Perhaps i’m just plain wrong, but i thought there is just no support for actual device to device communications, and that we have to use some intermediate RAM buffer … (GPU → RAM then RAM->GPU). Is there any better solution ?

Cédric

As far as I know you can just copy from device to device. It is just that at this time it goes via system RAM and people have requested a faster version that goes directly from device to device.
but cudaMemcpy(…,…,cudaMemcpyDeviceToDevice) works as far as I know.

Uh, how are you getting addresses from multiple contexts (and have a runtime API that is aware that these are from different contexts)?

Darn…

It would be extremely nice if the stream API would be extended to be able to have more than 1 stream, each on it’s own GPU. And then some kind of nice devicetodevice copy function between devices of 2 streams ;)

That is what I want to do, forcing kernels in a stream to run one-by-one.

Actually, asynchronous cuMemcpyDtoD is not a problem. I can do it in kernels.

But, I have another big problem, there is no asynchronous cuMemcpyDtoA.

Damn it, the built-in texture filtering is very robust, and i really need it in my implementation.

Why I cannot use it in asynchronous execution?

Anyone has good idea?