Is there asynchronous cuMemcpyDtoD?
I tried to cascade several kernels, and there are some feedbacks in some kernels, so i need to do cuMemcpyDtoD.
My kernels are asynchronous, so I need asynchronous cuMemcpyDtoD, right?
But I noticed therer is no asynchronous device memory copy in cuda driver.
that is so weird…
So I need to write a kernel to device memory copy, and execute it asynchronously.
According to the Guide, looks like it doesn’t exist.
Although the “synchronous” memcpy shouldn’t cause a true synchronize (ie, interfere with cpu-gpu overlap). It should just force the streams to catch up to one-another.
As long as you stay on 1 device, you can just give the second kernel the same pointer as input, that was output of the first kernel. So I am probably missing what you mean ;)
I guess he would not really bother making some asynchronous memory transfers if he was on the same device … sounds like a pipeline where you have 1 GPU or so per step.
Perhaps i’m just plain wrong, but i thought there is just no support for actual device to device communications, and that we have to use some intermediate RAM buffer … (GPU → RAM then RAM->GPU). Is there any better solution ?
As far as I know you can just copy from device to device. It is just that at this time it goes via system RAM and people have requested a faster version that goes directly from device to device.
but cudaMemcpy(…,…,cudaMemcpyDeviceToDevice) works as far as I know.
It would be extremely nice if the stream API would be extended to be able to have more than 1 stream, each on it’s own GPU. And then some kind of nice devicetodevice copy function between devices of 2 streams ;)