asynchronous cuMemcpyDtoD ?

dulcet · December 5, 2008, 9:50am

Is there asynchronous cuMemcpyDtoD?
I tried to cascade several kernels, and there are some feedbacks in some kernels, so i need to do cuMemcpyDtoD.
My kernels are asynchronous, so I need asynchronous cuMemcpyDtoD, right?

But I noticed therer is no asynchronous device memory copy in cuda driver.
that is so weird…

So I need to write a kernel to device memory copy, and execute it asynchronously.

or is there any better way?

alex_dubinsky · December 6, 2008, 3:55am

According to the Guide, looks like it doesn’t exist.

Although the “synchronous” memcpy shouldn’t cause a true synchronize (ie, interfere with cpu-gpu overlap). It should just force the streams to catch up to one-another.

E.D_Riedijk · December 6, 2008, 8:19am

As long as you stay on 1 device, you can just give the second kernel the same pointer as input, that was output of the first kernel. So I am probably missing what you mean ;)

gonnet · December 6, 2008, 10:45am

I guess he would not really bother making some asynchronous memory transfers if he was on the same device … sounds like a pipeline where you have 1 GPU or so per step.

CÃ©dric

alex_dubinsky · December 6, 2008, 10:07pm

You can use streams with more than 1 GPU?

gonnet · December 6, 2008, 10:10pm

Perhaps i’m just plain wrong, but i thought there is just no support for actual device to device communications, and that we have to use some intermediate RAM buffer … (GPU → RAM then RAM->GPU). Is there any better solution ?

CÃ©dric

E.D_Riedijk · December 7, 2008, 6:14am

As far as I know you can just copy from device to device. It is just that at this time it goes via system RAM and people have requested a faster version that goes directly from device to device.
but cudaMemcpy(…,…,cudaMemcpyDeviceToDevice) works as far as I know.

tmurray · December 7, 2008, 6:17am

Uh, how are you getting addresses from multiple contexts (and have a runtime API that is aware that these are from different contexts)?

E.D_Riedijk · December 7, 2008, 6:57am

Darn…

It would be extremely nice if the stream API would be extended to be able to have more than 1 stream, each on it’s own GPU. And then some kind of nice devicetodevice copy function between devices of 2 streams ;)

dulcet · December 9, 2008, 2:56pm

That is what I want to do, forcing kernels in a stream to run one-by-one.

Actually, asynchronous cuMemcpyDtoD is not a problem. I can do it in kernels.

But, I have another big problem, there is no asynchronous cuMemcpyDtoA.

Damn it, the built-in texture filtering is very robust, and i really need it in my implementation.

Why I cannot use it in asynchronous execution?

Anyone has good idea?

Topic		Replies	Views
cudaMemcpyAsync clarification required & help needed CUDA Programming and Performance	0	1749	October 17, 2009
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	985	December 15, 2022
What's the use of driver API "cuMemcpyDtoDAsync()"? CUDA Programming and Performance	1	1436	April 15, 2015
cudaMemcpyAsync copying back to same array from different streams!! CUDA Programming and Performance	3	1084	May 21, 2014
Synchronization of cudaMemcpyAsync for pageable memory CUDA Programming and Performance	2	1618	October 3, 2021
No Performance Improvement from Overlapping Kernel/Memcpy CUDA Programming and Performance	16	3144	July 14, 2010
Are cudaMemCpy and cudaMalloc blocking/synchronous? CUDA Programming and Performance	1	214	September 30, 2024
Is cudaMemset actually "asynchronous"? CUDA Programming and Performance	5	7734	January 5, 2016
Asynchronous memory copy from Host to Device CUDA Programming and Performance	5	3059	June 12, 2008
How to use streams for asynch transfers CUDA Programming and Performance	3	873	February 18, 2011

asynchronous cuMemcpyDtoD ?

Related topics