Much slower async memcpy in a separate stream than in stream 0

smilee · May 5, 2010, 10:17pm

Hi all,

I tested the time of async copy within the stream 0 and a separate stream.

Here is my code (host_y_pg_lk is a page locked array, SIZEOFY = 5)

Code 1, use stream 0:

[codebox]CUT_SAFE_CALL( cutStartTimer( timer));

for (i = 0; i < 7500; i++) {

cudaMemcpyAsync(device_y, host_y_pg_lk, sizeof(double) * SIZEOFY, cudaMemcpyHostToDevice, 0);

cudaStreamSynchronize(0); // or cudaThreadSynchronize(); time cost is the same for the both

}

CUT_SAFE_CALL( cutStopTimer( timer));[/codebox]

Code 2, use a separate stream:

[codebox]cudaStream_t stream1;

cutilSafeCall( cudaStreamCreate(&stream1) );

CUT_SAFE_CALL( cutStartTimer( timer));

for (i = 0; i < 7500; i++) {

cudaMemcpyAsync(device_y, host_y_pg_lk, sizeof(double) * SIZEOFY, cudaMemcpyHostToDevice, stream1);

cudaStreamSynchronize(stream1); // or cudaThreadSynchronize(); time cost is the same for the both

}

CUT_SAFE_CALL( cutStopTimer( timer));[/codebox]

Code 1 costs around 350 ms, while code 2 costs around 600 ms.

Can anyone tell me why async memcpy in a separate stream costs much more time than in the stream 0?

I want to use another stream, say stream2, for computation. But if stream1 costs much more time than the stream 0 in async memcpy, then even though stream1 (async memcpy) and stream2 (computation) can overlap, there seems no benefit than just using stream 0 for both the async memcpy and computation.

Thank you very much.

Zheng

gonnet · May 7, 2010, 1:58am

Hello smillee,

Interesting measurement, if this is actually different to use default stream or a user-provided stream in terms of performance, that’s really something i’d be interested to know more about :)

To stay on the safe side, you should perhaps check the return value of the stream synchronization: you are using CUT_SAFE_CALL everywhere, but you don’t actually test the calls that may be faulty …

So basically you are trying to send 20bytes 7500 times, and this takes like 300ms, correct me if i’m wrong but this would mean that your average transfer time is going from 45us (which is unfortunately not that surprising) to 80us, such a huge difference is really surprising however, perhaps you should enqueue two consecutive transfers just to see which operation really costs a lot of time, cudaMemcpyAsync or cudaThreadSynchronize.

Looking forward the explanation !
CÃ©dric

gonnet · May 7, 2010, 3:09am

Was too curious, did some extra experiments :)

I indeed managed to reproduce the same “bug” on my machine too (CUDA 3.0 under Linux). The figures I have are quite lower as I get 11Âµs vs. 35Âµs if I’m using a non-zero stream… but still what a difference !

If i do the same transfer twice before puting a stream synchronization, this goes from 15Âµs to 54Âµs! (With three times, i get 19Âµs vs 75Âµs). So it seems there is something eating 25Âµs in cudaMemcpyAsync here …

smilee · May 7, 2010, 1:55pm

Hi CÃ©dric,

All you said is correct except that I am sending 40 bytes (using double floating point).

I tried two different measuring methods, i.e., cuda timer and cuda events. They showed same results.

Right now, I am using only the stream 0 for both async memcpy and computation. User-provided stream for async memcpy performs badly, as shown by both of our results. But user-provided stream for kernel functions has the same performance as stream 0. So I tried using stream 0 for async memcpy and a user stream for kernel computation, and overlapped them. But the performance is not as good as just using stream 0 for both async memcpy and kernel computation.

I think the reason is probably the cudaThreadSynchronize (or cudaStreamSynchronize or cudaEventSynchronize) call, which must be called if two different streams are used. But for only one stream (e.g., stream 0) there is no need: kernel functions are guaranteed to start after the async memcpy is done.

Zheng

Hello smillee,

Interesting measurement, if this is actually different to use default stream or a user-provided stream in terms of performance, that’s really something i’d be interested to know more about :)

To stay on the safe side, you should perhaps check the return value of the stream synchronization: you are using CUT_SAFE_CALL everywhere, but you don’t actually test the calls that may be faulty …

So basically you are trying to send 20bytes 7500 times, and this takes like 300ms, correct me if i’m wrong but this would mean that your average transfer time is going from 45us (which is unfortunately not that surprising) to 80us, such a huge difference is really surprising however, perhaps you should enqueue two consecutive transfers just to see which operation really costs a lot of time, cudaMemcpyAsync or cudaThreadSynchronize.

Looking forward the explanation !

CÃ©dric

chandlerz · July 23, 2015, 7:52am

I run the code you attached with cuda 7.0 and K40. The times for the default stream and created stream are 11.16 us and 10.98 us, respectively. Meanwhile, given cuda 7.0 provides a way to change the behavior of the default stream with --default-stream per-thread, I tried this and no big changed.

So, I suggest you have a try with newer cuda.
I believe the issue has been fixed on newer cuda.

Topic		Replies	Views
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3204	September 1, 2009
cudaMemcpyAsync makes code faster even when using the default stream 0 CUDA Programming and Performance	1	1458	January 10, 2022
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15225	April 29, 2014
Overhead of using non-default stream with cudaMemcpyAsync() too high? CUDA Programming and Performance	2	2135	August 5, 2009
Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0 CUDA Programming and Performance	0	940	January 14, 2013
bets way to return a float value sync or assync CUDA Programming and Performance	26	10314	May 7, 2009
cudaMemcpyAsync CUDA Programming and Performance	10	20735	October 16, 2015
is kernel in stream 0 asynchronous? CUDA Programming and Performance	10	3712	April 23, 2011
Memory copy/set async to kernel execution in different stream CUDA Programming and Performance	5	1044	December 15, 2022
cudaMemcpy2DAsync a lot slower than cudaMemcpy normally CUDA Programming and Performance	6	125	August 22, 2024

Much slower async memcpy in a separate stream than in stream 0

Related topics