Much slower async memcpy in a separate stream than in stream 0

Hi all,

I tested the time of async copy within the stream 0 and a separate stream.

Here is my code (host_y_pg_lk is a page locked array, SIZEOFY = 5)

Code 1, use stream 0:

[codebox]CUT_SAFE_CALL( cutStartTimer( timer));

for (i = 0; i < 7500; i++) {

cudaMemcpyAsync(device_y, host_y_pg_lk, sizeof(double) * SIZEOFY, cudaMemcpyHostToDevice, 0);

cudaStreamSynchronize(0); // or cudaThreadSynchronize(); time cost is the same for the both

}

CUT_SAFE_CALL( cutStopTimer( timer));[/codebox]

Code 2, use a separate stream:

[codebox]cudaStream_t stream1;

cutilSafeCall( cudaStreamCreate(&stream1) );

CUT_SAFE_CALL( cutStartTimer( timer));

for (i = 0; i < 7500; i++) {

cudaMemcpyAsync(device_y, host_y_pg_lk, sizeof(double) * SIZEOFY, cudaMemcpyHostToDevice, stream1);

cudaStreamSynchronize(stream1); // or cudaThreadSynchronize(); time cost is the same for the both

}

CUT_SAFE_CALL( cutStopTimer( timer));[/codebox]

Code 1 costs around 350 ms, while code 2 costs around 600 ms.

Can anyone tell me why async memcpy in a separate stream costs much more time than in the stream 0?

I want to use another stream, say stream2, for computation. But if stream1 costs much more time than the stream 0 in async memcpy, then even though stream1 (async memcpy) and stream2 (computation) can overlap, there seems no benefit than just using stream 0 for both the async memcpy and computation.

Thank you very much.

Zheng

Hello smillee,

Interesting measurement, if this is actually different to use default stream or a user-provided stream in terms of performance, that’s really something i’d be interested to know more about :)

To stay on the safe side, you should perhaps check the return value of the stream synchronization: you are using CUT_SAFE_CALL everywhere, but you don’t actually test the calls that may be faulty …

So basically you are trying to send 20bytes 7500 times, and this takes like 300ms, correct me if i’m wrong but this would mean that your average transfer time is going from 45us (which is unfortunately not that surprising) to 80us, such a huge difference is really surprising however, perhaps you should enqueue two consecutive transfers just to see which operation really costs a lot of time, cudaMemcpyAsync or cudaThreadSynchronize.

Looking forward the explanation !
Cédric

Was too curious, did some extra experiments :)

I indeed managed to reproduce the same “bug” on my machine too (CUDA 3.0 under Linux). The figures I have are quite lower as I get 11µs vs. 35µs if I’m using a non-zero stream… but still what a difference !

If i do the same transfer twice before puting a stream synchronization, this goes from 15µs to 54µs! (With three times, i get 19µs vs 75µs). So it seems there is something eating 25µs in cudaMemcpyAsync here …

Best,
Cédric
CUDA_Stream.tar.gz (848 Bytes)

Hi Cédric,

All you said is correct except that I am sending 40 bytes (using double floating point).

I tried two different measuring methods, i.e., cuda timer and cuda events. They showed same results.

Right now, I am using only the stream 0 for both async memcpy and computation. User-provided stream for async memcpy performs badly, as shown by both of our results. But user-provided stream for kernel functions has the same performance as stream 0. So I tried using stream 0 for async memcpy and a user stream for kernel computation, and overlapped them. But the performance is not as good as just using stream 0 for both async memcpy and kernel computation.

I think the reason is probably the cudaThreadSynchronize (or cudaStreamSynchronize or cudaEventSynchronize) call, which must be called if two different streams are used. But for only one stream (e.g., stream 0) there is no need: kernel functions are guaranteed to start after the async memcpy is done.

Zheng

I run the code you attached with cuda 7.0 and K40. The times for the default stream and created stream are 11.16 us and 10.98 us, respectively. Meanwhile, given cuda 7.0 provides a way to change the behavior of the default stream with --default-stream per-thread, I tried this and no big changed.

So, I suggest you have a try with newer cuda.
I believe the issue has been fixed on newer cuda.