Hi all,
I tested the time of async copy within the stream 0 and a separate stream.
Here is my code (host_y_pg_lk is a page locked array, SIZEOFY = 5)
Code 1, use stream 0:
[codebox]CUT_SAFE_CALL( cutStartTimer( timer));
for (i = 0; i < 7500; i++) {
cudaMemcpyAsync(device_y, host_y_pg_lk, sizeof(double) * SIZEOFY, cudaMemcpyHostToDevice, 0);
cudaStreamSynchronize(0); // or cudaThreadSynchronize(); time cost is the same for the both
}
CUT_SAFE_CALL( cutStopTimer( timer));[/codebox]
Code 2, use a separate stream:
[codebox]cudaStream_t stream1;
cutilSafeCall( cudaStreamCreate(&stream1) );
CUT_SAFE_CALL( cutStartTimer( timer));
for (i = 0; i < 7500; i++) {
cudaMemcpyAsync(device_y, host_y_pg_lk, sizeof(double) * SIZEOFY, cudaMemcpyHostToDevice, stream1);
cudaStreamSynchronize(stream1); // or cudaThreadSynchronize(); time cost is the same for the both
}
CUT_SAFE_CALL( cutStopTimer( timer));[/codebox]
Code 1 costs around 350 ms, while code 2 costs around 600 ms.
Can anyone tell me why async memcpy in a separate stream costs much more time than in the stream 0?
I want to use another stream, say stream2, for computation. But if stream1 costs much more time than the stream 0 in async memcpy, then even though stream1 (async memcpy) and stream2 (computation) can overlap, there seems no benefit than just using stream 0 for both the async memcpy and computation.
Thank you very much.
Zheng