about streaming style sample code in Programming Guide ... why such a style?

shin_nka · January 21, 2009, 9:16am

Hi,

I’m curious about streams in order to reduce runtime of my applications by overlapping transfer and kernel,

then I read a sample in Programming Guide and found that there’s three ‘for’ statements for each instruction like below:

for (...)

	cudaMemcpyAsync(H2D);

for (...)

	myKernel<<<...>>>(...);

for (...)

	cudaMemcpyAsync(D2H);

I thought that such a style was redundant and inefficient, so I wrote my code like this:

for (...) {

	cudaMemcpyAsync(H2D);

	myKernel<<<...>>>(...);

	cudaMemcpyAsync(D2H);

}

but this was not overlapped at all contrary to my expectation.

If I write my code in the first style, transfer and kernel are perfectly overlapped.

I’ve checked my code many times and ensure these points:

use cudaMallocHost to allocate host memory and *Async to copy memory between host and device
there’s no memory area conflict with other streams and no dependency among all streams
compile the code in ‘Release’ mode
set CUDA_PROFILE=0

My 2 machines show the same result:

(8800 GTS 512 / GTX 280), CUDA 2.0, WinXP (x86)

Anyone knows why there’s no overlap in the second code?

pcchen · January 21, 2009, 10:51am

It’s impossible to overlap with the second style, because you always have to wait for memory copy to complete before starting to run the corresponding kernel. Otherwise, the kernel will not have complete data.

With the first style, in theory, the second memory copy (which runs async) can overlap with the execution of the first kernel.

shin_nka · January 22, 2009, 6:58am

Thanks pcchen, I expected both 2 programs to run in the same way because of these descriptions in Programming Guide:
“A stream is a sequence of operations that execute in order. Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrently.”

There’s no transfer while kernel is running in stream 1, therefore memcpy(H2D) in stream 2 can be overlapped with the kernel:
Stream 1: TKT_
Stream 2: _TKT

But it is not true in the second style, actually:
Stream 1: TKT___
Stream 2: ___TKT
memcpy(H2D) in stream 2 is blocked with memcpy(D2H) in stream1.

Is such a behavior truly “out of order?”
From this sentence, I think that driver should automatically reorder memcpy(D2H) in stream 1 and memcpy(H2D) in stream2 to utilize idle time even if the second is used,
while transfer operations, in this case, seem to be executed one after another through all streams.

Is this problem due to any hardware limitation, or software level one that may be improved in future release?

pcchen · January 22, 2009, 7:49am

In theory, it’s possible to do out-of-order in the driver side (e.g. the driver buffers all memory copy and kernel execution requests and reorder them), but I don’t know if NVIDIA is doing or planning to do this.

MisterAnderson42 · January 22, 2009, 1:02pm

Maybe it’s the D2H copies messing up the interleaving. Have you tried:

for (...) {

	cudaMemcpyAsync(H2D);

	myKernel<<<...>>>(...);

}

for (...) {

	cudaMemcpyAsync(D2H);

}

Otherwise, you seem to have all the bases covered and I don’t see any reason why the 2nd form you posted above doesn’t work. I wonder if it is a limitation of the winxp driver.

Is modifying the simpleStreams example sufficient to demonstrate this issue? I could make the modifications here and try my linux box to see if I find the same issue.

shin_nka · January 23, 2009, 6:41am

Maybe it’s the D2H copies messing up the interleaving. Have you tried:
for (...) {

	cudaMemcpyAsync(H2D);

	myKernel<<<...>>>(...);

}

for (...) {

	cudaMemcpyAsync(D2H);

}
Otherwise, you seem to have all the bases covered and I don’t see any reason why the 2nd form you posted above doesn’t work. I wonder if it is a limitation of the winxp driver.

Is modifying the simpleStreams example sufficient to demonstrate this issue? I could make the modifications here and try my linux box to see if I find the same issue.

I wrote my code based on simpleStreams, but changed its kernel function to do a sums.

Above code and this one:

for (...) {

	cudaMemcpyAsync(H2D);

}

for (...) {

	myKernel<<<...>>>(...);

	cudaMemcpyAsync(D2H);

}

both worked out well though my code does not.

It seems that the only difference from my coding style is whether memcpy(D2H) is called in a stream and memcpy(H2D) is succeeding in the other or not.

Topic		Replies	Views
Queueing device-to-device/peer memcpy stalls concurrent copy operations CUDA Programming and Performance	6	256	June 11, 2024
How to overlap execution of kernels in different streams with copy operations CUDA Programming and Performance	9	965	February 1, 2022
streams not overlapping CUDA Programming and Performance	1	1539	May 23, 2011
Asynchronous kernel execution and memory not overlapping using CUDA stream! CUDA Programming and Performance	3	872	July 7, 2017
Concurrent copy & execution problem Device to host memory copy is not overlapped with kernel exe CUDA Programming and Performance	1	1761	June 23, 2010
Processing Order with Cuda Streams in 7.5 CUDA Programming and Performance	13	1968	June 24, 2016
async memcopy/kernel from different contexts overlaping operations from different contexts.. CUDA Programming and Performance	9	2947	December 18, 2008
can't achieve cudaMemcpyAsync and kernel concurrency Jetson TX1	3	692	October 18, 2021
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	799	September 11, 2016
Cuda Streams for Concurrent Kernel Calls CUDA Programming and Performance	1	2233	October 26, 2016

about streaming style sample code in Programming Guide ... why such a style?

Related topics