about streaming style sample code in Programming Guide ... why such a style?

Hi,

I’m curious about streams in order to reduce runtime of my applications by overlapping transfer and kernel,

then I read a sample in Programming Guide and found that there’s three ‘for’ statements for each instruction like below:

for (...)

	cudaMemcpyAsync(H2D);

for (...)

	myKernel<<<...>>>(...);

for (...)

	cudaMemcpyAsync(D2H);

I thought that such a style was redundant and inefficient, so I wrote my code like this:

for (...) {

	cudaMemcpyAsync(H2D);

	myKernel<<<...>>>(...);

	cudaMemcpyAsync(D2H);

}

but this was not overlapped at all contrary to my expectation.

If I write my code in the first style, transfer and kernel are perfectly overlapped.

I’ve checked my code many times and ensure these points:

  1. use cudaMallocHost to allocate host memory and *Async to copy memory between host and device

  2. there’s no memory area conflict with other streams and no dependency among all streams

  3. compile the code in ‘Release’ mode

  4. set CUDA_PROFILE=0

My 2 machines show the same result:

(8800 GTS 512 / GTX 280), CUDA 2.0, WinXP (x86)

Anyone knows why there’s no overlap in the second code?

It’s impossible to overlap with the second style, because you always have to wait for memory copy to complete before starting to run the corresponding kernel. Otherwise, the kernel will not have complete data.

With the first style, in theory, the second memory copy (which runs async) can overlap with the execution of the first kernel.

Thanks pcchen, I expected both 2 programs to run in the same way because of these descriptions in Programming Guide:
“A stream is a sequence of operations that execute in order. Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrently.”

There’s no transfer while kernel is running in stream 1, therefore memcpy(H2D) in stream 2 can be overlapped with the kernel:
Stream 1: TKT_
Stream 2: _TKT

But it is not true in the second style, actually:
Stream 1: TKT___
Stream 2: ___TKT
memcpy(H2D) in stream 2 is blocked with memcpy(D2H) in stream1.

Is such a behavior truly “out of order?”
From this sentence, I think that driver should automatically reorder memcpy(D2H) in stream 1 and memcpy(H2D) in stream2 to utilize idle time even if the second is used,
while transfer operations, in this case, seem to be executed one after another through all streams.

Is this problem due to any hardware limitation, or software level one that may be improved in future release?

In theory, it’s possible to do out-of-order in the driver side (e.g. the driver buffers all memory copy and kernel execution requests and reorder them), but I don’t know if NVIDIA is doing or planning to do this.

Maybe it’s the D2H copies messing up the interleaving. Have you tried:

for (...) {

	cudaMemcpyAsync(H2D);

	myKernel<<<...>>>(...);

}

for (...) {

	cudaMemcpyAsync(D2H);

}

Otherwise, you seem to have all the bases covered and I don’t see any reason why the 2nd form you posted above doesn’t work. I wonder if it is a limitation of the winxp driver.

Is modifying the simpleStreams example sufficient to demonstrate this issue? I could make the modifications here and try my linux box to see if I find the same issue.

I wrote my code based on simpleStreams, but changed its kernel function to do a sums.

Above code and this one:

for (...) {

	cudaMemcpyAsync(H2D);

}

for (...) {

	myKernel<<<...>>>(...);

	cudaMemcpyAsync(D2H);

}

both worked out well though my code does not.

It seems that the only difference from my coding style is whether memcpy(D2H) is called in a stream and memcpy(H2D) is succeeding in the other or not.