It’s impossible to overlap with the second style, because you always have to wait for memory copy to complete before starting to run the corresponding kernel. Otherwise, the kernel will not have complete data.
With the first style, in theory, the second memory copy (which runs async) can overlap with the execution of the first kernel.
Thanks pcchen, I expected both 2 programs to run in the same way because of these descriptions in Programming Guide:
“A stream is a sequence of operations that execute in order. Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrently.”
There’s no transfer while kernel is running in stream 1, therefore memcpy(H2D) in stream 2 can be overlapped with the kernel:
Stream 1: TKT_
Stream 2: _TKT
But it is not true in the second style, actually:
Stream 1: TKT___
Stream 2: ___TKT
memcpy(H2D) in stream 2 is blocked with memcpy(D2H) in stream1.
Is such a behavior truly “out of order?”
From this sentence, I think that driver should automatically reorder memcpy(D2H) in stream 1 and memcpy(H2D) in stream2 to utilize idle time even if the second is used,
while transfer operations, in this case, seem to be executed one after another through all streams.
Is this problem due to any hardware limitation, or software level one that may be improved in future release?
In theory, it’s possible to do out-of-order in the driver side (e.g. the driver buffers all memory copy and kernel execution requests and reorder them), but I don’t know if NVIDIA is doing or planning to do this.
Maybe it’s the D2H copies messing up the interleaving. Have you tried:
for (...) {
cudaMemcpyAsync(H2D);
myKernel<<<...>>>(...);
}
for (...) {
cudaMemcpyAsync(D2H);
}
Otherwise, you seem to have all the bases covered and I don’t see any reason why the 2nd form you posted above doesn’t work. I wonder if it is a limitation of the winxp driver.
Is modifying the simpleStreams example sufficient to demonstrate this issue? I could make the modifications here and try my linux box to see if I find the same issue.