Help needed with streaming on Tesla GPU Understanding simpleMultiCopy SDK measurments

Hi, I’d like to exploit streaming capabilities Tesla C2050 to overlap kernel execution and data transfers.

Is there any tool that shows what was actually overlapped during a run and how much time it took?

I tried looking for measurements related to the simpleMultiCopy SDK sample (which is available online), but unfortunately I couldn’t find any…

Could anybody having a CUDA 2.0 capable GPU post their numbers for comparison? Do you see some overlapping of D2H and H2D transfers on your cards?

SimpleMultiCopy results on Tesla C2050, Ubuntu 10.10, CUDA dev driver 270.40, NVCC release 4.0, V0.2.1221:

[simpleMulti] starting...

[Tesla C2050] has 14 MP(s) x 32 (Cores/MP) = 448 (Cores)

> Device name: Tesla C2050

> CUDA Capability 2.0 hardware with 14 multi-processors

> scale_factor = 1.00

> array_size   = 4194304

Relevant properties of this CUDA device

(X) Can overlap one CPU<>GPU data transfer with GPU kernel execution (device property "deviceOverlap")

(X) Can overlap two CPU<>GPU data transfers with GPU kernel execution

    (compute capability >= 2.0 AND (Tesla product OR Quadro 4000/5000)

Measured timings (throughput):

 Memcpy host to device	: 3.032320 ms (5.532799 GB/s)

 Memcpy device to host	: 2.569984 ms (6.528140 GB/s)

 Kernel			: 0.580416 ms (289.055011 GB/s)

Theoretical limits for speedup gained from overlapped data transfers:

No overlap at all (transfer-kernel-transfer): 6.182720 ms 

Compute can overlap with one transfer: 5.602304 ms

Compute can overlap with both data transfers: 3.032320 ms

Average measured timings over 10 repetitions:

 Avg. time when execution fully serialized	: 6.261718 ms

 Avg. time when overlapped using 4 streams	: 5.001850 ms

 Avg. speedup gained (serialized - overlapped)	: 1.259869 ms

Measured throughput:

 Fully serialized execution		: 5.358662 GB/s

 Overlapped using 4 streams		: 6.708405 GB/s

[simpleMulti] test results...

PASSED

I’m curious about the results of my simpleMultiCopy run, as the average time for a cycle with 4 streams is around 5 ms, while the

estimate for compute which overlaps with both data transfers is around 3ms. The GPU should be capable of overlapping D2H and H2D transfers, so I thought the total should be approx. 3ms x number of iterations.

Does anybody reason what is the reason for this discrepancy between expected and actual timings?

p.s. What is the purpose of event synchronization in this sample? Shouldn’t the completion of H2D data transfer kick-off kernel execution (both are in the same stream)?