Visual Profiler: tracking of concurrent data transfers and kernel executions

Hello,

I have programed a CUDA application, which utilizes concurrent data transfer from host to device and kernel execution by using cuda streams and asynchronous memcopy. At the moment I want to track the actually achieved concurrency with the CUDA Visual Profiler but when I start a new recording, only a consecutive behaviour is shown. Nervertheless when using CPU or GPU timers a corresponding parallelization is measurable.

Now my question is whether the CUDA Visual Profiler supports the recording of concurrent data transfers and kernel executions.

Best regards,
ElBifi

For what I think, the profiler blocks kernel’s launchs.

You can test it with the simpleStreams sample of the SDK:

The standart behavior shows:

./simpleStreams 

[ simpleStreams ]

> > Using CUDA device [0]: Tesla T10 Processor

> CUDA Capable SM 1.3 hardware with 30 multi-processors

> scale_factor = 1.0000

> array_size   = 16777216

memcopy:        13.63

kernel:         25.15

non-streamed:   37.05 (38.78 expected)

4 streams:      26.65 (28.56 expected with compute capability 1.1 or later)

-------------------------------

PASSED

And the output in the computeprof shows:

Start program './simpleStreams' run #5 ...

[ simpleStreams ]

> > Using CUDA device [0]: Tesla T10 Processor

> CUDA Capable SM 1.3 hardware with 30 multi-processors

> scale_factor = 1.0000

> array_size   = 16777216

memcopy:	13.45

kernel:		25.18

non-streamed:	36.92 (38.63 expected)

4 streams:	38.99 (28.54 expected with compute capability 1.1 or later)

-------------------------------

PASSED

The profiler “decorates” execution with a lot of additional events to enable data logging and instrumentation of a program on the device. This has the effect of serializing actions that would otherwise be asynchronous. There was another thread about the perils of judging latency and concurrency just using the profiler here, if you are interested.