I have programed a CUDA application, which utilizes concurrent data transfer from host to device and kernel execution by using cuda streams and asynchronous memcopy. At the moment I want to track the actually achieved concurrency with the CUDA Visual Profiler but when I start a new recording, only a consecutive behaviour is shown. Nervertheless when using CPU or GPU timers a corresponding parallelization is measurable.
Now my question is whether the CUDA Visual Profiler supports the recording of concurrent data transfers and kernel executions.
The profiler “decorates” execution with a lot of additional events to enable data logging and instrumentation of a program on the device. This has the effect of serializing actions that would otherwise be asynchronous. There was another thread about the perils of judging latency and concurrency just using the profiler here, if you are interested.