Timing Concurrent Kernels

Parallel Nsight currently does not support tracing to show concurrent kernels
on Fermi GPU’s

Any suggestions on how to show/measure/calculate how kernels are scheduled
concurrently in a simple cuda script in the meantime until Nsight is updated.

I can measure event times of kernels in each stream fine using
cudaEventRecord/cudaEventSynchronize/cudaEventElapsedTime but how to
show they are concurrent running I’m not quite sure.

Any help appreciated

Well, maybe not the desired way, but if you only want to show, that the kernels
are being executed concurrently, why not put a CPU timer around the calls and show
that the summed time of cudaEvents is bigger than the CPU time?

Just a suggestion…

Tobi