I have a kernel that I run in two different streams to make sure that one kernel is executing while the host setup the next kernel (copy memory from and to gpu). How can I take the time a single kernel needs to execute in a system with more than one stream? If I use cuda events to take the time the results seems not to be correct. I think it takes the time from launch until the stream execution finished. But this is not necessary the time the kernel needs to execute, because it might have to wait for the last stream to finish first until it can start execution.
If I use nsight compute/trace i get perfect timings for each stream. Are these performance figurers also available my application?
What I need to do is this - maybe you have a much better idea how to get to that.
I have a pool with different kernels and one of them will run in 2 streams on a gpu. I need to run another kernel in a single stream parallel (alternating) that uses about 10% of the computetime the first/main kernel needs. So for example:
Kernel A from the pool runs in 2 stream and I need to run a special kernel Z that uses 10% of the performance of kernel A.
Since the kernel in the pool are all different I cannot run 10 time kernel from the pool and once kernel Z. I need somehow to run kernel Z that much, that the performance of kernel A drop for 10% compared to its performance without kernel Z running. The kernels do something different, so I cannot say if kernel A gives me 1000 results run kernel Z so that it gives me 100 results. Since Kernel A is running in 2 streams I cannot just measure the time it uses since the results are misleading.
Hope I put some light on the background why I try to measure the kernel times on a per stream basis.