I have a GPU application that launches approximately 2000 kernels during execution. What is the best way to measure the kernel launch latencies and kernel receive latencies (time between kernel completion and when the CPU executes the data)?
It is possible to be done through the visual profiler NVIDIA Nsight Systems but I will have to go through each of the kernels one by one to gather the data, so I am looking for a much more efficient way. Thanks!