How to profile kernel launch overhead?

Does cudaLaunchKernel in nsight system reflect the kernel launch overhead?

I’m trying to call cusparseSpMV kernel in the for loop, the loop size can 10^6.
So it will launch kernel this many times. I want to profile the kernel launch overhead.
How can I do this?

Please take a look at https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/ where I tried to go into depth on latency.