Does cudaLaunchKernel
in nsight system reflect the kernel launch overhead?
I’m trying to call cusparseSpMV kernel in the for loop, the loop size can 10^6.
So it will launch kernel this many times. I want to profile the kernel launch overhead.
How can I do this?