cudaLaunchKernel very slow? (Edit: The problem is with Nsight Systems.)

I’m trying to increase performance in some code, and after improving kernel times significantly (according to Nsight Compute), at the expense of a few more kernel calls, run time only increased.

I tried Nsight Systems (for the first time), and I get something that looks suspicious when I look at the CUDA summary in the stats system view. Here are the three top lines:

Time	Total Time	Instances	Avg	       Med	       Min	     Max	    StdDev	    Category	Operation
81.2%	42.985 s	8717	4.931 ms	2.740 ms	5.600 μs	29.521 ms	4.677 ms	CUDA_API	cudaLaunchKernel
14.5%	7.704 s	    3433	2.244 ms	1.026 ms	855.836 μs	10.104 ms	2.818 ms	CUDA_KERNEL	void Kernels::Update...
1.2%	626.529 ms	4156	150.753 μs	69.666 μs	64.226 μs	877.150 μs	205.191 μs	CUDA_KERNEL	void Kernels::Copy...

Can someone explain to me what’s going on?

Edit: This is being run on a laptop with a Ryzen 7 5800H, GeForce 3080 16GB, 64GB system RAM, Windows 10. I’m using CUDA 11.4.

It looks like the problem is Nsight Systems itself. Testing it with a much simpler example with prints to stderr, it looks to be getting slower the longer it runs, presumably due to the stats collection. I assume then that cudaLaunchKernel times represent the overhead of NsightSystems itself.