cudaOccupancyMaxActiveBlocks returns the blocks by taking into acccount other co-running kernels?

My suggestion would be to develop the smallest/shortest possible complete test case, and then file a bug, I guess based on what I read here I would file the bug against the occupancy call disrupting the work issuance, but if you want to go after the increase in launch latency, by all means file a bug for that too.

I’m not sure if you are trying to add information here. nvprof is “the profiler”. nvvp uses nvprof under the hood, to do its work. And I repeat, those are not the recommended profilers for turing GPUs. So I would definitely reconfirm observations with the recommended profiler, first (with respect to increase in launch latency, not the occupancy question).