will show you any and all kernels launched, including the number of blocks and threads associated with each launch
Worrying about something like this to me is misplaced, or suggests that you may want to rethink your app design.
If you’re doing cudaFree at the end of your app, who cares? If the difference in run time of 750 milliseconds is important to your overall performance scenario, your problem is too small to be interesting on GPUs.
If you’re doing cudaFree in a perf-sensitive loop (where this kind of thing could add up) then that is the as good as any reason I can think of to not do that. If you’re doing cudaFree in a loop, you are almost certainly doing some kind of allocation (e.g. cudaMalloc) in that loop also. And that will cost you too. So don’t allocate/free/allocate/free/allocate/free
Allocate once, at the beginning of your app, then reuse your allocations.