The result of the achieved occupancy did not make sense

Hi,
I recently used Nsight to profile the achieved occupancy of our deep learning program with different batch size, running on RTX 2080Ti.

I first run the training job without Nsight and the execution time halved as I double the batch size from 128 to 256, which is reasonable if the occupancy also doubles.

However, when I use Nsight to check the occupancy of these two different batch sizes, the average occupancy* only slightly increased from 45% to 48%. I know that there should be some overhead when profiling, but the result is too far from what we expected.

So, I have two questions:
Question 1: Do anyone know the reason for this result? Or is it simply because the overhead caused by Nsight is too large when profiling?
Question 2: Is there any other way that I can know the occupancy?

Thanks

* We take the weighted average of occupancy of each kernel based on the number of cycles as the average occupancy

You can find detailed information on the tool overhead on https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#overhead

In general, any tool overhead does not affect the computation of our occupancy metrics, neither theoretical nor achieved. If it affects your own “average occupancy” metric depends on how this metric is computed exactly. Nsight Compute shows detailed occupancy information on the Details page when collecting the Occupancy section, which is part of the default set of sections.

Note that as listed in https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#reproducibility, Nsight Compute collects very detailed per-kernel information. For this, it applies certain changes to increase the reproducibility of the results, e.g. adjusting clocks and caches, and serializes kernel execution. Depending on what exactly you want to measure, you can change the first two with flags, but be aware that this might have an effect if your selected metrics require multiple replay passes.