I recently used Nsight to profile the achieved occupancy of our deep learning program with different batch size, running on RTX 2080Ti.
I first run the training job without Nsight and the execution time halved as I double the batch size from 128 to 256, which is reasonable if the occupancy also doubles.
However, when I use Nsight to check the occupancy of these two different batch sizes, the average occupancy* only slightly increased from 45% to 48%. I know that there should be some overhead when profiling, but the result is too far from what we expected.
So, I have two questions:
Question 1: Do anyone know the reason for this result? Or is it simply because the overhead caused by Nsight is too large when profiling?
Question 2: Is there any other way that I can know the occupancy?
* We take the weighted average of occupancy of each kernel based on the number of cycles as the average occupancy