Methodology for the choice of metrics for Nsight Compute Sections?

I’ve noticed that profiling is much faster when I stay within a single section when looking for metrics. I wanted to know, is this due to hardware constraints or was Nsight Compute programmed in such a way that optimizes the specific metrics found within sections? If I was able to profile x-amount of metrics from the same section on Nsight Compute, and then did the same with CUPTI (if the same metrics were there), would the amount of kernel replays needed remain the same?

Metrics come from different providers, and even when from the same provider, often not all metrics can be collected at the same time due to HW limitations, yes. You can find more detail about this here and about overhead here.

If I was able to profile x-amount of metrics from the same section on Nsight Compute, and then did the same with CUPTI (if the same metrics were there), would the amount of kernel replays needed remain the same?

If you choose the same specific metrics, you should end up with the same number of passes, yes. It’s not related to the number of metrics in general, but which ones are chosen. Unfortunately, there is no good rule of thumb to determine this offline for a specific generic set of metrics unless they are actually scheduled on the concrete chip (there are some groups of metrics which are known to be collectable in a single pass, though).

For certain types of data collection, like pm and warp state sampling, ncu determines the number of passes dynamically beyond the minimum required number, to find optimal sampling parameters. This can differ from running this in CUPTI. If the sampling parameters are all set by the user, the minimal number of passes should be replayed by the tool.