Timelline View of Using PM Sampling to Get Tenso Core Utilitation

Hi, I have two questions regarding PM sampling timelines in Nsight Compute.

  1. I noticed that only pmsampling:sm__pipe_tensor_cycles_active_realtime_v2.avg.pct_of_peak_sustained_elapsed produces a usable PM sampling timeline (with per-sample timestamps) in the CLI raw report. Replacing avg with min or max (e.g., .min.pct_of_peak_sustained_elapsed) does not expose a timeline, even though those variants are valid metrics. Is the timeline intentionally limited to the avg aggregation for this metric family?

  2. For the same metric, the PM sampling timestamps shown in the CLI raw report appear to use a different time origin than the GUI timeline. For example, the first non-zero sample appears at 260,000 ns in the raw text, while the GUI timeline shows the same transition at 4,000 ns. Am I correct that the GUI re-normalizes PM sampling timestamps relative to the NVTX range or kernel start, and that this alignment information is not exposed in the CLI raw output?

matmul_tensor_pm_timeline.txt (387.6 KB)

Replacing avg with min or max (e.g., .min.pct_of_peak_sustained_elapsed) does not expose a timeline, even though those variants are valid metrics.

They are not valid metrics. If you check ncu --query-metrics-collection pmsampling --metrics sm__pipe_tensor_cycles_active_realtime_v2 to get all metric suffixes for this base metric name (in the context of pm sampling), you will only see sub-metrics with avg/max.pct_of_peak_sustained_elapsed. This is different from the default profiling collection mode.

Am I correct that the GUI re-normalizes PM sampling timestamps relative to the NVTX range or kernel start, and that this alignment information is not exposed in the CLI raw output?

Yes, the UI normalizes the timestamps in the table to be relative to the first sample’s timestamp to make it easier to associate the two places. In addition, if the collection is context-switched (as in your case, see the ContextSwitched Yes entry in the table), the metrics in the UI are filtered to only show samples for the CUDA context of interest. The CLI raw output isn’t filtered like this (at this point). You can technically do that yourself manually or using the Python Report Interface using the metrics tracking the context switch trace, but it’s not trivial. You can disable the context switch filtering in the UI using the timeline’s right-click context menu.

1 Like

I wonder if the time needed to do the context switch is deterministic or not. On my side, there is a 256 us gap between the raw output and the GUI view. Is it a coincidence?

The time to context switch and the time other contexts run when the target context is not active on the GPU are non-deterministic.