SM utilization exceeds 100% when profiling an app using multiple streams

I have an OpenACC application that uses multiple streams (queues in OpenACC terminology) to execute multiple kernels concurrently. I am trying to measure how the SM utilization improves with the number of streams using range replay mode.

The strange thing is that the “Compute (SM) Throughput” shown in the GPU Speed Of Light Analysis improves linearly with respect to the number of queues, and eventually exceeds 100% . I would like to confirm if these profiling results are valid.

Queue count SM utilization [%]
1 46.26
2 52.28
3 59.07
4 66.62
5 73.22
6 79.85
7 86.3
8 93.59
9 98.73
10 105.63

The command I use to invoke Nsight Compute is:

/opt/nvidia/hpc_sdk/Linux_x86_64/24.5/compilers/bin/ncu --replay-mode range --nvtx --nvtx-include NLMNT2/ ./a.out

Any help would be appreciated.

SM Throughput (sm__throughput.avg.pct_of_peak_sustained_elapsed) is a multi-pass metric, meaning the workload needs to be replayed multiple times to collect/compute it. The less determinism there is across these replays, and the shorter the range, the more likely it is that you could see small inaccuracies, as sub-counters collected in different passes won’t fit exactly anymore, compared to e.g. when all could be collected in a single pass.

Given that in your experiment, the range likely becomes shorter the more queues are used, and at the same time becomes more non-deterministic, I think the results are aligned to that.

If your app itself is deterministic, I would recommend to use --replay-mode app-range, as it’s less intrusive to your application’s behavior. It will however restart the app itself N times to collect N passes.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.