Nsight Compute reporting "nan" for most values for Perlmutter profile

Hi all,

I’m trying to understand the performance of my kernel on Perlmutter (A100 40GB). While I am able to capture a profile output, when read on my local machine I found it gives “nan” for virtually every value including most metrics for memory throughput, bandwidth and hit rates, compute throughput, making it virtually useless. I’ve uploaded the profile to : https://easyupload.io/p87u5w

I generated it using
srun -n 1 -c 8 --cpu-bind=cores --gpus-per-task=1 ncu --set full --profile-from-start 0 --force-overwrite --export ncu_output --print-summary per-kernel --replay-mode application

under driver version 450.162, ncu --version : 2022.1.0.0 (build 30763755) and read using nsys compute 2022.1.1.0

Note that I’m disabling profiling from start and using cudaProfilerStart and cudaProfilerEnd to capture the code region of interest.

I’d appreciate any help!

Edit: I should add that most (but not all) of the “nan” metrics in the GUI have a little yellow exclamation mark next to them, but I see no errors either when collecting the profile or when loading it under Nsight Compute.

Is you application deterministic, as required by the selected application replay mode? By this, I mean that kernel with the same name and grid size will execute in the same order and with the same input parameters every time the application is run? Also, is it guaranteed in your environment that only a single ncu process is trying to write to this report file at the same time (it appears so from your srun command, but it’s worth checking)?