Hi all,
I’m trying to understand the performance of my kernel on Perlmutter (A100 40GB). While I am able to capture a profile output, when read on my local machine I found it gives “nan” for virtually every value including most metrics for memory throughput, bandwidth and hit rates, compute throughput, making it virtually useless. I’ve uploaded the profile to : https://easyupload.io/p87u5w
I generated it using
srun -n 1 -c 8 --cpu-bind=cores --gpus-per-task=1 ncu --set full --profile-from-start 0 --force-overwrite --export ncu_output --print-summary per-kernel --replay-mode application
under driver version 450.162, ncu --version : 2022.1.0.0 (build 30763755) and read using nsys compute 2022.1.1.0
Note that I’m disabling profiling from start and using cudaProfilerStart and cudaProfilerEnd to capture the code region of interest.
I’d appreciate any help!
Edit: I should add that most (but not all) of the “nan” metrics in the GUI have a little yellow exclamation mark next to them, but I see no errors either when collecting the profile or when loading it under Nsight Compute.