Error Collecting Nsys Profile Metrics

I am running this script within slurm batch file to profile a distributed DNN training script.

srun nsys profile --sample=none --cpuctxsw=none --nic-metrics=true --gpu-metrics-device=0  --trace=mpi,cuda,nvtx,cudnn --output=logs/nsys_logs%h torchrun $script

Then I visualize with Nsight System

I have multiple issues:

First:
GPU metrics are shown for only the first 15 seconds then it’s not showing data. I get this error:

Event requestor failed: Source ID=
Type=ErrorInformation (18)
 Properties:
  ErrorText (100)=GPU Metrics [0]: NVPA_STATUS_ERROR
- API function: Nvpw.GPU_PeriodicSampler_DecodeCounters_V2(&params)
- Error code: 1
- Source function: virtual QuadDDaemon::EventSource::PwMetrics::PeriodicSampler::DecodeResult QuadDDaemon::EventSource::{anonymous}::GpuPeriodicSampler::DecodeCounters(uint8_t*, size_t) const
- Source location: /dvs/p4/build/sw/devtools/Agora/Rel/CUDA12.3/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetrics.cpp:248

Second:
NIC metrics are not showing any data exchange, while when collecting data manually from

/sys/class/infiniband/<interface id>/ports/<port number>/counters/

It shows Data Exchange with 250 MB/s and more.

Third:
MPI traces are not showing. Maybe they are cot collected at all.
I tried both openmpi and mpich with --mpi-impl but neither worked.

Fourth:
How to collect IB stats with SLURM sbatch environment?
I.e how to use --ib-switch-metric?

Fifth:
I have this error too:
Could not parse 97 CUPTI activity records. Please try updating the CUDA driver or use more recent profiler version.

System Details

I visualize on this device

Nsight Systems:
Chip: Apple M2
OS: 14.2 (23C64)
Version: 2024.1.1.59-241133802077v0 OSX.
Qt version: 6.3.2.
Google Protocol Buffers version: 3.21.1.
Boost version: 1.78.0.

I run experiments on these devices:

$ nsys -v
NVIDIA Nsight Systems version 2023.3.3.42-233333266658v0
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:81:00.0 Off |                    0 |
| N/A   44C    P0              28W /  70W |      2MiB / 15360MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
$ nsys status -e

Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 3.10.0-1127.19.1.el7.x86_64: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

I do not have root access since it’s a SuperComputer env that I don’t control.

Okay, first of all, I am going to recommend that you get a newer version of Nsys, you are about a year out of date, and especially the GPU metrics have undergone a fair number of improvements in that time.

@pkovalenko do you have thoughts on the GPU metric issues?

@rdietrich do you have thoughts on the MPI issues?

I guess that your are running on a single node only, right? If that’s the case, MPI is probably not used. Communication between GPUs is likely done with NCCL. Do you see NCCL calls in your Nsight Systems report?

It seems that PyTorch uses NCCL by default.
I thought it’s MPI. Sorry for the confusion.