I am running this script within slurm batch file to profile a distributed DNN training script.
srun nsys profile --sample=none --cpuctxsw=none --nic-metrics=true --gpu-metrics-device=0 --trace=mpi,cuda,nvtx,cudnn --output=logs/nsys_logs%h torchrun $script
Then I visualize with Nsight System
I have multiple issues:
First:
GPU metrics are shown for only the first 15 seconds then it’s not showing data. I get this error:
Event requestor failed: Source ID=
Type=ErrorInformation (18)
Properties:
ErrorText (100)=GPU Metrics [0]: NVPA_STATUS_ERROR
- API function: Nvpw.GPU_PeriodicSampler_DecodeCounters_V2(¶ms)
- Error code: 1
- Source function: virtual QuadDDaemon::EventSource::PwMetrics::PeriodicSampler::DecodeResult QuadDDaemon::EventSource::{anonymous}::GpuPeriodicSampler::DecodeCounters(uint8_t*, size_t) const
- Source location: /dvs/p4/build/sw/devtools/Agora/Rel/CUDA12.3/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetrics.cpp:248
Second:
NIC metrics are not showing any data exchange, while when collecting data manually from
/sys/class/infiniband/<interface id>/ports/<port number>/counters/
It shows Data Exchange with 250 MB/s and more.
Third:
MPI traces are not showing. Maybe they are cot collected at all.
I tried both openmpi and mpich with --mpi-impl but neither worked.
Fourth:
How to collect IB stats with SLURM sbatch environment?
I.e how to use --ib-switch-metric?
Fifth:
I have this error too:
Could not parse 97 CUPTI activity records. Please try updating the CUDA driver or use more recent profiler version.
System Details
I visualize on this device
Nsight Systems:
Chip: Apple M2
OS: 14.2 (23C64)
Version: 2024.1.1.59-241133802077v0 OSX.
Qt version: 6.3.2.
Google Protocol Buffers version: 3.21.1.
Boost version: 1.78.0.
I run experiments on these devices:
$ nsys -v
NVIDIA Nsight Systems version 2023.3.3.42-233333266658v0
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:81:00.0 Off | 0 |
| N/A 44C P0 28W / 70W | 2MiB / 15360MiB | 7% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
$ nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 3.10.0-1127.19.1.el7.x86_64: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail
I do not have root access since it’s a SuperComputer env that I don’t control.