Error Collecting Nsys Profile Metrics

osamaabuhamdan · April 15, 2024, 9:43pm

I am running this script within slurm batch file to profile a distributed DNN training script.

srun nsys profile --sample=none --cpuctxsw=none --nic-metrics=true --gpu-metrics-device=0  --trace=mpi,cuda,nvtx,cudnn --output=logs/nsys_logs%h torchrun $script

Then I visualize with Nsight System

I have multiple issues:

First:
GPU metrics are shown for only the first 15 seconds then it’s not showing data. I get this error:

Event requestor failed: Source ID=
Type=ErrorInformation (18)
 Properties:
  ErrorText (100)=GPU Metrics [0]: NVPA_STATUS_ERROR
- API function: Nvpw.GPU_PeriodicSampler_DecodeCounters_V2(&params)
- Error code: 1
- Source function: virtual QuadDDaemon::EventSource::PwMetrics::PeriodicSampler::DecodeResult QuadDDaemon::EventSource::{anonymous}::GpuPeriodicSampler::DecodeCounters(uint8_t*, size_t) const
- Source location: /dvs/p4/build/sw/devtools/Agora/Rel/CUDA12.3/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetrics.cpp:248

Second:
NIC metrics are not showing any data exchange, while when collecting data manually from

/sys/class/infiniband/<interface id>/ports/<port number>/counters/

It shows Data Exchange with 250 MB/s and more.

Third:
MPI traces are not showing. Maybe they are cot collected at all.
I tried both openmpi and mpich with --mpi-impl but neither worked.

Fourth:
How to collect IB stats with SLURM sbatch environment?
I.e how to use --ib-switch-metric?

Fifth:
I have this error too:
Could not parse 97 CUPTI activity records. Please try updating the CUDA driver or use more recent profiler version.

System Details

I visualize on this device

Nsight Systems:
Chip: Apple M2
OS: 14.2 (23C64)
Version: 2024.1.1.59-241133802077v0 OSX.
Qt version: 6.3.2.
Google Protocol Buffers version: 3.21.1.
Boost version: 1.78.0.

I run experiments on these devices:

$ nsys -v
NVIDIA Nsight Systems version 2023.3.3.42-233333266658v0

$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:81:00.0 Off |                    0 |
| N/A   44C    P0              28W /  70W |      2MiB / 15360MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

$ nsys status -e

Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 1
Linux Distribution = CentOS
Linux Kernel Version = 3.10.0-1127.19.1.el7.x86_64: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

I do not have root access since it’s a SuperComputer env that I don’t control.

hwilper · April 16, 2024, 1:50pm

Okay, first of all, I am going to recommend that you get a newer version of Nsys, you are about a year out of date, and especially the GPU metrics have undergone a fair number of improvements in that time.

@pkovalenko do you have thoughts on the GPU metric issues?

@rdietrich do you have thoughts on the MPI issues?

rdietrich · April 17, 2024, 8:41am

I guess that your are running on a single node only, right? If that’s the case, MPI is probably not used. Communication between GPUs is likely done with NCCL. Do you see NCCL calls in your Nsight Systems report?

osamaabuhamdan · April 18, 2024, 6:43pm

It seems that PyTorch uses NCCL by default.
I thought it’s MPI. Sorry for the confusion.

Topic		Replies	Views
What is the meaning of error in Nsight UI Diagnostics Summary Profiling Linux Targets	3	904	February 2, 2023
Cannot get tensor core metrics with latest NSight system Profiling Linux Targets cuda , profiling	4	1395	June 20, 2023
Profiling Python code using sudo Profiling Linux Targets nsight , python , profiling	8	2103	March 10, 2022
CPU core metrics do not match selected options Profiling Linux Targets	13	38	August 5, 2024
GPU Metrics Unit Already in Use Error and Slingshot-11 NIC Metrics Profiling Linux Targets	8	727	April 4, 2024
Availability issue for GPU Metrics sampling hardware unit on WSL Profiling Linux Targets nsight , wsl	9	1291	June 26, 2024
Nsight Systems Issue: Unable to configure the collection of CPU IP samples Profiling Linux Targets	12	8496	December 27, 2021
Wrong event order has been detected when adding events to the collection Profiling Linux Targets cudnn	1	359	April 23, 2024
"Missing Data" Issues in Nsight Systems Profiling Profiling Linux Targets	5	170	July 17, 2024
Nsys profile error: invalidArgumentException, unknown API driver activity Profiling Linux Targets nsight	17	3352	July 28, 2023

Error Collecting Nsys Profile Metrics

Related topics