Hardware
Perlmutter
Software
Nsight Systems 2023.3.1.92-233133147223v0
NVHPC 23.9
Environment
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 2
Linux Distribution = sles
Linux Kernel Version = 5.14.21-150400.24.81_12.0.87-cray_shasta_c: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail
Possible --gpu-metrics-device values are:
0: NVIDIA A100-SXM4-80GB PCI[0000:c1:00.0]
1: NVIDIA A100-SXM4-80GB PCI[0000:82:00.0]
2: NVIDIA A100-SXM4-80GB PCI[0000:41:00.0]
3: NVIDIA A100-SXM4-80GB PCI[0000:03:00.0]
all: Select all supported GPUs
none: Disable GPU Metrics [Default]
Sample Profiling Command
srun /bin/bash -c 'nsys profile -s none --delay 200 --duration 40 --cpuctxsw none -t cuda,nvtx,cudnn,cublas,cusparse --gpu-metrics-device=all --cuda-graph-trace=node -o report_${SLURM_PROCID} <script here>'
When I try collecting GPU Metrics, I get the below error, how do I go about fixing this? Note my environment was neither a VM nor a container.
GPU Metrics [0]: GPU Metrics sampling hardware unit is already in use by another instance of Nsight Systems or other tool. The conflict can occur within the OS as well as containers, VMs and hypervisor.
- API function: Nvpw.GPU_PeriodicSampler_GetCounterAvailability(¶ms)
- Error code: 20
- Source function: virtual std::vector<unsigned char> QuadDDaemon::EventSource::{anonymous}::GpuPeriodicSampler::GetCounterAvailabilityImage() const
- Source location: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetrics.cpp:135
I saw an inconclusive, previous post that suggested upgrading Nsight Systems, which leads me to my second question: can the newer versions of Nsight Systems available here be uploaded to conda
?
Using the Perlmutter, I can only download packages from package managers like conda
or pip
or use a container.
I tried using an NGC container with Nsight 2024.1 but running nsys profile --gpu-metrics-device=help
shows no supported devices
even though there are A100s underneath the container. Therefore, my only resort is conda
as it takes a while before newer Nsight releases are integrated into the CUDA toolkit or NVHPC.
Second, NVIDIA supports Slingshot-11, so can we also get support for nic-metrics
and ib-switch
metrics in Nsight Systems?
Would really appreciate a response, thanks!