GPU Metrics Unit Already in Use Error and Slingshot-11 NIC Metrics

Hardware
Perlmutter

Software
Nsight Systems 2023.3.1.92-233133147223v0
NVHPC 23.9

Environment

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 2
Linux Distribution = sles
Linux Kernel Version = 5.14.21-150400.24.81_12.0.87-cray_shasta_c: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail
Possible --gpu-metrics-device values are:
	0: NVIDIA A100-SXM4-80GB PCI[0000:c1:00.0]
	1: NVIDIA A100-SXM4-80GB PCI[0000:82:00.0]
	2: NVIDIA A100-SXM4-80GB PCI[0000:41:00.0]
	3: NVIDIA A100-SXM4-80GB PCI[0000:03:00.0]
	all: Select all supported GPUs
	none: Disable GPU Metrics [Default]

Sample Profiling Command

srun /bin/bash -c 'nsys profile -s none --delay 200 --duration 40 --cpuctxsw none -t cuda,nvtx,cudnn,cublas,cusparse --gpu-metrics-device=all --cuda-graph-trace=node -o report_${SLURM_PROCID}  <script here>'

When I try collecting GPU Metrics, I get the below error, how do I go about fixing this? Note my environment was neither a VM nor a container.

GPU Metrics [0]: GPU Metrics sampling hardware unit is already in use by another instance of Nsight Systems or other tool. The conflict can occur within the OS as well as containers, VMs and hypervisor.
- API function: Nvpw.GPU_PeriodicSampler_GetCounterAvailability(&params)
- Error code: 20
- Source function: virtual std::vector<unsigned char> QuadDDaemon::EventSource::{anonymous}::GpuPeriodicSampler::GetCounterAvailabilityImage() const
- Source location: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetrics.cpp:135

I saw an inconclusive, previous post that suggested upgrading Nsight Systems, which leads me to my second question: can the newer versions of Nsight Systems available here be uploaded to conda?

Using the Perlmutter, I can only download packages from package managers like conda or pip or use a container.

I tried using an NGC container with Nsight 2024.1 but running nsys profile --gpu-metrics-device=help shows no supported devices even though there are A100s underneath the container. Therefore, my only resort is conda as it takes a while before newer Nsight releases are integrated into the CUDA toolkit or NVHPC.

Second, NVIDIA supports Slingshot-11, so can we also get support for nic-metrics and ib-switch metrics in Nsight Systems?

Would really appreciate a response, thanks!

I’m going to loop in @pkovalenko for the GPU metrics and @ytebeka for the networking components.

1 Like

GPU Metrics sampling hardware unit is already in use by another instance of Nsight Systems or other tool. The conflict can occur within the OS as well as containers, VMs and hypervisor.

This could be caused not only by NSys, but also by DCGM or by CUPTI sampling from the user app.

@pkovalenko Do you have any suggestions on how to troubleshoot any of these cases?

@pkovalenko @hwilper pinging again to revive this

This could be DCGM. Try disabling DCGM before running nsys profiling session:

sudo systemctl stop nvidia-dcgm

Once you’re done with nsys, restart DCGM this way:

sudo systemctl restart nvidia-dcgm

I am on a shared system and do not have root access, is there any other alternative?

The hardware responsible for metrics sampling can not be used concurrently. The error you’re getting from Nsys means it’s already in use. If you have DCGM running, the only way around is to disable it.