GPU Metrics Unit Already in Use Error and Slingshot-11 NIC Metrics

Osayamen · March 3, 2024, 4:42am

Software
Nsight Systems 2023.3.1.92-233133147223v0
NVHPC 23.9

Environment

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 2
Linux Distribution = sles
Linux Kernel Version = 5.14.21-150400.24.81_12.0.87-cray_shasta_c: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): Fail

Possible --gpu-metrics-device values are:
	0: NVIDIA A100-SXM4-80GB PCI[0000:c1:00.0]
	1: NVIDIA A100-SXM4-80GB PCI[0000:82:00.0]
	2: NVIDIA A100-SXM4-80GB PCI[0000:41:00.0]
	3: NVIDIA A100-SXM4-80GB PCI[0000:03:00.0]
	all: Select all supported GPUs
	none: Disable GPU Metrics [Default]

Sample Profiling Command

srun /bin/bash -c 'nsys profile -s none --delay 200 --duration 40 --cpuctxsw none -t cuda,nvtx,cudnn,cublas,cusparse --gpu-metrics-device=all --cuda-graph-trace=node -o report_${SLURM_PROCID}  <script here>'

When I try collecting GPU Metrics, I get the below error, how do I go about fixing this? Note my environment was neither a VM nor a container.

GPU Metrics [0]: GPU Metrics sampling hardware unit is already in use by another instance of Nsight Systems or other tool. The conflict can occur within the OS as well as containers, VMs and hypervisor.
- API function: Nvpw.GPU_PeriodicSampler_GetCounterAvailability(&params)
- Error code: 20
- Source function: virtual std::vector<unsigned char> QuadDDaemon::EventSource::{anonymous}::GpuPeriodicSampler::GetCounterAvailabilityImage() const
- Source location: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Target/quadd_d/quadd_d/jni/EventSource/GpuMetrics.cpp:135

I saw an inconclusive, previous post that suggested upgrading Nsight Systems, which leads me to my second question: can the newer versions of Nsight Systems available here be uploaded to conda?

Using the Perlmutter, I can only download packages from package managers like conda or pip or use a container.

I tried using an NGC container with Nsight 2024.1 but running nsys profile --gpu-metrics-device=help shows no supported devices even though there are A100s underneath the container. Therefore, my only resort is conda as it takes a while before newer Nsight releases are integrated into the CUDA toolkit or NVHPC.

Second, NVIDIA supports Slingshot-11, so can we also get support for nic-metrics and ib-switch metrics in Nsight Systems?

Would really appreciate a response, thanks!

hwilper · March 4, 2024, 9:04pm

I’m going to loop in @pkovalenko for the GPU metrics and @ytebeka for the networking components.

pkovalenko · March 5, 2024, 7:50pm

GPU Metrics sampling hardware unit is already in use by another instance of Nsight Systems or other tool. The conflict can occur within the OS as well as containers, VMs and hypervisor.

This could be caused not only by NSys, but also by DCGM or by CUPTI sampling from the user app.

Osayamen · March 5, 2024, 10:09pm

@pkovalenko Do you have any suggestions on how to troubleshoot any of these cases?

Osayamen · March 20, 2024, 9:42pm

@pkovalenko @hwilper pinging again to revive this

pkovalenko · April 2, 2024, 1:06pm

This could be DCGM. Try disabling DCGM before running nsys profiling session:

sudo systemctl stop nvidia-dcgm

Once you’re done with nsys, restart DCGM this way:

sudo systemctl restart nvidia-dcgm

Osayamen · April 3, 2024, 7:40pm

I am on a shared system and do not have root access, is there any other alternative?

pkovalenko · April 4, 2024, 8:51am

The hardware responsible for metrics sampling can not be used concurrently. The error you’re getting from Nsys means it’s already in use. If you have DCGM running, the only way around is to disable it.

system · June 7, 2024, 8:11am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Availability issue for GPU Metrics sampling hardware unit on WSL Profiling Linux Targets nsight , wsl	9	1341	June 26, 2024
Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems Profiling Linux Targets profiling	12	869	June 5, 2024
Error Collecting Nsys Profile Metrics Profiling Linux Targets nsight	3	635	April 18, 2024
Nsys profile in Deepstream container Profiling Linux Targets nsight , deepstream	9	1576	September 10, 2022
Can not use GPU metrics on nsight system Profiling Linux Targets wsl	1	88	December 9, 2024
Can't get GPU Metrics with Nsight System Profiling Linux Targets cuda	13	278	September 6, 2024
Profiling Python code using sudo Profiling Linux Targets nsight , python , profiling	8	2153	March 10, 2022
Cannot get tensor core metrics with latest NSight system Profiling Linux Targets cuda , profiling	4	1427	June 20, 2023
Gpu-metrics-set not found for GH200 Profiling Linux Targets	6	275	August 15, 2024
Nsight Systems Issue: Unable to configure the collection of CPU IP samples Profiling Linux Targets	12	8960	December 27, 2021

GPU Metrics Unit Already in Use Error and Slingshot-11 NIC Metrics

Related topics