I’m trying to use nsys profile --gpu-metrics-device but hitting an error with strange message:
$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics
I could not understand “Already under profiling”. And the profiling results do not contain gpu-metrics-device: PCIe, memory bandwidth etc…
Thanks for helping in advance!
To my setup:
I did not use sudo but the driver option is in place:
$ nvidia-smi
Wed May 21 19:12:12 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 31C P0 57W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The nsys version:
$ ~/nsight-systems-2025.1.1/bin/nsys --version
NVIDIA Nsight Systems version 2025.1.1.103-251135427971v0
$ ./nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK
See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.
I have the same error with a bit more longer message, which is not helpful:
$ ./nsys --version
NVIDIA Nsight Systems version 2025.3.1.90-253135822126v0
$ ./nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling or insufficient privilege, see https://developer.nvidia.com/ERR_NVGPUCTRPERM
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics
$ ./nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK
See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.
@hwilper@pkovalenko My apologies for the followup, but I suspect the issue might stem from missing documentation in the setup process. I could not fix it by reading the existing documentation.
I’d be happy to provide additional details about my current environment configuration to help diagnose this.
thanks for your reply. I have it running (though no sudo):
(base) slurm-jluo@dgx01:~$ service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 0 days ago
Main PID: 9748 (nv-hostengine)
Tasks: 7 (limit: 629145)
Memory: 46.8M
CPU: 2d 13h 35min 18.558s
CGroup: /system.slice/nvidia-dcgm.service
└─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Warning: some journal files were not opened due to insufficient permissions.
$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 2 days ago
Main PID: 9748 (nv-hostengine)
Tasks: 7 (limit: 629145)
Memory: 48.4M
CPU: 3d 6h 24min 32.626s
CGroup: /system.slice/nvidia-dcgm.service
└─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Mai 20 15:29:39 dgx01 systemd[1]: Started NVIDIA DCGM service.
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: DCGM initialized
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: Started host engine version 3.3.3 using port number: 5555
DCGM is likely the reason - it uses the same infrastructure as nsys to collect performance metrics, and this can not be done concurrently. Disabling DCGM should resolve the problem.
I have an A100 DGX cluster using Slurm and want to confirm: are there any drawbacks or issues if I disable DCGM? I’m concerned about potential impacts&breaking on the DGX server.
Temporarily disabling DCGM to allow profiling is typically not a problem. It’s primarily responsible for monitoring/diagnostics, and there are also 3rd party integrations. You may want to read about the features it provides to better understand the impact: NVIDIA DCGM | NVIDIA Developer
@pkovalenkothanks for helping. It worked with stopping docker.dcgm-exporter.service as well.
I did stop it via service nvidia-dcgm stop without reboot. Nothing really changed here:
$ service nvidia-dcgm status
○ nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2025-06-02 12:09:30 UTC; 3min 32s ago
Process: 9748 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
Main PID: 9748 (code=exited, status=0/SUCCESS)
CPU: 4d 8h 13min 55.149s
Warning: some journal files were not opened due to insufficient permissions.
$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics
Then it did not work so I stop docker.dcgm-exporter.service then it worked.