Hi,
I’m trying to use nsys profile --gpu-metrics-device but hitting an error with strange message:
$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics
I could not understand “Already under profiling”. And the profiling results do not contain gpu-metrics-device: PCIe, memory bandwidth etc…
Thanks for helping in advance!
To my setup:
I did not use sudo but the driver option is in place:
$ cat /proc/driver/nvidia/params | grep RmProfilingAdminOnly
RmProfilingAdminOnly: 0
The GPU which I get from slurm from our DGX-A100:
$ nvidia-smi
Wed May 21 19:12:12 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 31C P0 57W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
The nsys version:
$ ~/nsight-systems-2025.1.1/bin/nsys --version
NVIDIA Nsight Systems version 2025.1.1.103-251135427971v0
$ ./nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK
See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.
I did try with the newest “Nsight Systems 2025.3.1 Full Version” from Nsight Systems - Get Started | NVIDIA Developer
I have the same error with a bit more longer message, which is not helpful:
$ ./nsys --version
NVIDIA Nsight Systems version 2025.3.1.90-253135822126v0
$ ./nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling or insufficient privilege, see https://developer.nvidia.com/ERR_NVGPUCTRPERM
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics
$ ./nsys status -e
Timestamp counter supported: Yes
CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK
See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.
@pkovalenko can you help.
@hwilper @pkovalenko My apologies for the followup, but I suspect the issue might stem from missing documentation in the setup process. I could not fix it by reading the existing documentation.
I’d be happy to provide additional details about my current environment configuration to help diagnose this.
Hi @pkovalenko,
thanks for your reply. I have it running (though no sudo):
(base) slurm-jluo@dgx01:~$ service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 0 days ago
Main PID: 9748 (nv-hostengine)
Tasks: 7 (limit: 629145)
Memory: 46.8M
CPU: 2d 13h 35min 18.558s
CGroup: /system.slice/nvidia-dcgm.service
└─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Warning: some journal files were not opened due to insufficient permissions.
(I will try to add sudo once I can get)
And I do not have kubectl at all installed: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr
I also tried with pgrep nsys and there is nothing, which is also expected
@hwilper @pkovalenko
I have the sudo output here:
$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 2 days ago
Main PID: 9748 (nv-hostengine)
Tasks: 7 (limit: 629145)
Memory: 48.4M
CPU: 3d 6h 24min 32.626s
CGroup: /system.slice/nvidia-dcgm.service
└─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Mai 20 15:29:39 dgx01 systemd[1]: Started NVIDIA DCGM service.
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: DCGM initialized
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: Started host engine version 3.3.3 using port number: 5555
DCGM is likely the reason - it uses the same infrastructure as nsys to collect performance metrics, and this can not be done concurrently. Disabling DCGM should resolve the problem.
Hi @pkovalenko,
Thanks for the reply. I noticed DCGM is still running in the context of the previous question: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr
I have an A100 DGX cluster using Slurm and want to confirm: are there any drawbacks or issues if I disable DCGM? I’m concerned about potential impacts&breaking on the DGX server.
Temporarily disabling DCGM to allow profiling is typically not a problem. It’s primarily responsible for monitoring/diagnostics, and there are also 3rd party integrations. You may want to read about the features it provides to better understand the impact: NVIDIA DCGM | NVIDIA Developer
@pkovalenko thanks for helping. It worked with stopping docker.dcgm-exporter.service as well.
I did stop it via service nvidia-dcgm stop without reboot. Nothing really changed here:
$ service nvidia-dcgm status
○ nvidia-dcgm.service - NVIDIA DCGM service
Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2025-06-02 12:09:30 UTC; 3min 32s ago
Process: 9748 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
Main PID: 9748 (code=exited, status=0/SUCCESS)
CPU: 4d 8h 13min 55.149s
Warning: some journal files were not opened due to insufficient permissions.
$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics
Then it did not work so I stop docker.dcgm-exporter.service then it worked.
Their is a check for DCGM:
$ systemctl list-dependencies --reverse nvidia-dcgm.service
nvidia-dcgm.service
● └─multi-user.target
● └─graphical.target
$ grep -r "nvidia-dcgm" /etc/systemd/ /etc/init.d/ /usr/lib/systemd/
/etc/systemd/system/docker.dcgm-exporter.service:ExecStart=/usr/bin/docker run --rm --gpus all --cap-add=SYS_ADMIN --cpus="0.5" --name %n -p 9400:9400 -v "/opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv" nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04
/usr/lib/systemd/system/nvidia-dcgm.service:Environment="DCGM_HOME_DIR=/var/log/nvidia-dcgm"
/usr/lib/systemd/system/nvidia-dcgm.service:ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm
/usr/lib/systemd/system/dcgm.service:Description=DEPRECATED. Please use nvidia-dcgm.service
/usr/lib/systemd/system/dcgm.service:Conflicts=nvidia-dcgm.service
/usr/lib/systemd/system/dcgm.service:Environment="DCGM_HOME_DIR=/var/log/nvidia-dcgm"
/usr/lib/systemd/system/dcgm.service:ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm
$ ps aux | grep -E "dcgm"
root 9748 33.7 0.0 475952 53404 ? Ssl Mai20 6250:34 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
root 11432 0.0 0.0 10130088 29720 ? Ssl Mai20 16:50 /usr/bin/docker run --rm --gpus all --cap-add=SYS_ADMIN --cpus=0.5 --name docker.dcgm-exporter.service -p 9400:9400 -v /opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04
root 11854 43.1 0.0 4509588 279292 ? Ssl Mai20 7992:08 /usr/bin/dcgm-exporter