[nsys profile] gpu-metrics-devices fails with "Already under profiling"

Hi,

I’m trying to use nsys profile --gpu-metrics-device but hitting an error with strange message:

$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
	Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics

I could not understand “Already under profiling”. And the profiling results do not contain gpu-metrics-device: PCIe, memory bandwidth etc…

Thanks for helping in advance!


To my setup:

I did not use sudo but the driver option is in place:

$ cat /proc/driver/nvidia/params | grep RmProfilingAdminOnly
RmProfilingAdminOnly: 0

The GPU which I get from slurm from our DGX-A100:

$ nvidia-smi
Wed May 21 19:12:12 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0              57W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The nsys version:

$ ~/nsight-systems-2025.1.1/bin/nsys --version
NVIDIA Nsight Systems version 2025.1.1.103-251135427971v0
$ ./nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

I did try with the newest “Nsight Systems 2025.3.1 Full Version” from Nsight Systems - Get Started | NVIDIA Developer

I have the same error with a bit more longer message, which is not helpful:

$ ./nsys --version
NVIDIA Nsight Systems version 2025.3.1.90-253135822126v0


$ ./nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
	Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling or insufficient privilege, see https://developer.nvidia.com/ERR_NVGPUCTRPERM
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics


$ ./nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

@pkovalenko can you help.

1 Like

@hwilper @pkovalenko My apologies for the followup, but I suspect the issue might stem from missing documentation in the setup process. I could not fix it by reading the existing documentation.

I’d be happy to provide additional details about my current environment configuration to help diagnose this.

Do you have DCGM running? Related: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr

Hi @pkovalenko,

thanks for your reply. I have it running (though no sudo):

(base) slurm-jluo@dgx01:~$ service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 0 days ago
   Main PID: 9748 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 46.8M
        CPU: 2d 13h 35min 18.558s
     CGroup: /system.slice/nvidia-dcgm.service
             └─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Warning: some journal files were not opened due to insufficient permissions.

(I will try to add sudo once I can get)

And I do not have kubectl at all installed: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr

I also tried with pgrep nsys and there is nothing, which is also expected

@hwilper @pkovalenko
I have the sudo output here:

$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 2 days ago
   Main PID: 9748 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 48.4M
        CPU: 3d 6h 24min 32.626s
     CGroup: /system.slice/nvidia-dcgm.service
             └─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Mai 20 15:29:39 dgx01 systemd[1]: Started NVIDIA DCGM service.
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: DCGM initialized
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: Started host engine version 3.3.3 using port number: 5555

DCGM is likely the reason - it uses the same infrastructure as nsys to collect performance metrics, and this can not be done concurrently. Disabling DCGM should resolve the problem.

Hi @pkovalenko,

Thanks for the reply. I noticed DCGM is still running in the context of the previous question: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr

I have an A100 DGX cluster using Slurm and want to confirm: are there any drawbacks or issues if I disable DCGM? I’m concerned about potential impacts&breaking on the DGX server.

Temporarily disabling DCGM to allow profiling is typically not a problem. It’s primarily responsible for monitoring/diagnostics, and there are also 3rd party integrations. You may want to read about the features it provides to better understand the impact: NVIDIA DCGM | NVIDIA Developer

@pkovalenko thanks for helping. It worked with stopping docker.dcgm-exporter.service as well.

I did stop it via service nvidia-dcgm stop without reboot. Nothing really changed here:

$ service nvidia-dcgm status
○ nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Mon 2025-06-02 12:09:30 UTC; 3min 32s ago
    Process: 9748 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
   Main PID: 9748 (code=exited, status=0/SUCCESS)
        CPU: 4d 8h 13min 55.149s

Warning: some journal files were not opened due to insufficient permissions.

$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
	Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics

Then it did not work so I stop docker.dcgm-exporter.service then it worked.


Their is a check for DCGM:

$ systemctl list-dependencies --reverse nvidia-dcgm.service
nvidia-dcgm.service
● └─multi-user.target
●   └─graphical.target

$ grep -r "nvidia-dcgm" /etc/systemd/ /etc/init.d/ /usr/lib/systemd/
/etc/systemd/system/docker.dcgm-exporter.service:ExecStart=/usr/bin/docker run --rm --gpus all --cap-add=SYS_ADMIN --cpus="0.5" --name %n -p 9400:9400 -v "/opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv" nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04
/usr/lib/systemd/system/nvidia-dcgm.service:Environment="DCGM_HOME_DIR=/var/log/nvidia-dcgm"
/usr/lib/systemd/system/nvidia-dcgm.service:ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm
/usr/lib/systemd/system/dcgm.service:Description=DEPRECATED. Please use nvidia-dcgm.service
/usr/lib/systemd/system/dcgm.service:Conflicts=nvidia-dcgm.service
/usr/lib/systemd/system/dcgm.service:Environment="DCGM_HOME_DIR=/var/log/nvidia-dcgm"
/usr/lib/systemd/system/dcgm.service:ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm

$ ps aux | grep -E "dcgm"
root        9748 33.7  0.0 475952 53404 ?        Ssl  Mai20 6250:34 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
root       11432  0.0  0.0 10130088 29720 ?      Ssl  Mai20  16:50 /usr/bin/docker run --rm --gpus all --cap-add=SYS_ADMIN --cpus=0.5 --name docker.dcgm-exporter.service -p 9400:9400 -v /opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04
root       11854 43.1  0.0 4509588 279292 ?      Ssl  Mai20 7992:08 /usr/bin/dcgm-exporter

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.