[nsys profile] gpu-metrics-devices fails with "Already under profiling"

user157267 · May 21, 2025, 7:12pm

Hi,

I’m trying to use nsys profile --gpu-metrics-device but hitting an error with strange message:

$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
	Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics

I could not understand “Already under profiling”. And the profiling results do not contain gpu-metrics-device: PCIe, memory bandwidth etc…

Thanks for helping in advance!

To my setup:

I did not use sudo but the driver option is in place:

$ cat /proc/driver/nvidia/params | grep RmProfilingAdminOnly
RmProfilingAdminOnly: 0

The GPU which I get from slurm from our DGX-A100:

$ nvidia-smi
Wed May 21 19:12:12 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-40GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   31C    P0              57W / 400W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The nsys version:

$ ~/nsight-systems-2025.1.1/bin/nsys --version
NVIDIA Nsight Systems version 2025.1.1.103-251135427971v0

$ ./nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

user157267 · May 22, 2025, 8:33am

I did try with the newest “Nsight Systems 2025.3.1 Full Version” from Nsight Systems - Get Started | NVIDIA Developer

I have the same error with a bit more longer message, which is not helpful:

$ ./nsys --version
NVIDIA Nsight Systems version 2025.3.1.90-253135822126v0


$ ./nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
	Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling or insufficient privilege, see https://developer.nvidia.com/ERR_NVGPUCTRPERM
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics


$ ./nsys status -e
Timestamp counter supported: Yes

CPU Profiling Environment Check
Root privilege: disabled
Linux Kernel Paranoid Level = 0
Linux Distribution = Ubuntu
Linux Kernel Version = 5.15.0-1042-nvidia: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Not Available
CPU Profiling Environment (process-tree): OK
CPU Profiling Environment (system-wide): OK

See the product documentation at https://docs.nvidia.com/nsight-systems for more information,
including information on how to set the Linux Kernel Paranoid Level.

hwilper · May 22, 2025, 3:23pm

@pkovalenko can you help.

user157267 · May 27, 2025, 7:58am

@hwilper @pkovalenko My apologies for the followup, but I suspect the issue might stem from missing documentation in the setup process. I could not fix it by reading the existing documentation.

I’d be happy to provide additional details about my current environment configuration to help diagnose this.

pkovalenko · May 27, 2025, 10:45pm

Do you have DCGM running? Related: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr

user157267 · May 28, 2025, 8:25am

Hi @pkovalenko,

thanks for your reply. I have it running (though no sudo):

(base) slurm-jluo@dgx01:~$ service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 0 days ago
   Main PID: 9748 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 46.8M
        CPU: 2d 13h 35min 18.558s
     CGroup: /system.slice/nvidia-dcgm.service
             └─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Warning: some journal files were not opened due to insufficient permissions.

(I will try to add sudo once I can get)

user157267 · May 28, 2025, 8:28am

And I do not have kubectl at all installed: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr

I also tried with pgrep nsys and there is nothing, which is also expected

user157267 · May 30, 2025, 9:15am

@hwilper @pkovalenko
I have the sudo output here:

$ sudo service nvidia-dcgm status
● nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: active (running) since Tue 2025-05-20 15:29:39 UTC; 1 week 2 days ago
   Main PID: 9748 (nv-hostengine)
      Tasks: 7 (limit: 629145)
     Memory: 48.4M
        CPU: 3d 6h 24min 32.626s
     CGroup: /system.slice/nvidia-dcgm.service
             └─9748 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

Mai 20 15:29:39 dgx01 systemd[1]: Started NVIDIA DCGM service.
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: DCGM initialized
Mai 20 15:29:56 dgx01 nv-hostengine[9748]: Started host engine version 3.3.3 using port number: 5555

pkovalenko · May 30, 2025, 11:54pm

DCGM is likely the reason - it uses the same infrastructure as nsys to collect performance metrics, and this can not be done concurrently. Disabling DCGM should resolve the problem.

user157267 · May 31, 2025, 12:34pm

Hi @pkovalenko,

Thanks for the reply. I noticed DCGM is still running in the context of the previous question: Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems - #10 by IntStr

I have an A100 DGX cluster using Slurm and want to confirm: are there any drawbacks or issues if I disable DCGM? I’m concerned about potential impacts&breaking on the DGX server.

pkovalenko · June 1, 2025, 3:26am

Temporarily disabling DCGM to allow profiling is typically not a problem. It’s primarily responsible for monitoring/diagnostics, and there are also 3rd party integrations. You may want to read about the features it provides to better understand the impact: NVIDIA DCGM | NVIDIA Developer

user157267 · June 2, 2025, 12:01pm

@pkovalenko thanks for helping. It worked with stopping docker.dcgm-exporter.service as well.

I did stop it via service nvidia-dcgm stop without reboot. Nothing really changed here:

$ service nvidia-dcgm status
○ nvidia-dcgm.service - NVIDIA DCGM service
     Loaded: loaded (/lib/systemd/system/nvidia-dcgm.service; enabled; vendor preset: enabled)
     Active: inactive (dead) since Mon 2025-06-02 12:09:30 UTC; 3min 32s ago
    Process: 9748 ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm (code=exited, status=0/SUCCESS)
   Main PID: 9748 (code=exited, status=0/SUCCESS)
        CPU: 4d 8h 13min 55.149s

Warning: some journal files were not opened due to insufficient permissions.

$ ~/nsight-systems-2025.1.1/bin/nsys profile --gpu-metrics-devices=help
GPU Metrics: None of the installed GPUs are supported:
	Ampere GA100 | NVIDIA A100-SXM4-40GB PCI[0000:07:00.0] - Already under profiling
See the user guide: https://docs.nvidia.com/nsight-systems/UserGuide/index.html#gpu-metrics

Then it did not work so I stop docker.dcgm-exporter.service then it worked.

Their is a check for DCGM:

$ systemctl list-dependencies --reverse nvidia-dcgm.service
nvidia-dcgm.service
● └─multi-user.target
●   └─graphical.target

$ grep -r "nvidia-dcgm" /etc/systemd/ /etc/init.d/ /usr/lib/systemd/
/etc/systemd/system/docker.dcgm-exporter.service:ExecStart=/usr/bin/docker run --rm --gpus all --cap-add=SYS_ADMIN --cpus="0.5" --name %n -p 9400:9400 -v "/opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv" nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04
/usr/lib/systemd/system/nvidia-dcgm.service:Environment="DCGM_HOME_DIR=/var/log/nvidia-dcgm"
/usr/lib/systemd/system/nvidia-dcgm.service:ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm
/usr/lib/systemd/system/dcgm.service:Description=DEPRECATED. Please use nvidia-dcgm.service
/usr/lib/systemd/system/dcgm.service:Conflicts=nvidia-dcgm.service
/usr/lib/systemd/system/dcgm.service:Environment="DCGM_HOME_DIR=/var/log/nvidia-dcgm"
/usr/lib/systemd/system/dcgm.service:ExecStart=/usr/bin/nv-hostengine -n --service-account nvidia-dcgm

$ ps aux | grep -E "dcgm"
root        9748 33.7  0.0 475952 53404 ?        Ssl  Mai20 6250:34 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
root       11432  0.0  0.0 10130088 29720 ?      Ssl  Mai20  16:50 /usr/bin/docker run --rm --gpus all --cap-add=SYS_ADMIN --cpus=0.5 --name docker.dcgm-exporter.service -p 9400:9400 -v /opt/deepops/nvidia-dcgm-exporter/dcgm-custom-metrics.csv:/etc/dcgm-exporter/default-counters.csv nvcr.io/nvidia/k8s/dcgm-exporter:2.1.8-2.4.0-rc.2-ubuntu20.04
root       11854 43.1  0.0 4509588 279292 ?      Ssl  Mai20 7992:08 /usr/bin/dcgm-exporter

system · June 16, 2025, 12:01pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Issue with GPU Metrics Collection for NVIDIA A100 on Nsight Systems Profiling Linux Targets profiling	12	988	June 5, 2024
Can not use GPU metrics on nsight system Profiling Linux Targets wsl	1	121	December 9, 2024
Can't get GPU Metrics with Nsight System Profiling Linux Targets cuda	13	486	September 6, 2024
Unable to profile GPU metrics in nsight system Profiling Linux Targets	4	244	April 17, 2025
GPU Metrics Unit Already in Use Error and Slingshot-11 NIC Metrics Profiling Linux Targets	8	807	April 4, 2024
Can't get GPU Metrics with nsight-system Profiling Linux Targets cuda , kernel	7	3319	June 14, 2024
[QuadDCommon::tag_message*] = No GPU associated to the given UUID Profiling Linux Targets	24	1043	November 5, 2024
Gpu-metrics-set not found for GH200 Profiling Linux Targets	6	377	August 15, 2024
Nsys profile doesn't collect tensor core utilization and the metrics about tensor active/SM instructions are not shown in the GUI Profiling Linux Targets cuda	6	684	February 24, 2024
Get GPU metrics failed Profiling Linux Targets nsight	1	616	October 19, 2022

[nsys profile] gpu-metrics-devices fails with "Already under profiling"

Related topics