NCU can not profile in L20

I can’t use ncu in my L20 GPU
the cli is :

my ncu version:
$ncu --version
NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2025 NVIDIA Corporation
Version 2025.3.1.0 (build 36398880) (public-release)

$sudo ncu --query-metrics --devices 1
Device NVIDIA L20 (AD102)
==ERROR== An error was reported by the counter measurement library:
==ERROR== Unknown Error on device 1.

sudo dmesg -wH | grep -iE ‘nvidia|nvrm’

[ +4.309603] NVRM: nvAssertFailedNoLog: Assertion failed: pStaticInfo->pSmIssueThrottleCtrl != NULL @ kernel_graphics.c:3411
[ +0.010003] NVRM: nvAssertFailedNoLog: Assertion failed: pStaticInfo->pSmIssueThrottleCtrl != NULL @ kernel_graphics.c:3411
[ +0.160973] NVRM: nvCheckOkFailedNoLog: Check failed: State in use [NV_ERR_STATE_IN_USE] (0x00000063) returned from pRmApi->Control(pRmApi, pClient->hClient, hObject, NVB0CC_CTRL_CMD_INTERNAL_ALLOC_PMA_STREAM, &internalParams, sizeof(internalParams)) @ kern_profiler_v2_ctrl.c:315

Hi, @2205151451

Sorry for the issue you met.

Please provide more details. Thanks !

  1. Is this issue can be reproduced after reboot ?
  2. Have you enabled profile permission ?
  3. Can you please check if there is nsight compute lock file under /tmp/nvidia/nsight_compute, if there is, please delete all and try again

Thanks for your help.

I checked the requested items and here is the current status:

  1. Reboot:
    I have not rebooted the machine because it is a shared server and currently running other users’ production workloads. So I cannot do a reboot-only verification at this time.

  2. Profile permission:
    /proc/driver/nvidia/params shows:
    RmProfilingAdminOnly: 1
    So profiling is restricted to admin users, and I am invoking ncu with sudo.

  3. Nsight Compute lock file:
    I found a lock file under:
    /tmp/nvidia/nsight_compute/lock
    I removed it with:
    rm -f /tmp/nvidia/nsight_compute/lock
    However, the issue still reproduces after removing the lock file.

  4. Reproduction after removing the lock file:
    sudo ncu --query-metrics --devices 0
    still returns:
    Device NVIDIA L20 (AD102)
    ==ERROR== An error was reported by the counter measurement library:
    ==ERROR== Unknown Error on device 0.

  5. Current GPU status:
    nvidia-smi shows that GPU0 is still in use by the existing workload:

    • /home/admin/sgl/bin/python
    • sglang::scheduler
      and GPU1 has an nvidia-cuda-mps-server process.

Based on this, the lock file was present but removing it did not fix the issue. The failure seems to happen deeper in the counter measurement library / PMA stream allocation path, and the kernel log also reports:

  • pStaticInfo->pSmIssueThrottleCtrl != NULL @ kernel_graphics.c:3411
  • NV_ERR_STATE_IN_USE returned from NVB0CC_CTRL_CMD_INTERNAL_ALLOC_PMA_STREAM

If you need any other non-disruptive checks, I can provide them.

The issue is likely the MPS server running on device 1. MPS is only supported as an opt-in by ncu. See 2. Profiling Guide — NsightCompute 13.2 documentation for how to use it. Either limit your application and the tool to device 0 using CUDA_VISIBLE_DEVICES, or use device 1 in ncu’s MPS mode. I don’t think you can combine both devices in the same profile session.

Thank you for your help.

I further investigated the issue on the host machine. The problem is still reproducible, and the evidence suggests a profiling resource contention or stale profiler state in this shared production environment, rather than a broken CUDA application.

Environment

  • OS: Alibaba Cloud Linux 3 (Soaring Falcon)
  • GPU: NVIDIA L20
  • Driver: 580.82.07
  • Nsight Compute CLI: 2025.3.1.0 (`/usr/local/NVIDIA-Nsight-Compute/ncu`)
  • Host machine, not in a container
  • I cannot reboot this host because it is shared and currently running production workloads

What I verified

1) CUDA execution is healthy

Real CUDA samples run successfully on GPU0:

```bash
CUDA_VISIBLE_DEVICES=0 /usr/local/harp/test/deviceQuery
CUDA_VISIBLE_DEVICES=0 /usr/local/harp/test/gpu/matrixMul_2m13s
```

Both complete successfully, and `matrixMul_2m13s` reports `Result = PASS`.

2) Nsight Compute can attach, but profiling fails during resource allocation

When profiling the real CUDA sample:

```bash
CUDA_VISIBLE_DEVICES=0 /usr/local/NVIDIA-Nsight-Compute/ncu --set full --devices 0 /usr/local/harp/test/gpu/matrixMul_2m13s
```

Nsight Compute connects to the target process:

```text
==PROF== Connected to process …
```

but then fails with:

```text
==ERROR== Profiling failed because a driver resource was unavailable.
==ERROR== Ensure that no other tool (like DCGM) is concurrently collecting profiling data.
==ERROR== Failed to profile “MatrixMulCUDA” in process …
```

3) `ncu --query-metrics` also fails on isolated GPU0

Even with `CUDA_VISIBLE_DEVICES=0`:

```bash
CUDA_VISIBLE_DEVICES=0 ncu --query-metrics --devices 0
```

I still get:

```text
==ERROR== An error was reported by the counter measurement library:
==ERROR== Unknown Error on device 0.
```

4) Profiling is restricted to admin users

Driver parameters show:

```bash
cat /proc/driver/nvidia/params | grep -i -E ‘profil|restrict|mps|debug|pma’
```

Relevant output:

```text
RmProfilingAdminOnly: 1
```

I am running the profiling commands as root.

5) Nsight Compute lock files are present, but they are not the root cause

I checked the Nsight Compute temp directory:

```bash
ls /tmp/nvidia/nsight_compute/
```

It contained:

```text
lock
lock.1c324aa1-44a0-a26e-a3ba-3df5670a486f
```

I then removed the entire directory:

```bash
rm -rf /tmp/nvidia/nsight_compute/
```

After that, I re-ran profiling:

```bash
CUDA_VISIBLE_DEVICES=0 /usr/local/NVIDIA-Nsight-Compute/ncu --set full --devices 0 /usr/local/harp/test/gpu/matrixMul_2m13s
```

The same profiling failure occurred:

```text
==ERROR== Profiling failed because a driver resource was unavailable.
```

And after the failed attempt, Nsight Compute recreated the lock files automatically:

```text
/tmp/nvidia/nsight_compute/lock
/tmp/nvidia/nsight_compute/lock.
```

So the lock files appear to be a consequence of the failed profiling attempt, not the root cause.

6) Kernel logs show repeated PMA/profiler allocation failures

When I attempt profiling, `dmesg` / `journalctl -k` repeatedly show:

```text
NVRM: nvAssertFailedNoLog: Assertion failed: pStaticInfo->pSmIssueThrottleCtrl != NULL @ kernel_graphics.c:3369
NVRM: nvAssertFailedNoLog: Assertion failed: lastSequence == (firstSequence + recordCount) @ rpc.c:2127
NVRM: nvCheckOkFailedNoLog: Check failed: State in use [NV_ERR_STATE_IN_USE] returned from … NVB0CC_CTRL_CMD_INTERNAL_ALLOC_PMA_STREAM … @ kern_profiler_v2_ctrl.c:315
```

These messages appear consistently in `dmesg` / `journalctl -k` during profiling attempts.

7) Host-side GPU services/agents that may be relevant

This machine has many always-on GPU-related components, including:

  • `nvidia-persistenced`
  • `amperf-daemon`
  • `amperf-collector`
  • `amp-host-agent`
  • `device-plugins`
  • `nvidia-docker`
  • `walle`
  • `node-problem-detector`

`fuser`/`lsof` show these processes hold handles on `/dev/nvidia0`, `/dev/nvidia1`, `/dev/nvidiactl`, and `/dev/nvidia-uvm`. GPU1 is also occupied by a production workload.

Current interpretation

At this point, the CUDA workload itself is healthy, and Nsight Compute can attach to the target process, but profiling fails when the driver attempts to allocate PMA/profiling resources. This looks like a driver/profiler resource conflict or stale profiler state in the current shared host environment.

The lock files under `/tmp/nvidia/nsight_compute` are not the root cause, because removing them does not fix the failure and they are recreated automatically after the failed profiling attempt.

I found the root cause and confirmed it by stopping the system GPU monitoring services matching amp.*service (amp-host-agent / amperf-daemon / amperf-collector).

Before stopping them, Nsight Compute failed with:
Profiling failed because a driver resource was unavailable
and the kernel log showed:
NV_ERR_STATE_IN_USE
NVB0CC_CTRL_CMD_INTERNAL_ALLOC_PMA_STREAM

After stopping the amp.*service units, the same Nsight Compute command succeeded on the same CUDA sample:
/usr/local/NVIDIA-Nsight-Compute/ncu --set full --devices 0 /usr/local/harp/test/gpu/matrixMul_2m13s

This confirms the profiling counter/backend was being held by the host-side GPU monitoring services, not by the CUDA workload itself.