Thank you for your help.
I further investigated the issue on the host machine. The problem is still reproducible, and the evidence suggests a profiling resource contention or stale profiler state in this shared production environment, rather than a broken CUDA application.
Environment
- OS: Alibaba Cloud Linux 3 (Soaring Falcon)
- GPU: NVIDIA L20
- Driver: 580.82.07
- Nsight Compute CLI: 2025.3.1.0 (`/usr/local/NVIDIA-Nsight-Compute/ncu`)
- Host machine, not in a container
- I cannot reboot this host because it is shared and currently running production workloads
What I verified
1) CUDA execution is healthy
Real CUDA samples run successfully on GPU0:
```bash
CUDA_VISIBLE_DEVICES=0 /usr/local/harp/test/deviceQuery
CUDA_VISIBLE_DEVICES=0 /usr/local/harp/test/gpu/matrixMul_2m13s
```
Both complete successfully, and `matrixMul_2m13s` reports `Result = PASS`.
2) Nsight Compute can attach, but profiling fails during resource allocation
When profiling the real CUDA sample:
```bash
CUDA_VISIBLE_DEVICES=0 /usr/local/NVIDIA-Nsight-Compute/ncu --set full --devices 0 /usr/local/harp/test/gpu/matrixMul_2m13s
```
Nsight Compute connects to the target process:
```text
==PROF== Connected to process …
```
but then fails with:
```text
==ERROR== Profiling failed because a driver resource was unavailable.
==ERROR== Ensure that no other tool (like DCGM) is concurrently collecting profiling data.
==ERROR== Failed to profile “MatrixMulCUDA” in process …
```
3) `ncu --query-metrics` also fails on isolated GPU0
Even with `CUDA_VISIBLE_DEVICES=0`:
```bash
CUDA_VISIBLE_DEVICES=0 ncu --query-metrics --devices 0
```
I still get:
```text
==ERROR== An error was reported by the counter measurement library:
==ERROR== Unknown Error on device 0.
```
4) Profiling is restricted to admin users
Driver parameters show:
```bash
cat /proc/driver/nvidia/params | grep -i -E ‘profil|restrict|mps|debug|pma’
```
Relevant output:
```text
RmProfilingAdminOnly: 1
```
I am running the profiling commands as root.
5) Nsight Compute lock files are present, but they are not the root cause
I checked the Nsight Compute temp directory:
```bash
ls /tmp/nvidia/nsight_compute/
```
It contained:
```text
lock
lock.1c324aa1-44a0-a26e-a3ba-3df5670a486f
```
I then removed the entire directory:
```bash
rm -rf /tmp/nvidia/nsight_compute/
```
After that, I re-ran profiling:
```bash
CUDA_VISIBLE_DEVICES=0 /usr/local/NVIDIA-Nsight-Compute/ncu --set full --devices 0 /usr/local/harp/test/gpu/matrixMul_2m13s
```
The same profiling failure occurred:
```text
==ERROR== Profiling failed because a driver resource was unavailable.
```
And after the failed attempt, Nsight Compute recreated the lock files automatically:
```text
/tmp/nvidia/nsight_compute/lock
/tmp/nvidia/nsight_compute/lock.
```
So the lock files appear to be a consequence of the failed profiling attempt, not the root cause.
6) Kernel logs show repeated PMA/profiler allocation failures
When I attempt profiling, `dmesg` / `journalctl -k` repeatedly show:
```text
NVRM: nvAssertFailedNoLog: Assertion failed: pStaticInfo->pSmIssueThrottleCtrl != NULL @ kernel_graphics.c:3369
NVRM: nvAssertFailedNoLog: Assertion failed: lastSequence == (firstSequence + recordCount) @ rpc.c:2127
NVRM: nvCheckOkFailedNoLog: Check failed: State in use [NV_ERR_STATE_IN_USE] returned from … NVB0CC_CTRL_CMD_INTERNAL_ALLOC_PMA_STREAM … @ kern_profiler_v2_ctrl.c:315
```
These messages appear consistently in `dmesg` / `journalctl -k` during profiling attempts.
7) Host-side GPU services/agents that may be relevant
This machine has many always-on GPU-related components, including:
- `nvidia-persistenced`
- `amperf-daemon`
- `amperf-collector`
- `amp-host-agent`
- `device-plugins`
- `nvidia-docker`
- `walle`
- `node-problem-detector`
`fuser`/`lsof` show these processes hold handles on `/dev/nvidia0`, `/dev/nvidia1`, `/dev/nvidiactl`, and `/dev/nvidia-uvm`. GPU1 is also occupied by a production workload.
Current interpretation
At this point, the CUDA workload itself is healthy, and Nsight Compute can attach to the target process, but profiling fails when the driver attempts to allocate PMA/profiling resources. This looks like a driver/profiler resource conflict or stale profiler state in the current shared host environment.
The lock files under `/tmp/nvidia/nsight_compute` are not the root cause, because removing them does not fix the failure and they are recreated automatically after the failed profiling attempt.