Nsight compute fail to profile L20 gpu

When I use ncu to profile a cuda kernel (gemm) on L20 gpu, it’s failed and report the error:

==PROF== Connected to process 3522403 (/data/workspace/gemm_test)
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “Kernel” in process 3522403
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

my command:

ncu --set full -o gemm_test ./gemm_test 256 128 64 fp16

nsight system version:

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2023 NVIDIA Corporation
Version 2023.2.2.0 (build 33188574) (public-release)

gpu info:

±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L20 On | 00000000:23:00.0 Off | 0 |
| N/A 28C P8 36W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA L20 On | 00000000:33:00.0 Off | 0 |
| N/A 29C P8 35W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA L20 On | 00000000:34:00.0 Off | 0 |
| N/A 37C P0 88W / 350W | 14894MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA L20 On | 00000000:43:00.0 Off | 0 |
| N/A 29C P8 37W / 350W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

This binary can run correctly and can be profiled on A10/A100 gpu.
Does nsight compute support L20 gpu?

can you try again with the latest version of nsight compute? It should be version 2024.1.1, currently.

It still fail after update ncu to version 2024.1.1.
command: /usr/local/NVIDIA-Nsight-Compute-2024.1/ncu --set full -f -o gemm_profile ./gemm_test 64 256 1024 fp16

==PROF== Connected to process 5061 (/data/shuren/gemm_test)
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “ampere_bf16_s16816gemm_bf16_6…” in process 5061
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

veriosn:

/usr/local/NVIDIA-Nsight-Compute-2024.1/ncu --version

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.1.1.0 (build 33998838) (public-release)

OK we are getting some additional information now:

Can you try updating to CUDA 12.4 update 1. Make sure you install the driver that comes with that. Also make sure to verify your CUDA install by running some of the suggested sample programs such as vectorAdd.

I have updated the cuda driver and cuda toolkit. And now my version is:

Unfortunately the nsight compute still fail.
The vectorAdd kernel can run correctly:

./Samples/0_Introduction/vectorAdd/vectorAdd

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

but it can’t be profiled by ncu:

ncu -o vecAdd_profile ./Samples/0_Introduction/vectorAdd/vectorAdd
[Vector addition of 50000 elements]
==PROF== Connected to process 42839 (/data/shuren/code/cuda-samples/Samples/0_Introduction/vectorAdd/vectorAdd)
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “vectorAdd” in process 42839
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

By the way, I run the kernel on device 1. The gpu 0 on the picture is used by other people.
Do you have any feasible way to use ncu profile on L20 gpu? Then I can reproduce your method. Thanks.

suggestions:

  1. Try running with CUDA_VISIBLE_DEVICES="1", like this:

    CUDA_VISIBLE_DEVICES=“1” ncu -o vecAdd_profile ./Samples/0_Introduction/vectorAdd/vectorAdd

  2. Make sure the instructions here have been properly applied to your machine

If neither of those suggestions help, I suggest asking for help on the nsight compute forum.

Thanks, I have asked for help on the nsight compute forum.