NVIDIA NSight Compute: The profiler returned an error code:1

GPU: NVIDIA GeForce RTX 4060 Laptop GPU

CUDA Version: 11.8
Nsight compute version:Version 2023.3.1.0 (build 33474944) (public-release)

==PROF== Connected to process 29004 (E:\Workspace\learning\cuda\CudaRuntime\x64\Debug\CudaRuntime.exe)
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “addKernel” in process 29004
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 546.33 Driver Version: 546.33 CUDA Version: 12.3 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4060 … WDDM | 00000000:01:00.0 On | N/A |
| N/A 38C P8 2W / 115W | 266MiB / 8188MiB | 3% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

Hi, @mbxust0901

Sorry for the issue you met.
Is it reproduced to any cuda sample or just this specific sample ?

Hi Veraj,

I am facing the same issue; I tried profiling any CUDA samples but still have the same problem. This occurs with the “full” metrics sets, but other sets, such as “detailed” and “basic”, work fine.

It can be reproduced to any cuda sample.

Hi, @mbxust0901

We can’t reproduce your issue internally with 2023.3.1.0 version + driver 546.33. Does the sample run successfully without NCU ?

The sample runs successfully without NCU. Is it related to the CUDA Tools version or my GPU “NVIDIA GeForce RTX 4060 Laptop GPU”?

It seems no issue with your GPU and tools version. Have your enabled performance access in control panel ?

Can you try run cuda sdk sample like vectorAdd/matrixMul ? And then do ncu $sample directly ?

I’ve enabled performance access.

Without NCU:

[Matrix Multiply Using CUDA] - Starting…
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Hopper
GPU Device 0: “Hopper” with compute capability 8.9

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 61.14 GFlop/s, Time= 2.144 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

With NCU:

[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 40340 (E:\Workspace\github\cuda-samples\bin\win64\Debug\matrixMul.exe)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Hopper
GPU Device 0: “Hopper” with compute capability 8.9

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “MatrixMulCUDA” in process 40340
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

Can you please get CUDA12.3 sample to check? Below output seems not correct as you are executing on Ada actually.

MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Hopper
GPU Device 0: “Hopper” with compute capability 8.9

When using CUDA 12.3 samples, got outputs below:

[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “Ada” with compute capability 8.9

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 1044.62 GFlop/s, Time= 0.125 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

With NCU:
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 7400 (E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe)
GPU Device 0: “Ada” with compute capability 8.9

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “MatrixMulCUDA” in process 7400
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

Thanks for the update.

So now you are using
12.3 sample + 2023.3.1.0 (build 33474944) + 546.33 Driver + NVIDIA GeForce RTX 4060 Laptop GPU, and any sample will cause “==ERROR== Unknown Error on device 0”.

Can you help us to do some isolation ?
Like using ncu --section ${section_name} or ncu --metrics ${metrics_name} to check if any section or metrics can work.

Also you can check in NCU-UI, use “Interactive Profile=>Run to Next Kernel=>Profile Kernel” to see if any other different error printed.

I will further check with our engineer team to see if anything else we can do. Thanks !

Yes.

Using ncu --metrics also cannot work.

~ ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum.per_second E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 33848 (E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe)
GPU Device 0: “Ada” with compute capability 8.9

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “MatrixMulCUDA” in process 33848
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.

And with NCU-UI:

When using ncu Version 2022.3.0.0 (build 31729285):

[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 43752 (E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe)
GPU Device 0: “Ada” with compute capability 8.9

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Profiling is not supported on device 0. To find out supported GPUs refer --list-chips option.
done
Performance= 1028.99 GFlop/s, Time= 0.127 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==PROF== Disconnected from process 43752
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.

Thank you. Your assistance is greatly appreciated!

Hi, @mbxust0901

Our dev also prepared an exactly same test config as you. But he also can’t reproduce your issue.