CUDA Version: 11.8
Nsight compute version:Version 2023.3.1.0 (build 33474944) (public-release)
==PROF== Connected to process 29004 (E:\Workspace\learning\cuda\CudaRuntime\x64\Debug\CudaRuntime.exe)
==ERROR== Failed to prepare kernel for profiling
==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “addKernel” in process 29004
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
I am facing the same issue; I tried profiling any CUDA samples but still have the same problem. This occurs with the “full” metrics sets, but other sets, such as “detailed” and “basic”, work fine.
[Matrix Multiply Using CUDA] - Starting…
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Hopper
GPU Device 0: “Hopper” with compute capability 8.9
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 61.14 GFlop/s, Time= 2.144 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
With NCU:
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 40340 (E:\Workspace\github\cuda-samples\bin\win64\Debug\matrixMul.exe)
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Hopper
GPU Device 0: “Hopper” with compute capability 8.9
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Failed to prepare kernel for profiling
==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “MatrixMulCUDA” in process 40340
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
Can you please get CUDA12.3 sample to check? Below output seems not correct as you are executing on Ada actually.
MapSMtoCores for SM 8.9 is undefined. Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined. Default to use Hopper
GPU Device 0: “Hopper” with compute capability 8.9
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “Ada” with compute capability 8.9
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 1044.62 GFlop/s, Time= 0.125 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
With NCU:
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 7400 (E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe)
GPU Device 0: “Ada” with compute capability 8.9
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Failed to prepare kernel for profiling
==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “MatrixMulCUDA” in process 7400
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
So now you are using
12.3 sample + 2023.3.1.0 (build 33474944) + 546.33 Driver + NVIDIA GeForce RTX 4060 Laptop GPU, and any sample will cause “==ERROR== Unknown Error on device 0”.
Can you help us to do some isolation ?
Like using ncu --section ${section_name} or ncu --metrics ${metrics_name} to check if any section or metrics can work.
Also you can check in NCU-UI, use “Interactive Profile=>Run to Next Kernel=>Profile Kernel” to see if any other different error printed.
I will further check with our engineer team to see if anything else we can do. Thanks !
~ ncu --metrics l1tex__t_bytes_pipe_lsu_mem_global_op_ld.sum.per_second,l1tex__t_bytes_pipe_lsu_mem_global_op_st.sum.per_second E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 33848 (E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe)
GPU Device 0: “Ada” with compute capability 8.9
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Failed to prepare kernel for profiling
==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “MatrixMulCUDA” in process 33848
==PROF== Trying to shutdown target application
==ERROR== An error occurred while trying to profile.
==WARNING== No kernels were profiled.
When using ncu Version 2022.3.0.0 (build 31729285):
[Matrix Multiply Using CUDA] - Starting…
==PROF== Connected to process 43752 (E:\Workspace\github\cuda-samples-12.3\Samples\0_Introduction\matrixMul\matrixMul.exe)
GPU Device 0: “Ada” with compute capability 8.9
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
==ERROR== Profiling is not supported on device 0. To find out supported GPUs refer --list-chips option.
done
Performance= 1028.99 GFlop/s, Time= 0.127 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
==PROF== Disconnected from process 43752
==WARNING== No kernels were profiled.
==WARNING== Profiling kernels launched by child processes requires the --target-processes all option.
Thank you. Your assistance is greatly appreciated!