Unknown Error on device 0 when Running NCU on wsl

baihdong · November 18, 2024, 2:09pm

Hi there,
I’m trying to profile a Fortran Code compiled by nvfortran so that I maybe I can compare my old code with OpenMP and OpenAcc.
here, I get a very simple code using openacc, I compile it with
nvfortran -acc -o vector_add simple.f90
running ./simple.f90 I get the answer,
however, when I run ncu ./vector_add
I get error below:
ncu ./vector_add

==PROF== Connected to process 9421 (/mnt/c/Users/baihaodong/Documents/2024Tasks/3.Parallel_new/Poisson_IFX_mp/poisson_acc/vector_add)
==ERROR== Unknown Error on device 0.
C(1) = 3.000000000000000
C(N) = 3.000000000000000
==PROF== Disconnected from process 9421

My CUDA version is 12.6 and nvfortran compiler is from hpc_sdk/Linux_x86_64/24.11

could somebody offer a help?

baihdong · November 19, 2024, 3:02am

I’m trying to profile the openACC code with fortram compiled by nvfortran.
Is there any recommended profiler?
since pgprof and nvprof can’t be used on compute capability higher than 8.0, while the Nsight compute comes this error, could someone provide some suggestions?

veraj · November 19, 2024, 3:13am

Hi, @baihdong

Can you share the source code of the sample, then we can try to reproduce ?
Thanks !

baihdong · November 19, 2024, 3:47am

program vector_add
implicit none

integer, parameter :: N = 1000000
real(8), dimension(N) :: A, B, C
integer :: i

! Initialize the vectors
A = 1.0d0
B = 2.0d0

! Perform vector addition using OpenACC
!$acc parallel loop
do i = 1, N
C(i) = A(i) + B(i)
end do
!$acc end parallel loop

! Print some results to verify correctness
print *, "C(1) = ", C(1)
print *, "C(N) = ", C(N)

end program vector_add

Compiling with
nvfortran -acc -Minfo simple.f90 -o vector_add_2

Thanks for your reply, additional, I have a question that when I use Nsight Compute (GUI) using windows platform to choose this application, it says it is not an executable, could you offer any help to how to generate the windows executable from nvfortran?

veraj · November 19, 2024, 7:39am

Hi, @baihdong

We can’t reproduce the issue internally.

CUDA: 12.6.77_560.94
HPC: 24.11
WSL

$ ./vector_add
C(1) = 3.000000000000000
C(N) = 3.000000000000000

$ncu ./vector_add
==PROF== Connected to process 4084 (/mnt/c/Users/swqa/daniel/vector_add)
==PROF== Profiling “vector_add_13” - 0: 0%…50%…100% - 8 passes
C(1) = 3.000000000000000
C(N) = 3.000000000000000
==PROF== Disconnected from process 4084
[4084] vector_add@127.0.0.1
vector_add_13 (7813, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 6.79
SM Frequency Ghz 1.49
Elapsed Cycles cycle 89144
Memory Throughput % 91.25
DRAM Throughput % 91.25
Duration us 59.71
L1/TEX Cache Throughput % 16.37
L2 Cache Throughput % 42.07
SM Active Cycles cycle 80130.04
Compute (SM) Throughput % 13.01
----------------------- ----------- ------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 128
Function Cache Configuration CachePreferNone
Grid Size 7813
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 46
Threads thread 1000064
Uses Green Context 0
Waves Per SM 14.15
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 32
Block Limit Shared Mem block 16
Block Limit Warps block 12
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 86.72
Achieved Active Warps Per SM warp 41.63
------------------------------- ----------- ------------
OPT Est. Local Speedup: 13.28%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (86.7%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (CUDA C++ Best Practices Guide) for more details on
optimizing occupancy.
Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
Average DRAM Active Cycles cycle 370008
Total DRAM Elapsed Cycles cycle 3244032
Average L1 Active Cycles cycle 80130.04
Total L1 Elapsed Cycles cycle 3844916
Average L2 Active Cycles cycle 75035.81
Total L2 Elapsed Cycles cycle 2689056
Average SM Active Cycles cycle 80130.04
Total SM Elapsed Cycles cycle 3844916
Average SMSP Active Cycles cycle 79370.96
Total SMSP Elapsed Cycles cycle 15379664
-------------------------- ----------- ------------

baihdong · November 19, 2024, 7:57am

Thank you for your regeneration information.
Could you please provide the compile command and are there any configurations step should be done before the profiling?

veraj · November 19, 2024, 8:11am

Can you try another simple CUDA sample on both Windows and WSL ? I think this may due to your set up.

baihdong · November 19, 2024, 8:38am

Thanks for your reply.
Actually, before I compile the source code, what I did is

download CUDA tools form the
CUDA Toolkit 12.6 Update 2 Downloads | NVIDIA Developer
for windows operating systems.
download Nvidia HPC sdk from
NVIDIA HPC SDK Current Release Downloads | NVIDIA Developer
with linux x86_64 Ubuntu(apt)
and then add the compilers to PATH.

Almost the new workstation without other setup
could that be the reason that I should download the CUDA tools by wsl?

veraj · November 19, 2024, 8:42am

Are you using windows tools binary on WSL ?

baihdong · November 19, 2024, 8:56am

I’m sorry, I don’t understand. What I want to do is profiling the code using openacc(like showing the data copy in/out time). The nvfortran and ncu are downloaded by WSL.
Does it mean to use the ncu command, I also need to download CUDA TOOLs from WSL?

veraj · November 19, 2024, 9:33am

I mean which CUDA toolkit package do you install ? Windows version or Linux version ?
In WSL, you should install and use linux version cuda.

baihdong · November 19, 2024, 9:49am

You are correct, I found maybe it is related with my nvidia driver version, I now download the driver from selected GPU version, then download CUDA tools 12.6 using wsl-ubuntu_runfile(local).
the previous problem went out, but it says error below:
sudo ncu ./add_vector
==PROF== Connected to process 1988 (/home/haku/2024tasks/parallel/simmple/add_vector)

==ERROR== An error was reported by the driver:
==ERROR== Profiling failed because a driver resource was unavailable or the user does not have permission to access NVIDIA GPU Performance Counters. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. For instructions on enabling permissions, see NVIDIA Development Tools Solutions - | NVIDIA Developer. See 2. Kernel Profiling Guide — NsightCompute 12.6 documentation for more details.
==ERROR== Failed to profile “simple_vector_11” in process 1988
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).

then I perform the instructions on
NVIDIA Development Tools Solutions - ERR_NVGPUCTRPERM: Permission issue with Performance Counters
for windows.

finally I get the error as below:
sudo ncu ./add_vector
==PROF== Connected to process 2107 (/home/haku/2024tasks/parallel/simmple/add_vector)
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “simple_vector_11” in process 2107
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).

veraj · November 19, 2024, 9:52am

Have you done this operation ?

veraj · November 19, 2024, 9:54am

Which driver do you install ?
You should install Windows driver on Windows not under WSL.

baihdong · November 19, 2024, 10:07am

Yeah, I did the operation in NVIDIA control panel as administrator.
and I download Driver version as 553.24 from using windows executable.
Finally after I reboot the machine, it works.
Thank you So much!

veraj · November 19, 2024, 10:15am

Great. Good to know !

system · December 3, 2024, 10:16am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
NVIDIA NSight Compute: The profiler returned an error code:1 Nsight Compute	13	2031	March 18, 2024
==ERROR== Profiling is not supported on device 0 as it uses the Windows Subsystem for Linux (WSL) Nsight Compute	7	2423	August 28, 2023
NCU : ==ERROR== The application returned an error code (9) Nsight Compute cuda , cublas	10	65	July 22, 2025
NCU CLI fails to profile a kernel - Error reported by the driver CUDA on Windows Subsystem for Linux	2	1349	July 3, 2024
Nsight Compute Fails To Profile Kernels on WSL Windows11 Nsight Compute	4	765	April 15, 2024
Error if "private" not on same line as "parallel loop" nvc, nvc++ and nvfortran	31	978	September 21, 2023
Ncu does not detect kernels, ==ERROR== The application returned an error code (11) Nsight Compute kernel , profiling	6	1974	December 13, 2023
Nsight Compute does not detect kernel launches for OpenMP offloaded code Nsight Compute profiling	11	1603	February 28, 2023
Cannot profile kernel from CUDA samples Nsight Compute	6	507	May 31, 2023
Run ncu command in ubuntu 20.04 Nsight Compute	7	5609	August 8, 2022

Unknown Error on device 0 when Running NCU on wsl

Related topics