Unknown Error on device 0 when Running NCU on wsl

Hi there,
I’m trying to profile a Fortran Code compiled by nvfortran so that I maybe I can compare my old code with OpenMP and OpenAcc.
here, I get a very simple code using openacc, I compile it with
nvfortran -acc -o vector_add simple.f90
running ./simple.f90 I get the answer,
however, when I run ncu ./vector_add
I get error below:
ncu ./vector_add

==PROF== Connected to process 9421 (/mnt/c/Users/baihaodong/Documents/2024Tasks/3.Parallel_new/Poisson_IFX_mp/poisson_acc/vector_add)
==ERROR== Unknown Error on device 0.
C(1) = 3.000000000000000
C(N) = 3.000000000000000
==PROF== Disconnected from process 9421

My CUDA version is 12.6 and nvfortran compiler is from hpc_sdk/Linux_x86_64/24.11

could somebody offer a help?

I’m trying to profile the openACC code with fortram compiled by nvfortran.
Is there any recommended profiler?
since pgprof and nvprof can’t be used on compute capability higher than 8.0, while the Nsight compute comes this error, could someone provide some suggestions?

Hi, @baihdong

Can you share the source code of the sample, then we can try to reproduce ?
Thanks !

program vector_add
implicit none

integer, parameter :: N = 1000000
real(8), dimension(N) :: A, B, C
integer :: i

! Initialize the vectors
A = 1.0d0
B = 2.0d0

! Perform vector addition using OpenACC
!$acc parallel loop
do i = 1, N
C(i) = A(i) + B(i)
end do
!$acc end parallel loop

! Print some results to verify correctness
print *, "C(1) = ", C(1)
print *, "C(N) = ", C(N)

end program vector_add

Compiling with
nvfortran -acc -Minfo simple.f90 -o vector_add_2

Thanks for your reply, additional, I have a question that when I use Nsight Compute (GUI) using windows platform to choose this application, it says it is not an executable, could you offer any help to how to generate the windows executable from nvfortran?

Hi, @baihdong

We can’t reproduce the issue internally.

CUDA: 12.6.77_560.94
HPC: 24.11
WSL

$ ./vector_add
C(1) = 3.000000000000000
C(N) = 3.000000000000000

$ncu ./vector_add
==PROF== Connected to process 4084 (/mnt/c/Users/swqa/daniel/vector_add)
==PROF== Profiling “vector_add_13” - 0: 0%…50%…100% - 8 passes
C(1) = 3.000000000000000
C(N) = 3.000000000000000
==PROF== Disconnected from process 4084
[4084] vector_add@127.0.0.1
vector_add_13 (7813, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 6.79
SM Frequency Ghz 1.49
Elapsed Cycles cycle 89144
Memory Throughput % 91.25
DRAM Throughput % 91.25
Duration us 59.71
L1/TEX Cache Throughput % 16.37
L2 Cache Throughput % 42.07
SM Active Cycles cycle 80130.04
Compute (SM) Throughput % 13.01
----------------------- ----------- ------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 128
Function Cache Configuration CachePreferNone
Grid Size 7813
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 46
Threads thread 1000064
Uses Green Context 0
Waves Per SM 14.15
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 32
Block Limit Shared Mem block 16
Block Limit Warps block 12
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 86.72
Achieved Active Warps Per SM warp 41.63
------------------------------- ----------- ------------
OPT Est. Local Speedup: 13.28%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (86.7%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (CUDA C++ Best Practices Guide) for more details on
optimizing occupancy.
Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
Average DRAM Active Cycles cycle 370008
Total DRAM Elapsed Cycles cycle 3244032
Average L1 Active Cycles cycle 80130.04
Total L1 Elapsed Cycles cycle 3844916
Average L2 Active Cycles cycle 75035.81
Total L2 Elapsed Cycles cycle 2689056
Average SM Active Cycles cycle 80130.04
Total SM Elapsed Cycles cycle 3844916
Average SMSP Active Cycles cycle 79370.96
Total SMSP Elapsed Cycles cycle 15379664
-------------------------- ----------- ------------

Thank you for your regeneration information.
Could you please provide the compile command and are there any configurations step should be done before the profiling?

Can you try another simple CUDA sample on both Windows and WSL ? I think this may due to your set up.

Thanks for your reply.
Actually, before I compile the source code, what I did is

  1. download CUDA tools form the
    CUDA Toolkit 12.6 Update 2 Downloads | NVIDIA Developer
    for windows operating systems.

  2. download Nvidia HPC sdk from
    NVIDIA HPC SDK Current Release Downloads | NVIDIA Developer
    with linux x86_64 Ubuntu(apt)
    and then add the compilers to PATH.

Almost the new workstation without other setup
could that be the reason that I should download the CUDA tools by wsl?

Are you using windows tools binary on WSL ?

I’m sorry, I don’t understand. What I want to do is profiling the code using openacc(like showing the data copy in/out time). The nvfortran and ncu are downloaded by WSL.
Does it mean to use the ncu command, I also need to download CUDA TOOLs from WSL?

I mean which CUDA toolkit package do you install ? Windows version or Linux version ?
In WSL, you should install and use linux version cuda.

You are correct, I found maybe it is related with my nvidia driver version, I now download the driver from selected GPU version, then download CUDA tools 12.6 using wsl-ubuntu_runfile(local).
the previous problem went out, but it says error below:
sudo ncu ./add_vector
==PROF== Connected to process 1988 (/home/haku/2024tasks/parallel/simmple/add_vector)

==ERROR== An error was reported by the driver:
==ERROR== Profiling failed because a driver resource was unavailable or the user does not have permission to access NVIDIA GPU Performance Counters. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. For instructions on enabling permissions, see NVIDIA Development Tools Solutions - | NVIDIA Developer. See 2. Kernel Profiling Guide — NsightCompute 12.6 documentation for more details.
==ERROR== Failed to profile “simple_vector_11” in process 1988
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).

then I perform the instructions on
NVIDIA Development Tools Solutions - ERR_NVGPUCTRPERM: Permission issue with Performance Counters
for windows.

finally I get the error as below:
sudo ncu ./add_vector
==PROF== Connected to process 2107 (/home/haku/2024tasks/parallel/simmple/add_vector)
==ERROR== Failed to prepare kernel for profiling

==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “simple_vector_11” in process 2107
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).

Have you done this operation ?

Which driver do you install ?
You should install Windows driver on Windows not under WSL.

Yeah, I did the operation in NVIDIA control panel as administrator.
and I download Driver version as 553.24 from using windows executable.
Finally after I reboot the machine, it works.
Thank you So much!

Great. Good to know !

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.