Hi there,
I’m trying to profile a Fortran Code compiled by nvfortran so that I maybe I can compare my old code with OpenMP and OpenAcc.
here, I get a very simple code using openacc, I compile it with
nvfortran -acc -o vector_add simple.f90
running ./simple.f90 I get the answer,
however, when I run ncu ./vector_add
I get error below:
ncu ./vector_add
==PROF== Connected to process 9421 (/mnt/c/Users/baihaodong/Documents/2024Tasks/3.Parallel_new/Poisson_IFX_mp/poisson_acc/vector_add)
==ERROR== Unknown Error on device 0.
C(1) = 3.000000000000000
C(N) = 3.000000000000000
==PROF== Disconnected from process 9421
My CUDA version is 12.6 and nvfortran compiler is from hpc_sdk/Linux_x86_64/24.11
I’m trying to profile the openACC code with fortram compiled by nvfortran.
Is there any recommended profiler?
since pgprof and nvprof can’t be used on compute capability higher than 8.0, while the Nsight compute comes this error, could someone provide some suggestions?
integer, parameter :: N = 1000000
real(8), dimension(N) :: A, B, C
integer :: i
! Initialize the vectors
A = 1.0d0
B = 2.0d0
! Perform vector addition using OpenACC
!$acc parallel loop
do i = 1, N
C(i) = A(i) + B(i)
end do
!$acc end parallel loop
! Print some results to verify correctness
print *, "C(1) = ", C(1)
print *, "C(N) = ", C(N)
end program vector_add
Compiling with
nvfortran -acc -Minfo simple.f90 -o vector_add_2
Thanks for your reply, additional, I have a question that when I use Nsight Compute (GUI) using windows platform to choose this application, it says it is not an executable, could you offer any help to how to generate the windows executable from nvfortran?
$ncu ./vector_add
==PROF== Connected to process 4084 (/mnt/c/Users/swqa/daniel/vector_add)
==PROF== Profiling “vector_add_13” - 0: 0%…50%…100% - 8 passes
C(1) = 3.000000000000000
C(N) = 3.000000000000000
==PROF== Disconnected from process 4084
[4084] vector_add@127.0.0.1
vector_add_13 (7813, 1, 1)x(128, 1, 1), Context 1, Stream 13, Device 0, CC 8.6
Section: GPU Speed Of Light Throughput
----------------------- ----------- ------------
Metric Name Metric Unit Metric Value
----------------------- ----------- ------------
DRAM Frequency Ghz 6.79
SM Frequency Ghz 1.49
Elapsed Cycles cycle 89144
Memory Throughput % 91.25
DRAM Throughput % 91.25
Duration us 59.71
L1/TEX Cache Throughput % 16.37
L2 Cache Throughput % 42.07
SM Active Cycles cycle 80130.04
Compute (SM) Throughput % 13.01
----------------------- ----------- ------------
INF The kernel is utilizing greater than 80.0% of the available compute or memory performance of the device. To
further improve performance, work will likely need to be shifted from the most utilized to another unit.
Start by analyzing DRAM in the Memory Workload Analysis section.
Section: Launch Statistics
-------------------------------- --------------- ---------------
Metric Name Metric Unit Metric Value
-------------------------------- --------------- ---------------
Block Size 128
Function Cache Configuration CachePreferNone
Grid Size 7813
Registers Per Thread register/thread 16
Shared Memory Configuration Size Kbyte 16.38
Driver Shared Memory Per Block Kbyte/block 1.02
Dynamic Shared Memory Per Block byte/block 0
Static Shared Memory Per Block byte/block 0
# SMs SM 46
Threads thread 1000064
Uses Green Context 0
Waves Per SM 14.15
-------------------------------- --------------- ---------------
Section: Occupancy
------------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
------------------------------- ----------- ------------
Block Limit SM block 16
Block Limit Registers block 32
Block Limit Shared Mem block 16
Block Limit Warps block 12
Theoretical Active Warps per SM warp 48
Theoretical Occupancy % 100
Achieved Occupancy % 86.72
Achieved Active Warps Per SM warp 41.63
------------------------------- ----------- ------------
OPT Est. Local Speedup: 13.28%
The difference between calculated theoretical (100.0%) and measured achieved occupancy (86.7%) can be the
result of warp scheduling overheads or workload imbalances during the kernel execution. Load imbalances can
occur between warps within a block as well as across blocks of the same kernel. See the CUDA Best Practices
Guide (CUDA C++ Best Practices Guide) for more details on
optimizing occupancy.
Section: GPU and Memory Workload Distribution
-------------------------- ----------- ------------
Metric Name Metric Unit Metric Value
-------------------------- ----------- ------------
Average DRAM Active Cycles cycle 370008
Total DRAM Elapsed Cycles cycle 3244032
Average L1 Active Cycles cycle 80130.04
Total L1 Elapsed Cycles cycle 3844916
Average L2 Active Cycles cycle 75035.81
Total L2 Elapsed Cycles cycle 2689056
Average SM Active Cycles cycle 80130.04
Total SM Elapsed Cycles cycle 3844916
Average SMSP Active Cycles cycle 79370.96
Total SMSP Elapsed Cycles cycle 15379664
-------------------------- ----------- ------------
Thank you for your regeneration information.
Could you please provide the compile command and are there any configurations step should be done before the profiling?
I’m sorry, I don’t understand. What I want to do is profiling the code using openacc(like showing the data copy in/out time). The nvfortran and ncu are downloaded by WSL.
Does it mean to use the ncu command, I also need to download CUDA TOOLs from WSL?
You are correct, I found maybe it is related with my nvidia driver version, I now download the driver from selected GPU version, then download CUDA tools 12.6 using wsl-ubuntu_runfile(local).
the previous problem went out, but it says error below:
sudo ncu ./add_vector
==PROF== Connected to process 1988 (/home/haku/2024tasks/parallel/simmple/add_vector)
==ERROR== An error was reported by the driver:
==ERROR== Profiling failed because a driver resource was unavailable or the user does not have permission to access NVIDIA GPU Performance Counters. Ensure that no other tool (like DCGM) is concurrently collecting profiling data. For instructions on enabling permissions, see NVIDIA Development Tools Solutions - | NVIDIA Developer. See 2. Kernel Profiling Guide — NsightCompute 12.6 documentation for more details.
==ERROR== Failed to profile “simple_vector_11” in process 1988
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
then I perform the instructions on
NVIDIA Development Tools Solutions - ERR_NVGPUCTRPERM: Permission issue with Performance Counters
for windows.
finally I get the error as below:
sudo ncu ./add_vector
==PROF== Connected to process 2107 (/home/haku/2024tasks/parallel/simmple/add_vector)
==ERROR== Failed to prepare kernel for profiling
==ERROR== Unknown Error on device 0.
==ERROR== Failed to profile “simple_vector_11” in process 2107
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
Yeah, I did the operation in NVIDIA control panel as administrator.
and I download Driver version as 553.24 from using windows executable.
Finally after I reboot the machine, it works.
Thank you So much!