OpenACC profiling with NVProf

ptheywood · January 22, 2016, 4:37pm

Hi,

As far as I am aware I should be able to profile an openacc application using nvprof, but whenever I attempt to profile an application nvprof reports that no kernels were profiled.

I.e.

Using the vecaddmod example from the openacc getting started guide (corrected so that it compiles)

! Fortran OpenACC example from the PGI OpenACC Getting Started Guide
! Chapter 2.10.1 - Vector Addition on the GPU
! http://www.pgroup.com/doc/openacc_gs.pdf

module vecaddmod
    implicit none
    contains

    subroutine vecaddgpu( r, a, b, n )
        real, dimension(:) :: r, a, b
        integer :: n
        integer :: i
        !$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
        do i = 1, n
            r(i) = a(i) + b(i)
        enddo
    end subroutine
end module

program main
    use vecaddmod
    implicit none
    integer :: n, i, errs, argcount
    real, dimension(:), allocatable :: a, b, r, e
    character*10 :: arg1
    argcount = command_argument_count()
    n = 1000000 ! default value
    ! @note - Corrected operator = to ==
    if( argcount == 1 )then
        call get_command_argument( 1, arg1 )
        read( arg1, '(i)' ) n
        if( n <= 0 ) n = 100000
    endif
    allocate( a(n), b(n), r(n), e(n) )
    do i = 1, n
        a(i) = i
        b(i) = 1000*i
    enddo
    ! compute on the GPU
    call vecaddgpu( r, a, b, n )
    ! compute on the host to compare
    do i = 1, n
        e(i) = a(i) + b(i)
    enddo
    ! compare results
    errs = 0
    do i = 1, n
    if( r(i) /= e(i) )then
        errs = errs + 1
    endif
    enddo
    print *, errs, ' errors found'
    if( errs ) call exit(errs)
end program

saved as f1.f90 and compiled using

pgfortran -acc -fast -Minfo=accel -g f1.f90

attempting to capture data results in the following output.

 nvprof f1.exe
            0  errors found
==3648== NVPROF is profiling process 3648, command: f1.exe
==3648== Profiling application: f1.exe
==3648== Profiling result:
No kernels were profiled.

==3648== API calls:
No API activities were profiled.
==3648== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.

Running the file with

PGI_ACC_TIME=1
PGI_ACC_NOTIFY=1

gives the following output

            0  errors found
PGI: "acc_shutdown" not detected, performance results might be incomplete.
 Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.
launch CUDA kernel  file=C:\Users\ptheywood\SATGPU\vecaddmod\f1.f90 function=vecaddgpu line=14 device=0 threadid=1 num_gangs=7813 num_workers=1 vector_length=128 grid=7813 block=128

Accelerator Kernel Timing data
C:\Users\ptheywood\SATGPU\vecaddmod\f1.f90
  vecaddgpu  NVIDIA  devicenum=0
    time(us): 3,708
    13: compute region reached 1 time
        14: kernel launched 1 time
            grid: [7813]  block: [128]
             device time(us): total=0 max=0 min=0 avg=0
    13: data region reached 1 time
        13: data copyin transfers: 5
             device time(us): total=2,449 max=1,213 min=5 avg=489
    17: data region reached 1 time
        17: data copyout transfers: 1
             device time(us): total=1,259 max=1,259 min=1,259 avg=1,259

call acc_shutdown(acc_device_nvidia)

prior to the final if statement does not resolve the PGI message either.

Version numbers as follows:

pgfortran --version

pgfortran 15.10-0 64-bit target on x86-64 Windows -tp haswell
The Portland Group - PGI Compilers and Tools
Copyright (c) 2015, NVIDIA CORPORATION.  All rights reserved.

nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2015 NVIDIA Corporation
Release version 7.5.18 (21)

Is there anything that I am missing?

Thanks,
Peter.

MatColgrove · January 22, 2016, 10:30pm

Hi Peter,

What’s happening is that since this code is so small and doesn’t run long, the profiler’s buffer isn’t getting dumped. Hence nvprof doesn’t get any information back. Calling “acc_shutdown” should force the buffer to dump, but it’s not for some reason in this case. I’ve asked our profiler folks to take a look.

Do you have a longer running program? If not, try putting a loop around the call to vecaddgpu so that it get call 1000 times. When I did this, I was able to generate a nvprof profile.

Note that this issue does not occur on Linux.

Thanks,
Mat

ptheywood · January 25, 2016, 9:31am

Hi Mat,

Thanks for the quick response.

Unfortunately simply adding adding a loop of 1000 or even 10000 calls to vecaddgpu does not generate any profile information on my machine.

$ nvprof f1.exe
            0  errors found
==4936== NVPROF is profiling process 4936, command: f1.exe
==4936== Profiling application: f1.exe
==4936== Profiling result:
No kernels were profiled.

==4936== API calls:
No API activities were profiled.
==4936== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.

However, also adding device selection/initialisation in combination with 1000 iterations does usually produce results.

    call acc_set_device(acc_device_nvidia)
    call acc_set_device_num(0, acc_device_nvidia)
    call acc_init(acc_device_nvidia)

There does still seem to be an issue with acc_shutdown however.

PGI: "acc_shutdown" not detected, performance results might be incomplete.
 Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

Accelerator Kernel Timing data
C:\Users\ptheywood\SATGPU\vecaddmod\f1.f90
  vecaddgpu  NVIDIA  devicenum=0
    time(us): 3,706,081
    13: compute region reached 1000 times
        14: kernel launched 1000 times
            grid: [7813]  block: [128]
            elapsed time(us): total=507,000 max=16,000 min=0 avg=507
    13: data region reached 1000 times
        13: data copyin transfers: 5000
             device time(us): total=2,484,843 max=1,577 min=3 avg=496
    17: data region reached 1000 times
        17: data copyout transfers: 1000
             device time(us): total=1,221,238 max=1,485 min=1,195 avg=1,221

Thanks,
Peter

Topic		Replies	Views
Profiling OpenACC Legacy PGI Compilers	7	3781	May 30, 2019
What is the defferent between"GPU activities" and "API calls"? Legacy PGI Compilers	3	3193	June 4, 2019
Application returned non-zero code during profiling with nvprof Visual Profiler and nvprof cuda	1	1694	December 8, 2021
No events/metrics were profiled when use nvprof in CUDA 10.1.168 Visual Profiler and nvprof	5	5040	December 14, 2019
Always got this warning when nvprof cuda file "This can happen if device ran out of memory or if a device kernel was stopped due to an assertion" on just HellowWorld GPU CUDA Programming and Performance	9	2557	January 31, 2019
Magic of nvprof --profile-api-trace none Visual Profiler and nvprof	4	891	March 27, 2023
Unknown Error on device 0 when Running NCU on wsl Nsight Compute	16	335	December 3, 2024
nvprof error Application received signal 11 CUDA Programming and Performance	10	5334	May 12, 2021
NVProf error on samples CUDA Programming and Performance	28	20452	December 29, 2020
CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler Technical Blog	35	2457	September 5, 2021

OpenACC profiling with NVProf

Related topics