OpenACC profiling with NVProf

Hi,

As far as I am aware I should be able to profile an openacc application using nvprof, but whenever I attempt to profile an application nvprof reports that no kernels were profiled.

I.e.

Using the vecaddmod example from the openacc getting started guide (corrected so that it compiles)

! Fortran OpenACC example from the PGI OpenACC Getting Started Guide
! Chapter 2.10.1 - Vector Addition on the GPU
! http://www.pgroup.com/doc/openacc_gs.pdf

module vecaddmod
    implicit none
    contains

    subroutine vecaddgpu( r, a, b, n )
        real, dimension(:) :: r, a, b
        integer :: n
        integer :: i
        !$acc kernels loop copyin(a(1:n),b(1:n)) copyout(r(1:n))
        do i = 1, n
            r(i) = a(i) + b(i)
        enddo
    end subroutine
end module

program main
    use vecaddmod
    implicit none
    integer :: n, i, errs, argcount
    real, dimension(:), allocatable :: a, b, r, e
    character*10 :: arg1
    argcount = command_argument_count()
    n = 1000000 ! default value
    ! @note - Corrected operator = to ==
    if( argcount == 1 )then
        call get_command_argument( 1, arg1 )
        read( arg1, '(i)' ) n
        if( n <= 0 ) n = 100000
    endif
    allocate( a(n), b(n), r(n), e(n) )
    do i = 1, n
        a(i) = i
        b(i) = 1000*i
    enddo
    ! compute on the GPU
    call vecaddgpu( r, a, b, n )
    ! compute on the host to compare
    do i = 1, n
        e(i) = a(i) + b(i)
    enddo
    ! compare results
    errs = 0
    do i = 1, n
    if( r(i) /= e(i) )then
        errs = errs + 1
    endif
    enddo
    print *, errs, ' errors found'
    if( errs ) call exit(errs)
end program

saved as f1.f90 and compiled using

pgfortran -acc -fast -Minfo=accel -g f1.f90

attempting to capture data results in the following output.

 nvprof f1.exe
            0  errors found
==3648== NVPROF is profiling process 3648, command: f1.exe
==3648== Profiling application: f1.exe
==3648== Profiling result:
No kernels were profiled.

==3648== API calls:
No API activities were profiled.
==3648== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.

Running the file with

PGI_ACC_TIME=1
PGI_ACC_NOTIFY=1

gives the following output

            0  errors found
PGI: "acc_shutdown" not detected, performance results might be incomplete.
 Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.
launch CUDA kernel  file=C:\Users\ptheywood\SATGPU\vecaddmod\f1.f90 function=vecaddgpu line=14 device=0 threadid=1 num_gangs=7813 num_workers=1 vector_length=128 grid=7813 block=128

Accelerator Kernel Timing data
C:\Users\ptheywood\SATGPU\vecaddmod\f1.f90
  vecaddgpu  NVIDIA  devicenum=0
    time(us): 3,708
    13: compute region reached 1 time
        14: kernel launched 1 time
            grid: [7813]  block: [128]
             device time(us): total=0 max=0 min=0 avg=0
    13: data region reached 1 time
        13: data copyin transfers: 5
             device time(us): total=2,449 max=1,213 min=5 avg=489
    17: data region reached 1 time
        17: data copyout transfers: 1
             device time(us): total=1,259 max=1,259 min=1,259 avg=1,259



call acc_shutdown(acc_device_nvidia)

prior to the final if statement does not resolve the PGI message either.

Version numbers as follows:

pgfortran --version

pgfortran 15.10-0 64-bit target on x86-64 Windows -tp haswell
The Portland Group - PGI Compilers and Tools
Copyright (c) 2015, NVIDIA CORPORATION.  All rights reserved.

nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2015 NVIDIA Corporation
Release version 7.5.18 (21)

Is there anything that I am missing?

Thanks,
Peter.

Hi Peter,

What’s happening is that since this code is so small and doesn’t run long, the profiler’s buffer isn’t getting dumped. Hence nvprof doesn’t get any information back. Calling “acc_shutdown” should force the buffer to dump, but it’s not for some reason in this case. I’ve asked our profiler folks to take a look.

Do you have a longer running program? If not, try putting a loop around the call to vecaddgpu so that it get call 1000 times. When I did this, I was able to generate a nvprof profile.

Note that this issue does not occur on Linux.

Thanks,
Mat

Hi Mat,

Thanks for the quick response.

Unfortunately simply adding adding a loop of 1000 or even 10000 calls to vecaddgpu does not generate any profile information on my machine.

$ nvprof f1.exe
            0  errors found
==4936== NVPROF is profiling process 4936, command: f1.exe
==4936== Profiling application: f1.exe
==4936== Profiling result:
No kernels were profiled.

==4936== API calls:
No API activities were profiled.
==4936== Warning: Some profiling data are not recorded. Make sure cudaProfilerStop() or cuProfilerStop() is called before application exit to flush profile data.

However, also adding device selection/initialisation in combination with 1000 iterations does usually produce results.

    call acc_set_device(acc_device_nvidia)
    call acc_set_device_num(0, acc_device_nvidia)
    call acc_init(acc_device_nvidia)

There does still seem to be an issue with acc_shutdown however.

PGI: "acc_shutdown" not detected, performance results might be incomplete.
 Please add the call "acc_shutdown(acc_device_nvidia)" to the end of your application to ensure that the performance results are complete.

Accelerator Kernel Timing data
C:\Users\ptheywood\SATGPU\vecaddmod\f1.f90
  vecaddgpu  NVIDIA  devicenum=0
    time(us): 3,706,081
    13: compute region reached 1000 times
        14: kernel launched 1000 times
            grid: [7813]  block: [128]
            elapsed time(us): total=507,000 max=16,000 min=0 avg=507
    13: data region reached 1000 times
        13: data copyin transfers: 5000
             device time(us): total=2,484,843 max=1,577 min=3 avg=496
    17: data region reached 1000 times
        17: data copyout transfers: 1000
             device time(us): total=1,221,238 max=1,485 min=1,195 avg=1,221

Thanks,
Peter