pgprof/pgcollect : problem with CPU+openacc on same routine

Hello .
New feature problem when profiling with pgprof/pgcollect tested on pgi/13.10 ( because pgi14/X doesn’t work at all )

When profiling a routine containing CPU + OpenAcc computation , with ppgprof this is reported as 2 separate routines ?

→ 1 having only CPU time
→ the other one only with GPU ( kernel/region ) time

A simple example showing this problem :

MODULE mode_sub
CONTAINS 
  SUBROUTINE sub(X,Y,N)
    IMPLICIT NONE
    INTEGER :: N
    REAL    :: X(N) , Y(N)

    !$acc kernels present(X)
    X = X * 0.9995
    !$acc end kernels

    ! CPU part
    Y = Y * 1.0001

  END SUBROUTINE sub  
END MODULE mode_sub

PROGRAM test_pgcollect

  USE mode_sub
  IMPLICIT NONE
  INTEGER, PARAMETER :: N=134217728, NIT=100
    
  REAL, ALLOCATABLE  :: A(:) , B(:)
  !$acc declare create(A)

  INTEGER            :: I
    
  ALLOCATE(A(N),B(N)) 
  !$acc kernels
  A = 1.2345
  !$acc end kernels
  B = 2.3456         

  DO I=1,NIT
     CALL SUB(A,B,N)
  END DO

  !$acc update host (A(N:N))
  print*,"A(N)=",A(N)," B(N)=",B(N)

END PROGRAM test_pgcollect



pgf90 --version
pgf90 13.10-0 64-bit target on x86-64 Linux -tp nehalem

pgf90 -ta=nvidia:cuda5.5 openacc_pgcollect_mix.f90 -o openacc_pgcollect_mix_1310

pgcollect ./openacc_pgcollect_mix_1310
Profiling single-threaded target program
A(N)= 1.174275 B(N)= 2.369176
target process has terminated, writing profile data

pgprof # → write the main window to file

cat prof_mix

Profiled: ./openacc_pgcollect_mix_1310 on Mon Oct 13 11:51:53 CEST 2014

| Function       | Seconds         | Region Time     | Kernel Device Time |

| sub            |  17,495 =  96%  |       0 =   0%  |       0 =   0%     |
| [System_Time]  |     577 =   3%  |       0 =   0%  |       0 =   0%     |
| __c_mset4      |     124 =   1%  |       0 =   0%  |       0 =   0%     |
| test_pgcollect |       0 =   0%  |      28 =   1%  |      28 =   1%     |
| sub            |       0 =   0%  |    4853 =  99%  |    4853 =  99%     |

The detail window show the same separate view , CPU only in one view and only GPU on the other one .

This problem is very painful for the analysis of porting a very big code ( Million of lines … )

:-) Thanks if it could be fixed for the next 14/15.X release …

Bye Juan

I ran your example with 13.10, 14.10, and the upcoming 15.0 (you will
see 15.1 in January).

I think 15.0 is working better than 14.10. We filed TPR 20904 for this problem and it should be in the 15.1 release.

13.10
Profiled: ./test_pgcollect on Tue Nov 18 16:33:44 PST 2014

| Function | Seconds | Region Time | Kernel Device Time |

| ppoll | 19.488 = 50% | 0.0000 = 0% | 0.0000 = 0% |
| sub | 18.050 = 46% | 0.0000 = 0% | 0.0000 = 0% |
| [System_Time] | 0.825 = 2% | 0.0000 = 0% | 0.0000 = 0% |
| __GI_sched_yield | 0.200 = 1% | 0.0000 = 0% | 0.0000 = 0% |
| sstk | 0.175 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __c_mset4 | 0.150 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __write | 0.050 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| _dl_debug_initialize | 0.050 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __open_nocancel | 0.025 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __lll_lock_wait_private | 0.012 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| sub | 0.000 = 0% | 0.8033 = 100% | 0.8033 = 100% |
| test_pgcollect | 0.000 = 0% | 0.0038 = 0% | 0.0038 = 0% |

\

14.10

=============================================
Profiled: ./test_pgcollect on Tue Nov 18 16:31:07 PST 2014

| Function | Seconds | Region Time | Kernel Device Time |

| __poll | 17.966 = 50% | 0.0000 = 0% | 0.0000 = 0% |
| sub | 16.690 = 46% | 0.0000 = 0% | 0.0000 = 0% |
| [System_Time] | 0.690 = 2% | 0.0000 = 0% | 0.0000 = 0% |
| sched_getscheduler | 0.253 = 1% | 0.0000 = 0% | 0.0000 = 0% |
| sstk | 0.115 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __c_mset4 | 0.115 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| _dl_debug_initialize | 0.046 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __open_nocancel | 0.023 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| sem_timedwait | 0.023 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __open64 | 0.023 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| fchmodat | 0.023 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __lll_lock_wait_private | 0.011 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| sub | 0.000 = 0% | 0.0002 = 56% | 0.0002 = 56% |
| test_pgcollect | 0.000 = 0% | 0.0002 = 44% | 0.0002 = 44% |

\

15.0

==============================================

Profiled: ./test_pgcollect on Tue Nov 18 13:51:21 PST 2014

| Function | Seconds | Region Time | Kernel Device Time |

| __poll | 19.101 = 33% | 0.0000 = 0% | 0.0000 = 0% |
| sem_timedwait | 18.868 = 33% | 0.0000 = 0% | 0.0000 = 0% |
| sub | 17.535 = 31% | 0.0000 = 0% | 0.0000 = 0% |
| [System_Time] | 0.651 = 1% | 0.0000 = 0% | 0.0000 = 0% |
| sched_getscheduler | 0.488 = 1% | 0.0000 = 0% | 0.0000 = 0% |
| __c_mset4 | 0.116 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| sstk | 0.101 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| _dl_debug_initialize | 0.047 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| _int_malloc | 0.047 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| read | 0.031 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __open64 | 0.031 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __open_nocancel | 0.023 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| __lll_lock_wait_private | 0.023 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| memset | 0.016 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| sighandler_setxid | 0.008 = 0% | 0.0000 = 0% | 0.0000 = 0% |
| sub | 0.000 = 0% | 0.8011 = 100% | 0.8011 = 100% |
| test_pgcollect | 0.000 = 0% | 0.0036 = 0% | 0.0036 = 0% |

:-) Thank you Dave

Bye