pgcollect + openacc , not working with pgi14.X

Hello .

The pgcollect tools was working well with pgi13.10 & OpenAcc code ,
but now with the 14.X version it not working again …

I’m using OpenSuse12.3 + cuda6.5 on this test ( but cuda5.5 give the same problem)

Here a simple code calling 2 routines , to show it :

MODULE mode_sub                                       
CONTAINS                                              
                                                       
  SUBROUTINE init(X)                                   
    IMPLICIT NONE                                      
    REAL  :: X(:)                                      
                                                       
    !$acc kernels                                      
    X = 1.2345                                         
    !$acc end kernels                                  
                                                       
  END SUBROUTINE init                                  
  !-------------------------------------               
  SUBROUTINE sub(X)                                    
    IMPLICIT NONE                                      
    REAL  :: X(:)                                      
                                                        
    !$acc kernels present(X)                            
    X = X * 0.9995                                      
    !$acc end kernels                                   
                                                        
  END SUBROUTINE sub                                    
                                                        
END MODULE mode_sub                                     

PROGRAM test_pgcollect

  USE mode_sub

  IMPLICIT NONE

  INTEGER, PARAMETER :: N=100000000, NIT=1000
    
  REAL, ALLOCATABLE  :: A(:)
  !$acc declare create(A)

  INTEGER            :: I
    
  ALLOCATE(A(N))
         
  CALL INIT(A)

  DO I=1,NIT
     CALL SUB(A)
  END DO

  !$acc update host (A(N:N))
  print*,"A(N)=",A(N)

END PROGRAM test_pgcollect

=> With pgf90/13.0

pgf90 --version
pgf90 13.10-0 64-bit target on x86-64 Linux -tp nehalem

pgf90 -ta=nvidia,cuda5.5 openacc_pgcollect.f90 -o openacc_pgcollect_1310

pgcollect openacc_pgcollect_1310
Profiling single-threaded target program
A(N)= 0.7486508
target process has terminated, writing profile data

pgprof

The pgprof tool give as expected in the main window , a big region time & kernel device time in sub subroutine

=> But with the pgi/14.9 , the profiling look also OK , but no region or kernel (device) time is reported in the main pgprog window

pgf90 --version
pgf90 14.9-0 64-bit target on x86-64 Linux -tp nehalem

pgf90 -ta=nvidia,cuda5.5 openacc_pgcollect.f90 -o openacc_pgcollect_149

pgcollect openacc_pgcollect_149
Profiling single-threaded target program
A(N)= 0.7486508
target process has terminated, writing profile data
pgprof

The pgprof window only show a [System_Time] Function with only a ‘Seconds’ (host) colum of data

The pgprof.out file look very similar with the 2 compilers , but the pgi/14.9 on seam to miss
some information .

For example , I could check that the pgi/14.9 one doesn’t generate this tag

9

which appear systematically with the pgi/13.10 one after this one

4

Thanks in advance for the help

Juan

Hi Juan,

I sent this off to our tools team and they are investigating. We’re tracking the issue as TPR#20904.

Thanks,
Mat

Hello Mat .

:-\ Another think I’ve just find now is that the timing reported with PGI_ACC_TIME
:-\ are also completely wrong with pgi/14.9 ( & 14.7 )

With the same previous example where all the time is spend in calling 1000 times the routine sub

On my computer , with a Titan card , the execution take about 3,8 seconds . Has reported by the linux time command .

→ This is reported correctly but the pgi13/10 version <=> time(us): 3,611,564 = 3,6 s

time ( PGI_ACC_TIME=1 ./openacc_pgcollect_1310 )
A(N)= 0.7486508

Accelerator Kernel Timing data

/home/escj/STREAM/openacc_pgcollect.f90
sub NVIDIA devicenum=0
time(us): 3,611,564
18: compute region reached 1000 times
19: kernel launched 1000 times
grid: [65535] block: [128]
device time(us): total=> 3,611,564 > max=3,702 min=3,574 avg=> 3,611
elapsed time(us): total=3,624,984 max=3,762 min=3,584 avg=3,624

real 0m3.827s
user 0m2.065s
sys 0m1.736s

→ But completely wrong with the 14/9 version <=> time(us): 4,685
( the min/max/avg also are wrong )

time ( PGI_ACC_TIME=1 ./openacc_pgcollect_149 )
A(N)=0.7486508 Accelerator Kernel Timing data …
sub NVIDIA devicenum=0
time(us): 4,685
18: data region reached 1000 times
18: compute region reached 1000 times
19: kernel launched 1000 times
grid: [65535] block: [128]
device time(us): total=> 4,685 > max=61 min=2 avg=> 4
elapsed time(us): total=> 24,572 > max=81 min=17 avg=> 24

real 0m3.786s
user 0m2.086s
sys 0m1.681s

REM : At contrario the detailed time reported by CUDA_PROFILE=1 are OK

time ( CUDA_PROFILE=1 ./openacc_pgcollect_149 )
A(N)= 0.7486508

real 0m3.796s
user 0m2.038s
sys 0m1.736s

less cuda_profile_0.log

CUDA_PROFILE_LOG_VERSION 2.0

CUDA_DEVICE 0 GeForce GTX TITAN

CUDA_CONTEXT 1

TIMESTAMPFACTOR 13995b5258779d02

method,gputime,cputime,occupancy
method=[ memcpyHtoDasync ] gputime=[ 3.680 ] cputime=[ 17.972 ]
method=[ init_9_gpu ] gputime=[ 1561.184 ] cputime=[ 112.554 ] occupancy=[ 1.000 ]
method=[ > sub_19_gpu > ] gputime=[ > 3561.824 > ] cputime=[ 8.233 ] occupancy=[ 1.000 ]
method=[ > sub_19_gpu > ] gputime=[ > 3562.432 > ] cputime=[ 7.217 ] occupancy=[ 1.000 ]
method=[ sub_19_gpu ] gputime=[ 3562.368 ] cputime=[ 6.581 ] occupancy=[ 1.000 ]
method=[ sub_19_gpu ] gputime=[ 3562.848 ] cputime=[ 7.282 ] occupancy=[ 1.000 ]

This just an example , but all routine I’ve ported to openacc in my code are reporting the same problem .

Bye

Juan

Hi Juan,

This is a known issue on Windows which can be resolved by also setting CUDA_PROFILE in addition to PGI_ACC_TIME, but I’ve not seen it on Linux before. I just double checked and see the correct time on my C2050 box as well as my K40 system. I wouldn’t think that this would be a problem specific to a GTX Titan, but possible.

On Windows, the problem is that we use cudaevents to record GPU timings and the driver returns bad times unless CUDA_PROFILE is set. Can you try setting CUDA_PROFILE=1 in addition to PGI_ACC_TIME to see if it’s the same issue?

What Linux version and CUDA Driver are you using?

Thanks,
Mat


bash-4.1$ time ( PGI_ACC_TIME=1 ./test_pgcollect_149.out )
 A(N)=   0.7486508

Accelerator Kernel Timing data
(unknown)
  (unknown)  NVIDIA  devicenum=0
    time(us): 4
    39: upload reached 1 time
        39: data copyin transfers: 1
             device time(us): total=4 max=4 min=4 avg=4
test_pgcollect.f90
  init  NVIDIA  devicenum=0
    time(us): 3,826
    8: data region reached 1 time
    8: compute region reached 1 time
        9: kernel launched 1 time
            grid: [65535]  block: [128]
             device time(us): total=3,826 max=3,826 min=3,826 avg=3,826
            elapsed time(us): total=3,843 max=3,843 min=3,843 avg=3,843
test_pgcollect.f90
  sub  NVIDIA  devicenum=0
    time(us): 7,556,593
    18: data region reached 1000 times
    18: compute region reached 1000 times
        19: kernel launched 1000 times
            grid: [65535]  block: [128]
             device time(us): total=7,556,593 max=7,817 min=7,548 avg=7,556
            elapsed time(us): total=7,624,575 max=46,718 min=7,557 avg=7,624
test_pgcollect.f90
  test_pgcollect  NVIDIA  devicenum=0
    time(us): 2
    47: update directive reached 1 time
        47: data copyout transfers: 1
             device time(us): total=2 max=2 min=2 avg=2

real    0m10.314s
user    0m2.844s
sys     0m5.016s

Hello Mat .

For this test , I’m using CUDA6.5 on OpenSuse12.3
And I’ve two cards on this PC :

nvidia-smi -L
GPU 0: GeForce GTX TITAN (UUID: GPU-758a8332-5422-912f-9a53-eda8c4b7ba37)
GPU 1: GeForce GTX 470 (UUID: GPU-79d6d038-74cd-3e17-f1a9-01fb16457d1c)

And you right .

  1. setting CUDA_PROFILE=1 + PGI_ACC_TIME=1 solve the problem for the Titan CARD

;-) Thanks for the bypass

> time ( ACC_DEVICE_NUM=0 CUDA_PROFILE=1 PGI_ACC_TIME=1 ./openacc_pgcollect_149 ) 
 A(N)=   0.7486508    

Accelerator Kernel Timing data
...
/home/escj/STREAM/openacc_pgcollect.f90
  sub  NVIDIA  devicenum=0
    time(us): 3,569,308
    18: data region reached 1000 times
    18: compute region reached 1000 times
        19: kernel launched 1000 times
            grid: [65535]  block: [128]
             device time(us): total=3,569,308 max=3,647 min=2 avg=3,569
            elapsed time(us): total=3,594,372 max=3,714 min=24 avg=3,594
...
real    0m3.811s
user    0m1.971s
sys     0m1.818s
  1. And The problem only occurred with the Titan card . With the GTX470 no problem .
> time ( ACC_DEVICE_NUM=1 PGI_ACC_TIME=1 ./openacc_pgcollect_149 ) 
 A(N)=   0.7486508    
...
/home/escj/STREAM/openacc_pgcollect.f90
  sub  NVIDIA  devicenum=1
    time(us): 7,457,064
    18: data region reached 1000 times
    18: compute region reached 1000 times
        19: kernel launched 1000 times
            grid: [65535]  block: [128]
             device time(us): total=7,457,064 max=7,574 min=7,436 avg=7,457
            elapsed time(us): total=7,469,811 max=7,597 min=7,447 avg=7,469
...
real    0m7.617s
user    0m3.181s
sys     0m4.396s

Bye

Juan

Hello .

:-) With pgi/15.01 some progress on this problem .

  1. first a remarque :
    The LD_LIBRARY_PATH must give the PATH to the library libcupti.so or pgcollect & PGI_ACC_TIME will give this warning ( and time reported will be wrong )

pgcollect openacc_pgcollect_1501
Profiling single-threaded target program
libcupti.so not found >

This library is not included in the pgi module given by the installation process , and could be found here for cuda6.5 :
${PGI_HOME}/linux86-64/2015/cuda/6.5/lib64

Good News

  1. No more problem with the time reported for Titan CARD without setting CUDA_PROFILE for PGI_ACC_TIME & pgcollect reporting .

Bad News
2) in the pgprof.out file generated be pgcollect always 2 main bugs preventing pgprof to display correctly the acc time (measured correctly now)

2-1) With pgi/15.1 In the pgprof.out file the threadid reported is wrong

It is set to 1 :

1

This is why the acc timing doesn’t appear in the the main pgprof view !
Setting this threadid to 0 ( as in pgi/13.10 ) solve this first main problem :

0

2-2) As reported in my previous post , in the pgprof.out of version pgi/ 15.1

  • it is always missing the tag X with give the number of lines=X of the routine
  • after the tag Y

for the openacc_pgcollect.f90 source given in example for the routine sub this give :

sub
14
9

Without the linecount tag (after fixing the threadid problem) , clicking on the main view of pgprof to see the detail of a subroutine like sub give this error in a popup window :

pgprof: Internal Error.Couldn’t find line information for query pgprof.out@/…/openacc_pgcollect.f90@sub@18

;-) Hope this 2 bugs could be fixed in next release …

Bye

Juan .

Thanks Juan. I added TPR#21367 for issue 2-1 and #21368 for issue 2-2.

You shouldn’t need to set the LD_LIBRARY_PATH with the current pgcollect. I tried here to recreate the warning here by unsetting LD_LIBRARY_PATH, but it still worked for me. I’m not sure why you’re have to set this.

  • Mat

Just a small aside, I had the same warning on Suse linux system (TSUBAME) with 15.1 when I set my CUDA environment manually to the 6.0 version that comes installed outside PGI. Setting the LD_LIBRARY_PATH, as well as NVDIR, CUDADIR and CUDALIB to their respective subfolders in the CUDA 6.5 provided by PGI, got rid of the warning.

Hello .

Well, exactly the same libcupti.so is also part of the official NVIDIA CUDA6.5 package but in an extras folder .
So loading the NVIDIA CUDA6.5 environment could also remove the warning .

To use it ( on My PC ) I have to set for example :

LD_LIBRARY_PATH=/home/escj/dir_NVIDIA/CUDA6.5.14/cuda/extras/CUPTI/lib64:…

Bye

Juan

Hello .

I’ve downloaded the last pgi 15.4 and the pgcollect/profiling for openacc problem are still present :

→ threadid & linecount incorrect

→ TPR#21367 for issue 2-1 and #21368 for issue 2-2.

Some hope

Bye Juan