Initalizers seems to be slow

OpenAcc/Fortran/PGI. Analysing my program with the output of -tp=nvidia,time

I get strange analysis-results.

Code was like this:

      SCHUBG(1:INOD) =S
      DO 50 IJ = 1,INOD
       S = 0
       IF ( HO(IJ) .GT. HMIN ) THEN
         SCHUBG(IJ) = CMUVISC * HO(IJ) * RMANN(IJ) *
     &       SQRT ((UO(IJ)**2 + VO(IJ)**2 ) * GRAV / CBRT(HO(IJ)))
       END IF
   50 CONTINUE

The initializer

       SCHUBG(1:INOD) = 0

took a rediculous amount of time. The loop with “real” work was much faster.

So I merged the loops

      DO 50 IJ = 1,INOD
       S = 0
       IF ( HO(IJ) .GT. HMIN ) THEN
         S = CMUVISC * HO(IJ) * RMANN(IJ) *
     &       SQRT ((UO(IJ)**2 + VO(IJ)**2 ) * GRAV / CBRT(HO(IJ)))
       END IF
       SCHUBG(IJ) =S
   50 CONTINUE

Now this remaining loop is slow. Other loops (same number of times called, same loop-bounds)
are much longer, write much more and are much faster.


Any hint’s?
Benedikt

Hi Benedikt,

Most likely you’re measuring the device warm-up time. On Linux, the devices are powered down when not in use and it takes approximately 1 second per device to warm them back up.

You can avoid this by running the “pgcudainit” utility in the background to keep the device open.

Also, you can make a call to “acc_init” early in your program. You’ll still incur the warm-up cost, but it will happen during initialization and not at the first OpenACC directive you encounter.

  • Mat

Thank you for taking care, Mat.

May I ask back:

I alreay made tests showing, that the very first OpenACC-Statement in the whole program takes long. Do you mean this with “warm-up time”.

Or do you mean, that the device can somehow “loose warm up”, when it is not used for “a while” during a program.

The latter would be new to me.

Anyway. The acc-kernels, which contains the initializer is the very first acc-kernels in the whole program. Sounds like you are right…

Benedikt

The OS/driver will power down the device once all contexts have ended. pgcudainit holds open a context.

Hence, you only pay the warm-up penalty once until your program ends (or if you call acc_shutdown and then reinitialize the device).

Thanks for your advice. I still mess up with the profiler output.

Profiler-Output does not give data I measure with a clock.

In my real program these effects are difficult to break down.
I found an simple example: The attached programm runs 6-8 seconds and profiling-output says approx. 200us.

Any suggestions?
Benedikt

      PROGRAM TEST
      IMPLICIT NONE 
      REAL*8 SUMME_GLOBAL
      REAL ZAHLEN(2000000)   !
      INTEGER LOOPS,SZ
      COMMON /Z/ ZAHLEN,SUMME_GLOBAL,LOOPS,SZ
      REAL*8 SUMME_LOCAL1,SUMME_LOCAL2,SUMME_LOCAL3      
      INTEGER I,J

      
      LOOPS= 300
      SZ = size(ZAHLEN)
!$acc data copy ( SUMME_GLOBAL,ZAHLEN)        
!$acc kernels
      DO J=1,size(ZAHLEN)
        ZAHLEN(j) = DBLE(j)
      END DO
!$acc end kernels
!$acc end data  

      SUMME_GLOBAL=0
!$acc data copy ( SUMME_GLOBAL,ZAHLEN,LOOPS,SZ)        
!$acc kernels
!$acc loop seq  reduction(+:SUMME_GLOBAL)
      DO I=1,LOOPS
      
        SUMME_LOCAL1=0
!$acc loop reduction(+:SUMME_LOCAL1)
        DO J=1,SZ
          SUMME_LOCAL1 = SUMME_LOCAL1 + sin(ZAHLEN(j))**2+
     &                                cos(ZAHLEN(j))**2 
        END DO
       
        SUMME_GLOBAL = SUMME_GLOBAL + SUMME_LOCAL1       
      END DO       
!$acc end kernels
!$acc end data        
      WRITE(*,*) SUMME_GLOBAL
      END

Compilercall:

"c:\Program Files (x86)\PGI\win32\14.7\bin\pgf90.exe" -tp=p7 -Mipa=fast,inline -Mvect -O2 -Mnoframe -Munroll=c:8 -Mlre -Mcache_align -Mflushz -acc -ta=nvidia,time -Mbackslash -Minfo=acc -Mpreprocess  testtiming.for -o testtiming.exe

It’s a known issue where we’re not getting the correct profile information from the CUDA driver. The work around is to set the “CUDA_PROFILE” environment variable in addition to “PGI_ACC_TIME”. We’re moving to a different profiling interface to the CUDA Driver which hopefully will correct this issue.

Did you intend to add “seq” to the second kernel? Using it here causes a severe performance issue since only a single gang is used. I removed it when generating the following profile.

  • Mat
PGI$ export CUDA_PROFILE=1
PGI$ time ./test1.exe
    599999999.4506104

Accelerator Kernel Timing data
H:\tmp\test.f90
  test  NVIDIA  devicenum=0
    time(us): 725,779
    14: data region reached 1 time
        14: data copyin transfers: 2
             device time(us): total=2 max= min= avg=
        20: data copyout transfers: 2
             device time(us): total=2 max= min= avg=
    15: compute region reached 1 time
        16: kernel launched 1 time
            grid: [15625]  block: [128]
             device time(us): total=541 max=541 min=541 avg=541
    23: data region reached 1 time
        23: data copyin transfers: 4
             device time(us): total=4 max= min= avg=
        39: data copyout transfers: 4
             device time(us): total=4 max= min= avg=
    24: compute region reached 1 time
        27: kernel launched 1 time
            grid: [300]  block: [128]
             device time(us): total=725,215 max=725,215 min=725,215 avg=725,215
            elapsed time(us): total=730,000 max=730,000 min=730,000 avg=730,000
        27: reduction kernel launched 1 time
            grid: [1]  block: [256]
             device time(us): total=11 max=11 min=11 avg=11

real    0m1.260s
user    0m0.000s
sys     0m0.078s

OK. I didn’t know about this issue :-), but the CUDA_PROFILE-hint turned out to be extremly valueable.

I applied it not only to the testprogram but also to the pofiling of my real program an now I get reasonable information for the first time.

(I do know, that the testprogram is slow - it is one member of a series where I made experiments with placing kernels: inside/outside loop, inside/outside function, one big acc-kernel vs. many small ones. So far I didn’t understand why it is so slow …)

Greetings Benedikt