OpenAcc/Fortran/PGI. Analysing my program with the output of -tp=nvidia,time
I get strange analysis-results.
Code was like this:
SCHUBG(1:INOD) =S
DO 50 IJ = 1,INOD
S = 0
IF ( HO(IJ) .GT. HMIN ) THEN
SCHUBG(IJ) = CMUVISC * HO(IJ) * RMANN(IJ) *
& SQRT ((UO(IJ)**2 + VO(IJ)**2 ) * GRAV / CBRT(HO(IJ)))
END IF
50 CONTINUE
The initializer
SCHUBG(1:INOD) = 0
took a rediculous amount of time. The loop with “real” work was much faster.
So I merged the loops
DO 50 IJ = 1,INOD
S = 0
IF ( HO(IJ) .GT. HMIN ) THEN
S = CMUVISC * HO(IJ) * RMANN(IJ) *
& SQRT ((UO(IJ)**2 + VO(IJ)**2 ) * GRAV / CBRT(HO(IJ)))
END IF
SCHUBG(IJ) =S
50 CONTINUE
Now this remaining loop is slow. Other loops (same number of times called, same loop-bounds)
are much longer, write much more and are much faster.
Any hint’s?
Benedikt
Hi Benedikt,
Most likely you’re measuring the device warm-up time. On Linux, the devices are powered down when not in use and it takes approximately 1 second per device to warm them back up.
You can avoid this by running the “pgcudainit” utility in the background to keep the device open.
Also, you can make a call to “acc_init” early in your program. You’ll still incur the warm-up cost, but it will happen during initialization and not at the first OpenACC directive you encounter.
Thank you for taking care, Mat.
May I ask back:
I alreay made tests showing, that the very first OpenACC-Statement in the whole program takes long. Do you mean this with “warm-up time”.
Or do you mean, that the device can somehow “loose warm up”, when it is not used for “a while” during a program.
The latter would be new to me.
Anyway. The acc-kernels, which contains the initializer is the very first acc-kernels in the whole program. Sounds like you are right…
Benedikt
The OS/driver will power down the device once all contexts have ended. pgcudainit holds open a context.
Hence, you only pay the warm-up penalty once until your program ends (or if you call acc_shutdown and then reinitialize the device).
Thanks for your advice. I still mess up with the profiler output.
Profiler-Output does not give data I measure with a clock.
In my real program these effects are difficult to break down.
I found an simple example: The attached programm runs 6-8 seconds and profiling-output says approx. 200us.
Any suggestions?
Benedikt
PROGRAM TEST
IMPLICIT NONE
REAL*8 SUMME_GLOBAL
REAL ZAHLEN(2000000) !
INTEGER LOOPS,SZ
COMMON /Z/ ZAHLEN,SUMME_GLOBAL,LOOPS,SZ
REAL*8 SUMME_LOCAL1,SUMME_LOCAL2,SUMME_LOCAL3
INTEGER I,J
LOOPS= 300
SZ = size(ZAHLEN)
!$acc data copy ( SUMME_GLOBAL,ZAHLEN)
!$acc kernels
DO J=1,size(ZAHLEN)
ZAHLEN(j) = DBLE(j)
END DO
!$acc end kernels
!$acc end data
SUMME_GLOBAL=0
!$acc data copy ( SUMME_GLOBAL,ZAHLEN,LOOPS,SZ)
!$acc kernels
!$acc loop seq reduction(+:SUMME_GLOBAL)
DO I=1,LOOPS
SUMME_LOCAL1=0
!$acc loop reduction(+:SUMME_LOCAL1)
DO J=1,SZ
SUMME_LOCAL1 = SUMME_LOCAL1 + sin(ZAHLEN(j))**2+
& cos(ZAHLEN(j))**2
END DO
SUMME_GLOBAL = SUMME_GLOBAL + SUMME_LOCAL1
END DO
!$acc end kernels
!$acc end data
WRITE(*,*) SUMME_GLOBAL
END
Compilercall:
"c:\Program Files (x86)\PGI\win32\14.7\bin\pgf90.exe" -tp=p7 -Mipa=fast,inline -Mvect -O2 -Mnoframe -Munroll=c:8 -Mlre -Mcache_align -Mflushz -acc -ta=nvidia,time -Mbackslash -Minfo=acc -Mpreprocess testtiming.for -o testtiming.exe
It’s a known issue where we’re not getting the correct profile information from the CUDA driver. The work around is to set the “CUDA_PROFILE” environment variable in addition to “PGI_ACC_TIME”. We’re moving to a different profiling interface to the CUDA Driver which hopefully will correct this issue.
Did you intend to add “seq” to the second kernel? Using it here causes a severe performance issue since only a single gang is used. I removed it when generating the following profile.
PGI$ export CUDA_PROFILE=1
PGI$ time ./test1.exe
599999999.4506104
Accelerator Kernel Timing data
H:\tmp\test.f90
test NVIDIA devicenum=0
time(us): 725,779
14: data region reached 1 time
14: data copyin transfers: 2
device time(us): total=2 max= min= avg=
20: data copyout transfers: 2
device time(us): total=2 max= min= avg=
15: compute region reached 1 time
16: kernel launched 1 time
grid: [15625] block: [128]
device time(us): total=541 max=541 min=541 avg=541
23: data region reached 1 time
23: data copyin transfers: 4
device time(us): total=4 max= min= avg=
39: data copyout transfers: 4
device time(us): total=4 max= min= avg=
24: compute region reached 1 time
27: kernel launched 1 time
grid: [300] block: [128]
device time(us): total=725,215 max=725,215 min=725,215 avg=725,215
elapsed time(us): total=730,000 max=730,000 min=730,000 avg=730,000
27: reduction kernel launched 1 time
grid: [1] block: [256]
device time(us): total=11 max=11 min=11 avg=11
real 0m1.260s
user 0m0.000s
sys 0m0.078s
OK. I didn’t know about this issue :-), but the CUDA_PROFILE-hint turned out to be extremly valueable.
I applied it not only to the testprogram but also to the pofiling of my real program an now I get reasonable information for the first time.
(I do know, that the testprogram is slow - it is one member of a series where I made experiments with placing kernels: inside/outside loop, inside/outside function, one big acc-kernel vs. many small ones. So far I didn’t understand why it is so slow …)
Greetings Benedikt