OK - I’ve been following Mat’s advice. Let me report on progress. First, I should present again my original code, since my first post didn’t include all the outer loops and so on. Here it is, with values of the relevant variables set explicitly to reproduce my test case (note this is just a toy code fragment which reproduces the essentials).
MODULE example
INTEGER,PARAMETER :: dp=kind(1.d0)
REAL(dp),ALLOCATABLE aaa(:,:,:)
REAL(dp),ALLOCATABLE :: basis_ri(:),basis_rj(:),basis_rij(:)
CONTAINS
SUBROUTINE blah
! Multiple subroutines call compute_f from within this module in the real code.
! Also blah and other module subroutines will itself be called multiple times from outside
! the module.
! This array never modified - fixed coefficients.
allocate(aaa(27,3,2)) ; aaa=1.123123123.d0
! These arrays modified inside every call to compute_f
allocate(basis(ri(0:2)),basis_rj(0:2),basis_rij(0:2))
netot=201 ; nitot=51
do i=1,netot
do ion=1,nitot
set=1 ! some number between 1 and 2
do j=1,netot
call compute_f(value,set)
enddo
enddo
enddo
write(6,*)'Value = ',value
return
END SUBROUTINE blah
SUBROUTINE compute_f(value,set)
INTEGER,INTENT(in) :: set
REAL(dp),INTENT(out) :: value
INTEGER Nee,NeN,lmn
REAL(dp) valb,valc
s=1 ! some number between 1 and 3
Nee=2 ; NeN=2 ! always 2 here (could be something else in principle)
basis_ri=0.234234234d0
basis_rj=0.345345345d0
basis_rij=0.456456456d0
!$acc data region copyin(aaa(:,s,set)) copyin(basis_ri(:)) &
!$acc copyin(basis_rj(:)) copyin(basis_rij(:))
value=0.d0
lmn=0
!$acc region
do n=0,Nee
valb=0.d0
do m=0,NeN
valc=0.d0
do l=0,NeN
lmn=lmn+1
valc=valc+aaa(lmn,s,set)*basis_ri(l)
enddo ! l
valb=valb+valc*basis_rj(m)
enddo ! m
value=value+valb*basis_rij(n)
enddo ! n
!$acc end region
!$acc end data region
END SUBROUTINE compute_f
END MODULE example
I have now rewritten this along the lines Mat suggested. Thus:
MODULE example
INTEGER,PARAMETER :: dp=kind(1.d0)
REAL(dp),ALLOCATABLE aaa(:,:,:)
REAL(dp),ALLOCATABLE :: basis_ri(:),basis_rj(:),basis_rij(:)
CONTAINS
SUBROUTINE blah ! multiple subroutines call compute_f in this module
! This array never modified - fixed coefficients.
allocate(aaa(27,3,2)) ; aaa=1.123123123.d0
! These arrays modified inside every call to compute_f
allocate(basis(ri(0:2)),basis_rj(0:2),basis_rij(0:2))
allocate(valb(0:2,0:2),valc(0:2,0:2,0:2))
netot=201 ; nitot=51
do i=1,netot ! THIS LOOP ACTUALLY IN ANOTHER MODULE IN THE REAL CODE-
! CONTAINS TOO MUCH STUFF TO BE INSIDE DATA REGION
!$acc data region copyin(aaa)
do ion=1,nitot
set=1 ! some number between 1 and 2
do j=1,netot
call compute_f(value,set)
enddo
enddo
!$acc end data region
enddo ! THIS LOOP IN ANOTHER MODULE
write(6,*)'Value = ',value
return
END SUBROUTINE blah
SUBROUTINE compute_f(value,set)
INTEGER,INTENT(in) :: set
REAL(dp),INTENT(out) value
INTEGER Nee,NeN,lmn
REAL(dp) valb,valc
s=1 ! some number between 1 and 3
Nee=2 ; NeN=2 ! always 2 here (could be something else in principle)
basis_ri=0.234234234d0
basis_rj=0.345345345d0
basis_rij=0.456456456d0
!$acc data region copyin(basis_ri,basis_rj,basis_rij), local(valb,valc)
value=0.d0
!$acc region
do n=0,Nee
do m=0,NeN
do l=0,NeN
lmn=(n*(Nee+1)*(NeN+1))+(m*(NeN+1))+l+1
valc(n,m,l)=aaa(lmn,s,set)*basis_ri(l)
enddo ! l
enddo ! m
enddo ! n
do n=0,Nee
do m=0,NeN
valb(n,m)=0.d0
do l=0,NeN
valb(n,m)=valb(n,m)+valc(n,m,l)*basis_rj(m)
enddo ! l
enddo ! m
enddo ! n
do n=0,Nee
do m=0,NeN
value=value+valb(n,m)*basis_rij(n)
enddo
enddo ! n
!$acc end region
!$acc end data region
END SUBROUTINE compute_f
END MODULE example
Here’s the accelerator report (for the real code):
8104, Generating copyin(aaa(:,:,:))
compute_f:
9502, Generating local(valc(:,:,:))
Generating local(valb(:,:))
Generating copyin(basis_rij(:))
Generating copyin(basis_rj(:))
Generating copyin(basis_ri(:))
9915, Generating copyin(aaa(:,s,set))
Generating compute capability 2.0 binary
9916, Loop is parallelizable
9917, Loop is parallelizable
9918, Loop is parallelizable
Accelerator kernel generated
9916, !$acc do parallel, vector(4) ! blockidx%y threadidx%y
9917, !$acc do parallel, vector(4) ! blockidx%x threadidx%z
9918, !$acc do vector(16) ! threadidx%x
Cached references to size [16] block of 'basis_ri'
CC 2.0 : 20 registers; 144 shared, 160 constant, 0 local memory bytes; 100% occupancy
9924, Loop is parallelizable
9925, Loop is parallelizable
Accelerator kernel generated
9924, !$acc do parallel, vector(32) ! blockidx%y threadidx%x
9925, !$acc do parallel, vector(8) ! blockidx%x threadidx%y
Cached references to size [8] block of 'basis_rj'
CC 2.0 : 22 registers; 72 shared, 156 constant, 0 local memory bytes; 83% occupancy
9927, Complex loop carried dependence of 'valb' prevents parallelization
Loop carried reuse of 'valb' prevents parallelization
Inner sequential loop scheduled on accelerator
9932, Loop is parallelizable
9933, Loop is parallelizable
Accelerator kernel generated
9932, !$acc do parallel, vector(16) ! blockidx%x threadidx%x
Cached references to size [16] block of 'basis_rij'
9933, !$acc do parallel, vector(16) ! blockidx%y threadidx%y
CC 2.0 : 16 registers; 2056 shared, 112 constant, 0 local memory bytes; 100% occupancy
9934, Sum reduction generated for value_f
Results are as follows:
MY ORIGINAL CODE (1 core of CPU, no GPU)
Total energy -289.380005052746
CPU time : 17.11 sec
MAT’S SUGGESTED REWRITING OF THE FORTRAN (1 core of CPU, no GPU)
Total energy -289.380005052746
CPU time : 22.42 sec
The big slowdown here is undesirable, as our code is used to run on everything from big supercomputers with 10s of thousands of CPU cores to poor quality laptops with or without GPUs and it would be nice to use the same Fortran in all cases supplemented by accelerator derivatives where necessary. I appreciate this is a difficult thing to achieve. Clearly if Mat’s version proves to be better for the GPU case, then the slowdown is big enough that the original Fortran will have to be used in the non-GPU case, even though having two versions of the code is messy and hard to maintain (and how do you switch between them at run time in a portable way anyway?).
ADD FULL SET OF ACCELERATOR DIRECTIVES (1 core of CPU with GPU)
HERE aaa IS INCLUDED IN THE SAME data region AS basis_ri, basis_rj etc. i.e. NOT AS IN THE ABOVE CODE FRAGMENT.
Total energy -289.380005052746 (correct answer, this time…)
CPU time : 710.01 sec (41 times slower!)
Accelerator Kernel Timing data
compute_f
9915: region entered 2139814 times
time(us): total=151192460 init=194964 region=150997496
kernels=58107115 data=0
w/o init: total=150997496 max=470 min=68 avg=70
9918: kernel launched 2139814 times
grid: [1] block: [16x4x4]
time(us): total=19205748 max=77 min=6 avg=8
9925: kernel launched 2139814 times
grid: [1] block: [32x8]
time(us): total=14996218 max=78 min=7 avg=7
9933: kernel launched 2139814 times
grid: [1] block: [16x16]
time(us): total=14770591 max=78 min=6 avg=6
9934: kernel launched 2139814 times
grid: [1] block: [256]
time(us): total=9134558 max=76 min=4 avg=4
compute_f
9502: region entered 2340253 times
time(us): total=696186744 init=1040667 region=695146077
data=51373056
w/o init: total=695146077 max=757 min=227 avg=297
ADD FULL SET OF ACCELERATOR DIRECTIVES (1 core of CPU with GPU)
HERE aaa HAS ITS OWN data region OUTSIDE THE nitot LOOP, AS IN THE ABOVE CODE FRAGMENT.
Total energy -289.380005052746 (correct)
CPU time : 510.06 sec (30 times slower)
Accelerator Kernel Timing data
compute_f
9915: region entered 2139814 times
time(us): total=179968284 init=155411 region=179812873
kernels=56122144 data=12882965
w/o init: total=179812873 max=489 min=82 avg=84
9918: kernel launched 2139814 times
grid: [1] block: [16x4x4]
time(us): total=18756495 max=77 min=6 avg=8
9925: kernel launched 2139814 times
grid: [1] block: [32x8]
time(us): total=15001354 max=402 min=7 avg=7
9933: kernel launched 2139814 times
grid: [1] block: [16x16]
time(us): total=13319290 max=368 min=6 avg=6
9934: kernel launched 2139814 times
grid: [1] block: [256]
time(us): total=9045005 max=394 min=4 avg=4
7415: region entered 4557 times
time(us): total=3884 init=513 region=3371
data=32203
w/o init: total=3371 max=257 min=0 avg=0
compute_f
9502: region entered 2340253 times
time(us): total=495579249 init=186635 region=495392614
data=37623734
w/o init: total=495392614 max=713 min=132 avg=211
So - we are at least getting the right answer with the GPU now, which is obviously a big improvement. However the timing data are clearly unacceptable. For this application we have an obvious rate limiting step taking up 80% of the CPU time, with a nice set of loops that are apparently parallelizable, yet we’re just making the whole thing 30 times slower than on a single CPU core. Hmmmm…
I guess the inner loops are considerably shorter than Mat was guessing, which probably has an effect on the efficiency. Also, I note that there is a:
9915, Generating copyin(aaa(:,s,set))
on encounteding the !$acc region statement. Not entirely sure why…
Does anyone have any further ideas of how we can improve the speed here?
Thanks again - Mat - for your help with this. Much appreciated.
Django