I’m a beginner of PVF. To accelerate my large program with GPU, I write an analogous but shorter code to take a test. In fact, it is just a three-fold(or six) loops, and every fold is an iteration and has a reduction which will be used next outer loop. And I rewrite like this,
ffkk=0.0
!$acc region
do 30 ik=1,nm
do 30 iky=1,nm
ffqq=0.0
do 201 ip=1,nm
do 201 ipy=1,nm
ffqq1=0.0
do 10 iq=1,nm
do 10 iqy=1,nm
ffq=1.0/nm/nm
ffqq1=ffqq1+ffq
ffqq(ip,ipy)=ffqq1/2.0
10 continue
201 continue
ffpp=0.0
do 20 ip=1,nm
do 20 ipy=1,nm
ffp=ffqq(ip,ipy)/nm/nm
ffpp=ffpp+ffp
20 continue
ffk=ffpp/nm/nm
ffkk=ffkk+ffk
30 continue
!$acc end region
The result should be correct, but I think the parallelism is not very good, because of so many ‘Loop is parallelizable’ as follows.
prog:
31, Generating copyout(ffqq(1:20,1:20))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
32, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
33, Parallelization would require privatization of array 'ffqq(1:20,i3+1)'
Accelerator kernel generated
32, !$acc do seq
33, !$acc do seq
CC 1.0 : 11 registers; 152 shared, 20 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 20 registers; 128 shared, 48 constant, 0 local memory bytes; 16% occupancy
56, Sum reduction generated for ffkk
35, Loop is parallelizable
36, Loop is parallelizable
37, Loop is parallelizable
40, Loop carried scalar dependence for 'ffqq1' at line 43
Loop carried reuse of 'ffqq' prevents parallelization
41, Loop carried scalar dependence for 'ffqq1' at line 43
Loop carried reuse of 'ffqq' prevents parallelization
Inner sequential loop scheduled on accelerator
49, Loop is parallelizable
50, Loop is parallelizable
I am looking forward to your helpful derectives.