It is still not clear to me the differences between the ‘kernels’ and the ‘parallel’ directives, so I’m trying both.
I had the impression that with the ‘parallel’ one I would have more control, but if I try to specify the number of gangs/workers/vectors I get the error:
PGF90-S-0533-Clause ‘Worker(value)’ not allowed in ‘Parallel Loop’ directive .
Changing it to ‘kernels’ like the following, this is no problem. Any reason why I cannot do it within the ‘parallel’ region?
36 !$acc kernels present(zc)
37 !$acc loop gang(9) collapse(2)
38 do k=kmin,kmax
39 do kp=kmin,kmax
40 k2=2*k
41 km = MIN(k,kp)
42 kp2=2*kp
43 z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))
44 do q=-km,km
45 q2=2*q
47 ! Calculate quantity C and its sum over magnetic quantum numbers
48 !$acc loop worker(16) collapse(2)
49 do mu2=-ju2,ju2,2
50 do ml2=-jl2,jl2,2
51 p2=mu2-ml2
52 if(abs(p2).gt.2) cycle
53 z1=w3js(ju2,jl2,2,mu2,-ml2,-p2)
54 !$acc loop vector(32)
Another question. With the above code, the compiler gives the following info:
36, Generating present(zc(:,:,:,:,:,:,:))
38, Loop is parallelizable
39, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
38, !$acc loop gang(9) collapse(2) ! blockidx%x
39, ! blockidx%x collapsed
44, !$acc loop seq
49, !$acc loop worker(16) collapse(2) ! threadidx%y
50, ! threadidx%y collapsed
55, !$acc loop vector(32) ! threadidx%x
60, !$acc loop seq
44, Loop carried dependence of zc prevents parallelization
Loop carried backward dependence of zc prevents vectorization
49, Loop is parallelizable
50, Loop is parallelizable
55, Loop is parallelizable
60, Loop is parallelizable
And again, I’m not sure how to interpret the output. The compiler says “Loop is parallelizable” for all the DO loops. I understand that this is only the analysis stage, so it says that it could be parallelized, but it doesn’t mean that it has generated parallel code.
The loops in lines 38-39 have been parallelized correctly, as it says “Accelerator kernel generated”, and “Generating Tesla code”. Within the output for line 39, it also tells me that loop in line 44 and 60 will run sequentially, but also that the compiler was able to generate for loops 49-50 worker parallelism and for loop 55 vector parallelism.
But how do I interpret the rest of the lines?
44, Loop carried dependence of zc prevents parallelization
For some reason it thinks that there is dependence on zc and cannot parallelize this loop. That’s OK as far as the other loops where I put an !$acc loop get parallelized, since I know that there are no dependencies. Is this message warning that no parallelization will be done at this and lower levels or only for this particular loop?
49,50,55,60 Loop is parallelizable. So no parallel code was generated for these loops? Within the info for loop 39 it looks like it was generated, but these messages confuse me.
Sorry to nitpick, but I want to understand as best as possible all the information that the compiler provides, in order to aim for the best possible performance.