Hi, I am trying to optimize a few routines in a fortran code, but I am having trouble getting parrellization in a couple of them.
In one routine, grid, I have put compute directives around the main computational nested do loop. Here is the compiler feedback:
grid:
550, Loop interchange produces reordered loop nest: 551,550
Loop unrolled 33 times (completely unrolled)
551, Loop not vectorized: may not be beneficial
Unrolled inner loop 8 times
Residual loop unrolled 1 times (completely unrolled)
561, Generating copyin(rwx(:lr))
Generating copyin(x3(:))
Generating copyin(rwy(:lr))
Generating copyin(y3(:))
Generating copy(den(:,:))
Generating copyin(w3(:))
Generating copyin(mu(:))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
562, Loop carried dependence due to exposed use of 'den(:,:)' prevents parallelization
Accelerator kernel generated
562, !$acc do seq
Non-stride-1 accesses for array 'y3'
Non-stride-1 accesses for array 'x3'
Non-stride-1 accesses for array 'mu'
Non-stride-1 accesses for array 'w3'
CC 1.0 : 34 registers; 112 shared, 48 constant, 0 local memory bytes; 12% occupancy
CC 2.0 : 26 registers; 0 shared, 148 constant, 0 local memory bytes; 16% occupancy
569, Complex loop carried dependence of 'den' prevents parallelization
Loop carried dependence due to exposed use of 'den(:,:)' prevents parallelization
604, Loop unrolled 33 times (completely unrolled)
609, Loop unrolled 33 times (completely unrolled)
614, Loop interchange produces reordered loop nest: 615,614
Loop unrolled 33 times (completely unrolled)
615, Loop not vectorized: may not be beneficial
I am particularly interested in what exposed use of ‘den(:,:)’ and complex loop carried dep of den(:,:) mean as I expect these are the more important pieces. Is it possible I can parallelize my code and any advice on how I can do so?
The second routine yields the following compiler info:
cpush:
385, Generating copy(w2(:))
Generating copy(y2(:))
Generating copy(x2(:))
Generating copy(nos(n))
Generating copy(ke(n))
Generating copy(efl(n))
Generating copy(pfl(n))
Generating copy(w1(:))
Generating copy(w3(:))
Generating copy(y1(:))
Generating copy(x1(:))
Generating copy(u2(:))
Generating copyout(u3(:))
Generating copy(u1(:))
Generating copyin(rwx(:lr))
Generating copy(x3(:))
Generating copyin(rwy(:lr))
Generating copy(y3(:))
Generating copyin(ex(:,:))
Generating copyin(ey(:,:))
Generating copyin(mu(:))
Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
386, Complex loop carried dependence of 'pfl' prevents parallelization
Loop carried dependence due to exposed use of 'pfl(n)' prevents parallelization
Complex loop carried dependence of 'efl' prevents parallelization
Loop carried dependence due to exposed use of 'efl(n)' prevents parallelization
Complex loop carried dependence of 'ke' prevents parallelization
Loop carried dependence due to exposed use of 'ke(n)' prevents parallelization
Complex loop carried dependence of 'nos' prevents parallelization
Loop carried dependence due to exposed use of 'nos(n)' prevents parallelization
Accelerator kernel generated
386, !$acc do seq
Non-stride-1 accesses for array 'w2'
Non-stride-1 accesses for array 'y2'
Non-stride-1 accesses for array 'x2'
Non-stride-1 accesses for array 'w1'
Non-stride-1 accesses for array 'w3'
Non-stride-1 accesses for array 'y3'
Non-stride-1 accesses for array 'y1'
Non-stride-1 accesses for array 'x3'
Non-stride-1 accesses for array 'x1'
Non-stride-1 accesses for array 'u2'
Non-stride-1 accesses for array 'u3'
Non-stride-1 accesses for array 'u1'
Non-stride-1 accesses for array 'mu'
CC 1.0 : 43 registers; 256 shared, 44 constant, 0 local memory bytes; 8% occupancy
CC 2.0 : 37 registers; 0 shared, 292 constant, 0 local memory bytes; 16% occupancy
393, Loop is parallelizable
I could also post the subroutines if that would be helpful.
Thanks for any advice/documentation I should read,
Ben