I have a typical loop, for example
DO k = 1, JFLagcnt
j = LSTjflag(k)
w(j) = (w(j)+B(j)*p(j))/B(j)
ENDDO
Initially compiler gives "Loop not vectorized: data dependency Loop unrolled 4 times " so I try a compiler directive
!pgi$l nodepchk
DO k = 1, JFLagcnt
j = LSTjflag(k)
w(j) = (w(j)+B(j)*p(j))/B(j)
ENDDO
and get the following compiler output “Loop not parallelized: innermost Loop not vectorized: may not be beneficial Loop unrolled 4 times”
The loop is >12000 elements. What am I doing wrong so I get it to vectorized? There are a number of loops like this so it’ll affect run time.
Hi deeppow,
I think you’re fine, but the compiler isn’t tuned to vectorize loops where the indexes come from a look-up table. I added a feature request (TPR#19181) and we will have our engineers see what we can do.
Thanks,
Mat
Matt,
Using indirect indexing to avoid repeated if-testing, do test once and store for reuse. It’s an old method, is there a better way these days?
-ralph
An additional weird problem is associated with
DO k = 1, JFLagcnt
j = LSTjflag(k)
u(j) = r(j) + betah(j)
p(j) = u(j) + beta(beta*p(j)+h(j))
ENDDO
which produces the compiler output “Loop not vectorized: data dependency Loop unrolled 2 times”. Most data dependency failures such as that noted above produce unrolling of 4 times. Even thought the default is 4, I tried to force it by using the compiler option “-Munroll=c:4” which as one might expect doesn’t change the behavior.
-ralph
Hi Ralph,
Try using “-Munrol=n:4” or “-Munroll=m:4” instead. The “c” option controls the maximum loop count to completely unroll a loop. “n” controls the unroll factor for single block loops while “m” controls the factor for multi-block loops.
Hope this helps,
Mat
Matt,
“-Munrol=n:4” gave me an ~15% speed up on my test problem, from ~10min to ~8.5min. There was more than just the one case I noted.
This situation arises due to my use of indirect addressing (what you call table lookup) for array indexes for a large number of cases (mesh/grid cells).
-ralph
Any benefit if you increase the unroll factor even further? -Munroll=n:8 or even -Munroll=n:64?
No improvement. I tried 8. I did look at the task manager cpu loading with 4 and it was at or near 100% most of the time so I figured it wasn’t going to help much.
I understand what the unrolling does to the do-loop but I’m not quite sure how it uses the cpu architecture. Would assume the unrolled loops are shipped off to different cores. How is this different than parallelization regarding a cpu?
No improvement. I tried 8.
Too bad, but worth a try.
Would assume the unrolled loops are shipped off to different cores. How is this different than parallelization regarding a cpu?
No, unrolling does not auto-parallelize, the code is executed sequentially.
Matt,
I must show my ignorance here. I would interpret sequential to mean on only 1 core, i.e. just like an old single-core scaler processor. If that is true of what advantage is unrolling? Would seem to even add a little loop overhead. Since that makes no sense, I conclude I have something wrong in my thinking.
-ralph
If that is true of what advantage is unrolling?
Besides the reduction in branching, which can be quite costly, the compiler can better perform instruction scheduling, memory prefetching (which benefits caching), and may be able to eliminate repeated instrucitons.