unrolling or data dependent loops

I have a typical loop, for example
DO k = 1, JFLagcnt
j = LSTjflag(k)
w(j) = (w(j)+B(j)*p(j))/B(j)
ENDDO

Initially compiler gives "Loop not vectorized: data dependency Loop unrolled 4 times " so I try a compiler directive
!pgi$l nodepchk
DO k = 1, JFLagcnt
j = LSTjflag(k)
w(j) = (w(j)+B(j)*p(j))/B(j)
ENDDO
and get the following compiler output “Loop not parallelized: innermost Loop not vectorized: may not be beneficial Loop unrolled 4 times”

The loop is >12000 elements. What am I doing wrong so I get it to vectorized? There are a number of loops like this so it’ll affect run time.

Hi deeppow,

I think you’re fine, but the compiler isn’t tuned to vectorize loops where the indexes come from a look-up table. I added a feature request (TPR#19181) and we will have our engineers see what we can do.

Thanks,
Mat

Matt,

Using indirect indexing to avoid repeated if-testing, do test once and store for reuse. It’s an old method, is there a better way these days?


-ralph

An additional weird problem is associated with
DO k = 1, JFLagcnt
j = LSTjflag(k)
u(j) = r(j) + betah(j)
p(j) = u(j) + beta
(beta*p(j)+h(j))
ENDDO

which produces the compiler output “Loop not vectorized: data dependency Loop unrolled 2 times”. Most data dependency failures such as that noted above produce unrolling of 4 times. Even thought the default is 4, I tried to force it by using the compiler option “-Munroll=c:4” which as one might expect doesn’t change the behavior.


-ralph

Hi Ralph,

Try using “-Munrol=n:4” or “-Munroll=m:4” instead. The “c” option controls the maximum loop count to completely unroll a loop. “n” controls the unroll factor for single block loops while “m” controls the factor for multi-block loops.

Hope this helps,
Mat

Matt,

“-Munrol=n:4” gave me an ~15% speed up on my test problem, from ~10min to ~8.5min. There was more than just the one case I noted.

This situation arises due to my use of indirect addressing (what you call table lookup) for array indexes for a large number of cases (mesh/grid cells).


-ralph

Any benefit if you increase the unroll factor even further? -Munroll=n:8 or even -Munroll=n:64?

  • Mat

No improvement. I tried 8. I did look at the task manager cpu loading with 4 and it was at or near 100% most of the time so I figured it wasn’t going to help much.

I understand what the unrolling does to the do-loop but I’m not quite sure how it uses the cpu architecture. Would assume the unrolled loops are shipped off to different cores. How is this different than parallelization regarding a cpu?

No improvement. I tried 8.

Too bad, but worth a try.

Would assume the unrolled loops are shipped off to different cores. How is this different than parallelization regarding a cpu?

No, unrolling does not auto-parallelize, the code is executed sequentially.

  • Mat

Matt,
I must show my ignorance here. I would interpret sequential to mean on only 1 core, i.e. just like an old single-core scaler processor. If that is true of what advantage is unrolling? Would seem to even add a little loop overhead. Since that makes no sense, I conclude I have something wrong in my thinking.

-ralph

If that is true of what advantage is unrolling?

Besides the reduction in branching, which can be quite costly, the compiler can better perform instruction scheduling, memory prefetching (which benefits caching), and may be able to eliminate repeated instrucitons.

  • Mat