unrolling or data dependent loops

deeppow · March 5, 2013, 6:36pm

I have a typical loop, for example
DO k = 1, JFLagcnt
j = LSTjflag(k)
w(j) = (w(j)+B(j)*p(j))/B(j)
ENDDO

Initially compiler gives "Loop not vectorized: data dependency Loop unrolled 4 times " so I try a compiler directive
!pgi$l nodepchk
DO k = 1, JFLagcnt
j = LSTjflag(k)
w(j) = (w(j)+B(j)*p(j))/B(j)
ENDDO
and get the following compiler output “Loop not parallelized: innermost Loop not vectorized: may not be beneficial Loop unrolled 4 times”

The loop is >12000 elements. What am I doing wrong so I get it to vectorized? There are a number of loops like this so it’ll affect run time.

MatColgrove · March 5, 2013, 10:03pm

Hi deeppow,

I think you’re fine, but the compiler isn’t tuned to vectorize loops where the indexes come from a look-up table. I added a feature request (TPR#19181) and we will have our engineers see what we can do.

Thanks,
Mat

deeppow · March 5, 2013, 10:11pm

Matt,

Using indirect indexing to avoid repeated if-testing, do test once and store for reuse. It’s an old method, is there a better way these days?

-ralph

deeppow · March 5, 2013, 10:33pm

An additional weird problem is associated with
DO k = 1, JFLagcnt
j = LSTjflag(k)
u(j) = r(j) + betah(j)
p(j) = u(j) + beta(beta*p(j)+h(j))
ENDDO

which produces the compiler output “Loop not vectorized: data dependency Loop unrolled 2 times”. Most data dependency failures such as that noted above produce unrolling of 4 times. Even thought the default is 4, I tried to force it by using the compiler option “-Munroll=c:4” which as one might expect doesn’t change the behavior.

-ralph

MatColgrove · March 6, 2013, 6:02pm

Hi Ralph,

Try using “-Munrol=n:4” or “-Munroll=m:4” instead. The “c” option controls the maximum loop count to completely unroll a loop. “n” controls the unroll factor for single block loops while “m” controls the factor for multi-block loops.

Hope this helps,
Mat

deeppow · March 6, 2013, 8:51pm

Matt,

“-Munrol=n:4” gave me an ~15% speed up on my test problem, from ~10min to ~8.5min. There was more than just the one case I noted.

This situation arises due to my use of indirect addressing (what you call table lookup) for array indexes for a large number of cases (mesh/grid cells).

-ralph

MatColgrove · March 8, 2013, 11:16pm

Any benefit if you increase the unroll factor even further? -Munroll=n:8 or even -Munroll=n:64?

Mat

deeppow · March 11, 2013, 12:04am

No improvement. I tried 8. I did look at the task manager cpu loading with 4 and it was at or near 100% most of the time so I figured it wasn’t going to help much.

I understand what the unrolling does to the do-loop but I’m not quite sure how it uses the cpu architecture. Would assume the unrolled loops are shipped off to different cores. How is this different than parallelization regarding a cpu?

MatColgrove · March 11, 2013, 4:09pm

No improvement. I tried 8.

Too bad, but worth a try.

Would assume the unrolled loops are shipped off to different cores. How is this different than parallelization regarding a cpu?

No, unrolling does not auto-parallelize, the code is executed sequentially.

Mat

deeppow · March 11, 2013, 5:09pm

Matt,
I must show my ignorance here. I would interpret sequential to mean on only 1 core, i.e. just like an old single-core scaler processor. If that is true of what advantage is unrolling? Would seem to even add a little loop overhead. Since that makes no sense, I conclude I have something wrong in my thinking.

-ralph

MatColgrove · March 11, 2013, 5:45pm

If that is true of what advantage is unrolling?

Besides the reduction in branching, which can be quite costly, the compiler can better perform instruction scheduling, memory prefetching (which benefits caching), and may be able to eliminate repeated instrucitons.

Mat

Topic		Replies	Views
Decide on wheter parallelize or unroll a loop Legacy PGI Compilers	3	2452	November 5, 2015
Loop unrolling Legacy PGI Compilers	8	9226	March 26, 2014
Loop unrolling (PGI 5.1 and 5.2: pgf77) Legacy PGI Compilers	11	20116	May 31, 2005
Generating SSE code for blocks. Legacy PGI Compilers	2	11582	May 21, 2017
Fortran90: Loop vectorization failed - data dependency Legacy PGI Compilers	3	12638	April 28, 2010
PGI not vectorizing openmp loops Legacy PGI Compilers	1	2492	October 23, 2012
PGF95 won't vectorize loops -- "may not be beneficial&q Legacy PGI Compilers	3	4746	October 31, 2013
compiler directive CUDA Programming and Performance	7	6406	June 12, 2008
New facet Legacy PGI Compilers	1	2016	October 4, 2012
Force a loop to vectorize Legacy PGI Compilers	6	4516	July 26, 2022

unrolling or data dependent loops

Related topics