Loop unrolling

Hello,

I have a loop with a trip-count of just 2 that is itself at the heart of a triply-nested loop. I’m thinking that the best thing to do is to unroll this loop but it seems that the compiler (v.11.3) rejects my !$ACC DO SEQ, UNROLL(2). If I do !$ACC DO UNROLL(2) then it doesn’t complain but its output tells me that it has used do VECTOR(2). Is that because that is a better option and it has silently ignored my advice? I was thinking there may be an overhead with loop set-up that I could avoid by unrolling…

Thanks for any insight,

Andy.

Just to update my own post, I’ve gone through and manually unrolled those small loops and see that I get ~factor of two speed-up on quite a few of them.

Hi Andy,

The semantics of the unroll clause changed a bit. Now the unroll clause is applied to the preceding loop schedule clause. Hence, " !$ACC DO UNROLL(2)" has no meaning. “!$ACC DO SEQ, UNROLL(2)” means apply the unroll to the sequential loop, which I don’t think would apply since you don’t have a sequential loop.

Instead, try adding “!$ACC DO VECTOR(16) UNROLL(2)” before line 246 (i.e. the inner loop) and see if that get the desired unrolling.

  • Mat

Hi Mat,

Instead, try adding “!$ACC DO VECTOR(16) UNROLL(2)” before line 246 (i.e. the inner loop) and see if that get the desired unrolling.

I don’t understand this idea - could you explain it? I’ve tried it by leaving directives off the outer three loops and putting only the directive you’ve suggested on the inner loop (what was 246 but is now 268). The compiler responds with:

    268, Loop is parallelizable
         Accelerator kernel generated
        259, !$acc do parallel ! blockidx%y
        263, !$acc do parallel, vector(2) ! blockidx%x threadidx%z
        265, !$acc do vector(8) ! threadidx%x
             Cached references to size [8x2] block of 'e3u'
             Cached references to size [8x2] block of 'e2u'
             Cached references to size [8x2] block of 'ahtu'
             Cached references to size [8x2] block of 'e1u'
             Cached references to size [8x2] block of 'uslp'
             Cached references to size [8x2] block of 'umask'
             Cached references to size [9x3x2] block of 'tmask'
             Cached references to size [8x2] block of 'e3v'
             Cached references to size [8x2] block of 'e1v'
             Cached references to size [8x2] block of 'ahtv'
             Cached references to size [8x2] block of 'e2v'
             Cached references to size [8x2] block of 'vslp'
             Cached references to size [8x2] block of 'vmask'
        268, !$acc do vector(32) unroll(2) ! threadidx%y
             Cached references to size [9x3x32] block of 'zdk1t'
             Cached references to size [9x3x32] block of 'zdkt'
             CC 1.3 : 64 registers; 15816 shared, 1192 constant, 328 local memory bytes; 25% occupancy
             CC 2.0 : 63 registers; 15800 shared, 1200 constant, 0 local memory bytes; 33% occupancy

and the resulting code is very slow.
I’ll email you the code (same one as in my other post).

Thanks for your help,

Andy.

Hi,
any results on this problem? I think I have a similar issue. I have an outer loop which is distributed to the threads and threadblocks and an inner loop which shall be executed serially by each thread:

#pragma acc region
#pragma acc for parallel vector(32)
for (i<n) {
  // do something
  for (j <m) { /* do something*/ }
}

Now, I try to unroll the inner loop, however my unroll statements are always ignored. I tried:

  1. #pragma acc seq unroll(4)
  2. #pragma unroll(4)

What is the problem? If I manually unroll the inner loop, I get a nice speedup…
Bye, Sandra

Any news on this issue? It is really tedious to do the unrolling manually :-(

Hi Sandra,

Let me ping Michael again. Both he and I have been travelling a lot the last few weeks so are behind on our responses.

Best Regards,
Mat

Hi all,
I am experiencing a similar problem, I would like to force the unroll of an inner loop, but all of my attempts seem to be ignored.

Which is the right way to do it?


Thanks in advance,

Enrico

Update:

Looking at the gpu/ptx code using the “keep” option I noticed that the loops I would like to unroll are actually implicitly unrolled in any case, independently from the use of the unroll keyword which seems actually to be ignored.

Loops are not completely unrolled, but just by 4, is this a predefined behaviour? Is there any way to control it?


Thaks in advance,

Enrico