Loop carried reuse prevents parallelization

As I’m trying to learn to rewire my brain for parallel thinking, I’ve been trying various things to reduce the number of “loop carried dependence”, “loop carried reuse” and other issues reported by -Minfo=accel. One particular loop has been stymieing me, so I’m coming here to try and figure it out.

To wit, the loop:

217       do i=1,m
218        do k=0,np
219         fsdir(i)=tda(i,k,2)
220        enddo
221       enddo

where those are line numbers, not statement labels.

By the time the code gets to here, tda has been constructed, and fsdir has not appeared anywhere else (and never does again). Also, tda(:,:,:) is local to the whole !$acc region and fsdir(:) is copyout.

When the compiler gets here it says:

    217, Loop is parallelizable
    218, Loop carried reuse of fsdir prevents parallelization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
        217, !$acc do parallel, vector(256)
             Using register for 'fsdir'
        218, !$acc do seq

I guess I’m confused as to why this is not parallel, vector(16)-parallel, vector(16) as I’m used to seeing in cases like this. Is it because fsdir(:) is a copyout array and as such has internal restrictions regarding memory layout or the like? (And, of course, it maybe that is faster than the 16x16 method, I’m just wondering about that ‘loop carried reuse’ issue.)

Hi Matt,

The outer “i” loop is being parallelized. However the inner loop is not because for each iteration of the k loop, the same element of fsdir is being assigned to (i.e. loop carried re-use). So if the k loop were to be parallelized, all “k” threads would be trying to assign their values to the same spot, leading to nod-deterministic results. To parallelize the k loop, you’ll need to make fsidr a two dimensional array.

Note that we are working on adding support for reductions within accelerator regions. My guess is that your code is more like “fsdir(i) = fsdir(i) + tda(i,k,2)”, in which case we should be able to parallelize the inner loop once this support has been added.

  • Mat

You know, you are right and that is actually what the code is doing (in some ways). Guess I’ve found a place to redo a bit of coding!