# the parallelization of two simple loops

I’m playing around with two different simple loops:

``````c\$acc region
do i=0,32
do j=0,32
den(i,j)=den(i,j)+1
enddo
enddo
c\$acc end region
``````

and

``````c\$acc region
do i=0,32
do j=0,32
do k=0,1024
den(i,j)=den(i,j)+1
enddo
enddo
enddo
c\$acc end region
``````

The first parallelizes, the second, which has an additional outer loop, doesn’t:
18, Complex loop carried dependence of ‘den’ prevents parallelization
Loop carried dependence of ‘den’ prevents parallelization
Loop carried backward dependence of ‘den’ prevents vectorization

No matter the order the compiler complains of the loop over k. I just want to make sure I understand this correctly:

Each thread in the first case writes to a unique element of den(i,j), so we are ok. In the second situation, each thread over the k loop (if we forced it to parallelize) will be writing to the same den(i,j), thus introducing the problem of race conditions.

So how do you fix the problem? Either A) add a k component to den, and then perform a reduction back to 2d or B) privatize den and then somehow sum all the private arrays at the end.

I’m not sure if B is possible or how to do it but I know A works because Mat showed me in another thread. But what if I don’t want to deal with a large sparse array necesary in A? Is there a way to avoid this? B seems like it would be more efficient if possible.

Thanks,
Ben

Hi Ben,

Each thread in the first case writes to a unique element of den(i,j), so we are ok. In the second situation, each thread over the k loop (if we forced it to parallelize) will be writing to the same den(i,j), thus introducing the problem of race conditions.

Correct. The “i” and “j” loops should still accelerate but if you accelerated the “k” loop you’d end up with a race condition.

So how do you fix the problem? Either A) add a k component to den, and then perform a reduction back to 2d or B) privatize den and then somehow sum all the private arrays at the end.

These are essentially the same thing. In A you’ve manually privatized it while in B, you’d let compiler do it for you. In either case, you’d still be using up a lot of memory.

One thing you can try is to use OpenACC with a loop vector reduction directive on the inner loop and collapse the outer two loops into a 2-D gang. The caveat is that there is overhead in performing a parallel reduction so is only beneficial if the inner loop is a lot larger than the outer loops (which it is here) and there’s enough work in the inner loop to offset the overhead (which there isn’t here). Also note that reductions are only allowed on scalars, so you’d need to initialize a temp variable to den(i,j), perform the scalar reduction, then copy the new value back to den(i,j).

Something like:

``````c\$acc parallel loop collapse(2) gang
do i=0,32
do j=0,32
tmp = den(i,j)
c\$acc loop vector reduction(+:tmp)
do k=0,1024
tmp=tmp+1
enddo
den(i,j) = tmp
enddo
enddo
``````

Hope this helps,
Mat