As I’m trying to learn to rewire my brain for parallel thinking, I’ve been trying various things to reduce the number of “loop carried dependence”, “loop carried reuse” and other issues reported by -Minfo=accel. One particular loop has been stymieing me, so I’m coming here to try and figure it out.
To wit, the loop:
217 do i=1,m
218 do k=0,np
219 fsdir(i)=tda(i,k,2)
220 enddo
221 enddo
where those are line numbers, not statement labels.
By the time the code gets to here, tda has been constructed, and fsdir has not appeared anywhere else (and never does again). Also, tda(:,:,:) is local to the whole !$acc region and fsdir(:) is copyout.
When the compiler gets here it says:
217, Loop is parallelizable
218, Loop carried reuse of fsdir prevents parallelization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
217, !$acc do parallel, vector(256)
Using register for 'fsdir'
218, !$acc do seq
I guess I’m confused as to why this is not parallel, vector(16)-parallel, vector(16) as I’m used to seeing in cases like this. Is it because fsdir(:) is a copyout array and as such has internal restrictions regarding memory layout or the like? (And, of course, it maybe that is faster than the 16x16 method, I’m just wondering about that ‘loop carried reuse’ issue.)