Do concurrent with gpu or multicore

I am trying to add do concurrent in my code, in order to see if I can gain some speed using gpu. However, when I compile the code using -acc=gpu -stdpar=gpu (or multicore), I get different numbers than without those options.

The part of my code, including the commented original one, that I am trying to change is

     do concurrent(is=1:3,js=1:3,jz=1:4,kz=1:4)

! do is=1,3
! do js=1,3
! do jz=1,4
! do kz=1,4
f2b(kz,jz,this%pairs(i,j))=f2b(kz,jz,this%pairs(i,j))+this%f2st(is,js,this%pairs(i,j))* &
(spx(kz,3is+4,i)spx(jz,3js+4,j) &
+spx(kz,3
is+5,i)spx(jz,3js+5,j) &
+spx(kz,3*is+6,i)spx(jz,3js+6,j)) &
+this%f2s(is,js,this%pairs(i,j))*spx(kz,is,i)*spx(jz,js,j)
! enddo
! enddo
! enddo
enddo

Everything outside those loops is irrelevant for now.

Why compiling with gpu or multicore is giving me different numbers?

Thanks,
Stefano

Hi Stefano,

This code appears to have a race condition which will prevent it from being parallelized.

Here multiple threads can be reading and writing to the same indices in f2b. Given the values of “i” and “j” are invariant, the independent indices for f2b are “kz” and “jz”. The “is” and “js” iterations may access the same indices in f2b at the same time.

The order in which the access occurs is non-deterministic, but at the extreme you can have a case well all the threads read from the same index of f2b, do the computation, and then write it back. However they would be reading the same value. Your algorithm requires one thread to complete it’s computation of f2b before another thread reads the value.

One question, is this a typo: “this%pairs(i,j)”. Should these be “is”/“js” instead? If so, and the index return by " this%pairs(is,js)" are unique, then this code would be parallelizable.

Alternately, you can parallelize the “jz” and “kz” loops, so switching the loop order and only applying do concurrent to them, should be ok as well. Something like:

   do concurrent(jz=1:4,kz=1:4)
! do jz=1,4
! do kz=1,4
do is=1,3
do js=1,3
f2b(kz,jz,this%pairs(i,j))=f2b(kz,jz,this%pairs(i,j))+this%f2st(is,js,this%pairs(i,j))* &
(spx(kz,3<em>is+4,i)*spx(jz,3*js+4,j) &
+spx(kz,3</em>is+5,i)*spx(jz,3*js+5,j) &
+spx(kz,3*is+6,i)*spx(jz,3*js+6,j)) &
+this%f2s(is,js,this%pairs(i,j))*spx(kz,is,i)*spx(jz,js,j)
enddo
enddo
! enddo
enddo

You can then extend this to parallelize the is and js loops using a reduction. Something like:

do concurrent(jz=1:4,kz=1:4)
   f2bsum = f2b(kz,jz,this%pairs(i,j))
   do concurrent(is=1,3,js=1,3) reduce (+:f2bsum)
       f2bsum=f2bsum+this%f2st(is,js,this%pairs(i,j))* &
        (spx(kz,3<em>is+4,i)*spx(jz,3*js+4,j) &
        +spx(kz,3</em>is+5,i)*spx(jz,3*js+5,j) &
        +spx(kz,3*is+6,i)*spx(jz,3*js+6,j)) &
        +this%f2s(is,js,this%pairs(i,j))*spx(kz,is,i)*spx(jz,js,j)
enddo
 f2b(kz,jz,this%pairs(i,j))=f2bsum
enddo

Note due to the small loop bounds it’s unlikely you’ll see much performance improvement given the time to copy the data to the device will likely dominate. Though it should be better with larger loops.

Thank you very much, it works!

About the small loops, I am afraid you are right. However, those are inner loops with other two outside that are orders like N*N, with N=10 to 20. So hopefully I’ll see some improvement.
If not, I’ll have to find some other way to rewrite my code, but that will take much more effort.

I have few other questions about do concurrent.
Does it make any difference to have more do concurrent or just one? For example:
do concurrent(ks=1:4)
do concurrent(js=1:4)

or

 do concurrent(ks=1:4,js=1:4)

And also, does the order matters, i.e. ks=1:4,js=1:4 vs js=1:4,ks=1:4?

It changes how the parallelism is distributed.

“do concurrent(ks=1:4,js=1:4)” collapses the loops into a single iteration space and give the compiler more freedom on how to do the distribution across blocks and threads.

do concurrent(ks=1:4)
do concurrent(js=1:4)

Here you’re explicit splitting the parallelism between blocks and threads.

Typically collapsing is more performant but splitting can useful in some cases, especially if there is additional code between the two loops.

And also, does the order matters, i.e. ks=1:4,js=1:4 vs js=1:4,ks=1:4?

It may depend.

Ideally a program maps the stride-1 dimension, i.e. the contiguous memory, of an array to the thread level parallelism. In Fortran since it’s column major, this is the first dimension.

The compiler will attempt to do this and even interchange loops if the incorrect order is given. However, I don’t know if it can do this in all cases. So I’d probably put the first index, “kz”, last just to be safe.