GPU-enabling a loop

Hello everyone,

I have a piece of code that I have to accelerate using the GPU. There is a loop similar to the one shown below. I’ve put an !$acc markers do execute it on the GPU.

!$acc region do local(n, nbl, ijk)
	  do nbl=1,nblcks
	    do n=k,l
	      ijk=ijkpr(n)
	      vvect(ijk,1)=vvect(ijk,1)*rrnorm
	    enddo
	  enddo
!$acc end region

Unfortunately, during compilation I have a following error:

56, No Parallel kernels found, acceleration region ignored.

It seems to be a problem with the ijk variable, which is used to access the vvect array elements in consecutive iterations. It is not sequential but depends on the ijkpr table.

How can I change this loop to make it GPU-enabled?

Thanks for any ideas!
Jacek

Hi Jacek,

In order to parallelize your code, you need to make sure that all iterations of the loops are independent. In your case, ‘vvect’ has two dependencies.

The first is that you are using a computed index, ijk. If any of the values of ijk are the same for multiple threads, then the value of vvect(ijk,1) will depend on the order in which the threads are run. Since threads can be run in any order, your results will be non-deterministic. If you know that all values of ijk are unique, then you could use the “independent” clause to over ride the compilers dependency analysis and have the loop parallelized anyway.

However, even if all the values of ijkpr are unique, because multiple iteration of the outer loop access the same elements of ijkpr, you will definitely have multiple threads accessing the same elements of vvect. Again, the value of vvect will change from run to run.

Now you may think that due to the associativity of your calculation, the order in which vvect*rnorm is performed doesn’t really matter. The problem here is that this is not an atomic operation. Hence, you could the following scenario:

Thread 1 reads vvect(1,1) from memory
Thread 2 reads vvect(1,1) from memory
Thread 2 calculates vvect(1,1)*rnorm
Thread 2 writes the result back to memory
Thread 1 calculates vvect(1,1)*rnorm
Thread 1 writes the result back to memory

Though, all hope is not lost. Assuming you have enough computation, what you can do is manually privatiaze vvect (i.e. create another array with an additional dimension indexed using the parallel loop). You then need to perform a reduction to populate the original vvect array. Something like:

!$acc region copy(vvectP)
!$acc do kernel
     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvectTMP(nbl,ijk,1)=vvectTMP(nbl,ijk,1)*rrnorm
       enddo
     enddo
!$acc end region

     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvect(ijk,1) = vvect(ijk,1)+vvectTMP(nbl,ijk,1)
       enddo
     enddo

On side note, you do not need to use “local(n, nbl, ijk)”. Scalars are privatized (i.e. each thread has their own copy) by default. “local” is only needed for arrays that are local to the device.

Hope this helps,
Mat

Mat, thanks for your clear message.

There are some problems, though.
Firstly, I can not afford to double the loop. In your solution, you’ve divided the calculations on both, the GPU and the CPU. The CPU part has similar size to the original loop, so the acceleration will be too small (if any).
Additionally, creating a temporal array will consume a lot of memory and probably also some CPU time. I’m fighting for any milisecond here. ;)

I tried also the “independent” option. Assuming that ijkpr() has a unique values, for this loop it increased the calculation time almost 20 times (strange!).

!$acc region local(n, ijk, nbl)
!$acc do independent
	  do nbl=1,nblcks
	    do n=k,l
	      ijk=ijkpr(n)
	      vvect(ijk,1)=vvect(ijk,1)*rrnorm	      
	    enddo
	  enddo
!$acc end region

While this code’s execution time decreased when ‘independent’:

!$acc region local(n, ijk, nbl)
!$acc do independent
	  do nbl=1,nblcks
	    do n=k,l
	      ijk=ijkpr(n)
	      vvect(n,1)=vvect(n,1)*rrnorm
	    enddo
	  enddo
!$acc end region

I was thinking if it is possible to somehow divide the ijkpr() array and this way make the threads independent?

Any other ideas?

Firstly, I can not afford to double the loop. In your solution, you’ve divided the calculations on both, the GPU and the CPU. The CPU part has similar size to the original loop, so the acceleration will be too small (if any).
Additionally, creating a temporal array will consume a lot of memory and probably also some CPU time. I’m fighting for any milisecond here. ;)

I understand but getting very fast wrong answers doesn’t help you either.

I was thinking if it is possible to somehow divide the ijkpr() array and this way make the threads independent?

The problem with using ijkpr, besides not being able to guarantee independence, is that presumably the elements are not in consecutive order. This will cause non-contiguous memory access (memory divergence) and slow down you code. If you can use ‘n’ for vvect’s index and remove ijkpr then you will improve memory access.

!$acc region local(n, ijk, nbl)
!$acc do independent
     do nbl=1,nblcks
       do n=k,l
         ijk=ijkpr(n)
         vvect(n,1)=vvect(n,1)*rrnorm
       enddo
     enddo
!$acc end region

Assuming your algorithm allows it, using ‘n’ instead of ijk will help. Though your code is still not parallel since multiple iterations of the ‘nbl’ loop will access the same elements of vvect. What you could do is interchange the n and nbl loops and then run nbl sequentially.

!$acc region 
!$acc do kernel
     do n=k,l
       do nbl=1,nblcks
         ijk=ijkpr(n)
         vvect(n,1)=vvect(n,1)*rrnorm
       enddo
     enddo
!$acc end region
  • Mat