Reduction of an array inside the parallel region

Alex_Tsimpboukis · September 15, 2017, 8:44am

Hello everyone,

I need to reduce some arrays inside a parallel region. Due to Openacc I can’t reduce the arrays. Below is an example of what I am trying to achieve. I thought about reducing to scalar however after that I can’t store them back to the array because I have race condition. Also, I tried the atomic clause but I can’t write to an allocatable variable.

Thank you in advance!

!$acc parallel loop collapse(2) independent private (Y) 
	Do L=1,Nt				!- External loop for angles
	    Do K=1,Nc				! External loop for velocities

	        ! Boundary conditions 
                Do i=1,Nx 
                    Y(i,1)=..
                End do
                Do i=2,Ny
                    Y(1,i)=..
                End do    
                !Calculation of Y
	        Do I=2,Nx
	            Do J=2,Ny
		            Y(I,J)=(-Y(I-1,J-1)*Td00-Y(I,J-1)*Td10-Y(I- &
                             1,J)*Td01+CoFy)/Td11
	            End do
	        End do
            !Sum Reduction 
	        Do i=1,Nx
                         Do j=1,Ny
			          Ux(i,j)=Ux(i,j)+Y(i,j)	    
                                  Uy(i,j)=Uy(i,j)+Y(i,j)	           
                End do
            End do
            !
            
        End do					
    End do

MatColgrove · September 15, 2017, 3:36pm

Hi Alexander,

Since “Y” is private, I’m assuming you’re asking about the “Ux” and “Uy” arrays. What’s the issue you’re having with atomics? Using an atomic update would be solution.

           Do i=1,Nx 
                         Do j=1,Ny 
                                  !$acc atomic update
                                  Ux(i,j)=Ux(i,j)+Y(i,j)    
                                  !$acc atomic update
                                  Uy(i,j)=Uy(i,j)+Y(i,j)               
                End do 
            End do

You are correct that atomics can only be used on scalars, but “Ux(i,j)” and “Uy(i,j)” are considered scalars in this context since they are elements of an allocatable array, not the array itself.

-Mat

Alex_Tsimpboukis · September 15, 2017, 6:28pm

Hi Mat and thank you for your rapid response

It is for the Ux and Uy arrays.

If I use the atomic as you posted I get the following error:

PGF90-S-0000-Internal compiler error. Error: Detected unexpected atomic store opcode. 365 (2D_cmplx_1.f90: 180)
PGF90-S-0155-Invalid atomic expression (2D_cmplx_1.f90: 180)
PGF90-S-0155-Invalid atomic region. (2D_cmplx_1.f90: 180)

Do you have any idea why is this happening?

Thank you

MatColgrove · September 15, 2017, 6:54pm

Are Ux and Uy arrays complex double? If so, that’s the problem. Due lack of hardware support, we don’t support complex double with atomics. Single precision complex should be fine with PGI 17.4 and later.

-Mat

Alex_Tsimpboukis · September 15, 2017, 7:51pm

Yes they are. However even if I change all my variables to single precision I still get these errors

PGF90-S-0155-Invalid atomic expression (2D_cmplx_1.f90: 186)
PGF90-S-0155-Invalid atomic region. (2D_cmplx_1.f90: 186)

I use PGI 17.4-Community edition.

Alex

MatColgrove · September 15, 2017, 8:46pm

Well, that should work. It just tried something similar and it worked for me. Can you write-up a small reproducer?

% cat test.f90
program test_complex_atomic_update
  implicit none

  integer,parameter:: nlen=10
  complex(kind=kind(0.0)),dimension(nlen,nlen):: arr

  integer i,j,k
  complex(kind=kind(0.0)):: var

  var=(1.,1.)
  arr=(0.,0.)
!$acc parallel loop gang vector copy(arr)
  do k=1,100
  do i=1,nlen
    do j=1,nlen
!$acc atomic update
     arr(i,j)=arr(i,j)+var
!$acc end atomic
   enddo
  enddo
  enddo
  print*,arr
end program test_complex_atomic_update


% pgfortran -ta=tesla:cc60 -Minfo=accel test.f90 -V17.4
test_complex_atomic_update:
     12, Generating copy(arr(:,:))
         Accelerator kernel generated
         Generating Tesla code
         13, !$acc loop gang, vector(100) ! blockidx%x threadidx%x
         14, !$acc loop seq
         15, !$acc loop seq
     14, Loop is parallelizable
     15, Loop is parallelizable

Alex_Tsimpboukis · September 18, 2017, 3:11pm

Mat

your test case gave the same result as yours. The problem was mine because in another atomic construct the calculation was different.So
instead of having:
x = x operator expr
I had something like
x = x operator expr operator expr
which was wrong.

I wanted also to ask:

Can’t I have private arrays that are created only for the device?
I was thinking about creating a data region with all the arrays inside the do while iteration and then do all the calculations in parallel. Do you have a suggestion or a recommendation about the code below?

Thanks again for the assistance

!$acc data copyin(Ux(1:Nx,1:Ny),Uy(1:Nx,1:Ny),Uxpr(1:Nx,1:Ny),Uypr(1:Nx,1:Ny),reler_r(1:Nx,1:Ny))

6   Continue
    Iter=Iter+1
	!
    Do i=1,Nx
        Do j=1,Ny
            Uxpr(i,j)=Ux(i,j);Uypr(i,j)=Uy(i,j)
        End do
    End do

!$acc parallel loop collapse(2) independent private (Y) present(phi,W1,c,Ux,Uy,Dns,Tmp,qx,qy)
   Do L=1,Nt            !- External loop for angles 
       Do K=1,Nc            ! External loop for velocities 

           ! Boundary conditions 
                !$acc loop
                Do i=1,Nx 
                    Y(i,1)=.. 
                End do 
               !$acc loop
                Do i=2,Ny 
                    Y(1,i)=.. 
                End do    
                !Calculation of Y (must be private)
           Do I=2,Nx 
               Do J=2,Ny 
                  Y(I,J)=(-Y(I-1,J-1)*Td00-Y(I,J-1)*Td10-Y(I- & 
                             1,J)*Td01+CoFy)/Td11 
               End do 
           End do 
            !$acc loop 
           Do i=1,Nx 
                         Do j=1,Ny 
!$acc atomic update
                   Ux(i,j)=Ux(i,j)+Y(i,j)
!$acc end atomic       
                                  Uy(i,j)=Uy(i,j)+Y(i,j)               
                End do 
            End do 
            ! 
            
        End do                
    End do   

!$acc parallel loop collapse(2) independent
	Do I=1,Nx
        Do J=1,Ny
	        Reler(I,J)=abs(Ux(I,J)-Uxpr(I,J))+abs(Uy(I,J)-Uypr(I,J))
        End do
    End do
	!$acc end parallel do	
    Relmaxr=MAXVAL(Reler_r)

If(Relmaxr.gt.Error) Goto 6

!$acc end data

MatColgrove · September 18, 2017, 4:00pm

Hi Alex,

Can’t I have private arrays that are created only for the device?

Yes, via the “private” clause as you have done here with the “Y” array.

I was thinking about creating a data region with all the arrays inside the do while iteration and then do all the calculations in parallel. Do you have a suggestion or a recommendation about the code below?

This should work. Though, I’d put “Uxpr” and “Uypr” in a “create” clause and then put their initialization loop in a compute construct. That way you don’t need to move their from the host to device.

You’ll need to update “Reler” to the host in order to use it MAXVAL. Though, why not just do the max reduction on the device and get rid of “Reler” altogether?

Also, I think you meant to put “Uy” in an atomic update as well.

Something like:

!$acc data copyin(Ux(1:Nx,1:Ny),Uy(1:Nx,1:Ny)), create(Uxpr(1:Nx,1:Ny),Uypr(1:Nx,1:Ny))

6   Continue 
    Iter=Iter+1 
   ! 
!$acc kernels loop collapse(2) 
    Do j=1,Ny 
        Do i=1,Nx 
            Uxpr(i,j)=Ux(i,j);Uypr(i,j)=Uy(i,j) 
        End do 
    End do 

!$acc parallel loop collapse(2) independent private (Y) present(phi,W1,c,Ux,Uy,Dns,Tmp,qx,qy) 
   Do L=1,Nt            !- External loop for angles 
       Do K=1,Nc            ! External loop for velocities 

           ! Boundary conditions 
                !$acc loop 
                Do i=1,Nx 
                    Y(i,1)=.. 
                End do 
               !$acc loop 
                Do i=2,Ny 
                    Y(1,i)=.. 
                End do    
                !Calculation of Y (must be private) 
           Do I=2,Nx 
               Do J=2,Ny 
                  Y(I,J)=(-Y(I-1,J-1)*Td00-Y(I,J-1)*Td10-Y(I- & 
                             1,J)*Td01+CoFy)/Td11 
               End do 
           End do 
            !$acc loop 
           Do i=1,Nx 
                         Do j=1,Ny 
!$acc atomic update 
                   Ux(i,j)=Ux(i,j)+Y(i,j) 
!$acc end atomic        
!$acc atomic update
                    Uy(i,j)=Uy(i,j)+Y(i,j)
!$acc end atomic                
                End do 
            End do 
            ! 
            
        End do                
    End do    
     Relmaxr = -100000.000  ! or some other small initial value
!$acc parallel loop collapse(2) independent reduction(max:Relmaxr)
   Do I=1,Nx 
        Do J=1,Ny 
           Relmaxr=max(RelMaxr,abs(Ux(I,J)-Uxpr(I,J))+abs(Uy(I,J)-Uypr(I,J)) 
        End do 
    End do 
   !$acc end parallel do    
    

If(Relmaxr.gt.Error) Goto 6 

!$acc end data

Alex_Tsimpboukis · September 19, 2017, 5:01pm

Mat hi again

I applied all the changes that you mentioned in your previous post but I can’t get the OpenACC code to get faster than the serial one. I think that the memory transfers are very limited. Is it the atomic structure that slows down the parallel version so much?

Also, if I run the code with PGPROF there is part in Driver API named ‘cu DevicePrimaryCtxRelease’ that need 8 seconds. Does it mean that the gpu isn’t shutdown quickly?

Thanks again,
Alex

[/img]

MatColgrove · September 19, 2017, 10:13pm

Hi Alex,

The profiler should be able to tell you where the time is being spent. “atomics” can slow things down, but it could be other things as well, such as data movement between the host and device, too small of a problem, not enough exposed parallelism, data layout on the device, occupancy, …

“DevicePrimaryCtxRelease” is the driver API call to release the primary context. I’d think the releasing the context would only take <200 ms, not 8 seconds so something is wrong. Though, not having seen this before, I’m not sure what. Can you post or send to PGI Customer Service (trs@pgroup.com) the full profile and possibly a reproducing example?

-Mat

Alex_Tsimpboukis · September 20, 2017, 11:04am

Mat

I send you an email because the code needs a some input data to run.

Alex

MatColgrove · September 20, 2017, 9:47pm

Thanks Alex. I was able to figure out the problem.

To refresh the question:

Also, if I run the code with PGPROF there is part in Driver API named ‘cu DevicePrimaryCtxRelease’ that need 8 seconds. Does it mean that the gpu isn’t shutdown quickly?

This only occurs on Windows and the time being spent isn’t really because of “cuDevicePrimaryCtxRelease”. The problem is that it takes the profiler a few seconds (in this case 8) to output the CPU profile to a file. It just happens that “cuDevicePrimaryCtxRelease” was being held open while the file was being outputted. The time does not occur when not using PGPROF or disabling CPU profiling (pgprof --cpu-profiling off <a.exe>).

-Mat