# Reduction of an array inside the parallel region

Hello everyone,

I need to reduce some arrays inside a parallel region. Due to Openacc I can’t reduce the arrays. Below is an example of what I am trying to achieve. I thought about reducing to scalar however after that I can’t store them back to the array because I have race condition. Also, I tried the atomic clause but I can’t write to an allocatable variable.

``````!\$acc parallel loop collapse(2) independent private (Y)
Do L=1,Nt				!- External loop for angles
Do K=1,Nc				! External loop for velocities

! Boundary conditions
Do i=1,Nx
Y(i,1)=..
End do
Do i=2,Ny
Y(1,i)=..
End do
!Calculation of Y
Do I=2,Nx
Do J=2,Ny
Y(I,J)=(-Y(I-1,J-1)*Td00-Y(I,J-1)*Td10-Y(I- &
1,J)*Td01+CoFy)/Td11
End do
End do
!Sum Reduction
Do i=1,Nx
Do j=1,Ny
Ux(i,j)=Ux(i,j)+Y(i,j)
Uy(i,j)=Uy(i,j)+Y(i,j)
End do
End do
!

End do
End do
``````

Hi Alexander,

Since “Y” is private, I’m assuming you’re asking about the “Ux” and “Uy” arrays. What’s the issue you’re having with atomics? Using an atomic update would be solution.

``````           Do i=1,Nx
Do j=1,Ny
!\$acc atomic update
Ux(i,j)=Ux(i,j)+Y(i,j)
!\$acc atomic update
Uy(i,j)=Uy(i,j)+Y(i,j)
End do
End do
``````

You are correct that atomics can only be used on scalars, but “Ux(i,j)” and “Uy(i,j)” are considered scalars in this context since they are elements of an allocatable array, not the array itself.

-Mat

Hi Mat and thank you for your rapid response

It is for the Ux and Uy arrays.

If I use the atomic as you posted I get the following error:

PGF90-S-0000-Internal compiler error. Error: Detected unexpected atomic store opcode. 365 (2D_cmplx_1.f90: 180)
PGF90-S-0155-Invalid atomic expression (2D_cmplx_1.f90: 180)
PGF90-S-0155-Invalid atomic region. (2D_cmplx_1.f90: 180)

Do you have any idea why is this happening?

Thank you

Are Ux and Uy arrays complex double? If so, that’s the problem. Due lack of hardware support, we don’t support complex double with atomics. Single precision complex should be fine with PGI 17.4 and later.

-Mat

Yes they are. However even if I change all my variables to single precision I still get these errors

PGF90-S-0155-Invalid atomic expression (2D_cmplx_1.f90: 186)
PGF90-S-0155-Invalid atomic region. (2D_cmplx_1.f90: 186)

I use PGI 17.4-Community edition.

Alex

Well, that should work. It just tried something similar and it worked for me. Can you write-up a small reproducer?

``````% cat test.f90
program test_complex_atomic_update
implicit none

integer,parameter:: nlen=10
complex(kind=kind(0.0)),dimension(nlen,nlen):: arr

integer i,j,k
complex(kind=kind(0.0)):: var

var=(1.,1.)
arr=(0.,0.)
!\$acc parallel loop gang vector copy(arr)
do k=1,100
do i=1,nlen
do j=1,nlen
!\$acc atomic update
arr(i,j)=arr(i,j)+var
!\$acc end atomic
enddo
enddo
enddo
print*,arr
end program test_complex_atomic_update

% pgfortran -ta=tesla:cc60 -Minfo=accel test.f90 -V17.4
test_complex_atomic_update:
12, Generating copy(arr(:,:))
Accelerator kernel generated
Generating Tesla code
13, !\$acc loop gang, vector(100) ! blockidx%x threadidx%x
14, !\$acc loop seq
15, !\$acc loop seq
14, Loop is parallelizable
15, Loop is parallelizable
``````

Mat

your test case gave the same result as yours. The problem was mine because in another atomic construct the calculation was different.So
x = x operator expr
x = x operator expr operator expr
which was wrong.

1. Can’t I have private arrays that are created only for the device?
2. I was thinking about creating a data region with all the arrays inside the do while iteration and then do all the calculations in parallel. Do you have a suggestion or a recommendation about the code below?

Thanks again for the assistance

``````!\$acc data copyin(Ux(1:Nx,1:Ny),Uy(1:Nx,1:Ny),Uxpr(1:Nx,1:Ny),Uypr(1:Nx,1:Ny),reler_r(1:Nx,1:Ny))

6   Continue
Iter=Iter+1
!
Do i=1,Nx
Do j=1,Ny
Uxpr(i,j)=Ux(i,j);Uypr(i,j)=Uy(i,j)
End do
End do

!\$acc parallel loop collapse(2) independent private (Y) present(phi,W1,c,Ux,Uy,Dns,Tmp,qx,qy)
Do L=1,Nt            !- External loop for angles
Do K=1,Nc            ! External loop for velocities

! Boundary conditions
!\$acc loop
Do i=1,Nx
Y(i,1)=..
End do
!\$acc loop
Do i=2,Ny
Y(1,i)=..
End do
!Calculation of Y (must be private)
Do I=2,Nx
Do J=2,Ny
Y(I,J)=(-Y(I-1,J-1)*Td00-Y(I,J-1)*Td10-Y(I- &
1,J)*Td01+CoFy)/Td11
End do
End do
!\$acc loop
Do i=1,Nx
Do j=1,Ny
!\$acc atomic update
Ux(i,j)=Ux(i,j)+Y(i,j)
!\$acc end atomic
Uy(i,j)=Uy(i,j)+Y(i,j)
End do
End do
!

End do
End do

!\$acc parallel loop collapse(2) independent
Do I=1,Nx
Do J=1,Ny
Reler(I,J)=abs(Ux(I,J)-Uxpr(I,J))+abs(Uy(I,J)-Uypr(I,J))
End do
End do
!\$acc end parallel do
Relmaxr=MAXVAL(Reler_r)

If(Relmaxr.gt.Error) Goto 6

!\$acc end data
``````

Hi Alex,

1. Can’t I have private arrays that are created only for the device?

Yes, via the “private” clause as you have done here with the “Y” array.

1. I was thinking about creating a data region with all the arrays inside the do while iteration and then do all the calculations in parallel. Do you have a suggestion or a recommendation about the code below?

This should work. Though, I’d put “Uxpr” and “Uypr” in a “create” clause and then put their initialization loop in a compute construct. That way you don’t need to move their from the host to device.

You’ll need to update “Reler” to the host in order to use it MAXVAL. Though, why not just do the max reduction on the device and get rid of “Reler” altogether?

Also, I think you meant to put “Uy” in an atomic update as well.

Something like:

``````!\$acc data copyin(Ux(1:Nx,1:Ny),Uy(1:Nx,1:Ny)), create(Uxpr(1:Nx,1:Ny),Uypr(1:Nx,1:Ny))

6   Continue
Iter=Iter+1
!
!\$acc kernels loop collapse(2)
Do j=1,Ny
Do i=1,Nx
Uxpr(i,j)=Ux(i,j);Uypr(i,j)=Uy(i,j)
End do
End do

!\$acc parallel loop collapse(2) independent private (Y) present(phi,W1,c,Ux,Uy,Dns,Tmp,qx,qy)
Do L=1,Nt            !- External loop for angles
Do K=1,Nc            ! External loop for velocities

! Boundary conditions
!\$acc loop
Do i=1,Nx
Y(i,1)=..
End do
!\$acc loop
Do i=2,Ny
Y(1,i)=..
End do
!Calculation of Y (must be private)
Do I=2,Nx
Do J=2,Ny
Y(I,J)=(-Y(I-1,J-1)*Td00-Y(I,J-1)*Td10-Y(I- &
1,J)*Td01+CoFy)/Td11
End do
End do
!\$acc loop
Do i=1,Nx
Do j=1,Ny
!\$acc atomic update
Ux(i,j)=Ux(i,j)+Y(i,j)
!\$acc end atomic
!\$acc atomic update
Uy(i,j)=Uy(i,j)+Y(i,j)
!\$acc end atomic
End do
End do
!

End do
End do
Relmaxr = -100000.000  ! or some other small initial value
!\$acc parallel loop collapse(2) independent reduction(max:Relmaxr)
Do I=1,Nx
Do J=1,Ny
Relmaxr=max(RelMaxr,abs(Ux(I,J)-Uxpr(I,J))+abs(Uy(I,J)-Uypr(I,J))
End do
End do
!\$acc end parallel do

If(Relmaxr.gt.Error) Goto 6

!\$acc end data
``````

Mat hi again

I applied all the changes that you mentioned in your previous post but I can’t get the OpenACC code to get faster than the serial one. I think that the memory transfers are very limited. Is it the atomic structure that slows down the parallel version so much?

Also, if I run the code with PGPROF there is part in Driver API named ‘cu DevicePrimaryCtxRelease’ that need 8 seconds. Does it mean that the gpu isn’t shutdown quickly?

Thanks again,
Alex

[/img]

Hi Alex,

The profiler should be able to tell you where the time is being spent. “atomics” can slow things down, but it could be other things as well, such as data movement between the host and device, too small of a problem, not enough exposed parallelism, data layout on the device, occupancy, …

“DevicePrimaryCtxRelease” is the driver API call to release the primary context. I’d think the releasing the context would only take <200 ms, not 8 seconds so something is wrong. Though, not having seen this before, I’m not sure what. Can you post or send to PGI Customer Service (trs@pgroup.com) the full profile and possibly a reproducing example?

-Mat

Mat

I send you an email because the code needs a some input data to run.

Alex

Thanks Alex. I was able to figure out the problem.

To refresh the question:

Also, if I run the code with PGPROF there is part in Driver API named ‘cu DevicePrimaryCtxRelease’ that need 8 seconds. Does it mean that the gpu isn’t shutdown quickly?

This only occurs on Windows and the time being spent isn’t really because of “cuDevicePrimaryCtxRelease”. The problem is that it takes the profiler a few seconds (in this case 8) to output the CPU profile to a file. It just happens that “cuDevicePrimaryCtxRelease” was being held open while the file was being outputted. The time does not occur when not using PGPROF or disabling CPU profiling (pgprof --cpu-profiling off <a.exe>).

-Mat