[Help] Using reduction with Array

danhnam.hoang · April 5, 2019, 4:27am

Hi all,

Currently, i’m trying using reduction clause with array and i don’t know how to apply it for my case.

I have a array {1, 2, 3, 4, 5, 3, 1, 3, 2, 5, 6, 2, 6, 2, 9, 4, 2, 3, 5, 6}.
i want count number of values(0-9) in above array and put them into a array. But i don’t know use reduction clause in here (#pragma acc parallel loop reduction(+:res)??? ).

This is my source code:

#include <stdio.h> 
#include <string.h> 
#include <iostream>

using namespace std;

int main() 
{ 
    int n = 20; 
    int arr[20] = {1,2,3,4,5,3,1,3,2,5,6,2,6,2,9,4,2,3,5,6}; 
    int res[10] = {0,0,0,0,0,0,0,0,0,0};

    #pragma acc enter data copyin(arr[0:20])
    #pragma acc parallel loop reduction(+:res[])
    for(int i = 0; i < sizeof(arr)/sizeof(arr[0]); i++)
    {
        res[arr[i]] += 1;
    }
    #pragma acc update self(arr[0:20])
    #pragma acc exit data delete(arr)

    return 0; 
}

Please help me apply reduction in here!

Thank you so much!

danhnam.hoang · April 5, 2019, 11:05am

Hi all,

I think i found sample code for my issues:

float _Complex c;
real s[4];
..
#pragma acc parallel loop reduction(+:c,s) default(present)
for (int i = 0; i < n; ++i) {
   c += CMPLXF( a[i], b[i] );
   s[0] += a[i];
   s[1] += b[i];
   s[2] += c[i];
   if (d[i] > 0) s[3] += d[i];
}

But it has error: PGCC-S-1002-Reduction type not supported for this variable datatype - res

Currently, i’m using PGI compiler 18.10 that include openacc 2.6. Is this version support reduction with array?
When do the community version of PGI compiler 19.1 release?

Thank you so much!

MatColgrove · April 5, 2019, 3:22pm

Hi DanhNam

Currently, i’m using PGI compiler 18.10 that include openacc 2.6. Is this version support reduction with array?

No, the OpenACC 2.6 nor the most recent 2.7 standard does not support array reductions.

However, it is one of the more requested features so the OpenACC committee is considering it for a future standard.

Note that the concern from our engineers evaluating this feature is that array reductions might cause a performance issue. But if it is adopted by the OpenACC standard, PGI will add this support.

When do the community version of PGI compiler 19.1 release?

The next PGI Community Edition will be 19.4 which isn’t too far away.

-Mat

kfriedline91622 · June 4, 2020, 12:22am

Sorry for Frankensteining this thread, but I do believe the 2.7 standard does include array reductions. From the 2.7 specifications pg. 28,

“If the reduction var is an array or subarray, the array reduction operation is logically equivalent to applying that reduction operation to each element of the array or subarray individually.”

From what I can tell, this is as of yet unsupported in PGI 19.7. Is this on the horizon of being added or has it been added in a more recent version? Thanks!

MatColgrove · June 4, 2020, 2:26pm

Correct, I was in error in stating that 2.7 did include array reductions. Apologies.

With all the other projects we’ve been working on and the rebranding effort, we unfortunately haven’t had time to add new OpenACC features. We should be getting back to these in the near future, but I don’t have firm timeline as to when.

-Mat

yongsuk · March 20, 2024, 5:21pm

Hello,

I was wondering whether nvfortran 23.9 supports openacc array reduction?

If so, does it support array reduction for the nested loop? For example, it works well with multiple scalar reduction(+: tmp_x, tmp_y, tmp_z) but got the runtime error for the array reduction(+: tmp(:))

The error message is:
Accelerator Fatal Error: call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

The nested code structure is like this:

!$acc parallel loop independent gang worker private(tmp(1:3)) 
do j=1,n
tmp(1:3)=0.0d0
!$acc loop independent vector private(A) reduction(+:tmp(1:3))
do i=1,m
... computation for A ...
tmp(1:3) = tmp(1:3) + A
end do
B = B + tmp(1:3)
end do

Thanks,
Yongsuk

MatColgrove · March 20, 2024, 5:48pm

It might be a heap overflow given the inner private copies of “tmp” need to be allocated on the device. You can try increasing the device heap via the environment variable “NV_ACC_CUDA_HEAPSIZE”.

If you can, I’d recommend avoiding array reductions given the required overhead.

yongsuk · March 20, 2024, 6:24pm

But it works with multiple scalar reduction(+: tmp_x, tmp_y, tmp_z) but got the runtime error for the array reduction(+: tmp(:)).

If that is the case, I should get the heap overflow for both cases, right? Thank you.

MatColgrove · March 20, 2024, 8:20pm

Scalars don’t need to be allocated on the device since they are fixed size and can be made local to each thread.

tmp’s size is unknown until runtime so needs to be allocated by each thread.

yongsuk · March 20, 2024, 8:56pm

I might confuse you, but the tmp’s size is fixed in my case as the length of 3 real(8) variables. I will edit the code. I’m pretty sure there would be no heap overflow.

MatColgrove · March 20, 2024, 9:07pm

Ok, you edited the post. Originally you had “tmp(:)”.

If it still doesn’t work after the change, please post a reproducing example so I can investigate.

yongsuk · March 20, 2024, 10:22pm

Here are reproducing examples. One is working; the other is not.

TOY_NO_ERROR.f90

program toy_no_error
use openacc
implicit none

integer,parameter :: &
    NPmax=100000, &
    NPNmax = 300, &
    NSD = 3      
integer,allocatable :: npnl(:), pair(:,:)
real(8),allocatable :: mass(:), at(:,:), p(:), rho(:), dwdx(:,:,:)

integer :: ii,i,j,k,err
real(8) :: sr, vr(NSD), dat_x, dat_y, dat_z

allocate(npnl(NPmax), pair(NPNmax,NPmax), mass(NPmax), at(NSD,NPmax), &
    p(NPmax), rho(NPmax), dwdx(NSD,NPNmax,NPmax), stat=err)
if(err/=0) then 
    print '(A)', '  DYNAMIC ALLOCATION ERROR  '
end if

do ii=1,10

!$acc parallel loop independent gang private(dat_x, dat_y, dat_z)
do i=1,NPmax
dat_x=0.0d0
dat_y=0.0d0
dat_z=0.0d0
!$acc loop independent private(j,sr,vr) reduction(+:dat_x,dat_y,dat_z)
do k=1,npnl(i)

	j = pair(k,i)

        sr = (p(i)+p(j))/(rho(i)*rho(j))
        vr = sr*dwdx(:,k,i)
        
        dat_x = dat_x - mass(j)*vr(1) 
        dat_y = dat_y - mass(j)*vr(2) 
        dat_z = dat_z - mass(j)*vr(3) 
end do
at(1,i) = at(1,i) + dat_x
at(2,i) = at(2,i) + dat_y
at(3,i) = at(3,i) + dat_z
end do

end do 

end program

TOY_ERROR.f90

program toy_error
use openacc
implicit none

integer,parameter :: &
    NPmax=100000, &
    NPNmax = 300, &
    NSD = 3      
integer,allocatable :: npnl(:), pair(:,:)
real(8),allocatable :: mass(:), at(:,:), p(:), rho(:), dwdx(:,:,:)

integer :: ii,i,j,k,err
real(8) :: sr, vr(NSD), dat(NSD)

allocate(npnl(NPmax), pair(NPNmax,NPmax), mass(NPmax), at(NSD,NPmax), &
    p(NPmax), rho(NPmax), dwdx(NSD,NPNmax,NPmax), stat=err)
if(err/=0) then 
    print '(A)', '  DYNAMIC ALLOCATION ERROR  '
end if

do ii=1,10

!$acc parallel loop independent gang private(dat)
do i=1,NPmax
dat=0.0d0
!$acc loop independent private(j,sr,vr) reduction(+:dat)
do k=1,npnl(i)

	j = pair(k,i)

        sr = (p(i)+p(j))/(rho(i)*rho(j))
        vr = sr*dwdx(:,k,i)
        
        dat = dat - mass(j)*vr 
end do
at(:,i) = at(:,i) + dat
end do

end do 

end program

yongcho@yongcho-XPS-8960:~$ nvfortran --version

yongcho@yongcho-XPS-8960:~$ nvfortran -acc TOY_ERROR.f90

yongcho@yongcho-XPS-8960:~$ ./a.out
Failing in Thread:1
Accelerator Fatal Error: call to cuLaunchKernel returned error 1: Invalid value
File: /home/yongcho/reproducing_example_error.f90
Function: toy:1
Line: 23

Thanks,
Yongsuk

MatColgrove · March 20, 2024, 10:52pm

Looking at it, it seems more likely a stack overflow. You’ll need to reduce the number of gangs and the vector length to get it to fit.

!$acc parallel loop independent gang private(dat) vector_length(64) num_gangs(1024)
do i=1,NPmax
dat=0.0d0
!$acc loop independent private(j,sr,vr) reduction(+:dat)
do k=1,npnl(i)

Again, I recommend avoiding array reductions if possible given the overhead and the performance will likely be slower.

yongsuk · March 20, 2024, 11:23pm

I reduced the parameter like this:

integer,parameter :: &
NPmax=10, &
NPNmax = 3, &
NSD = 3

with reducing the number of gangs and vectors as you did:

!$acc parallel loop independent gang private(dat) vector_length(64) num_gangs(1024)

At this time however, I got profiling error.

yongcho@yongcho-XPS-8960:~$ ncu --target-processes all -o profile ./a.out 
==PROF== Connected to process 698634 (/home/yongcho/a.out)
==PROF== Target process 698636 terminated before first instrumented API call.
==PROF== Target process 698637 terminated before first instrumented API call.
==PROF== Profiling "toy_21" - 0: 0%....50%....100% - 1 pass

==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==PROF== Report: /home/yongcho/profile.ncu-rep

Can I ask why this is happening? A normal run (./a.out) works though.

Again, I recommend avoiding array reductions if possible given the overhead and the performance will likely be slower.
→ Then, are multiple scalar reductions safe to use? Looks like the reduction mechanism for scalar and array variable is different?

Thank you again for your help,
Yongsuk

MatColgrove · March 21, 2024, 3:32pm

It works for me here, so I’m assuming it’s something local to your system. I looked over on the Nsight-Compute forum and see a few posts with the same error. In particular one looked to be a due to an old CUDA driver and the second was a remote system connection error. Why it’s happening on your system, I’m not sure.

Topic		Replies	Views
Significant deterioration of performance with array reduction in OpenACC Legacy PGI Compilers	7	1021	April 22, 2022
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4181	December 6, 2012
Reduction of an array inside the parallel region Legacy PGI Compilers	11	6978	September 20, 2017
Privatization of array Legacy PGI Compilers	9	17595	July 14, 2010
OpenACC 2.0 standard and nested loops Legacy PGI Compilers	6	10415	May 2, 2014
FATAL ERROR at run time Legacy PGI Compilers	5	8103	December 18, 2014
How should OpenACC handle array size that is only known at run-time Legacy PGI Compilers	8	4097	June 20, 2019
License issue when using pgi/20.4 compiler Legacy PGI Compilers	6	291	April 16, 2024
OpenACC: Problem with present directive and module array Legacy PGI Compilers	14	9243	August 14, 2012
OPENACC changes value of array Legacy PGI Compilers	12	9688	May 17, 2016

[Help] Using reduction with Array

Related topics