[Help] Using reduction with Array

Hi all,

Currently, i’m trying using reduction clause with array and i don’t know how to apply it for my case.

I have a array {1, 2, 3, 4, 5, 3, 1, 3, 2, 5, 6, 2, 6, 2, 9, 4, 2, 3, 5, 6}.
i want count number of values(0-9) in above array and put them into a array. But i don’t know use reduction clause in here (#pragma acc parallel loop reduction(+:res)??? ).

This is my source code:

#include <stdio.h> 
#include <string.h> 
#include <iostream>

using namespace std;

int main() 
{ 
    int n = 20; 
    int arr[20] = {1,2,3,4,5,3,1,3,2,5,6,2,6,2,9,4,2,3,5,6}; 
    int res[10] = {0,0,0,0,0,0,0,0,0,0};

    #pragma acc enter data copyin(arr[0:20])
    #pragma acc parallel loop reduction(+:res[])
    for(int i = 0; i < sizeof(arr)/sizeof(arr[0]); i++)
    {
        res[arr[i]] += 1;
    }
    #pragma acc update self(arr[0:20])
    #pragma acc exit data delete(arr)

    return 0; 
}

Please help me apply reduction in here!

Thank you so much!

Hi all,

I think i found sample code for my issues:

float _Complex c;
real s[4];
..
#pragma acc parallel loop reduction(+:c,s) default(present)
for (int i = 0; i < n; ++i) {
   c += CMPLXF( a[i], b[i] );
   s[0] += a[i];
   s[1] += b[i];
   s[2] += c[i];
   if (d[i] > 0) s[3] += d[i];
}

But it has error: PGCC-S-1002-Reduction type not supported for this variable datatype - res

Currently, i’m using PGI compiler 18.10 that include openacc 2.6. Is this version support reduction with array?
When do the community version of PGI compiler 19.1 release?

Thank you so much!

Hi DanhNam

Currently, i’m using PGI compiler 18.10 that include openacc 2.6. Is this version support reduction with array?

No, the OpenACC 2.6 nor the most recent 2.7 standard does not support array reductions.

However, it is one of the more requested features so the OpenACC committee is considering it for a future standard.

Note that the concern from our engineers evaluating this feature is that array reductions might cause a performance issue. But if it is adopted by the OpenACC standard, PGI will add this support.

When do the community version of PGI compiler 19.1 release?

The next PGI Community Edition will be 19.4 which isn’t too far away.

-Mat

Sorry for Frankensteining this thread, but I do believe the 2.7 standard does include array reductions. From the 2.7 specifications pg. 28,

“If the reduction var is an array or subarray, the array reduction operation is logically equivalent to applying that reduction operation to each element of the array or subarray individually.”

From what I can tell, this is as of yet unsupported in PGI 19.7. Is this on the horizon of being added or has it been added in a more recent version? Thanks!

Correct, I was in error in stating that 2.7 did include array reductions. Apologies.

With all the other projects we’ve been working on and the rebranding effort, we unfortunately haven’t had time to add new OpenACC features. We should be getting back to these in the near future, but I don’t have firm timeline as to when.

-Mat

Hello,

I was wondering whether nvfortran 23.9 supports openacc array reduction?

If so, does it support array reduction for the nested loop? For example, it works well with multiple scalar reduction(+: tmp_x, tmp_y, tmp_z) but got the runtime error for the array reduction(+: tmp(:))

The error message is:
Accelerator Fatal Error: call to cuStreamSynchronize returned error 719: Launch failed (often invalid pointer dereference)

The nested code structure is like this:

!$acc parallel loop independent gang worker private(tmp(1:3)) 
do j=1,n
tmp(1:3)=0.0d0
!$acc loop independent vector private(A) reduction(+:tmp(1:3))
do i=1,m
... computation for A ...
tmp(1:3) = tmp(1:3) + A
end do
B = B + tmp(1:3)
end do

Thanks,
Yongsuk

It might be a heap overflow given the inner private copies of “tmp” need to be allocated on the device. You can try increasing the device heap via the environment variable “NV_ACC_CUDA_HEAPSIZE”.

If you can, I’d recommend avoiding array reductions given the required overhead.

But it works with multiple scalar reduction(+: tmp_x, tmp_y, tmp_z) but got the runtime error for the array reduction(+: tmp(:)).

If that is the case, I should get the heap overflow for both cases, right? Thank you.

Scalars don’t need to be allocated on the device since they are fixed size and can be made local to each thread.

tmp’s size is unknown until runtime so needs to be allocated by each thread.

I might confuse you, but the tmp’s size is fixed in my case as the length of 3 real(8) variables. I will edit the code. I’m pretty sure there would be no heap overflow.

Ok, you edited the post. Originally you had “tmp(:)”.

If it still doesn’t work after the change, please post a reproducing example so I can investigate.

Here are reproducing examples. One is working; the other is not.

TOY_NO_ERROR.f90

program toy_no_error
use openacc
implicit none

integer,parameter :: &
    NPmax=100000, &
    NPNmax = 300, &
    NSD = 3      
integer,allocatable :: npnl(:), pair(:,:)
real(8),allocatable :: mass(:), at(:,:), p(:), rho(:), dwdx(:,:,:)

integer :: ii,i,j,k,err
real(8) :: sr, vr(NSD), dat_x, dat_y, dat_z

allocate(npnl(NPmax), pair(NPNmax,NPmax), mass(NPmax), at(NSD,NPmax), &
    p(NPmax), rho(NPmax), dwdx(NSD,NPNmax,NPmax), stat=err)
if(err/=0) then 
    print '(A)', '  DYNAMIC ALLOCATION ERROR  '
end if

do ii=1,10

!$acc parallel loop independent gang private(dat_x, dat_y, dat_z)
do i=1,NPmax
dat_x=0.0d0
dat_y=0.0d0
dat_z=0.0d0
!$acc loop independent private(j,sr,vr) reduction(+:dat_x,dat_y,dat_z)
do k=1,npnl(i)

	j = pair(k,i)

        sr = (p(i)+p(j))/(rho(i)*rho(j))
        vr = sr*dwdx(:,k,i)
        
        dat_x = dat_x - mass(j)*vr(1) 
        dat_y = dat_y - mass(j)*vr(2) 
        dat_z = dat_z - mass(j)*vr(3) 
end do
at(1,i) = at(1,i) + dat_x
at(2,i) = at(2,i) + dat_y
at(3,i) = at(3,i) + dat_z
end do

end do 

end program

TOY_ERROR.f90

program toy_error
use openacc
implicit none

integer,parameter :: &
    NPmax=100000, &
    NPNmax = 300, &
    NSD = 3      
integer,allocatable :: npnl(:), pair(:,:)
real(8),allocatable :: mass(:), at(:,:), p(:), rho(:), dwdx(:,:,:)

integer :: ii,i,j,k,err
real(8) :: sr, vr(NSD), dat(NSD)

allocate(npnl(NPmax), pair(NPNmax,NPmax), mass(NPmax), at(NSD,NPmax), &
    p(NPmax), rho(NPmax), dwdx(NSD,NPNmax,NPmax), stat=err)
if(err/=0) then 
    print '(A)', '  DYNAMIC ALLOCATION ERROR  '
end if

do ii=1,10

!$acc parallel loop independent gang private(dat)
do i=1,NPmax
dat=0.0d0
!$acc loop independent private(j,sr,vr) reduction(+:dat)
do k=1,npnl(i)

	j = pair(k,i)

        sr = (p(i)+p(j))/(rho(i)*rho(j))
        vr = sr*dwdx(:,k,i)
        
        dat = dat - mass(j)*vr 
end do
at(:,i) = at(:,i) + dat
end do

end do 

end program

yongcho@yongcho-XPS-8960:~$ nvfortran --version

nvfortran 23.9-0 64-bit target on x86-64 Linux -tp alderlake
NVIDIA Compilers and Tools
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

yongcho@yongcho-XPS-8960:~$ nvfortran -acc TOY_ERROR.f90

yongcho@yongcho-XPS-8960:~$ ./a.out
Failing in Thread:1
Accelerator Fatal Error: call to cuLaunchKernel returned error 1: Invalid value
File: /home/yongcho/reproducing_example_error.f90
Function: toy:1
Line: 23

Thanks,
Yongsuk

Looking at it, it seems more likely a stack overflow. You’ll need to reduce the number of gangs and the vector length to get it to fit.

!$acc parallel loop independent gang private(dat) vector_length(64) num_gangs(1024)
do i=1,NPmax
dat=0.0d0
!$acc loop independent private(j,sr,vr) reduction(+:dat)
do k=1,npnl(i)

Again, I recommend avoiding array reductions if possible given the overhead and the performance will likely be slower.

I reduced the parameter like this:

integer,parameter :: &
NPmax=10, &
NPNmax = 3, &
NSD = 3

with reducing the number of gangs and vectors as you did:

!$acc parallel loop independent gang private(dat) vector_length(64) num_gangs(1024)

At this time however, I got profiling error.

yongcho@yongcho-XPS-8960:~$ ncu --target-processes all -o profile ./a.out 
==PROF== Connected to process 698634 (/home/yongcho/a.out)
==PROF== Target process 698636 terminated before first instrumented API call.
==PROF== Target process 698637 terminated before first instrumented API call.
==PROF== Profiling "toy_21" - 0: 0%....50%....100% - 1 pass

==ERROR== LaunchFailed
==PROF== Trying to shutdown target application
==ERROR== The application returned an error code (9).
==ERROR== An error occurred while trying to profile.
==PROF== Report: /home/yongcho/profile.ncu-rep

Can I ask why this is happening? A normal run (./a.out) works though.

Again, I recommend avoiding array reductions if possible given the overhead and the performance will likely be slower.
→ Then, are multiple scalar reductions safe to use? Looks like the reduction mechanism for scalar and array variable is different?

Thank you again for your help,
Yongsuk

It works for me here, so I’m assuming it’s something local to your system. I looked over on the Nsight-Compute forum and see a few posts with the same error. In particular one looked to be a due to an old CUDA driver and the second was a remote system connection error. Why it’s happening on your system, I’m not sure.

1 Like