Pgc++ acc can't reduce long int

I summing i from 0 to n-1. I’m testing i and n being either int or long int. My loop is either inside an acc parallel loop or is not. 3 of the 4 choices work. However long int and parallel crashes.

#include <iostream>
using std::cout;
using std::endl;

int main() {
  cout << "[Starting]\n";

  {
    const int n = 2'000'000'000;
    cout << "n: " << n << endl;
    double correct = n*(n+1.0)*0.5;
    double computed=0;
    for (int i=0; i<n; i++) computed +=i;
    cout  << "int n, sequential. correct: " << correct << ", computed: " << computed << endl;
  }

  {
    const int n = 2'000'000'000;
    cout << "n: " << n << endl;
    double correct = n*(n+1.0)*0.5;
    double computed=0;
#pragma acc parallel loop reduction(+:computed)
    for (int i=0; i<n; i++) computed +=i;
    cout  << "int n, parallel. correct: " << correct << ", computed: " << computed << endl;
  }

  {
    const long int n = 20'000'000'000;
    cout << "n: " << n << endl;
    double correct = n*(n+1.0)*0.5;
    double computed=0;
    for (long int i=0; i<n; i++) computed +=i;
    cout  << "long int n, sequential. correct: " << correct << ", computed: " << computed << endl;
  }

  {
    const long int n = 20'000'000'000;
    cout << "n: " << n << endl;
    double correct = n*(n+1.0)*0.5;
    double computed=0;
#pragma acc parallel loop reduction(+:computed)
    for (long int i=0; i<n; i++) computed +=i;
    cout  << "long int n, parallel. correct: " << correct << ", computed: " << computed << endl;
  }
}

I compiled it thus:

pgc++ -fast -Minfo=accel -mp -acc bad-acc.cc -lfmt -o bad-acc

This is the output:

n: 2000000000
int n, sequential. correct: 2e+18, computed: 2e+18
n: 2000000000
upload CUDA data  file=/p73/wrf/git/parallel-research/openacc-1st/bad-acc.cc function=main line=25 device=0 threadid=1 variable=computed bytes=8
download CUDA data  file=/p73/wrf/git/parallel-research/openacc-1st/bad-acc.cc function=main line=27 device=0 threadid=1 variable=computed bytes=8
int n, parallel. correct: 2e+18, computed: 2e+18
n: 20000000000
long int n, sequential. correct: 2e+20, computed: 2e+20
n: 20000000000
upload CUDA data  file=/p73/wrf/git/parallel-research/openacc-1st/bad-acc.cc function=main line=44 device=0 threadid=1 variable=computed bytes=8
[2]    415351 segmentation fault (core dumped)  ./bad-acc
11.75s real  11.44s user  0.12s system  98% 0,0 socket  165 mem ./bad-acc

I’m running Ubuntu 20.04. The GPU is a Quadro RTX 5000.

If this is a known limitation, are other limitations documented so I’ll know to work around them?

Thanks.

Thanks for the report. The problem doesn’t appear to be an issue with the long int, but rather a problem with a buffer in our runtime when using very large loop trip counts. I’ve reported this issue as TPR #28814 and sent to our compiler engineers for further evaluation.

Unfortunately I don’t have a good work around for you other than to lower the value of n to 19327352831 or below.

Mat,

Thanks for looking into it. Knowing where the problem is, I can easily
work around it. For some reason SW often breaks when I use it, and I
wasn’t even trying. A few years ago I put nvcc into an infinite loop
when trying to compile a particular size of local array. (I reported
it, and the problem was fixed in the next minor release.)

/Randolph

Hi Randolph,

Apologies that it took this long, but engineering let me know that TPR #28814 was fixed in our 23.11 release.

% nvc++ red2.cpp -acc=gpu -Minfo -V23.11 ; a.out
main:
     11, Generating NVIDIA GPU code
         13, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
             Generating reduction(+:computed)
     11, Generating implicit copy(computed) [if not already present]
[Starting]
n: 19330000000
long int n, parallel. correct: 1.86824e+20, computed: 1.86824e+20

-Mat

Hi Mat,

Thanks for the info!

/Randolph