Calculation fails when using long long with OpenACC kernels

Hi, All:

I am testing a very simple Pi calculation using OpenACC acceleration, however it is giving erroneous results when I use a very large iteration number with long long:

The code pi_acc.orig.c is listed below:

  1 #include <stdio.h>
  1 #include <stdio.h>
  2 #include <stdlib.h>
  3 #include <omp.h>
  4
  5 int main(int argc, char** argv) {
  6     long long int i, n=10000000000; //10^10
  7     if (argc>1) n=atoi(argv[1]);
  8     double start_time, end_time;
  9     double x, pi;
 10     double sum = 0.0;
 11     double step = 1.0/(double) n;
 12     printf("step = %17.15f\n",step);
 13
 14 #pragma acc kernels
 15     for (i = 0; i < n; i++) {
 16         x = (i+0.5)*step;
 17         sum +=  4.0/(1.0+x*x);
 18     }
 19     pi = step * sum;
 20     printf("pi = %17.15f\n",pi);
 21     return 0;
 22 }

When using n=10000000000 (10^10), it is no longer able to give correct values, however if using smaller value (e.g. 10^9), the pi value is ok.

If without the #pragma acc kernels, the serial version will give correct results:

[fchen14@shelob006 c]$ pgcc -acc pi_acc.orig.c
[fchen14@shelob006 c]$ ./a.out
step = 0.000000000100000
pi = 0.560331986334500
[fchen14@shelob006 c]$ ./a.out 1000000000
step = 0.000000001000000
pi = 3.141592653589794
[fchen14@shelob006 c]$ pgcc pi_acc.orig.c
[fchen14@shelob006 c]$ ./a.out
step = 0.000000000100000
pi = 3.141592653589451

Hopefully I have made my problem cleat, could anyone tell how to use the long long value with OpenACC?

Thanks a lot!

Feng[/code]

Hi Feng,

Looks like the compiler isn’t handling reductions with very large loop trip counts. I’ve added a problem report (TPR#20503) and sent it on to engineering.

The work around would be to split this into two reductions:

% cat testA.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <stdint.h>
#include <math.h>

int main(int argc, char** argv) {
     long i, j, n; //10^10
     if (argc>1) n=atoi(argv[1]);
     double start_time, end_time;
     double x, pi;
     double sum = 0.0;
     double sumA = 0.0;
     double step;
     n=10000000000; //10^10
   //n=1000000000; //10^10
     step = 1.0/(double) n;
     printf("step = %17.15f %ld\n",step,n);
     n = (long) sqrtf((float)n);
#pragma acc kernels loop  gang reduction(+:sum)
     for (i = 0; i < n; i++) {
         sumA = 0.0;
#pragma acc loop vector reduction(+:sumA)
     for (j = 0; j < n; j++) {
         x = ((n*i)+j+0.5)*step;
         sumA +=  4.0/(1.0+x*x);
       }
       sum+=sumA;
     }
     pi = step * sum;
     printf("pi = %17.15f \n",pi);
     return 0;
}
% pgcc testA.c -fast -acc -Minfo=accel  ; a.out
main:
     20, Generating Tesla code
     21, Loop is parallelizable
         Accelerator kernel generated
         21, #pragma acc loop gang /* blockIdx.x */
             Sum reduction generated for sum
         24, #pragma acc loop vector(128) /* threadIdx.x */
             Sum reduction generated for sumA
         Loop is parallelizable
step = 0.000000000100000 10000000000
pi = 3.141592653589793

Thanks!
Mat

Hi, Mat, thanks for providing the workaround. The reason I use long long is trying to demonstrate speedup of OpenACC with simple directive. However for this example it seems the speedup is not apparent unless the “n” value is large enough to 10^10

Feng

Hi Feng,

While not as good as the 10^10, I do see a nice speed-up with 10^9 as well. If you’re on Linux, you might want to run the “pgcudainit” utility in the background. The OS will power down your device when not in use and costs about 1-2 seconds per device to power back up. “pgcudainit” holds the device open so you don’t incur the start-up costs. For longer running applications the start-up penalty doesn’t matter, but can have an impact in these small examples.

  • Mat


% pgcudainit &
[1] 21130
 pgcudainit called cuInit, now waiting for input
% time gpu.out
step = 0.000000000100000 10000000000
pi = 3.141592653589793
0.247u 0.347s 0:00.62 93.5%     0+0k 0+0io 0pf+0w
% time cpu.out
step = 0.000000000100000 10000000000
pi = 3.141592653589754
28.907u 0.004s 0:28.99 99.6%    0+0k 0+0io 0pf+0w

% time gpu9.out
step = 0.000000001000000 1000000000
pi = 3.141494419177372
0.023u 0.219s 0:00.26 88.4%     0+0k 0+0io 0pf+0w
% time cpu9.out
step = 0.000000001000000 1000000000
pi = 3.141494419177361
2.900u 0.001s 0:02.91 99.6%     0+0k 0+0io 0pf+0w

Without pgcudinit:

% time gpu9.out
step = 0.000000001000000 1000000000
pi = 3.141494419177372
0.010u 0.575s 0:02.78 20.8%     0+0k 0+0io 0pf+0w

TPR 20503 - OpenACC: reduction gives bad answers when summing loop with very large trip count


is fixed in the current 14.7 release.