Differences between Host and Accel Code

Hi,

This may be a newbie question, but the output from the code below isn’t quite what I expected it to be. I am trying to do quick Mandelbrot set area calculation. I’d have expected the results to be identical in each case (host/accel) but they differ.
I’m using PGI v 11.3.

Any help appreciated.

-Nick.

Code:

#include <stdio>
#include <stdlib>
#include <math>
#include <time>
#include <sys>
#include <omp>
#ifdef ACCELERATOR_PGI
#include <accel>
#endif

/* Algorithm parameters */
//# define NPOINTS 100352
#define NPOINTS 10000
#define MAXITER 10000

/* Cut off at which an iteration is ruled outside of set */
#define CEILING 4.0 

/* Bounding box in which Mandelbrot set is know to lie */
#define RMIN -2.0
#define RMAX  0.5
#define IMIN  0.0
#define IMAX  1.125

/* Prototypes */
int min(int a,int b);
float getTime(void);

int main(int argc, char *argv[])
{

  double csetre[NPOINTS];
  double csetim[NPOINTS];
  double numarray[NPOINTS];

  int numInside;
  int i;
  int j;
  double area;
  double error;
  float start_time;
  float end_time;
  double cre;
  double cim;
  double zre;
  double zim;
  double ztemp;

  numInside = 0;
  ztemp = 0.0;

  /* Generate a set of nPoints random points in complex plane. */
  srand(12345);

  for(i=0; i<NPOINTS; i++){
    csetre[i] = RMIN + 
      (RMAX - RMIN)*((double) rand() / (double) RAND_MAX);

    csetim[i] = IMIN + 
      (IMAX - IMIN)*((double) rand() / (double) RAND_MAX);

    numarray[i] = 0.0;
  }


  start_time = getTime();

  /* Kernel that iterates over set of points, counting up those that are
     inside Mandelbrot set */


  
  for (j=0; j<NPOINTS; j++){
    zre = csetre[j];
    cre = csetre[j];
    zim = csetim[j];
    cim = csetim[j];


    for(i=0; i<MAXITER; i++){
      /* z_{i+1} = z_i^2 + c */
      ztemp = (zre*zre - zim*zim) + cre;
      zim  = 2.0*zre*zim + cim;
      zre  = ztemp;
    }
    /* Having completed iteration, check if point is inside set */
    if((zre*zre + zim*zim) < CEILING){
      numInside++;
    }
  }
  /*  Calculate area and error and output the Results */
  area = 2.0*(RMAX-RMIN)*(IMAX-IMIN) * (double) numInside / (double) NPOINTS;
  error = area / sqrt((double) NPOINTS);

  end_time = getTime();
  printf("NPOINTS    : %d\n", NPOINTS);
  printf("MAXITER    : %d\n", MAXITER);
  printf("NumInside  : %d\n", numInside);
  printf("Area       : %16.12f +/- %16.12f (host)\n", area, error);
  printf("Time       : %16.12f secs (host)\n", end_time-start_time);

  
#ifdef ACCELERATOR_PGI
  area = 0.0;
  error = 0.0;
  zre = cre = zim = cim = 0.0;

  start_time = getTime();
  // Only do this if we are accelerated using PGI compiler
#pragma acc region
  {
    for (j=0; j<NPOINTS; j++){
      zre = csetre[j];
      cre = csetre[j];
      zim = csetim[j];
      cim = csetim[j];
      
      for(i=0; i<MAXITER; i++){
	/* z_{i+1} = z_i^2 + c */
	ztemp = (zre*zre - zim*zim) + cre;
	zim  = 2.0*zre*zim + cim;
	zre  = ztemp;
      }
      /* Having completed iteration, store results (whatever it is) in a big array */
      numarray[j] = (zre*zre + zim*zim);
    }
  }

  /* Back on the host, do the analysis */
  numInside = 0;
  for (j=0; j<NPOINTS; j++){
    if (numarray[j] < CEILING){
      numInside++;
    }
  }


  area = 2.0*(RMAX-RMIN)*(IMAX-IMIN) * (double) numInside / (double) NPOINTS;
  error = area / sqrt((double) NPOINTS);

  end_time = getTime();
  printf("NPOINTS    : %d\n", NPOINTS);
  printf("MAXITER    : %d\n", MAXITER);
  printf("NumInside  : %d\n", numInside);
  printf("Area       : %16.12f +/- %16.12f (accel)\n", area, error);
  printf("Time       : %16.12f secs (accel)\n", end_time-start_time);
#endif

  return(0);

}


/* Simple function to compute minimum of two integers */
int min(int a, int b)
{
  if (a < b)
    return a;
  else
    return b;
}

float getTime(void){
  return (float) clock() / (float) CLOCKS_PER_SEC;  
}

I compile the job using:

 pgcc -o vtspg3 -ta=nvidia -Minfo=all,accel -DACCELERATOR_PGI area-comparison.c

And get the output as:

NPOINTS    : 10000
MAXITER    : 10000
NumInside  : 2671
Area       :   1.502437500000 +/-   0.015024375000 (host)
Time       :   0.649999976158 secs (host)
NPOINTS    : 10000
MAXITER    : 10000
NumInside  : 3115
Area       :   1.752187500000 +/-   0.017521875000 (accel)
Time       :   0.580000042915 secs (accel)

Hi Nickaj,

I’d have expected the results to be identical in each case (host/accel) but they differ.

Not necessarily. The GPU can use different mathematical operations, in this case Fuse Multiply Add (FMA) instructions, which can lead to different results. Compiling your code with “-ta=nvidia,nofma”, will yield the same answers as the CPU.

Note, I modified your code to replace "ACCELERATOR_PGI " with “_ACCEL”. “_ACCEL” gets defined by the compiler when “-ta” is used.

Hope this helps,
Mat

% pgcc -fastsse mandle.c -ta=nvidia,nofma -Minfo=accel
main:
    110, Generating copyin(csetim[0:9999])
         Generating copyin(csetre[0:9999])
         Generating copyout(numarray[0:9999])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
    112, Loop is parallelizable
         Accelerator kernel generated
        112, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
             Using register for 'csetim'
             Using register for 'csetre'
             CC 1.3 : 14 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 15 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy
    118, Loop carried scalar dependence for 'zim' at line 120
         Loop carried scalar dependence for 'zim' at line 121
         Scalar last value needed after loop for 'zim' at line 125
         Loop carried scalar dependence for 'zre' at line 120
         Loop carried scalar dependence for 'zre' at line 121
         Scalar last value needed after loop for 'zre' at line 125
         Accelerator restriction: scalar variable live-out from loop: zre
         Accelerator restriction: scalar variable live-out from loop: zim
         Inner sequential loop scheduled on accelerator
% a.out
NPOINTS    : 10000
MAXITER    : 10000
NumInside  : 2671
Area       :   1.502437500000 +/-   0.015024375000 (host)
Time       :   0.439999997616 secs (host)
NPOINTS    : 10000
MAXITER    : 10000
NumInside  : 2671
Area       :   1.502437500000 +/-   0.015024375000 (accel)
Time       :   0.110000014305 secs (accel)