OpenACC diff between GPU + CPU codes

This may have been covered already, but I couldn’t find an exact match.

I’m using 12.3 to compute a trivial kernel on the host + accelerator and then taking the difference to check for numerical equivalence.

I have the kernel written in OpenACC which I think is valid.

#pragma acc data copyin(A, x, nx, ny), create(j, i, tmp, tmpscalar), copyout(ya)
  {

#pragma acc kernels loop
    for (i= 0; i < ny; i++)
      ya[i] = 0;


#pragma acc kernels loop private(tmpscalar) independent, gang
    for (i = 0; i < nx; i++){

      tmpscalar = 0;

#pragma acc loop vector
      for (j = 0; j < ny; j++){
        tmpscalar += A[i][j] * x[j];
      }
      tmp[i] = tmpscalar;
    }

#pragma acc kernels loop private(tmpscalar) independent, gang
    for (j = 0; j < ny; j++){
      tmpscalar = 0;

#pragma acc loop vector
      for (i = 0; i < nx; i++){
        tmpscalar += A[i][j] * tmp[i];
      }
      ya[j] = tmpscalar;

    }

  }

Given that all my arrays and temporary scalars are decalred as float and index counters are ints, I’d expect this to give a reasonably similar result for both devices. However, I get an error of ~1e-3 for a small problem size (10x10) which rises to ~1e+2 as the problem sizes increases to 100x100.

If I swap float for double I get a similar result but at a lower value (ie of the order 1e-8 for a small problem).

I’ve tried with and without -nofma but the result is the same.

The host code is the code above without the directives and the compilation output is:

main:
     79, Generating local(tmpscalar)
         Generating local(tmp[:])
         Generating local(i)
         Generating local(j)
         Generating copyout(ya[:])
         Generating copyin(ny)
         Generating copyin(nx)
         Generating copyin(x[:])
         Generating copyin(A[:][:])
     82, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     83, Loop is parallelizable
         Accelerator kernel generated
         83, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
             CC 1.0 : 3 registers; 36 shared, 8 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 6 registers; 4 shared, 48 constant, 0 local memory bytes; 16% occupancy
     87, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     88, Loop is parallelizable
         Accelerator kernel generated
         88, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
             CC 1.0 : 13 registers; 52 shared, 12 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 20 registers; 4 shared, 64 constant, 0 local memory bytes; 16% occupancy
     93, Loop is parallelizable
     99, Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
    100, Loop is parallelizable
         Accelerator kernel generated
        100, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
             CC 1.0 : 14 registers; 52 shared, 8 constant, 0 local memory bytes; 33% occupancy
             CC 2.0 : 21 registers; 4 shared, 64 constant, 0 local memory bytes; 16% occupancy
    104, Loop is parallelizable

This looks reasonably sensible. I’d like to expose more parallelism in the inner loops, but for now this isn’t a major problem.

If anyone could shed some light on the nature of this I’d be grateful as I do sometimes see it in other kernels and would like to know if it’s a feature of the system or a problem with my coding.

Cheers,
-Nick.

Hi Nick,

Keep in mind that currently the PGI implementation of the “kernels” model only supports tightly nested loops. The “parallel” model (partially supported in 12.5, expected full support in 12.6) will be more flexible. Hence in your code, the inner loop schedules are being ignored and only the outer loops are being parallelized. Also, the vector width of 10 is odd, but I’d need to see the loops in context to know why the compiler chose it.

Without the full source, I can’t really tell why you’re getting wrong answers. It could be because of the vector width so I’d try removing the inner loops schedules, add an explicit vector clause on the out loop, and vary the width.

Another possible issue is that you’ve used the “create” clause on your loop index variables. By default, scalar variables are privatized so that each thread gets it’s own copy. Having never put a index variable in a data create clause, I’m not really sure what happens. I believe the compiler just will ignore it (liek it does in the PGI Accelerator Model) and still uses the private copy, but it not, then you’re using a single shared global variable to across all threads.

Try the following code:

#pragma acc data copyin(A, x), create(tmp), copyout(ya) 
  { 

#pragma acc kernels loop 
    for (i= 0; i < ny; i++) 
      ya[i] = 0; 

! Is independent needed?
#pragma acc kernels loop independent, gang, vector(32)   
    for (i = 0; i < nx; i++){ 
      tmpscalar = 0; 
      for (j = 0; j < ny; j++){ 
        tmpscalar += A[i][j] * x[j]; 
      } 
      tmp[i] = tmpscalar; 
    } 

#pragma acc kernels loop independent, gang, vector(32)
    for (j = 0; j < ny; j++){ 
      tmpscalar = 0; 

      for (i = 0; i < nx; i++){ 
        tmpscalar += A[i][j] * tmp[i]; 
      } 
      ya[j] = tmpscalar; 
    } 
  }

Now once “parallel” and the “reduction” clause are added (currently expected in 12.6), you’ll want to change your code to something like:

#pragma acc data copyin(A, x), create(tmp), copyout(ya) 
  { 

#pragma acc parallel
{
#pragma acc loop 
    for (i= 0; i < ny; i++) 
      ya[i] = 0; 

#pragma acc loop gang
    for (i = 0; i < nx; i++){ 
      tmpscalar = 0; 
#pragma acc loop vector(32), reduction(+:tmpscalar)
      for (j = 0; j < ny; j++){ 
        tmpscalar += A[i][j] * x[j]; 
      } 
      tmp[i] = tmpscalar; 
    } 

#pragma acc loop gang
    for (j = 0; j < ny; j++){ 
      tmpscalar = 0; 
#pragma acc loop vector(32), reduction(+:tmpscalar)
      for (i = 0; i < nx; i++){ 
        tmpscalar += A[i][j] * tmp[i]; 
      } 
      ya[j] = tmpscalar; 
    } 
  }
}

Hope this helps,
Mat

Hi Mat,

Thanks for a very helpful response. I’d been wondering when it was appropraite to use a kernels construct over a parallel construct so this has cleared it up. Hopefully when 12.6 is released I’ll be able to experiment further.

I’d also been oblivious to the fact scalars are privatised which clears up confusion I’d had with that in the past. Is there an equiavlent option to the ‘default(none)’ I’d use in openmp which would force me to manage the data moves and variable creation? For example, in my original code, if I remove nx and ny from the copyin statement on the data region, they still get copied but are not shown in the compiler output.

Finally, the vector(10) outputs were from using a problem size where the arrays were of length 10 so that appears sensible.

Here’s a complete code listing:

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <math.h>
#ifdef _OPENACC
#include <openacc.h>
#endif

int main(int argc, char** argv)
{
  int i, j;
  int nx = 10;
  int ny = 10;

  float A[10][10];
  float x[10];
  float y[10];
  float ya[10];
  float tmp[10];
  float tmpscalar;

  for (i = 0; i < 10; i++){
    x[i] = i * M_PI;
    for (j = 0; j < 10; j++){
      A[i][j] = ((float) i*j) / 10;
    }
  }


  acc_init(acc_device_nvidia);

  for (i= 0; i < ny; i++)
    y[i] = 0;

  for (i = 0; i < nx; i++){
    tmpscalar = 0;

    for (j = 0; j < ny; j++){
      tmpscalar += A[i][j] * x[j];
    }
    tmp[i] = tmpscalar;
  }


  for (j = 0; j < ny; j++){
    tmpscalar = 0;

    for (i = 0; i < nx; i++){
      tmpscalar += A[i][j] * tmp[i];
    }
    y[j] = tmpscalar;
  }

#pragma acc data copyin(A, x), create(tmp), copyout(ya)
  {

#pragma acc kernels loop
    for (i= 0; i < ny; i++)
      ya[i] = 0;


#pragma acc kernels loop
    for (i = 0; i < nx; i++){

      tmpscalar = 0;

      for (j = 0; j < ny; j++){
        tmpscalar += A[i][j] * x[j];
      }
      tmp[i] = tmpscalar;
    }

#pragma acc kernels loop
    for (j = 0; j < ny; j++){
      tmpscalar = 0;

      for (i = 0; i < nx; i++){
        tmpscalar += A[i][j] * tmp[i];
      }
      ya[j] = tmpscalar;

    }

  } // end data

  float maxdiff = 0;
  float diff = 0;
  for (j=0;j<ny;j++){
    diff = fabs(y[j] - ya[j]);
    if (diff > maxdiff) maxdiff = diff;
  }

  printf("%20.20f\n", maxdiff);

  return 0;
}

I get maxdiff reports as: 0.001953125

-Nick.

-Nick.

Hi Nick,

Is there an equiavlent option to the ‘default(none)’

No, but I’ll send a note to our rep on OpenACC and ask him to bring it up during there conference calls.

they still get copied but are not shown in the compiler output.

nx and ny are being passed in as arguments to your kernels.

I get maxdiff reports as: 0.001953125

I’m not convinced that there is a problem but I can’t full explain all the differences either. Below are my results for the “ya” and “diff” variables with and without FMA (fuse-multiply-add). FMA combines a “a=a+A*B” expression into a single operation but does loose some precision in doing so. As you can see, by disabling FMA the answers closer match the CPU results. I can’t yet account for the differences in element 9, but can dig deeper I you need me to.

  • Mat

Telsa C2070 with FMA enabled (versus no FMA on the CPU)

j=0: 0.000000 0.000000
j=1: 2551.758789 0.000000
j=2: 5103.517578 0.000000
j=3: 7655.276367 0.000000
j=4: 10207.035156 0.000000
j=5: 12758.793945 0.000977
j=6: 15310.552734 0.000000
j=7: 17862.312500 0.001953
j=8: 20414.070312 0.000000
j=9: 22965.830078 0.001953
0.00195312500000000000

Telsa C2070 without FMA enabled (versus no FMA on the CPU)

j=0: 0.000000 0.000000
j=1: 2551.758789 0.000000
j=2: 5103.517578 0.000000
j=3: 7655.276367 0.000000
j=4: 10207.035156 0.000000
j=5: 12758.792969 0.000000
j=6: 15310.552734 0.000000
j=7: 17862.310547 0.000000
j=8: 20414.070312 0.000000
j=9: 22965.830078 0.001953
0.00195312500000000000

Thanks Mat,

When I run the same code on my system compiled with:

pgcc -o a a.c -acc -Minfo=all -ta=nvidia:nofma

I get a difference of 0 for all elements. This is using a C2050 and it remains as I scale up the problem size (replacing 10 with 100 in the code) which caused problems with fma. My current guess is that the loss of precision with FMA combined with a large value being placed in a float (ie, a value of >7 significant digits) means I should always expect a difference.

Again, many thanks for help with this, it’s cleared up a few misunderstandings I had with OpenACC.

-Nick.

Nick: I’ve suggested the ‘default(none)’ to the OpenACC group, it was received well. I think we’ll add this in the next revision of the spec, and PGI will implement it even earlier than that.

-mw