Variables and constants on accelerator

nickaj · October 13, 2011, 3:58pm

I was recently trying to do a quick benchmark of a machine, comparing the execution speed between host and device. After some tinkering with the code, I came across the following.

If I have an accelerated loop such:

#pragma acc region for parallel
  for(j=0; j<nPoints; j++){
    
    zre = cre = csetre[j];
    zim = cim = csetim[j];
    
    for(i=0; i<MAXITER; i++){
        ztemp = (zre*zre - zim*zim) + cre;
        zim  = 2.0*zre*zim + cim;
        zre  = ztemp;
        if ((zre*zre + zim*zim) > CEILING) break;
     }
    
    if((zre*zre + zim*zim) < CEILING)
      numInside++;
  }

I get one answer. If I change MAXITER (which is #defined in the source to a value) for maxIter, an integer variable of the same value as MAXITER, I get a different answer. I’d come across this previously so compiled using

pgcc -o pgi_bug -O3 -ta=nvidia,cc20,nofma -Minfo=accel pgi_bug_producer.c -fastsse

…the nofma option sorting out the maths such that both device and host now give the same answer. The problem I now have is that my runtime for MAXITER is ~3 times that of maxIter.

Whilst it’s not a problem at the moment, it’d be interesting to know why this happens.

MatColgrove · October 14, 2011, 7:37pm

Hi nickaj,

What does the -Minfo=accel output say about how the two versions are scheduled?

Mat

nickaj · October 18, 2011, 1:31pm

Here’s the compiler outputs for both versions. My code does both MAXITER (constant) and maxIter (the integer variable version) in one hence the dual output. The MAXITER (constant) version is first.

normal (ie, pgcc ta=nvidia,cc20 …)

    162, Generating copyin(csetre[0:nPoints-1])
         Generating copyin(csetim[0:nPoints-1])
         Generating compute capability 2.0 binary
    163, Loop carried scalar dependence for 'numInside' at line 183
         Accelerator kernel generated
        163, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
             CC 2.0 : 15 registers; 1032 shared, 76 constant, 0 local memory bytes; 100% occupancy
        183, Sum reduction generated for numInside
    171, Loop carried scalar dependence for 'zim' at line 173
         Loop carried scalar dependence for 'zim' at line 174
         Scalar last value needed after loop for 'zim' at line 182
         Loop carried scalar dependence for 'zre' at line 173
         Loop carried scalar dependence for 'zre' at line 174
         Scalar last value needed after loop for 'zre' at line 182
         Accelerator restriction: scalar variable live-out from loop: zre
         Accelerator restriction: scalar variable live-out from loop: zim
         Inner sequential loop scheduled on accelerator
    196, Generating copyin(csetre[0:nPoints-1])
         Generating copyin(csetim[0:nPoints-1])
         Generating compute capability 2.0 binary
    197, Loop carried scalar dependence for 'numInside' at line 217
         Accelerator kernel generated
        197, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
             CC 2.0 : 15 registers; 1032 shared, 76 constant, 0 local memory bytes; 100% occupancy
        217, Sum reduction generated for numInside
    205, Loop carried scalar dependence for 'zim' at line 207
         Loop carried scalar dependence for 'zim' at line 208
         Scalar last value needed after loop for 'zim' at line 216
         Loop carried scalar dependence for 'zre' at line 207
         Loop carried scalar dependence for 'zre' at line 208
         Scalar last value needed after loop for 'zre' at line 216
         Accelerator restriction: scalar variable live-out from loop: zre
         Accelerator restriction: scalar variable live-out from loop: zim
         Inner sequential loop scheduled on accelerator

And using nofma (pgcc -ta=nvidia,cc20,nofma)

    162, Generating copyin(csetre[0:nPoints-1])
         Generating copyin(csetim[0:nPoints-1])
         Generating compute capability 2.0 binary
    163, Loop carried scalar dependence for 'numInside' at line 183
         Accelerator kernel generated
        163, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
             CC 2.0 : 19 registers; 1032 shared, 76 constant, 0 local memory bytes; 100% occupancy
        183, Sum reduction generated for numInside
    171, Loop carried scalar dependence for 'zim' at line 173
         Loop carried scalar dependence for 'zim' at line 174
         Scalar last value needed after loop for 'zim' at line 182
         Loop carried scalar dependence for 'zre' at line 173
         Loop carried scalar dependence for 'zre' at line 174
         Scalar last value needed after loop for 'zre' at line 182
         Accelerator restriction: scalar variable live-out from loop: zre
         Accelerator restriction: scalar variable live-out from loop: zim
         Inner sequential loop scheduled on accelerator
    196, Generating copyin(csetre[0:nPoints-1])
         Generating copyin(csetim[0:nPoints-1])
         Generating compute capability 2.0 binary
    197, Loop carried scalar dependence for 'numInside' at line 217
         Accelerator kernel generated
        197, #pragma acc for parallel, vector(256) /* blockIdx.x threadIdx.x */
             CC 2.0 : 19 registers; 1032 shared, 76 constant, 0 local memory bytes; 100% occupancy
        217, Sum reduction generated for numInside
    205, Loop carried scalar dependence for 'zim' at line 207
         Loop carried scalar dependence for 'zim' at line 208
         Scalar last value needed after loop for 'zim' at line 216
         Loop carried scalar dependence for 'zre' at line 207
         Loop carried scalar dependence for 'zre' at line 208
         Scalar last value needed after loop for 'zre' at line 216
         Accelerator restriction: scalar variable live-out from loop: zre
         Accelerator restriction: scalar variable live-out from loop: zim
         Inner sequential loop scheduled on accelerator

-Nick.