dependence in loop prevents parallelization

Hi,
I’m using pgi 10.1 on linux 64bit.

I have problems with dependencies in loops, since I get the following messages when compiling with “pgcc -g -ta=nvidia,cc11 -Minfo -fastsse -c ./main.c -o main.o”:

calc:
     11, No parallel kernels found, accelerator region ignored
     15, Complex loop carried dependence of 'fArr2' prevents parallelization
     17, Complex loop carried dependence of 'fArr2' prevents parallelization
         Generated 4 alternate loops for the loop
         Generated vector sse code for the loop
main:
     38, Loop unrolled 4 times (completely unrolled)
     39, Loop unrolled 4 times (completely unrolled)

Everywhere it is said that I should either use the restrict keyword or the option -Msafteptr, but neither of these is working for my case.

I reduced my program to an small example code where I copy stuff from one array to another (see below). I know it does not really make sense what I am doing there, but first I want to get rid of these dependencies.
Does anyone has an idea?

#include <stdio.h>
#include <stdlib.h>

void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
	    {
#pragma acc for private(fVal,i,j)
	      /* compute stencil, residual and update */
	      for (j = 0; j < iRows; j++)
		{
		  for (i = 0; i < iCols; i++)
		    {
		      fVal = 5.0f * fArr1[j*iCols+i];
		      fArr2[j*iCols+i] = 2.0f * fVal;
		    }
		}
	    }
}

int main (int argc, char** argv)
{
    int retVal = 0;    /* return value */
    int i,j;
    int iCols = 4;
    int iRows = 4;

    /*Init arrays*/
    float *fArr1 = (float*) malloc(iCols * iRows * sizeof(float));
    float *fArr2 = (float*) malloc(iCols * iRows * sizeof(float));
 
    for (j=0; j< iRows; ++j){
      for(i=0; i< iCols; ++i){
	fArr1[j*iCols+i] = i;
	fArr2[j*iCols+i] = 0.0f;
      }
    }

    if (fArr1 && fArr2)
    {
        /* running calculations */
      calc(fArr1,fArr2,iCols,iRows);

      /* print one example result */
      printf("Result[%d]: %f\n", iRows*iCols-1,fArr2[iRows*iCols-1]);
    }
    else
    {
        printf(" Memory allocation failed ...\n");
        retVal = -1;
    }

    /* cleanup */
    free(fArr1);
    free(fArr2);
    
    return retVal;
}

Hi Xray,

For data parallelism, the operations on each elements of your array must be independent. It’s obvious to us in this case that the computed index resolves to unique index, it’s not easily determined by the compiler. While we are working on support to handle these simpler cases, in the time being you’ll need to do one of three options.

  1. Force parallelization

You can force parallelization by using the “#pragma acc for parallel” directive before the “j” loop. Note that I would not recommend users do this unless they are sure all computed indices are unique.

void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j,idx;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
       {
#pragma acc for parallel
         /* compute stencil, residual and update */
         for (j = 0; j < iRows; j++)
      {
        for (i = 0; i < iCols; i++)
          {
            fVal = 5.0f * fArr1[j*iCols+i];
            fArr2[j*iCols+i] = 2.0f * fVal;
          }
      }
       }
}

Side note, using the private clause is not necessary since scalars are privatized by default.


2) Remove the “i” loop

While I don’t know your full source, in this example you can remove the “i” loop altogether.

void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
       {
         /* compute stencil, residual and update */
         for (j = 0; j < n; j++)
      {
            fVal = 5.0f * fArr1[j];
            fArr2[j] = 2.0f * fVal;
      }
       }
}
  1. Use multi-dimensional indexing

Instead of computing the index, you could use multi-dimensional indexing. Though, this would require you to change the entire program.

#include <stdio.h>
#include <stdlib.h>

void calc(float **restrict fArr1, float **restrict fArr2, int iCols, int iRows)
{
  int i,j,idx;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copyin(fArr1[0:iRows-1][0:iCols-1]), copyout(fArr2[0:iRows-1][0:iCols-1])
       {
#pragma acc for parallel
         /* compute stencil, residual and update */
         for (j = 0; j < iRows; j++)
      {
        for (i = 0; i < iCols; i++)
          {
            fVal = 5.0f * fArr1[j][i];
            fArr2[j][i] = 2.0f * fVal;
          }
      }
       }
}

int main (int argc, char** argv)
{
    int retVal = 0;    /* return value */
    int i,j;
    int iCols = 4;
    int iRows = 4;

    /*Init arrays*/
    float **fArr1 = (float**) malloc(iRows * sizeof(float*));
    float **fArr2 = (float**) malloc(iRows * sizeof(float*));
    if (fArr1 && fArr2)
    {

    for (j=0; j< iRows; ++j){
      fArr1[j] = (float*) malloc(iCols * sizeof(float));
      fArr2[j] = (float*) malloc(iCols * sizeof(float));
      for(i=0; i< iCols; ++i){
   fArr1[j][i] = i;
   fArr2[j][i] = 0.0f;
      }
    }
        /* running calculations */
      calc(fArr1,fArr2,iCols,iRows);

      /* print one example result */
      printf("Result[%d][%d]: %f\n", iRows-1,iCols-1,fArr2[iRows-1][iCols-1]);
    }
    else
    {
        printf(" Memory allocation failed ...\n");
        retVal = -1;
    }

    /* cleanup */
    free(fArr1);
    free(fArr2);

    return retVal;
}

Side note, I changed the fArr1 to use the “copyin” clause and fArr2 to use the copyout clause.

Hope this helps,
Mat

Thanks for your answers.
I tried all solution possibilites with my sample code and they all worked. But, what I’ve recognised is that if I force parallelisation I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?

And one more question:
In my bigger code I have a reduction and with that it’s not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: “pgnvd job exited with nonzero status code 0”.
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn’t it?

Hi Xray,

I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?

The compiler is still parallelizing the code but just letting you know that it thinks you shouldn’t be doing it.

In my bigger code I have a reduction and with that it’s not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: “pgnvd job exited with nonzero status code 0”.
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn’t it?

Support for reductions is brand new so may have problems. Please get the example code to how you think it should work and then send a report to PGI Customer Support (trs@pgroup.com). We’ll have one of our engineers (probably me) take a look and let you know what’s going on.

Thanks,
Mat