dependence in loop prevents parallelization

sWienke · February 5, 2010, 8:08am

Hi,
I’m using pgi 10.1 on linux 64bit.

I have problems with dependencies in loops, since I get the following messages when compiling with “pgcc -g -ta=nvidia,cc11 -Minfo -fastsse -c ./main.c -o main.o”:

calc:
     11, No parallel kernels found, accelerator region ignored
     15, Complex loop carried dependence of 'fArr2' prevents parallelization
     17, Complex loop carried dependence of 'fArr2' prevents parallelization
         Generated 4 alternate loops for the loop
         Generated vector sse code for the loop
main:
     38, Loop unrolled 4 times (completely unrolled)
     39, Loop unrolled 4 times (completely unrolled)

Everywhere it is said that I should either use the restrict keyword or the option -Msafteptr, but neither of these is working for my case.

I reduced my program to an small example code where I copy stuff from one array to another (see below). I know it does not really make sense what I am doing there, but first I want to get rid of these dependencies.
Does anyone has an idea?

#include <stdio.h>
#include <stdlib.h>

void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
	    {
#pragma acc for private(fVal,i,j)
	      /* compute stencil, residual and update */
	      for (j = 0; j < iRows; j++)
		{
		  for (i = 0; i < iCols; i++)
		    {
		      fVal = 5.0f * fArr1[j*iCols+i];
		      fArr2[j*iCols+i] = 2.0f * fVal;
		    }
		}
	    }
}

int main (int argc, char** argv)
{
    int retVal = 0;    /* return value */
    int i,j;
    int iCols = 4;
    int iRows = 4;

    /*Init arrays*/
    float *fArr1 = (float*) malloc(iCols * iRows * sizeof(float));
    float *fArr2 = (float*) malloc(iCols * iRows * sizeof(float));
 
    for (j=0; j< iRows; ++j){
      for(i=0; i< iCols; ++i){
	fArr1[j*iCols+i] = i;
	fArr2[j*iCols+i] = 0.0f;
      }
    }

    if (fArr1 && fArr2)
    {
        /* running calculations */
      calc(fArr1,fArr2,iCols,iRows);

      /* print one example result */
      printf("Result[%d]: %f\n", iRows*iCols-1,fArr2[iRows*iCols-1]);
    }
    else
    {
        printf(" Memory allocation failed ...\n");
        retVal = -1;
    }

    /* cleanup */
    free(fArr1);
    free(fArr2);
    
    return retVal;
}

MatColgrove · February 5, 2010, 5:13pm

Hi Xray,

For data parallelism, the operations on each elements of your array must be independent. It’s obvious to us in this case that the computed index resolves to unique index, it’s not easily determined by the compiler. While we are working on support to handle these simpler cases, in the time being you’ll need to do one of three options.

Force parallelization

You can force parallelization by using the “#pragma acc for parallel” directive before the “j” loop. Note that I would not recommend users do this unless they are sure all computed indices are unique.

void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j,idx;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
       {
#pragma acc for parallel
         /* compute stencil, residual and update */
         for (j = 0; j < iRows; j++)
      {
        for (i = 0; i < iCols; i++)
          {
            fVal = 5.0f * fArr1[j*iCols+i];
            fArr2[j*iCols+i] = 2.0f * fVal;
          }
      }
       }
}

Side note, using the private clause is not necessary since scalars are privatized by default.

2) Remove the “i” loop

While I don’t know your full source, in this example you can remove the “i” loop altogether.

void calc(float *restrict fArr1, float *restrict fArr2, int iCols, int iRows)
{
  int i,j;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copy(fArr1[0:n-1], fArr2[0:n-1])
       {
         /* compute stencil, residual and update */
         for (j = 0; j < n; j++)
      {
            fVal = 5.0f * fArr1[j];
            fArr2[j] = 2.0f * fVal;
      }
       }
}

Use multi-dimensional indexing

Instead of computing the index, you could use multi-dimensional indexing. Though, this would require you to change the entire program.

#include <stdio.h>
#include <stdlib.h>

void calc(float **restrict fArr1, float **restrict fArr2, int iCols, int iRows)
{
  int i,j,idx;
  int n = iCols * iRows;
  float fVal;
#pragma acc region copyin(fArr1[0:iRows-1][0:iCols-1]), copyout(fArr2[0:iRows-1][0:iCols-1])
       {
#pragma acc for parallel
         /* compute stencil, residual and update */
         for (j = 0; j < iRows; j++)
      {
        for (i = 0; i < iCols; i++)
          {
            fVal = 5.0f * fArr1[j][i];
            fArr2[j][i] = 2.0f * fVal;
          }
      }
       }
}

int main (int argc, char** argv)
{
    int retVal = 0;    /* return value */
    int i,j;
    int iCols = 4;
    int iRows = 4;

    /*Init arrays*/
    float **fArr1 = (float**) malloc(iRows * sizeof(float*));
    float **fArr2 = (float**) malloc(iRows * sizeof(float*));
    if (fArr1 && fArr2)
    {

    for (j=0; j< iRows; ++j){
      fArr1[j] = (float*) malloc(iCols * sizeof(float));
      fArr2[j] = (float*) malloc(iCols * sizeof(float));
      for(i=0; i< iCols; ++i){
   fArr1[j][i] = i;
   fArr2[j][i] = 0.0f;
      }
    }
        /* running calculations */
      calc(fArr1,fArr2,iCols,iRows);

      /* print one example result */
      printf("Result[%d][%d]: %f\n", iRows-1,iCols-1,fArr2[iRows-1][iCols-1]);
    }
    else
    {
        printf(" Memory allocation failed ...\n");
        retVal = -1;
    }

    /* cleanup */
    free(fArr1);
    free(fArr2);

    return retVal;
}

Side note, I changed the fArr1 to use the “copyin” clause and fArr2 to use the copyout clause.

Hope this helps,
Mat

sWienke · February 9, 2010, 1:49pm

Thanks for your answers.
I tried all solution possibilites with my sample code and they all worked. But, what I’ve recognised is that if I force parallelisation I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?

And one more question:
In my bigger code I have a reduction and with that it’s not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: “pgnvd job exited with nonzero status code 0”.
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn’t it?

MatColgrove · February 9, 2010, 4:51pm

Hi Xray,

I still get the compiler message that loop dependencies prevent parallelisation and just after that I get: Accelerator kernel genereated. Do you know why?

The compiler is still parallelizing the code but just letting you know that it thinks you shouldn’t be doing it.

In my bigger code I have a reduction and with that it’s not working. I also added the reduction to my sample code (within the i-loop) and also get an internal compiler errror: “pgnvd job exited with nonzero status code 0”.
Is it possible that the reduction cannot be detected when forcing parallelism? I must have something to do with the nested loop, since usually a reduction works, doesn’t it?

Support for reductions is brand new so may have problems. Please get the example code to how you think it should work and then send a report to PGI Customer Support (trs@pgroup.com). We’ll have one of our engineers (probably me) take a look and let you know what’s going on.

Thanks,
Mat

Topic		Replies	Views
prevent parallelization Legacy PGI Compilers	3	1968	March 22, 2012
Data dependency? Legacy PGI Compilers	2	3605	November 25, 2013
should use to "acc reduction" in an inner loop Legacy PGI Compilers	4	4259	December 6, 2012
Need help to accelerate Legacy PGI Compilers	3	2680	November 26, 2012
how to avoid this dependency Legacy PGI Compilers	11	6153	November 16, 2012
How to efficiently parallelize these loops, Fortran Legacy PGI Compilers	2	875	May 22, 2020
accelerator parallization issues Legacy PGI Compilers	18	26859	April 12, 2010
Removing Loop Carried Dependencies Legacy PGI Compilers	1	3987	December 29, 2010
How to parallel the outer loop Legacy PGI Compilers	3	2204	October 10, 2012
PGI attempts to parallelize sequential loop Legacy PGI Compilers	3	2655	August 28, 2012

dependence in loop prevents parallelization

Related topics