Calculating loop bounds

As part of porting a larger code using CSR data, I found I was having problems with what I thought should be a fairly simple kernel.

Here’s an example (with variable decls snipped for clarity)

  double vals[19] = {10, -2, ...
  int cols[19] = {1, 5, ...
  int row_ptr[7] = {1, 3, ...
  double vec[6] = {1, 2, 3, 4, 5, 6};


#pragma acc data region copyin(row_ptr,vals,cols,vec,dim) copyout(out) local(i,j,temp,cl,myrow,st,nd)
  {
#pragma acc region
    {
#pragma acc for parallel
      for (myrow=0; myrow<dim; myrow++){
	st = row_ptr[myrow]-1;
	nd = row_ptr[myrow+1]-1;
	temp = 0.0;
#pragma acc for vector
	for (j=st; j<nd; j++){
	  cl = cols[j]-1;
	  temp += vals[j] * vec[cl];
	}
	out[myrow] = temp;
      }
    }
  }

I would expect out[0] to be 0 (101 + 5-2) but I get 10. Compiling as a host code, I do get the correct answer.

Am I hitting trouble because I’m trying to calculate the bounds for the j loop in the i loop?

Hi nickaj,

Am I hitting trouble because I’m trying to calculate the bounds for the j loop in the i loop?

Yes, since the kernel schedule dimensions need to be known before it’s launched.

Keep in mind that the PGI Accelerator model was designed to work on nested parallel loops. So what the compiler wants to do in this case is put the “myrow” and “j” loops together to create the CUDA schedule (i.e block and thread block sizes). Which of course doesn’t work since the range of the “j” loop isn’t known.

Very recently (within the last month) we have had a couple users have a similar style where you want to separate the parallel from vector portions of the code. So in this case, I’m assuming you’d want something like the following:

#pragma acc region
    {
#pragma acc for parallel
// Have the compiler create "dim" number of blocks
      for (myrow=0; myrow<dim; myrow++){  
// Start the kernel here
// Have one thread compute the bounds and init temp
   st = row_ptr[myrow]-1;
   nd = row_ptr[myrow+1]-1;
   temp = 0.0;
// call syncthreads

// Now have the threads in the block perform a partial sum on a portion of the 
// computation.  The compiler needs to account for cases where the 
// range is larger than the number of threads in the block
#pragma acc for vector
   for (j=st; j<nd; j++){
     cl = cols[j]-1;
     temp += vals[j] * vec[cl];
   }
// Have one thread perform the final
// reduction of temp and store the value back to memory
   out[myrow] = temp;
// call syncthreads
      }
    }
  }

Again, this is something we have just started looking at so I don’t have a solution for you now short of just parallelizing the “myrow” loop and using the “independent” clause. We are looking at and trying to determine how the model can be extended so this can be expressed.

Thanks,
Mat

Hi Mat,

Yes, that’s what I was hoping to do. Thanks for the confirmation that it’s not just my coding!

-Nick.