Am I hitting trouble because I’m trying to calculate the bounds for the j loop in the i loop?
Yes, since the kernel schedule dimensions need to be known before it’s launched.
Keep in mind that the PGI Accelerator model was designed to work on nested parallel loops. So what the compiler wants to do in this case is put the “myrow” and “j” loops together to create the CUDA schedule (i.e block and thread block sizes). Which of course doesn’t work since the range of the “j” loop isn’t known.
Very recently (within the last month) we have had a couple users have a similar style where you want to separate the parallel from vector portions of the code. So in this case, I’m assuming you’d want something like the following:
#pragma acc region
{
#pragma acc for parallel
// Have the compiler create "dim" number of blocks
for (myrow=0; myrow<dim; myrow++){
// Start the kernel here
// Have one thread compute the bounds and init temp
st = row_ptr[myrow]-1;
nd = row_ptr[myrow+1]-1;
temp = 0.0;
// call syncthreads
// Now have the threads in the block perform a partial sum on a portion of the
// computation. The compiler needs to account for cases where the
// range is larger than the number of threads in the block
#pragma acc for vector
for (j=st; j<nd; j++){
cl = cols[j]-1;
temp += vals[j] * vec[cl];
}
// Have one thread perform the final
// reduction of temp and store the value back to memory
out[myrow] = temp;
// call syncthreads
}
}
}
Again, this is something we have just started looking at so I don’t have a solution for you now short of just parallelizing the “myrow” loop and using the “independent” clause. We are looking at and trying to determine how the model can be extended so this can be expressed.