Internal compiler error for simple OpenACC parallel loop?

David_Henty · May 16, 2012, 4:18pm

I am trying to use OpenACC to parallelise a very simple image processing code that I use as an example on a number of parallel programming courses.

Using “pgcc -acc -Minfo=accel -c dosharpen.c” I get a whle bunch of errors like:

/tmp/pgaccj63cB3vdLOXa.gpu(104): error: expression must have arithmetic or enum type
…
ending in:

19 errors detected in the compilation of “/tmp/pgnvd173cFGXA-4Gj.nv0”.

The loop isn’t parallelised although the compiler analysis is exactly what I would expect:

97, Generating copyin(scale)
Generating copyin(norm)
Generating copyin(sigmad4)
Generating copyin(filter0)
Generating copyin(d4)
Generating copyin(d)
Generating copyin(sharp[0:][0:])
Generating copy(fuzzy[0:][0:])
105, Loop is parallelizable
107, Loop is parallelizable
109, Loop carried dependence of ‘sharp’ prevents parallelization
Loop carried backward dependence of ‘sharp’ prevents vectorization
111, Complex loop carried dependence of ‘sharp’ prevents parallelization
Loop carried dependence of ‘sharp’ prevents parallelization
Loop carried backward dependence of ‘sharp’ prevents vectorization
Inner sequential loop scheduled on accelerator
Accelerator kernel generated
105, #pragma acc loop gang, vector(16) /* blockIdx.y threadIdx.y /
107, #pragma acc loop gang, vector(16) / blockIdx.x threadIdx.x */
109, #pragma acc loop seq
111, #pragma acc loop seq

Any ideas what’s going on - - I get the same errors if I use “parallel” rather than “kernels”?

The loop is appended. Note that the code is quite verbose and inefficient but this is because it is meant to be a training example for beginners.

Thanks,

David

#pragma acc data copy(fuzzy) copyin(sharp, d, d4, filter0, sigmad4, norm, scale)
{

#pragma acc kernels loop private(i, j, k, l, rd4sq, rsq, sigmad4sq, sigmasq,
rsq, delta, filter, filter0)
{
for (i=0; i < nx; i++)
{
for (j=0; j < ny; j++)
{
for (k=-d; k <= d; k++)
{
for (l= -d; l <= d; l++)
{
rd4sq = d4d4;
rsq = dd;

sigmad4sq = sigmad4*sigmad4;
sigmasq = sigmad4sq * (rsq/rd4sq);

rsq = 1.0ii + 1.0jj;
delta = rsq/(2.0*sigmasq);

filter = filter0 * (1.0-delta) * exp(-delta);

sharp_[j] = sharp[j]

filter*fuzzy[i+d+k][j+d+l];
}
}
}
}
}
}_

MatColgrove · May 17, 2012, 11:41pm

Hi David.

The “expression must have arithmetic or enum type” error is a known issue in the OpenACC beta (TPR#18694). It was just reported last week so missed the deadline for 12.5 but hopefully we can have a fix in place for 12.6.

As for the “Loop carried dependence” messages. The compiler is correct and the two inner loops are not parallelizable due to the loop dependency on “sharp”.

Since sharp is being used as a sum reduction, to parallelize the inner loops, you need to use the “reduction” clause. However since “reduction” only works with scalars, you need to create a temp scalar value to sum the intermediate value and then store the result back into sharp. Something along the lines of:

#pragma acc data copy(fuzzy) copyin(sharp, d, d4, filter0, sigmad4, norm, scale)
{

#pragma acc kernels loop private(i, j, k, l, rd4sq, rsq, sigmad4sq, sigmasq, \
rsq, delta, filter, filter0)
{
for (i=0; i < nx; i++)
{
for (j=0; j < ny; j++)
{
#pragma acc loop reduction(+:sum)
for (k=-d; k <= d; k++)
{
for (l= -d; l <= d; l++)
{
rd4sq = d4*d4;
rsq = d*d;

sigmad4sq = sigmad4*sigmad4;
sigmasq = sigmad4sq * (rsq/rd4sq);

rsq = 1.0*i*i + 1.0*j*j;
delta = rsq/(2.0*sigmasq);

filter = filter0 * (1.0-delta) * exp(-delta);

sum = sum + filter*fuzzy[i+d+k][j+d+l];
}
}
sharp[i][j] += sum;
}
}
}
}

The caveat is that inner loop reductions are not quite working as well as we’d like. Currently, you’ll gets some dependency messages on the reduction variable and you’ll need to explicitly set the loop schedule on the outer loops.

Mat

David_Henty · May 22, 2012, 3:01pm

Matt,

Thanks for the reply - I’ll wait until 12.6 before taking this example forward!

I should have made it clearer that I already understood the issue with the inner loops, which I’m actually quite happy to execute sequentially. However, thanks for the tip re introducing a scalar reduction variable if parallelisation is required/

David

MatColgrove · May 22, 2012, 4:21pm

Hi David,

Note that in the mean time, you can use the PGI Accelerator Model instead of OpenACC since the “kernels” model is based from it.

Mat