This may have been covered already, but I couldn’t find an exact match.
I’m using 12.3 to compute a trivial kernel on the host + accelerator and then taking the difference to check for numerical equivalence.
I have the kernel written in OpenACC which I think is valid.
#pragma acc data copyin(A, x, nx, ny), create(j, i, tmp, tmpscalar), copyout(ya)
{
#pragma acc kernels loop
for (i= 0; i < ny; i++)
ya[i] = 0;
#pragma acc kernels loop private(tmpscalar) independent, gang
for (i = 0; i < nx; i++){
tmpscalar = 0;
#pragma acc loop vector
for (j = 0; j < ny; j++){
tmpscalar += A[i][j] * x[j];
}
tmp[i] = tmpscalar;
}
#pragma acc kernels loop private(tmpscalar) independent, gang
for (j = 0; j < ny; j++){
tmpscalar = 0;
#pragma acc loop vector
for (i = 0; i < nx; i++){
tmpscalar += A[i][j] * tmp[i];
}
ya[j] = tmpscalar;
}
}
Given that all my arrays and temporary scalars are decalred as float and index counters are ints, I’d expect this to give a reasonably similar result for both devices. However, I get an error of ~1e-3 for a small problem size (10x10) which rises to ~1e+2 as the problem sizes increases to 100x100.
If I swap float for double I get a similar result but at a lower value (ie of the order 1e-8 for a small problem).
I’ve tried with and without -nofma but the result is the same.
The host code is the code above without the directives and the compilation output is:
main:
79, Generating local(tmpscalar)
Generating local(tmp[:])
Generating local(i)
Generating local(j)
Generating copyout(ya[:])
Generating copyin(ny)
Generating copyin(nx)
Generating copyin(x[:])
Generating copyin(A[:][:])
82, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
83, Loop is parallelizable
Accelerator kernel generated
83, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
CC 1.0 : 3 registers; 36 shared, 8 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 6 registers; 4 shared, 48 constant, 0 local memory bytes; 16% occupancy
87, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
88, Loop is parallelizable
Accelerator kernel generated
88, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
CC 1.0 : 13 registers; 52 shared, 12 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 20 registers; 4 shared, 64 constant, 0 local memory bytes; 16% occupancy
93, Loop is parallelizable
99, Generating compute capability 1.0 binary
Generating compute capability 2.0 binary
100, Loop is parallelizable
Accelerator kernel generated
100, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
CC 1.0 : 14 registers; 52 shared, 8 constant, 0 local memory bytes; 33% occupancy
CC 2.0 : 21 registers; 4 shared, 64 constant, 0 local memory bytes; 16% occupancy
104, Loop is parallelizable
This looks reasonably sensible. I’d like to expose more parallelism in the inner loops, but for now this isn’t a major problem.
If anyone could shed some light on the nature of this I’d be grateful as I do sometimes see it in other kernels and would like to know if it’s a feature of the system or a problem with my coding.
Cheers,
-Nick.