Hi,
I’m trying to accelerate my program for a Nvidia GPU with compute compability 60.
In my CPU version I used OpenMP to accelerate the code
The output of this code is the square array z_.
Every thread will use an tmp array and all tmp arrays are later reduced into the final z_.
#pragma omp parallel private(tmp)
{
#pragma omp for
for(int jz = 0; jz < z; jz++)
{
for(int jy = 0; jy < y; jy++)
{
for(int js = 0; s < s; js++)
{
for(int jk = 0; jk < s; jk++)
{
tmp[js+jk*s] = 0.0;
for(int jx = 0; jx < x; jx++)
{
int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js];
}
}
}
}
}
for(int jk = 0; jk < s; jk++)
{
for(int js = 0; js < s; js++)
{
#pragma omp atomic
z_[js+jk*s] += tmp[js+jk*s];
}
}
}
My first try with OpenACC resulted in the following code, I simply replaced the OpenMP directives with OpenACC directives.
I checked the compiler output which seems parallelize exactly the loop I specified, however the output result of the program is not correct.
I think there is a problem with the tmp array and the final reduction but I’m not able to track it down.
What is the problem with this code?
#pragma acc data copyin(x[0:s*m],y[0:s*m]) copy(z_[s:s])
#pragma acc kernels
{
#pragma acc loop independent private(tmp)
for(int jz = 0; jz < z; jz++)
{
for(int jy = 0; jy < y; jy++)
{
for(int js = 0; js < s; js++)
{
for(int jk = 0; jk < s; jk++)
{
tmp[js+jk*s] = 0.0;
for(int jx = 0; jx < x; jx++)
{
int j = (jx+stmx) + mx*(jy+stm) + mxy*(jz+stm);
tmp[js+jk*s]+=x[j+m*jk]*y[j+m*js];
}
}
}
}
}
for(int jk = 0; jk < s; jk++)
{
for(int js = 0; js < s; js++)
{
#pragma acc atomic update
z_[js+jk*s] += tmp[js+jk*s];
}
}
}
56, Generating copyin(x[:s*m],y[:s*m])
Generating copy(z_[s:s])
57, Generating implicit copyin(tmp[:])
61, Loop is parallelizable
63, Loop carried reuse of tmp prevents parallelization
65, Loop carried dependence of tmp prevents parallelization
Loop carried dependence of tmp prevents vectorization
Loop carried backward dependence of tmp prevents vectorization
67, Complex loop carried dependence of tmp prevents parallelization
Loop carried dependence of tmp prevents parallelization
Loop carried backward dependence of tmp prevents vectorization
Accelerator kernel generated
Generating Tesla code
61, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
63, #pragma acc loop seq
65, #pragma acc loop seq
67, #pragma acc loop seq
70, #pragma acc loop seq
70, Complex loop carried dependence of tmp prevents parallelization
Loop carried reuse of tmp prevents parallelization
Inner sequential loop scheduled on accelerator
81, Loop carried dependence of z_ prevents parallelization
Loop carried backward dependence of z_ prevents vectorization
83, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
81, #pragma acc loop seq
83, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Thank you for your help!