#pragma acc data copyin(C, sum, X) copy(x)
{
#pragma acc kernels loop independent
for(i = 1; i < SIZE; i++) {
sum[i] = 0;
for(k = 0; k < 3; k++) {
sum[i] += C[i-1][0][k] * X[k];
}
x[i] = sum[i];
}
}
104, Loop is parallelizable
Accelerator kernel generated
104, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
CC 1.0 : 10 registers; 48 shared, 12 constant, 0 local memory bytes
CC 2.0 : 17 registers; 0 shared, 64 constant, 0 local memory bytes
106, Complex loop carried dependence of 'sum' prevents parallelization
Loop carried reuse of 'sum' prevents parallelization
Inner sequential loop scheduled on accelerator
If I wanna only parallel the outer loop i and remove dependency, how to modify my code?
Thank you so much.