I am trying to get the compiler to parallelize across 2 nested loops. This works as expected in fortran, but in C the compiler (pgcc v10.6) states that the inner loop is parallelizable, but does not parallelize it (only the outer loop). I’d be grateful for any advice on how to do this. The below simple example illustrates the problem.
Code:
20 #pragma accel region
21 {
22 #pragma acc for parallel, vector(16)
23 for (i = 0; i<N; i++)
24 {
25 #pragma acc for parallel, vector(16)
26 for (j = 0; j<N; j++)
27 {
28 b[i][j] = 2.*a[i][j];
29 }
30 }
31 }//end accel region
Compilation:
[agray3@fermi0 nested]$ pgcc -ta=nvidia:cc20 -Minfo:accel nested.c
main:
20, Generating copyout(b[0:255][0:255])
Generating copyin(a[0:255][0:255])
Generating compute capability 2.0 binary
23, Loop is parallelizable
Accelerator kernel generated
23, #pragma acc for parallel, vector(16)
CC 2.0 : 8 registers; 4 shared, 48 constant, 0 local memory bytes; 16 occupancy
26, Loop is parallelizable