Nested loop and reductions

I am new to OpenACC. I am also trying out OpenMP.

I am trying to understand how nested loops work

I have a code
include <stdio.h>
include <stdlib.h>
include <openacc.h>
int M = 200;
int N = 200;
int O = 200;

int main(void) {
	double S = 0;
	double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K)
#pragma acc parallel loop worker collapse(2) present(K)
	for (int m = 0; m < M; m++) {
		for (int n = 0; n < N; n++) {

#pragma acc loop independent vector reduction(+:S)
			for (int o = 0; o < O; o++) {
				S +=  0.25;
			}
			K[m * N + n] = S;
		}
	}


#pragma acc exit data copyout(K)

}

The build output reads

Blockquote
12:38:01 **** Incremental Build of configuration Default for project testOMP ****
make all
nvc++ -fast -Minfo -acc=multicore -o test test.cc
main:
10, Generating Single core code
13, Loop is parallelizable
Loop not fused: function call before adjacent loop
14, Loop is parallelizable
17, Loop is parallelizable
Generated vector simd code for the loop containing reductions
nvc++ -fast -Minfo -acc=gpu -gpu=cc60,managed -Mlarge_arrays -o testgpu test.cc
main:
10, Generating enter data copyin(K[:1])
Generating present(K[:1])
Generating Tesla code
13, #pragma acc loop worker(4) collapse(2) /* threadIdx.y /
14, /
threadIdx.y collapsed /
17, #pragma acc loop vector(32) /
threadIdx.x */
Generating reduction(+:S)
Vector barrier inserted for vector loop reduction
13, Loop is parallelizable
14, Loop is parallelizable
17, Loop is parallelizable
27, Generating exit data copyout(K[:1])

12:38:02 Build Finished. 0 errors, 0 warnings. (took 1s.633ms)

Questions

  1. Is the above code correct ?
  2. What are possible areas for improvements
  3. I have tried some other thing where copyin and copyout doesn’t seems to work right. I went on to use managed data. I may have to revisit that. What scenarios are copyin preferred?
  4. Lastly when is independent clause recommended?

Mostly. There’s only one true error, but several things you can do to help performance.

The one error is that you’re missing the array shape when copying “K” to the device. The array shape is needed when the size of the array to copied is not known at compile time and the language doesn’t have runtime information which includes the bounds information. If this was Fortran and “K” was an allocatable array, it would be fine to use “copyin(K)”, but this is C so K is just a pointer to an unbounded block of memory. Hence you need “copyin(K[0:M*N])” so the runtime knows how much memory to create and copy to/from the device.

What are possible areas for improvements

First, I’d use a “gang” schedule for the outer loops. “Gang” maps to an OpenMP threads when targeting a multi-core CPU and a CUDA Block when targeting a GPU. In general most codes only need gang and vector parallelism, worker is only used in a few cases where a third level of parallelism is required.

Second, the vector inner loop may be better run sequentially. When possible, the vector loop (or in some cases worker) should correspond to the stride-1 dimension of the array, in this case the “n” loop. Also, vector loop reductions incur overhead due to the setup of a partial reduction array, the final reduction code, and added barriers. Not to say that you should never use vector loop reductions, but in this case the extra overhead combined with the non-coalesced memory access on K, it’s not the most performant method here.

OpenACC does make it very easy to try different schedules so especially as you’re learning the impact, I suggest trying various schedules to see what happens.

Here lets set the environment variable “NVCOMPILER_ACC_TIME=1” to have the compiler runtime give us a quick profile. Second, lets execute the kernel 1000 times to remove any overhead.

Your original version with an outer worker loop:

 % cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;

int main(void) {
        double S = 0;
        double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
        for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop worker collapse(2) present(K)
        for (int m = 0; m < M; m++) {
                for (int n = 0; n < N; n++) {

#pragma acc loop independent vector reduction(+:S)
                        for (int o = 0; o < O; o++) {
                                S +=  0.25;
                        }
                        K[m * N + n] = S;
                }
        }
        }

#pragma acc exit data copyout(K[0:M*N])

}
% nvc -acc -Minfo=accel -fast test.c ; a.out
main:
     10, Generating enter data copyin(K[:N*M])
     12, Generating present(K[:1])
         Generating Tesla code
         14, #pragma acc loop worker(4) collapse(2) /* threadIdx.y */
         15,   /* threadIdx.y collapsed */
         18, #pragma acc loop vector(32) /* threadIdx.x */
             Generating reduction(+:S)
             Vector barrier inserted for vector loop reduction
     14, Loop is parallelizable
     15, Loop is parallelizable
     18, Loop is parallelizable
     28, Generating exit data copyout(K[:N*M])

Accelerator Kernel Timing data
test.c
  main  NVIDIA  devicenum=0
    time(us): 7,418,751
    10: data region reached 1 time
        10: data copyin transfers: 1
             device time(us): total=44 max=44 min=44 avg=44
    12: compute region reached 1000 times
        12: kernel launched 1000 times
            grid: [1]  block: [32x4]
             device time(us): total=7,418,659 max=7,478 min=7,384 avg=7,418
            elapsed time(us): total=7,436,871 max=7,923 min=7,401 avg=7,436
    12: data region reached 2000 times
    28: data region reached 1 time
        28: data copyout transfers: 1
             device time(us): total=48 max=48 min=48 avg=48

Next, lets change worker to gang on the outer loop:

% cat test1.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;

int main(void) {
        double S = 0;
        double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
        for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop gang collapse(2) present(K)
        for (int m = 0; m < M; m++) {
                for (int n = 0; n < N; n++) {

#pragma acc loop independent vector reduction(+:S)
                        for (int o = 0; o < O; o++) {
                                S +=  0.25;
                        }
                        K[m * N + n] = S;
                }
        }
        }

#pragma acc exit data copyout(K[0:M*N])

}
% nvc -acc -Minfo=accel -fast test1.c ; a.out
main:
     10, Generating enter data copyin(K[:N*M])
     12, Generating present(K[:1])
         Generating Tesla code
         14, #pragma acc loop gang collapse(2) /* blockIdx.x */
         15,   /* blockIdx.x collapsed */
         18, #pragma acc loop vector(128) /* threadIdx.x */
             Generating reduction(+:S)
     18, Loop is parallelizable
     28, Generating exit data copyout(K[:N*M])

Accelerator Kernel Timing data
test1.c
  main  NVIDIA  devicenum=0
    time(us): 82,095
    10: data region reached 1 time
        10: data copyin transfers: 1
             device time(us): total=46 max=46 min=46 avg=46
    12: compute region reached 1000 times
        12: kernel launched 1000 times
            grid: [40000]  block: [128]
             device time(us): total=82,008 max=87 min=82 avg=82
            elapsed time(us): total=100,863 max=838 min=99 avg=100
    12: data region reached 2000 times
    28: data region reached 1 time
        28: data copyout transfers: 1
             device time(us): total=41 max=41 min=41 avg=41

Next, let’s schedule the outer loop using gang and worker so K does have some coalesced memory accesses.

% cat test2.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;

int main(void) {
        double S = 0;
        double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
        for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop gang worker collapse(2) present(K)
        for (int m = 0; m < M; m++) {
                for (int n = 0; n < N; n++) {

#pragma acc loop independent vector reduction(+:S)
                        for (int o = 0; o < O; o++) {
                                S +=  0.25;
                        }
                        K[m * N + n] = S;
                }
        }
        }

#pragma acc exit data copyout(K[0:M*N])

}
% nvc -acc -Minfo=accel -fast test2.c ; a.out
main:
     10, Generating enter data copyin(K[:N*M])
     12, Generating present(K[:1])
         Generating Tesla code
         14, #pragma acc loop gang, worker(4) collapse(2) /* blockIdx.x threadIdx.y */
         15,   /* blockIdx.x threadIdx.y collapsed */
         18, #pragma acc loop vector(32) /* threadIdx.x */
             Generating reduction(+:S)
             Vector barrier inserted for vector loop reduction
     18, Loop is parallelizable
     28, Generating exit data copyout(K[:N*M])

Accelerator Kernel Timing data
test2.c
  main  NVIDIA  devicenum=0
    time(us): 28,021
    10: data region reached 1 time
        10: data copyin transfers: 1
             device time(us): total=46 max=46 min=46 avg=46
    12: compute region reached 1000 times
        12: kernel launched 1000 times
            grid: [10000]  block: [32x4]
             device time(us): total=27,932 max=29 min=27 avg=27
            elapsed time(us): total=49,854 max=813 min=47 avg=49
    12: data region reached 2000 times
    28: data region reached 1 time
        28: data copyout transfers: 1
             device time(us): total=43 max=43 min=43 avg=43

Finally, let’s not use the inner vector loop and have the outer loop be gang vector:

% cat test3.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;

int main(void) {
        double S = 0;
        double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
        for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop gang vector collapse(2) present(K)
        for (int m = 0; m < M; m++) {
                for (int n = 0; n < N; n++) {
                        for (int o = 0; o < O; o++) {
                                S +=  0.25;
                        }
                        K[m * N + n] = S;
                }
        }
        }

#pragma acc exit data copyout(K[0:M*N])

}

% nvc -acc -Minfo=accel -fast test3.c ; a.out
main:
10, Generating enter data copyin(K[:NM])
12, Generating present(K[:1])
Generating Tesla code
14, #pragma acc loop gang, vector(128) collapse(2) /
blockIdx.x threadIdx.x /
15, /
blockIdx.x threadIdx.x collapsed /
16, #pragma acc loop seq
16, Loop is parallelizable
26, Generating exit data copyout(K[:N
M])

Accelerator Kernel Timing data
test3.c
  main  NVIDIA  devicenum=0
    time(us): 23,366
    10: data region reached 1 time
        10: data copyin transfers: 1
             device time(us): total=46 max=46 min=46 avg=46
    12: compute region reached 1000 times
        12: data copyin transfers: 1000
             device time(us): total=6,217 max=11 min=6 avg=6
        12: kernel launched 1000 times
            grid: [313]  block: [128]
             device time(us): total=6,008 max=7 min=6 avg=6
            elapsed time(us): total=26,331 max=795 min=24 avg=26
        12: data copyout transfers: 1000
             device time(us): total=11,062 max=18 min=10 avg=11
    12: data region reached 2000 times
    26: data region reached 1 time
        26: data copyout transfers: 1
             device time(us): total=33 max=33 min=33 avg=33

So the final version is the fastest in this case, but again you may still want to go through this exercise until you better understand how loop schedule effect performance.

Lastly when is independent clause recommended?

“independent” tells the compiler to ignore it’s loop dependency checks and go ahead and parallelize the loop. “independent” is implied for loop directives with a “parallel” region so is not needed in this case.

With “parallel”, you’re telling the compiler where to parallelize. With the “kernels” directive, it’s up to the compiler to determine how best to parallelize the region of code. Though in order to do that, it must first prove that a loop is safe to parallelize. By adding a “loop independent” directive within a “kernels” region, you’re overriding this safety check and asserting to the compiler that it’s ok to parallelize the loop.

1 Like

Thanks @MatColgrove.
One question I have regarding test3.c is that

for each of those inner most loop (for loop of O) , will S be reinitialized to 0? For instance in the snippet below

#pragma acc enter data copyin(K[0:M*M], X[0:M*N])
#pragma acc parallel loop gang vector collapse(2) present(K)
		for (int i = 0; i < M; i++) {
				for (int j = 0; j < M; j++) {
						for (int k = 0; k < N; k++) {
							S +=  pow(X[i*N + k] - X[i*N + k], 2); // X is a matrix X[M][N]
						}
						K[i * M + j] = S;
				}
		}

#pragma acc exit data copyout(K[0:M*N])

No, this was an error on my part. S should be set to zero before the k loop. Though it shouldn’t change the performance.