Mostly. There’s only one true error, but several things you can do to help performance.
The one error is that you’re missing the array shape when copying “K” to the device. The array shape is needed when the size of the array to copied is not known at compile time and the language doesn’t have runtime information which includes the bounds information. If this was Fortran and “K” was an allocatable array, it would be fine to use “copyin(K)”, but this is C so K is just a pointer to an unbounded block of memory. Hence you need “copyin(K[0:M*N])” so the runtime knows how much memory to create and copy to/from the device.
What are possible areas for improvements
First, I’d use a “gang” schedule for the outer loops. “Gang” maps to an OpenMP threads when targeting a multi-core CPU and a CUDA Block when targeting a GPU. In general most codes only need gang and vector parallelism, worker is only used in a few cases where a third level of parallelism is required.
Second, the vector inner loop may be better run sequentially. When possible, the vector loop (or in some cases worker) should correspond to the stride-1 dimension of the array, in this case the “n” loop. Also, vector loop reductions incur overhead due to the setup of a partial reduction array, the final reduction code, and added barriers. Not to say that you should never use vector loop reductions, but in this case the extra overhead combined with the non-coalesced memory access on K, it’s not the most performant method here.
OpenACC does make it very easy to try different schedules so especially as you’re learning the impact, I suggest trying various schedules to see what happens.
Here lets set the environment variable “NVCOMPILER_ACC_TIME=1” to have the compiler runtime give us a quick profile. Second, lets execute the kernel 1000 times to remove any overhead.
Your original version with an outer worker loop:
% cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;
int main(void) {
double S = 0;
double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop worker collapse(2) present(K)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma acc loop independent vector reduction(+:S)
for (int o = 0; o < O; o++) {
S += 0.25;
}
K[m * N + n] = S;
}
}
}
#pragma acc exit data copyout(K[0:M*N])
}
% nvc -acc -Minfo=accel -fast test.c ; a.out
main:
10, Generating enter data copyin(K[:N*M])
12, Generating present(K[:1])
Generating Tesla code
14, #pragma acc loop worker(4) collapse(2) /* threadIdx.y */
15, /* threadIdx.y collapsed */
18, #pragma acc loop vector(32) /* threadIdx.x */
Generating reduction(+:S)
Vector barrier inserted for vector loop reduction
14, Loop is parallelizable
15, Loop is parallelizable
18, Loop is parallelizable
28, Generating exit data copyout(K[:N*M])
Accelerator Kernel Timing data
test.c
main NVIDIA devicenum=0
time(us): 7,418,751
10: data region reached 1 time
10: data copyin transfers: 1
device time(us): total=44 max=44 min=44 avg=44
12: compute region reached 1000 times
12: kernel launched 1000 times
grid: [1] block: [32x4]
device time(us): total=7,418,659 max=7,478 min=7,384 avg=7,418
elapsed time(us): total=7,436,871 max=7,923 min=7,401 avg=7,436
12: data region reached 2000 times
28: data region reached 1 time
28: data copyout transfers: 1
device time(us): total=48 max=48 min=48 avg=48
Next, lets change worker to gang on the outer loop:
% cat test1.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;
int main(void) {
double S = 0;
double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop gang collapse(2) present(K)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma acc loop independent vector reduction(+:S)
for (int o = 0; o < O; o++) {
S += 0.25;
}
K[m * N + n] = S;
}
}
}
#pragma acc exit data copyout(K[0:M*N])
}
% nvc -acc -Minfo=accel -fast test1.c ; a.out
main:
10, Generating enter data copyin(K[:N*M])
12, Generating present(K[:1])
Generating Tesla code
14, #pragma acc loop gang collapse(2) /* blockIdx.x */
15, /* blockIdx.x collapsed */
18, #pragma acc loop vector(128) /* threadIdx.x */
Generating reduction(+:S)
18, Loop is parallelizable
28, Generating exit data copyout(K[:N*M])
Accelerator Kernel Timing data
test1.c
main NVIDIA devicenum=0
time(us): 82,095
10: data region reached 1 time
10: data copyin transfers: 1
device time(us): total=46 max=46 min=46 avg=46
12: compute region reached 1000 times
12: kernel launched 1000 times
grid: [40000] block: [128]
device time(us): total=82,008 max=87 min=82 avg=82
elapsed time(us): total=100,863 max=838 min=99 avg=100
12: data region reached 2000 times
28: data region reached 1 time
28: data copyout transfers: 1
device time(us): total=41 max=41 min=41 avg=41
Next, let’s schedule the outer loop using gang and worker so K does have some coalesced memory accesses.
% cat test2.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;
int main(void) {
double S = 0;
double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop gang worker collapse(2) present(K)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
#pragma acc loop independent vector reduction(+:S)
for (int o = 0; o < O; o++) {
S += 0.25;
}
K[m * N + n] = S;
}
}
}
#pragma acc exit data copyout(K[0:M*N])
}
% nvc -acc -Minfo=accel -fast test2.c ; a.out
main:
10, Generating enter data copyin(K[:N*M])
12, Generating present(K[:1])
Generating Tesla code
14, #pragma acc loop gang, worker(4) collapse(2) /* blockIdx.x threadIdx.y */
15, /* blockIdx.x threadIdx.y collapsed */
18, #pragma acc loop vector(32) /* threadIdx.x */
Generating reduction(+:S)
Vector barrier inserted for vector loop reduction
18, Loop is parallelizable
28, Generating exit data copyout(K[:N*M])
Accelerator Kernel Timing data
test2.c
main NVIDIA devicenum=0
time(us): 28,021
10: data region reached 1 time
10: data copyin transfers: 1
device time(us): total=46 max=46 min=46 avg=46
12: compute region reached 1000 times
12: kernel launched 1000 times
grid: [10000] block: [32x4]
device time(us): total=27,932 max=29 min=27 avg=27
elapsed time(us): total=49,854 max=813 min=47 avg=49
12: data region reached 2000 times
28: data region reached 1 time
28: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
Finally, let’s not use the inner vector loop and have the outer loop be gang vector:
% cat test3.c
#include <stdio.h>
#include <stdlib.h>
#include <openacc.h>
int M = 200;
int N = 200;
int O = 200;
int main(void) {
double S = 0;
double *K = (double*) calloc(M * N, sizeof(double));
#pragma acc enter data copyin(K[0:M*N])
for (int it=0; it < 1000; ++it) {
#pragma acc parallel loop gang vector collapse(2) present(K)
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
for (int o = 0; o < O; o++) {
S += 0.25;
}
K[m * N + n] = S;
}
}
}
#pragma acc exit data copyout(K[0:M*N])
}
% nvc -acc -Minfo=accel -fast test3.c ; a.out
main:
10, Generating enter data copyin(K[:NM])
12, Generating present(K[:1])
Generating Tesla code
14, #pragma acc loop gang, vector(128) collapse(2) / blockIdx.x threadIdx.x /
15, / blockIdx.x threadIdx.x collapsed /
16, #pragma acc loop seq
16, Loop is parallelizable
26, Generating exit data copyout(K[:NM])
Accelerator Kernel Timing data
test3.c
main NVIDIA devicenum=0
time(us): 23,366
10: data region reached 1 time
10: data copyin transfers: 1
device time(us): total=46 max=46 min=46 avg=46
12: compute region reached 1000 times
12: data copyin transfers: 1000
device time(us): total=6,217 max=11 min=6 avg=6
12: kernel launched 1000 times
grid: [313] block: [128]
device time(us): total=6,008 max=7 min=6 avg=6
elapsed time(us): total=26,331 max=795 min=24 avg=26
12: data copyout transfers: 1000
device time(us): total=11,062 max=18 min=10 avg=11
12: data region reached 2000 times
26: data region reached 1 time
26: data copyout transfers: 1
device time(us): total=33 max=33 min=33 avg=33
So the final version is the fastest in this case, but again you may still want to go through this exercise until you better understand how loop schedule effect performance.
Lastly when is independent clause recommended?
“independent” tells the compiler to ignore it’s loop dependency checks and go ahead and parallelize the loop. “independent” is implied for loop directives with a “parallel” region so is not needed in this case.
With “parallel”, you’re telling the compiler where to parallelize. With the “kernels” directive, it’s up to the compiler to determine how best to parallelize the region of code. Though in order to do that, it must first prove that a loop is safe to parallelize. By adding a “loop independent” directive within a “kernels” region, you’re overriding this safety check and asserting to the compiler that it’s ok to parallelize the loop.