I’m experimenting with OpenACC’s cache clause using PGI 14.10. I’ve got a simple example based on the one in the slides at [1]:
#include <stdlib.h>
int main(int argc, char **argv) {
int N = 1024;
int *restrict x = (int *)malloc(sizeof(int) * N);
int *restrict y = (int *)malloc(sizeof(int) * N);
#pragma acc parallel loop copy(x[0:N], y[0:N])
for (int i = 1; i < N - 1; i++) {
#pragma acc cache(x[i-1:2])
y = (x + x[i + 1]) / 2.0;
}
return 0;
}
Compiling with: pgc++ -acc -Minfo -ta=nvidia,cc3x foo.c
When I run this under nvprof with --metrics shared_load_transactions,shared_store_transactions it reports no loads or stores. When I look at the generated CUDA, it also shows that no shared variables are being used. Can anyone explain why the cache directive would appear to be having no effect in this example?
Minfo output is below.
[1] http://www.pgroup.com/lit/presentations/cea-3.pdf\
main:
6, Generating copy(x[:N])
Generating copy(y[:N])
Accelerator kernel generated
9, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
6, Generating Tesla code