cache directive not resulting in __shared__ memory use

I’m experimenting with OpenACC’s cache clause using PGI 14.10. I’ve got a simple example based on the one in the slides at [1]:

#include <stdlib.h>

int main(int argc, char **argv) {
int N = 1024;
int *restrict x = (int *)malloc(sizeof(int) * N);
int *restrict y = (int *)malloc(sizeof(int) * N);

#pragma acc parallel loop copy(x[0:N], y[0:N])
for (int i = 1; i < N - 1; i++) {
#pragma acc cache(x[i-1:2])
y = (x + x[i + 1]) / 2.0;
}
return 0;
}

Compiling with: pgc++ -acc -Minfo -ta=nvidia,cc3x foo.c

When I run this under nvprof with --metrics shared_load_transactions,shared_store_transactions it reports no loads or stores. When I look at the generated CUDA, it also shows that no shared variables are being used. Can anyone explain why the cache directive would appear to be having no effect in this example?

Minfo output is below.

[1] http://www.pgroup.com/lit/presentations/cea-3.pdf

main:
6, Generating copy(x[:N])
Generating copy(y[:N])
Accelerator kernel generated
9, #pragma acc loop gang, vector(256) /* blockIdx.x threadIdx.x */
6, Generating Tesla code

Hi Max,

The cache directive was essentially disabled in the 14.x compilers. We were having performance issues so decided to rework it. The PGI 2015 compilers will contain the new and improved version of cache.

I tested your example and can verify that shared memory is being use when compiled with our pre-release 2015 compiler.

Best Regards,
Mat

Thanks for the clarification Mat!