Private variable giving problem on AMD GPUs

I’m having a problem with a code that looks something like this:

  double A[20];

  #pragma acc kernels present(src, dest)
  #pragma acc loop independent gang private(A)
  for (i = 0; i < LZ; i++){

    A[ 0] =  src[_OFFSET0 ]; 
    A[ 1] =  src[_OFFSET1 ]; 
    A[ 2] =  src[_OFFSET2 ];   

    #pragma acc loop independent reduction(+:rho)  
      for (j = 0; j < 20; j++) {
        k += A[j];

   dest[0] = A[0]*k;
   dest[1] = A[1]*k;
   dest[2] = A[2]*k;

when compiling for nVidia I get the following:

37, Loop is parallelizable
CUDA shared memory used for A
Accelerator kernel generated

and the program runs just fine.
If I compile the same code targeting an AMD GPU instead I get this strange error message:

PGC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages): pointer variable has multiple OpenCL target memory spaces

How can I fix this, keeping a different copy of A for each thread?
Currently using PGI 15.7 on a Hawaii FirePro W9100.

Hi Alga,

Apologies for the late reply. I sent this to one of our compiler engineers. She’s been able to replicate the issue but is still investigating the cause. We’re tracking this issue as TPR#21860 and will report back once we learn more.

  • Mat

… any news?

Sorry, no news.