Accelerator restriction: unsupported statement type: opcode=DEALLOC

Hello,

I am able to reproduce a compilation error in my C++ and OpenACC program.

#include <iostream>

class foo
{
public:
        foo(int nbEqs, int nOrder);
        ~foo();
        void createDeviceData();
        void calculation();
private:
        int _nbEqs;
        int _elem;
};

foo::foo(int nbEqs, int nOrder)
{
        _nbEqs = nbEqs;
        _elem = nOrder;
}

foo::~foo()
{
}

void foo::createDeviceData()
{
        #pragma acc enter data copyin(this)
}

void foo::calculation()
{
        #pragma acc parallel loop \
        copyin(_elem)
        for (int i=0; i<_nbEqs; i++)
        {
          double temp[_elem];
          for (int i=0; i<_elem; i++)
          {
            temp[i] = 1.;
          }
        }
}
int main()
{
        foo obj(100,3);
        obj.createDeviceData();

        obj.calculation();

        std::cout << "Done" << std::endl;
}

With “pgc++ -acc -ta=nvidia,lineinfo -Minfo=accel test3.C”, the compilation error reads:

"test3.C", line 36: warning: variable "temp" was set but never used
  	  double temp[_elem];
  	         ^

foo::createDeviceData():
     28, Generating enter data copyin(this[:1])
PGCC-S-0155-Accelerator region ignored; see -Minfo messages  (test3.C: 31)
foo::calculation():
     31, Accelerator region ignored
     34, Accelerator restriction: loop contains unsupported statement type
     40, Accelerator restriction: unsupported statement type: opcode=DEALLOC
PGCC/x86-64 Linux 19.4-0: compilation completed with severe errors

Could you advise why the error would arise?

Thanks,
Shine

Hi Shine,

VLAs aren’t supported in device code.

While you can change this to use malloc/free instead (See below), dynamic allocation on the device should be avoided if possible. Mallocs are serialized which can have adverse impact on performance. Also, the default heap size is 8MB which often leads to heap overflows if you malloc too much memory.

Instead, I’d recommend you make temp fixed size or declare it before the parallel loop and then add it to a private clause.

Hope this helps,
Mat

Malloc version:

void foo::calculation()
{
        #pragma acc parallel loop \
        copyin(_elem)
        for (int i=0; i<_nbEqs; i++)
        {
          double *temp = (double*) malloc(sizeof(double)*_elem);
          for (int i=0; i<_elem; i++)
          {
            temp[i] = 1.;
          }
          free(temp);
        }
}

Private version:

void foo::calculation()
{

        double temp [_elem];
        #pragma acc parallel loop \
        copyin(_elem) private(temp[0:_elem])
        for (int i=0; i<_nbEqs; i++)
        {
          for (int i=0; i<_elem; i++)
          {
            temp[i] = 1.;
          }
        }
}

Thanks, Mat, for your helpful replies as always. Yes, your solution works. Following your suggestions and from our tests, we will implement fixed size arrays to avoid exceeding the heap limit (we have iterated with Alex of PGI group on this).

Just out of curiosity, we have seen other types of VLA usages, for example:

#include <iostream>

class foo
{
public:
        foo(int nbEqs, int nOrder);
        ~foo();
        void createDeviceData();
        #pragma acc routine seq
        void calculation();
private:
        int _nbEqs;
        int _elem;
};

foo::foo(int nbEqs, int nOrder)
{
        _nbEqs = nbEqs;
        _elem = nOrder;
}

foo::~foo()
{
}

void foo::createDeviceData()
{
        #pragma acc enter data copyin(this)
}

void foo::calculation()
{
          double temp[_elem];
          for (int i=0; i<_elem; i++)
          {
            temp[i] = 1.;
          }
}

int main()
{
        foo obj(100,3);

        obj.createDeviceData();

        #pragma acc parallel loop
        for(int i=0; i<100; i++)
        {
          obj.calculation();
        }

        std::cout << "Done" << std::endl;
}

This code compiles and runs fine. I believe I am missing some subtle differences here regarding VLA usage. Could shed some light on this?

Thanks,
Shine

Ok, I was a bit too broad in saying VLAs are not supported in device code. In a device subroutine, you can use VLAs or Fortran automatics. I’ve just seen several folks have issues with heap overflows when using them, so don’t encourage their use.

Ah, I see the difference. Yes, my latest example is a device function while the original example is a plain accelerated OpenACC region. We will avoid VLAs in our upcoming development.

Is there a schedule for the next community release? Currently there is a strong indication in our production code that private arrays are actually not privatized. However I have not been able to reproduce it with a smaller code. I wonder whether this has already been fixed by TPR#27025 (https://www.pgroup.com/support/release-tprs-2019.htm) so that the next community release will automatically incorporate the fix.

Thanks,
Shine

Hi Shine,

We typically release the community edition at the xx.4 and xx.10 releases, so the next one should be 19.10.

TPR#27025, which was fixed in 19.5, had to do with putting a private array on a on worker loop. Are you also using a worker loop? If not, your issue may be something different. If you can get me a reproducing example, I can see what’s wrong and determine if it the same issue or something else.

-Mat

Hi Mat,

Thanks for your information. As to my bug, no I did not use a worker loop so it is not likely related to TPR#27025. I will spend some time on a reproducer again today, and if I can get it to work I will report to PGI customer service.

Thanks and have a good day,
Shine