How should OpenACC handle array size that is only known at run-time

Hello,

It seems that for arrays declared in an OpenACC region, they have to have constant array size in order to compile successfully. One example is given below:

#include <iostream>

class foo
{
public:
        foo(int nbEqs);
        ~foo();
        void createDeviceData();
        #pragma acc routine seq
        void assignValue();
        void syncGPU();
        void printVal();
private:
        int _nbEqs;
        int* ptr;
};

foo::foo(int nbEqs)
{
        _nbEqs = nbEqs;
        ptr = new int[_nbEqs];
        for (int i=0; i<_nbEqs; i++){ptr[i]=0;}
}

foo::~foo()
{
        delete ptr;
}

void foo::createDeviceData()
{
        #pragma acc enter data copyin(this)
        #pragma acc enter data copyin(ptr[0:_nbEqs])
}
void foo::assignValue()
{
        //int localPtr[_nbEqs]; //this line does not work
        //int*localPtr = new int[_nbEqs]; //this line does not work
        int localPtr[5]; //this works
        for (int i=0; i<_nbEqs; i++)
        {
          localPtr[i]=1;
        }
        for (int i=0; i<_nbEqs; i++)
        {
          ptr[i]=localPtr[i];
        }
}

void foo::syncGPU()
{
        #pragma acc exit data copyout(ptr[0:_nbEqs])
}

void foo::printVal()
{
        for(int i=0; i<_nbEqs; i++)
        {
          std::cout << "i=" << i << ";val=" << ptr[i] << std::endl;
        }
}

int main()
{
        foo obj(5);
        obj.createDeviceData();

        #pragma acc serial
        {
          obj.assignValue();
        }

        obj.syncGPU();
        obj.printVal();
}

Notice that in assignValue(), only “int localPtr[5];” works, where as “int localPtr[_nbEqs];” gives the error of

[shine@dummy]$ pgc++ -acc -ta:nvidia -Minfo=accel test.C 
main:
     69, Accelerator serial kernel generated
         Generating Tesla code
         Generating implicit copy(obj)
foo::createDeviceData():
     34, Generating enter data copyin(this[:1],ptr[:_nbEqs])
PGCC-S-0000-Internal compiler error. Call in OpenACC region to support routine - _mp_malloc (test.C: 37)
PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages)  (test.C: 37)
foo::assignValue():
     37, Generating acc routine seq
         Generating Tesla code
         38, Accelerator restriction: unsupported call to support routine '_mp_malloc'
foo::syncGPU():
     53, Generating exit data copyout(ptr[:_nbEqs])
PGCC/x86 Linux 18.10-1: compilation completed with severe errors

and “int*localPtr = new int[_nbEqs];” gives the error of

[shine@dummy]$ pgc++ -acc -ta:nvidia -Minfo=accel test.C 
main:
     69, Accelerator serial kernel generated
         Generating Tesla code
         Generating implicit copy(obj)
foo::createDeviceData():
     34, Generating enter data copyin(this[:1],ptr[:_nbEqs])
PGCC-S-0000-Internal compiler error. Call in OpenACC region to support routine - _mp_malloc (test.C: 37)
PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages)  (test.C: 37)
foo::assignValue():
     37, Generating acc routine seq
         Generating Tesla code
         38, Accelerator restriction: unsupported call to support routine '_mp_malloc'
foo::syncGPU():
     53, Generating exit data copyout(ptr[:_nbEqs])

Could you comment on how to avoid hard-coding like this?

Thanks,
Shine

Hi Shine,

We didn’t support VLAs in device code until recently, so I’m assuming you’re using 18.10. With PGI 19.4, I’m able to compile the first case, “localPtr[_nbEqs]”, without issues.

We don’t currently support new/delete in device code, but you can use malloc/free instead.

However, it’s highly recommended to not dynamically allocate data from the device. The device has a very small heap (8MB by default) so it’s easy to hit this limit quickly as you scale-up the number of threads. While there are ways to increase the heap to 32MB (via a call to cudaDeviceSetLimit), it’s still easy to blow the heap causing the program to crash.

Secondly, device allocation gets serialized and can have a huge negative impact on performance. Here you’re only using a serial compute region so it doesn’t matter, but in any parallel regions it will effect things.

Granted, some algorithms require dynamic allocation in device routines so you may need to do this, but just keep in mind the caveats and limit the use if possible.

-Mat

Thanks Mat, for the encouraging information on VLA support and your insightful comments on using heap memory on the device.

One question: it sounds that

int localPtr[_nbEqs];

will allocate device memory on its heap? If so, is it because _nbEqs is not known until run-time? Also would

int localPtr[5];

allocate the device memory on its stack instead, since the size is known at compilation time?

We will update to PGI 19.4 soon (yes we were using PGI 18.10) and compare the performance. We will find out, but should we expect that

int localPtr[5];

may be faster?

Thanks,
Shine

will allocate device memory on its heap?

Correct. VLAs are implicitly allocated upon entry in to the routine, and implicitly deallocated upon exit.

allocate the device memory on its stack instead, since the size is known at compilation time?

Fixed size array aren’t allocated and declared locally within the routine. On the host side, local variables are stored on the stack, but on the GPU they would be stored in registers. As of compute capability 2.0 devices, each thread on the GPU does have a limited stack, but it’s mostly used when calling device subroutines.

may be faster?

Most likely.

-Mat

Thanks for your reply, Mat! It is very clear.

The following reproducer code demonstrates the very original issue:

#include <iostream>
#include <math.h>

class bar
{
public:
        bar(int nOrder)
        {
          _nOrder = nOrder;
          barPtr = new int[_nOrder];
        }
        ~bar();
        void createDeviceData()
        {
          #pragma acc enter data copyin(this)
          #pragma acc enter data copyin(barPtr[0:_nOrder])
        }
        #pragma acc routine seq
        void calculation(const int& inVal, int& outVal)
        {
          // outVal = (-inVal)**0 + (-inVal)**1 + (-inVal)**2 + etc up to nOrder-1
          //int tempPtr[3]; // this line avoid the race condition
          int* tempPtr = barPtr; // this line gives a race condition; it was intended to re-use the CPU memory
          for (int i=0; i<_nOrder; i++ )
          {
            tempPtr[i] = pow(-inVal, i);
          }
          outVal = 0;
          for (int i=0; i<_nOrder; i++)
          {
          outVal += tempPtr[i];
          }
        }
private:
        int* barPtr;
        int _nOrder;
};

class foo
{
public:
        foo(int nbEqs, int nOrder);
        ~foo();
        void createDeviceData();
        void calculation();
        void syncGPU();
        void printVal();
private:
        int _nbEqs;
        int* ptr1;
        int* ptr2;
        bar* help;
};

foo::foo(int nbEqs, int nOrder)
{
        _nbEqs = nbEqs;
        ptr1 = new int[_nbEqs];
        ptr2 = new int[_nbEqs];
        for (int i=0; i<_nbEqs; i++)
        {
          ptr1[i]=i;
          ptr2[i]=i;
        }
        help = new bar(nOrder);
}

foo::~foo()
{
        delete ptr1, ptr2;
}

void foo::createDeviceData()
{
        #pragma acc enter data copyin(this)
        #pragma acc enter data copyin(ptr1[0:_nbEqs],ptr2[0:_nbEqs])
        help->createDeviceData();
        #pragma acc enter data attach(help)
}

void foo::calculation()
{
        #pragma acc parallel loop \
        present(ptr1[0:_nbEqs],ptr2[0:_nbEqs])
        for (int i=0; i<_nbEqs; i++)
        {
          help->calculation(ptr1[i],ptr2[i]);
        }
}

void foo::syncGPU()
{
        #pragma acc exit data copyout(ptr2[0:_nbEqs])
}

void foo::printVal()
{
        for(int i=0; i<_nbEqs; i++)
        {
          std::cout << "i=" << i << ";val=" << ptr2[i] << std::endl;
        }
}


int main()
{
        foo obj(1000,3);
        obj.createDeviceData();

        obj.calculation();

        obj.syncGPU();
        obj.printVal();
}

The code can be compiled by pgc++ -acc -ta=nvidia -Minfo=accel test.C. In the current form, there is a race condition as tempPtr points to barPtr[0:_nOrder] which is shared by all threads, and as a result the output differs.

My current workaround (with PGI 18.10) is to use local “int tempPtr[3];” inside “void calculation(const int& inVal, int& outVal)” (since PGI18.10 does not support VLAs). My questions are:

  1. with PGI 19.4 and above, “int tempPtr[_nOrder]” should work (is the heap memory shared by all the threads? if so, that is an issue.). One potential issue is the consumption of too much heap memory as _nOrder increases. Is there any other concern that comes to your mind?
  2. I am using OpenACC to accelerate our production C++ code which has many occurrences of the example above. Is using local arrays the best workaround? Any suggestions will be helpful.

Thanks,
Shine

How about something like this, where you manually privatize barPtr.

#include <iostream>
#include <math.h>

class bar
{
public:
        bar(int nEqs,int nOrder)
        {
          _nEqs = nEqs;
          _nOrder = nOrder;
          barPtr = new int*[_nEqs];
          for(int i=0; i < _nEqs; ++i) {
            barPtr[i] = new int[_nOrder];
          }
        }
        ~bar();
        void createDeviceData()
        {
          #pragma acc enter data copyin(this)
          #pragma acc enter data copyin(barPtr[0:_nEqs][0:_nOrder])
        }
        #pragma acc routine seq
        void calculation(const int eqs, const int& inVal, int& outVal)
        {
          // outVal = (-inVal)**0 + (-inVal)**1 + (-inVal)**2 + etc up to nOrder-1
          for (int i=0; i<_nOrder; i++ )
          {
            barPtr[eqs][i] = pow(-inVal, i);
          }
          outVal = 0;
          for (int i=0; i<_nOrder; i++)
          {
          outVal += barPtr[eqs][i];
          }
        }
private:
        int** barPtr;
        int _nOrder;
        int _nEqs;
};

class foo
{
public:
        foo(int nbEqs, int nOrder);
        ~foo();
        void createDeviceData();
        void calculation();
        void syncGPU();
        void printVal();
private:
        int _nbEqs;
        int* ptr1;
        int* ptr2;
        bar* help;
};

foo::foo(int nbEqs, int nOrder)
{
        _nbEqs = nbEqs;
        ptr1 = new int[_nbEqs];
        ptr2 = new int[_nbEqs];
        for (int i=0; i<_nbEqs; i++)
        {
          ptr1[i]=i;
          ptr2[i]=i;
        }
        help = new bar(nbEqs,nOrder);
}

foo::~foo()
{
        delete ptr1, ptr2;
}

void foo::createDeviceData()
{
        #pragma acc enter data copyin(this)
        #pragma acc enter data copyin(ptr1[0:_nbEqs],ptr2[0:_nbEqs])
        help->createDeviceData();
        #pragma acc enter data attach(help)
}

void foo::calculation()
{
        #pragma acc parallel loop \
        present(ptr1[0:_nbEqs],ptr2[0:_nbEqs])
        for (int i=0; i<_nbEqs; i++)
        {
          help->calculation(i,ptr1[i],ptr2[i]);
        }
}

void foo::syncGPU()
{
        #pragma acc exit data copyout(ptr2[0:_nbEqs])
}

void foo::printVal()
{
        for(int i=0; i<_nbEqs; i++)
        {
          std::cout << "i=" << i << ";val=" << ptr2[i] << std::endl;
        }
}


int main()
{
        foo obj(1000,3);
        obj.createDeviceData();

        obj.calculation();

        obj.syncGPU();
        obj.printVal();
}

Hi Mat,

Thanks for the suggestions. Yes, it works and I need to discuss with my team and see how easy it is to adapt the production code.

Thanks again,
Shine

Just a test, as my operation to submit a post returns 403 forbidden.

OK, probably my reproducer code is too long…

I have sent the code to PGI customer support (trs@pgroup.com) with the title “Reporting an error with VLAs”. The symptom is that whenever the heap memory usage is high, an error of

========= CUDA-MEMCHECK
========= Out-of-range Shared or Local Address
=========     at 0x00000120 in __cuda_syscall_mc_dyn_globallock_check
=========     by thread (64,0,0) in block (484,0,0)

would arise (with cuda-memcheck). If it is not a compiler issue, we will have to reroute our code.

Thanks,
Shine