OpenACC dynamically allocate array within loop

Hi there,

I was wondering if it is possible to allocate arrays dynamically within a loop.
Something like this

int** parr;
#pragma acc parallel loop pcreate(parr[0:N])
for(int i = 0; i < N; ++i)
{
    int x = some_func(i);
    parr[i] = new int[x];
}

I think in CUDA this is possible, but I could not find any possibility in openacc.

Kind regards,
Rob

1 Like

Hi Rob,

Allocation via malloc or new is supported within device code, though generally not advisable given it’s negative impact on performance and small heap.

What is your goal with this code? To create a 2D device only array?

-Mat

Hi Mat,

the goal is to have an array of arrays with variable length (or list) on the device. The lengths of the individual arrays range from a couple of entries, to maybe 100 entries.

Kind regards,
Rob

Does it need to be device only or can you make the array from the host?

I wrote an example on how to create a jagged array as part of my Chapter in the book Parallell Programming with OpenACC. ParallelProgrammingWithOpenACC/jagged_array.c at master · rmfarber/ParallelProgrammingWithOpenACC · GitHub

If needed, I can rewrite it to be device only.

Note that there’s been a slight semantic change in behavior since I wrote the original code. In that you need to change line 59 from:

#pragma acc enter data create(tmp[0:size1][0:1])

to

#pragma acc enter data create(tmp[0:size1][0:0])

Hope this helps,
Mat

Hi Mat,

thanks. Unfortunately I cannot get it to work as I need. I need a jagged array just as in your example, but the sizes of the sub-arrays can change. I tried something like this:

for(int i = 0 ;  i < N; ++i)
{
    if( new_size[i] != old_size[i] )
    {
         #pragma acc exit data delete(A[i:1])
        #pragma acc enter data create(A[i:1][0:new_size[i]])
    }
}

but I get an Error down the line:
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Failing in Thread:1
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

Do you have any suggestions how to debug this?

Best,
Rob

Hi Rob,

We may need to move to using API calls since this will give greater control over the device allocation and attachment. Especially if this is a realloc, then we can do a device-to-device copy from the old to new array.

I’m mentoring a virtual hackathon this week (Monday-Wednesday), so unfortunately may not have time till later this week to create an example. Though will as soon as I can.

-Mat

Hi Mat

Thanks a lot, an example would be perfect. Is there a Documentation of the API functionalities?

Regards,
Rob

Best place is the OpenACC Standard 3.0, see seciton 3.

Hi Mat,

unfortunately I did not manage to get it to work properly. If you have a working example that would be highly appreciated.

Regards,
Rob

Here’s a modified version of jagged_array.c where I’m doing data management through the API calls. It’s bit more manual, but does allow for the device to device copy when doing the realloc.

#include <stdlib.h>
#include <stdio.h>
#ifdef _OPENACC
#include <openacc.h>
#endif

#ifndef N
#define N 32
#endif

double ** allocData(size_t size1, int * sizes);
void reallocData(double ** A, size_t size1, int * oldsizes, int * newsizes);
int deleteData(double ** A, size_t size1i, int * sizes);
int initData(double **A, size_t size1, int * sizes, double val);
int printData(double **A, size_t size1, int * sizes);

int main() {

    double **A, **B;
    int * sizes, *newsizes;
    size_t size1,size2, i, j;
    size1 = N;

/* create an array of sizes for each jagged array */
    sizes=(int*) malloc(sizeof(int)*size1);
    newsizes=(int*) malloc(sizeof(int)*size1);
    for (i=0; i < size1; ++i) {
       sizes[i]=i+10;
       newsizes[i] = sizes[i]+32;
    }
#pragma acc enter data copyin(sizes[:size1],newsizes[:size1])

/* Allocate the jagged arrays */
    A=allocData(size1,sizes);
    B=allocData(size1,sizes);

/* set the inital values of B */
    initData(B,size1,sizes,2.5);

/* Perform the computation on the device */
#pragma acc parallel loop gang present(A,B,sizes)
    for (j=0; j < size1; ++j) {
       int size2 = sizes[j];
#pragma acc loop vector
       for (i=0; i < size2; ++i) {
          A[j][i] = B[j][i] + (double) ((j*size2)+i);
       }
    }
#ifdef _OPENACC
/* Copy back the results */
    for (j=0; j < size1; ++j) {
      acc_update_self(A[j],sizes[j]*sizeof(double));
    }
#endif

    printData(A,size1,sizes);

/* reallocate A and B's jagged arrays using new sizes */
    reallocData(A,size1,sizes,newsizes);
    reallocData(B,size1,sizes,newsizes);
    initData(B,size1,newsizes,3.0);

#pragma acc parallel loop gang present(A,B,newsizes)
    for (j=0; j < size1; ++j) {
       int size2 = newsizes[j];
#pragma acc loop vector
       for (i=0; i < size2; ++i) {
          A[j][i] = B[j][i] + (double) ((j*size2)+i);
       }
    }
#ifdef _OPENACC
/* copy back the results */
    for (j=0; j < size1; ++j) {
      acc_update_self(A[j],newsizes[j]*sizeof(double));
    }
#endif
    printData(A,size1,newsizes);
    deleteData(A,size1,newsizes);
    deleteData(B,size1,newsizes);

#pragma acc exit data delete(sizes,newsizes)
    free(sizes);
    exit(0);
}

double ** allocData(size_t size1, int * sizes) {
    double ** tmp;
    int i;
/* Create an array of pointers */
    tmp = (double **) malloc(size1*sizeof(double*));
#ifdef _OPENACC
/* Create the pointer array on the device */
    acc_create(tmp,sizeof(double*)*size1);
#endif
    for (i=0; i < size1; ++i) {
/* Create the vector */
       tmp[i] = (double *) malloc(sizes[i]*sizeof(double));
#ifdef _OPENACC
/* allocate the vector on the device */
       double * dptr = (double *) acc_malloc(sizeof(double)*sizes[i]);
/* map the host vector to the device vector */
       acc_map_data(tmp[i],dptr,sizeof(double)*sizes[i]);
/* attach the vector (i.e. fill-in the device pointer value in the device jagged array at element i) */
       acc_attach((void**)&tmp[i]);
#endif
    }
    return tmp;
}

void reallocData(double ** A, size_t size1, int * oldsizes, int * newsizes) {
    int i;
#ifdef _OPENACC
    double * olddptr, *newdptr;
#endif

    for (i=0; i < size1; ++i) {
#ifdef _OPENACC
/* store the device pointer for this vector and unmap A so it's no longer associated with this device array */
       olddptr = acc_deviceptr(A[i]);
       acc_unmap_data(A[i]);
#endif
       A[i] = (double *) realloc(A[i],newsizes[i]*sizeof(double));
#ifdef _OPENACC
/* malloc a new device vector with the new size and map A[i] to the new device vector*/
       newdptr = (double *) acc_malloc(sizeof(double)*newsizes[i]);
       acc_map_data(A[i],newdptr,sizeof(double)*newsizes[i]);
       acc_attach((void**)&A[i]);
/* device to device copy from old to new device vector */
       if (newsizes[i] > oldsizes[i]) {
          acc_memcpy_device(newdptr,olddptr,oldsizes[i]);
       } else {
          acc_memcpy_device(newdptr,olddptr,newsizes[i]);
       }
/* delete the old device vector */
       acc_free(olddptr);
#endif
    }
}

int deleteData(double ** A, size_t size1, int * sizes) {
    int i;
    for (i=0; i < size1; ++i) {
#ifdef _OPENACC
       double * dptr = acc_deviceptr(A[i]);
       acc_unmap_data(A[i]);
       acc_free(dptr);
#endif
       free(A[i]);
  }
#ifdef _OPENACC
       acc_delete(A,sizeof(double*)*size1);
#endif
    free(A);
}

int initData(double **A, size_t size1, int * sizes, double val) {
    size_t i,j;
    for (j=0; j < size1; ++j) {
       int size2=sizes[j];
       for (i=0; i < size2; ++i) {
          A[j][i] = val;
       }
/* Update the device with the initial values */
#ifdef _OPENACC
       acc_update_device(A[j],sizeof(double)*size2);
#endif
    }
}

int printData(double **A, size_t size1, int * sizes) {
    size_t i,j;
    printf("Values:\n");
    for (i=0; i < 5; ++i) {
        int last = sizes[i]-1;
        printf("A[%d][0]=%f A[%d][%d]=%f\n",i,A[i][0],i,last,A[i][last]);
    }
    printf("....\n");
    for (i=size1-5; i < size1; ++i) {
        int last = sizes[i]-1;
        printf("A[%d][0]=%f A[%d][%d]=%f\n",i,A[i][0],i,last,A[i][last]);
    }
}

% nvc -fast -w jagged.c -acc=gpu -Minfo=accel ; a.out
main:
     34, Generating enter data copyin(newsizes[:size1],sizes[:size1])
     42, Generating present(A[:1],sizes[:1],B[:1])
         Generating Tesla code
         42, #pragma acc loop gang /* blockIdx.x */
         45, #pragma acc loop vector(128) /* threadIdx.x */
     45, Loop is parallelizable
     64, Generating present(A[:1],newsizes[:1],B[:1])
         Generating Tesla code
         64, #pragma acc loop gang /* blockIdx.x */
         67, #pragma acc loop vector(128) /* threadIdx.x */
     67, Loop is parallelizable
     82, Generating exit data delete(sizes[:1],newsizes[:1])
Values:
A[0][0]=2.500000 A[0][9]=11.500000
A[1][0]=13.500000 A[1][10]=23.500000
A[2][0]=26.500000 A[2][11]=37.500000
A[3][0]=41.500000 A[3][12]=53.500000
A[4][0]=58.500000 A[4][13]=71.500000
....
A[27][0]=1001.500000 A[27][36]=1037.500000
A[28][0]=1066.500000 A[28][37]=1103.500000
A[29][0]=1133.500000 A[29][38]=1171.500000
A[30][0]=1202.500000 A[30][39]=1241.500000
A[31][0]=1273.500000 A[31][40]=1313.500000
Values:
A[0][0]=3.000000 A[0][41]=44.000000
A[1][0]=46.000000 A[1][42]=88.000000
A[2][0]=91.000000 A[2][43]=134.000000
A[3][0]=138.000000 A[3][44]=182.000000
A[4][0]=187.000000 A[4][45]=232.000000
....
A[27][0]=1866.000000 A[27][68]=1934.000000
A[28][0]=1963.000000 A[28][69]=2032.000000
A[29][0]=2062.000000 A[29][70]=2132.000000
A[30][0]=2163.000000 A[30][71]=2234.000000
A[31][0]=2266.000000 A[31][72]=2338.000000

Hi Mat,

thanks a lot! I guess the data management with the pragmas does the same as the API calls under the hood? So the performance is similar?

Regards,
Rob

Yes, the API calls are mostly equivalent to the directives. While I haven’t measured it since it’s rare that I use the APIs over the directives (which are more portable in that they don’t need to be guarded by macros), I would assume the performance is roughly the same.

Hi Mat,

I have a follow up question. With the regular jagged array allocation you linked I got an error when the size of an individual array went above 34. In the example with the API calls there seems to be no such restriction. Where does that come from?
Error looked like this, but seems to come from the allocation

host:(nil) device:(nil) size:0 presentcount:0+1 line:126 name:tmp
host:0x625c20 device:0x2aaae1afa000 size:128 presentcount:0+1 line:26 name:sizes
host:0xdbc1a0 device:0x2aaae1afa400 size:256 presentcount:0+1 line:121 name:tmp
…

Regards,
Rob

Hmm. I just re-downloaded the file and tried various larger values of “N”, but all succeeded for me. Granted I’m using the pre-release of 20.7, so maybe it’s a compiler issue that was fixed?

What compiler version, GPU, compiler flags, and OS are you using?
How are you changing the array size? Are you setting it via “N”?
Any other changes that you made?

Hi Mat,
I am using pgc++ 19.10-0 LLVM 64-bit. Compile options are “-acc -ta:tesla”. OS is Centos7.3.
The error only occurs when I delete one jagged array and want to create a new one.

deleteData(A,size1);
A = allocData(size1,sizes);

I tried also using the API calls, your example runs fine, however when I apply it to my problem I get a segfault:

#6 0x00002aaaae1f09c3 in gomp_vfatal (fmt=, list=list@entry=0x7fffffffb178) at …/…/…/src/gcc-7.3.0/libgomp/error.c:80
#7 0x00002aaaae1f0a5c in gomp_fatal (fmt=fmt@entry=0x2aaaae20cb60 “cannot map data on shared-memory system”) at …/…/…/src/gcc-7.3.0/libgomp/error.c:89
#8 0x00002aaaae20989a in acc_map_data (h=0x103d2500, d=0x103d2a00, s=4) at …/…/…/src/gcc-7.3.0/libgomp/oacc-mem.c:320

I don’t understand why libgomp is invoked, do you have any clue?

Regards,
Rob

The error only occurs when I delete one jagged array and want to create a new one.
I tried also using the API calls, your example runs fine, however when I apply it to my problem I get a segfault:

I tried modifying my code and it seemed to run fine as well. Not sure what’s wrong with your code. Are you able to provide a reproducing example?

I don’t understand why libgomp is invoked, do you have any clue?

This would imply that the code is running on multicore CPU rather than a device, or possibly that you have a higher level OpenMP region. Now I do see an error with my example when running on mutlicore CPU with larger N values (I’ll need to adjust the code to account for a single memory space), but it’s not a segv but rather a runtime error “FATAL ERROR: variable in OpenACC API data map routine was already present on the GPU”

Can you double check that you are compiling with “-ta=tesla” and running on a system with a supported GPU? Are you running with OpenMP (-mp) as well?

Using just “-acc”, both Tesla and Multicore targets will be generated. However if “-ta=tesla” is used, no multicore CPU code is generated and you’d see a runtime error if trying to run this binary on a system without a GPU. So if you don’t have OpenMP in the code, this would imply that “-ta=tesla” is not being used. Of course, this just a guess so having a reproducing example would help.

-Mat

In the following example I just added a deleteData(A,size1) and A=allocData(size1,sizes); after the original allocData. This fails on my machine with the following error:

FATAL ERROR: variable in data clause is partially present on the device: name=tmp
file:/test_jagged.c _Z9allocDatamPi line:68

tmp lives at 0xda0120 size 120 partially present
Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 7.0, threadid=1

test_jagged.c (2.7 KB)

Concerning the API code, yes it is part of some other code that uses OpenMP, (-mp) is used for the linker, however in this specific code I dont use it, is it possible to avoid this conflict?

Apologies, I did realize that you were working with the original version, not the updated API versions.

Not sure if this issue has always been there or a change in behavior since I first wrote it 7 years ago. Are you able to switch to using the API version as a work around?

Note that it’s on my todo list to revamp this example once some modifications have been made to the compiler. We’ve always treated “**” as a 2D array when it really should be treated as an array of pointers. Plus we’ve had a long standing compiler issue when copying only the second dimension of an array and not using a range in the first dimension. This is why you have to use “j:1” in the update directive:

#pragma acc update device(A[j:1][0:size2])
where
#pragma acc update device(A[j][0:size2])

triggers the error.

I’ll keep pushing our team to work on these issues, but don’t have a timeline. Though once complete, I’ll rewrite the example and ask Rob Farber to update the git repo with the changes.

Concerning the API code, yes it is part of some other code that uses OpenMP, (-mp) is used for the linker, however in this specific code I dont use it, is it possible to avoid this conflict?

Just a guess, but it may be that the OpenMP runtime exception handling is catching the error so it’s not actually failing in OpenMP code.

Hi Mat,

I found that it is actually executing the API functions from libgomp instead of the ones from the PGI library, even when I do not link with “-mp”. Do you have any clue why this is the case? I could not reproduce it with the example you provided, even when I add “-mp”. For my code the dbg says:

acc_create (h=0xeb38660, s=32) at …/…/…/src/gcc-7.3.0/libgomp/oacc-mem.c:501

while for the example

acc_create (hostptr=0xf0a2c0, bytes=25608) at …/src/acc_create.c:33

Very odd. While we do implicitly include the OpenMP runtime library on the link line (so you can binding executable to cores, even without OpenMP enabled but you can use -nomp to remove this), we link against our “libomp.so”, not the GOMP library. Here it looks like the OpenACC API calls are resolving the GOMP versions, so it’s not even an OpenMP issue.

Are you using a MPI driver? If so, could it be implicitly adding GOMP?

What is your link line? Can you add the verbose (-v) option to the link so we can see the what’s actually being use on the link (ld) command?