Bug with privatizing 2D array inside OpenACC region in C

I’ve encountered a bug with privatizing dynamically-allocated 2D array in C.

I’ve written a shortest possible code to illustrate this problem (call it ACC_test.c):

# include <stdlib.h>
# include <stdio.h>
# define N 16

// dynamically allocate 2D array 
double** Make2DDoubleArray(int arraySizeX, int arraySizeY) {
    double** theArray;
    theArray = (double**) malloc(arraySizeX*sizeof(double*));
    for (int i = 0; i < arraySizeX; i++)
        theArray[i] = (double*) malloc(arraySizeY*sizeof(double));
    return theArray;
} 

int main() {

    int i,j;
    //double a[N][N];
    double **a = Make2DDoubleArray(N,N);
    
    // Initial value 
    for (i=0; i<N; i++){
    for (j=0; j<N; j++){
        a[i][j]=0.0;
    }   
    }   

    #pragma acc data copyin(a[:N][:N])
    {   

    #pragma acc parallel private(a[:N][:N]) 
    {   
        #pragma acc loop gang 
        for (i=0; i<N; i++){
            a[i][0]=1.0;
        }

    }
    }

    return 0;
}

I compiled it with

$pgcc -acc -ta=multicore,tesla -Minfo=accel ACC_test.c

When running on the host side (export ACC_DEVICE_TYPE=host), I got

Segmentation fault

When running on GPU (export ACC_DEVICE_TYPE=nvidia), I got

call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
call to cuMemFreeHost returned error 700: Illegal address during kernel execution

The hints I have are:

    1. This is definitely a valid serial code (works fine without -acc option or with other compilers)


    1. The bug only occurs when “a” is set to private.


    1. I can read the 2D array, but just cannot modify its value (i.e. “temp = a_[0]” is fine)_


  • 4) The bug only occurs when a is dynamically allocated. If I use the commented line “double a[N][N];” instead of the function “Make2DDoubleArray”, then everything is good.
    However, when N is large I have to use dynamic allocation to avoid stack overflow. Also, I can’t use static allocation if N is a variable.



  • 5) For the private clause, if I use “private(a)” instead of private(a[:N][:N]), I would get such warning:

Accelerator clause: upper bound for dimension 0 of array ‘a’ is unknown
Accelerator clause: upper bound for dimension 1 of array ‘a’ is unknown

Again, this only happens for dynamic allocation.

A possible explanation is that when “a” is dynamically allocated, the compiler doesn’t generate private copies of the data that “a” points to. I am wondering how can I correctly privatize a dynamically-allocated 2D array and modify its value inside the parallel region? Thanks!

The version I am using is:

$ pgcc --version
pgcc 16.10-0 64-bit target on x86-64 Linux -tp haswell
The Portland Group - PGI Compilers and Tools
Copyright (c) 2016, NVIDIA CORPORATION. All rights reserved.

Hi jzhuang,

Sorry but arrays used in a private clauses must be contiguous. The overhead cost of trying to allocate and fill in device pointer address across many thousands of private 2D arrays would be prohibitive.

The solution here would be to create a 1-D N*N array for the private array.

Note that you can’t use the same variable in both a data clause and a private clause.

-Mat

% cat test.c
# include <stdlib.h>
 # include <stdio.h>
 # define N 16

 // dynamically allocate 2D array
 double** Make2DDoubleArray(int arraySizeX, int arraySizeY) {
     double** theArray;
     theArray = (double**) malloc(arraySizeX*sizeof(double*));
     for (int i = 0; i < arraySizeX; i++)
         theArray[i] = (double*) malloc(arraySizeY*sizeof(double));
     return theArray;
 }

 int main() {

     int i,j;
     //double a[N][N];
     double **a = Make2DDoubleArray(N,N);
     double *b =  (double*) malloc(N*N*sizeof(double));

     // Initial value
     for (i=0; i<N; i++){
     for (j=0; j<N; j++){
         a[i][j]=0.0;
     }
         b[i]=0.0;
     }

     #pragma acc data copy(a[:N][:N])
     {
     #pragma acc parallel private(b[:N*N])
     {
         #pragma acc loop gang
         for (i=0; i<N; i++){
             int idx=i*N;
             b[idx]=1.0;
             a[i][0]=b[idx];
         }

     }
     }
     for (i=0; i<N; i++){
        printf("a[i][0]=%f\n",a[i][0]);
     }
     return 0;
 }

% pgcc test.c -acc -Minfo=accel
main:
     29, Generating copy(a[:16][:16])
     31, Accelerator kernel generated
         Generating Tesla code
         34, #pragma acc loop gang, vector(16) /* blockIdx.x threadIdx.x */
     34, Accelerator restriction: size of the GPU copy of b is unknown
% a.out
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000
a[i][0]=1.000000

Hi Mat,

Thanks for the reply. Well that makes sense.

Could you explain more about “you can’t use the same variable in both a data clause and a private clause.”?

Say, if I initialized an array before the OpenACC region, then I want to get a private copy for each gang using the firstprivate clause. Do you mean the following code is wrong?

 

     a[N*N] = ... // initialize a

     #pragma acc data copyin(a[:N*N) 
     { 
     #pragma acc parallel firstprivate(a) 
     { 
     #pragma acc loop gang 
         for (i=0; i<M; i++){ 
              // I want each iteration get a private copy of "a" 
              // with the initial value defined at the beginning.
             do_something( a ) 
         } 

     }
     }

If it is wrong, then how should I change the code?

Do you mean the following code is wrong?

Yes. When you put “a” in a data copy clause, you’re creating a mirrored copy of the host array on device that is globally accessible to all threads in a compute region. Putting “a” in a firstprivate clause says to create N number of a, one for each thread, and then initialize each a’s to the value of the copy. When entering the compute region, there is now confusion as to which a should be used, the global or the private copy.

I think your confusion is that you’re thinking that “a” needs to be on the device before it can be used in a firstprivate. I understand your logic but this is not the case since the firstprivate value is taken from the host not the device.

-Mat

Hi Mat,

Thanks again for the explanation. I removed the data clause, and then the compiler automatically generated an update device clause. So it seems that a firstprivate clause implicitly contains an update clause?

30, Generating update device(a[:][:])
Accelerator kernel generated
Generating Multicore code

Something else is going on since the compiler wouldn’t add an implicit update. Maybe an implicit copy, but the compiler feedback messages would indicate this.

This message indicates that you have an “update” directive in your code. Also, this is for a 2D array so probably a different variable than the one listed in the firstprivate clause.

Can you post a code example?

-Mat