Privatization of array

Viet · October 1, 2009, 1:11pm

I ran a parallel loop as follows:

#define imax 257
#define jmax 129
#define kmax 129
#define nn 50

static float arr1[imax][jmax]kmax];
static float arr2[imax][jmax][kmax];

#pragma acc region
{
for(n=1;n<nn-1;++n){

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr1_[j][k] = arr2[j][k];
}
}
}

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr2[j][k] = arr1[j][k] ;
}
}
}
} /* end n loop /

}

I got the following message when compiling:

Parallelization would require privatization of array arr1[i2+1][i3+1][1:kmax-2]

I carried out privatization of arrays as follows:

static float arr1[nn][imax][jmax]kmax];
static float arr2[nn][imax][jmax][kmax];

#pragma acc region
{
for(n=1;n<nn-1;++n){

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr1[n][j][k] = arr2[n][j][k];
}
}
}

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr2[n][j][k] = arr1[n][j][k] ;
}
}
}
} / end n loop */

}

I got the out_of_memory error when running the code:

call to cuMemAlloc returned error 2: Out of memory

Are there different ways of privatization of array so as not to get the out_of_memory error?

Thanks in advance,
Viet_

TheMatt · October 1, 2009, 4:12pm

Viet,

Did you try privatizing the array using the pragma clauses? With Fortran it’s an !$acc do clause, so I imagine it’s a #pragma acc for one in C (I’m a Fortran programmer so caveat lector).

So you might try:
#pragma acc region
{
#pragma acc for private(arr1)
for(n=1;n<nn-1;++n){…

Essentially, you add that private clause on the line directly before the for-loop that it must apply to.

Viet · October 2, 2009, 10:36am

Dear TheMatt,

Thank you very much. I was able to use pragma clause to privatize arrays automatically as follows:

#pragma acc region
{
#pragma acc for private(arr2[1:imax-2][1:jmax-2][1:kmax-2], arr1[1:imax-2][1:jmax-2][1:kmax-2])

for(n=1;n<nn-1;++n){

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr1_[j][k] = arr2[j][k];
}
}
}

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr2[j][k] = arr1[j][k];
}
}
}

} /* end n loop */
}

However, out_of_memory error is still the problem. Are there any better ways of privatization of arrays in this case?

Thanks in advance,
Viet_

MatColgrove · October 2, 2009, 6:46pm

Hi Viet,

When you privatize an array, you are creating a temporary copy for each thread. This can dramatically increase your memory usage. Also, since the private arrays are temporary, their values are not stored back to the host. Full details on the private clause can be found in section 2.4.4 of Accelerator model guide PGI Compilers with OpenACC | PGI.

Backing up to the original code, the reason why the outer loop wont parallelize is that all values of n (i.e. all threads) need to access the same i, j, and k elements of the arrays. Depending on the order in which the threads store their results, the values stored in the array will change and lead to non-deterministic results.

Instead of having the “n” loop be the outer loop, could it be moved to the innermost loop? This will allow you to parallelize the i, j, and k loops, have the n loop as kernel, reduce the data movement, and increase your compute intensity.

For example:

#pragma acc region
{
for(i=1 ; i<imax-1 ; ++i){
   for(j=1 ; j<jmax-1 ; ++j){
      for(k=1 ; k<kmax-1 ; ++k){
           for(n=1;n<nn-1;++n){
                arr1[i][j][k] = arr2[i][j][k];
        }
    }
}
...
}

Hope this helps,
Mat

Viet · October 6, 2009, 2:33pm

Dear Mat,

Thank you very much for the suggestion of moving the “n” loop to the innermost loop so as to parallelize the i, j, and k loops.

In order to do so, I have to parallelize the computation of arr1 and arr2 also, end take synchronization as shown in the following sketches:

for(i=1 ; i<imax-1 ; ++i){                                       
   for(j=1 ; j<jmax-1 ; ++j){                                         
      for(k=1 ; k<kmax-1 ; ++k){                                    
           for(n=1;n<nn-1;++n){                                                
                arr1[i][j][k] = arr2[i][j][k];                                             
                wait for computing arr2  at step (n)
           }                                          
        }                                                                                 
    }                                                                                    
}                                                                              


for(i=1 ; i<imax-1 ; ++i){                                       
   for(j=1 ; j<jmax-1 ; ++j){                                         
      for(k=1 ; k<kmax-1 ; ++k){                                    
           for(n=1;n<nn-1;++n){                                                
                arr2[i][j][k] = arr1[i][j][k];                                             
                wait for computing arr1  at step (n)
           }                              
        }                                                                                 
    }                                                                                    
}

How can I use pragma clause to implement this?

Thanks in advance,
Viet

MatColgrove · October 6, 2009, 2:53pm

Hi Viet,

The inner “n” loops should be executed sequentially, but just in case, you can use “#pragma acc do kernel” or “#pragma acc do seq” for force it. Be sure to compile with “-Minfo=accel” and watch the generated output. The compiler will tell you how the loops were scheduled.

You can also use the “#pragma acc do parallel” and “#pragma acc do vector” to manually adjust the parallel schedule.

On a side note, compile with “-Msafeptr” or declare your arrays with the C99 restrict keyword. Otherwise the compiler wont be able to parallelize them.

Mat

Alistair_Hart · July 7, 2010, 12:18pm

Can “!$acc do private” be used for nested loops, e.g. for “u” as below?

PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region                                                                

  PRINT '(4I6)',b

END PROGRAM test

I tried putting “!$acc do private(u)” above k and/or j loops but nothing worked (wrong answers or cuMemFree error).

I can, of course, add a j-index to u and a “do private” to k, but I would like to accelerate the original source version just using directives.

Thanks for your help,

Alistair.

P.S. I’m using v10.5 of the compiler, if it matters.

MatColgrove · July 7, 2010, 3:09pm

Hi Alistair,

It seems to work for me when I add “!$acc do private(u)” above the k loop. Try rebooting your system (or resetting your CUDA driver). The pinned memory that NVIDIA drivers use can enter an inconsistent state leading to this type of behavior.

Mat

% cat test1.f90
PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
!$acc do private(u)
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region

  PRINT '(4I6)',b

END PROGRAM test

% pgf90 -ta=nvidia -Minfo=accel test1.f90 -V10.5
test:
     10, Generating copyout(b(2:3,1:4,1:4))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     12, Loop is parallelizable
     13, Loop carried reuse of 'u' prevents parallelization
     14, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(4)
         13, !$acc do seq
         14, !$acc do parallel, vector(4)
             Using register for 'u'
             CC 1.0 : 7 registers; 24 shared, 40 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 7 registers; 24 shared, 40 constant, 0 local memory bytes; 25 occupancy
     17, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(4)
         13, !$acc do seq
         17, !$acc do parallel, vector(2)
             Cached references to size [3] block of 'u'
             CC 1.0 : 8 registers; 40 shared, 44 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 8 registers; 40 shared, 44 constant, 0 local memory bytes; 25 occupancy
% a.out
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0

Alistair_Hart · July 14, 2010, 11:36am

Thanks for the reply.

mkcolg:

It seems to work for me when I add “!$acc do private(u)” above the k loop.

% a.out
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0

But the correct result (compiling without -ta flag and running on host) is

% pgf90 test1.F90
% ./a.out
     0     7     9     0
     0     9    11     0
     0    11    13     0
     0    13    15     0
     0     9    11     0
     0    11    13     0
     0    13    15     0
     0    15    17     0
     0    11    13     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    19    21     0

The only way I could get agreement with the host code is to explicitly privatise u by making it a three-index array in the source code. I was hoping to do the acceleration without changing the Fortran.

I don’t like the sound of that - how can I diagnose this problem
other than my code not working? And rebooting a remote system you
don’t own is non-trivial.

Cheers,

Alistair.

MatColgrove · July 14, 2010, 5:16pm

Sorry about that Alistair, let me try again. This time adding in the ‘kernel’ clause around the J loop.

% cat test1.f90
PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
  DO k = 1,N
!$acc do private(u), kernel
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region

  PRINT '(4I6)',b

END PROGRAM test

% pgf90 -ta=nvidia,time test1.f90 -fast -V10.6 -Minfo=accel
test:
     10, Generating copyout(b(2:3,1:4,1:4))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     11, Loop is parallelizable
     13, Loop is vectorizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(4)
         13, !$acc do vector(4)
             CC 1.0 : 9 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 8 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     14, Loop is parallelizable
     17, Loop is parallelizable
% a.out
     0     7    15     0
     0     9    17     0
     0    11    19     0
     0    13    21     0
     0     9    15     0
     0    11    17     0
     0    13    19     0
     0    15    21     0
     0    11    15     0
     0    13    17     0
     0    15    19     0
     0    17    21     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    19    21     0

Accelerator Kernel Timing data
/tmp/mec/test1.f90
  test
    10: region entered 1 time
        time(us): total=92946 init=92536 region=410
                  kernels=21 data=24
        w/o init: total=410 max=410 min=410 avg=410
        13: kernel launched 1 times
            grid: [1]  block: [4x4]
            time(us): total=21 max=21 min=21 avg=21

I don’t like the sound of that - how can I diagnose this problem
other than my code not working? And rebooting a remote system you
don’t own is non-trivial.

The NVIDIA driver has an issue where the Pinned host memory it uses to perform data transfers can get corrupted if a CUDA program abnormally aborts (such as if the user kills the program via Ctrl-C). NVIDIA is aware of the problem but I don’t know it’s status.

Whenever I start getting wrong answers or unexpected behavior from a program that had been working, my first step is to either reset the NVIDIA driver or reboot the system. I agree that it is not ideal but it’s the only work around that I have found for this issue.

Mat

Topic		Replies	Views
call to cuMemHostUnregister returned error 700: Launch faile Legacy PGI Compilers	3	3277	July 29, 2013
Bug with privatizing 2D array inside OpenACC region in C Legacy PGI Compilers	5	7698	March 14, 2017
Pgfortran 20.4 and OpenACC giving "cudaLaunchKernel returned status 2: out of memory" Legacy PGI Compilers	17	972	June 3, 2021
help understanding compiler information Legacy PGI Compilers	7	5268	October 9, 2012
Wrong results when using the private directive with PGI 12.6 Legacy PGI Compilers	9	6459	August 22, 2012
Need advices for optimizing heart of CFD code Legacy PGI Compilers	11	7098	July 13, 2016
Six Loops iteration and reduction Legacy PGI Compilers	15	7932	March 27, 2012
Code execution depends strangely on irrelevant parameters Legacy PGI Compilers	8	8105	October 22, 2013
Shared memory is not correctly used in kernels block nvc, nvc++ and nvfortran	13	617	June 15, 2022
Bug report for nvfortran: arrays not privatizing correctly inside procedure nvc, nvc++ and nvfortran	6	557	April 11, 2021

Privatization of array

Related topics