Privatization of array

I ran a parallel loop as follows:

#define imax 257
#define jmax 129
#define kmax 129
#define nn 50

static float arr1[imax][jmax]kmax];
static float arr2[imax][jmax][kmax];

#pragma acc region
{
for(n=1;n<nn-1;++n){

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr1_[j][k] = arr2[j][k];
}
}
}

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr2[j][k] = arr1[j][k] ;
}
}
}
} /* end n loop /

}


I got the following message when compiling:

Parallelization would require privatization of array arr1[i2+1][i3+1][1:kmax-2]

I carried out privatization of arrays as follows:


static float arr1[nn][imax][jmax]kmax];
static float arr2[nn][imax][jmax][kmax];

#pragma acc region
{
for(n=1;n<nn-1;++n){

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr1[n][j][k] = arr2[n][j][k];
}
}
}

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr2[n][j][k] = arr1[n][j][k] ;
}
}
}
} /
end n loop */

}

I got the out_of_memory error when running the code:

call to cuMemAlloc returned error 2: Out of memory

Are there different ways of privatization of array so as not to get the out_of_memory error?

Thanks in advance,
Viet_

Viet,

Did you try privatizing the array using the pragma clauses? With Fortran it’s an !$acc do clause, so I imagine it’s a #pragma acc for one in C (I’m a Fortran programmer so caveat lector).

So you might try:
#pragma acc region
{
#pragma acc for private(arr1)
for(n=1;n<nn-1;++n){…

Essentially, you add that private clause on the line directly before the for-loop that it must apply to.

Dear TheMatt,

Thank you very much. I was able to use pragma clause to privatize arrays automatically as follows:


#pragma acc region
{
#pragma acc for private(arr2[1:imax-2][1:jmax-2][1:kmax-2], arr1[1:imax-2][1:jmax-2][1:kmax-2])

for(n=1;n<nn-1;++n){

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr1_[j][k] = arr2[j][k];
}
}
}

for(i=1 ; i<imax-1 ; ++i){
for(j=1 ; j<jmax-1 ; ++j){
for(k=1 ; k<kmax-1 ; ++k){
arr2[j][k] = arr1[j][k];
}
}
}

} /* end n loop */
}

However, out_of_memory error is still the problem. Are there any better ways of privatization of arrays in this case?

Thanks in advance,
Viet_

Hi Viet,

When you privatize an array, you are creating a temporary copy for each thread. This can dramatically increase your memory usage. Also, since the private arrays are temporary, their values are not stored back to the host. Full details on the private clause can be found in section 2.4.4 of Accelerator model guide http://www.pgroup.com/resources/accel.htm.

Backing up to the original code, the reason why the outer loop wont parallelize is that all values of n (i.e. all threads) need to access the same i, j, and k elements of the arrays. Depending on the order in which the threads store their results, the values stored in the array will change and lead to non-deterministic results.

Instead of having the “n” loop be the outer loop, could it be moved to the innermost loop? This will allow you to parallelize the i, j, and k loops, have the n loop as kernel, reduce the data movement, and increase your compute intensity.

For example:

#pragma acc region
{
for(i=1 ; i<imax-1 ; ++i){
   for(j=1 ; j<jmax-1 ; ++j){
      for(k=1 ; k<kmax-1 ; ++k){
           for(n=1;n<nn-1;++n){
                arr1[i][j][k] = arr2[i][j][k];
        }
    }
}
...
}

Hope this helps,
Mat

Dear Mat,

Thank you very much for the suggestion of moving the “n” loop to the innermost loop so as to parallelize the i, j, and k loops.

In order to do so, I have to parallelize the computation of arr1 and arr2 also, end take synchronization as shown in the following sketches:

for(i=1 ; i<imax-1 ; ++i){                                       
   for(j=1 ; j<jmax-1 ; ++j){                                         
      for(k=1 ; k<kmax-1 ; ++k){                                    
           for(n=1;n<nn-1;++n){                                                
                arr1[i][j][k] = arr2[i][j][k];                                             
                wait for computing arr2  at step (n)
           }                                          
        }                                                                                 
    }                                                                                    
}                                                                              


for(i=1 ; i<imax-1 ; ++i){                                       
   for(j=1 ; j<jmax-1 ; ++j){                                         
      for(k=1 ; k<kmax-1 ; ++k){                                    
           for(n=1;n<nn-1;++n){                                                
                arr2[i][j][k] = arr1[i][j][k];                                             
                wait for computing arr1  at step (n)
           }                              
        }                                                                                 
    }                                                                                    
}

How can I use pragma clause to implement this?

Thanks in advance,
Viet

Hi Viet,

The inner “n” loops should be executed sequentially, but just in case, you can use “#pragma acc do kernel” or “#pragma acc do seq” for force it. Be sure to compile with “-Minfo=accel” and watch the generated output. The compiler will tell you how the loops were scheduled.

You can also use the “#pragma acc do parallel” and “#pragma acc do vector” to manually adjust the parallel schedule.

On a side note, compile with “-Msafeptr” or declare your arrays with the C99 restrict keyword. Otherwise the compiler wont be able to parallelize them.

  • Mat

Can “!$acc do private” be used for nested loops, e.g. for “u” as below?

PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region                                                                

  PRINT '(4I6)',b

END PROGRAM test

I tried putting “!$acc do private(u)” above k and/or j loops but nothing worked (wrong answers or cuMemFree error).

I can, of course, add a j-index to u and a “do private” to k, but I would like to accelerate the original source version just using directives.

Thanks for your help,

Alistair.

P.S. I’m using v10.5 of the compiler, if it matters.

Hi Alistair,

It seems to work for me when I add “!$acc do private(u)” above the k loop. Try rebooting your system (or resetting your CUDA driver). The pinned memory that NVIDIA drivers use can enter an inconsistent state leading to this type of behavior.

  • Mat
% cat test1.f90
PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
!$acc do private(u)
  DO k = 1,N
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region

  PRINT '(4I6)',b

END PROGRAM test

% pgf90 -ta=nvidia -Minfo=accel test1.f90 -V10.5
test:
     10, Generating copyout(b(2:3,1:4,1:4))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     12, Loop is parallelizable
     13, Loop carried reuse of 'u' prevents parallelization
     14, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(4)
         13, !$acc do seq
         14, !$acc do parallel, vector(4)
             Using register for 'u'
             CC 1.0 : 7 registers; 24 shared, 40 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 7 registers; 24 shared, 40 constant, 0 local memory bytes; 25 occupancy
     17, Loop is parallelizable
         Accelerator kernel generated
         12, !$acc do parallel, vector(4)
         13, !$acc do seq
         17, !$acc do parallel, vector(2)
             Cached references to size [3] block of 'u'
             CC 1.0 : 8 registers; 40 shared, 44 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 8 registers; 40 shared, 44 constant, 0 local memory bytes; 25 occupancy
% a.out
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0
     0    13    15     0

Thanks for the reply.

But the correct result (compiling without -ta flag and running on host) is

% pgf90 test1.F90
% ./a.out
     0     7     9     0
     0     9    11     0
     0    11    13     0
     0    13    15     0
     0     9    11     0
     0    11    13     0
     0    13    15     0
     0    15    17     0
     0    11    13     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    19    21     0

The only way I could get agreement with the host code is to explicitly privatise u by making it a three-index array in the source code. I was hoping to do the acceleration without changing the Fortran.

I don’t like the sound of that - how can I diagnose this problem
other than my code not working? And rebooting a remote system you
don’t own is non-trivial.

Cheers,

Alistair.

Sorry about that Alistair, let me try again. This time adding in the ‘kernel’ clause around the J loop.

% cat test1.f90
PROGRAM test

  IMPLICIT NONE

  INTEGER, PARAMETER :: N = 4
  INTEGER :: b(N,N,N),i,j,k,u(N)

  b(:,:,:) = 0

!$acc region
  DO k = 1,N
!$acc do private(u), kernel
     DO j = 1,N
        DO i = 1,N
           u(i) = i+j+k
        ENDDO
        DO i = 2,N-1
           b(i,j,k) = u(i-1) + u(i)
        ENDDO
     ENDDO
  ENDDO
!$acc end region

  PRINT '(4I6)',b

END PROGRAM test

% pgf90 -ta=nvidia,time test1.f90 -fast -V10.6 -Minfo=accel
test:
     10, Generating copyout(b(2:3,1:4,1:4))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     11, Loop is parallelizable
     13, Loop is vectorizable
         Accelerator kernel generated
         11, !$acc do parallel, vector(4)
         13, !$acc do vector(4)
             CC 1.0 : 9 registers; 20 shared, 28 constant, 0 local memory bytes; 33 occupancy
             CC 1.3 : 8 registers; 20 shared, 28 constant, 0 local memory bytes; 25 occupancy
     14, Loop is parallelizable
     17, Loop is parallelizable
% a.out
     0     7    15     0
     0     9    17     0
     0    11    19     0
     0    13    21     0
     0     9    15     0
     0    11    17     0
     0    13    19     0
     0    15    21     0
     0    11    15     0
     0    13    17     0
     0    15    19     0
     0    17    21     0
     0    13    15     0
     0    15    17     0
     0    17    19     0
     0    19    21     0

Accelerator Kernel Timing data
/tmp/mec/test1.f90
  test
    10: region entered 1 time
        time(us): total=92946 init=92536 region=410
                  kernels=21 data=24
        w/o init: total=410 max=410 min=410 avg=410
        13: kernel launched 1 times
            grid: [1]  block: [4x4]
            time(us): total=21 max=21 min=21 avg=21



I don’t like the sound of that - how can I diagnose this problem
other than my code not working? And rebooting a remote system you
don’t own is non-trivial.

The NVIDIA driver has an issue where the Pinned host memory it uses to perform data transfers can get corrupted if a CUDA program abnormally aborts (such as if the user kills the program via Ctrl-C). NVIDIA is aware of the problem but I don’t know it’s status.

Whenever I start getting wrong answers or unexpected behavior from a program that had been working, my first step is to either reset the NVIDIA driver or reboot the system. I agree that it is not ideal but it’s the only work around that I have found for this issue.

  • Mat