Vector array assignments within a $acc parallel region

Hi there,

This will almost certainly expose some misunderstanding I have with OpenACC, but I don’t know why this code runs differently with and without the $acc statements:

program acc_error_test

  implicit none 

  real(SELECTED_REAL_KIND( P=12,R=60)) :: temp1(20,260)

  integer :: b,bb,n 
  integer :: g
  integer :: i,i2,j1,j2
  integer      :: MtrBind(260), MtrBpar(0:259)
  !---------------------------------------------------------------------------------------!

  MtrBpar = 0
  do i = 1, 260
     if (i .le. 4) then
        MtrBind(i) = 1
        MtrBpar(i) = MtrBpar(i - 1) + 2
     else
        MtrBind(i) = i
        MtrBpar(i) = MtrBpar(i - 1)
     end if
  end do
  
  b  = 1
  i2 = 20
  
!$acc parallel copyout(temp1),pcopyin(MtrBpar, MtrBind)
  
  temp1 = 0.0
  
  j2 = 0
  do  n  = 1, 260
     bb = MtrBind(n)
     j1 = j2 + 1
     j2 = j2 + MtrBpar(bb) - MtrBpar(bb-1)
     if (bb == b) then
        temp1(1:i2,j1:j2) = -1.0
     endif
  enddo
  
!$acc end parallel
  print *, temp1(1:i2,1)

end program acc_error_test

With the $acc statements, the output is

-1.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.000000000000000

Without the $acc statements, the output is

-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000 -1.000000000000000
-1.000000000000000 -1.000000000000000

The original code has some loops within the parallel region which I’d like to accelerate, but I managed to isolate this behaviour that’s giving me problems.

I’m compiling:

pgf90 -acc -Minfo=accel -Mlarge_arrays -mcmodel=medium -fast -o acc_error_test acc_error_test.f90

Is this behaviour expected? Presumably temp1 is being distributed across a gang (or multiple gangs?) and only one version of it is being returned. How would I otherwise copy back to the host an array that is set this way in a parallel region?

Thanks,
Rob

Try examples that can be broken into parallel operations.

You cannot calculate MtrBpar(i) until AFTER you calculate
MtrBpar(I-1), so it cannot run in parallel.

When the operations do not involve results from a
previous iteration of the loop, you can distribute the work
over multiple, parallel processors.

I did not try to determine why your answers differ. I suspect you
assumed arrays are always initially zero, which is a bad assumption in fortran. Local data can come from the stack, which means it will initially
be garbage. Initialize any data to zero if the loop depends on initial
values being zero.

dave

Hi Dave,

I don’t believe you read my post very closely :(.

As I said, the original code is much more involved with things that can be parallelized, I merely isolated the problem for the forum.

The code looks more like

!$acc parallel 

... various isolated things that can be parallelized ...

... problem area ...

... various isolated things that can be parallelized ...
  
!$acc end parallel



I’m not sure what you’re talking about. The variables are initialized. Unless for some reason you can’t initialize variables on the device (j2 in my example). Could you please take another look?

Thanks,
Rob

Hi Robert,

First you have an out-of-bounds error with MtrBpar in the first loop when i==260. Though, that’s not the main issue.

“parallel” expects the user to specify the work sharing loops via the “!$acc loop” directives. Sans the loop directives, a single sequential kernel should be created. The exception being when the OpenACC 2.0 “auto” keyword is used, then the compiler is free to auto-parallelize inner loops. For legacy reasons, we auto-parallelize by default and it’s this auto-parallelization that’s causing the wrong answers. (Note you can disable this via the -acc=noautopar flag, but you would be left with a sequential kernel). I’ll go ahead and add a problem report.

What’s creating the problem is that you have an outer sequential loop (n) combined with a parallel assignment to array segment where the seqment size varies from iteration to iteration of the n loop. It doesn’t look to me that compiler generating the correct code for the variable size array assignment and only setting the first value. The easy work around is to use explicit loops instead of array syntax.

  • Mat
% cat test.f90
program acc_error_test

   implicit none

   real(SELECTED_REAL_KIND( P=12,R=60)) :: temp1(20,260)

   integer :: b,bb,n
   integer :: g
   integer :: i,i2,j1,j2,j
   integer      :: MtrBind(260), MtrBpar(0:260)
   !---------------------------------------------------------------------------------------!

   MtrBpar = 0
   do i = 1, 260
      if (i .le. 4) then
         MtrBind(i) = 1
         MtrBpar(i) = MtrBpar(i - 1) + 2
      else
         MtrBind(i) = i
         MtrBpar(i) = MtrBpar(i - 1)
      end if
   end do

   b  = 1
   i2 = 20

 !$acc parallel copyout(temp1),pcopyin(MtrBpar, MtrBind)

   temp1 = 1.0

   j2 = 0
   do  n  = 1, 260
      bb = MtrBind(n)
      j1 = j2 + 1
      j2 = j2 + MtrBpar(bb) - MtrBpar(bb-1)
      if (bb == b) then
         do i=1,i2
          do j=j1,j2
              temp1(i,j)=-1.0
          enddo
         enddo
!         temp1(1:i2,j1:j2) = -1.0
      endif
   enddo

 !$acc end parallel
   print *, temp1(1:i2,1)

 end program acc_error_test
% pgf90 test.f90 -acc -Minfo=accel; a.out
acc_error_test:
     27, Generating copyout(temp1(:,:))
         Generating present_or_copyin(mtrbpar(:))
         Generating present_or_copyin(mtrbind(:))
         Accelerator kernel generated
         29, !$acc loop vector(256) ! threadidx%x
         37, !$acc loop vector(256) ! threadidx%x
     27, Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     29, Loop is parallelizable
     32, Loop carried scalar dependence for 'j2' at line 34
         Loop carried scalar dependence for 'j2' at line 35
         Parallelization would require privatization of array 'temp1(i2+1,:)'
     37, Loop is parallelizable
     38, Loop carried reuse of 'temp1' prevents parallelization
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000

Hi Mat,

Thanks. I was getting a little frustrated with debugging accelerated code, so I tried to do things very incrementally. I.e., first specify the region so that I know the data is getting copied correctly, then insert the shared loop directives. Maybe I was a little too incremental :).

Rob

edit:
ps - I noticed that copying-in a structure causes a launch fail. I suppose that means only standard types can be copied-in?

ps - I noticed that copying-in a structure causes a launch fail. I suppose that means only standard types can be copied-in?

You can use structures, provided that they are fixed sized since data is required to be contiguous. Dynamically allocated data within structures, classes, Fortran user defined types, are problematic in that they require a deep copy which would rebuild the data structure on the device.

This is a long standing limitation of OpenACC and one of the most difficult to solve. Though, the OpenACC committee is currently investigating for the 3.0 specification a standard method on how to perform a deep copy and/or restructure data so that it’s contiguous.

  • Mat

Hi all/Mat,

I hate to be a pest, but I’m still having similar issues. When I de-vectorized the array assignment in the original code, it still fails (with and without noautopar). So I went back to the test program and initialized it to something a little bit closer to the original problem. It still produces different results.

I’m also wondering if I’m a little naive here. In the absence of a specific warning message, simply defining a parallel region shouldn’t affect the results, correct?


program acc_error_test

  implicit none 

  real(SELECTED_REAL_KIND( P=12,R=60)) :: d_temp1(20,260)

  integer :: b,bb,n
  integer :: x,x1,x2,y,y1,y2
  integer :: g
  integer :: i,i2,j,j1,j2,k,k1,k2

  integer      :: M1(13), M2(0:13)
  !---------------------------------------------------------------------------------------!

  b = 1
  g = 1
  M1 = (/1,2,3,4,5,6,7,8,9,10,11,12,13/)
  M2 = (/0,20,40,60,80,100,120,140,160,180,200,220,240,260/)

  i2 = M2(1)-M2(0)

!$acc parallel copyout(d_temp1), pcopyin(M1, M2)
  
  d_temp1 = 0.0
  
  j2 = 0
  k2 = 0
  do  n  = 1, 13
     bb = M1(n)
     j1 = j2 + 1
     j2 = j2 + M2(bb) - M2(bb-1)
     if (bb == b) then
        do i = 1, i2
           do j = j1, j2
              d_temp1(i,j) = -1.0
           end do
        end do
     else
        k1 = k2 + 1
        k2 = k2 + M2(bb) - M2(bb-1)
        do j = 1, k2-k1+1
           do i = 1,i2
              do k = 1,i2
                 d_temp1(i,j+j1-1) = 1.0
              end do
           end do
        end do
     endif
  enddo

!$acc end parallel
  print *, d_temp1


end program acc_error_test



~/codes/sandbox> pgf90 -acc=noautopar -i8 -Minfo=accel -Mlarge_arrays -mcmodel=medium -fast -o acc_error_test acc_error_test.f90
acc_error_test:
     22, Generating present_or_copyin(m2(:))
         Generating present_or_copyin(m1(:))
         Generating copyout(d_temp1(:,:))
         Accelerator kernel generated
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     24, Loop is parallelizable
     28, Loop carried scalar dependence for 'j2' at line 30
         Loop carried scalar dependence for 'j2' at line 31
         Loop carried scalar dependence for 'k2' at line 39
         Loop carried scalar dependence for 'k2' at line 40
         Parallelization would require privatization of array 'd_temp1(i2+1,:)'
     33, Loop is parallelizable
     34, Loop carried reuse of 'd_temp1' prevents parallelization
     41, Parallelization would require privatization of array 'd_temp1(i3+1,:)'
     42, Loop is parallelizable
     43, Loop carried reuse of 'd_temp1' prevents parallelization
~/codes/sandbox> acc_error_test > test.out
launch CUDA kernel  file=/home/wiersmar/codes/sandbox/acc_error_test.f90 function=acc_error_test line=22 device=0 grid=10 block=1

Accelerator Kernel Timing data
/home/wiersmar/codes/sandbox/acc_error_test.f90
  acc_error_test  NVIDIA  devicenum=0
    time(us): 3,939
    22: compute region reached 1 time
        22: data copyin reached 2 times
             device time(us): total=26 max=19 min=7 avg=13
        22: kernel launched 1 time
            grid: [10]  block: [1]
             device time(us): total=3,879 max=3,879 min=3,879 avg=3,879
            elapsed time(us): total=3,892 max=3,892 min=3,892 avg=3,892
        51: data copyout reached 1 time
             device time(us): total=34 max=34 min=34 avg=34
~/codes/sandbox> head test.out
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000         0.000000000000000
    0.000000000000000         0.000000000000000         0.000000000000000
    0.000000000000000         0.000000000000000         0.000000000000000
    0.000000000000000         0.000000000000000         0.000000000000000
~/codes/sandbox> pgf90 -i8 -Mlarge_arrays -mcmodel=medium -fast -o acc_error_test acc_error_test.f90
~/codes/sandbox> acc_error_test > test.out
~/codes/sandbox> head test.out
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000
   -1.000000000000000        -1.000000000000000        -1.000000000000000

Thanks,
Rob

Hi Robert,

I’m debating what do here. The compiler probably isn’t generating good code, but the code is serial and not really a good fit for an accelerator. Would you consider something more like the following where only the array assignments are put on the accelerator?

  • Mat
program acc_error_test

   implicit none

   real(SELECTED_REAL_KIND( P=12,R=60)) :: d_temp1(20,260)

   integer :: b,bb,n
   integer :: x,x1,x2,y,y1,y2
   integer :: g
   integer :: i,i2,j,j1,j2,k,k1,k2

   integer      :: M1(13), M2(0:13)
   !---------------------------------------------------------------------------------------!

   b = 1
   g = 1
   M1 = (/1,2,3,4,5,6,7,8,9,10,11,12,13/)
   M2 = (/0,20,40,60,80,100,120,140,160,180,200,220,240,260/)

   i2 = M2(1)-M2(0)
!$acc data copyout(d_temp1)
!$acc kernels
   d_temp1 = 0.0
!$acc end kernels
   j2 = 0
   k2 = 0

   do  n  = 1, 13
      bb = M1(n)
      j1 = j2 + 1
      j2 = j2 + M2(bb) - M2(bb-1)
      if (bb == b) then
!$acc kernels loop
         do i = 1, i2
            do j = j1, j2
               d_temp1(i,j) = -1.0
            end do
         end do
      else
         k1 = k2 + 1
         k2 = k2 + M2(bb) - M2(bb-1)
!$acc kernels loop
         do j = 1, k2-k1+1
            do i = 1,i2
               do k = 1,i2
                  d_temp1(i,j+j1-1) = 1.0
               end do
            end do
         end do
      endif
   enddo

 !$acc end data
   print *, d_temp1


 end program acc_error_test

Looks good Mat - it works in my test program (and I can go back to vectorizing that one assignment). I hadn’t thought of using a data region for the arrays and letting the processor worry about the indices.

Naturally, I’ll be back if it doesn’t go so well in the main program :).

Thanks again for your patience,
Rob

Hi there,

Just one more clarification. Minfo spits out the following if I use Mat’s code:

acc_error_test:
     22, Generating copyout(d_temp1(:,:))
     24, Generating present_or_copyout(d_temp1(:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     25, Loop is parallelizable
         Accelerator kernel generated
         25, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     35, Generating present_or_copyout(d_temp1(:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     36, Loop is parallelizable
         Accelerator kernel generated
         36, !$acc loop gang ! blockidx%y
             !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     41, Generating present_or_copyout(d_temp1(:,:))
         Generating NVIDIA code
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
         Generating compute capability 3.0 binary
     42, Loop is parallelizable
     43, Loop is parallelizable
     44, Loop carried reuse of 'd_temp1' prevents parallelization
         Inner sequential loop scheduled on accelerator
         Accelerator kernel generated
         42, !$acc loop gang ! blockidx%y
         43, !$acc loop gang, vector(128) ! blockidx%x threadidx%x

Does that mean that two copyout statements are generated (line 35 and 41)? I’m trying desperately to avoid too much communication since I know it’s going to kill me. On the other hand, the timing info shows as follows:

Accelerator Kernel Timing data
/home/wiersmar/codes/sandbox/acc_error_test.f90
  acc_error_test  NVIDIA  devicenum=0
    time(us): 1,176
    22: data region reached 1 time
        52: data copyout reached 1 time
             device time(us): total=34 max=34 min=34 avg=34
    24: compute region reached 1 time
        25: kernel launched 1 time
            grid: [1x65]  block: [32x4]
             device time(us): total=110 max=110 min=110 avg=110
            elapsed time(us): total=130 max=130 min=130 avg=130
    35: compute region reached 1 time
        36: kernel launched 1 time
            grid: [1x20]  block: [128]
             device time(us): total=122 max=122 min=122 avg=122
            elapsed time(us): total=136 max=136 min=136 avg=136
    41: compute region reached 12 times
        44: kernel launched 12 times
            grid: [1x20]  block: [128]
             device time(us): total=910 max=109 min=63 avg=75
            elapsed time(us): total=1,088 max=122 min=77 avg=90

This seems to indicate that only one copyout was used. Which is it then?

Thanks again,
Rob

Hi Rob,

Does that mean that two copyout statements are generated (line 35 and 41)?

This is a common question. Those are “present_or_copyout” which checks if the data is already on the device before copying. The compiler adds them to allow for such things as pointer swapping and for data regions to span across subroutine boundaries.

In this case, the data is already there so the copy is only performed once at the end of the data region.

  • Mat

Hi,
I was going through this post as the issue here seems to be somewhat related to mine, where I also have vector assignments that give different results on CPU and GPU. I wanted to try mine with -acc=noautopar, but then I get

$ pgf90    -O -Mdalign -acc=noautopar -ta=nvidia,time -Minfo=inline,accel -Munixlogical -c -I. -Mnosave -Mfreeform -Mrecursive -Mreentrant -byteswapio -Minline=name:des_crossprdct_2d,name:des_crossprdct_3d,name:des_dotproduct,name:CFRELVEL_wall ./des/calc_force_des.f 
pgf90-Error-Switch -acc with unknown keyword noautopar
-acc[=strict|verystrict]
                    Enable OpenACC directives
    strict          Issue warnings for non-OpenACC accelerator directives
    verystrict      Fail with an error for any non-OpenACC accelerator directive

My version of PGI is 13.3. Is that the problem here?

Thanks much
Anirban

My version of PGI is 13.3. Is that the problem here?

Yes. We didn’t add the -acc=noautopar sub-option until 13.6.

  • Mat

Thanks Mat. Thought so. I will request that our PGI compilers be updated here.