Accelerator compiler bug with sequential rewriting matrices.

Hi,
below code run on a GTX480 with CC30 results in an upredictable rewriting of values from copiedInArray from host memory to an local temporary array only on GPU.
The arrays are real*4 and have the same dimensions x=90, y=90, z=1500 (probably the z dimension is the matter here)

!$acc region
      do k=2,z
        do j=1,y
          do i=1,x
            localGPUArray(i,j,k) = copiedInArray(i,j,k)
          enddo
        end do
      end do
!$acc end region

It appears that the compiler divides the job in a weird matter between computation units on GPU (90x90x1499).

A fast fix to this problem, so that values in both arrays are the same on the same indexes was to make any of these loops sequential. However the compiler nor profiler have not shown any hint that without the !$acc do seq these calculations may work undesired.

!$acc region
      do k=2,z
        do j=1,y
!$acc do seq
          do i=1,x
            localArray(i,j,k) = copiedInArray(i,j,k)
          enddo
        end do
      end do
!$acc end region

If You know any better way to fill an local GPU array with host-uploaded data please let me know. I hope that You will be able to recreate this problem and address it with a fix :)

Regards,
Nicolas Dobski

Hi Nicolas,

It appears that the compiler divides the job in a weird matter between computation units on GPU (90x90x1499).

This makes sense given that your k loop starts at 2. The compiler will only allocate the minimum amount of space, hence in this case 1499. You can override this behavior using the copy and local clauses.


Can you post a reproducing example? Here’s my attempt to recreate the issue, but my simple example works fine.

% cat copy3d.f90


program copy3d

real, allocatable, dimension(:,:,:) :: A,B
integer :: i,j,k
integer :: x,y,z

x=90
y=90
z=1500

allocate(A(x,y,z), B(x,y,z))

do i=1,x
  do j = 1,y
    do k=1, z
       A(i,j,k)=real(i*j)/real(k)
    enddo
  enddo
enddo


!$acc region
do k=2, z
  do j = 1,y
    do i=1,x
       B(i,j,k) = A(i,j,k)
    enddo
  enddo
enddo
!$acc end region

print *, A(1,1,2), A(1,1,1500)
print *, B(1,1,2), B(1,1,1500)

end program copy3d

% pgf90 copy3d.f90 -ta=nvidia -Minfo=accel -V10.9 ; a.out
copy3d:
     24, Generating copyin(a(1:90,1:90,2:1500))
         Generating copyout(b(1:90,1:90,2:1500))
         Generating compute capability 1.0 binary
         Generating compute capability 1.3 binary
     25, Loop is parallelizable
     26, Loop is parallelizable
     27, Loop is parallelizable
         Accelerator kernel generated
         25, !$acc do parallel, vector(4)
         26, !$acc do parallel, vector(4)
         27, !$acc do vector(16)
             CC 1.0 : 8 registers; 24 shared, 52 constant, 0 local memory bytes; 100 occupancy
             CC 1.3 : 8 registers; 24 shared, 52 constant, 0 local memory bytes; 100 occupancy
   0.5000000       6.6666666E-04
   0.5000000       6.6666666E-04