where am I going wrong?

I am trying to run the first Fortran program in PGI insider, June 2009. I am getting some weird results. Probably something stupid I did, but I just cant’ see it. So, with your indulgence, my code is:

program dbl_it

integer :: n, i
real,dimension(:),allocatable :: a, r, e
character(10) :: arg1

if (iargc() > 0) then
  call getarg(1,arg1)
  read(arg1,'(i10)') n
else
  n=100000
endif

allocate(a(n),r(n),e(n))

do i=1,n
  a(i)=2.*i
enddo

!$acc region
do i=1,n
  r(i)=2.*a(i)
enddo
!$acc end region

do i=1,n
  e(i)=2.*a(i)
enddo

do i=1,n
  if (r(i) /= e(i)) then
    print *, i,r(i),e(i)
    stop 'error found'
  endif
enddo

print *, n,'iterations completed'

end program

Note the default for “n” is 100000. As long as I keep n less than or equal to 100000, everything is OK:

[CUDA]$ ./a.out
       100000 iterations completed
[CUDA]$ ./a.out 50000
        50000 iterations completed

But if I go beyond 100000, woe is me:

[CUDA]$ ./a.out 100001
       100001    0.000000        400004.0    
Warning: ieee_inexact is signaling
error found

It looks like somehow the GPU is picking up on that 100000 default value and only populating the array “r” up to 100000.

If I change the default to, say, 20000, then the GPU populates “r” only up to 20000. I put in a bounch of diagnostic output to look at n; n alwys outputs OK. But the GPU won’t go beyond the default value.

Many apologies if I am doing something obviously wrong. I have looked at this over and over and I can’t see it. Thanks.

Hi Cablesb,

This looks like a compiler error to me. If you look at the -Minfo output you’ll see that the compiler is using the default value of “100000” in the copy:

% pgf90 -ta=nvidia -Minfo=accel test.f90 
dbl_it:
     23, Generating present_or_copyin(a(1:100000))
         Generating present_or_copyout(r(1:100000))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     24, Loop is parallelizable
         Accelerator kernel generated
         24, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             CC 1.0 : 10 registers; 48 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 0 shared, 64 constant, 0 local memory bytes

In the original code, the compiler correctly uses “n” for the size:

% pgf90 -ta=nvidia -Minfo=accel test2.f90
main:
     21, Generating present_or_copyin(a(1:n))
         Generating present_or_copyout(r(1:n))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     22, Loop is parallelizable
         Accelerator kernel generated
         22, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             CC 1.0 : 10 registers; 52 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 0 shared, 68 constant, 0 local memory bytes

The main difference in the two codes is the “if ( n .le. 0 ) n = 100000” in the original. If you add this if statement or add copy clauses to specifically set the size, this will work around the problem.

% cat test.f90 
program dbl_it

integer :: n,i
real,dimension(:),allocatable :: a, r, e
character(10) :: arg1

if (iargc() > 0) then
  call getarg(1,arg1)
  read(arg1,'(i10)') n
else
  n=100000
endif

! ADD THIS
if( n .le. 0 ) n = 100000

allocate(a(n),r(n),e(n))

do i=1,n
  a(i)=2.*i
enddo

! OR ADD THIS
!$acc region copyin(a), copyout(r)
do i=1,n
  r(i)=2.*a(i)
enddo
!$acc end region

do i=1,n
  e(i)=2.*a(i)
enddo

do i=1,n
  if (r(i) /= e(i)) then
    print *, i,r(i),e(i)
    stop 'error found'
  endif
enddo

print *, n,'iterations completed'

end program	
% pgf90 -ta=nvidia -Minfo=accel test.f90
dbl_it:
     24, Generating present_or_copyout(r(:))
         Generating present_or_copyin(a(:))
         Generating compute capability 1.0 binary
         Generating compute capability 2.0 binary
     25, Loop is parallelizable
         Accelerator kernel generated
         25, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             CC 1.0 : 10 registers; 64 shared, 0 constant, 0 local memory bytes
             CC 2.0 : 14 registers; 0 shared, 80 constant, 0 local memory bytes
% a.out 1000000
      1000000 iterations completed

I’ve submitted TPR#18950 to engineering for further investigation. Note that this is a zero-day bug that occurs in every compiler version. Thanks for finding it!

  • Mat

So it wasn’t me??? Wow. That’s a first! :) Anyway, thanks for the tip, and glad to be of service.