Found an error if a loop size is too large

Hi all,

I am now having a confusing problem when using OpenACC to accelerate a multi-block CFD case. I cannot attach the research CFD code here so I formulated a simple code to mimic the problem as far as I can. The simple formulated code (attached below) contains one main and one module subroutine. The main has a nested loop with three layers, which is accelerated by adding one “!$acc loop collapse(3)” clause and enclosed by one “!$acc parallel” clause. The loop size can be adjusted. Inside the loop, a routine is used to test printouts and the routine is defined as a module subroutine on the device. As you can see, the code is pretty simple. I compiled the code using “pgfortran -acc -ta=tesla:cc60,cuda8.0 -Minfo=accel loop_size_test.f90 -o loop_size_test”, and ran the code using “./loop_size_test > printout_ijk.dat” to pipe the outputs to a file. Then I checked the number of printouts in the file. If length = [72, 48, 3], then the total number of printouts should be 72483=10368, however the output file only has 4096 printouts. I also imported the data file into an excel file and ordered the data, and found some printouts are lost. If length = [20, 20, 3] (which I commented), then the number of printouts is correct. I tested many times with different loop sizes and found the maximum number of printouts is 4096. I am wondering why!

I am using a Nvidia P100 GPU, PGI/17.5 compiler. The code is attached here:

module printout

  implicit none

contains

  subroutine printout_ijk(i, j, k)

    !$acc routine seq
    integer, intent(in)    :: i, j, k

  continue

    !Do nothing except for printing out i, j and k
    print *, i, j, k

  end subroutine printout_ijk

end module printout

program loop_size_test

  use printout, only : printout_ijk

  implicit none

  integer, dimension(3) :: length
  integer               :: i, j, k

continue

  length = [72, 48, 3]
  !length = [20, 20, 3]

  !$acc data copyin(length(1:3))

  !$acc parallel
  !$acc loop independent collapse(3)
  do k = 1, length(3)
    do j = 1, length(2)
      do i = 1, length(1)
        call printout_ijk(i, j, k)
      end do
    end do
  end do
  !$acc end parallel

  !$acc end data

end program loop_size_test

Best Regards,

Weicheng Xue

PS: If I compile the formulated code with “pgfortran -acc -Minfo=accel loop_size_test.f90 -o loop_size_test”, then this code works fine. However, my CFD code requires me to specify the option “-ta=tesla:cc60”, otherwise there would be runtime errors. Therefore, I need to add the option because getting my CFD code to run correctly is my primary purpose. Also, I tested the code on an older GPU with compute capability 2.0, this code does not work correctly even without “-ta=tesla:cc60”

Hi Weicheng Xue,

You’re running into the limits of the CUDA print buffer size. CUDA will buffer the print statements on the device, then return this buffer to the host for printing. If you go over the buffer size, later prints will overwrite earlier ones.

See: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#limitations

To increase the print buffer size, call the CUDA Fortran API routine “cudaDeviceSetLimit”. For example:

% cat size.F90
module printout

  implicit none

contains

  subroutine printout_ijk(i, j, k)

    !$acc routine seq
    integer, intent(in)    :: i, j, k

  continue

    !Do nothing except for printing out i, j and k
    print *, i, j, k

  end subroutine printout_ijk

end module printout

program loop_size_test

  use printout, only : printout_ijk
  use cudafor
  implicit none

  integer, dimension(3) :: length
  integer               :: i, j, k, rc
  integer(kind=cuda_count_kind) :: val

continue

  length = [72, 48, 3]
  !length = [20, 20, 3]
  val = length(3)*length(2)*length(1)*1024
  rc = cudaDeviceSetLimit(cudaLimitPrintfFifoSize,val )

  !$acc parallel
  !$acc loop independent collapse(3)
  do k = 1, length(3)
    do j = 1, length(2)
      do i = 1, length(1)
        call printout_ijk(i, j, k)
      end do
    end do
  end do
  !$acc end parallel


end program loop_size_test


% pgf90 -ta=tesla -Mcuda -Minfo=accel size.F90; a.out > size.txt
printout_ijk:
      7, Generating acc routine seq
         Generating Tesla code
loop_size_test:
     38, Generating Tesla code
         40, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         41,   ! blockidx%x threadidx%x collapsed
         42,   ! blockidx%x threadidx%x collapsed
% wc -l size.txt
10368 size.txt

Hope this helps,
Mat

Hi Mat,

Thanks very much for your help! I really learned something new.

Best Regards,

Weicheng