I am now having a confusing problem when using OpenACC to accelerate a multi-block CFD case. I cannot attach the research CFD code here so I formulated a simple code to mimic the problem as far as I can. The simple formulated code (attached below) contains one main and one module subroutine. The main has a nested loop with three layers, which is accelerated by adding one “!$acc loop collapse(3)” clause and enclosed by one “!$acc parallel” clause. The loop size can be adjusted. Inside the loop, a routine is used to test printouts and the routine is defined as a module subroutine on the device. As you can see, the code is pretty simple. I compiled the code using “pgfortran -acc -ta=tesla:cc60,cuda8.0 -Minfo=accel loop_size_test.f90 -o loop_size_test”, and ran the code using “./loop_size_test > printout_ijk.dat” to pipe the outputs to a file. Then I checked the number of printouts in the file. If length = [72, 48, 3], then the total number of printouts should be 72483=10368, however the output file only has 4096 printouts. I also imported the data file into an excel file and ordered the data, and found some printouts are lost. If length = [20, 20, 3] (which I commented), then the number of printouts is correct. I tested many times with different loop sizes and found the maximum number of printouts is 4096. I am wondering why!
I am using a Nvidia P100 GPU, PGI/17.5 compiler. The code is attached here:
module printout implicit none contains subroutine printout_ijk(i, j, k) !$acc routine seq integer, intent(in) :: i, j, k continue !Do nothing except for printing out i, j and k print *, i, j, k end subroutine printout_ijk end module printout program loop_size_test use printout, only : printout_ijk implicit none integer, dimension(3) :: length integer :: i, j, k continue length = [72, 48, 3] !length = [20, 20, 3] !$acc data copyin(length(1:3)) !$acc parallel !$acc loop independent collapse(3) do k = 1, length(3) do j = 1, length(2) do i = 1, length(1) call printout_ijk(i, j, k) end do end do end do !$acc end parallel !$acc end data end program loop_size_test
PS: If I compile the formulated code with “pgfortran -acc -Minfo=accel loop_size_test.f90 -o loop_size_test”, then this code works fine. However, my CFD code requires me to specify the option “-ta=tesla:cc60”, otherwise there would be runtime errors. Therefore, I need to add the option because getting my CFD code to run correctly is my primary purpose. Also, I tested the code on an older GPU with compute capability 2.0, this code does not work correctly even without “-ta=tesla:cc60”