Cuda host array allocation problem.

I’m currently learning PGI CUDA fortran (PGI workstation 10.9, Win XP 32), and made a little test code. It allocates an allocatable array on host. It works well when the size of array is small, however, the allocation fails when the array gets big. Moreover, one of the curious thing is that I don’t even call the subroutines.

Also, this happens only when

attributes(global)

is declared. I’ll attach the code and the result.

Is anybody here who knows what happens? Is there a limit in allocating an array, or am I missing anything important? I’ll be very glad to hear some.

  • Sungjin, Kim.
module linear_system_cu
  use cudafor

contains
  attributes(global) subroutine jacobi_kernel(a, b, x, x_new, n)
     implicit none
     real, device :: a(n,n), b(n)
     real, device :: x_new(n), x(n)
     integer, value :: n

   end subroutine jacobi_kernel

  subroutine jacobi(a, x, b, tol)
    implicit none
    real, dimension(:,:), intent(in) :: a
    real, dimension(:), intent(inout) :: x
    real, dimension(:), intent(in) :: b
    real, intent(in) :: tol

  end subroutine jacobi

end module linear_system_cu


program alloc
  use linear_system_cu

  implicit none

  real, dimension(:,:,:), allocatable :: a
  integer :: ierr

  write(*,*) "Test 1."

  allocate(a(5, 100, 100), stat=ierr)

  if (ierr /= 0) then
     write(*,*) "Could not allocate a."
  else
     write(*,*) "Allocated a."
  end if

  if (allocated(a)) then 
     deallocate(a)
  end if

  write(*,*) "Test 2."

  allocate(a(5, 10000, 10000), stat=ierr)

  if (ierr /= 0) then
     write(*,*) "Could not allocate a."
  else
     write(*,*) "Allocated a."
  end if

end program alloc

The result is;

PGI$ pgf90 -Mcuda alloc.f90 linear_system_cu.f90
alloc.f90:
linear_system_cu.f90:
PGI$ ./alloc.exe
 Test 1.
 Allocated a.
 Test 2.
 Could not allocate a.

However, if modified like this;

  subroutine jacobi_kernel(a, b, x, x_new, n)

The result becomes

PGI$ pgf90 -Mcuda alloc.f90 linear_system_cu.f90
alloc.f90:
linear_system_cu.f90:
PGI$ ./alloc.exe
 Test 1.
 Allocated a.
 Test 2.
 Allocated a.

Hi Sungjin, Kim,

The problem here is that your array is simply too big. The maximum size of all user memory in Win32 is 2GB less some memory due to the memory requirements of the OS, so the actual max you can allocate is closer to 1.75GB. You are trying to allocate just over 1.8GB (5x10000x10000x4 bytes). You can have the OS extend the user space to 3GB by passing in the link flag “-Wl,largeaddressaware” and hence allow you to allocate more memory. However, I have never tested this flag with CUDA Fortran so don’t know if you’ll encounter other issues. I would suggest limiting your memory usage or move to 64-bit Windows.

Note that the program fails for me with or without the “attribute(global)”. Why it works for you is most likely just luck. You’re just at the board line of memory usage so slight variations in the code could change the behavior.

Hope this helps,
Mat

I have just posted in computing and compiling that with a new Tesla 2000 series in TCC mode the whole host memory can probably be addressed from the device without pinning. You need the latest version of the compiler. windows 7, a Tesla 2050 minimum. The Tesla must be in TCC mode. You allocate memory according to the pinned memory model but without the attributes pinned. I have not yet tested for very large arrays but that was what Nvidia designed it for. PGI support said it did not work but I think they did not use a Tesla 2000 series or it was not in TCC mode or they were not using Windows 7 or not using CUDA 4.0. It is not a cheap solution if you have to buy a new card and Windows 7, but the Tesla C2070’s have 6GB DDR5 on board and they ard good value for money.