Cuda host array allocation problem.

sungjinkim · May 19, 2011, 6:03am

I’m currently learning PGI CUDA fortran (PGI workstation 10.9, Win XP 32), and made a little test code. It allocates an allocatable array on host. It works well when the size of array is small, however, the allocation fails when the array gets big. Moreover, one of the curious thing is that I don’t even call the subroutines.

Also, this happens only when

attributes(global)

is declared. I’ll attach the code and the result.

Is anybody here who knows what happens? Is there a limit in allocating an array, or am I missing anything important? I’ll be very glad to hear some.

Sungjin, Kim.

module linear_system_cu
  use cudafor

contains
  attributes(global) subroutine jacobi_kernel(a, b, x, x_new, n)
     implicit none
     real, device :: a(n,n), b(n)
     real, device :: x_new(n), x(n)
     integer, value :: n

   end subroutine jacobi_kernel

  subroutine jacobi(a, x, b, tol)
    implicit none
    real, dimension(:,:), intent(in) :: a
    real, dimension(:), intent(inout) :: x
    real, dimension(:), intent(in) :: b
    real, intent(in) :: tol

  end subroutine jacobi

end module linear_system_cu


program alloc
  use linear_system_cu

  implicit none

  real, dimension(:,:,:), allocatable :: a
  integer :: ierr

  write(*,*) "Test 1."

  allocate(a(5, 100, 100), stat=ierr)

  if (ierr /= 0) then
     write(*,*) "Could not allocate a."
  else
     write(*,*) "Allocated a."
  end if

  if (allocated(a)) then 
     deallocate(a)
  end if

  write(*,*) "Test 2."

  allocate(a(5, 10000, 10000), stat=ierr)

  if (ierr /= 0) then
     write(*,*) "Could not allocate a."
  else
     write(*,*) "Allocated a."
  end if

end program alloc

The result is;

PGI$ pgf90 -Mcuda alloc.f90 linear_system_cu.f90
alloc.f90:
linear_system_cu.f90:
PGI$ ./alloc.exe
 Test 1.
 Allocated a.
 Test 2.
 Could not allocate a.

However, if modified like this;

  subroutine jacobi_kernel(a, b, x, x_new, n)

The result becomes

PGI$ pgf90 -Mcuda alloc.f90 linear_system_cu.f90
alloc.f90:
linear_system_cu.f90:
PGI$ ./alloc.exe
 Test 1.
 Allocated a.
 Test 2.
 Allocated a.

MatColgrove · May 19, 2011, 4:15pm

Hi Sungjin, Kim,

The problem here is that your array is simply too big. The maximum size of all user memory in Win32 is 2GB less some memory due to the memory requirements of the OS, so the actual max you can allocate is closer to 1.75GB. You are trying to allocate just over 1.8GB (5x10000x10000x4 bytes). You can have the OS extend the user space to 3GB by passing in the link flag “-Wl,largeaddressaware” and hence allow you to allocate more memory. However, I have never tested this flag with CUDA Fortran so don’t know if you’ll encounter other issues. I would suggest limiting your memory usage or move to 64-bit Windows.

Note that the program fails for me with or without the “attribute(global)”. Why it works for you is most likely just luck. You’re just at the board line of memory usage so slight variations in the code could change the behavior.

Hope this helps,
Mat

WilliamRae59305 · July 26, 2011, 10:51pm

I have just posted in computing and compiling that with a new Tesla 2000 series in TCC mode the whole host memory can probably be addressed from the device without pinning. You need the latest version of the compiler. windows 7, a Tesla 2050 minimum. The Tesla must be in TCC mode. You allocate memory according to the pinned memory model but without the attributes pinned. I have not yet tested for very large arrays but that was what Nvidia designed it for. PGI support said it did not work but I think they did not use a Tesla 2000 series or it was not in TCC mode or they were not using Windows 7 or not using CUDA 4.0. It is not a cheap solution if you have to buy a new card and Windows 7, but the Tesla C2070’s have 6GB DDR5 on board and they ard good value for money.

Topic		Replies	Views
The size of the allocatable arrays in device subroutines Legacy PGI Compilers	1	5012	November 13, 2013
cuda malloc use CPU memory Legacy PGI Compilers	3	4407	August 22, 2013
CUDA 4.0 cudaHostAlloc CUDA Programming and Performance	9	1879	June 12, 2011
Unexpected limit in cudaHostAlloc Failing to allocate large amounts of pinned/page-locked memory CUDA Programming and Performance	3	4213	December 6, 2010
GTX 680 fails after large cudaHostAllocaPortable allocation CUDA Programming and Performance	11	3886	June 13, 2012
Apparent bug in Fortran device-to-host copies above 2GB Legacy PGI Compilers	2	2699	May 15, 2013
cudaHostAlloc can only allocate about 3.5GB of memory out of 128GB CUDA Programming and Performance	7	558	June 2, 2023
Allocating large arrays. CUDA Programming and Performance	6	3870	October 25, 2009
What is the greatest size of data can transform Legacy PGI Compilers	4	2335	July 26, 2018
Array limit CUDA Programming and Performance	4	4383	October 31, 2011

Cuda host array allocation problem.

Related topics