Fatal error: fortran auto allocation failed

Hi,

I’m kinda new to cuda-fortran and I’m having trouble with my code when certain variables are larger.
With small inputs, the function runs correctly, but when I use bigger inputs, the function crash with the following message:

FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED
...
FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED
0: copyout Memcpy (host=0xe80ca80, dev=0x2b59cccb1e00, size=728428) FAILED: 719(unspecified launch failure)

The error seems to occur when I transfer the variable from the device to the host (Y1=Y_d, line 142 of the file Test_Cuda_fct.cuf).
You can find the program file and the output message in the attachment.
The medium_input folder is not in it due to its large size (750mo).
I run my program with the following commands:

nvfortran -c MOD_deviceQuery.cuf
nvfortran Test_Cuda_fct.cuf MOD_deviceQuery.o -o Test_Cuda_fct.x
nvprof ./Test_Cuda_fct.x

Test_Cuda_fct.zip (112.2 KB)

Thank you in advance for your help,
Best regards,
Rémy Bretin

The error is probably occurring in the kernel itself, not this line. Since kernels are launched asynchronously, unless the code specifically checks the error status, errors from the kernel would be seen in the next GPU operation, which in this case is the copy

Also due to this, your timing will be meaningless since the CPU doesn’t block until it reaches the copy (after your timers), so I’d recommend adding a call to “cudaDeviceSyncronize” after the kernel calls

The actual error is most likely due to your automatics. Automatics implicitly allocate memory and the default device heap is quite small. You can increase this by calling cudadevicesetlimit using cudaLimitMallocHeapSize.

Though, device side allocation can be slow and adversely effect performance. So if you’re able, you should consider rewriting the algorithm to not use automatics.

-Mat

Hi Mat,

Thank you again for your answer.
Right now, you are kinda speaking Chinese to me (or any other language that I wouldn’t understand) but I will look more into this and will come back when I will understand what you are speaking about.

Thank you,
Have a great day,
R.

Hi Mat,

So I found a way to get around the issue without really understanding how to be honest. I was fine until now because I’m now receiving the same error message for another subroutine. It might have to be because of the way how I declare my variables in the subroutine I guess.

I try to look what do you mean by “automatics” and what is the “heap space” but I didn’t find anything that I could understand. Could you help me please ?

Concerning variables declarations, I saw so many ways to do it on internet that I don’t know what is the most efficient/reliable. Could you tell me in the three following examples which one would be the best way ?

MODULE EXAMPLES
  USE CUDAFOR
  USE MOD_INIT_DIMENSIONS 
  ! nA,nB,nC would be defined in this module. 
  ! The module would be loaded only for example1.
  ! nA,nB,nC are not fixed, they are user inputs.

contains

attribute(global) subroutine example1(A,B,C)
  implicit none
  real,device,intent(in),dimension(0:nA) :: A
  real,device,intent(in),dimension(0:nB) :: B
  real,device,intent(out),dimension(nC,nC) :: C
  integer,value :: i,j
  real,device,dimension(nA,nC) :: Buff(:,:)
      '... same following code ...'
end subroutine example1

attribute(global) subroutine example2(A,B,C,nA,nB,nC)
  implicit none
  integer,value,intent(in) :: nA,nB,nC
  real,device,intent(in) :: A(0:nA),B(0:nB)
  real,device,intent(out) :: C(nC,nC)
  integer,value :: i,j
  i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
  j = blockDim%y * (blockIdx%y - 1) + threadIdx%y
  real,device :: Buff(nA,nC)
      '... same following code ...'
end subroutine example2

attribute(global) subroutine example3(A,B,C)
  implicit none
  real,device,intent(in) :: A(0:),B(0:)
  real,device,intent(out) :: C(:,:)
  integer,value :: i,j,nA,nB,nC
  real,device,allocatable :: Buff(:,:)
  nA=size(A)-1; nB=size(B)-1; nC=size(C,1);
  ALLOCATABLE(Buff(nA,nC))
      '... same following code ...'
end subroutine example3

END MODULE EXAMPLES

Thank you in advance for your help,
Remy

Hi Remy,

Computer programs have what’s known as an “address space”. This is the accessible memory that it uses during the execution of the program. The address space is divided up into several sections such as the instructions (i.e. the executable code), constants (read-only memory), static data (global memory such as your module variables). These memory spaces are created when the program is first loaded by the OS. The other two are “stack” and “heap”. Stack holds things local variables and grows and shrinks as you call subroutines. Heap is dynamic, meaning space is created dynamically (i.e. via allocation and depends on the execution of the program on when and how much is used).

Of course, this is a very generalized explanation. Doing a web search for terms like “heap memory”, “address space”, “heap vs stack” will give you more details.

Though a few thing into keep in mind is that the amount of memory available to each segment is finite. The exact amount of memory available will vary widely by the OS being used, OS limits sets, compiler options, architecture, etc, though unless needed, I’ll skip the details. Just understand that a stack or heap overflow just means that these segments ran out available memory.

Also, the address spaces for the CPU and GPU are different spaces.

An “automatic” is a variable sized local array (i.e. declared in a subroutine) whose size is passed in as arguments to the subroutine. By default, the memory for automatics is dynamically allocated on the heap. So in general not great for performance since this adds overhead, worse for GPUs since this allocation is serialized (i.e. each software execution unit, “threads”, needs to take turns allocating the memory). Also the amount of memory available for the heap is relatively small by default (though you can increase this via CUDA call or an environment variable). For the CPU, we can allocate automatics on the stack via the “-Mstack_arrays” flag or when OpenMP (-mp) is enabled.

Just in case, I’ll explain the “value” attribute. By default, Fortran passes variables to subroutines by reference, i.e. it passes in the address of the variable. So for CUDA Fortran this means that you need to be passing variables who’s address is in the GPU’s address space (i.e. with the device attribute). The “value” attribute changes this so the actual value of the variable is passed in, not the address, and then stored in a local variable. This means you can pass in the value of variables which are not stored in the GPU’s address space. Very useful for scalars so you don’t need to create corresponding device variables and explicitly copy them.

For your examples as far as how you’re passing, I’d tend towards #2, but #3 using assumed shape arrays is fine as well. #1 is ok but would require the loop bound variables be module device arrays.

The problem with all three is “Buff”. #1 and #2 it’s an automatic and #3 it’s allocatable. But given automatics are implicitly allocated, there’s no real difference.

Not knowing your algorithm, you may just need to take the performance hit as well as increasing the heap size available via cudaDeviceLimit as noted above. Ideally you’d make “Buff” fixed size or manually privatize it, i.e. allocate a device array that’s size to the nA,nC with additional dimensions for each CUDA thread, and then pass in this array as an argument. If you were using OpenACC, you’d put Buff in a “private” clause and the compiler would automatically do this for you.

Speaking of which, have you considered using OpenACC instead? Not that you shouldn’t use CUDA, but often domain scientist who aren’t compute science experts find it much easier to work with.

-Mat

Hi Mat,

Thank you very much for this detailed answer. I understand better the issue now.

I still have some questions though concerning the memory allocation and heap space: does subroutine variables with “intent(in or out)” are stored in the global memory (variables A,B,C from my previous examples) and the local variable in the heap space (variable buff from my examples) ? Does the heap space is shared by all the threads, and therefore the heap space needed increases with the number of threads called ?

Currently I have a subroutine which works correctly until a certain number of threads. It works for 512 threads but for 1024 my output variable remains untouched (identical as before calling the subroutine) without any error message. And increasing the heap space doesn’t solve the problem:

  integer(kind=cuda_count_kind) :: val
  istat = cudaDeviceGetLimit(val,cudaLimitMallocHeapSize)
  istat = cudaDeviceSetLimit(cudaLimitMallocHeapSize,10*val)

So I don’t really know where the issue might come from. Subroutine too big ?
Here is my function if you want to take a look at it :

attributes(global) subroutine Section2_km(g_num,g_coord,el_sd,ip_w,dee,  &
                       storkm, nels,nn,nod,nip,ndim,nst,ndof,nodof) 
  implicit none
  integer,value,intent(in) :: nels,nn,nod,nip,ndim,nst,ndof,nodof
  integer,device,intent(in) :: g_num(nod,nels)
  real(iwp),device,intent(in) :: g_coord(ndim,nn),ip_w(nip), &
  el_sd(ndim,nod,nip), dee(nst,nst)
  real(iwp),device,intent(out) :: storkm(ndof,ndof,nels)
  real(iwp),device :: zero=0.0_iwp,jac(3,3),jacinv(3,3), &
                      bee(nst,ndof), det, deriv(ndim,nod) 
                    ! ndof<60, nod<20, nst=6, ndim=3
  integer,value :: iel,ipt,i,j,k,n,m,l,istat

  iel = blockDim%x * (blockIdx%x - 1) + threadIdx%x
  IF (iel <= nels) THEN; storkm(:,:,iel)=zero;  
  DO ipt=1,nip
    ! coord=g_coord(:,g_num(:,iel)) 
    ! der=el_sd(:,:,ipt)
    ! jac=MATMUL(der,coord) 
    jac=zero
    DO i=1,ndim; DO j=1,ndim; DO k=1,nod
    jac(i,j)=jac(i,j)+el_sd(i,k,ipt)*g_coord(j,g_num(k,iel))
    END DO; END DO; END DO

    det=jac(1,1)*(jac(2,2)*jac(3,3)-jac(3,2)*jac(2,3)) &
    -jac(1,2)*(jac(2,1)*jac(3,3)-jac(3,1)*jac(2,3)) &
    +jac(1,3)*(jac(2,1)*jac(3,2)-jac(3,1)*jac(2,2))

    jacinv(1,1)=( jac(2,2)*jac(3,3)-jac(3,2)*jac(2,3))/det
    jacinv(2,1)=(-jac(2,1)*jac(3,3)+jac(3,1)*jac(2,3))/det
    jacinv(3,1)=( jac(2,1)*jac(3,2)-jac(3,1)*jac(2,2))/det
    jacinv(1,2)=(-jac(1,2)*jac(3,3)+jac(3,2)*jac(1,3))/det
    jacinv(2,2)=( jac(1,1)*jac(3,3)-jac(3,1)*jac(1,3))/det
    jacinv(3,2)=(-jac(1,1)*jac(3,2)+jac(3,1)*jac(1,2))/det
    jacinv(1,3)=( jac(1,2)*jac(2,3)-jac(2,2)*jac(1,3))/det
    jacinv(2,3)=(-jac(1,1)*jac(2,3)+jac(2,1)*jac(1,3))/det
    jacinv(3,3)=( jac(1,1)*jac(2,2)-jac(2,1)*jac(1,2))/det

    ! deriv=MATMUL(jacinv,der) 
    deriv=zero
    DO i=1,ndim; DO j=1,nod; DO k=1,ndim
    deriv(i,j)=deriv(i,j)+jacinv(i,k)*el_sd(k,j,ipt)
    END DO; END DO; END DO

    bee=zero
    DO m=1,nod; n=3*m; k=n-1; l=k-1
    bee(1,l)=deriv(1,m); bee(4,k)=deriv(1,m); bee(6,n)=deriv(1,m)
    bee(2,k)=deriv(2,m); bee(4,l)=deriv(2,m); bee(5,n)=deriv(2,m)
    bee(3,n)=deriv(3,m); bee(5,k)=deriv(3,m); bee(6,l)=deriv(3,m)
    END DO

    ! km=km+MATMUL(MATMUL(TRANSPOSE(bee),dee),bee)*det*ip_w(ipt)
    ! storkm(:,:,iel) = km
    DO i=1,ndof; DO j=1,ndof 
      DO k=1,nst; DO l=1,nst
      storkm(i,j,iel)=storkm(i,j,iel)+bee(l,i)*dee(l,k)*bee(k,j)*det*ip_w(ipt)
      ! istat = atomicadd( storkm(i,j,iel) , bee(l,i)*dee(l,k)*bee(k,j)*det*ip_w(ipt) )
      END DO; END DO; 
    END DO; END DO
  END DO
  END IF
end subroutine Section2_km

Concerning OpenACC vs CUDA this is not really my call. I’m working on an extension of an already existing code in cuda-fortran (Fraser Kirk might ring a bell to you ? he told me you met and helped him).

Thank you again for your great support,
Sincerely,
Remy

Of course! Kirk has one of our early adopters of CUDA Fortran long ago and a “power” user. I’ll get occasional notes from him. Though it’s probably been 7+ years(?) since I looked at SPHriction-3D, which what I assume you’re working on.

does subroutine variables with “intent(in or out)” are stored in the global memory (variables A,B,C from my previous examples) and the local variable in the heap space (variable buff from my examples) ?

“intent” is just a hint to the compiler for optimization. We tend to ignore “intent” and instead rely on compiler analysis. The A, B, and C variables would be stored in the hardware’s global memory.

While not trying to make things more complicated (I was trying to simplify and give generalizations before), GPU local variables would be in local memory and stored in registers, though local memory could be located in global memory if registers “spill” or there’s a fixed sized array. Though data allocated either explicitly or implicitly via automatics within a kernel is allocated in the heap (again stored in the hardware global memory). So the automatic array “bee” is a local variable but the allocated data it points to is in the heap.

Things like global memory, registers, and caches are terms for the physical memory on the hardware. local, heap and stack are software constructs. Though for now, lets not worry too much about data placement (i.e. where which physical memory the program, compiler, OS, or runtime stores the data). You may need to think about this during optimization, but for now let’s get the program to run correctly.

As for the problem, are you adding error checking after the kernel is launched? CUDA doesn’t give an error unless the code specially checks. My best guess is that the kernel is erroring but without the check the main code is continuing on it’s way. Borrowing some code (from How to Query Device Properties and Handle Errors in CUDA Fortran | NVIDIA Developer Blog) error checking would look something like:

call Section2_km<<dimGrd,dimBlk>>(x_d, y_d, a)
ierrSync = cudaGetLastError()
ierrAsync = cudaDeviceSynchronize()
if (ierrSync /= cudaSuccess) write(*,*) &
  ’Sync kernel error:’, cudaGetErrorString(ierrSync)
if (ierrAsync /= cudaSuccess) write(*,*) &
  ’Async kernel error:’, cudaGetErrorString(ierrAsync)

While this may not tell us specifically what’s wrong, it should give clues. It could be a problem with your launch bounds, out of bounds error, or some other issue, Doubt it’s a heap overflow given you setting the device limits and each thread using only around 3KB of memory.

The next step, if it’s a problem with the kernel, would be to compile the code with “-g” to add debugging information and then run the code through the cuda-gdb debugger. Cuda-gdb isn’t great with Fortran, but may give some insights.

Also, you can run your code using the ‘cuda-memcheck’ utility or it’s more recent replacement ‘compute-sanitizer’ especially to determine if it’s a memory access issue (like accessing an array out-of-bounds).

Hi Mat,

So I added the error check you advised me and I get the following message with dimGrd=1 and dimBlk=1024 (nels=102, but I also get the same issue for bigger inputs): Sync kernel error: too many resources requested for launch.
test_km.zip (839.0 KB)

Concerning cuda-gdb or compute-sanitizer, I haven’t installed it yet. I will, if the mentioned error message is not enough for you to guide me.

Thank you again for your help,
Remy
PS: yes I’m working on the SPH3D.

Too many registers for the number of threads in the block.

Looking at the ptxinfo, Fun_km3 uses 184 registers per thread, and Fun_km2 uses 89. From the ‘nvaccelinfo’ utility output, you can see that the maximum number of registers per block is 64K.

% nvfortran -fast MOD_km_fun.cuf TEST_km.cuf -gpu=ptxinfo
MOD_km_fun.cuf:
ptxas info    : 80 bytes gmem
ptxas info    : Compiling entry function 'mod_functions_fun_km3_' for 'sm_70'
ptxas info    : Function properties for mod_functions_fun_km3_
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 184 registers, 448 bytes cmem[0]
ptxas info    : Compiling entry function 'mod_functions_fun_km2_' for 'sm_70'
ptxas info    : Function properties for mod_functions_fun_km2_
    1104 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 89 registers, 432 bytes cmem[0]
ptxas info    : Compiling entry function 'mod_functions_fun_km_' for 'sm_70'
ptxas info    : Function properties for mod_functions_fun_km_
    1072 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 48 registers, 432 bytes cmem[0]
TEST_km.cuf:
% nvaccelinfo | grep Registers
Registers per Block:           65536

So for Fun_km3, the max number of threads per block is 356 (65536/184), and Fun_km2 is 736 (65536/89). Your kernels are running out of registers and hence the “too many resources requested for launch” error, when using 512 (km3) and 1024 (both km2 and km3) threads.

You can use the flag “-gpu=maxregcount:64” to have PTXAS uses fewer registers per block, but registers will then “spill” (i.e. be stored in global memory) so would adversely impact performance. Better to limit the thread block size to 128. Note in general a thread block size of 128 is optimal and larger thread blocks are only beneficial when using shared memory.

Best guess as to why km3 is using so many more registers is due to the added integers to hold the sizes, but the array descriptors for the allocatable arrays are probably being placed there as well. Another reason not to allocate on the device.

Note that ‘cuda-gdb’ and ‘cuda-memcheck’ can be found in the NV HPC SDK under the “cuda” bin directories, for example “<base_install_dir>/Linux_x86_64/21.9/cuda/11.4/bin”. ‘compute-sanitizer’ is new and found under ‘<base_install_dir>/Linux_x86_64/21.9/cuda/11.4/compute-sanitizer/’.

Hope this helps,
Mat

OK , thank you for your quick answer,

But then why Fun_km uses less registers than Fun_km2 when both functions have the exact same variables, the only difference is that Fun_km has a 2D kernel (which forces me to use atomicadd) whereas Fun_km2 is a one dimension Kernel ?

Hi Mat,

So I made different tests trying to figure out what influence the use of register by comparing 3 different variations in the code (making at total of 2**3 different subroutines doing the exact same calculations):

  • Making a 1D or 2D kernel (atomicadd being needed for the 2D kernel) : for identical variable declaration, the number of register is drastically different.
  • Defining the local variable in a flexible way (Au) or in a fixed way (Fx): this has an impact for the 1D case, but none for the 2D case. Also, the ‘Au’ versions give me “FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED” for larger inputs due to the heap space full (even though nst,ndof,ndim,nod,nip, the dimensions of the local variables, are identical and only nels and nn grow in size).
  • Reducing the number of local variable by using the same ones for different purpose (Lg vs Sh): it has barely no impact reducing the number of local variables.
    Register_tests_km.zip (649.9 KB)

In the end, I have no idea what influence the number of registers used, and therefore, I don’t know how I should proceed for future subroutine without wasting my time doing these kind of tests.

Your advice would be of a great help,
Thank you in advance,
Remy

Sorry, I don’t know why this would be the case. My only guess is there’s less need to hold address calculations in temp variables.

Defining the local variable in a flexible way (Au) or in a fixed way (Fx): this has an impact for the 1D case, but none for the 2D case. Also, the ‘Au’ versions give me “FATAL ERROR: FORTRAN AUTO ALLOCATION FAILED” for larger inputs due to the heap space full (even though nst,ndof,ndim,nod,nip, the dimensions of the local variables, are identical and only nels and nn grow in size).

Did the total number of threads increase? While the amount of heap space per thread remained constant, if the number of threads increased, the total heap space needed would increase as well.

Reducing the number of local variable by using the same ones for different purpose (Lg vs Sh): it has barely no impact reducing the number of local variables.

This can help sometimes but registers also hold intermediate values, array address computation, and other temps. Often registers can be reused, but this depends on the lifetime and use of the variable.

This is a bit of a black box and I don’t fully know all the details on how PTXAS does register allocation, but from my understanding, let’s look at this line:

jacinv(1,1)=( jac(2,2)*jac(3,3)-jac(3,2)*jac(2,3))/det

jac(2,2), jac(3,3), jac(3,2), and jac(2,3) need to be fetched from local memory and brought into registers (or possibly since jac is fixed size, it may just put into registers to begin with). Then the intermediate computations of “jac(2,2)*jac(3,3)” and “jac(3,2)*jac(2,3)” need to be held in registers, then the result of the subtraction, and finally the division.

In the “km” kernel, the registers can be reused for other computation since the “jac” array is no longer used. However in “km2” you now have loop and “jac” is re-used in each iteration. Hence PTXAS is most likely keeping these in registers thus increasing the usage. Same for the other small arrays.

Given the problem size is small (102x4), it’s difficult to assess performance, but because of this, it seems the “km” version the correct way to go. Though the automatics are still hurting quite a bit. I see significant improvement after making these fixed size arrays. If you can’t do that, then you may consider manually privatizing them, i.e. make them 4D arrays, like “bee(nst,ndof,nels,nip)”, and passing them into the kernel.