Fatal error: Fortran auto allocation failed

Dear Mat and all,

I am getting an error, apparently when copying the device array to host array. I get many lines of the following output message:


and also I get

0: copyout Memcpy (host=0x25f56f0, dev=0x7fd150e00000, size=56448) FAILED: 4(unspecified launch failure)
  • I am using pgi 18.10
  • cuda-memcheck and all other debugging flags do not give any more info
  • The error only occurs when the problem size exceeds a certain amount
  • I made sure about the memory available on the device.

I would appreciate your help and advice on this.


P.S. The code is part of the package I provided before (https://github.com/amir-saadat/BDpack) that I wish you still have installed. The particular example that shows this is at (https://github.com/amir-saadat/BDpack/tree/master/projects/semidilute_comb). The error occurs in the line 1238 of the file box_mod.f90.

Hi Amir,

I just updated my git repository and was able to build and successfully run the code (using semidilute_dumb_shear) with 19.1.

Most likely given the error, you’re running into a problem we had with CUDA Fortran code when we first enabled F2003 allocatable semantics by default in 18.10. You might try adding the flag " -⁠Mallocatable=95" to revert to using F95 semantics to see if it works around the issue.

Note that the next PGI Community Edition, 19.4, should be out soon which you can also try then.


Hi Mat,

Many thanks for your reply.

“-Mallocatable=95” didn’t seem to resolve the issue. I added the flag for compiling both host and device codes. Please run the example in “semidilute_comb” in particular to hopefully reproduce the error (using the executable, please run “mpirun -np 1 …/…/bin/BDpack”).


Hi Amir,

I went back and compiled the code with 18.10 and ran the semidilute_comb workload. Unfortunately, I’m not able to reproduce the error you show. The code does eventually seg fault in the dot product at line 290 of “semidilute_bs/pp_smdlt.f90” but doesn’t seem related and is a different issue.

I did have to modify your make.inc file to get things to link using the following command. One thing that I did do differently was to not link the CUDA libraries directly but rather use the PGI “-Mcuda -Mcudalib=cublas” flags. The compiler will implicitly add the CUDA libraries but more importantly, make sure to use the same CUDA version that the compilers are using. Could you have a mismatch between the CUDA version PGI is using and the CUDA 9.2 libraries? In PGI 18.10, we look to see what CUDA driver version you have installed to determine which CUDA version to use. If you have the CUDA 10.0 driver, hard coding the CUDA 9.2 libraries in the makefile could be an issue.

Also, I’m not sure how you got the code to link without the Intel runtime libraries on the link line. Are you using a mpif90 driver configured to use ifort?

GLBLIBS += -Mcuda -Mcudalib=cublas
GLBLIBS += -L ~/mkl_pgi18/lib/intel64/ -L/opt/intel/compilers_and_libraries_2019/linux/lib/intel64/
GLBLIBS += -lmkl_lapack95_lp64 -lmkl_blas95_lp64 ~/mkl_pgi18/include/blas.o
GLBLIBS +=  -lifcore -limf -lirc -lm -lsvml -lintlc


Hi Mat,

Many thanks again for your reply and effort.

I tried to follow all of your suggestions, made and pushed all of these changes to the git repository.

  • I modified make.inc to comment all of the direct linking to cuda library. I use -Mcuda as a flag usually, but using it with other linking libraries did not make a difference.

  • Now the default cuda on the cluster is cuda/10.0 (I am loading the module), here is my bashrc and I am running on XSEDE’s bridges cluster:

# loading modules
module load cuda/10.0
module load mpi/pgi_openmpi/18.10
export MKLROOT=/opt/intel/compilers_and_libraries_2019/linux/mkl
export MAGMAROOT=/home/asaadat/magma_mkl_pgi
export INTELROOT=/opt/intel
export MKLLIB=$MKLROOT/lib/intel64
export CUDADIR=/opt/packages/cuda/10.0
export INTELLIB=$INTELROOT/lib/intel64

# adding libraries
  • mpif90 is configured to use pgi
$mpif90 -V
pgfortran 18.10-1 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
  • I am linking to intel libraries using
GLBLIBS += -L$(MKLROOT)/lib/intel64/
GLBLIBS += -lmkl_intel_lp64 -lmkl_core
GLBLIBS += -lmkl_sequential -lpthread

and I link to lapack and blas libraries that I made myself using pgf95 compiler:

GLBLIBS += -L ~/mkl_pgi18/lib/intel64/
GLBLIBS += -lmkl_lapack95_lp64 -lmkl_blas95_lp64

I assumed these are the only critical ones – if I don’t link to these ones, I get a compilation error. Did you need to add the other ones?

  • I have not seen seg fault at pp_smdlt file (line 290 is not a dot_product, though), the file is not critical, basically just for post-processing, so in the repository, I modified the input file, so that, it won’t get called (by making DumpConf = FALSE), just in case you tried running it again.

    I hope you would still have some advice on resolving the issue. Does “Fortran auto allocation failed” have a standard meaning after all?


Hi Amir,

I looked up where the error “FORTRAN AUTO ALLOCATION FAILED” is coming from. I was originally thinking it had to do with F2003 automatic allocation, but it’s actually coming from the device code (apologies for not seeing this earlier). The error occurs when allocating automatic arrays on the device when the device size malloc fails. Hence, the actual issue is most likely in one of the previous kernels called before line 1238 of box_mod.f90 where your using an automatic array.

The most likely spot is in calcDiff_recip_d_part1 since you do have some automatic arrays in there. Though you also catch the error code from this kernel call, so it could be someplace else (I’d expect the error code to be non-zero if this kernel was failing).

It looks like Bridges uses K80s or P100s while I was testing on V100 and why it worked for me. Sure enough, moving to a K80 or P100 I can recreate the error. Most likely the device is running out of heap space which can be quite small. To fix, you’ll want to call “cudaDeviceSetLimit” to increase the heap size. Here, I’m setting it to 16MB (default is 8MB)

integer(kind=cuda_count_kind) :: heapsize
    heapsize = 16_8*1024_8*1024_8
    istat = cudaDeviceSetLimit(cudaLimitMallocHeapSize,heapsize)

I added this bit of code to the “init_dev” function in “common/cuda/dev_cumod.cuf” and the code ran correctly (except for the unrelated segv). Though, this should really be called after you set the device, but the only place I see “cudaSetDevice” being called is in the “init_dev” routine in “dilute_bs/cuda/gpu_cumod.cuf” which doesn’t seem to be called.

Note that I strongly discourage use of automatics within device kernels. Besides the limited heap size, device side allocation is slow so can effect performance. If you can make these arrays fixed size, you’ll be better off.

Hope this helps,