DEALLOCATE: an illegal memory access was encountered

philou · July 10, 2018, 2:10pm

Hello,

I am new to CUDA Fortran; for now I am trying to do very basic things and I have some problems with this small piece of code :

$ cat main_simple2.F90 

PROGRAM MAIN_SIMPLE2
IMPLICIT NONE
INTEGER(KIND=4) , DEVICE :: KLEV_D

INTERFACE
ATTRIBUTES(GLOBAL) &
SUBROUTINE GPU_SIMPLE2 (KLEV)
INTEGER(KIND=4),INTENT(IN) :: KLEV
END SUBROUTINE GPU_SIMPLE2
END INTERFACE

KLEV_D = 15

CALL GPU_SIMPLE2 <<<600, 32>>> (KLEV_D)

END PROGRAM MAIN_SIMPLE2

ATTRIBUTES(GLOBAL) &
SUBROUTINE GPU_SIMPLE2 ( KLEV )

IMPLICIT NONE

INTEGER(KIND=4),INTENT(IN)    :: KLEV 
 
INTEGER(KIND=4) :: JLEV

REAL(KIND=8) :: ZTPRTH(KLEV),ZTDIFF(KLEV), ZTDISS(KLEV)

PRINT *, THREADIDX%X, BLOCKIDX%X, BLOCKDIM%X, KLEV

DO JLEV = 1, KLEV
  ZTDIFF(JLEV)  =0.0_8
ENDDO
DO JLEV = 1, KLEV
  ZTDISS(JLEV)  =0.0_8
ENDDO
DO JLEV = 1, KLEV
  ZTPRTH(JLEV)  =0.0_8
ENDDO

RETURN

END SUBROUTINE GPU_SIMPLE2


$ pgf90 -Mcuda=ptxinfo,fastmath main_simple2.F90 -o main_simple2.x  ; ./main_simple2.x
...
           30          422           32           15
           31          422           32           15
           32          422           32           15
0: DEALLOCATE: an illegal memory access was encountered

Sometimes I get the error, sometimes not. What is wrong with this ?

My NVIDIA card is :

$ pgaccelinfo

CUDA Driver Version:           8000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  375.26  Thu Dec  8 18:36:43 PST 2016

Device Number:                 0
Device Name:                   Quadro M5000
Device Revision Number:        5.2
Global Memory Size:            8514830336
Number of Multiprocessors:     16
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1038 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             3305 MHz
Memory Bus Width:              256 bits
L2 Cache Size:                 2097152 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc50

and my fortran compiler is :

$ pgf90 --version 

pgf90 17.7-0 64-bit target on x86-64 Linux -tp haswell 
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

Thank you for your help.

MatColgrove · July 10, 2018, 4:10pm

Hi philou,

The problem here is that you’re using too much heap space. There’s a very limited amount of device allocatable heap space (~8MB). While there’s ways of increasing this by calling cudaDeviceSetLimit (max heap space is about 32MB), I would highly recommend you rewrite your code to not use automatics in your device code.

Automatics will implicitly allocate data from the device code which is a very slow operation and negatively impact your performance. It’s better to use fixed sized local arrays or pass in a global array.

-Mat

philou · July 11, 2018, 6:41am

OK, thanks. But I have another question; the NVIDIA programmer manual states that (section 5.3.2) :

Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:

Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling).

So in principle, the arrays I declared using KLEV as dimension should go to local memory, that is off-chip. Why am I hitting a 8Mb limit here ?

Regards,

Philippe

Topic		Replies	Views
CUDA Fortran Error Legacy PGI Compilers cuda	2	809	July 31, 2020
CUDA_EXCEPTION_10, Device Illegal Address Legacy PGI Compilers	2	3318	November 18, 2016
Declaring local arrays in device code Legacy PGI Compilers	16	9258	June 8, 2012
Error running simple CUDA Fortran program Legacy PGI Compilers	9	21439	February 26, 2010
not enough memory Legacy PGI Compilers	12	9656	December 27, 2010
Unknown 8GB memory getting allocated on GPU Legacy PGI Compilers	12	9837	December 7, 2020
Illegal Memory Access in CUDA Fortran Code Legacy PGI Compilers	7	10633	May 24, 2017
Fatal error: Fortran auto allocation failed Legacy PGI Compilers	5	5120	April 24, 2019
FORTRAN: memory management & auto allocation failed Legacy PGI Compilers	2	1513	September 7, 2019
The size of the allocatable arrays in device subroutines Legacy PGI Compilers	1	5035	November 13, 2013

DEALLOCATE: an illegal memory access was encountered

Related topics