Unable to call Device function in Cuda Fortran

I am trying to create a Linked list in Cuda Fortran, but when I am trying to call a device function from the kernel, I am getting a compilation error. Could someone please explain to me why …?

file :p5.f95

MODULE ListModule
  IMPLICIT NONE

  TYPE ListElem
  REAL                    :: value;
  TYPE(ListElem), POINTER :: next;
  END TYPE ListElem

  CONTAINS
  
  ATTRIBUTES(DEVICE) FUNCTION InsertList(head, elem)
    IMPLICIT NONE
  
    type( ListElem ), pointer :: head, elem 
    type( ListElem ), pointer :: InsertList
  
    elem%next => head 
    InsertList => elem
    
  END FUNCTION InsertList
  END MODULE

MODULE Test
  CONTAINS
    ATTRIBUTES(GLOBAL) SUBROUTINE KERNEL()
    USE ListModule
    IMPLICIT NONE
    type( ListElem ), pointer :: head
    type( ListElem ), pointer :: newElem, h
    integer :: i,N = 4
    INTEGER(KIND=4),ALLOCATABLE::ND(:)
  
    nullify( head )                

    allocate( newElem )
    newElem%value=1

    PRINT*,newElem%value
    head => InsertList(head, newElem)

    END SUBROUTINE
END MODULE Test


PROGRAM LinkedList
  USE TEST
  USE CUDAFOR

  integer :: N

  CALL KERNEL<<<1,1>>>()
  N=cudaDeviceSynchronize()
END PROGRAM LinkedList

compilation command:pgf95 -cuda -gpu=rdc p5.f95

Error:
nvvmCompileProgram error 9: NVVM_ERROR_COMPILATION.
Error: /tmp/pgcudaforNnQ1wgcS2LZ.gpu (27, 26): parse use of undefined value '%sym_insertlist_p_356'
ptxas /tmp/pgcudaforxnQf5cb9zEu.ptx, line 1; fatal   : Missing .version directive at start of file '/tmp/pgcudaforxnQf5cb9zEu.ptx'
ptxas fatal   : Ptx assembly aborted due to errors
NVFORTRAN-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (p5.f95: 1)
NVFORTRAN/x86-64 Linux 21.7-0: compilation aborted

Hi,
We don’t normally recommend developers use F90 pointers in device code. They can have very high overheads, though in your case, a pointer to a scalar type is pretty efficient. Linked lists are not good parallel data structures, and in your case here, the allocations in device code also have high overhead.
The bug though is we don’t properly support device functions which return pointers, which is what you have used. I can get the code to compile and run by changing the function to a subroutine:

ATTRIBUTES(DEVICE) SUBROUTINE InsertList(head, elem)
    IMPLICIT NONE

    type( ListElem ), pointer :: head, elem

    elem%next => head

  END SUBROUTINE InsertList
  END MODULE

MODULE Test
  CONTAINS
    ATTRIBUTES(GLOBAL) SUBROUTINE KERNEL()
    USE ListModule
    IMPLICIT NONE
    type( ListElem ), pointer :: head
    type( ListElem ), pointer :: newElem, h
    integer :: i,N = 4
    INTEGER(KIND=4),ALLOCATABLE::ND(:)

    allocate( newElem )
    allocate( head )
    newElem%value=1
    head%value=2

    PRINT*,newElem%value
    call InsertList(head, newElem)
    PRINT*,newElem%next%value

Unless you really need every CUDA thread to have its own linked list, I would recommend looking at other ways to build your data structures.

Thank you @bleback