Problem in accessing device routine declared in another module file

Hi,

I have a code in which reduction operation needs to be performed inside kernel function. Similar reduction is need in some other files. So what i am trying to do is having a common module for reduction functions which will be included in any files wherever required and will be called in those files

structure of code is something like this

common module file containing reduction subroutines

MODULE OPS_FORTRAN_CUDA_REDUCTIONS

    USE ISO_C_BINDING
    USE CUDAFOR

    IMPLICIT NONE

    CONTAINS
ATTRIBUTES (DEVICE) SUBROUTINE ReductionReal8(sharedReal8, reduction_res, input_value, reduction_op)
! Reduction operation
END SUBROUTINE ReductionReal8
END MODULE OPS_FORTRAN_CUDA_REDUCTIONS`

File which will contain call to reduction function

MODULE APPLY_STENCIL_KERNEL_MODULE
       USE ISO_C_BINDING
       USE CUDAFOR
       USE OPS_FORTRAN_CUDA_REDUCTIONS
contains

attributes (device) subroutine apply_stencil_kernel_gpu(A, Anew, error)
....................
error = max( error, abs( Anew() - A() ))
end subroutine

attributes (global) subroutine apply_stencil_kernel_wrap(......)
..........................
call apply_stencil_kernel_gpu(......)
call ReductionFloat8(.....)
...............................
end subroutine

attributes (host) subroutine apply_stencil_kernel_host(....)
.........................
call apply_stencil_kernel_wrap <<<grid,tblock,nshared>>> (....)
.........................
end subroutine

The documentation says, for the Fortran “Relocatable device code is generated by default.”
Although i am getting error

NVFORTRAN-S-1253-Calls from device code to a host subroutine/function are not allowed - reductionfloat8

if i place the function ReductionFloat8 in the same file where it is called, this works.

Please help me to solve this

I am using following compiler

pgfortran (aka nvfortran) 23.1-0 64-bit target on x86-64 Linux -tp skylake-avx512
PGI Compilers and Tools
Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

and compilation flags are
O3 -fast -gopt -Mcuda=cc70

This should work. Make sure your names are straight, ReductionReal8 vs. ReductionFloat8. A word of warning, though, you will get better inlining, register usage, and thus potentially better performance if the device function is in the same file as the global subroutine. So, some “bad” programming practices might result in better performance. In any event, if you continue to see these unresolved symbols, post a full reproducer and we can help you figure it out. You can see the device symbols in an object file using cuobjdump, BTW.

Hi bleback,
Thanks for pointing out the mistake. Its working now.
Also thanks for suggestion. I will make changes to have those routines in same file as global subroutine.

–Ashutosh

Good! Well, let me know what difference in performance you see with rdc and the functions in a different file vs. in the same file. There are lots of factors, like the number of arguments passed, the complexity of the functions, etc.