Seperate compilation of cuda fortran code concerning dynamic library

Hi everyone,

I met with an strange issue related to separate compilation of Cuda Fortran codes. It can be explained in details with a minimum working example like, I want to call a subroutine “test()” from the main program. However, I compiled them separately, which means the subroutine “test” (in file type.cuf) is compiled as dynamic library ( and then linked to the main program named “cudatest” (in file main.f90). I did it in this way (separate compilation) for other reason. For test, there are a same code block of “Cuda cuf kernal” in both of the main body program and the subroutine. And the two compilation process succeeded without any errors or even warnings. Strangely, the kernel in the main body runs normally with right result, but the same kernel part in the subroutine (dynamic library) broke out with the following error

cudaLaunchKernel returned status 98: invalid device function

After google the keywords, the error seems to indicate that the cuf kernel has not been compiled as device codes correctly. But I don’t know the reasons and don’t know how to fix it. I will appreciate it if anyone can offer some useful information.

Besides, there may be some useful information to diagnose the problem. I test it with sdk21.9 (Cuda 11.4) on Tesla V100 . Execute nvidai-smi command, I get

NVIDIA-SMI 470.129.06 Driver Version: 470.129.06 CUDA Version: 11.4

Also, I test it with sdk22.7 (cuda 11.7) on a very clean and new Tesla A100 machine. When testing it with this cuda version (11.7), it occurs with a different error in the same code part, which is

cudaLaunchKernel returned status 500: named symbol not found

Similarly, execute nvidai-smi, I get

NVIDIA-SMI 515.57 Driver Version: 515.57 CUDA Version: 11.7

However, the related code can be compiled correctly and run normally in the same Tesla V100 machine with pgi free community edition 2019-1910 and cuda 10.0.130. I did in this way without any problem in the past two years. However, as the pgi compiler has been integrated into Nvidia hpc-sdk package, there for long-term benefit and convenience, I decide to turn to Nvidia hpc-sdk recently. Then I met with this problem. I have struggled with it for several days and have no idea so far.

The minimum working code includes two source files main.f90 and type.cuf. The first one is compiled into executable and the second one is compiled as dynamic library. The source file and corresponding Makefile reads.
The file main.f90 reads,

program cudatest
    use cudafor
    use cudatype
    implicit none
    integer      ::  ii, istat
    real(kind=8) :: mat1(100)
    real(kind=8), device :: mat1_d(100)

    istat = cudaSetDevice(0)

    ! test cuf kernel in main body 
    !$cuf kernel do (1) <<<*,*>>>
    do ii = 1, 100
        mat1_d(ii) = 1.d0
    mat1 = mat1_d
    print *,'Test in main: sum(mat1)=',sum(mat1)

    ! test cuf kernel from linking dynamic library 
    call test()
end program

Makefile for main.f90


MODS=$(wildcard *.mod)
UNAME_S=$(shell uname -n)
RM=rm -fv


LIBS = ./tmp/
LIBS += -L${cublas}/lib64 -lcublas -lblas

FCFLAGS=-fPIC -O3 -traceback -g -Mpreprocess -Mcuda -gpu=cc70 -Mcudalib=cublas $(INCLUDE)

.SUFFIXES: .o .f .f90 .cuf

all: ${EXE}

${EXE}: ${FILES} ${MODS}
	${FCMPI} -o $@ ${FILES} ${LIBS} ${FCFLAGS}

	${FCMPI} ${FCFLAGS} -c main.f90

%.mod: %.f90
	@echo “Some modules are out of date. Do clean and then recompile”
	${RM} $@ ${EXE}

.PHONY: clean

	${RM} *.o
	${RM} *.mod
	${RM} ${EXE}

./tmp/type.cuf reads,

module cudatype
    implicit none
    public  :: test

        subroutine test()
            integer :: ii
            integer, device :: mat(100)
            integer :: mat_h(100)
            !$cuf kernel do (1) <<<*,*>>>
            do ii = 1, 100
                mat(ii) = 1
            mat_h = mat
            print *,'Test in dynamic lib: sum(mat2)=',sum(mat_h)
        end subroutine 
end module

The related Makefile reads,


FCFLAGS=-fPIC -O3 -traceback -g -Mpreprocess -Mcuda -gpu=cc70 -Mcudalib=cublas

MODS=$(wildcard *.mod)

UNAME_S=$(shell uname -n)
RM=rm -fv

LIBS = -L${cublas}/lib64 -lcublas -lblas

.SUFFIXES: .o .f .f90 .cuf

all:${FILES} ${MODS}
	${FCMPI} -fPIC -shared -Mcuda -Mcudalib=cublas -o ${FILES} ${LIBS}

	${FCMPI} ${FCFLAGS} -c type.cuf

%.mod: %.f90
	@echo “Some modules are out of date. Do clean and then recompile”
	${RM} $@ ${EXE}

.PHONY: clean

	${RM} *.o
	${RM} *.mod
	${RM} *.so
	${RM} ${EXE}

When compiling the dynamic library and the main program, I also tried the following compilation flags : -Wl,-export-dynamic, -fortranlibs. But they all do not work.

Problem : How to make the test code run normally with the hpc_sdk 21.9 or newer. And for my personal reason, the subroutine module (type.cuf) should be compiled as dynamic library and then linked to the main program (main.f90).

Hi pengshiyuj,

Looks like you just need to update to use the newer “-cuda” and “-cudalibs” flags as opposed to the deprecated “-Mcuda”/“-Mcudalibs” flags.

% pgf90 -V

pgf90 (aka nvfortran) 21.9-0 64-bit target on x86-64 Linux -tp zen
PGI Compilers and Tools
Copyright (c) 2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
% make -f makefile.lib clean all
rm -fv *.o
rm -fv *.mod
rm -fv *.so
removed ''
rm -fv
pgf90 -fPIC -O3 -g -Mpreprocess -cuda -gpu=ccall -cudalib=cublas -c type.cuf
pgf90 -fPIC -shared -cuda -cudalib=cublas -o type.o -L/public/home/sypeng/soft/hpc_sdk/Linux_x86_64/21.9/math_libs/11.4/lib64 -lcublas -lblas
% make
pgf90 -fPIC -O3 -traceback -g -Mpreprocess -cuda -gpu=ccall -cudalib=cublas -I. -c main.f90
pgf90 -o gputest main.o ./ -lblas -fPIC -O3 -traceback -g -Mpreprocess -cuda -gpu=ccall -cudalib=cublas -I.
% ./gputest
 Test in main: sum(mat1)=    100.0000000000000
 Test in dynamic lib: sum(mat2)=          100

Hope this helps,

Hi Mat,

It really works for me. Thanks very much!