Compiling Python wrappers with F2PY and CUDA Fortran

Yes, this looks to be the issue in that “-Mcuda” isn’t being added to the link when creating the share object. If you can figure out how add “-Mcuda” to the link flags, that would be ideal.

I did find this post on StackOverflow which suggests that you can set the environment variable “LDFLAGS=-Mcuda” to set the f2py linker flags so you may want to try it. Setting NPY_DISTUTILS_APPEND_FLAGS=1 looks necessary as well so it doesn’t overwrite the other linker flags.

If NPY_DISTUTILS_APPEND_FLAGS isn’t functional in your version of f2py (it seems to be numpy specific), then you might need to set LDFLAGS to “-shared -fpic -Mcuda”.

I could try manually specifying all the required libraries that -MCuda provides (specifying this from f2py does not work for the link stage)… Do you know which libraries Mcuda proxies for?

You can, but it’s a little more complex that just adding the libraries. In addition, the “-Mcuda” flag tells the compiler to also run a device code link step when creating the shared object. If you add the libraries but hand, you’ll need to also compile the code with “-Mcuda=nordc” so the device link isn’t required. Though without RDC enabled, some CUDA Fortran features are disabled such the ability to call device routines not in the same module or accessing device module variables outside of the module in which their defined.

Second, the “-Mcuda” flag can use different CUDA versions, selecting the one to use based on the NVIDIA driver version being used, if “CUDA_HOME” is set, or if the users has selected a particular CUDA version via “-Mcuda=cudaX.y”. The included libraries can be different depending of the CUDA version being used.

Finally, the libraries can change from release to release, so the exact libraries used is release dependent.

The best way to determine what flags to add, is to run the command: “pgfortran -dryrun -Mcuda=nordc -shared test.o -o libtest.so”. “-dryrun” will show you the commands the compiler driver would execute, but not actually do them. “-v” (verbose) also shows the driver commands, but does perform them.

Here’s the ld command with 19.10 using my local install:

/usr/bin/ld /usr/lib/x86_64-linux-gnu/crti.o /proj/pgi/linux86-64-llvm/19.10/lib/trace_init.o /home/sw/thirdparty/gcc/gcc-9.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/9.2.0/crtbeginS.o --eh-frame-hdr -m elf_x86_64 -dynamic-linker /lib64/ld-linux-x86-64.so.2 /proj/pgi/linux86-64-llvm/19.10/lib/pgi.ld -L/proj/pgi/linux86-64-llvm/19.10/lib -L/usr/lib64 -L/home/sw/thirdparty/gcc/gcc-9.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/9.2.0 test2.o -rpath /proj/pgi/linux86-64-llvm/19.10/lib -rpath /proj/pgi/linux86-64-llvm/2019/cuda/10.1/lib64 -rpath /home/sw/thirdparty/gcc/gcc-9.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../lib64 -o libtest.so -shared /proj/pgi/linux86-64-llvm/19.10/lib/pgiloc.ld -L/home/sw/thirdparty/gcc/gcc-9.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/9.2.0/../../../../lib64 -lcudafor101 -lcudafor -lcudaforblas101 /proj/pgi/linux86-64-llvm/19.10/lib/cuda_init_register_end.o -L/proj/pgi/linux86-64-llvm/2019/cuda/10.1/lib64 -lcudadevrt -lcudart -lcudafor2 -lpgf90rtl -lpgf90 -lpgf90_rpm1 -lpgf902 -lpgf90rtl -lpgftnrtl -lpgatm -lpgkomp -lomp -as-needed -lomptarget -no-as-needed -lpthread --start-group -lpgmath -lpgc --end-group -lrt -lpthread -lm -lgcc -lc -lgcc -lgcc_s /home/sw/thirdparty/gcc/gcc-9.2.0/linux86-64/lib/gcc/x86_64-pc-linux-gnu/9.2.0/crtendS.o /usr/lib/x86_64-linux-gnu/crtn.o

You can then compare this to another dryrun without -Mcuda=nordc to see the added library paths, libraries and objects.

So in my case where CUDA 10.1 is being used, I’d want to add “-L/proj/pgi/linux86-64-llvm/2019/cuda/10.1/lib64 -lcudafor101 -lcudafor -lcudaforblas101 /proj/pgi/linux86-64-llvm/19.10/lib/cuda_init_register_end.o -lcudadevrt -lcudart -lcudafor2”