Separate compilation of mixed CUDA OpenACC code

I present a simple test program composed by CUDA source code, an OpenACC source code and a plain C++ main which calls some functions defined in the other two compilation units. This is a simplification of a bigger program composed of many CUDA and OpenaCC source files.

I need to link the executable with separable compilation using an intermediate device linking step followed by the last host linking step as described in the nvidia blog. This need comes from other restrictions not reproducible with this sample code.

Here are the steps I take:

OPENACC_ARCH_FLAGS="-acc=gpu -gpu=cc70 -acc=noautopar -Minfo=accel"
CUDA_ARCH_FLAGS="--generate-code=arch=compute_70,code=[compute_70,sm_70]"
CUDA_LIB_DIR=$HPC_SDK_HOME/Linux_ppc64le/2021/cuda/lib64 # customize your path

# Compile MPI C++ code
pgc++ -c main.cpp -o main.cpp.o

# Compile OPENACC code
pgc++ $OPENACC_ARCH_FLAGS -c test_openacc.cpp -o openacc.cpp.o

# Compile CUDA code
nvcc $CUDA_ARCH_FLAGS -dc test_cuda.cu -o cuda.cu.o

# removing openacc.cpp.o from dlink object works without errors
DLINK_OBJS="cuda.cu.o main.cpp.o openacc.cpp.o" # <=== this cause error
nvcc $CUDA_ARCH_FLAGS -dlink $DLINK_OBJS -o dlink.o

# Generate executable
nvc++ $OPENACC_ARCH_FLAGS -o main cuda.cu.o openacc.cpp.o main.cpp.o dlink.o -L$CUDA_LIB_DIR -lcudadevrt -lcudart_static -lr

PROBLEM: if I include the openacc object code into the device linking step I get an unresolved symbol error during the final host linking stage:

undefined reference to `__fatbinwrap_98_test_openacc_cpp'
pgacclnk: child process exit status 1: /usr/bin/ld

QUESTION:

  1. why including the openacc.cpp.o object in the dlink.o object produce this error at host linking step?
  2. why including the main.cpp.o object in the dlink.o object does not produce any problem?

nvc++ compiles with RDC on by default and adding the flag “-gpu=nordc” to OPENACC_ARCH_FLAGS should fix the issue.

I’m not an expert in the inner workings of nvcc, but my guess is that when it creates the fat bin wrapper symbols, it doesn’t know how to interrupt the OpenACC symbol names so doesn’t create the symbol name correctly. They should something like “__fatbinwrap_98_tmpxft_00028c08_00000000_8_test_openacc_cpp”. Though since nvcc doesn’t support OpenACC, it wouldn’t expect it to.

As I noted on your SO post, the “dlink” step shouldn’t be necessary. nvc++ already includes the device linking so no need to separate it out. Something like the following should work:

OPENACC_ARCH_FLAGS="-acc=gpu -gpu=cc70 -acc=noautopar -Minfo=accel"
CUDA_ARCH_FLAGS="--generate-code=arch=compute_70,code=[compute_70,sm_70] -rdc"

# Compile MPI C++ code
nvc++ -c main.cpp -o main.cpp.o

# Compile OPENACC code
nvc++ $OPENACC_ARCH_FLAGS -c test_openacc.cpp -o openacc.cpp.o

# Compile CUDA code
nvcc $CUDA_ARCH_FLAGS -c test_cuda.cu -o cuda.cu.o

# Generate executable
nvc++ $OPENACC_ARCH_FLAGS -o main cuda.cu.o openacc.cpp.o main.cpp.o -cuda -static-nvidia -lr

Thank you Mat for clarifing the default RDC behaviour of nvc++ and providing a solution using direct linking with nvc++.

Although the dlink step is not necessary with this simple example, I assumed it should worked anyway. For example, as I posted on SO, configuring a CUDA project with CMAKE with separate compilation, the build process always goes through the intermediate dlink step, followed by the final host linking step. That’s why I wanted to understand how to properly set stuff up to go with the intermediate dlink step.

Do you think I can submit a bug report to your nvidia support collegues regarding the fact that nvcc produces a dlink.o object which cannot be linked with the host linker when dealing with an openacc object?