CUDA/C++ dynamic parallelism compile issue on aarch64/arm64

Hi!

I’m having problems compiling and linking a c++ program to a CUDA file that uses dynamic parallelism. The program itself is quite extensive so for testing sake I simplified it to a very basic main.cpp file and wrapper.cu file which replicates the error. I’m also using a Jetson Nano (aarch64 /arm64, compute capability - 5.3) to compile these files which I’m thinking may be causing the problem (?).

Here are my programs:

main.cpp:

extern void wrapperfunction();

int main(){
	wrapperfunction();
}

wrapper.cu:

#include <cuda.h>
#include <cuda_runtime.h>

using namespace std;

__global__ void update_upper_image(){
	int x = 1;
}

__global__ void event_kernel(){
    update_upper_image<<<512,1>>>();
}

void wrapperfunction(){
	event_kernel<<<4,1>>>();
}

The commands I’m using to compile are:

nvcc -arch=sm_53 -rdc=true -c wrapper.cu
nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

The first two commands work fine, but at the g++ compilation stage I get the following error:

file_link.o: In function __cudaRegisterLinkedBinary_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37': link.stub:(.text+0xcc): undefined reference to __fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37’
link.stub:(.text+0xd0): undefined reference to `__fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37’
collect2: error: ld returned 1 exit status
rufus@rufus-desktop:~/Documents/CUDA_CODE/cm

I’ve been working on this problem for a few days now without any luck, so any help would be greatly appreciated!

Thanks,
Sophie

Hi,

Would you mind to check it again?
We can compile and execute the sample without any issue on the Nano.

Makefile

all:
        nvcc -arch=sm_53 -rdc=true -c wrapper.cu
        nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
        g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

$ make && ./a.out

nvcc -arch=sm_53 -rdc=true -c wrapper.cu
wrapper.cu(7): warning: variable "x" was declared but never referenced

nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

Thanks.

Thanks for your response!

Unfortunately I’m still getting the same error when I call make && ./a.out:

nvcc -arch=sm_53 -rdc=true -c wrapper.cu
wrapper.cu(12): warning: variable "x" was declared but never referenced

nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt
file_link.o: In function `__cudaRegisterLinkedBinary_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37':
link.stub:(.text+0xcc): undefined reference to `__fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37'
link.stub:(.text+0xd0): undefined reference to `__fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37'
collect2: error: ld returned 1 exit status
Makefile:2: recipe for target 'all' failed
make: *** [all] Error 1

Since you were able to run it, this may indicate that the problem isn’t the Nano.

Do you have any other ideas as to why you think this might be failing?

Hi,

Based on your log, it seems that the app try to link the compute=75 runtime.
Would you mind to try the following command to see if works first:

$ nvcc -gencode arch=compute_53,code=sm_53 -rdc=true -c wrapper.cu
$ nvcc -gencode arch=compute_53,code=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
$ g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

If the issue goes on, please share the output log of deviceQuery for your environment.

$ cd /usr/local/cuda-10.2/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery

Thanks.

Hi again and thank you - I was able to debug the problem!

I didn’t have cuda-10.2 (I have cuda-10.0, and my default, cuda-11.0). When I ran:

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make

I got:

/usr/local/cuda-11.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-11.0/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/sbsa/linux/release
cp deviceQuery ../../bin/sbsa/linux/release

Which indicated that compute_53 wasn’t available when I was compiling using g++ using cuda-11.0.

The commands that fix this problem are:

$ nvcc -gencode arch=compute_53,code=sm_53 -rdc=true -c wrapper.cu
$ nvcc -gencode arch=compute_53,code=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
$ g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda-10.0/lib64 -lcudart -lcudadevrt

As:

$ cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
$ sudo make

Produces:

/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-10.0/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release

Thank you again!