CUDA/C++ dynamic parallelism compile issue on aarch64/arm64

SophieJ · September 3, 2020, 12:00am

Hi!

I’m having problems compiling and linking a c++ program to a CUDA file that uses dynamic parallelism. The program itself is quite extensive so for testing sake I simplified it to a very basic main.cpp file and wrapper.cu file which replicates the error. I’m also using a Jetson Nano (aarch64 /arm64, compute capability - 5.3) to compile these files which I’m thinking may be causing the problem (?).

Here are my programs:

main.cpp:

extern void wrapperfunction();

int main(){
	wrapperfunction();
}

wrapper.cu:

#include <cuda.h>
#include <cuda_runtime.h>

using namespace std;

__global__ void update_upper_image(){
	int x = 1;
}

__global__ void event_kernel(){
    update_upper_image<<<512,1>>>();
}

void wrapperfunction(){
	event_kernel<<<4,1>>>();
}

The commands I’m using to compile are:

nvcc -arch=sm_53 -rdc=true -c wrapper.cu
nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

The first two commands work fine, but at the g++ compilation stage I get the following error:

file_link.o: In function __cudaRegisterLinkedBinary_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37': link.stub:(.text+0xcc): undefined reference to __fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37’
link.stub:(.text+0xd0): undefined reference to `__fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37’
collect2: error: ld returned 1 exit status
rufus@rufus-desktop:~/Documents/CUDA_CODE/cm

I’ve been working on this problem for a few days now without any luck, so any help would be greatly appreciated!

Thanks,
Sophie

AastaLLL · September 3, 2020, 3:57am

Hi,

Would you mind to check it again?
We can compile and execute the sample without any issue on the Nano.

Makefile

all:
        nvcc -arch=sm_53 -rdc=true -c wrapper.cu
        nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
        g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

$ make && ./a.out

nvcc -arch=sm_53 -rdc=true -c wrapper.cu
wrapper.cu(7): warning: variable "x" was declared but never referenced

nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

Thanks.

SophieJ · September 3, 2020, 5:52am

AastaLLL:

all:
        nvcc -arch=sm_53 -rdc=true -c wrapper.cu
        nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
        g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

Thanks for your response!

Unfortunately I’m still getting the same error when I call make && ./a.out:

nvcc -arch=sm_53 -rdc=true -c wrapper.cu
wrapper.cu(12): warning: variable "x" was declared but never referenced

nvcc -arch=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt
file_link.o: In function `__cudaRegisterLinkedBinary_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37':
link.stub:(.text+0xcc): undefined reference to `__fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37'
link.stub:(.text+0xd0): undefined reference to `__fatbinwrap_38_cuda_device_runtime_compute_75_cpp1_ii_8b1a5d37'
collect2: error: ld returned 1 exit status
Makefile:2: recipe for target 'all' failed
make: *** [all] Error 1

Since you were able to run it, this may indicate that the problem isn’t the Nano.

Do you have any other ideas as to why you think this might be failing?

AastaLLL · September 4, 2020, 3:47am

Hi,

Based on your log, it seems that the app try to link the compute=75 runtime.
Would you mind to try the following command to see if works first:

$ nvcc -gencode arch=compute_53,code=sm_53 -rdc=true -c wrapper.cu
$ nvcc -gencode arch=compute_53,code=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
$ g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda/lib64 -lcudart -lcudadevrt

If the issue goes on, please share the output log of deviceQuery for your environment.

$ cd /usr/local/cuda-10.2/samples/1_Utilities/deviceQuery
$ sudo make
$ ./deviceQuery

Thanks.

SophieJ · September 4, 2020, 8:22am

Hi again and thank you - I was able to debug the problem!

I didn’t have cuda-10.2 (I have cuda-10.0, and my default, cuda-11.0). When I ran:

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make

I got:

/usr/local/cuda-11.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-11.0/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/sbsa/linux/release
cp deviceQuery ../../bin/sbsa/linux/release

Which indicated that compute_53 wasn’t available when I was compiling using g++ using cuda-11.0.

The commands that fix this problem are:

$ nvcc -gencode arch=compute_53,code=sm_53 -rdc=true -c wrapper.cu
$ nvcc -gencode arch=compute_53,code=sm_53 -dlink -o file_link.o wrapper.o -lcudart -lcudadevrt
$ g++ wrapper.o file_link.o main.cpp -L/usr/local/cuda-10.0/lib64 -lcudart -lcudadevrt

As:

$ cd /usr/local/cuda-10.0/samples/1_Utilities/deviceQuery
$ sudo make

Produces:

/usr/local/cuda-10.0/bin/nvcc -ccbin g++ -I../../common/inc  -m64    -gencode arch=compute_30,code=sm_30 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-10.0/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_30,code=sm_30 -gencode arch=compute_32,code=sm_32 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_75,code=compute_75 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release

Thank you again!

Topic		Replies	Views
CUDA/C++ dynamic parallelism compile issue on aarch64/arm64 CUDA Programming and Performance cuda	1	381	September 2, 2020
Cross compiling dynamic parallelism for jetson aarch64 cuda 10.2 nvlink error Jetson TX2 cuda , compile	5	1287	December 1, 2021
CUDA Dynamic Parallelism undefined reference to __fatbinwrap Legacy PGI Compilers	5	11972	April 28, 2015
Cmake aarch64 cross compile error Jetson AGX Xavier	4	5034	October 18, 2021
Can't locate cuda_runtime.h when compiling CUDA and C++ files CUDA NVCC Compiler	3	13648	January 9, 2022
Nvlink seems not to link for cuda libraries if cross compiling and --cpu-arch=AARCH64 is specified General driveos-cuda	6	2259	November 9, 2021
Inheritence issue in Cuda CUDA Programming and Performance	12	1121	December 25, 2021
Dynamic parallelism on Jetson TX1 isn't working properly Jetson TX1	0	557	June 20, 2016
Using Dynamic Parallelism in multiple VS2019 projects CUDA NVCC Compiler visual-studio	5	1436	October 11, 2021
NVCC forces c++ compilation of .cu files CUDA Programming and Performance	11	25514	December 11, 2011

CUDA/C++ dynamic parallelism compile issue on aarch64/arm64

Related topics