Problem with nvcc and ptxas when cross compiling tensorflow

I’m trying to cross compile tensorflow 2.9.1 for Xavier NX running CUDA 10.2. One of the cross compilation steps keeps failing but other cuda files compile successfully.

My build machine ran this nvcc command:

nvcc -D_FORCE_INLINES -gencode=arch=compute_72,\"code=sm_72,compute_72\"  --expt-relaxed-constexpr --ftz=true -DEIGEN_MPL2_ONLY -DEIGEN_MAX_ALIGN_BYTES=64 -DHAVE_SYS_UIO_H -DTF_USE_SNAPPY -DAUTOLOAD_DYNAMIC_KERNELS -DGOOGLE_CUDA=1 -DEIGEN_AVOID_STL_ARRAY -DGOOGLE_CUDA=1 -DTENSORFLOW_USE_NVCC=1 -DTENSORFLOW_USE_XLA=1 -DGOOGLE_TENSORRT=1 -DTENSORFLOW_MONOLITHIC_BUILD -std=c++14 --compiler-options " -isystem external/local_config_cuda/cuda -isystem bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda -isystem external/local_config_cuda/cuda/cuda/include -isystem bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda/cuda/include -isystem external/nsync/public -isystem bazel-out/aarch64-opt/bin/external/nsync/public -isystem external/eigen_archive -isystem bazel-out/aarch64-opt/bin/external/eigen_archive -isystem external/gif -isystem bazel-out/aarch64-opt/bin/external/gif -isystem external/com_google_protobuf/src -isystem bazel-out/aarch64-opt/bin/external/com_google_protobuf/src -isystem external/zlib -isystem bazel-out/aarch64-opt/bin/external/zlib -isystem external/farmhash_archive/src -isystem bazel-out/aarch64-opt/bin/external/farmhash_archive/src -isystem external/local_config_rocm/rocm -isystem bazel-out/aarch64-opt/bin/external/local_config_rocm/rocm -isystem external/local_config_rocm/rocm/rocm/include -isystem bazel-out/aarch64-opt/bin/external/local_config_rocm/rocm/rocm/include -isystem external/local_config_rocm/rocm/rocm/include/rocrand -isystem bazel-out/aarch64-opt/bin/external/local_config_rocm/rocm/rocm/include/rocrand -isystem external/local_config_rocm/rocm/rocm/include/roctracer -isystem bazel-out/aarch64-opt/bin/external/local_config_rocm/rocm/rocm/include/roctracer -iquote . -iquote bazel-out/aarch64-opt/bin -iquote external/cub_archive -iquote bazel-out/aarch64-opt/bin/external/cub_archive -iquote external/local_config_cuda -iquote bazel-out/aarch64-opt/bin/external/local_config_cuda -iquote external/com_google_absl -iquote bazel-out/aarch64-opt/bin/external/com_google_absl -iquote external/nsync -iquote bazel-out/aarch64-opt/bin/external/nsync -iquote external/eigen_archive -iquote bazel-out/aarch64-opt/bin/external/eigen_archive -iquote external/gif -iquote bazel-out/aarch64-opt/bin/external/gif -iquote external/libjpeg_turbo -iquote bazel-out/aarch64-opt/bin/external/libjpeg_turbo -iquote external/com_google_protobuf -iquote bazel-out/aarch64-opt/bin/external/com_google_protobuf -iquote external/zlib -iquote bazel-out/aarch64-opt/bin/external/zlib -iquote external/com_googlesource_code_re2 -iquote bazel-out/aarch64-opt/bin/external/com_googlesource_code_re2 -iquote external/farmhash_archive -iquote bazel-out/aarch64-opt/bin/external/farmhash_archive -iquote external/fft2d -iquote bazel-out/aarch64-opt/bin/external/fft2d -iquote external/highwayhash -iquote bazel-out/aarch64-opt/bin/external/highwayhash -iquote external/double_conversion -iquote bazel-out/aarch64-opt/bin/external/double_conversion -iquote external/snappy -iquote bazel-out/aarch64-opt/bin/external/snappy -iquote external/local_config_rocm -iquote bazel-out/aarch64-opt/bin/external/local_config_rocm -iquote external/local_config_tensorrt -iquote bazel-out/aarch64-opt/bin/external/local_config_tensorrt -iquote external/cudnn_frontend_archive -iquote bazel-out/aarch64-opt/bin/external/cudnn_frontend_archive -fPIC" --verbose --keep --compiler-bindir=/opt/toolchain/bin/aarch64-linux-gnu-gcc -I . -x cu  -g -G -I bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cuda_headers_virtual -I bazel-out/aarch64-opt/bin/external/local_config_tensorrt/_virtual_includes/tensorrt_headers -I bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cudnn_header -I bazel-out/aarch64-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend -I external/gemmlowp -c tensorflow/core/kernels/histogram_op_gpu.cu.cc -o bazel-out/aarch64-opt/bin/tensorflow/core/kernels/_objs/histogram_op_gpu/histogram_op_gpu.cu.o

nvcc eventually calls ptxas like this:

ptxas --verbose --compile-only -arch=sm_72 -m64  -g --dont-merge-basicblocks --return-at-end "histogram_op_gpu.cu.ptx"  -o "histogram_op_gpu.cu.sm_72.cubin"

The only output I get from ptxas is this: ptxas fatal : Unresolved extern function 'cudaGetErrorString'
strace of ptxas shows the last failure is ioctl(3, TCGETS, 0x7ffe13b62410) = -1 ENOTTY (Inappropriate ioctl for device).

Does anyone know what is wrong and how do I cross compile this file?

I have encountered another file in tensorflow project that I can’t compile with ptxas. Again, I just get ptxas fatal : Unresolved extern function 'cudaGetErrorString' error message. I tried running the same ptxas command on my Xavier NX and got the same error.

The attached cc file below is the tensorflow source file that I try to compile with NVCC. NVCC will call cicc which generates a ptx file and calls ptxas to compile the ptx file.
sparse_fill_empty_rows_op_gpu.cu.cc (24.3 KB)

I had a look at the generated sparse_fill_empty_rows_op_gpu.cu.ptx file (attached below) that the cicc generated and double checked that all the files defined exist using this command:
grep '.file' sparse_fill_empty_rows_op_gpu.cu.ptx | cut -d ' ' -f 2 | cut -d ',' -f 1 | xargs -I % sh -c 'if [ ! -e % ]; then echo % ; fi'
sparse_fill_empty_rows_op_gpu.cu.ptx (63.6 MB)

This is the NVCC command:
nvcc --verbose --keep -D_FORCE_INLINES -gencode=arch=compute_72,\"code=sm_72,compute_72\" --expt-relaxed-constexpr --ftz=true -DEIGEN_MPL2_ONLY -DEIGEN_MAX_ALIGN_BYTES=64 -DHAVE_SYS_UIO_H -DTF_USE_SNAPPY -DAUTOLOAD_DYNAMIC_KERNELS -DGOOGLE_CUDA=1 -DEIGEN_AVOID_STL_ARRAY -DGOOGLE_CUDA=1 -DTENSORFLOW_USE_NVCC=1 -DTENSORFLOW_USE_XLA=1 -DGOOGLE_TENSORRT=1 -DTENSORFLOW_MONOLITHIC_BUILD -std=c++14 --compiler-options " -isystem external/nsync/public -isystem bazel-out/aarch64-opt/bin/external/nsync/public -isystem external/eigen_archive -isystem bazel-out/aarch64-opt/bin/external/eigen_archive -isystem external/gif -isystem bazel-out/aarch64-opt/bin/external/gif -isystem external/com_google_protobuf/src -isystem bazel-out/aarch64-opt/bin/external/com_google_protobuf/src -isystem external/zlib -isystem bazel-out/aarch64-opt/bin/external/zlib -isystem external/farmhash_archive/src -isystem bazel-out/aarch64-opt/bin/external/farmhash_archive/src -isystem external/local_config_cuda/cuda -isystem bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda -isystem external/local_config_cuda/cuda/cuda/include -isystem bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda/cuda/include -iquote . -iquote bazel-out/aarch64-opt/bin -iquote external/com_google_absl -iquote bazel-out/aarch64-opt/bin/external/com_google_absl -iquote external/nsync -iquote bazel-out/aarch64-opt/bin/external/nsync -iquote external/eigen_archive -iquote bazel-out/aarch64-opt/bin/external/eigen_archive -iquote external/gif -iquote bazel-out/aarch64-opt/bin/external/gif -iquote external/libjpeg_turbo -iquote bazel-out/aarch64-opt/bin/external/libjpeg_turbo -iquote external/com_google_protobuf -iquote bazel-out/aarch64-opt/bin/external/com_google_protobuf -iquote external/zlib -iquote bazel-out/aarch64-opt/bin/external/zlib -iquote external/com_googlesource_code_re2 -iquote bazel-out/aarch64-opt/bin/external/com_googlesource_code_re2 -iquote external/farmhash_archive -iquote bazel-out/aarch64-opt/bin/external/farmhash_archive -iquote external/fft2d -iquote bazel-out/aarch64-opt/bin/external/fft2d -iquote external/highwayhash -iquote bazel-out/aarch64-opt/bin/external/highwayhash -iquote external/double_conversion -iquote bazel-out/aarch64-opt/bin/external/double_conversion -iquote external/snappy -iquote bazel-out/aarch64-opt/bin/external/snappy -iquote external/local_config_cuda -iquote bazel-out/aarch64-opt/bin/external/local_config_cuda -iquote external/local_config_rocm -iquote bazel-out/aarch64-opt/bin/external/local_config_rocm -iquote external/local_config_tensorrt -iquote bazel-out/aarch64-opt/bin/external/local_config_tensorrt -iquote external/cub_archive -iquote bazel-out/aarch64-opt/bin/external/cub_archive -iquote external/cudnn_frontend_archive -iquote bazel-out/aarch64-opt/bin/external/cudnn_frontend_archive -fPIC" --compiler-bindir=/home/dev/package/tensorflow/host/bin/aarch64-linux-gnu-gcc -I . -x cu -g -G -I bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cuda_headers_virtual -I bazel-out/aarch64-opt/bin/external/local_config_tensorrt/_virtual_includes/tensorrt_headers -I bazel-out/aarch64-opt/bin/external/local_config_cuda/cuda/_virtual_includes/cudnn_header -I bazel-out/aarch64-opt/bin/external/cudnn_frontend_archive/_virtual_includes/cudnn_frontend -I external/gemmlowp -c tensorflow/core/kernels/sparse_fill_empty_rows_op_gpu.cu.cc -o /tmp/sparse_fill_empty_rows_op_gpu.cu.o

This is the cicc command.
cicc --c++14 --gnu_version=70500 --allow_managed --debug_mode --relaxed_constexpr -arch compute_72 -m64 -ftz=1 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "sparse_fill_empty_rows_op_gpu.cu.fatbin.c" -g -O0 -tused -nvvmir-library "/home/dev/package/tensorflow/host/local/cuda-10.2/bin/../nvvm/libdevice/libdevice.10.bc" --gen_module_id_file --module_id_file_name "sparse_fill_empty_rows_op_gpu.cu.module_id" --orig_src_file_name "tensorflow/core/kernels/sparse_fill_empty_rows_op_gpu.cu.cc" --gen_c_file_name "sparse_fill_empty_rows_op_gpu.cu.cudafe1.c" --stub_file_name "sparse_fill_empty_rows_op_gpu.cu.cudafe1.stub.c" --gen_device_file_name "sparse_fill_empty_rows_op_gpu.cu.cudafe1.gpu" "sparse_fill_empty_rows_op_gpu.cu.cpp1.ii" -o "sparse_fill_empty_rows_op_gpu.cu.ptx"

This is the ptxas command.
ptxas -arch=sm_72 -m64 -g --dont-merge-basicblocks --return-at-end "sparse_fill_empty_rows_op_gpu.cu.ptx" -o "/tmp/sparse_fill_empty_rows_op_gpu.cu.sm_72.cubin"

Does anyone know what’s wrong? How do I even debug this?


System info:

Target OS: Linux
Target NVIDIA GPU or System: Xavier NX
Target CUDA Toolkit Version: AArch64 Jetpack 4.6 (CUDA 10.2)

Build host OS: Ubuntu 20.04 (x86-64)
Build host NVIDIA GPU: None
Build host CUDA Toolkit Version: x86-64 Jetpack 4.6 (CUDA 10.2)

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:35:40_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

ptxas --version

ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:35:11_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

Cross compiler gcc --verson for Linux

aarch64-linux-gnu-gcc 7.5.0

Host compiler gcc --verson for Linux

gcc 7.5.0

Tensorflow version 2.9.1