CCIC cannot handle arm_neon.h, any help?

Trying to compile a .cu file that contains “arm_neon.h”.
On a Jetson Xavier AGX, CUDA 11.4, gcc 9.40, nvcc V12.1.105 contains just this line

include <arm_neon.h>

compile with:

nvcc -arch=sm_72 -v --keep --compiler-options " -march=armv8.2-a+fp16 -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -pthread -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include -Wno-pedantic" -ptx

#$ CUDART=cudart
#$ HERE=/usr/local/cuda/bin
#$ THERE=/usr/local/cuda/bin
#$ TARGET_DIR=targets/aarch64-linux
#$ TOP=/usr/local/cuda/bin/…
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/…/nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/…/lib:/usr/local/cuda/lib
#$ PATH=/usr/local/cuda/bin/…/nvvm/bin:/usr/local/cuda/bin:/home/djinn/AI-ML/MiniConda3/bin:/home/djinn/AI-ML/MiniConda3/condabin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
#$ INCLUDES=“-I/usr/local/cuda/bin/…/targets/aarch64-linux/include”
#$ LIBRARIES= “-L/usr/local/cuda/bin/…/targets/aarch64-linux/lib/stubs” “-L/usr/local/cuda/bin/…/targets/aarch64-linux/lib”
#$ gcc -D__CUDA_ARCH__=720 -D__CUDA_ARCH_LIST__=720 -E -x c++ -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -march=armv8.2-a+fp16 -I. -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -pthread -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include -Wno-pedantic “-I/usr/local/cuda/bin/…/targets/aarch64-linux/include” -D__CUDACC_VER_MAJOR__=12 -D__CUDACC_VER_MINOR__=1 -D__CUDACC_VER_BUILD__=105 -D__CUDA_API_VER_MAJOR__=12 -D__CUDA_API_VER_MINOR__=1 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include “cuda_runtime.h” “” -o “simple.cpp1.ii”
#$ cicc --c++14 --gnu_version=90400 --display_error_number --orig_src_file_name “” --orig_src_path_name “/home/djinn/Builds/” --allow_managed --unsigned_chars --unsigned_wchar_t -arch compute_72 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name “simple.fatbin.c” -tused --gen_module_id_file --module_id_file_name “simple.module_id” “simple.cpp1.ii” -o “simple.ptx”
/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h(38): error: identifier “__Int8x8_t” is undefined
typedef __Int8x8_t int8x8_t;

gcc aka its preprocessor does it’s job: in the created file simple.cpp1.ii the whole <arm_neon.h> file is included.

excerpt from simple.cpp1.ii ,somewhere at line 25748

#34 “/usr/lib/gcc/aarch64-linux-gnu/9/include/arm_neon.h” 2 3 4

typedef __Int8x8_t int8x8_t;
typedef __Int16x4_t int16x4_t;

but as soon as nvcc calls cicc,
it fails with: error: identifier "__ any-neon-intrinsic __" is undefined
shown above.
It doesn’t matter which gcc8-10 or clang9-10, or even CUDA version 10.2/11.4/12.1
it always fail. Building any non .cu source code (.h/.c/.cpp/.cxx) with arm_neon.h is absolutely fine.

What do I miss?

I originally intended to build the “whispers.cpp and llama.cpp” projects on Github with cuda support. Also the cuda version of the bitsandbytes library fails with the same error.

Found a solution for llama.cpp and whisper.cpp projects, by using CUDACC for conditionally define a half type for nvcc and fp16 for gcc/clang.

Also there is a version of bitsandbytes were the neon instruction are separeted from the .cu to .cpp files. →