__cudaRegisterFatBinary is not called in OP-TEE OS

Hi:
I use “nvcc -ccbin path-to-aarch64-linux-gnu-g++ -arch=sm_75 -Xcompiler -fPIC” to compile .cu file into ELF in x86 Linux, and excute it in ARM V8 OP-TEE OS with shared CUDA library. By the way, I have ocelot CUDA runtime environment in OP-TEE OS. When I use gdb to trace the calling stack, __cudaRegisterFatBinary is not called in OP-TEE OS. Therefore, the host code compiled by nvcc could run normally in OP-TEE, but the device code could not offload to GPU, since the __cudaRegisterFatBinary is not called to do regular registeration before. I also compared that with the same code which is excuted in ARM V8 Linux OS, __cudaRegisterFatBinary is called before main function. There is an explanation in this link: cudaErrorCudartUnloading问题排查及建议方案 - SegmentFault 思否, which says cudaRegisterAll is embedded to code automatically when we perform nvcc compilation. But I also find cudaRegisterAll is coming from nvcc, there is no .so or .a library which has cudaRegisterAll in OP-TEE OS. My quesiton is how to call __cudaRegisterFatBinary “manually” when I want to run nvcc compiled ELF file in TEE OS. Is that possible to use nvcc compile .cu file with static linked libcudart.a and so on, make it run in TEE OS. Thanks!