When calling a kernel from within a kernel, I get undefined symbol: __fatbinwrap_f6e73cba_22_cuda_device_runtime_cu_945c48ec_33040

With CDP 2.0 (i.e. modern CDP - CUDA Dynamic Parallelism - the act of calling a kernel from device code) that is basically not possible. But there are workarounds. You will find various threads discussing this notion, here are a few: 1 2 3