The driver API generally (JIT) compiles code for the currently selected device. So the process is not completely disconnected from the GPU. Yes, I acknowledge it doesn’t seem like that should require a lot of interaction or serialization, and I personally don’t know the extent of serialization, but I have seen reports of it. Beyond that I wouldn’t be able to explain in detail the dependencies, serialization, or rationalization for observations.
Yes, you can create fatbins with nvcc. You can create a cubin format output from nvcc also which is directly consumable by the driver API without JIT compilation.
You can request changes or enhancements to CUDA behavior by filing a bug. In this case, you might be asked for a demonstrator and your observations. Short text descriptions like what is in your post at the moment might not gain traction without a demonstrator.