Can you compile both emulated code, and the real code into same objects by nvcc?
That would be a nice feature very ofter, as people need to write compatibility code for cases there is no CUDA enabled HW present.
If you could just always trust that code written for CUDA will be run, (regardless of HW being present or not) it would save lot’s of work from higher layers.
If CUDA HW is present, the calls are fast, and if not you get automatically your code run with emulation stub, that calls the kernel compiled for host architecture with a resonable number of operating system threads.
If this is not supported, I wonder why? This is similar that writing code for OpenGL, which will run fast on devices with a good graphics card, and otherwise with CPU.