Probably, but not sure, as I haven’t used the persistence daemon yet. The instructions at the page I pointed to look detailed and comprehensive, however, so I would suggest simply working through those. I don’t know whether the legacy persistence mode, turned on via nvidia-smi, still works at this time.
 CUDA prepares a unified 64-bit address map so all memory in the system is accessible at unique addresses. This is completely independent of an app’s usage of particular allocation or copy functions (which is not known at driver / runtime startup anyhow).
 I think cudaGetDeviceProperties() is the only CUDA API function that does not trigger context creation, so it would make sense that cudaFree(0) absorbs all the overhead.
Is the 12 GB for the server a typo and should really be 128 GB? 12 GB of system memory seems incredibly small for a server. Ideally, a host’s system memory should be four times the size of the total GPU memory in the system, but at least twice the size.
If the server really has a much smaller system memory than the desktop, lack of persistence would seem to be the strongest hypothesis that explains your observations. Lengthy startup due to address space mapping is typically seen in server systems with 100+ GB of system memory.
How does the CPU/memory performance of the server compare to that of the desktop machine? Servers often sport CPUs with many cores but low frequency (~2 GHz), and thus low single-thread performance, which however is precisely what is needed to minimize host-system overhead in the CUDA stack. Servers may also use slower speed grades of memory than desktop machines (partially counterbalanced by providing more memory channels and larger caches in server CPUs).