Your three-item list is a good summary.
(1) Driver initialization: Make sure you make the driver persistent, if you haven’t done that already. Long driver initialization times are frequently seen on systems with very large memory (both CPU and GPU), as the driver needs to map all GPU and system memory into a single virtual memory map.
(2) PTX compilation: Best practice is to create fat binaries that embed SASS for all architectures you intend to support, and one PTX version (for the latest architecture supported by CUDA (for forward compatibility with future GPU architectures). Dynamic PTX generation and compilation should be used only when absolutely necessary, e.g. some in-memory databases for GPUs compile queries into custom kernels created on the fly.
Note that most of the overhead enumerated in your list is CPU (host-side) work, and much of it is single-threaded to boot. For high-performance systems using GPU acceleration, I therefore recommend CPUs with high single-thread performance. At this time, I’d say that means base frequency >= 3.5 GHz.