Just to state the obvious, JIT compilation happens at runtime and will increase your application’s time to completion. It is best to build a fat binary which contains machine code for each compute capability you plan to run on.
It’s not clear how much your applications overall runtime is due to code running on the host. If it is a significant portion, the performance of you host system will factor into application runtime. Likewise, the CUDA driver executes code on the host, and the higher the single-thread performance of the host system is, the less host-side driver overhead will there be.
In terms of software configuration, make sure you run identical and recent driver versions on all platforms. Note that the driver model has implications for driver overhead: Linux, WindowsXP, and Windows7 TCC drivers have small overhead. Windows7 WDDM drivers have significant overhead (which the CUDA driver tried to mitigate by batching etc, but this can cause other performance artifacts).
In terms of data transfer from and to the device ensure the PCIe interface is configured correctly. Your GPUs should run with a x16 PCIe gen2 interface, that when properly configured should give you a transfer rate of around 6 GB/sec in each direction for large blocks (say, 16MB). If any of your machines is a multi-socket system, make sure to tightly control NUMA features such as CPU affinity and memory affinity so the GPU “talks” to the “near” CPU and the “near” system memory.