slow CUDA application start-up times on headless compute box

Am experimenting with multi-GPU programming, at the moment using Amazon EC2 Cluster GPU instances. I’ve noticed that significant time is needed for the start-up of my CUDA application (basically, querying device info to verify that all devices have CC>=2.0, and then allocating device memory, of the order of 100MB per GPU, and copying input data there) - typically, 2 to 3 seconds are needed for this (while these operations take negligible time on my desktop development machine), and it doesn’t matter if I’m using single or both GPUs on given node. On the other side, my kernels run as expected, achieving almost 2x speedup in two-GPU configuration vs. single-GPU configuration; however as this is sort of demo application, these lengthy start-up times are really crippling overall speedup numbers. In this post, it is suggested that running X server may help, but I’m wondering are there any other solutions (and why exactly this happens at all)?