You may be able to significantly reduce this initialization time by specifying to nvcc the gpu on which your kernel is to be executed. You can do so by adding -code sm_13 (or whatever your gpu is) to nvcc’s command line. You may have a closer look at e.g. page 16 of nvcc_2.1.pdf in the doc directory beside the bin directory of nvcc.
If you don’t specify -code, apparently (sth. like) ptxas will be invoked when executing the first cuda function, in order to compile and optimize the ptx embedded in your executable for the current gpu. (I just figured this out for a rather large kernel, where omitting -code leads to abortion of the executable after about 3.5 minutes (spent in the first cuda function). With -code the compilation takes about 4.5 minutes but the executable initializes within a few seconds…)