It’s been years since I looked at the details of compile times. From what I recall, the flow involves multiple separate applications invoked by nvcc, with a number of files being used for communication between components. That is likely responsible for basic overhead that dominates the compilation of short files.
While JIT compilation from PTX was part of the CUDA build flow from the beginning, JIT compilation from high-level source is a late addition, and the design goals are probably reflected in the speed of the respective compiler components. As Allan noted, ever since JIT compilation from HLL source was added as a feature, compile times have generally improved.
The default settings of the CUDA tool chain are for full optimization, which means there are a ton of optimization passes that run inside LLVM. There isn’t even a convenient -O[1|2|3] switch one can pass to the LLVM compiler, like one could with the old Open64 compiler, in order to turn off a bunch of these passes. Lowering the optimization level could improve compilation times quite a bit with Open64. Aggressive function inlining is one of the main culprits in ballooning the code size and thus compilation times.
If you are a user of the Intel compiler, you will be aware that builds at full optimization are pretty slow as well, although not quite as slow as with nvcc. No lunch (i.e., highly optimized code) is for free.
In practical terms, CUDA programmers facing excessive build times should keep filing bugs/RFEs with NVIDIA, so development effort can be applied where most needed.