I might have misunderstood something so please correct me where I am wrong.
What I have is an sm_11/compute_11 compatible code. What I want to get as a result of compilation is a low the startup delay by having device code for various architectures (11, 12, 13) but also PTX code for other architectures.
Therefor, I chose the feature set-wise lowest possible virtual architecture  but at the same, time by explicitly specifying sm_11,12,13, I get “exact code” and therefor small startup delay for these architectures [1,2]. By also specifying compute_11 as architecture I get ptx code that can be compiled by the JIT compiler  to (hopefully) further optimized code for sm_X, X > 1.3 (e.g. Fermi).
 “Virtual architectures”, nvcc documentation, v2.3 7-29-2009, pp 26-27
 “Just in time compilation”, nvcc documentation, v2.3 7-29-2009, pp 29
 “Device code repositories”, nvcc documentation, v2.3 7-29-2009, pp 30