I want to compile some code that is sm_11+ compatible in such a way that the binary (without recompiling) is as suitable as possible for upcoming driver/hardware. After going through the nvcc documentation the conclusion I came to is that using [font=“Courier New”]-arch=compute_11;-code=compute_11,sm_11,sm_12,sm_13[/font]. Does this sound reasonable?
What good is it to generate code for sm_1.2 and sm_1.3 when the feature set you specified (compute v1.1) does not make use of any of the 1.2 or 1.3 features?
I might have misunderstood something so please correct me where I am wrong.
What I have is an sm_11/compute_11 compatible code. What I want to get as a result of compilation is a low the startup delay by having device code for various architectures (11, 12, 13) but also PTX code for other architectures.
Therefor, I chose the feature set-wise lowest possible virtual architecture [1] but at the same, time by explicitly specifying sm_11,12,13, I get “exact code” and therefor small startup delay for these architectures [1,2]. By also specifying compute_11 as architecture I get ptx code that can be compiled by the JIT compiler [3] to (hopefully) further optimized code for sm_X, X > 1.3 (e.g. Fermi).
The ptxas compiler generates a sligthly different code when targeting SM 1.2 devices (different register allocation and use of a mysterious variant of the mov instruction). Probably doesn’t impact performance much, though…
Until now, when an 1.2 or 1.3 device runs an sm_11+compute_11 code, it directly executes the sm_11 binary code instead of recompiling the PTX code.
Now I am getting even more confused. I do compile for sm_12 and sm_13 as well so I should have code fore 1.2 and 1.3…
But on the other hand, if I understand you correctly (and what you’re saying is true) then the statements in the nvcc documentation are far from what happens in reality: even if only compute_11 PTX and sm_11 code is available - based on the nvcc doc - the JIT compiler should take the PTX and generate the appropriate code for 1.2 and 1.3 devices.
Do you actually use JIT compilation in your app? This requires using the Driver API. Runtime API compiles kernels to binary and embeds them in the .exe afaik.
This shed the light on my confusion. Where is this in the documentation? Btw, why is the JIT compilation Driver API-only?
So using virtual architectures has no effect at all on code that uses the Runtime API? How about the -code option with different real architectures & Runtime API?
In every case that I have seen (over 100 apps), nvcc with embed PTX as well as sm_10, etc in each CUDA binary. It is up to the runtime whether to JIT from PTX to something else. The Driver API lets you do it yourself.