future-proof binaries -- nvcc -code and -arch options how to select the best combination of -code an

pszilard · November 11, 2009, 3:23pm

Hi!

I want to compile some code that is sm_11+ compatible in such a way that the binary (without recompiling) is as suitable as possible for upcoming driver/hardware. After going through the nvcc documentation the conclusion I came to is that using [font=“Courier New”]-arch=compute_11;-code=compute_11,sm_11,sm_12,sm_13[/font]. Does this sound reasonable?

Any better idea is very welcome!

Thanks!

cbuchner1 · November 11, 2009, 4:44pm

What good is it to generate code for sm_1.2 and sm_1.3 when the feature set you specified (compute v1.1) does not make use of any of the 1.2 or 1.3 features?

Christian

pszilard · November 11, 2009, 5:38pm

I might have misunderstood something so please correct me where I am wrong.

What I have is an sm_11/compute_11 compatible code. What I want to get as a result of compilation is a low the startup delay by having device code for various architectures (11, 12, 13) but also PTX code for other architectures.

Therefor, I chose the feature set-wise lowest possible virtual architecture [1] but at the same, time by explicitly specifying sm_11,12,13, I get “exact code” and therefor small startup delay for these architectures [1,2]. By also specifying compute_11 as architecture I get ptx code that can be compiled by the JIT compiler [3] to (hopefully) further optimized code for sm_X, X > 1.3 (e.g. Fermi).

[1] “Virtual architectures”, nvcc documentation, v2.3 7-29-2009, pp 26-27

[2] “Just in time compilation”, nvcc documentation, v2.3 7-29-2009, pp 29

[3] “Device code repositories”, nvcc documentation, v2.3 7-29-2009, pp 30

Cheers,

Szilard

Sylvain_Collange · November 11, 2009, 6:45pm

The ptxas compiler generates a sligthly different code when targeting SM 1.2 devices (different register allocation and use of a mysterious variant of the mov instruction). Probably doesn’t impact performance much, though…

Until now, when an 1.2 or 1.3 device runs an sm_11+compute_11 code, it directly executes the sm_11 binary code instead of recompiling the PTX code.

pszilard · November 11, 2009, 8:59pm

Now I am getting even more confused. I do compile for sm_12 and sm_13 as well so I should have code fore 1.2 and 1.3…

But on the other hand, if I understand you correctly (and what you’re saying is true) then the statements in the nvcc documentation are far from what happens in reality: even if only compute_11 PTX and sm_11 code is available - based on the nvcc doc - the JIT compiler should take the PTX and generate the appropriate code for 1.2 and 1.3 devices.

Could somebody clarify this, please?

Szilard

_Big_Mac · November 11, 2009, 9:45pm

Do you actually use JIT compilation in your app? This requires using the Driver API. Runtime API compiles kernels to binary and embeds them in the .exe afaik.

pszilard · November 11, 2009, 10:51pm

This shed the light on my confusion. Where is this in the documentation? Btw, why is the JIT compilation Driver API-only?

So using virtual architectures has no effect at all on code that uses the Runtime API? How about the -code option with different real architectures & Runtime API?

Thanks Big_Mac!

Sz.

Gregory_Diamos · November 11, 2009, 11:05pm

In every case that I have seen (over 100 apps), nvcc with embed PTX as well as sm_10, etc in each CUDA binary. It is up to the runtime whether to JIT from PTX to something else. The Driver API lets you do it yourself.

Topic		Replies	Views
Driver JIT compilation CUDA Programming and Performance	6	4407	September 9, 2016
Understanding code optimization resulting from the --gpu-architecture, --gpu-code and --generate-code flags CUDA NVCC Compiler	1	796	May 31, 2024
Visual studio cuda code generation selection benefits CUDA Programming and Performance cuda , visual-studio	6	2635	August 29, 2021
sm_1x vs. compute_1x? CUDA Programming and Performance	9	12891	November 9, 2008
JIT compilation PTX to machine code may fail for certain GPUs ? CUDA Programming and Performance	4	5726	January 21, 2015
Slow compile and cudaMalloc CUDA Programming and Performance	8	3693	February 2, 2011
CUDA NVCC creates .target 5.0 CUDA Programming and Performance	4	758	January 12, 2017
How portable are compiled binaries? CUDA Programming and Performance	7	8199	January 20, 2009
How should I use correctly the sm_XX and compute_XX? CUDA Programming and Performance	3	5002	July 14, 2022
Compiling for the right architecture CUDA Programming and Performance	14	1753	September 14, 2010

future-proof binaries -- nvcc -code and -arch options how to select the best combination of -code an

Related topics