How portable are compiled binaries?

_Big_Mac · January 19, 2009, 1:49pm

Can I write portable code that will use wither arch 1.1 or 1.3 depending on device’s compute capability using Runtime API?

Let’s assume I write a kernel that doesn’t use double precision and vote intrinsics, ie. that can be compiled to architecture 1.1 or even 1.0, but I’d like it to use the extra registers and better coalescing when ran on a 1.3 device. Does the Runtime API support this? Can the compiler automagically build two kernels, link both (on the expense of having a slightly bigger .exe file) and have the program decide at runtime which to launch?

jack · January 19, 2009, 2:44pm

I remember reading something in the driver API section once about ‘fat binaries’ – the idea was to be able to pre-compile your kernels for different compute levels (1.0, 1.1, 1.3, etc.) and put them all in one file (the ‘fat binary’). When the host program was run, it would automatically choose the most compatible kernel from the binary and run it.

I don’t know if nvidia has released any tools or whatever to use this feature as of the release of CUDA 2.1 (it said so in the programming manual), but it seems like a good feature to have, so maybe one day they will get around to it. Even if they didn’t release anything official yet, I wouldn’t think it would take someone more than an hour or so to make a little “packaging” program to import .cubins and package them all into the fat binary, and they could just post it up as one of the ‘tools’ available from the forum.

AndreiB · January 19, 2009, 3:03pm

AFAIK you don’t have to compile for sm_13 to use softer coalescing rules and/or extra registers. This is done on hardware level, so even code compiled for sm_10 will benefit from this hardware changes.

_Big_Mac · January 19, 2009, 3:33pm

Does that mean that superfluous registers spill to lmem at runtime and not at compilation? I thought you could judge how many registers have spilled by reading the .cubin.

True about coalescing though, I think I remember reading someone’s remark that the “naive” and “optimized” kernels from transpose example run equally fast on his GT 200, meaning coalescing is automatically optimized despite the example being compiled to 1.0.

E.D_Riedijk · January 19, 2009, 4:14pm

Spilling to lmem is done at ptxas compile time, so if you are spilling to local memory it is better to recompile with the sm_13 switch.

_Big_Mac · January 19, 2009, 4:47pm

Yes, but if I compiled with sm_13 (16k registers) and then ran on an sm_11 GPU, the binary would try to use non-existent registers. They wouldn’t just spill (if spilling is done @ compiling ptx) so the program would simply fail to run.

If assigning registers is done at compile time, using GT 200’s extra registers is not automatic at runtime for 1.1 builds (in contrast to coalescing). You’d need two compilations if you wanted to use extra registers and yet make the program runnable under old archs.

AndreiB · January 19, 2009, 6:33pm

If you compile for sm_13 it will not run on anything earlier (or at least not supposed to run).

Physical register allocation is done at ptxas level. More registers per MP means you can have more blocks per MP for better occupancy.

E.D_Riedijk · January 20, 2009, 3:58am

but, but non-sm_13 hardware is sooo 2008 ;)

Topic		Replies	Views
too many registers issue with memory writes and registers CUDA Programming and Performance	7	1936	July 13, 2011
future-proof binaries -- nvcc -code and -arch options how to select the best combination of -code an CUDA Programming and Performance	7	8743	November 11, 2009
local thread memory & compiller CUDA Programming and Performance	12	2954	September 26, 2008
Why can I run sm_10 binaries with >64 registers/thread on Fermi/Kepler just fine? CUDA Programming and Performance	2	994	April 3, 2013
New coalescing rules and -arch sm_13 CUDA Programming and Performance	7	4182	May 27, 2008
BUG: Broken register allocation, toolkit 2.3 CUDA Programming and Performance	15	6914	May 10, 2010
SM_20 register usage CUDA Programming and Performance	9	22721	February 7, 2011
optimizing registers by using shared memory when specifying -maxregcount maximizing the utility of s CUDA Programming and Performance	13	11892	March 3, 2010
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3298	January 20, 2019
Compile time architecture checking? CUDA Programming and Performance	1	1030	January 4, 2011

How portable are compiled binaries?

Related topics