How portable are compiled binaries?

Can I write portable code that will use wither arch 1.1 or 1.3 depending on device’s compute capability using Runtime API?

Let’s assume I write a kernel that doesn’t use double precision and vote intrinsics, ie. that can be compiled to architecture 1.1 or even 1.0, but I’d like it to use the extra registers and better coalescing when ran on a 1.3 device. Does the Runtime API support this? Can the compiler automagically build two kernels, link both (on the expense of having a slightly bigger .exe file) and have the program decide at runtime which to launch?

I remember reading something in the driver API section once about ‘fat binaries’ – the idea was to be able to pre-compile your kernels for different compute levels (1.0, 1.1, 1.3, etc.) and put them all in one file (the ‘fat binary’). When the host program was run, it would automatically choose the most compatible kernel from the binary and run it.

I don’t know if nvidia has released any tools or whatever to use this feature as of the release of CUDA 2.1 (it said so in the programming manual), but it seems like a good feature to have, so maybe one day they will get around to it. Even if they didn’t release anything official yet, I wouldn’t think it would take someone more than an hour or so to make a little “packaging” program to import .cubins and package them all into the fat binary, and they could just post it up as one of the ‘tools’ available from the forum.

AFAIK you don’t have to compile for sm_13 to use softer coalescing rules and/or extra registers. This is done on hardware level, so even code compiled for sm_10 will benefit from this hardware changes.

Does that mean that superfluous registers spill to lmem at runtime and not at compilation? I thought you could judge how many registers have spilled by reading the .cubin.

True about coalescing though, I think I remember reading someone’s remark that the “naive” and “optimized” kernels from transpose example run equally fast on his GT 200, meaning coalescing is automatically optimized despite the example being compiled to 1.0.

Spilling to lmem is done at ptxas compile time, so if you are spilling to local memory it is better to recompile with the sm_13 switch.

Yes, but if I compiled with sm_13 (16k registers) and then ran on an sm_11 GPU, the binary would try to use non-existent registers. They wouldn’t just spill (if spilling is done @ compiling ptx) so the program would simply fail to run.

If assigning registers is done at compile time, using GT 200’s extra registers is not automatic at runtime for 1.1 builds (in contrast to coalescing). You’d need two compilations if you wanted to use extra registers and yet make the program runnable under old archs.

If you compile for sm_13 it will not run on anything earlier (or at least not supposed to run).

Physical register allocation is done at ptxas level. More registers per MP means you can have more blocks per MP for better occupancy.

but, but non-sm_13 hardware is sooo 2008 ;)