Can't make ptxas generate efficient code

That’s interesting new information for me. Am I right that fermi sass documentation is not for public ? I’d be really interested to read it for fermi and kepler architectures.

One can learn much from studying the SASS generated for various code across multiple GPU generations :-) I have been doing that for almost eight years now, so admittedly I have had a bit of head start on most CUDA users. Even so, I am sometimes stumped.

If you are intrigued by the “short immediates” in double-precision instructions on sm_2x, simply try the literal constants 1.0, 1.5, 1.25, 1.125, … and check at what point the SASS disassembly provided by cuobjdump shows a switch from an immediate operand that is part of the instruction to separate MOV32I. For a comprehensive approach repeat that experiment for all three source operands of a DFMA.

I’ve already verified your claim. But who knows how many other gotchas there are. That’s why I have a strong interest in reading such a paper.
This issue costed me significant amount of time since for the last moment I was thinking I’ve screwed up. Reading about such things in advance I believe would make thinks easier in the future.

I am not aware of public documentation for SASS. My own knowledge has largely been acquired by working through lots of output from cuobjdump --dump-sass, usually when chasing code generation issues. Keep in mind that the various NVIDIA GPU generations are not binary compatible (thus the use of PTX as a portable intermediate assembly language, and fat binaries). So while much of SASS code is fairly self-explanatory when one has had exposure to assembly languages for other processors, in-depth knowledge of SASS details is of limited utility since it is very architecture specific. I am certainly not aware of all details myself.

Customers should not feel compelled to root cause suspected code generation issues by looking at SASS. If something seems seriously out of whack, either in terms of functionality or performance-wise (such as lots of register pressure resulting in massive spilling with an associated significant performance loss where such behavior is not reasonably expected), self-contained repro code that demonstrates the problem is an excellent starting point for a bug report.