Although the tools mostly don’t work with Fermi, for the compute capability 1.x devices you can use things like cudasm (comes with decuda) to produce cubin files. That gets you much closer to the hardware than PTX.
The biggest reason why “assembly” programming CUDA devices is not very useful (unless you are writing a new compiler) is the lack of detailed hardware-level documentation. NVIDIA does not (and is almost certainly never going to) release hardware documentation at the level of Intel and AMD’s architecture optimization guides. Without that information, you are going to have a very hard time doing better than the authors of nvcc and ptxas, who do have access to such internal documents. Otherwise, you are stuck basically coding up your preconceived notion of how you think the hardware works, or doing a very large number of microbenchmarks to try to deduce the rules of the game, which will change with the next generation of card.