I have a long-running problem with the obscurity of the cubin instructions. Cubin represents the reality of what is executed on the card, and without going into detail, doesn’t look much like ptx.
The obscurity of what really happens on G80 is a problem for designing and analyzing low-level CUDA code. I recently coded the same basic algorithm for G80 and Intel’s Core 2 Duo and the contrast between my understanding of what was happening on IA32 and G80 was astounding. On the Core 2 Duo, I could sit down with Intel’s optimization manuals and build a mental model of where my performance bottlenecks were, and, most of the time, see confirmation of that model by making small changes to my code. On G80 I’m almost completely in the dark. How many instructions issued per cycle? What sort of execution units exist if there’s multiple issue? Latency or throughput for a {add, shift, …} instruction? Who knows?
Not knowing these things is not just a matter of failing to get a couple percent here and there. Careful hand-engineering of my IA32 code ultimately got a factor of 2 - much of which came from designing algorithms that matched what I knew about the low-level execution model on the C2D (for example, exact latencies and throughputs of different operations).
In an ideal world, NVIDIA engineers would race off and furnish us with a magnificent “NVIDIA ISA Optimization Reference Manual” with complete tables of throughputs and latencies and nice little block diagrams and so on. I’d settle for re-enabling ‘–forcetext’ (which was accidentally documented in the 0.9 release, but not activated), which would at least allow us to read the actual code that is being executed on the gpu and make inferences about instruction selection multiple issue, latencies, and so on. No-one has to document anything, explain anything. etc.
Note: I’m not asking for cubin to be fully documented or exposed as a compilation target. I fully understand the reason that ptx exists, and it makes a great deal of sense.