Openness about 'real' cubin instructions

I have a long-running problem with the obscurity of the cubin instructions. Cubin represents the reality of what is executed on the card, and without going into detail, doesn’t look much like ptx.

The obscurity of what really happens on G80 is a problem for designing and analyzing low-level CUDA code. I recently coded the same basic algorithm for G80 and Intel’s Core 2 Duo and the contrast between my understanding of what was happening on IA32 and G80 was astounding. On the Core 2 Duo, I could sit down with Intel’s optimization manuals and build a mental model of where my performance bottlenecks were, and, most of the time, see confirmation of that model by making small changes to my code. On G80 I’m almost completely in the dark. How many instructions issued per cycle? What sort of execution units exist if there’s multiple issue? Latency or throughput for a {add, shift, …} instruction? Who knows?

Not knowing these things is not just a matter of failing to get a couple percent here and there. Careful hand-engineering of my IA32 code ultimately got a factor of 2 - much of which came from designing algorithms that matched what I knew about the low-level execution model on the C2D (for example, exact latencies and throughputs of different operations).

In an ideal world, NVIDIA engineers would race off and furnish us with a magnificent “NVIDIA ISA Optimization Reference Manual” with complete tables of throughputs and latencies and nice little block diagrams and so on. I’d settle for re-enabling ‘–forcetext’ (which was accidentally documented in the 0.9 release, but not activated), which would at least allow us to read the actual code that is being executed on the gpu and make inferences about instruction selection multiple issue, latencies, and so on. No-one has to document anything, explain anything. etc.

Note: I’m not asking for cubin to be fully documented or exposed as a compilation target. I fully understand the reason that ptx exists, and it makes a great deal of sense.

Some of the questions you ask have answers. E.g., nvidia reveals instruction throughput in section 5.1.1 of the Programming Guide helpfully titled “Instruction Throughput.” Most instructions have 1 instrution/clock throughput. (Section 5.1.1 is REALLY confusing by saying a mad takes 4 cycles. It’s actually saying that a 32-thread warp requires 4 cycles on an 8-issue multiprocessor.) Instruction latency is a big issue for CPUs, but it almost doesn’t exist on GPUs because of hyperthreading (latency hiding). Throughput and latency are the two biggest problems on CPUs, so it’s nice to see they’re almost non-issues for gpus.

In general, though, you’re right. Nvidia tells us almost nothing else.

In particular, certain instructions can dual issue. Nvidia relies on this fact to inflate their peak GFLOPs figures 50%. AND THEN DOESN’T EVEN VAGUELY HINT TO US HOW TO ACHIEVE IT.

Umm… deceptive advertising suit, anyone?

To be fair, gpu manufacturers have been ultra-paranoid about their precious secrets. I’m just glad nvidia has told us about threads, bank conflicts, and cache sizes. Sigh. However, there are still important details about the multiprocessors that are missing. If you study ATI’s CTM, for example, you learn that there’s a semaphore system used with texture fetches that can cause stalling. You also learn that the ALUs can perform certain tricks that are almost like dual-issue. E.g., multiplying by powers of two, negating, and a few other things can all be done in a single instruction.

Like geoff said, I don’t need to program the dual-issue (although it’d be nice), BUT I DO NEED TO KNOW ABOUT IT.

I think the entire industry would be in the dock if you could be sued for taking the most optimistic view of FLOPS, MIPS, etc. I don’t think anyone takes these numbers entirely seriously and I’m not proposing that NVIDIA be the first to unilaterally disarm and post ‘realistic’ FLOPS numbers.

I have a number of reasons (some of which I can’t discuss here) to suspect that backend cubin instruction execution is at least marginally more complicated than 5.1.1 suggests, possibly very much so. I think that section is a good guide to the more obvious pitfalls (e.g. anyone can see that integer divide by a non-power of two generates a whole bunch of cubin instructions), but I’m still not sure we really know enough.


nVidia people don’t seem to want to talk about this…
But again, they don’t seem to want to talk about whether they want to talk about this, either.
Even if they’re not allowed to talk about whether they can talk about what we want to know, can’t they even tell us that?
The silent treatment feels really creepy.

There are good reasons why PTX exists and we don’t expose the native G80 instruction set in CUDA.

Unlike on CPUs, the hardware ISA of the GPU can change radically from generation to generation. For example, between G7x and G8x we switched from a 4-vector SIMD machine to an entirely different multithreaded scalar architecture, but all existing shaders continued running without changes and with radically higher performance.

The abstraction is the price you pay for this kind of performance scaling.

That said, I do understand people’s frustration with not knowing all the low-level details. Over time I expect us to reveal more details in the documentation and provide better tools to help developers understand the performance bottlenecks.

In the meantime, if you have specific performance questions please post them here.

The sad thing is the hardware review websites, who cater to geek curiousity and nothing practical, are told more hardware details than we are!

e.g, take some snippets from Beyond3D here:

"NVIDIA’s documentation for G80 states that each SP is able to dual-issue a scalar MADD and MUL instruction per cycle, and retire the results from each once per cycle, for the completing instruction coming out of the end. The thing is, we couldn’t find the MUL, and we know another Belgian graphics analyst that’s having the same problem. No matter the dependant instruction window in the shader, the peak – and publically quoted by NVIDIA at Editor’s Day – MUL issue rate never appears during general shading.

We can push almost every other instruction through the hardware at close to peak rates, with minor bubbles or inefficiencies here and there, but dual issuing that MUL is proving difficult. It turns out that the MUL isn’t part of the SP ALU, rather it’s serial to the interpolator/SF hardware and comes after it when executing, leaving it (currently) for attribute interpolation and perspective correction."

“The threaded nature extends to data fetch and filtering too, the chip running fetch threads asynchronously from threads running on the clusters, allowing the hardware to hide fetch and filter latency as much as possible”

"Rather than a global sampler array, each cluster gets its own, reducing overall texturing performance per thread (one SP thread can’t use all of the sampler hardware, even if the other samplers are idle) but making the chip easier to build.

The sampler hardware per cluster runs in a separate clock domain to the SPs (a slower one), and with the chip supporting D3D10 and thus constant buffers as a data pool to fetch from, each sampler section has a bus to and from L1 and to and from dedicated constant buffer storage. Measured L1 size is seemingly 8KiB. "

“Clusters can pass data to each other, but in a read-only fashion via L2, or not without a memory traversal penalty to DRAM of some kind.”

“Special function ops (sin, cos, rcp, log, pow, etc) all seem to take 4 cycles (4 1-cycle loops, we bet) to execute and retire, performed outside of what you’d reasonably call the ‘main’ shading ALUs for the first time in a programmable NVIDIA graphics processor. Special function processing consumes available attribute interpolation horsepower, given the shared logic for processing in each case, NVIDIA seemingly happy to make the tradeoff between special function processing and interpolation rates. We covered that in a bit of detail previous, back on page 6. Each cluster then feeds into a level 2 cache (probably 128KiB in size) and then data is either sent back round for further processing, stored off somewhere in an intermediary surface or sent to the ROP for final pixel processing, depending on the application”

global register file in addition to per-cluster registers? data fetch running as a separate thread in a different clock domain? and presumably, there’s a semaphore system in place? 128KB L2 cache? Dual-issued MUL instructions that CUDA seems incapable of realizing, but which are counted in the GFLOPs anyway?

I can just see NVIDIA executives sitting in a room and saying, “cuda developers are like mushrooms. you feed em crap and keep em in the dark.” ATI seems to take a different view… when is their reworked CTM library for r6xx coming out again?

That’s fine… Ok, so I’m proramming PTX. Now please tell me how can I use my registers to hide data fetch costs from either shared memory or the texture cache. (The cu compiler seems to assign each fetch a different register, but that’s unreasonable.) I am doing a fetch every other instrution. How can i cause efficient dual-issue so that the ALUs can work 100% of the time? If the samplers work at 1/2 frequency, is my goal then to do a fetch from shared every two ALU ops? Why does the .cu compiler emit vect4 texture fetches and then throw out everything but the first component? How can I use vector fetches from shared or tex cache to improve performance?

In short, I have no problem with the PTX abstraction. I just want to know why my hand-coded PTX runs 30% slower than my identical .cu code.

Hmm… maybe I get it.

The reason you don’t want to tell us how it works under the hood is because you’ll have to listen to people complain if you ever change it up.

Or I think another way to say it is that you’ll hinder your artistic freedom if it will be weighed by “well, we could completely change that part and make it way cooler, but then what are we going to do about cuda compatibility and performance…”

I don’t know. I think revealing architectural details and letting developers optimize for them (either indirectly through cu or directly through cubin assembly) would be ok as long as you put up big warnings and make a nice versioning system like you implement in the devcode repository. Remember, you could use the marketing wins like “G80 achieves XXX gflops” or “carefully optimized code is finally appearing for g80.” For the ISVs, too, you give a new purpose for existence in delivering yearly performance updates in step with your hardware.

Indeed… our cool demo once run at x fps. When we found out local memory’s performance the hard way, we got it at 2x fps. If we can have some more information about warp divergence and stuff, maybe we can get it to 3x fps, which sounds much better than 2x. And after we publish our work, you’ll be able to do some marketing with the numbers.

What information, beyond what’s already in the programming guide, would you like about divergence?

As far as local memory is concerned, I think we’ve been straightforward stating that its performance is the same as global memory’s.


I have been waiting for months (as mentioned in the Murphy’s Law post) for info on divergence and more importantly convergence algorithms. This is in the same class as the bincode doco per this topic and is what makes one feel like one is being treated like a mushroom by Nvidia.

I did extend my memory benchmark into local after asadafag brought this up a while ago and found that local must be implemented as a base register per warp as one does not get the same performance as contiguous device memory if there is any decent size of local - the memory cycles for each warp end up in different memory pages slowing down local to 1/2 the rate of fully coalesced 32 bit reads. Also more than double the device memory required is eaten up by local (in my case allocating 36Mb of local on a GTS used 128Mb of device memory). Sounds like the driver is broken. All this is stuff Nvidia won’t tell you.


ed: forgot you may not know the colloquialism: mushroom = kept in the dark & fed bulls**t

The thing is… The global memory performance itself isn’t quite clear. There is a some 200 cycle saying somewhere, and the “coalescing is faster” saying somewhere. But I can’t find exactly how much faster coalescing is, and there isn’t a comparison between global memory and shared memory (which would be important for deciding whether to use local/shared memory/regs). The doc is currently formulated in a way that gives me an impression of “global memory is slow”. Well, it may be indeed slow, but seems it’s not too slower than registers when the latency is well hidden. If this was stated in the doc, I may have a month more time to improve my algorithm.

It’s unfair to blame it on you, and I no longer mean such things.

On the divergence, I have a case of an extra useless goto halves my kernel’s performance (and shortens the cubin by 2 dwords). I suspect it’s about divergence handling, or ptxas, since the ptx doesn’t look suspicious. The kernel is reasonably short, but it’s so important that I can’t disclose it on my own. If I can get my boss to agree, and get a place to upload 10M~20M data required to test the performance, I’d send it to you.

I strongly suspect my other bottleneck kernel may suffer from similar problems, but I don’t know for sure. Experimenting requires rewrite of another long kernel and the corresponding CPU part, and is very tiring. That’s why I continue to bother you with this.

not necessarily bullsh**. psilocybe cubensis can grow on several other types of sh** too.

I am laughing at this explanation. What you are saying is simply absurd.

What if Intel said:

There are good reasons why SIMD classes and intrinsics exist, and we don’t expose the native x86 instruction set in Intel C/C++ Compiler.

You don’t need to expose it in CUDA. Expose it anywhere you want and let the willing people exploit the maximum of G80 architecture.

But I bet the plan is as follows:

  1. We abstract the native instructions thus reducing performance

  2. We provide “unified” platform

  3. We sell new versions of both hardware (and perhaps even software later)

In other words, if I could (with some clever low-level optimization) make some CUDA code finish processing in one second, I wouldn’t need to buy G92 to get there, would I?

Or maybe they already turned things upside-down in G92 and forgot what G80’s instruction set is themselves.
Whatever G92 is capable of, we can’t pend our SIGGRAPH projects till November. And we’ll still buy it even if nVidia open source G80. Keeping people in the dark may just end up being defeated by Intel performance-wise and get a big humiliation for the entire GPGPU community.

OK, I normally like to ignore these sorts of posts, but seriously, you took a reasonable point (“openness is good”) and destroyed it with a heavy mixture of arrogance and conspiracy theories.

Keeping the low level instruction set secret is not absurd. It is a decision which weights the flexibility of being able to drastically change the underlying instruction set against the benefits to the developer community of understanding the low level hardware. Intel doesn’t get to redesign the x86 instruction set every generation because developers would be angry. NVIDIA is trying to leave themselves some wiggle room to change the ISA without having to worry about breaking people’s low level code.

Now, before you assume I’m defending NVIDIA here, I also think as a general principle that it would be helpful if more low-level docs were available. Keeping everything secret is probably a short-sighted strategy, especially given ATI’s bent towards opening things up.

But making crazy accusations like “optimization details are being withheld to drive future hardware sales” does not help your case. Sure it is possible that is a conscious strategy, but without any proof of that, you are just poisoning the discussion.

I think what levicki said that was absurd was that nvidia wants cuda code to run slow so that one has a reason to buy faster hardware later. His statement was absurd, but don’t pull a trick and use it to boost an argument against a rather different statement. It’s a very low style of arguing.

Actually, YOUR statement makes perfect sense. If people code/optimize to the G80 instruction set, they won’t want to move to G92.

Hmm… I think this really must be what nvidia is thinking. It could be a very convincing argument. Except… gpus get so much faster with each generation that I think unoptimized ‘cu’ code on G92 would handily beat hand-tuned assembly on G80. People would still upgrade, and the software developers would follow suit with updated code. Nvidia, do remember that there are also benefits to having a community of carefully tuned libraries and a community of software developers who get paid every time you refresh.

But in the end, yeah, it all comes down to the equation (speedup of the new generation of hardware)-(slowdown caused by reverting to .cu/.ptx). If that figure comes out negative or even close to zero, it’ll be bad for everyone. Perversely, the more useful hand-tuned cubins are, the more reason that we can’t have them.

I don’t believe the G92 instruction set will be that much different from the G80, sure, some instructions will have been changed to introduce new features but the overall idea won’t be completely different. Such a redesign would be absurd, and very expensive, and take a long time (as long as designing G80 took in the first place, which is a few years).

But I’m sure NVidia has reasons to withhold the murky details from us. We’ll never know why, unless the Nouveau folks figure out G80 shaders.

I do agree Intel will have a huge advantage if they come with a CUDA-like architecture with similar performance and do publish all the low level optimization details, like usually. Maybe NVidia will be pushed then…

I can provide some insight into the ‘real’ cubin instructions. By seeing what compiles to what, I wrote an disassembler for the NVIDIA CUDA binary (.cubin) format. It provides insight into the internal instructions generated for the G8x architecture.

If you’re interested you can download it here:

I mainly made this out of curiousity on how a modern graphics card its shader assembly works. It turns out it’s nearly a fully fledged CPU. Anyway, I hope this helps finding real clock times, real optimization techniques, etc :)

Cheers! Maybe finally we can solve a mysterious 2x slow down…
I haven’t used python though, and it may require some work for me to get it running.
One of my kernels had one reloc session corresponding to a globally defined device variable that it can’t handle.
reloc {
name = ###CENSORD###
segname = reloc
segnum = 14
offset = 0
bytes = 16