NVIDIA Instruction Set Architecture - Documentation (Maxwell)

Hello all,

I am planning to develop a JIT compiler as a project, which automatically performs GPU optimizations.
In my example I would try this on a graphics card with the Maxwell Instruction Set Architecture. I don’t want to do any intermediate steps here via, for example, PTX.
Via the CUDA Driver API it is possible for me to allocate and execute the self-emitted code.

Unfortunately your documentation does not give me any insight into how the individual opcodes of the instructions are encoded. The only insight I have is which (public) instructions exist CUDA Binary Utilities :: CUDA Toolkit Documentation.

Do any of you have documentation on this ISA or possibly reverse-engineered this using the CUDA binary utils?

I am basically interested in any information about the Maxwell ISA, but would be happy to see other ISAs as well.

Best regards
Robin Lindner

There is no intention to document that by NVIDIA. It is intentionally undocumented.

If you’d like to see something possibly similar or related, you could study the maxas project by Scott Gray. There was a considerable amount of reverse engineering involved, at least in part due to “lack” of documentation.

Then the legitimate question is, why is this intentionally undocumented?

Sure NVIDIA wants to make money with their products, but NVIDIA is also interested in increasing the applied performance of their products, right?

My point is that a compiler can of course make better optimizations for individual ISAs than a general ISA like PTX (LLVM has NVPTX for example).
Also, another level of abstraction is a performance loss in compile time (First to PTX, then NVIDIA internals do the specific ISA assembly).

I cannot speak to all the reasons. Reasons I know of, which explain various “gaps” in NVIDIA documentation, is that by leaving things undocumented:

  1. You make it difficult for people to build rational dependencies on undocumented behavior. This gives NVIDIA more flexibility to change undocumented behavior, should the need arise. This is a useful feature.
  2. It is generally easier to adapt or change things in the future.

NVIDIA has intentionally chosen to primarily present the GPU in CUDA via a virtual architecture. The PTX language is the language of this architecture. PTX doesn’t run on any real architecture, directly. It must be converted to a compatible SASS.

By documenting the PTX, but not the SASS, NVIDIA allows for additional flexibility (for NVIDIA, at least) in creating a compatibility model for code generation and maintenance, as well as more flexibility in developing new features and deploying them in new architectures.

I’m sure there are other reasons. I don’t know them all.

sure.

There’s no reason a compiler has to stop optimizing at the PTX level. In fact, the ptxas tool is really an optimizing compiler. You should divorce yourself from the notion that PTX is any sort of meaningful destination or meaningful device code. It is not. It is merely an intermediate representation. Modern compilers often use IR’s as they transition between phases of compilation.

Theoretically, at least, this doesn’t have to be. There is no requirement to generate PTX unless your compile specification calls for embedding PTX in the fatbinary (or otherwise emit PTX as a deliverable, e.g. for driver API consumption). There is no (technical or design) reason a compiler cannot compile directly from the desired source format to SASS. (I realize that it is difficult or impossible for an arbitrary party to create such a compiler…)

On these last 2 points, I understand that by expecting the compiler community to use PTX as an interface, that imposes a hand-off which may have implications. To the extent that that increases compile time in practice or introduces other inefficiencies, I assume that NVIDIA CUDA developers have thought through this, weighed the tradeoffs, and felt that the current architecture’s benefits outweigh the drawback(s).

You make it difficult for people to build rational dependencies on undocumented behavior. This gives NVIDIA more flexibility to change undocumented behavior, should the need arise. This is a useful feature.

That’s right. The Maxwell architecture in the example is an older architecture that I don’t expect to change.
Nevertheless, I would find a documentation of the Maxwell architecture quite appropriate. Of course, there should be a disclaimer telling the developer that this can change at any time and should not be relied upon to prevent such a thing.

There’s no reason a compiler has to stop optimizing at the PTX level. In fact, the ptxas tool is really an optimizing compiler. You should divorce yourself from the notion that PTX is any sort of meaningful destination or meaningful device code. It is not. It is merely an intermediate representation. Modern compilers often use IR’s as they transition between phases of compilation.

Right. I can imagine that the PTX compiler optimizes quite a bit, but it doesn’t have all the background data that the compiler that generates the PTX has.
That would be a shortcoming what I see.

On these last 2 points, I understand that by expecting the compiler community to use PTX as an interface, that imposes a hand-off which may have implications. To the extent that that increases compile time in practice or introduces other inefficiencies, I assume that NVIDIA CUDA developers have thought through this, weighed the tradeoffs, and felt that the current architecture’s benefits outweigh that drawback.

For an AOT compiled application is also fine. In my example it is a JIT compiler, which also has advantages over AOT and there the compile time must be correspondingly fast.

The numba developers managed to create a fairly quick JIT compiler within the framework I described. numba cuda has seen considerable uptake.

I don’t think the current situation is likely to change, whether I can describe it well or not.

Check out: GitHub - 0xD0GF00D/DocumentSASS: Unofficial description of the CUDA assembly (SASS) instruction sets.

I made a semi-functional compiler from this: GitHub - sebftw/OmniSASSembler: An assembler for all CUDA SASS instructions (one day hopefully) . I mean it works, but mostly using breakpoints and such. And it’s quite unoptimized. It’s the first parser I made.