Cubin assembler is now available decuda 0.4.0 released

Hi,

I’d like to announce that the most recent version of decuda, my disassembler for .cubin instructions for the G8x/G9x architectures, now includes an assembler. It allows writing and optimizing code specificially for the G8x and G9x series, and completes the independent toolchain for this hardware. It takes a text file with assembly instructions as input, and produces a .cubin file as output.

Not the entire instruction set is supported yet, but it is capable of assembling working CUDA kernels, including predication and flow control constructs.

Also, decuda can export in a format that (should be) reparsable by cudasm, so that it is possible to make changes to the code produced by nvcc, and reassemble.

The software can be found here: [url=“http://www.cs.rug.nl/~wladimir/decuda/”]http://www.cs.rug.nl/~wladimir/decuda/[/url]

The assembler is still in a beta stage, and barely documented. So let me know if you have any questions as to its use, or if you find nasty bugs.

Wow! That’s pretty cool. Thanks for making it available.

Thank you. Really appreciate it !

Been away for a while. Came back to find this…

holy @#$#@it

How did you manage to decode the cubins? And is there anything you learned about how it’s different from ptx that will help to write optimized code?

I have quite some assembler experience for various architectures (including some manually typing Z80 code, looking up the opcodes in a book :) ), but it was still a lot of work to identify all the bits and pieces. Basically, I compiled a big lot of small test kernels, and looked at the differences until I was able to isolate individual instructions. Next, I needed to find out what the opcodes and operands were… and so on

Most of the time I spend on this project is currently on finishing the assembler, finding out how to optimally use it is next step :)
For example, analyzing timings of individual instructions is still to be done. I have succeeded in decreasing the number of instructions and used registers in some cases. In other cases, I was able to move instructions outside loops for some speed gain.

Wow, this did exactly what it said it would. Was able to see why the 1.1 sdk is giving me worse performance.

How is the rest of your toolchain coming? Do you have a cu compiler that can take inline assembly? Or a way of using your kernels with nvidia’s other tools?

Also, to help expand your decoder, take a note of:

There was one type of instruction, a mov.f32 from shared memory with a non-zero offset, that didn’t get decoded:

0002b0: 14018079 4400c780 op.12 // (unk0 04018078)// (unk1 0400c000)
0002c8: 14030079 4400c780 op.12 // (unk0 04030078)// (unk1 0400c000)
0002e0: 14048079 4400c780 op.12 // (unk0 04048078)// (unk1 0400c000)
(the immediate increases by 0x300 each iteration. desitnation reg is $r30)

In regard to timings: it seems an add-followed-by-a-dependent-mov carries quite a bit of latency. Placing address calculations at least 2-3 instructions ahead of the mov seems necessary.

Before i tried out your disassembler, I was a bit skeptical because I had read ATI’s CTM docs. ATI explained its full instruction set, and that thing was COMPLEX. There was tons of flag bits and options to perform all sorts of operations inside one instructions, including saturation/negation, pre-multiplying by a variety of powers of two, swizzling, and setting spinlocks to synchrnoize the texture fetch unit with the alus. (This was its older DX9 tech, however)

The nvidia isa seems cleaner, but there could also be a lot of features and power that’s hidden. E.g., ptxas is probably only using a subset of the features. Or maybe nvidia has a more RISC-like philosophy to building gpus. But try to document all the unexplained bits.

Indeed, I found it very similar to RISC architectures like ARM. Including the fact that it has half-size (32 bit) and full-size (64-bit) instructions.

I think I found out almost all of the instructions, at least those used by ptxas and by the shader compiler (and which other compilers are there?). I went over the PTX guide and covered all instructions, and I actually memory-traced the GL driver to find the shader code while it was written to the card, for some shaders, it should do all non-specific shader instructions as well. So I doubt there is much missing.

Yes it is quite complex, a lot of instructions (add, mul and friends, logical instructions) do include bits for saturation and negation. I don’t see any obviously missing instructions. And the fact that it is a scalar architecture, in contrary to ATI, helps too. There is no need for swizzling. Most complex part was the condition code / predicate register stuff.

The things that could be “hidden” are probably some conversion instructions and less-used variations of other instructions (like this load). Can you send me the ptx file that generated those 0x12 instructions?

Currently my toolchain consists of a disassembler and assembler, which is enough for hand-optimizing kernels. I don’t know if I’m actually going to make a C compiler or something, seems NVidia covered that part quite well :)

I just released version 0.4.1, which improves the assembler a lot, and fixes a host of other issues.

Wow, ok cool.

To find those op12s (generated when compiling ptxas -O0, which maybe uses a couple things -O4 doesn’t):

Download the zip file from here:

http://forums.nvidia.com/index.php?showtopic=47689&st=0

Shader code (ptx and cubin) is in folder:

milestone 6.3\matrixMul_vc8\matrixMul_vc8.exe.devcode\matrixMul@05726202a99a58be

Let us know if you write documentation. I’m going to try writing some cubin kernels today. I’m afraid housekeeping looks a bit different from ptx (in regard to registers, parameters, constants, etc), i’ll post if there’s something I can’t figure out.

P.S. have you also included function call instructions? I’d like to explore splitting the kernel into seperate kernel + inner loop functions using cuda 1.1 and replacing just the inner loop.

yes, the call and return instruction are supported

Call should call a label, return returns from the function. Note I haven’t tested these, but have seen them used (when doing integer division, ptxas adds some internal microcode and calls it), so they should work.

I’ll have a look at your code that generates those instructions soon.

I added your 0x12 instruction to the disassembler and assembler.

I’m using decuda 0.4.1 and I’ve hit a couple of issues. Firstly, CUDA 2.0 beta seems to generate code that really baffles decuda but I guess thats understandable. Secondly, using CUDA 1.1 I can run decuda okay although there are a few suspicious looking instructions where the result is discarded before it is ever used. I’m working on the basis that this is just a problem with the CUDA optimizer. My big problem is that cudasm won’t assemble anything with tex instructions in it.

I appreciate your work wumpus and I think my kernel could really benefit from being hand-optimized to reduce both instruction and register count. I’m going to try to work around the tex instructions for the time being.

One other question: what does movsh do?

Well, feel free to implement any missing instructions and send me a patch. Also, you should really be using the SVN version, in case you are not. I don’t think it has the movsh instruction anymore.
It used to move from/to shared memory I think.

I’d love to but I can’t read or write Python. I’ll try the latest SVN version though. Something else I noticed in 0.4.1 is that rcp.f32 works but rcp.half.f32 complains about a missing modifier.

Well, python is a really easy and elegant language, if you can do CUDA/C then it should be a breeze :)

Can you give the hex for the (rcp) instruction that doesn’t recompile?

Do you think? I had a look and decided I’d rather learn nVidia opcodes :)

It looks like this:

90000000 rcp.half.f32 $r0, $r0

And to put my movsh in context it looks like this:

10008001 00000003 mov.b32 $r0, 0x00000000

00000005 c0000780 movsh.b32 $ofs1, $r0, 0x00000000

Any idea what’s going on here? I know its something to do with getting a pointer to a constant but what exactly do the 0x00000000s mean and why does it bother to use the $r0 register?

:angry: :angry: :angry: :">

Hi,
can somebody tell me how to use this decuda. I am using visual studio.I have some cuda codes and i want to see the actual hardware code. coz i m pissed of seeing ptx code :angry:

Dont u realize that the actual hardware code can cause you to piss more than PTX?

:lol: