Cubin assembler is now available decuda 0.4.0 released

sandeepan1986 · June 11, 2008, 8:03am

No…i m not being pissed of by syntax of it…i m pissed off becoz it does not tell u the actual picture…i want to see where r my rigisters being used. I want to optimize my code :D

sandeepan1986 · June 11, 2008, 8:05am

if you know…can u tell me how to run decuda…sarnath

sandeepvirdi · June 11, 2008, 8:39am

Thats awesome wumpus…indeed it will be a nice-to-have tool chain…
carry on the good work…man… External Image

Cheers!!!
Sandeep

Sarnath · June 11, 2008, 9:09am

I also dont know. I have never used it. But you can always download it and go through the documentation.

IMHO,

If your algorithm is good enough – one should stay away from optimizations at this level – unles n until therez a driving need for this.

I still remember the quote from an author of a book (zen of graphics programming???) - “The best optimizer is in between your ears”.

sandeepan1986 · June 11, 2008, 11:41am

ok …i have found the way with help from wumpus…

parallelis · June 12, 2008, 12:29am

Wumpus, I would like to thank you for your impressive work, it is really a big help for people like me that would take the control of the code (doing a chess-engine) and all the hardware features inside the nVidia GPU.

Great great work!

Ola · June 12, 2008, 10:15am

Hello, very nice tool!

Made me a happier man to see how registers are actually allocated!

Was curious if you have seen any signs of the interpolators that are supposed to exist in the SFU (Special Functions Unit), they are apparently there to help pixel shaders interpolate vertex attributes and as far as I can make out are not exposed in the CUDA APIs.

Perhaps accessing them through the assembler would be possible? Unfortunately no more insights to offer as to how they might be invoked, but if I didn’t missread I gathered that you had traced the shader bins being sent down to the HW?

cheers
.ola

P.S.
IEEE Micro: NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE

wumpus · June 12, 2008, 11:20am

Yes, I have traced various kinds of shaders in the HW. Vertex shaders have some special instructions to write to varyings (output registers), which are interpolated for the fragment shaders. Also, there are instructions to read incoming values, used in fragment programs.

As far as I know you cannot use this in cuda, because the values are already interpolated at the start of the shader. You have no programmatic control over them in your kernel. Also using the instructions causes hardware exceptions (Xid stuff …).

Edit: thanks very much to the people that sent me the paper, it is truly interesting

dilthoms · November 8, 2008, 4:10pm

Hi wumpus,

decuda and cudasm are extremely helpful while trying to optimize code. Thanks a lot.

I encountered a problem though. Attaching herewith a cubin for matrix multiply. If I disassemble it with decuda and then assemble it again with cudasm the program gives incorrect output. Whereas if I run the original cubin as is, it works fine. May be this is a bug in cudasm? Could you please take a look? Thanks!

Hi,

I’d like to announce that the most recent version of decuda, my disassembler for .cubin instructions for the G8x/G9x architectures, now includes an assembler. It allows writing and optimizing code specificially for the G8x and G9x series, and completes the independent toolchain for this hardware. It takes a text file with assembly instructions as input, and produces a .cubin file as output.

Not the entire instruction set is supported yet, but it is capable of assembling working CUDA kernels, including predication and flow control constructs.

Also, decuda can export in a format that (should be) reparsable by cudasm, so that it is possible to make changes to the code produced by nvcc, and reassemble.

The software can be found here: http://www.cs.rug.nl/~wladimir/decuda/

The assembler is still in a beta stage, and barely documented. So let me know if you have any questions as to its use, or if you find nasty bugs.

wumpus · November 9, 2008, 2:05pm

There are a lot of bugs in cudasm, it was more of a proof of concept, I didn’t get around to making it fool proof.

alex_dubinsky · November 9, 2008, 9:33pm

cudasm doesn’t have much use because mucking about with PTX kernels (especially machine-code kernels) is a fool’s errand. doing this takes far more time to code and effort to maintain than is the benefit, and will break as soon as nvidia makes cool changes to its arch. (the only thing it might make sense for is matrix-multiply, since it’s simple and the focus of competitions.) at least, it’s really useless until non-inline functions are supported (so you could link normal code to asm-optimized inner loops). even then, though, it’s dangerous for you and society at large.

decuda, though, is pure gold.

Jamie_K · April 4, 2009, 12:19am

Ok, finally got decuda working and it’s great!

I tried latest stable python from the python official size, and from ActiveState and both gave me problems with modules cStringIO and StringIO. After a lot of digging, it turns out these were standard modules but have been removed from python 3.0. So if you’re going to get python for using decuda, get python 2.6.1!

I’m on Windows Vista.

seibert · April 4, 2009, 10:30pm

An off-topic but general FYI: Python 3.0 is a backward-incompatible release designed to fix some long-standing design “bugs” in Python. Unless you are using code which explicitly says “Written for Python 3” or you want to experiment with the future of the language, you should stay away from Python 3. (Not that it is bad, it is just a “forward looking” release.) There are essentially zero useful programs that automatically work with both Python 2 and 3.

It would be nice if people who packaged Python made this more clear. (Tell your friends!)

sergeyn · May 21, 2009, 4:24pm

Awesome tool!