So I wrote a fairly full featured assembler for the Maxwell architecture.
This all started earlier this year when I was studying different sgemm implementations and trying to incorporate those techniques into some deep learning code I’ve been working on. I basically came to the conclusion that it was not possible to fully utilize the hardware I bought. nVidia, unfortunately, doesn’t believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to. Ptxas badly manages register usage, does a poor job of mixing memory loads with fp computation and handles predicated memory operations badly (even when warp uniform), among other things.
Anyway, the more I looked at the sass output of my code the more I began to realize that it should be possible to figure out the op and control codes to all the instructions I was using and just assemble my own code. One month later and I now have a pretty useful piece of software. I find it easier now to code in assembler and talk directly to the hardware than it is to code in cuda c.
Here are the features I put together:
Register Allocation: You do this at the top of the file with a map of variable names to register numbers. This way you can write code that’s easy to understand and not obscured by all the register numbers. But mainly this gives you absolute control over what registers are allocated. For performance code this is important because at the hardware level registers are banked and some combinations give you higher throughput than others (and I’m talking 100s of GFlops here).
Scheduled Blocks: For a lot of your code you don’t want to spend too much time optimizing the ordering and stalling of instructions. So I wrote a basic scheduler to do this for you. This way you can focus on just writing clear code that’s easy to maintain. In addition to the stall control values it also automatically figures out the best register reuse control flags to use (a new feature with Maxwell and cuda 6.5). But for performance blocks of code you don’t have to auto schedule it and can very carefully place your instructions to maximize throughput.
Macro Language: I implemented this assembler in Perl and embedded the interpreter as a macro language. This allows you to keep your code nicely rolled up without having a gazillion instructions to maintain. This makes it feel more like developing in a scripted language rather than assembly.
Control Codes: Any instruction placed in a scheduled block has any required stall counts managed automatically. But the other aspects of the control notation I deliberately don’t manage for you. These are mainly the dependency barriers that memory operations make use of to signal when data is ready to use. Managing these automatically is a hard problem and is one I feel is better left up to the developer to work out. Setting these codes actually adds a fun aspect to gpu programming that cuda c or ptx doesn’t expose.
Disassembly: Sometimes you just want to slightly tweak a compiled program. This tool makes that really easy to do. It can dump cubin code in an easy to edit format and you can just insert it back in. In fact, the program isn’t designed to work from scratch. You need to at least to start out with the shell of a kernel that defines the globals, shared memory, and params. It dumps that and you take it from there.
There are lots of other little features to talk about but just wanted to put together a high level description first. I wrote it in Perl but I’ll probably convert it to Python at some point (this seems like the perfect project to finally learn that language.) As it is, I find it pretty easy to now write code that performs within 5% of the theoretical throughput, which for GM107 is 1.6 TFlops. The best I was getting from bashing my head against ptxas was around 70%.
Anyway, I wanted to see if there was any interest in me putting this up on google code or github or something for others to play with, use, and perhaps extend. The op code coverage is around 80% at this point. I can dis and re-assemble all of cublas_device.lib with zero errors. But there’s still more to do: more op codes and more micro benchmarks to fine tune the scheduler.
nVidia may try to claim this violates my EULA, but I call bullshit on that. I’m more than happy to fight them on that front.