Maxwell Assembler

So I wrote a fairly full featured assembler for the Maxwell architecture.

This all started earlier this year when I was studying different sgemm implementations and trying to incorporate those techniques into some deep learning code I’ve been working on. I basically came to the conclusion that it was not possible to fully utilize the hardware I bought. nVidia, unfortunately, doesn’t believe in eating their own dog food and they hand assemble their library routines, rather than use ptxas like the rest of us have to. Ptxas badly manages register usage, does a poor job of mixing memory loads with fp computation and handles predicated memory operations badly (even when warp uniform), among other things.

Anyway, the more I looked at the sass output of my code the more I began to realize that it should be possible to figure out the op and control codes to all the instructions I was using and just assemble my own code. One month later and I now have a pretty useful piece of software. I find it easier now to code in assembler and talk directly to the hardware than it is to code in cuda c.

Here are the features I put together:

Register Allocation: You do this at the top of the file with a map of variable names to register numbers. This way you can write code that’s easy to understand and not obscured by all the register numbers. But mainly this gives you absolute control over what registers are allocated. For performance code this is important because at the hardware level registers are banked and some combinations give you higher throughput than others (and I’m talking 100s of GFlops here).

Scheduled Blocks: For a lot of your code you don’t want to spend too much time optimizing the ordering and stalling of instructions. So I wrote a basic scheduler to do this for you. This way you can focus on just writing clear code that’s easy to maintain. In addition to the stall control values it also automatically figures out the best register reuse control flags to use (a new feature with Maxwell and cuda 6.5). But for performance blocks of code you don’t have to auto schedule it and can very carefully place your instructions to maximize throughput.

Macro Language: I implemented this assembler in Perl and embedded the interpreter as a macro language. This allows you to keep your code nicely rolled up without having a gazillion instructions to maintain. This makes it feel more like developing in a scripted language rather than assembly.

Control Codes: Any instruction placed in a scheduled block has any required stall counts managed automatically. But the other aspects of the control notation I deliberately don’t manage for you. These are mainly the dependency barriers that memory operations make use of to signal when data is ready to use. Managing these automatically is a hard problem and is one I feel is better left up to the developer to work out. Setting these codes actually adds a fun aspect to gpu programming that cuda c or ptx doesn’t expose.

Disassembly: Sometimes you just want to slightly tweak a compiled program. This tool makes that really easy to do. It can dump cubin code in an easy to edit format and you can just insert it back in. In fact, the program isn’t designed to work from scratch. You need to at least to start out with the shell of a kernel that defines the globals, shared memory, and params. It dumps that and you take it from there.

There are lots of other little features to talk about but just wanted to put together a high level description first. I wrote it in Perl but I’ll probably convert it to Python at some point (this seems like the perfect project to finally learn that language.) As it is, I find it pretty easy to now write code that performs within 5% of the theoretical throughput, which for GM107 is 1.6 TFlops. The best I was getting from bashing my head against ptxas was around 70%.

Anyway, I wanted to see if there was any interest in me putting this up on google code or github or something for others to play with, use, and perhaps extend. The op code coverage is around 80% at this point. I can dis and re-assemble all of cublas_device.lib with zero errors. But there’s still more to do: more op codes and more micro benchmarks to fine tune the scheduler.

nVidia may try to claim this violates my EULA, but I call bullshit on that. I’m more than happy to fight them on that front.


I would be interested in using some of that functionality, thanks for posting.

Will not make the move to Maxwell until bandwidth on those devices>336 GBs, but if such a utility would also be useful for Kepler, that would be a great tool for optimization.

Also would not hurt to learn about inner-workings of Maxwell now, before they release their ‘BigBoys’ next year.

I was never really excited about Kepler in that it seems about 1/3 of the chip is wasted on cores you can’t really use that effectively. But Kepler has 7 op instructions per 1 control instruction vs Maxwell’s 3 to 1 so the control notation should be twice as easy to work out. Particularly now that I’ve figured it out for Maxwell: For each of the 3 instructions: 4 bits for stalls, 1 bit for yield, 3 bits for setting the read-after-write dependency barrier, 3 bits for setting the write-after-read dependency barrier, 6 bits to wait on one or more of the set dependency barriers, and 2 bits for register reuse flags (for 3 read operand positions). So, 4 total unused bits out of 64 that I can tell, probably for future expansion.

Anyway, I bet most of the op codes and op flags are the same or only slightly different. The cubin format is likely very similar… so if someone were so inclined it would take far less work than I put in to get this working for Kepler as well.

Oh, and I’ll start packaging this up this weekend so it’s more suitable for general use.

Absolutely interested, as I do crypto implementations for high performance crypto currency mining. Every IOP counts.

Please put it on github…

I’d be really curious to see this too.

Do you know any function which could copy faster blocs of 32 to 64 bytes arrays than a for…i<16 dst[i] = src[i]?

About the memory copy speed.

a memcpy() with fixed (known at compile time) size should work well in CUDA code.

Alternatively a full unroll of your i<16 loop with #pragma unroll 16 works rather well.

Additionally consider performing uint2 or uint4 based copies using some pointer casting magic. Wider data transfer sizes can be faster, but then your data then has to be multiple of 8 and 16 bytes.

If you want to copy each block using multiple threads, make sure your memory accesses are coalesced (and free of bank conflicts when it involves shared memory).

I don’t think this Assembler would give a big boost to memcpy operations, as these are limited by memory latencies, so the raw instruction throughput won’t matter much.


i know, i tried but its not working all the time… i tried to align the u64 arrays with align(32) and that works… sometimes, local arrays are not aligned on each build…

__device__ __device_builtin__ void __nvvm_memcpy(uint8_t *, uint8_t *, SZCT, int);
__device__ __device_builtin__ void __nvvm_memset(uint8_t *, unsigned char, SZCT, int);

So the code is up, but I still need to spend a bunch of time on the wiki. All the code is heavily commented so you could jump right in now if you were so inclined… feel free to email or pm me with questions…

I’ve been reading through sgemm.sass. Very cool. Metaprogramming FTW!

Can you explain your comment on line 66?

Seeing some of the instructions you’ve chosen (e.g. LDS.128) makes me want to go back and tweak some code I’ve written.

If you compile the shell normally with nvcc and inspect the sass you can see what I’m talking about. Technically it is possible, but every memory load that is predicated, either directly or indirectly by having its address or texture register set with a predicted instruction, turns into that warp shuffle monstrosity. This is because memory accesses needs to be somewhat warp uniform. The shuffle is put in there to inspect the threads in the warp and enforce this. I haven’t spent the time to read that code and figure out precisely what it’s trying to enforce, but I just know that’s a totally unnecessary massive overhead when you know all your warps will be behaving uniformly.

What’s needed is to move the .uni specification from the bra instruction and tie it directly to the predicate like so:

@p0.uni bra LABEL;


@p0.uni tex.1d.v4.f32.s32 {loadX0, unused0, unused1, unused2}, [texture, track0];

Also, the point of using LSD.128 loads is to increase the ratio of FFMA instructions to non-ffma instructions. The shared loads are essentially free since they’re dual issued and their latency is pretty much totally hidden by ffma’s from the current thread or another. So grab as much data as you can at once and don’t care that it may take a few more clock cycles to retrieve.

Ah, thanks. I see what you mean by the SHFLs (and SYNCs!). It’s an instruction explosion. After the lda/ldb high bit is flipped, the SHFL looks like it’s broadcasting the lane 0 value across each warp so that every lane has the same value. Seems pointless but as you say it’s probably trying to enforce some TEX-specific requirement.

This paper appeared today which heavily references Scott’s Maxas:

link to actual paper:

Thanks for the reference to my paper!

V1 of the paper has a minor but annoyingly misleading statement about kepler peak throughput, so please update any references to point to the latest version:

Get the source code at: