ptxas register use

So I’ve been toying around with CUDA C and PTX assembly the last few weeks and I’ve been somewhat frustrated by the resulting cuobjdump assembly code that’s generated. I have finely crafted ptx assembly code that should be using around 113 registers, yet when compiled uses around 137 (which drops my occupancy in half which in turn causes a drop of about 300 GLOPS on my 750Ti). Looking closer it seems ptxas is being kind of dumb with the LDS.128 instruction. It has to allocate the registers in adjacent groups of 4 but it only decides what registers to use for this after it’s assigned a bunch of others. So it ends up trying to reuse those already allocated registers and adds in a bunch of messy and totally unnecessary MOV instructions shifting stuff around. The end result is it using over 20 more registers than it should need to if it could just pick a sensible layout to begin with.

Interestingly, if I compile with --opt-level 0 the register use drops down to about 115 and the resulting dump ends up looking a lot like my original code. However the code is also littered with lots of useless extra instructions like:

IADD R2, R0, 0xd0;
MOV R2, R2;
STG.CS [R2], R85;
IADD R2, R0, 0xd4;
MOV R2, R2;
STG.CS [R2], R86;

which should just be this:

STG.CS [R0+0xd0], R85;
STG.CS [R0+0xd4], R86;

So my question is have you guys discovered any tricks to get ptxas to behave the way you’d like it to? Ideally there’d be a “trust me I know what I’m doing mode” that didn’t try and get too clever with optimization and produced something much closer to a 1to1 mapping from ptx to machine code.

Oh, and while I’m requesting for new functionality, it would be nice to be able to specify the actual registers used, that way you could minimize register bank conflict for code that may not be able to have the required occupancy to hide those latencies. For example, if you dump the code for the sm35 cublas sgemm implementation you can see it’s completely devoid of register bank conflicts (as described in Junjie Lai’s paper on the topic). Though it seems he now works for nVidia and I wouldn’t be surprised if he wrote that implementation. It would be nice if we didn’t have to work for nVidia to get the most out of our custom kernels.

You didn’t mention if you were using it already, so one easy thing to try is using launch bounds to hint that you really really want a hard cap on the number of registers/thread (e.g. 128 regs/thread).

As far as coaxing ptxas to “do what I say”… any advice I have borders on voodoo. It sounds like you’re pretty comfortable with PTX/SASS so I can only suggest crafting small PTX routines that generate exactly what you want and then go from there. I did this with some vector type operations and went through a number of iterations before I found the smallest/fastest instruction sequence.

I’ve found that stringing together PTX and C seems to be just as effective as hand-coding a large PTX routine. Also, don’t be afraid to use the vector pack/unpack PTX opcodes as ptxas seems to do a good job entirely erasing the instructions. It might also be beneficial to make sure the args to your PTX routines are properly qualified (const, volatile, etc.).

Oh, and file bugs if you find anything really bad (like that MOV R2,R2) and can definitively repro it.

Changing the launch bounds doesn’t seem to change the optimizer’s strategy at all. It only seems to force register spilling which hurts performance far more than the lost occupancy does.

The advice bordering on voodoo seems to jibe with my own experience. I was actually able to get register usage down to 128 in the C version of the code, but that was only with lots of random permutations of slight code changes. And just the slightest change to that would trigger it revert back to excessive register use and mov instructions. I was pushing over 1250 GFLOPS of useful calculations on my 750Ti in the 128 register case which was really encouraging. The cublas implementation of sgemm for sm50 only does about 945 GFLOPS (the sm35 code hasn’t been ported yet). What I’m doing has a component of sgemm to it but mixes in some other logic as well (which makes more sense to incorporate into the kernel vs adding as a second stage to a pure sgemm kernel).

The ptx version was mainly an exercise in learning that language, but I also wanted to see if I was able to get more stability out of the optimizer. Using ld.volatile.shared.v4.f32 did seem to help a little bit in that respect, but ptxas was oblivious to const vs non-const. Haven’t tried vector unpack instructions as that would just add additional complexity to the code for the sole purpose of trying to tweak the optimizer out of bad behavior. This would be so much easier if ptxas just used exactly the registers you specified it to.

Compiling to sm35 or even using cuda 5.5 doesn’t seem to change anything.

I’m not sure nvidia considers mov r2, r2 a bug under level zero of optimization. But I’d happily submit my basic sgemm code to them that clearly demonstrates this issue if you think they’d actually look at it. What’s been your experience submitting bugs to them?

@spworley noted offline that in CUDA 6.0 RC an explicit inline qualifier squashed some unexpected spilling in some of his tests (registers available but spills still occurring). I expect he’ll write up his findings soon.

So if you’re on CUDA 6 RC and don’t need the ABI then one experiment might be to qualify all (or some) of your funcs with forceinline as well as compile with -Xtpxas=-v,-abi=no to grab that extra register or two.

So far I’ve found that all submitted bugs w/a repro get fixed but it’s staggered to the next release.

1250 GFLOPS on the GTX 750 Ti sounds great!

-abi=no just induced massive register spilling. However, in the C version, forceinline on my one device function was one of the things that got me to 128. Maybe the trick is to break the code up into smaller pieces that the optimizer works on individually rather than it being overwhelmed with a massive block of unrolled loops. Or maybe not, as I recall that looking at the cuda generated ptx just had the one kernel function with all the device function calls all inlined. Anyway, that’s something else to play with at least. Thanks for the feedback.

Oh, and the performance I’m getting is through use of double buffered 8 register blocking for each matrix. And then some creative remapping of global (coalesced) to shared memory such that I’m loading from shared with all lds.128 instructions with no bank conflicts. This allows for a greater than 90% ffma instruction ratio with zero operand dependency (aside from register banking latency which I can’t control, only hide).