Feedback on PTXAS

Hi,

A few feature requests for PTXAS. Caveat: Much of this feedback is based on the disassembly output from wumpus’ tool, but so far its results have correlated very well with observed performance.

BTW, despite these points the tool quality is generally very good – and thanks for the robustness improvements in v1.1!

(1) Please add a pragma or such in order to limit the register count (maxrregcount) on a per-kernel basis. If I have multiple kernels in the same file, I don’t necessarily want them to have the same register limit. Plus, this is a more easy-to-see way to spec the limit rather than including in the build instructions.

(2) Please improve the register allocator. I’ve noticed that reducing register count artificially with maxrregcount will often result in poor register allocation; it almost looks like the compiler assigns registers using its normal algorithm, then applies a post-processing pass to fit into the available registers specified and can pick poor choices to spill. Sorry, I don’t have an example right now to post.

(3) It would be nice if the compiler knew the # of expected threads for a kernel so that it could allocate registers accordingly. For example, I created a kernel that I always call with 16x16 threads, yet the compiler used >32 registers resulting in a failed kernel (well, a fast kernel that produced incorrect results…). Perhaps a #pragma or such to give hints to the compiler on this.

(4) Would it be possible to optionally spill registers to a shared memory buffer? I’ve had a few times where I had (say) 18 registers but wanted to fit my kernel into 16 in order to increase occupancy. However, the spills to local memory were killer. So I manually had to “spill” values to a shared memory. There is a trade-off here since you can’t spill many registers to shared mem without blowing a budget, but it would be a nice tool to have.

(5) The compiler seems poor at deciding how to schedule global loads. I have seen many examples where rather than hoisting the load as high as possible, it pushes it late into the block. For example:

loop {

  As[ty][tx] = my_var;  // Use the preloaded value from previous iteration

  __syncthreads();

  my_var = A[foo];       // Try to preload next value to hide its latency

  do_lots_of_calculations();

  __syncthreads();

}

Despite having enough registers free to execute the A[foo] load immediately after the synchthreads, the compiler pushes this load to the end of the loop. It doesn’t always do this, but it would be very nice if it wouldn’t do that.

This optimization yields me a ~5% speedup in matrix multiply, except that the compiler half the time pushes around my loads and eliminates the gain when I make other unrelated changes!

(6) Along the same lines, it would be nice if the compiler would be a little less restrictive with syncthreads; currently it doesn’t move any global loads across the barriers. I know pointer aliasing makes life difficult, but in many cases the kernel may be doing no global writes at all inside a loop – surely in such cases the compiler could have a bit more freedom with lauching its loads early?

(7) The compiler doesn’t seem to share offsetpointers for shared variables. If I have two shared matrices As[3][32][32] and B[32][32], and then change my code to have ABs[4][32][32] and then everywhere I refer to B instead refer to ABs[3], I get a speed-up. Looking at the disassembly, the compiler was able to do some common sub-expression recognition and pointer math that it didn’t do when A and B were declared seperately, despite there being a static relative memory map in both cases.

Regards,

Paul