Understanding LDSM.16.M88.4

PTX docs clearly state that ldmatrix.x4 requires 4 output registers:

// Load four 8x8 matrices
.reg .b64 addr;
.reg .b32 d<4>;
ldmatrix.sync.aligned.m8n8.x4.b16 {d0, d1, d2, d3}, [addr];

Since we are loading 8 * 8 * 4 = 256 bytes, and we only have 32 threads in a warp, of course we need 4 32-bit registers per thread.

But when I look at SASS disassembly for my kernel, I only see one output register:

        /*3940*/                   LDSM.16.M88.4 R56, [R62] ;                                                  /* 0x000000003e38783b */

How is this possible? Is cuobjdump hiding some of the output registers? The SASS instruction ends with .4, so how can it only have a single output?

I am not familiar with this particular instruction, but the general philosophy used by the disassembler is to show the least significant naturally-aligned register only, with the number of registers used indicated by a instruction name suffix, if at all. For common examples, look at double-precision instructions which use two registers per operand.

In this case I would expect that R56 in combination with .4 indicates that registers R56, R57, R58, and R59 are being used. The natural alignment is obvious from the fact that 56 is divisible by 4, while the registers indicated for normal double-precision operations are always divisible by 2.

That means BTW that PTX code has freedom in register naming and ptxas has to add special constraints, when mapping registers.

By observation, PTX generated by the nvcc uses virtual register names in SSA (static single-assignment) fashion, meaning each virtual register is written to exactly once. But registers in PTX are typed, which carries information to ptxas how to map them to aggregates of physical registers.

I am not a compiler engineer, but from talking to compiler engineers working on non-GPU platforms that also use register aggregation for wider data types, having variant granularity for register allocation is a non-trivial complication and can result in holes in the final register allocation map.

I think with the possible exception of rolled (= not unrolled) loops with conditional assignment in the loop body.

I am not an expert on SSA, but as I understand it, this is where the “static” comes in. A loop is something dynamic that happens at runtime. In other words, with SSA there is only one instruction in a code listing where a particular register appears as the destination / left-hand side.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.