I am not familiar with this particular instruction, but the general philosophy used by the disassembler is to show the least significant naturally-aligned register only, with the number of registers used indicated by a instruction name suffix, if at all. For common examples, look at double-precision instructions which use two registers per operand.
In this case I would expect that R56 in combination with .4 indicates that registers R56, R57, R58, and R59 are being used. The natural alignment is obvious from the fact that 56 is divisible by 4, while the registers indicated for normal double-precision operations are always divisible by 2.
By observation, PTX generated by the nvcc uses virtual register names in SSA (static single-assignment) fashion, meaning each virtual register is written to exactly once. But registers in PTX are typed, which carries information to ptxas how to map them to aggregates of physical registers.
I am not a compiler engineer, but from talking to compiler engineers working on non-GPU platforms that also use register aggregation for wider data types, having variant granularity for register allocation is a non-trivial complication and can result in holes in the final register allocation map.
I am not an expert on SSA, but as I understand it, this is where the “static” comes in. A loop is something dynamic that happens at runtime. In other words, with SSA there is only one instruction in a code listing where a particular register appears as the destination / left-hand side.