Tcgen05{.ld, .st} matrix fragments

Hi NVIDIA expert:

As described in the 9.7.16.2.3.1.2. Matrix fragments for shape .16x64b and 9.7.16.2.3.1.3. Matrix fragments for shape .16x128b in the PTX ISA 9.1 documentation, the tcgen05{.ld,.st}.16x64b instruction uses two non-contiguous threads to move 64-bit data as shown in the Figure 184, but the tcgen05{.ld,.st}.16x128b instruction uses four consecutive threads to move 128-bit data as shown in the Figure 185.

Why doesn’t the tcgen05{.ld,.st}.16x64b instruction use two continuous thread to move each 64bit data in each lane?

In addition, in the Figure 185, does r0 in each thread represent the same register?