What is `@!PT LDS RZ, [RZ]` for?


During the tuning of a GEMM-like kernel, I saw some strange fragments generated by the compiler

@!PT  LDS RZ, [RZ] 
@!PT  LDS RZ, [RZ] 
@!PT  LDS RZ, [RZ] 
@!P0  LDGSTS.E.BYPASS.LTC128B.128 [R219], [R224.64] 

what this fragment correlates to is an inline PTX instruction

asm volatile("cp.async.cg.shared.global.L2::128B  [%0], [%1], %2;\n" ::"r"(smem_int_ptr), "l"(src), "n"(cp_size));

The code compiles and the result is OK.

However, I’m wondering what @!PT LDS RZ, [RZ] is doing. I reads to me like

if (!PT) {
    RZ = smem[0];

if I understand SASS right, PT is an always true predicate register and RZ is a zero register, so the whole instruction is just like an NOP?

Yes, PT is the “always true” predicate and RZ is the designated zero register. But this does not look like an ordinary no-op such as would be used for code alignment. I cannot find anything relevant in NVIDIA’s published materials, nor the internet at large.

Based on general experience with processor design, I could speculate that this instruction sequence might serve one of two purposes. In order of decreasing likelihood:

(1) The additional LDS instructions may serve as placeholders to generate additional entries in an internal queue. For example, each queue entry could correspond to 32 bytes (the length of an L1 line), so for a 128 byte transfer four slots need to be created, the first three of which are created via these LDS instructions.

(2) This is a work-around for a hardware bug affecting the recently introduced global-to-shared memory block transfer feature, constructed to get the execution pipeline into a “safe” state prior to initiating the transfer.

I will emphasize again that the above is speculation. In as far as NVIDIA has filed patent applications for their new block-transfer mechanism, one might find additional information there. I have not searched the USPTO database to check if any such patent applications have been filed.