Yes, PT is the “always true” predicate and RZ is the designated zero register. But this does not look like an ordinary no-op such as would be used for code alignment. I cannot find anything relevant in NVIDIA’s published materials, nor the internet at large.
Based on general experience with processor design, I could speculate that this instruction sequence might serve one of two purposes. In order of decreasing likelihood:
(1) The additional LDS instructions may serve as placeholders to generate additional entries in an internal queue. For example, each queue entry could correspond to 32 bytes (the length of an L1 line), so for a 128 byte transfer four slots need to be created, the first three of which are created via these LDS instructions.
(2) This is a work-around for a hardware bug affecting the recently introduced global-to-shared memory block transfer feature, constructed to get the execution pipeline into a “safe” state prior to initiating the transfer.
I will emphasize again that the above is speculation. In as far as NVIDIA has filed patent applications for their new block-transfer mechanism, one might find additional information there. I have not searched the USPTO database to check if any such patent applications have been filed.