What is `@!PT LDS RZ, [RZ]` for?

user92857 · July 27, 2023, 7:10pm

Hi,

During the tuning of a GEMM-like kernel, I saw some strange fragments generated by the compiler

@!PT  LDS RZ, [RZ] 
@!PT  LDS RZ, [RZ] 
@!PT  LDS RZ, [RZ] 
@!P0  LDGSTS.E.BYPASS.LTC128B.128 [R219], [R224.64]

what this fragment correlates to is an inline PTX instruction

asm volatile("cp.async.cg.shared.global.L2::128B  [%0], [%1], %2;\n" ::"r"(smem_int_ptr), "l"(src), "n"(cp_size));

The code compiles and the result is OK.

However, I’m wondering what @!PT LDS RZ, [RZ] is doing. I reads to me like

if (!PT) {
    RZ = smem[0];
}

if I understand SASS right, PT is an always true predicate register and RZ is a zero register, so the whole instruction is just like an NOP?

njuffa · July 27, 2023, 11:32pm

Yes, PT is the “always true” predicate and RZ is the designated zero register. But this does not look like an ordinary no-op such as would be used for code alignment. I cannot find anything relevant in NVIDIA’s published materials, nor the internet at large.

Based on general experience with processor design, I could speculate that this instruction sequence might serve one of two purposes. In order of decreasing likelihood:

(1) The additional LDS instructions may serve as placeholders to generate additional entries in an internal queue. For example, each queue entry could correspond to 32 bytes (the length of an L1 line), so for a 128 byte transfer four slots need to be created, the first three of which are created via these LDS instructions.

(2) This is a work-around for a hardware bug affecting the recently introduced global-to-shared memory block transfer feature, constructed to get the execution pipeline into a “safe” state prior to initiating the transfer.

I will emphasize again that the above is speculation. In as far as NVIDIA has filed patent applications for their new block-transfer mechanism, one might find additional information there. I have not searched the USPTO database to check if any such patent applications have been filed.

Topic		Replies	Views
How to force variables to be on a register, local memory or shared memory? CUDA Programming and Performance	6	7266	March 21, 2008
preventing ptxas from reordering instructions CUDA Programming and Performance	23	6126	December 2, 2022
Half2 atomics generate unused code CUDA Programming and Performance	13	191	August 8, 2024
When red instruction is generated? CUDA Programming and Performance	12	2281	August 5, 2010
What does LOP.AND.NZ do? CUDA Programming and Performance	13	1273	December 16, 2020
Suggestion for nvcc Overlaying auto shared CUDA Programming and Performance	11	8574	June 16, 2007
Why compiler prefer to use registers to cache hot data rather than constant memory? CUDA Programming and Performance	22	1540	November 7, 2022
cuda SASS question CUDA Programming and Performance	4	1874	June 18, 2018
NVCC Compling question, where is the lmem? CUDA Programming and Performance	5	1470	March 4, 2011
Any undocumented queries? CUDA Programming and Performance	6	5234	October 5, 2008

What is `@!PT LDS RZ, [RZ]` for?

Related topics