Predicate register as last operand in load instructions

I am currently studying load/store unit in Ampere architecture. By dumping SASS code from cuBLAS compiled for sm_80 with nvdisasm, I observed a frequent occurence of patterns like LDG.E.LTC128B.64.STRONG.GPU destination, source memory address, predicate register.
e.g. LDG.E.LTC128B.64.STRONG.GPU R90, [R70.64], P4 ;.

To my knowledge, the predicate register appeared as the last operand is not found in previous architectures.

A wild guess is that this pattern has similar effects as the prefix @P, which prevents some masked threads to perfrom the loading. I wrote a simple masked a+b CUDA program, compiled it, and found it uses @P0 LDG.E R0, [R2.64] to load from memory. By replacing the device binary in the compiled ELF file with a LDG.E.LTC128B.64.STRONG.GPU R0, [R2.64], P0 ; which is extracted from cuBLAS, I am able to confirm that: At least in this simple case, the two patterns have the same functionalities.

Now I wonder what is the point of introducing such a new pattern (predicate as an operand for load instructions), rather than the conventional @P? Does it bring any performance benefits?