Ampere SASS Annotation

Have just started working with Ampere targets and have noticed many instructions in a SASS dump, end in “!PT”.

Does anyone know the meaning?

PT stands for “predicate true”, i.e. a boolean TRUE. ! computes the complement. This notation has been used since Maxwell, if memory serves.

Note that predicates are not necessarily used to predicate instruction execution. For example, the IMNMX and FMNMX instructions use a predicate to select whether the instruction computes the minimum or the maximum. Note that some instructions also allow combining of predicates.

I find Ampere SASS cumbersome to read since the code is typically composed of three-input operations, which then have part of the functionality “fused off”, e.g. IADD3 with RZ as one operand. Logical operations are now all handled by LOP3, for which one needs a decode program to figure out what it does.

Thanks Norbert. Perhaps the exclusive use of LOP3 for logic ops is to free up resources - I note SHR/SHL seem to have disappeared in favour of SHF.

The “U” prefix on some instructions is also a change from the 6.1 SASS I’m used to. Nsight Compute indicates it means “Uniform”, but not sure what that means either.

U refers to the uniform data path. It’s not well documented, but designed to allow higher throughput of certain instructions (e.g. floating point) in the presence of a mixture of other instructions.

1 Like

Thanks Robert. Interesting to see that this path is still used to a reasonable degree in code that has zero FP content.

It certainly looks like that with Ampere the ISA has completed a transition that started several GPU architecture generations ago. Given that the GPU data path has always supported three input operands for FMA (and FMAD before that), it makes sense to move to three-input primitives for everything, even if it means that the result is machine code that is no longer easy to read for humans.

Theoretically, this general move to three-input operations also raises a question about energy efficiency, but presumably that has been addressed through suitable hardware mechanisms that power down unused parts of the computational cores, e.g,. when a three-input IMAD is used to perform an integer multiply.