What is the purpose of mma.sp::ordered_metadata?

I’ve noticed that the PTX ISA 8.5 adds an mma.sp::ordered_metadata instruction with very brief mention of what it does (it seems to be a more restricted version of mma.sp) but I can’t find anything saying why we would want to use it (rather than mma.sp). Is there a performance benefit on current or anticipated future hardware? Or are there some situations where mma.sp::ordered_metadata is supported but mma.sp is not?

from the ptx doc:

mma.sp instruction may have substantially reduced performance on some target architectures. Hence, it is advised to use mma.sp::ordered_metadata instruction.

Thanks! I missed that on the first reading. It would be nice to know which target architectures though. In particular, if I’m using sm86 and sm89 is it worth upgrading from CUDA 12.1 to CUDA 12.5+ just for this or is it only for future GPUs? I don’t see a mention in the Ada or Hopper tuning guides but maybe it will be in the Blackwell tuning guide when that comes out…

A couple options:

  1. benchmark
  2. file a bug to request a document update for clarification on cases where “reduced performance” might be observed.

I would believe that either it had never correctly worked without ordered_metadata and to fix it, Nvidia had to create a performance loss (only) with the new toolkit even on current architectures.
Or for optimization of future architectures, Nvidia introduced this restricted usage, and on current architectures even with the new toolkit both (ordered and unordered) have the same speed.

In either case, I would either stay on older Cuda toolkit versions or, when upgrading then use ordered_metadata. Both should be equally fast.

I would not believe the new fix to make a difference for older toolkit versions, if you compile SASS code instead of only PTX for your intended target architectures, even if the Cuda program is then run with new drivers.

It would have been noticable, if there had been a major performance loss the last few years (since the introduction of sparsed matrix operations with Ampere).

I have no detailed or insider information, so the above are just reasoned guesses.

since its PTX, very often additional knowledge can be gained by coding up test cases and studying the SASS. For example if this is true:

then my guess would be that you could probably observe that in the SASS, there would be something that looks like work-around code instead of nice clean SASS that lines up directly with the PTX.