In NVIDIA/CUDA land, SASS is the native assembly instruction set for NVIDIA GPUsβthe stuff the hardware actually runs.
Introducing SASSquatch π¦Ά: a SASS βtrackerβ that helps you discover which opcodes are supported on a given GPU architectureβno more blurry sightings.
Inspired by sandsifter (for Intel), SASSquatch lets you confirm instruction availability across targets, for example:
sm121a vs sm121f (same runtime, different supported opcode sets)
Weβve already spotted a few undocumented instructions in the wild π.
β οΈ Early days: it still needs love. Feedback, issues, and PRs are very welcome.
This is great! We could use this to help ATLAS identify the best possible instruction set from all the way down to basic MMUL up to GEMM for the DGX Spark
Also, toss out βvibe codedβ. Thatβs for people who didnβt know how to code or were inadequate before AI could do the translation from English to Code for us.
Reference Database Cross-Reference
------------------------------------------------------------
Documented Blackwell SASS instructions: 245
Discovered & documented: 114
Discovered, NOT documented: 7
? @!P0
? @!PT
? @P0
? @P1
? ERRBAR;
? F2FP
? NOP;
Documented, not yet discovered: 131
(Only ~274 of ~245 instructions probed via template)
MXFP4-relevant instructions in reference:
not probed BGMMA Bit MMA Across Warpgroup
not probed BMMA Bit Matrix Multiply and Accumulate
FOUND DMMA Matrix Multiply and Accumulate (FP64)
not probed HGMMA FP16 MMA Across Warpgroup
FOUND HMMA Matrix Multiply and Accumulate (FP16)
not probed IGMMA Integer MMA Across Warpgroup
FOUND IMMA Integer Matrix Multiply and Accumulate
not probed LDT Load Matrix from Tensor Memory to RF [TMEM]
not probed LDTM Load Matrix from Tensor Memory to RF [TMEM]
not probed OMMA FP4 Matrix Multiply and Accumulate
not probed QGMMA FP8 MMA Across Warpgroup
FOUND QMMA FP8 Matrix Multiply and Accumulate
not probed STT Store Matrix to Tensor Memory from RF [TMEM]
not probed STTM Store Matrix to Tensor Memory from RF [TMEM]
not probed UTCATOMSWS Atomic on SW State Register (TC) [TMEM]
not probed UTCBAR Tensor Core Barrier [TMEM]
not probed UTCCP Async copy Shared->Tensor Memory [TMEM]
not probed UTCHMMA Uniform Matrix Multiply and Accumulate (FP16) [TMEM]
not probed UTCIMMA Uniform Matrix Multiply and Accumulate (INT) [TMEM]
not probed UTCOMMA Uniform Matrix Multiply and Accumulate (FP4) [TMEM]
not probed UTCQMMA Uniform Matrix Multiply and Accumulate (FP8) [TMEM]
not probed UTCSHIFT Shift elements in Tensor Memory [TMEM]
not probed UTMACMDFLUSH TMA Command Flush
Iβm not quite sure what this list means β does it imply the tensor core of GB10 does not support FP4 Matrix Multiply and Accumulate?
great question, I havenβt spent much time evaluating the results and as you see there is some strange output still.
I use activations in fp8 and weights in fp4 for the gpt-oss-120b vllm work.
TMEM-family instructions are not supported on the chip.
The technique in use is to compile βevery possibleβ op code and disassemble them to see what is there. The not probed is telling us we either didnβt try the op code or got the βparametersβ wrong so it wasnβt disassembled. Maybe that logic could be improved.