Introducing SASSquatch: Find Supported SASS Opcodes per GPU Architecture

In NVIDIA/CUDA land, SASS is the native assembly instruction set for NVIDIA GPUsβ€”the stuff the hardware actually runs.

Introducing SASSquatch 🦢: a SASS β€œtracker” that helps you discover which opcodes are supported on a given GPU architectureβ€”no more blurry sightings.

Inspired by sandsifter (for Intel), SASSquatch lets you confirm instruction availability across targets, for example:

  • sm121a vs sm121f (same runtime, different supported opcode sets)

We’ve already spotted a few undocumented instructions in the wild πŸ‘€.

⚠️ Early days: it still needs love. Feedback, issues, and PRs are very welcome.

Repo: https://github.com/christopherowen/sassquatch

Fun results from one full run:

  • 4,096 low-12 signatures tested on hardware

  • 141 decode-accepted (VALID)

  • 3,940 illegal-instruction traps

  • 14 timeouts + 1 wrong-output edge case

  • Phase 1 PTX audit: 409 compile / 72 fail

  • Phase 2 map: 277 PTXβ†’SASS low-12 entries

  • Deep sweep expanded observed mnemonic variants from 199 β†’ 1,894

  • 78 QMMA variants observed across 2 opcode families

This is great! We could use this to help ATLAS identify the best possible instruction set from all the way down to basic MMUL up to GEMM for the DGX Spark

Also, toss out β€œvibe coded”. That’s for people who didn’t know how to code or were inadequate before AI could do the translation from English to Code for us.

As you recommend ;)

all failed?

It looks like I was too optimistic in the cleanups after my last successful run. I’ve pushed some fixups. Try again?

python sassquatch.py --phase 1 2 3

  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
  β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β•β•β•
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— quatch
  β•šβ•β•β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘
  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘
  β•šβ•β•β•β•β•β•β•β•šβ•β•  β•šβ•β•β•šβ•β•β•β•β•β•β•β•šβ•β•β•β•β•β•β•

  NVIDIA GPU ISA auditing toolkit

  Date: 2026-02-24 20:20:12
  Phases: 1, 2, 3
  Targets: sm_121, sm_121a, sm_121f

========================================================================
  PHASE 1: PTX Compilation Audit
========================================================================
  Testing PTX instruction compilation across target architectures
  This discovers which instructions each GPU generation supports

  ptxas version: Build cuda_13.0.r13.0/compiler.36424714_0
  Targets: sm_121, sm_121a, sm_121f
  Total probes: 488 instructions x 3 targets = 1464

  [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 1464/1464 (100.0%)  225/s  b2r_pattern                              [sm_121f] FAIL   :bytes [sm_121] FAIL

  Completed in 6.5s

  Compilation Results by Target
  ------------------------------------------------------------
    sm_121       β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘    0/ 481 (0.0%)
    sm_121a      β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘    0/ 481 (0.0%)
    sm_121f      β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘β–‘    0/ 481 (0.0%)

  Results by Instruction Category
  ------------------------------------------------------------

    ARITHMETIC (0/136 on sm_121)

    ASYNC (0/2 on sm_121)

    ASYNC_FENCE (0/4 on sm_121)

    ATOMIC (0/5 on sm_121)

    ATOMIC_GLOBAL (0/7 on sm_121)

    BARRIER (0/10 on sm_121)

    BITWISE (0/15 on sm_121)

    COMPARISON (0/43 on sm_121)

    CONTROL (0/12 on sm_121)

    CONVERSION (0/77 on sm_121)

    CONVERSION_PACK (0/8 on sm_121)

    DATA_MOVEMENT (0/6 on sm_121)

    EXPERIMENTAL (0/2 on sm_121)

    FLOAT_SPECIAL (0/5 on sm_121)

    IMM32 (0/9 on sm_121)

    INTEGER_SIMD (0/7 on sm_121)

    MATRIX_LOAD (0/4 on sm_121)

    MATRIX_STORE (0/3 on sm_121)

    MEMORY (0/13 on sm_121)

    MEMORY_GENERIC (0/2 on sm_121)

    MEMORY_LOCAL (0/2 on sm_121)

    MEMORY_SHARED (0/9 on sm_121)

    MISC (0/8 on sm_121)

    MMA (0/8 on sm_121)

    MMA_BLOCK_SCALED (0/8 on sm_121)

    SM100_FEATURES (0/4 on sm_121)

    SPECIAL (0/6 on sm_121)

    SURFACE (0/5 on sm_121)

    TEXTURE (0/5 on sm_121)

    TMA (0/7 on sm_121)

    TMEM (0/2 on sm_121)

    UNIFORM (0/20 on sm_121)

    WARP (0/24 on sm_121)

    WARP_EXTRA (0/3 on sm_121)

  Anomaly Analysis
  ------------------------------------------------------------

    Summary:
      Universal (all targets):       0
      Universal fail:              481
      SM121-only:                    0
      SM100-only (missing SM121):    0
      Experimental success:          0

========================================================================
  PHASE 2: SASS Opcode Field Discovery
========================================================================
  Analyzing SASS instruction encoding to map opcode fields

  Target: sm_121
  Compiling template kernel...
  Template compiled (5480 bytes)
  Parsing cubin ELF...
  Found .text section: kernel=squatch_kernel, offset=0x700, size=384 bytes, 24 instructions
  Disassembling...

  Template Kernel SASS Instructions (sm_121)
  ------------------------------------------------------------
    Idx  Offset  Opcode[11:0]  Mnemonic      Operands
    ---  ------  ------------  --------      --------
      0  0x0000         0xb82  LDC           R1, c[0x0][0x37c]
      1  0x0010         0x919  S2R           R5, SR_TID.X
      2  0x0020         0xb82  LDC.64        R2, c[0x0][0x380]
      3  0x0030         0x7ac  LDCU          UR6, c[0x0][0x388]
      4  0x0040         0x431  HFMA2         R0, -RZ, RZ, 0, 0.000900745391845703125
      5  0x0050         0x7ac  LDCU.64       UR4, c[0x0][0x358]
      6  0x0060         0xc0c  ISETP.NE.U32.AND  P0, PT, RZ, UR6, PT
      7  0x0070         0x825  IMAD.WIDE.U32  R2, R5, 0x4, R2
      8  0x0080         0x807  SEL           R5, R0, 0x2a, P0
      9  0x0090         0x918  NOP
     10  0x00a0         0x986  STG.E         desc[UR4][R2.64], R5
     11  0x00b0         0x94d  EXIT
     12  0x00c0         0x947  BRA           `(.L_x_0)
     13  0x00d0         0x918  NOP
     14  0x00e0         0x918  NOP
     15  0x00f0         0x918  NOP
     16  0x0100         0x918  NOP
     17  0x0110         0x918  NOP
     18  0x0120         0x918  NOP
     19  0x0130         0x918  NOP
     20  0x0140         0x918  NOP
     21  0x0150         0x918  NOP
     22  0x0160         0x918  NOP
     23  0x0170         0x918  NOP

  Compiling diverse instructions to map opcode field...

  Using 0 compilable probes from Phase 1 (no re-compilation)
  0 compilable probes for SASS discovery
  Compiling at -O0, -O1, -O3 to maximize opcode coverage

  Discovered 0 opcodes from PTX probes in 0.0s

  Compiling CUDA C++ probes for uniform datapath instructions...
  43 CUDA C++ probe kernels

Error: No module named 'sass_probe'
Traceback (most recent call last):
  File "/home/edison/Downloads/SASSquatch/sassquatch.py", line 1089, in main
    phase2_results = run_phase2(args, phase1_results=phase1_results)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/edison/Downloads/SASSquatch/sassquatch.py", line 438, in run_phase2
    cuda_opcodes = compile_and_discover_with_hex(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/edison/Downloads/SASSquatch/src/cuda_probe.py", line 1847, in compile_and_discover_with_hex
    from sass_probe import CubinBuilder
ModuleNotFoundError: No module named 'sass_probe'

  Results saved to artifacts/scan_results.json

========================================================================
  Audit Complete
========================================================================
  Phase 1: 0/481 PTX instructions compile for sm_121

(.venv) edison@gb10:~/Downloads/SASSquatch$ pip install sass_probe
ERROR: Could not find a version that satisfies the requirement sass_probe (from versions: none)
ERROR: No matching distribution found for sass_probe

I could maybe recommend trying the run inside the provided docker container for a more consistent experience.

I had never tried running it in the host, but I’ve made it toolchain agnostic now so it should do the thing.

  Reference Database Cross-Reference
  ------------------------------------------------------------
    Documented Blackwell SASS instructions: 245
    Discovered & documented:   114
    Discovered, NOT documented: 7
      ? @!P0
      ? @!PT
      ? @P0
      ? @P1
      ? ERRBAR;
      ? F2FP
      ? NOP;
    Documented, not yet discovered: 131
    (Only ~274 of ~245 instructions probed via template)

    MXFP4-relevant instructions in reference:
        not probed  BGMMA            Bit MMA Across Warpgroup
        not probed  BMMA             Bit Matrix Multiply and Accumulate
            FOUND  DMMA             Matrix Multiply and Accumulate (FP64)
        not probed  HGMMA            FP16 MMA Across Warpgroup
            FOUND  HMMA             Matrix Multiply and Accumulate (FP16)
        not probed  IGMMA            Integer MMA Across Warpgroup
            FOUND  IMMA             Integer Matrix Multiply and Accumulate
        not probed  LDT              Load Matrix from Tensor Memory to RF [TMEM]
        not probed  LDTM             Load Matrix from Tensor Memory to RF [TMEM]
        not probed  OMMA             FP4 Matrix Multiply and Accumulate
        not probed  QGMMA            FP8 MMA Across Warpgroup
            FOUND  QMMA             FP8 Matrix Multiply and Accumulate
        not probed  STT              Store Matrix to Tensor Memory from RF [TMEM]
        not probed  STTM             Store Matrix to Tensor Memory from RF [TMEM]
        not probed  UTCATOMSWS       Atomic on SW State Register (TC) [TMEM]
        not probed  UTCBAR           Tensor Core Barrier [TMEM]
        not probed  UTCCP            Async copy Shared->Tensor Memory [TMEM]
        not probed  UTCHMMA          Uniform Matrix Multiply and Accumulate (FP16) [TMEM]
        not probed  UTCIMMA          Uniform Matrix Multiply and Accumulate (INT) [TMEM]
        not probed  UTCOMMA          Uniform Matrix Multiply and Accumulate (FP4) [TMEM]
        not probed  UTCQMMA          Uniform Matrix Multiply and Accumulate (FP8) [TMEM]
        not probed  UTCSHIFT         Shift elements in Tensor Memory [TMEM]
        not probed  UTMACMDFLUSH     TMA Command Flush

I’m not quite sure what this list means β€” does it imply the tensor core of GB10 does not support FP4 Matrix Multiply and Accumulate?

great question, I haven’t spent much time evaluating the results and as you see there is some strange output still.

I use activations in fp8 and weights in fp4 for the gpt-oss-120b vllm work.

TMEM-family instructions are not supported on the chip.

The technique in use is to compile β€˜every possible’ op code and disassemble them to see what is there. The not probed is telling us we either didn’t try the op code or got the β€˜parameters’ wrong so it wasn’t disassembled. Maybe that logic could be improved.