Introducing SASSquatch: Find Supported SASS Opcodes per GPU Architecture

christopher_owen · February 23, 2026, 6:50pm

In NVIDIA/CUDA land, SASS is the native assembly instruction set for NVIDIA GPUs—the stuff the hardware actually runs.

Introducing SASSquatch 🦶: a SASS “tracker” that helps you discover which opcodes are supported on a given GPU architecture—no more blurry sightings.

Inspired by sandsifter (for Intel), SASSquatch lets you confirm instruction availability across targets, for example:

sm121a vs sm121f (same runtime, different supported opcode sets)

We’ve already spotted a few undocumented instructions in the wild 👀.

⚠️ Early days: it still needs love. Feedback, issues, and PRs are very welcome.

Repo: https://github.com/christopherowen/sassquatch

christopher_owen · February 23, 2026, 7:08pm

Fun results from one full run:

4,096 low-12 signatures tested on hardware
141 decode-accepted (VALID)
3,940 illegal-instruction traps
14 timeouts + 1 wrong-output edge case
Phase 1 PTX audit: 409 compile / 72 fail
Phase 2 map: 277 PTX→SASS low-12 entries
Deep sweep expanded observed mnemonic variants from 199 → 1,894
78 QMMA variants observed across 2 opcode families

tbraun96 · February 23, 2026, 10:48pm

This is great! We could use this to help ATLAS identify the best possible instruction set from all the way down to basic MMUL up to GEMM for the DGX Spark

tbraun96 · February 23, 2026, 10:49pm

Also, toss out “vibe coded”. That’s for people who didn’t know how to code or were inadequate before AI could do the translation from English to Code for us.

christopher_owen · February 24, 2026, 7:58am

As you recommend ;)

cho · February 24, 2026, 11:11am

all failed?

christopher_owen · February 24, 2026, 12:06pm

It looks like I was too optimistic in the cleanups after my last successful run. I’ve pushed some fixups. Try again?

cho · February 24, 2026, 12:22pm

python sassquatch.py --phase 1 2 3

  ███████╗ █████╗ ███████╗███████╗
  ██╔════╝██╔══██╗██╔════╝██╔════╝
  ███████╗███████║███████╗███████╗ quatch
  ╚════██║██╔══██║╚════██║╚════██║
  ███████║██║  ██║███████║███████║
  ╚══════╝╚═╝  ╚═╝╚══════╝╚══════╝

  NVIDIA GPU ISA auditing toolkit

  Date: 2026-02-24 20:20:12
  Phases: 1, 2, 3
  Targets: sm_121, sm_121a, sm_121f

========================================================================
  PHASE 1: PTX Compilation Audit
========================================================================
  Testing PTX instruction compilation across target architectures
  This discovers which instructions each GPU generation supports

  ptxas version: Build cuda_13.0.r13.0/compiler.36424714_0
  Targets: sm_121, sm_121a, sm_121f
  Total probes: 488 instructions x 3 targets = 1464

  [████████████████████████████████████████] 1464/1464 (100.0%)  225/s  b2r_pattern                              [sm_121f] FAIL   :bytes [sm_121] FAIL

  Completed in 6.5s

  Compilation Results by Target
  ------------------------------------------------------------
    sm_121       ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    0/ 481 (0.0%)
    sm_121a      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    0/ 481 (0.0%)
    sm_121f      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    0/ 481 (0.0%)

  Results by Instruction Category
  ------------------------------------------------------------

    ARITHMETIC (0/136 on sm_121)

    ASYNC (0/2 on sm_121)

    ASYNC_FENCE (0/4 on sm_121)

    ATOMIC (0/5 on sm_121)

    ATOMIC_GLOBAL (0/7 on sm_121)

    BARRIER (0/10 on sm_121)

    BITWISE (0/15 on sm_121)

    COMPARISON (0/43 on sm_121)

    CONTROL (0/12 on sm_121)

    CONVERSION (0/77 on sm_121)

    CONVERSION_PACK (0/8 on sm_121)

    DATA_MOVEMENT (0/6 on sm_121)

    EXPERIMENTAL (0/2 on sm_121)

    FLOAT_SPECIAL (0/5 on sm_121)

    IMM32 (0/9 on sm_121)

    INTEGER_SIMD (0/7 on sm_121)

    MATRIX_LOAD (0/4 on sm_121)

    MATRIX_STORE (0/3 on sm_121)

    MEMORY (0/13 on sm_121)

    MEMORY_GENERIC (0/2 on sm_121)

    MEMORY_LOCAL (0/2 on sm_121)

    MEMORY_SHARED (0/9 on sm_121)

    MISC (0/8 on sm_121)

    MMA (0/8 on sm_121)

    MMA_BLOCK_SCALED (0/8 on sm_121)

    SM100_FEATURES (0/4 on sm_121)

    SPECIAL (0/6 on sm_121)

    SURFACE (0/5 on sm_121)

    TEXTURE (0/5 on sm_121)

    TMA (0/7 on sm_121)

    TMEM (0/2 on sm_121)

    UNIFORM (0/20 on sm_121)

    WARP (0/24 on sm_121)

    WARP_EXTRA (0/3 on sm_121)

  Anomaly Analysis
  ------------------------------------------------------------

    Summary:
      Universal (all targets):       0
      Universal fail:              481
      SM121-only:                    0
      SM100-only (missing SM121):    0
      Experimental success:          0

========================================================================
  PHASE 2: SASS Opcode Field Discovery
========================================================================
  Analyzing SASS instruction encoding to map opcode fields

  Target: sm_121
  Compiling template kernel...
  Template compiled (5480 bytes)
  Parsing cubin ELF...
  Found .text section: kernel=squatch_kernel, offset=0x700, size=384 bytes, 24 instructions
  Disassembling...

  Template Kernel SASS Instructions (sm_121)
  ------------------------------------------------------------
    Idx  Offset  Opcode[11:0]  Mnemonic      Operands
    ---  ------  ------------  --------      --------
      0  0x0000         0xb82  LDC           R1, c[0x0][0x37c]
      1  0x0010         0x919  S2R           R5, SR_TID.X
      2  0x0020         0xb82  LDC.64        R2, c[0x0][0x380]
      3  0x0030         0x7ac  LDCU          UR6, c[0x0][0x388]
      4  0x0040         0x431  HFMA2         R0, -RZ, RZ, 0, 0.000900745391845703125
      5  0x0050         0x7ac  LDCU.64       UR4, c[0x0][0x358]
      6  0x0060         0xc0c  ISETP.NE.U32.AND  P0, PT, RZ, UR6, PT
      7  0x0070         0x825  IMAD.WIDE.U32  R2, R5, 0x4, R2
      8  0x0080         0x807  SEL           R5, R0, 0x2a, P0
      9  0x0090         0x918  NOP
     10  0x00a0         0x986  STG.E         desc[UR4][R2.64], R5
     11  0x00b0         0x94d  EXIT
     12  0x00c0         0x947  BRA           `(.L_x_0)
     13  0x00d0         0x918  NOP
     14  0x00e0         0x918  NOP
     15  0x00f0         0x918  NOP
     16  0x0100         0x918  NOP
     17  0x0110         0x918  NOP
     18  0x0120         0x918  NOP
     19  0x0130         0x918  NOP
     20  0x0140         0x918  NOP
     21  0x0150         0x918  NOP
     22  0x0160         0x918  NOP
     23  0x0170         0x918  NOP

  Compiling diverse instructions to map opcode field...

  Using 0 compilable probes from Phase 1 (no re-compilation)
  0 compilable probes for SASS discovery
  Compiling at -O0, -O1, -O3 to maximize opcode coverage

  Discovered 0 opcodes from PTX probes in 0.0s

  Compiling CUDA C++ probes for uniform datapath instructions...
  43 CUDA C++ probe kernels

Error: No module named 'sass_probe'
Traceback (most recent call last):
  File "/home/edison/Downloads/SASSquatch/sassquatch.py", line 1089, in main
    phase2_results = run_phase2(args, phase1_results=phase1_results)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/edison/Downloads/SASSquatch/sassquatch.py", line 438, in run_phase2
    cuda_opcodes = compile_and_discover_with_hex(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/edison/Downloads/SASSquatch/src/cuda_probe.py", line 1847, in compile_and_discover_with_hex
    from sass_probe import CubinBuilder
ModuleNotFoundError: No module named 'sass_probe'

  Results saved to artifacts/scan_results.json

========================================================================
  Audit Complete
========================================================================
  Phase 1: 0/481 PTX instructions compile for sm_121

(.venv) edison@gb10:~/Downloads/SASSquatch$ pip install sass_probe
ERROR: Could not find a version that satisfies the requirement sass_probe (from versions: none)
ERROR: No matching distribution found for sass_probe

christopher_owen · February 24, 2026, 12:35pm

I could maybe recommend trying the run inside the provided docker container for a more consistent experience.

I had never tried running it in the host, but I’ve made it toolchain agnostic now so it should do the thing.

cho · February 24, 2026, 3:01pm

  Reference Database Cross-Reference
  ------------------------------------------------------------
    Documented Blackwell SASS instructions: 245
    Discovered & documented:   114
    Discovered, NOT documented: 7
      ? @!P0
      ? @!PT
      ? @P0
      ? @P1
      ? ERRBAR;
      ? F2FP
      ? NOP;
    Documented, not yet discovered: 131
    (Only ~274 of ~245 instructions probed via template)

    MXFP4-relevant instructions in reference:
        not probed  BGMMA            Bit MMA Across Warpgroup
        not probed  BMMA             Bit Matrix Multiply and Accumulate
            FOUND  DMMA             Matrix Multiply and Accumulate (FP64)
        not probed  HGMMA            FP16 MMA Across Warpgroup
            FOUND  HMMA             Matrix Multiply and Accumulate (FP16)
        not probed  IGMMA            Integer MMA Across Warpgroup
            FOUND  IMMA             Integer Matrix Multiply and Accumulate
        not probed  LDT              Load Matrix from Tensor Memory to RF [TMEM]
        not probed  LDTM             Load Matrix from Tensor Memory to RF [TMEM]
        not probed  OMMA             FP4 Matrix Multiply and Accumulate
        not probed  QGMMA            FP8 MMA Across Warpgroup
            FOUND  QMMA             FP8 Matrix Multiply and Accumulate
        not probed  STT              Store Matrix to Tensor Memory from RF [TMEM]
        not probed  STTM             Store Matrix to Tensor Memory from RF [TMEM]
        not probed  UTCATOMSWS       Atomic on SW State Register (TC) [TMEM]
        not probed  UTCBAR           Tensor Core Barrier [TMEM]
        not probed  UTCCP            Async copy Shared->Tensor Memory [TMEM]
        not probed  UTCHMMA          Uniform Matrix Multiply and Accumulate (FP16) [TMEM]
        not probed  UTCIMMA          Uniform Matrix Multiply and Accumulate (INT) [TMEM]
        not probed  UTCOMMA          Uniform Matrix Multiply and Accumulate (FP4) [TMEM]
        not probed  UTCQMMA          Uniform Matrix Multiply and Accumulate (FP8) [TMEM]
        not probed  UTCSHIFT         Shift elements in Tensor Memory [TMEM]
        not probed  UTMACMDFLUSH     TMA Command Flush

I’m not quite sure what this list means — does it imply the tensor core of GB10 does not support FP4 Matrix Multiply and Accumulate?

christopher_owen · February 24, 2026, 3:11pm

great question, I haven’t spent much time evaluating the results and as you see there is some strange output still.

I use activations in fp8 and weights in fp4 for the gpt-oss-120b vllm work.

TMEM-family instructions are not supported on the chip.

The technique in use is to compile ‘every possible’ op code and disassemble them to see what is there. The not probed is telling us we either didn’t try the op code or got the ‘parameters’ wrong so it wasn’t disassembled. Maybe that logic could be improved.

Topic		Replies	Views
Ptxas slow CUDA Programming and Performance cuda , kernel	35	3132	May 2, 2024
Don't see the SASS code via objdump CUDA Programming and Performance	3	1123	February 19, 2020
Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5 CUDA Programming and Performance	21	3846	April 2, 2015
Cubin assembler is now available decuda 0.4.0 released CUDA Programming and Performance	33	23351	May 21, 2009
Nvdisasm vs ncu: discrepancy in SASS for register spilling Nsight Compute	4	864	March 9, 2023
Can't make ptxas generate efficient code CUDA Programming and Performance	23	4749	December 30, 2012
Extracting SASS instructions of an OptiX binary? OptiX	7	294	December 13, 2024
Ptxas can not generate reasonable sass CUDA Programming and Performance	10	984	December 6, 2020
PTX instructions are reordered CUDA Programming and Performance	12	1764	May 13, 2024
The GPU architecture is not supported for SASS analysis. Most source information will not be available Nsight Compute	9	134	November 28, 2025

Introducing SASSquatch: Find Supported SASS Opcodes per GPU Architecture

Related topics