SMT Optimizer: Emergent Structural Redundancy Discovery in Transformer Attention Mechanisms

Hi NVIDIA developer community,

Hi, I’m Samuel, an NVIDIA Inception member. I’d like to share two recent works on extreme compression for Edge AI and LLM optimization.

Work 1 — SMT V10: Adaptive Sparse Training [https://doi.org/10.5281/zenodo.20150258] An adaptive optimizer that applies dynamic gradient masking layer-by-layer during training.

  • Key PoC result: Achieved 93.79% sparsity with only 3.67% accuracy loss vs. Adam (tested on Fashion-MNIST as a baseline proof-of-concept), maintaining stability across 3 random seeds.

  • Designed with neuromorphic hardware and Edge AI in mind — the extreme compression is where the real gain would be localized on sparse-native silicon.

Work 2 — Crystal-SMT: Automatic Attention Bias Discovery [https://doi.org/10.5281/zenodo.20219077] A structural analysis tool that autonomously identifies redundant parameters during the training phase.

  • Key finding: The optimizer consistently eliminates ~36% of QKV attention biases while strictly preserving weight matrices (across 10 random seeds with ±0.78% variance).

  • This mirrors the architectural decisions made manually in state-of-the-art models like LLaMA and PaLM, but discovered emergently through gradient dynamics rather than manual ablation.

I’m looking to connect with engineers working on sparse tensor operations, neuromorphic computing, or LLM compression. Happy to share the code and discuss potential validation strategies on NVIDIA hardware.

Best regards, Samuel David López Armenta