Thanks for the reply! That matches with what I’m seeing (NVFP4/MXFP4 underperforming vs AWQ 4-bit on GB10).
Do you know which specific kernel path is missing/slow on sm121 (MoE group GEMM vs attention vs packing/padding)? Also, are there recommended vLLM/FlashInfer/Triton/PyTorch/CUDA versions for Spark right now, and any upstream issues/PRs to track?
I can test patches and provide Nsight traces.
I’m also willing to attempt to contribute, but I’m lacking a little direction. I attempted to make a patch that tried to address sm121 gating, but that’s clearly not enough. Run VLLM in Spark - #118 by christopher_owen
There is also this contribution, new today: feat: Add SM121/GB10 (DGX Spark) Blackwell-class GPU support by seli-equinix · Pull Request #31740 · vllm-project/vllm · GitHub
As well as this contribution, which didn’t appear well received: [Bugfix] Add SM 12.1 support by ohsono · Pull Request #31607 · vllm-project/vllm · GitHub
On the flashInfer side, we see:
committed 08:12PM - 02 Sep 25 UTC
## 📌 Description
## 🔍 Related Issues
## 🚀 Pull Request Checklist
Thank yo… u for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.
### ✅ Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).
## 🧪 Tests
- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).
## Reviewer Notes
---------
Co-authored-by: Vincent Huang <vincenth@nvidia.com>
Co-authored-by: Yong Wu <yowu@nvidia.com>
Co-authored-by: Sunghyun Park <sunghyunp@nvidia.com>
Co-authored-by: Yunzhe Qiu <yunzheq@nvidia.com>
Co-authored-by: Brian Ryu <bryu@nvidia.com>
Co-authored-by: Ka-Hyun Nam <knam@nvidia.com>
Co-authored-by: yzh119 <zihaoy@nvidia.com>
Co-authored-by: Zihao Ye <expye@outlook.com>
committed 09:54AM - 13 Nov 25 UTC
## 📌 Description
Thor and Spark support when wheels are generating
## 🔍 Relate… d Issues
Output says that is not compatible. Only with JIT is working.
## Summary by CodeRabbit
* **New Features**
* Broadened GPU architecture support to include additional newer
architectures.
* **Documentation**
* Updated README and installation docs to show the revised CUDA
architecture example list.
* **Chores**
* Adjusted release/nightly workflows and build scripts to select
architectures using an expanded CUDA-version threshold and branching
logic.
* **Performance**
* Extended architecture-specific build/runtime handling to cover an
additional GPU architecture affecting memory-related behavior.
---------
Co-authored-by: Zihao Ye <expye@outlook.com>
Co-authored-by: yzh119 <zihaoy@nvidia.com>
(for example, I don’t understand why there is different build strings for cuda ‘< 13.0’ and the rest - and so the difference between 12.0a and 12.0f.
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.head_ref || inputs.tag }}
submodules: true
- name: Build wheel in container
env:
DOCKER_IMAGE: ${{ matrix.arch == 'aarch64' && format('pytorch/manylinuxaarch64-builder:cuda{0}', matrix.cuda) || format('pytorch/manylinux2_28-builder:cuda{0}', matrix.cuda) }}
FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f' }}
run: |
# Extract CUDA major and minor versions
CUDA_MAJOR=$(echo "${{ matrix.cuda }}" | cut -d'.' -f1)
CUDA_MINOR=$(echo "${{ matrix.cuda }}" | cut -d'.' -f2)
export CUDA_MAJOR
export CUDA_MINOR
export FLASHINFER_LOCAL_VERSION="cu${CUDA_MAJOR}${CUDA_MINOR}"
chown -R $(id -u):$(id -g) ${{ github.workspace }}
mkdir -p ${{ github.workspace }}/ci-cache