## Summary
I have been attempting to serve **Qwen3-Next-80B-A3B-Thinking** (both NVFP4 and FP8 variants)
via `trtllm-serve` on a single DGX Spark GB10 (128GB unified memory) using the official
`nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc7` container.
After extensive debugging over multiple sessions, I’ve hit a series of cascading failures
that appear to be fundamental blockers specific to the DGX Spark ARM (aarch64) + GB10 platform.
I’m sharing the full failure sequence here in the hope that NVIDIA can confirm whether
this model/platform combination is officially supported, and provide a workaround or roadmap.
## Environment
| Item | Value |
|—|—|
| Hardware | NVIDIA DGX Spark (GB10, aarch64 ARM) |
| Unified Memory | 128 GB |
| GPU | NVIDIA GB10 Blackwell |
| Container | `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc7` |
| TRT-LLM Version | 1.3.0rc7 |
| PyTorch Version | 2.10.0a0 (inside container) |
| Models Tested | `nvidia/Qwen3-Next-80B-A3B-Thinking-NVFP4`, `Qwen/Qwen3-Next-80B-A3B-Thinking-FP8` |
**Error:**
```
ValueError: LLM got invalid argument: pytorch_backend_config
```
**Observation:** `pytorch_backend_config` as a YAML block is rejected by `trtllm-serve` in 1.3.0rc7.
This is despite the field being documented in official TRT-LLM examples.
When fields are moved to top-level, `use_cuda_graph` and `autotuner_enabled` are also rejected
as invalid top-level arguments.
-–
### Step 2 — Mamba hybrid cache + `enable_block_reuse` conflict
After resolving config issues, encountered:
```
AssertionError: mamba hybrid cache requires block reuse to be disabled in KV cache config
```
**Root cause:** Qwen3-Next uses a Mamba hybrid architecture. `enable_block_reuse: true`
is incompatible with this architecture and must be forced to `false`.
-–
### Step 3 — CUDA Illegal Instruction (NVFP4 — Triton kernel)
After disabling block reuse and CUDA graph, the NVFP4 model crashes during generation:
```
RuntimeError: Triton Error [CUDA]: an illegal instruction was encountered
```
**Stack trace root:**
```
modeling_qwen3_next.py line 332:
fused_qkvzba_split_reshape_cat_kernel\[grid\](
```
**Observation:** This is a Triton kernel compiled for x86/SM10x Blackwell.
The DGX Spark GB10 is an **ARM aarch64** system, and this Triton kernel appears
to contain instructions incompatible with the ARM architecture.
This is a **platform-level incompatibility**, not a config issue.
-–
### Step 4 — Autotuner IndexError (FP8 — fp8_block_scaling_gemm)
Switching to the FP8 variant (`Qwen/Qwen3-Next-80B-A3B-Thinking-FP8`), the model loads
successfully (74.85 GB), but crashes during autotuner warmup:
```
IndexError: list assignment index out of range
File “autotuner.py”, line 1406, in _find_nearest_profile
base_profile\[spec.input_idx\]\[spec.dim_idx\] = -1
\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~^^^^^^^^^^^^^^
```
**Call path:**
```
fp8_block_scaling_gemm → tuner.choose_one → profiling_cache.search_cache → _find_nearest_profile
```
**Observation:** This is a known regression bug (GitHub Issue #10679) where the
autotuner crashes with MoE + FP8 configurations. The bug has been confirmed but
has no fix in 1.3.0rc7. Setting `TLLM_DISABLE_AUTOTUNER=1` has no effect.
There is no documented way to disable the autotuner via `extra_llm_api_options` in this version.
-–
## Questions for NVIDIA
1. **Is Qwen3-Next-80B (NVFP4 or FP8) officially supported on DGX Spark GB10 (aarch64) with TRT-LLM?**
The DGX Spark compatibility table on build.nvidia.com does not list Qwen3-Next models,
yet this is one of the highest-performing models that fits in 128GB unified memory.
2. **The Triton kernel `fused_qkvzba_split_reshape_cat_kernel` crashes on ARM GB10.
Is this a known platform issue? Is there a fix or alternative kernel for aarch64?**
3. **The autotuner IndexError (Issue #10679) blocks FP8 serving. Is there an environment variable
or config flag to disable the autotuner in 1.3.0rc7?**
4. **`pytorch_backend_config` as a YAML block is rejected in 1.3.0rc7. What is the correct
way to pass PyTorch backend-specific options (e.g., `autotuner_enabled`, `use_cuda_graph`)
in this version?**
5. **Is there a planned release (1.3.0 final or later) that fixes these issues
and adds official DGX Spark support for Qwen3-Next?**
-–
## Current Workaround
The only currently working solution for DGX Spark is **vLLM**, which serves the FP8 variant
at ~44 t/s via the `nvcr.io/nvidia/vllm:26.01-py3` container:
```bash
vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 \
–host 0.0.0.0 --port 8355 \
–gpu_memory_utilization 0.85 \
–speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}’
```
This works but lacks the potential performance gains of TRT-LLM’s optimized kernels.
DGX Spark is specifically marketed as a device for running 70–80B class models,
and having no working TRT-LLM path for the most capable model in this weight class
is a significant gap.
-–
## Request
Please update the DGX Spark model compatibility table and provide:
- A confirmed working `trtllm-serve` command + config for Qwen3-Next-80B FP8 on DGX Spark
- Or a clear timeline for when these blockers will be resolved
Thank you.