[Issue] Qwen3-Next-80B NVFP4 and FP8 Cannot Be Served via trtllm-serve on DGX Spark GB10 (TRT-LLM 1.3.0rc7)

## Summary

I have been attempting to serve **Qwen3-Next-80B-A3B-Thinking** (both NVFP4 and FP8 variants)

via `trtllm-serve` on a single DGX Spark GB10 (128GB unified memory) using the official

`nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc7` container.

After extensive debugging over multiple sessions, I’ve hit a series of cascading failures

that appear to be fundamental blockers specific to the DGX Spark ARM (aarch64) + GB10 platform.

I’m sharing the full failure sequence here in the hope that NVIDIA can confirm whether

this model/platform combination is officially supported, and provide a workaround or roadmap.

## Environment

| Item | Value |

|—|—|

| Hardware | NVIDIA DGX Spark (GB10, aarch64 ARM) |

| Unified Memory | 128 GB |

| GPU | NVIDIA GB10 Blackwell |

| Container | `nvcr.io/nvidia/tensorrt-llm/release:1.3.0rc7` |

| TRT-LLM Version | 1.3.0rc7 |

| PyTorch Version | 2.10.0a0 (inside container) |

| Models Tested | `nvidia/Qwen3-Next-80B-A3B-Thinking-NVFP4`, `Qwen/Qwen3-Next-80B-A3B-Thinking-FP8` |

**Error:**

```

ValueError: LLM got invalid argument: pytorch_backend_config

```

**Observation:** `pytorch_backend_config` as a YAML block is rejected by `trtllm-serve` in 1.3.0rc7.

This is despite the field being documented in official TRT-LLM examples.

When fields are moved to top-level, `use_cuda_graph` and `autotuner_enabled` are also rejected

as invalid top-level arguments.

-–

### Step 2 — Mamba hybrid cache + `enable_block_reuse` conflict

After resolving config issues, encountered:

```

AssertionError: mamba hybrid cache requires block reuse to be disabled in KV cache config

```

**Root cause:** Qwen3-Next uses a Mamba hybrid architecture. `enable_block_reuse: true`

is incompatible with this architecture and must be forced to `false`.

-–

### Step 3 — CUDA Illegal Instruction (NVFP4 — Triton kernel)

After disabling block reuse and CUDA graph, the NVFP4 model crashes during generation:

```

RuntimeError: Triton Error [CUDA]: an illegal instruction was encountered

```

**Stack trace root:**

```

modeling_qwen3_next.py line 332:

fused_qkvzba_split_reshape_cat_kernel\[grid\](

```

**Observation:** This is a Triton kernel compiled for x86/SM10x Blackwell.

The DGX Spark GB10 is an **ARM aarch64** system, and this Triton kernel appears

to contain instructions incompatible with the ARM architecture.

This is a **platform-level incompatibility**, not a config issue.

-–

### Step 4 — Autotuner IndexError (FP8 — fp8_block_scaling_gemm)

Switching to the FP8 variant (`Qwen/Qwen3-Next-80B-A3B-Thinking-FP8`), the model loads

successfully (74.85 GB), but crashes during autotuner warmup:

```

IndexError: list assignment index out of range

File “autotuner.py”, line 1406, in _find_nearest_profile

base_profile\[spec.input_idx\]\[spec.dim_idx\] = -1

\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~\~^^^^^^^^^^^^^^

```

**Call path:**

```

fp8_block_scaling_gemm → tuner.choose_one → profiling_cache.search_cache → _find_nearest_profile

```

**Observation:** This is a known regression bug (GitHub Issue #10679) where the

autotuner crashes with MoE + FP8 configurations. The bug has been confirmed but

has no fix in 1.3.0rc7. Setting `TLLM_DISABLE_AUTOTUNER=1` has no effect.

There is no documented way to disable the autotuner via `extra_llm_api_options` in this version.

-–

## Questions for NVIDIA

1. **Is Qwen3-Next-80B (NVFP4 or FP8) officially supported on DGX Spark GB10 (aarch64) with TRT-LLM?**

The DGX Spark compatibility table on build.nvidia.com does not list Qwen3-Next models,

yet this is one of the highest-performing models that fits in 128GB unified memory.

2. **The Triton kernel `fused_qkvzba_split_reshape_cat_kernel` crashes on ARM GB10.

Is this a known platform issue? Is there a fix or alternative kernel for aarch64?**

3. **The autotuner IndexError (Issue #10679) blocks FP8 serving. Is there an environment variable

or config flag to disable the autotuner in 1.3.0rc7?**

4. **`pytorch_backend_config` as a YAML block is rejected in 1.3.0rc7. What is the correct

way to pass PyTorch backend-specific options (e.g., `autotuner_enabled`, `use_cuda_graph`)

in this version?**

5. **Is there a planned release (1.3.0 final or later) that fixes these issues

and adds official DGX Spark support for Qwen3-Next?**

-–

## Current Workaround

The only currently working solution for DGX Spark is **vLLM**, which serves the FP8 variant

at ~44 t/s via the `nvcr.io/nvidia/vllm:26.01-py3` container:

```bash

vllm serve Qwen/Qwen3-Next-80B-A3B-Thinking-FP8 \

–host 0.0.0.0 --port 8355 \

–gpu_memory_utilization 0.85 \

–speculative-config ‘{“method”:“qwen3_next_mtp”,“num_speculative_tokens”:2}’

```

This works but lacks the potential performance gains of TRT-LLM’s optimized kernels.

DGX Spark is specifically marketed as a device for running 70–80B class models,

and having no working TRT-LLM path for the most capable model in this weight class

is a significant gap.

-–

## Request

Please update the DGX Spark model compatibility table and provide:

- A confirmed working `trtllm-serve` command + config for Qwen3-Next-80B FP8 on DGX Spark

- Or a clear timeline for when these blockers will be resolved

Thank you.

You can also run it on SGLang with https://sparkrun.dev. Benchmarks: Qwen/Qwen3-Coder-Next-FP8 - Spark Arena Benchmark

I would use @eugr 's repo , he has a recipe for this model and int4-autoround.

qwen3-coder-next-fp8.yaml
qwen3-coder-next-int4-autoround.yaml