Slow performance for RedHatAI/Qwen3-Coder-Next-NVFP4

Hi,

I am using the eugr/spark-vllm-docker image, built using the latest vllm sources 0.17.0.

I launched it manually to the bash prompt and enter the following command to serve the RedHatAI/Qwen3-Coder-Next-NVFP4 model.

vllm serve redhatai/qwen3-coder-next-nvfp4 \
    --port 8888 --host 0.0.0.0 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.7 \
    --load-format fastsafetensors \
    --enable-prefix-caching \
    --max-model-len 32000 \
    --enable-expert-parallel \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder

The token generation performance is only around 14.2 tokens/s:

(APIServer pid=1125) INFO 03-11 05:05:27 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 14.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%

Q01: Is there anything that can be done to improve the performance of this specific model on the DGX Spark GB10?

At the current state of the art, NVFP4 support on DGX Spark is still not mature enough to provide competitive performance.

You may want to consider serving other quants of the same model. For example, you can get 69 tok/s on single Spark while running Intel/Qwen3-Coder-Next-int4-AutoRound. Please note that INT4 has the means to provide a precision close to NVFP4.

Serve it using @eugr’s community build https://github.com/eugr/spark-vllm-docker recipe or with the following command:

 ./launch-cluster.sh --apply-mod mods/fix-qwen3-coder-next \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  --solo exec vllm serve Intel/Qwen3-Coder-Next-int4-AutoRound \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --dtype bfloat16 \
  --enable-prefix-caching \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

You may want to refer to the Spark Arena for the most up to date leaderboard:

1 Like

Would you happen to know why a model requantization step into a 4-bit format, has such a wide variation in terms of hardware performance?

Is it due to software frameworks falling back to alternate software implementations with no hardware support in case a specific quantization format isn’t supported by the underlying hardware?

Yes, it is a matter of software optimisation, or lack thereof. When it comes to NVFP4, GB10 supports it natively. However, the rather complex software stack build on its top is lagging behind in support.

More generally, nowadays incentives are rather misaligned for those of us hacking on GB10, because there is greater interest in supporting cloud grade hardware, understandably so. While we speak about Grace Blackwell lineage, the reality is that, at this stage, GB10 is not a first class citizen. We have to take the development effort in our own hands! And I think something is already moving.

@eugr ‘s docker build recipe pulls in the latest vLLM sources, which is currently at 0.17.0.

I tried manually patching, and noticed some of the mods don’t apply cleanly. Perhaps the mods are no longer required, as a result of PRs being pulled into vLLM mainline or requires refactoring for the current release?

root@aa439cbf5719:/tmp/spark-vllm-docker/mods/fix-qwen3-coder-next# bash run.sh 
Patching Qwen3-Coder-Next crashing on start
patching file vllm/v1/core/single_type_kv_cache_manager.py
Hunk #1 FAILED at 1000.
1 out of 1 hunk FAILED -- saving rejects to file vllm/v1/core/single_type_kv_cache_manager.py.rej
Patch is not applicable, skipping
Reverting PR #34279 that causes slowness
patching file vllm/model_executor/layers/fused_moe/fused_moe.py
Unreversed patch detected!  Ignore -R? [n] n
Apply anyway? [n] n
Skipping patch.
2 out of 2 hunks ignored -- saving rejects to file vllm/model_executor/layers/fused_moe/fused_moe.py.rej
Can't revert PR #34279, skipping as it was reverted in recent commits
Fixing Triton allocator bug

The ./build-and-copy.sh script takes care of everything; no need to manually apply the patch, as this has already been applied under the hood.

For preparing to serve Intel/Qwen3-Coder-Next-int4-AutoRound for the first time, simply run: ./build-and-copy.sh , or add the required parameters for more specialised scenarios, or if this is not you first time doing so and you need to rebuild vllm, for example:

./build-and-copy.sh --help
Usage: ./build-and-copy.sh [OPTIONS]
  -t, --tag <tag>               : Image tag (default: 'vllm-node')
  --gpu-arch <arch>             : GPU architecture (default: '12.1a')
  --rebuild-flashinfer          : Force rebuild of FlashInfer wheels (ignore cached wheels)
  --rebuild-vllm                : Force rebuild of vLLM wheels (ignore cached wheels)
  --vllm-ref <ref>              : vLLM commit SHA, branch or tag (default: 'main')
  -c, --copy-to <hosts>         : Host(s) to copy the image to. Accepts comma or space-delimited lists.
      --copy-to-host            : Alias for --copy-to (backwards compatibility).
      --copy-parallel           : Copy to all hosts in parallel instead of serially.
  -j, --build-jobs <jobs>       : Number of concurrent build jobs (default: 16)
  -u, --user <user>             : Username for ssh command (default: $USER)
  --tf5                         : Install transformers>=5 (aliases: --pre-tf, --pre-transformers)
  --exp-mxfp4, --experimental-mxfp4 : Build with experimental native MXFP4 support
  --apply-vllm-pr <pr-num>      : Apply a specific PR patch to vLLM source. Can be specified multiple times.
  --full-log                    : Enable full build logging (--progress=plain)
  --no-build                    : Skip building, only copy image (requires --copy-to)
  -h, --help                    : Show this help message

I tried again without patch and the intel/qwen3-coder-next-int4-autoround model continuously streams out !!!! exclaimation marks in the output.

export VLLM_MARLIN_USE_ATOMIC_ADD=1

vllm serve intel/qwen3-coder-next-int4-autoround \
    --port 8888 --host 0.0.0.0 \
    --tensor-parallel-size 1 \
    --max-model-len 262144 \
    --gpu-memory-utilization 0.9 \
    --load-format fastsafetensors \
    --dtype bfloat16 \
    --enable-prefix-caching \
    --enable-expert-parallel \
    --attention-backend flashinfer \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_coder

I have setup my docker environment to use usernamespace-remapping. @eugr docker image executes all commands as root. Without usernamespace-remapping, a process running as root inside the docker container can execute processes as root on the host system.

This is why I haven’t used the launch-cluster.sh –solo command, so far.

If I run the launch-cluster.sh –solo command with my current configuration, it will give the following error, due to the command line arguments passed to docker.

Starting Head Node on 127.0.0.1...
docker: Error response from daemon: privileged mode is incompatible with user namespaces.  You must run the container in the host namespace when running privileged mode

I applied the following patch to the launch-cluster.sh script to allow it to work with usernamespace-remapping enabled;

This is a quick patch just to get it running and test locally on a system with usernamespace-remapping enabled. A proper fix would be to specify the exposed port only once when running in non-privileged mode.

diff --git a/launch-cluster.sh b/launch-cluster.sh
index 0d00407..69234e2 100755
--- a/launch-cluster.sh
+++ b/launch-cluster.sh
@@ -79,6 +79,7 @@ while [[ "$#" -gt 0 ]]; do
         --name) CONTAINER_NAME="$2"; shift ;;
         --eth-if) ETH_IF="$2"; shift ;;
         --ib-if) IB_IF="$2"; shift ;;
+        --localhost-port) DOCKER_ARGS="$DOCKER_ARGS -p 127.0.0.1:$2:$2"; shift ;;
         -e|--env) DOCKER_ARGS="$DOCKER_ARGS -e $2"; shift ;;
         -j) BUILD_JOBS="$2"; shift ;;
         --apply-mod) MOD_PATHS+=("$2"); shift ;;
@@ -590,7 +591,7 @@ start_cluster() {
     fi
 
     # Build docker run arguments based on mode
-    local docker_args_common="--gpus all -d --rm --network host --name $CONTAINER_NAME $DOCKER_ARGS $IMAGE_NAME"
+    local docker_args_common="--runtime=nvidia -d --rm --name $CONTAINER_NAME $DOCKER_ARGS $IMAGE_NAME"
     local docker_caps_args=""
     local docker_resource_args=""

I can now use the launch-cluster.sh script:

./launch-cluster.sh \
  --non-privileged \
  --localhost-port 8000 \
  --apply-mod mods/fix-qwen3-coder-next \
  -e VLLM_MARLIN_USE_ATOMIC_ADD=1 \
  --solo exec vllm serve intel/qwen3-coder-next-int4-autoround \
  --max-model-len 262144 \
  --gpu-memory-utilization 0.85 \
  --port 8000 \
  --host 0.0.0.0 \
  --load-format fastsafetensors \
  --dtype bfloat16 \
  --enable-prefix-caching \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder

Thanks for your contribution; please consider doing a pull request, as this is of independent interest.

At this stage, the software stack is inherently trusted. It is worth to note that, even when invoked with usernamespace-remapping, docker is not a robust security mechanism; it can be easily escaped, and key attack surface runs as root. That being said, usernamespace-remapping is better than nothing solution; and least privileges are best provided by more robust tooling for compartmentalisation. Feel free to audit the resources being pulled in to build confidence in.

I suppose @eugr ‘s docker recipe and launch scripts can be modified to support usenamespace-remapping, along with documentation on how to setup and configure a host DGX Spark system with docker usernamespace-remapping.

You can create a docker image to run processes as a regular user inside a docker container, and use usernamespace-remapping to map it to an unprivileged used on the host with UID > 100000. While container mechanisms can be escaped, it just basic vulnerability when downloading docker images and running them locally if it contains malware.

Absolutely, this would be a nice patch, as it would provide enough flexibility to all users.

I’m considering supporting podman as a Docker alternative - it runs as a user by default and avoids many docker security pitfalls.

1 Like

For Autoround quants, you need a different mod. I’ll create a recipe, but for now you can run:

./launch-cluster.sh  \
--apply-mod mods/fix-qwen3-next-autoround \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 --solo \
exec vllm serve Intel/Qwen3-Coder-Next-int4-AutoRound \
--max-model-len 262144 \
--gpu-memory-utilization 0.7 \
--port 8888 --host 0.0.0.0 \
--load-format fastsafetensors \
--enable-prefix-caching \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder
1 Like