Newb alert! Qwen 3.5/3.6 Gemma 4 26B / 35B downloading and speed! Help!

Hello brilliant minds. I had been using the qwen 3.5 FP8 model for the longest time, it was blazing fast thanks to @eugr_nv tireless work. It was my main model for quite some time. I didn’t like that it was 2024 training date cutoff. But eh. Then when gemma 4 came out and saw it was part of his cookbook models I gave it a spin and it has been blazing fast as well! So I’ve been using gemma4 26B for a bit now. No real issues. I can’t decided which is “better” at python coding.

Then 3.6Qwen came out. I was like OHHH must have nowww!! Well honestly I’m still trying to understand the improvements in it, it has a 2024 cutoff training date still. And it runs slower for me (because there’s no cookbook for it and I have no idea what I’m doing). I didn’t notice any difference in terms of capabilites compared to 3.5. So I’m asking all you wizards if you can elaborate for me and also if anyone has an awesome cookbook for vLLM I mostly use Eugr’s vLLM from git hub but I do use sparkrun as well sometimes for version 3.6 that I could use to speed it up.

I’m having the same issue with Gemma 4 35B, I downloaded the NVIDIA NVFP4 version and used the playbook from NVIDIA to run it. Its slow about 6 tps while its spitting out info, its slow to display it on the screen and take awhile to finish. Anyone know if there’s an optimized route to get it running faster? I tried creating a yaml file that was essentially a copy of the 26B one in Eugr’s recipe folder but modified it for 35B. It didn’t work. vLLM just implodes saying:

(APIServer pid=41) INFO 05-05 13:02:26 [utils.py:233] non-default args: {'model_tag': 'google/gemma-4-31B-it-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'host': '0.0.0.0', 'model': 'google/gemma-4-31B-it-NVFP4', 'max_model_len': 262144, 'quantization': 'fp8', 'load_format': 'safetensors', 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_batched_tokens': 8192}
(APIServer pid=41) WARNING 05-05 13:02:26 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=41) Traceback (most recent call last):
(APIServer pid=41)   File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 761, in hf_raise_for_status
(APIServer pid=41)     response.raise_for_status()
(APIServer pid=41)   File "/usr/local/lib/python3.12/dist-packages/httpx/_models.py", line 829, in raise_for_status
(APIServer pid=41)     raise HTTPStatusError(message, request=request, response=self)
(APIServer pid=41) httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://huggingface.co/google/gemma-4-31B-it-NVFP4/resolve/main/config.json'

or

sparkit@bd-it-spark01:~/spark-vllm-docker$ ./hf-download.sh nvidia/Gemma-4-31B-IT-NVFP4
Downloading model 'nvidia/Gemma-4-31B-IT-NVFP4' using uvx...
Installed 23 packages in 9ms
Downloading (incomplete total...):  24%|██████████████████████████▉                                                                                    | 7.94G/32.7G [01:37<08:52, 46.5MB/s]Fetching 15 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [06:14<00:00, 24.98s/it]
Download complete: : 32.7GB [06:14, 131MB/s]              ✓ Downloaded████████████████████████▉                                                              | 8/15 [06:14<06:20, 54.36s/it]
  path: /home/sparkit/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/snapshots/05fa17010ac2bf68365b33bdd20f07faa10654b4
Download complete: : 32.7GB [06:14, 87.2MB/s]
Download completed in 00:06:17
Model directory: /home/sparkit/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4

or another time when I had downloaded the Intel version

sparkit@bd-it-spark01:~/spark-vllm-docker$ ./run-recipe.sh gemma4-31B-it.yaml
Warning: Recipe uses schema version ‘2’, but this run-recipe.py supports: [‘1’]
Some features may not work correctly. Consider updating run-recipe.py.
Recipe: google/gemma-4-31B-it

No cluster nodes configured. Running autodiscover…

Running autodiscover…

Auto-detecting interfaces…
Error: No active IB interfaces found.
Error: Autodiscover failed
Error: Missing parameter in recipe command: ‘model’
Available parameters: [‘port’, ‘host’, ‘tensor_parallel’, ‘gpu_memory_utilization’, ‘max_model_len’, ‘max_num_batched_tokens’, ‘tool_call_parser’]

So I just ran it straight out of the NVDIA playbook

sparkit@bd-it-spark01:~$ docker run -it --gpus all -p 8000:8000 \                                                                
vllm/vllm-openai:gemma4-cu130 nvidia/Gemma-4-31B-IT-NVFP4

It’s dog slow and eating 120GB of my ram, which I’m not surprised because its not being capped by memory.

I always check the spark arena leaderboard, seems like its always the same models. As a newbie I’m confused why out of the mountain of models on huggingface we don’t seem more variants showing up.

Don’t take this please as complaining, I’m not! I’m just trying to learn and understand the differences in models/releases. As a total AI newbie I struggle to understand how a 35B model runs horribly slow when its eating 120GB of RAM on the spark. Seems like that should fly. But I know optimization and configuration is everything. That is where I fall flat on my face. I don’t understand how to do it, I want to learn, but I haven’t found a good structured learning path. There’s so many moving parts.

Anyways thanks in advance!!

Cheers

First: Avoid dense models if you value speed.

Qwen3.6-27 and Gemma-4-31B are dense models.

Second: (for the moment) Avoid NVFP4. vLLM is not yet fully optimized for NVFP4. Great progress has been made in recent weeks for NVFP4 support.

Currently 4bit quant formats like AWQ and Intel Autoround are still faster. Sometimes even FP8 quants are pretty fast.

Have a look in the Spark Arena - LLM Leaderboard what can expect.

If you have a single Spark check if the above recipes are defining dual node use. You can use --solo as an argument for single node use when using eugr’s tool set for the community docker edition of vLLM.

If you want to get even more performance you should follow closely the threads in here dealing with the different models.

There a quite a few “extras” / “special builds” that use the latest sh*t like TurboQuant/PrismQuant/DFlash/Hybrids techniques. But some of those can be time consuming to get them running.

as an example.

Go to @eugr_nv 's repo GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub clone, install that you have many recipes that are ready to run there.

Run the Qwen/Qwen3.6-35B-A3B-FP8 · Hugging Face with the 3.5-35B recipe, and it should load properly. Try it and familirize yourself with it, should be fairly easy :)

Then AFTER that you can start optimizing that with with MTP if you want to venture that much, this is my Qwen3.6-35b-a3b-fp8 recipe with MTP that it’s snappy fast and great quality overall :)

# Recipe: Qwen3.6-35B-A3B-FP8
# Qwen3.6-122B model in FP8 quantization + DFlash

recipe_version: "1"
name: Qwen3.6-35B-A3B-FP8
description: vLLM serving Qwen3.6-35B-A3B-FP8

# HuggingFace model to download (optional, for --download-model)
model: Qwen/Qwen3.6-35B-A3B-FP8

solo_only: true

# Container image to use
container: vllm-node-tf5

build_args:
  - --tf5

mods:
#  - mods/fix-qwen3.5-enhanced-chat-template ## This requires the CHAT fix. Only enabled if you have it downloaded it

# Default settings (can be overridden via CLI)
defaults:
  port: 8000
  host: 0.0.0.0
  max_model_len: 262144
  gpu_memory_utilization: 0.57
  max_num_batched_tokens: 32768
  max-num-seqs: 8
  served_model_name: qwen/qwen3.6-35B-A10B-FP8
  speculative_config: '{"method":"mtp","num_speculative_tokens":3}'
  coding_debug: '{"temperature": 0.15, "top_p": 0.85, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
  coding_review:  '{"temperature": 0.35, "top_p": 0.85, "top_k": 35, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
  writing_config: '{"temperature": 0.85, "top_p": 0.92, "top_k": 50, "presence_penalty": 0.0, "repetition_penalty": 1.08}'
  coding_config: '{"temperature": 0.60, "top_p": 0.95, "top_k": 20, "presence_penalty": 0.0, "repetition_penalty": 1.0}'
  chat_template: unsloth.jinja
  chat_template2: qwen3.5-enhanced.jinja


# Environment variables
env:
  VLLM_MARLIN_USE_ATOMIC_ADD: 1

# The vLLM serve command template
command: |
  vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name {served_model_name} \
  --max-model-len {max_model_len} \
  --gpu-memory-utilization {gpu_memory_utilization} \
  --max-num-batched-tokens {max_num_batched_tokens} \
  --max-num-seqs {max-num-seqs} \
  --port {port} \
  --host {host} \
  --dtype bfloat16 \
  --load-format instanttensor \
  --attention-backend flash_attn \
  --speculative-config '{speculative_config}' \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --enable-auto-tool-choice \
  --chat-template {chat_template2} \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --override-generation-config '{coding_config}'