Hello brilliant minds. I had been using the qwen 3.5 FP8 model for the longest time, it was blazing fast thanks to @eugr_nv tireless work. It was my main model for quite some time. I didn’t like that it was 2024 training date cutoff. But eh. Then when gemma 4 came out and saw it was part of his cookbook models I gave it a spin and it has been blazing fast as well! So I’ve been using gemma4 26B for a bit now. No real issues. I can’t decided which is “better” at python coding.
Then 3.6Qwen came out. I was like OHHH must have nowww!! Well honestly I’m still trying to understand the improvements in it, it has a 2024 cutoff training date still. And it runs slower for me (because there’s no cookbook for it and I have no idea what I’m doing). I didn’t notice any difference in terms of capabilites compared to 3.5. So I’m asking all you wizards if you can elaborate for me and also if anyone has an awesome cookbook for vLLM I mostly use Eugr’s vLLM from git hub but I do use sparkrun as well sometimes for version 3.6 that I could use to speed it up.
I’m having the same issue with Gemma 4 35B, I downloaded the NVIDIA NVFP4 version and used the playbook from NVIDIA to run it. Its slow about 6 tps while its spitting out info, its slow to display it on the screen and take awhile to finish. Anyone know if there’s an optimized route to get it running faster? I tried creating a yaml file that was essentially a copy of the 26B one in Eugr’s recipe folder but modified it for 35B. It didn’t work. vLLM just implodes saying:
(APIServer pid=41) INFO 05-05 13:02:26 [utils.py:233] non-default args: {'model_tag': 'google/gemma-4-31B-it-NVFP4', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'host': '0.0.0.0', 'model': 'google/gemma-4-31B-it-NVFP4', 'max_model_len': 262144, 'quantization': 'fp8', 'load_format': 'safetensors', 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_batched_tokens': 8192}
(APIServer pid=41) WARNING 05-05 13:02:26 [envs.py:1783] Unknown vLLM environment variable detected: VLLM_BASE_DIR
(APIServer pid=41) Traceback (most recent call last):
(APIServer pid=41) File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 761, in hf_raise_for_status
(APIServer pid=41) response.raise_for_status()
(APIServer pid=41) File "/usr/local/lib/python3.12/dist-packages/httpx/_models.py", line 829, in raise_for_status
(APIServer pid=41) raise HTTPStatusError(message, request=request, response=self)
(APIServer pid=41) httpx.HTTPStatusError: Client error '404 Not Found' for url 'https://huggingface.co/google/gemma-4-31B-it-NVFP4/resolve/main/config.json'
or
sparkit@bd-it-spark01:~/spark-vllm-docker$ ./hf-download.sh nvidia/Gemma-4-31B-IT-NVFP4
Downloading model 'nvidia/Gemma-4-31B-IT-NVFP4' using uvx...
Installed 23 packages in 9ms
Downloading (incomplete total...): 24%|██████████████████████████▉ | 7.94G/32.7G [01:37<08:52, 46.5MB/s]Fetching 15 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [06:14<00:00, 24.98s/it]
Download complete: : 32.7GB [06:14, 131MB/s] ✓ Downloaded████████████████████████▉ | 8/15 [06:14<06:20, 54.36s/it]
path: /home/sparkit/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4/snapshots/05fa17010ac2bf68365b33bdd20f07faa10654b4
Download complete: : 32.7GB [06:14, 87.2MB/s]
Download completed in 00:06:17
Model directory: /home/sparkit/.cache/huggingface/hub/models--nvidia--Gemma-4-31B-IT-NVFP4
or another time when I had downloaded the Intel version
sparkit@bd-it-spark01:~/spark-vllm-docker$ ./run-recipe.sh gemma4-31B-it.yaml
Warning: Recipe uses schema version ‘2’, but this run-recipe.py supports: [‘1’]
Some features may not work correctly. Consider updating run-recipe.py.
Recipe: google/gemma-4-31B-it
No cluster nodes configured. Running autodiscover…
Running autodiscover…
Auto-detecting interfaces…
Error: No active IB interfaces found.
Error: Autodiscover failed
Error: Missing parameter in recipe command: ‘model’
Available parameters: [‘port’, ‘host’, ‘tensor_parallel’, ‘gpu_memory_utilization’, ‘max_model_len’, ‘max_num_batched_tokens’, ‘tool_call_parser’]
So I just ran it straight out of the NVDIA playbook
sparkit@bd-it-spark01:~$ docker run -it --gpus all -p 8000:8000 \
vllm/vllm-openai:gemma4-cu130 nvidia/Gemma-4-31B-IT-NVFP4
It’s dog slow and eating 120GB of my ram, which I’m not surprised because its not being capped by memory.
I always check the spark arena leaderboard, seems like its always the same models. As a newbie I’m confused why out of the mountain of models on huggingface we don’t seem more variants showing up.
Don’t take this please as complaining, I’m not! I’m just trying to learn and understand the differences in models/releases. As a total AI newbie I struggle to understand how a 35B model runs horribly slow when its eating 120GB of RAM on the spark. Seems like that should fly. But I know optimization and configuration is everything. That is where I fall flat on my face. I don’t understand how to do it, I want to learn, but I haven’t found a good structured learning path. There’s so many moving parts.
Anyways thanks in advance!!
Cheers