Hi everyone,
I am currently getting excellent results using Unsloth’s Qwen3.5-122B_Q6 with llama.cpp. Here is my working configuration:
./llama.cpp/build/bin/llama-server \
-m /workspace/AIEngine/Qwen3.5-122B_Q6/Qwen3.5-122B-A10B-UD-Q6_K_XL-00001-of-00004.gguf \
--mmproj /workspace/AIEngine/Qwen3.5-122B_Q6/mmproj-BF16.gguf \
--host 0.0.0.0 \
--port 8000 \
-c 60000 \
-a qwen3.5_122B \
-ngl 999 \
--cache-ram 0 \
-np 1 \
--temp 0.6 \
--top-k 20 \
--top-p 0.95 \
--min-p 0.00
However, due to slow inference speed (18 t/s), I am trying to switch to vLLM using a Q4 quantization (Qwen3.5-122B-A10B-int4-AutoRound).
I am launching it with the following command ( GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks · GitHub ):
VLLM_SPARK_EXTRA_DOCKER_ARGS="-v /home/ollam3/Desktop/AIEngine:/models" \
./launch-cluster.sh --solo -t vllm-node-tf5 --apply-mod mods/fix-qwen3.5-autoround \
-e VLLM_MARLIN_USE_ATOMIC_ADD=1 exec vllm serve /models/Qwen3.5-122B-A10B-int4-AutoRound \
--max-model-len 40000 \
--gpu-memory-utilization 0.90 \
--port 8000 \
--host 0.0.0.0 \
--load-format fastsafetensors \
--kv-cache-dtype fp8 \
--max-num-seqs 1 \
--max-num-batched-tokens 8192 \
--trust-remote-code \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm
While the model loads and serves successfully, I am not getting any reasoning output when evaluating vision inputs.
Am I missing a specific parameter in my vLLM configuration? Or could it be that vision reasoning isn’t fully supported yet in this vLLM version (especially considering the custom mod-fix)?
Many thanks in advance for your help!
Kind regards,
P