RedHatAI/Qwen3.5-122B-A10B-NVFP4 seems to be the best option for a single Spark

I agree, and I’m not saying the metrics aren’t useful. But without a good measure of accuracy, it’s hard to know what cost the improved speed comes and so it’s a difficult decision to make.

I haven’t used these models and I’m not suggesting the NVFP4 one is better, I’m just saying that without good data about accuracy, it’s not obvious to me that the fastest one is “better” (at least for my priorities).

Yeah, I absolutely get this (I’ve tried many times to come up with something to help me compare things), and I hope we can come up with something. But in the meantime, I’m just not putting much weight on speed. From the testing I’ve done so far, if you asked me if I wanted a faster model or a smarter model, I would pick smarter every time. There is no model I’ve run that is smart enough, even at a pathetically slow speed 😄

lol too true

It might be worth trying Nemotron-3-super since it’s several times faster than other models. At that speed, performing manual corrections becomes a much more acceptable trade-off.

I didn’t try Nemotron-3-super because I saw “vLLM for it still in WIP”, and I am in a hurry to use a better model than I am currently using.

Yeah, I need to try this again. Last time I tried it it kept crashing, but it sounds like there might be fixes out for that now.

qwen3.5_397b is the smartest; I’ve tested it on two Spark nodes..

How can you have enough memory to do inference after bring the model up? the best case I have encountered is only 6.9Gi left on a single DGX Spark.

If the model has already been loaded in vLLM and you’ve received the API endpoint,

then the KV cache required for inference should already be allocated, so inference should be ready to run.

But model could only handle very simple prompt, and keeps crashing. When I lowered max-length to 8k, I was able to run a bit more “real” question, but couldn’t stay long to be used for daily.

Not sure what’s wrong with my setup

It would be helpful to refer to the vLLM logs when a crash occurs to better understand the situation.

Thank you. Let’s examine YAML file first:

`services:
vllm-qwen:
image: vllm-node-tf5
container_name: spark-vllm-qwen
restart: always
runtime: nvidia
ports:

  • “8000:8000”
    networks:
  • spark-net
    volumes:
  • ~/models/qwen-122b-nvfp4:/workspace/model
  • ~/dgx-spark/builds/spark-vllm-docker/mods:/workspace/mods
    shm_size: ‘16gb’
    deploy:
    resources:
    reservations:
    devices:
  • driver: nvidia
    count: all
    capabilities: [gpu]
    command: >
    vllm serve /workspace/model --served-model-name “Qwen3.5-122B-NVFP4_262k-0.87”
    –host 0.0.0.0
    –port 8000
    –reasoning-parser qwen3
    –enable-auto-tool-choice
    –tool-call-parser qwen3_coder
    –max-model-len 262144
    –moe-backend flashinfer_cutlass
    –default-chat-template-kwargs ‘{“enable_thinking”: false}’
    –chat-template /workspace/mods/fix-qwen3.5-chat-template/chat_template.jinja
    –max-num-seqs 100
    –trust-remote-code
    –gpu-memory-utilization 0.87
    –enforce-eager
    –kv-cache-dtype fp8`

I recommend removing the --default-chat-template-kwargs '{"enable_thinking": false}' argument, as it disables the reasoning capability.

Also, please remove the following flags:

  • --trust-remote-code
  • --gpu-memory-utilization 0.87
  • --enforce-eager
  • --kv-cache-dtype fp8
  • --chat-template /workspace/mods/fix-qwen3.5-chat-template/chat_template.jinja

After removing all of them, try reloading the model and running it again.

Hey guys,

I have two Blackwell 6000 Pro units, but I’m thinking about the same thing.

Currently we have all of those model options available:

- Qwen/Qwen3.5‑122B‑A10B‑FP8

- Qwen/Qwen3.5‑122B‑A10B‑GPTQ‑Int4

- Sehyo/Qwen3.5‑122B‑A10B‑NVFP4

- RedHatAI/Qwen3.5‑122B‑A10B‑NVFP4

- Intel/Qwen3.5‑122B‑A10B‑int4‑AutoRound

- QuantTrio/Qwen3.5‑122B‑A10B‑AWQ

So, for a production‑grade deployment of Blackwell serving around 50 concurrent users, which option provides the best combination of stability and performance?

Thanks for your advice

Morgan

For reference - My results with LM Studio Unsloth GGUF Q4_K_M 262144 context KV Q8_0

Has anyone noticed issues with vision/image parsing with the RedHatAI/Qwen3.5-122B-A10B-NVFP4 model and vllm?

It seems the model has the wrong orientation, either upside down or flipped on the diagonal. I’ve had repeated confusions where a lower-right part of the image is perceived by the model being on the top-left.

Are there any image/vision configurations that vllm needs?

It is TAX time. When using these models to do TAX work, I benchmarked a few models on financial statements from PDF to MD and then to QIF (Financial app importable format). Nemotron was a clear winner among all the models I benchmarked, speed wise and quality wise. in MD → QIF work, Nemotron-3-super took 1/4 of what Gemma4-31b took…

I run often in a loop with RedHatAI/Qwen3.5-122B-A10B-NVFP4. In both Thinking and Instruction modes. Using latest vLLM