GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...?

notmy.reward438 · January 20, 2026, 1:48am

GadflyII/GLM-4.7-Flash-NVFP4 was just released on HuggingFace.

However, its requirements are vLLM 0.14.0 and Transformers 5.0.0. The GitHub for vLLM release is currently 0.13.x, and Transformers is 4.x.

I currently have a working GPT OSS 120B setup through Docker compose, but can’t seem to figure out how to get the above prerequisites and up it. This is my current GPT OSS 120B Docker compose config file

services:
  vllm-node:
    image: vllm-image
    container_name: vllm-container
    environment: 
      - VLLM_API_SERVER_COUNT=2
      - HF_TOKEN=${HF_TOKEN}
    restart: unless-stopped

    privileged: true
    network_mode: host
    ipc: host
    pid: host

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface

    command: >
      bash -c -i "vllm serve 
      openai/gpt-oss-120b 
      --port 8000 --host 0.0.0.0 
      --gpu-memory-utilization 0.7 
      --load-format fastsafetensors

Just switching openai/gpt-oss-120b to GadflyII/GLM-4.7-Flash-NVFP4 would not work, as you might already have assumed. How can I get the dependencies?

cosinus · January 20, 2026, 6:40am

Try an AWQ quant that should perform better.

try the nightly docker build vllm/vllm-openai:nightly

or eugr’s build GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

not tested yet myself - will try this evening (CET)

soleaf · January 20, 2026, 9:47am

I think you can try this version from unsloth’s

It was better for me.

cosinus · January 20, 2026, 10:06am

Nightly build needs still the newer transformer lib.

pip install git+https://github.com/huggingface/transformers.git

as mentioned the model card. Successfully modified the nightly container and tested on a regular server (L40).

Oh, vLLM v0.14.0 has just been released just 15 minutes ago. No release notes yet.

cosinus · January 20, 2026, 10:07am

vLLM sucks at GGUF.

aceangel · January 20, 2026, 2:41pm

Running it on llama.cpp is the fastest for inference. Currently, flash attention is broken for GLM 4.7 Flash but there is a branch with the fixes that you can clone on the llama.cpp github.

cosinus · January 20, 2026, 6:51pm

No luck building vLLM latest on Spark today:

Failed to fetch: https://download.pytorch.org/whl/cu130/numpy/

Cloudflare seems to be troubled or my upstream provider just fails me… will retry tomorrow.

raphael.amorim · January 20, 2026, 7:52pm

Have you checked out this already vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? and New pre-built vLLM Docker Images for NVIDIA DGX Spark?

Please check it out

dbsci · January 20, 2026, 8:01pm

FYI. I released the images at New pre-built vLLM Docker Images for NVIDIA DGX Spark ; I don’t think GLM-4.7-Flash works on the latest image, but I was planning to release a new version tomorrow coinciding with pytorch 2.10 release. I don’t think vllm 0.14.0 release will work; I think a key commit came in right after the 0.14.0 release–but I might include it in the updated vllm 0.14.0 release image tomorrow given that GLM-4.7-Flash is pretty key…

raphael.amorim · January 20, 2026, 8:10pm

Thanks for the context, Drew.

notmy.reward438 · January 21, 2026, 6:08am

vLLM 0.14.0 was just released. I’m trying to set it up with Docker compose, but it keeps telling me that at least one -cc argument is expected. Can’t seem to figure this one out after multiple iterations. Anyone else tried?

cosinus · January 21, 2026, 8:08am

now that I could access pytorch repos again without flaws, I could test it with eugr’s Docker build. It works - you only need to update transformers to v5.0.0.

cosinus · January 21, 2026, 6:11pm

JFTR - eugr already added a flag for using bleeding edge transformers library (pre-tf):

/build-and-copy.sh --pre-tf --use-wheels

dbsci · January 21, 2026, 7:48pm

Released updated docker container images, so hopefully that could help / be another alternative. I’d still recommend eugr’s repo for people that need a new version on a daily basis (or need the newest commit at any given point), etc. I’m trying to be somewhere in-between a nightly build and NVIDIA’s more conservative releases.

aceangel · January 22, 2026, 6:08am

There is a new version of the Unsloth GGUF model with a looping and poor output fix:

And there is a NVFP4 version with updated KV cache fixes. You can use 200K context with just 10gb of ram.

When I combine the unsloth version + the llama.cpp Flash Attention git branch fix, the GLM 4.7 Flash token generation inference speed starts out at ~61 tk/s, tapering down to 57 tk/s at 5k tokens!

mjpansa · January 22, 2026, 10:00am

Sorry guys, I am quite new to the Spark ecosystem. Did anyone successfully get it to run with NVFP4 on vllm already? Its quite confusing to know what i need to compile / package myself regarding triton kernels, vllm, transformers 5.0 etc.

would be happy about any pointers. Only got the q8 gguf to run on llama with like 30 t/s which was a little underwhelming but maybe missed the flash attention git branch

JanOwiesniak · January 22, 2026, 2:21pm

I created a PR for eugr/spark-vllm-docker which allowed me to run GLM-4.7-Flash-NVFP4 on my DGX Spark. Dislaimer: I only got plain decode (no tools): ~35–45 tokens/s and tool calling / reasoning enabled: ~15–25 tokens/s but at least i got it running on vLLM this way.

dbsci · January 22, 2026, 7:53pm

I’ve also updated: New pre-built vLLM Docker Images for NVIDIA DGX Spark - #7 by dbsci to include the necessary patch, so the scitrera/dgx-spark-vllm:0.14.0-t5 container image should be OK to use for GLM-4.7-Flash. Note that I have not specifically tested the NVFP4 quant, so any feedback on that is welcome.

eugr · January 22, 2026, 8:12pm

Nice! I like that it’s a mod, makes it easier. I’ll have a look and merge if there are no issues.

eugr · January 22, 2026, 8:29pm

I’m currently downloading that NVFP4 model to test your mod. But just wanted to mention, that since NVFP4 support is still not great on Spark, I think AWQ models will work much better. Also, FP8 version if/when it comes out will be pretty fast as well.

Since it’s 3B active parameters model, I’d expect something around 80 t/s for AWQ 4-bit and around 50 t/s for FP8 version.

Topic		Replies	Views
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	18	1922	December 25, 2025
Two-Spark cluster with vLLM using tensor-parallel-size 2 causes one node to drop while the other's GPU goes 100% forever DGX Spark / GB10	36	837	February 13, 2026
vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? DGX Spark / GB10	144	5301	March 10, 2026
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	35	2154	December 31, 2025
We unlocked NVFP4 on the DGX Spark: 20% faster than AWQ! DGX Spark / GB10	140	3687	March 2, 2026
Make GLM-4.7-Flash go BRRRRR DGX Spark / GB10	17	1683	February 5, 2026
Install and Use vLLM for Inference on two Sparks does not work DGX Spark / GB10	159	4369	December 9, 2025
NVIDIA folks -- where is this promised nvfp4 speedup? DGX Spark / GB10	24	1711	January 11, 2026
From 20 to 35 TPS on Qwen3-Next-NVFP4 w/ FlashInfer 12.1f DGX Spark / GB10	10	1304	January 7, 2026
Nemotron-3-Nano-30B-A3B-NVFP4 ultra-efficient NVFP4 precision version of Nemotron 3 Nano DGX Spark / GB10 jetson , nemotron	74	1741	February 26, 2026

GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...?

Related topics