GLM-4.7-Flash-NVFP4 was just released, but for Transformers 5.0 + vLLM 0.14...?

GadflyII/GLM-4.7-Flash-NVFP4 was just released on HuggingFace.

However, its requirements are vLLM 0.14.0 and Transformers 5.0.0. The GitHub for vLLM release is currently 0.13.x, and Transformers is 4.x.

I currently have a working GPT OSS 120B setup through Docker compose, but can’t seem to figure out how to get the above prerequisites and up it. This is my current GPT OSS 120B Docker compose config file

services:
  vllm-node:
    image: vllm-image
    container_name: vllm-container
    environment: 
      - VLLM_API_SERVER_COUNT=2
      - HF_TOKEN=${HF_TOKEN}
    restart: unless-stopped

    privileged: true
    network_mode: host
    ipc: host
    pid: host

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface

    command: >
      bash -c -i "vllm serve 
      openai/gpt-oss-120b 
      --port 8000 --host 0.0.0.0 
      --gpu-memory-utilization 0.7 
      --load-format fastsafetensors

Just switching openai/gpt-oss-120b to GadflyII/GLM-4.7-Flash-NVFP4 would not work, as you might already have assumed. How can I get the dependencies?

Try an AWQ quant that should perform better.

try the nightly docker build vllm/vllm-openai:nightly

or eugr’s build GitHub - eugr/spark-vllm-docker: Docker configuration for running VLLM on dual DGX Sparks

not tested yet myself - will try this evening (CET)

1 Like

I think you can try this version from unsloth’s

It was better for me.

Nightly build needs still the newer transformer lib.

pip install git+https://github.com/huggingface/transformers.git

as mentioned the model card. Successfully modified the nightly container and tested on a regular server (L40).

Oh, vLLM v0.14.0 has just been released just 15 minutes ago. No release notes yet.

vLLM sucks at GGUF.

Running it on llama.cpp is the fastest for inference. Currently, flash attention is broken for GLM 4.7 Flash but there is a branch with the fixes that you can clone on the llama.cpp github.

1 Like

No luck building vLLM latest on Spark today:

Failed to fetch: https://download.pytorch.org/whl/cu130/numpy/

Cloudflare seems to be troubled or my upstream provider just fails me… will retry tomorrow.

Have you checked out this already vLLM on GB10: gpt-oss-120b MXFP4 slower than SGLang/llama.cpp... what’s missing? and New pre-built vLLM Docker Images for NVIDIA DGX Spark?

Please check it out

1 Like

FYI. I released the images at New pre-built vLLM Docker Images for NVIDIA DGX Spark ; I don’t think GLM-4.7-Flash works on the latest image, but I was planning to release a new version tomorrow coinciding with pytorch 2.10 release. I don’t think vllm 0.14.0 release will work; I think a key commit came in right after the 0.14.0 release–but I might include it in the updated vllm 0.14.0 release image tomorrow given that GLM-4.7-Flash is pretty key…

3 Likes

Thanks for the context, Drew.

vLLM 0.14.0 was just released. I’m trying to set it up with Docker compose, but it keeps telling me that at least one -cc argument is expected. Can’t seem to figure this one out after multiple iterations. Anyone else tried?

1 Like

now that I could access pytorch repos again without flaws, I could test it with eugr’s Docker build. It works - you only need to update transformers to v5.0.0.

JFTR - eugr already added a flag for using bleeding edge transformers library (pre-tf):

/build-and-copy.sh --pre-tf --use-wheels
1 Like

Released updated docker container images, so hopefully that could help / be another alternative. I’d still recommend eugr’s repo for people that need a new version on a daily basis (or need the newest commit at any given point), etc. I’m trying to be somewhere in-between a nightly build and NVIDIA’s more conservative releases.

1 Like

There is a new version of the Unsloth GGUF model with a looping and poor output fix:

And there is a NVFP4 version with updated KV cache fixes. You can use 200K context with just 10gb of ram.

When I combine the unsloth version + the llama.cpp Flash Attention git branch fix, the GLM 4.7 Flash token generation inference speed starts out at ~61 tk/s, tapering down to 57 tk/s at 5k tokens!

2 Likes

Sorry guys, I am quite new to the Spark ecosystem. Did anyone successfully get it to run with NVFP4 on vllm already? Its quite confusing to know what i need to compile / package myself regarding triton kernels, vllm, transformers 5.0 etc.

would be happy about any pointers. Only got the q8 gguf to run on llama with like 30 t/s which was a little underwhelming but maybe missed the flash attention git branch

I created a PR for eugr/spark-vllm-docker which allowed me to run GLM-4.7-Flash-NVFP4 on my DGX Spark. Dislaimer: I only got plain decode (no tools): ~35–45 tokens/s and tool calling / reasoning enabled: ~15–25 tokens/s but at least i got it running on vLLM this way.

I’ve also updated: New pre-built vLLM Docker Images for NVIDIA DGX Spark - #7 by dbsci to include the necessary patch, so the scitrera/dgx-spark-vllm:0.14.0-t5 container image should be OK to use for GLM-4.7-Flash. Note that I have not specifically tested the NVFP4 quant, so any feedback on that is welcome.

Nice! I like that it’s a mod, makes it easier. I’ll have a look and merge if there are no issues.

1 Like

I’m currently downloading that NVFP4 model to test your mod. But just wanted to mention, that since NVFP4 support is still not great on Spark, I think AWQ models will work much better. Also, FP8 version if/when it comes out will be pretty fast as well.

Since it’s 3B active parameters model, I’d expect something around 80 t/s for AWQ 4-bit and around 50 t/s for FP8 version.