GadflyII/GLM-4.7-Flash-NVFP4 was just released on HuggingFace.
However, its requirements are vLLM 0.14.0 and Transformers 5.0.0. The GitHub for vLLM release is currently 0.13.x, and Transformers is 4.x.
I currently have a working GPT OSS 120B setup through Docker compose, but can’t seem to figure out how to get the above prerequisites and up it. This is my current GPT OSS 120B Docker compose config file
Running it on llama.cpp is the fastest for inference. Currently, flash attention is broken for GLM 4.7 Flash but there is a branch with the fixes that you can clone on the llama.cpp github.
FYI. I released the images at New pre-built vLLM Docker Images for NVIDIA DGX Spark ; I don’t think GLM-4.7-Flash works on the latest image, but I was planning to release a new version tomorrow coinciding with pytorch 2.10 release. I don’t think vllm 0.14.0 release will work; I think a key commit came in right after the 0.14.0 release–but I might include it in the updated vllm 0.14.0 release image tomorrow given that GLM-4.7-Flash is pretty key…
vLLM 0.14.0 was just released. I’m trying to set it up with Docker compose, but it keeps telling me that at least one -cc argument is expected. Can’t seem to figure this one out after multiple iterations. Anyone else tried?
now that I could access pytorch repos again without flaws, I could test it with eugr’s Docker build. It works - you only need to update transformers to v5.0.0.
Released updated docker container images, so hopefully that could help / be another alternative. I’d still recommend eugr’s repo for people that need a new version on a daily basis (or need the newest commit at any given point), etc. I’m trying to be somewhere in-between a nightly build and NVIDIA’s more conservative releases.
There is a new version of the Unsloth GGUF model with a looping and poor output fix:
And there is a NVFP4 version with updated KV cache fixes. You can use 200K context with just 10gb of ram.
When I combine the unsloth version + the llama.cpp Flash Attention git branch fix, the GLM 4.7 Flash token generation inference speed starts out at ~61 tk/s, tapering down to 57 tk/s at 5k tokens!
Sorry guys, I am quite new to the Spark ecosystem. Did anyone successfully get it to run with NVFP4 on vllm already? Its quite confusing to know what i need to compile / package myself regarding triton kernels, vllm, transformers 5.0 etc.
would be happy about any pointers. Only got the q8 gguf to run on llama with like 30 t/s which was a little underwhelming but maybe missed the flash attention git branch
I created a PR for eugr/spark-vllm-docker which allowed me to run GLM-4.7-Flash-NVFP4 on my DGX Spark. Dislaimer: I only got plain decode (no tools): ~35–45 tokens/s and tool calling / reasoning enabled: ~15–25 tokens/s but at least i got it running on vLLM this way.
I’ve also updated: New pre-built vLLM Docker Images for NVIDIA DGX Spark - #7 by dbsci to include the necessary patch, so the scitrera/dgx-spark-vllm:0.14.0-t5 container image should be OK to use for GLM-4.7-Flash. Note that I have not specifically tested the NVFP4 quant, so any feedback on that is welcome.
I’m currently downloading that NVFP4 model to test your mod. But just wanted to mention, that since NVFP4 support is still not great on Spark, I think AWQ models will work much better. Also, FP8 version if/when it comes out will be pretty fast as well.
Since it’s 3B active parameters model, I’d expect something around 80 t/s for AWQ 4-bit and around 50 t/s for FP8 version.