Hey there, I created a doc on how to run StepFun’s latest Step 3.5 Flash Model using the INT4/GGUF version.
This is by far the best model I have ran to date (gpt120b is a close second)
Performance Metrics
Tokens Per Second (TPS): 22.92 t/s
Latency (Time Per Token): 44.39 ms
# Step 1 - Download and Format Model
# Download Model
# If you don't have it already, downlaod Hugging Face CLI to download model
curl -LsSf https://hf.co/cli/install.sh | bash
# Download model (replace path destination)
hf download stepfun-ai/Step-3.5-Flash-Int4 --local-dir /path/to/models/Step-3.5-Flash-Int4
# Once downloaded, combine files into one .gguf
cd /path/to/models/Step-3.5-Flash-Int4
cat step3p5_flash_Q4_K_S.gguf.part-* > step3.5_flash_Q4_K_S.gguf
# Step 2 - Build Runner
# Option 1 - Run Directly on System
# Download custom llama.cpp - Go into whatever directory you want first (i.e. cd /ai/launchers/)
git clone https://github.com/stepfun-ai/Step-3.5-Flash.git
cd Step-3.5-Flash/llama.cpp
# Build Cuda
cmake -S . -B build-cuda \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DGGML_CUDA_GRAPHS=ON \
-DLLAMA_CURL=OFF
# Perform the actual compilation (using all CPU cores for speed)
cmake --build build-cuda --config Release -j$(nproc)
# Launch Model - Change Port as needed - Take about 5 minutes to load
~/Step-3.5-Flash/llama.cpp/build-cuda/bin/llama-server \
-m ~/path/to/models/Step-3.5-Flash-Int4/step3.5_flash_Q4_K_S.gguf \
-c 16384 \
-ngl 999 \
--port 8000 \
--host 0.0.0.0
# Option 2 - Build & Run Via Docker (recommended)
# Feel free to use whatever directory you want
mkdir ~/ai/launchers/Step3.5-Flash-Int4
cd ~/ai/launchers/Step3.5-Flash-Int4
sudo nano docker-compose.yml
# Paste the following. Make sure to change the directories to point where your model is downloaded.
services:
step-server:
image: nvidia/cuda:13.1.1-devel-ubuntu24.04
container_name: llama-Step3.5-Flash-Int4
restart: unless-stopped
ports:
- "8000:8000"
volumes:
- ${HOME}/ai/models:/models
- ${HOME}/ai/launchers/Step3.5-Flash-Int4/app:/app
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
working_dir: /app
command: >
bash -c "apt-get update && apt-get install -y git cmake build-essential libcurl4-openssl-dev &&
if [ ! -d 'Step-3.5-Flash' ]; then git clone https://github.com/stepfun-ai/Step-3.5-Flash.git; fi &&
cd Step-3.5-Flash/llama.cpp &&
cmake -S . -B build-cuda -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON -DLLAMA_CURL=OFF &&
cmake --build build-cuda --config Release -j$(nproc) &&
./build-cuda/bin/llama-server -m /models/Step-3.5-Flash-Int4/step3.5_flash_Q4_K_S.gguf -c 16384 -ngl 999 --port 8000 --host 0.0.0.0"
# Launch Docker
docker compose up -d
# The model will start to load - Takes a while - Check logs to see status
docker logs -f --tail=120 llama-Step3.5-Flash-Int4
# Step 3 - Test
curl http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Confirm you are alive on Spark-1."}
]
}'
# Step 4 - Add to Open WebUI or Your Preferred Inference Platform - :)
I can confirm that this model is looking pretty good.
I have been banging my head against a brick wall the past few days trying to get Nemotron 3 Nano running using the NVFP4 quant on vLLM. No matter what I did, including rebuilding on all the latest sources, I could not get a stable setup when using it with OpenCode..
The issues are crashes being logged at startup in the flashinfer code - this also happens with the official Nvidia vLLM container as well. Then once it’s up, it tends to run for a few minutes of processing then die a horrible death. Together with that and an inability to perform tool calls reliably means it’s an unworkable solution for a coding solution.
I then came across this new model, and it’s a world of difference. It runs slightly slower, but still at a pretty good rate but better yet, it appears to be totally stable. My first impressions are that it seems to be understanding the code better than Nemotron and GLM 4.7 Flash, but time will tell after I throw some more difficult problems at it.
It’s llama.cpp based, so I can’t run it on a cluster, but for the moment it looks like it’s going to give me a stable platform to work from.
This sounds really interesting and I’m going to set the weights to come down overnight. I’d really like to see more MoE’s like this with int4 or NVFP4 that use much of the Spark’s available RAM!
The flags they recommend for setting up the CUDA build include -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_COMMON=ON
Their flags for the build command only use -j8 rather than all available cores, and do not include --config Release.
Their flags for starting the model include the flags -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 (I’m sceptical about setting temp to 1.0, nowhere else on the page changes temp). Yours lacks a couple of these but includes -ngl 999 which I think should be unnecessary.
Beyond that, I am very interested in their comment which says “On NVIDIA DGX Spark, the Step 3.5 Flash achieves a generation speed of 20 tokens per second; by integrating the INT8 quantization technology for KVCache, it supports an extended context window of up to 256K tokens, thus delivering long text processing capabilities on par with cloud-based inference.”
Nowhere on the model page seems to delve further into this, unfortunately. I’m more familiar with vLLM than llama.cpp - their startup command seems to limit context to 16k. Anyone care to comment further? I’d love to push that context window out on single spark.
Able to build with the flags of this thread .-DGGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON -DLLAMA_CURL=OFF). However the flags suggested for starting the model (-c 16384 -ngl 999) always lead to “encountered an error while trying to fit params to free device memory”. Still working on trying to run Step-3.5-Flash in a single Spark. Will try later -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 & -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_COMMON=ON
Since you only need llama-server to act as your backend, skipping DLLAMA_BUILD_EXAMPLES=ON & DLLAMA_BUILD_COMMON=ON saves compile time and reduces the final binary size.
$(nproc) is much better. It dynamically scales the build to use all available CPU cores on your Spark. Scaling back to -j8 is a generic recommendation for users on 8-core laptops to prevent system lag; on your rig, limiting to 8 cores just makes you wait longer for the compile.
Model flags: These the settings they set in their documentation for llama-cli (step 5) - ./llama-cli -m step3.5_flash_Q4_K_S.gguf -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 -p “What’s your name?”
Note: I was able to scale to context window to 262144without any issues.
-ngl 999 is more of a fail safe, ensuring every layer is forced on the GPU. If you omit it, llama.cpp might default to partial CPU offloading, which would destroy your performance on the Spark.
All-in-all, amazing model, just slow. Hopefully someone can come up with ways to speed it up on the Spark.
Thanks for your notes, @Keyper-AI! I got excited about using this model and went ahead and generalized your docker-compose.yml to abstract the build from the running of the container - this makes startup far faster.
Here is my known-working config. Necessary setup as in the original post includes
download the model (I assume it is located locally in your user account at ~/models/Step-3.5-Flash-Int4) and
combine the gguf shards into one file by running cat step3p5_flash_Q4_K_S.gguf.part-* > step3.5_flash_Q4_K_S.gguf in that directory.
Copy these into new files in an empty directory.
Dockerfile
FROM nvidia/cuda:13.1.1-devel-ubuntu24.04
# 1. Install build dependencies
RUN apt-get update && apt-get install -y \
git \
cmake \
build-essential \
libcurl4-openssl-dev \
&& rm -rf /var/lib/apt/lists/*
# 2. Set up build directory
WORKDIR /build
# 3. Clone the specific Step-3.5-Flash repository
RUN git clone https://github.com/stepfun-ai/Step-3.5-Flash.git
# 4. Configure and Build
WORKDIR /build/Step-3.5-Flash/llama.cpp
# The linker needs libcuda.so.1 (Driver API), but the driver isn't mounted during build.
# We point it to the "stubs" provided by the nvidia-devel image.
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
# We use the specific flags provided, including the custom CUDA Arch 121
RUN cmake -S . -B build-cuda \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DGGML_CUDA_GRAPHS=ON \
-DLLAMA_CURL=OFF \
-DCMAKE_CUDA_ARCHITECTURES=121 \
-DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs"
# Compile llama.cpp with all available cores
# LD_LIBRARY_PATH set inline here only for this command
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
cmake --build build-cuda --config Release --target llama-server -j$(nproc)
# 5. Install the binary to the system path
RUN cp build-cuda/bin/llama-server /usr/local/bin/llama-server
# 6. Set Runtime Environment Variables
ENV GGML_CUDA_GRAPH_OPT=1
# 7. Set default working directory
WORKDIR /app
In the same directory name this docker-compose.yml
services:
step-server:
# Build from the Dockerfile in the current directory
build:
context: .
dockerfile: Dockerfile
image: step-3.5-flash:local
container_name: llama-Step3.5-Flash-Int4
restart: unless-stopped
ports:
- "8000:8000"
volumes:
# We only need to mount the models now.
# The code/binary is baked into the image.
- ${HOME}/models:/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
working_dir: /app
# The command is now strictly for runtime arguments
command: >
llama-server
-m /models/Step-3.5-Flash-Int4/step3.5_flash_Q4_K_S.gguf
-c 200000
-ngl 999
-fa 1
-b 2048
-ub 2048
-ctk q8_0
-ctv q8_0
--no-mmap
--port 8000
--host 0.0.0.0
Finally, from the directory these files are in, run
docker compose build
docker compose up -d
Starts up quite fast and works very well. I am seeing about 24 t/s for short prompts, for prompts around 10k it is still over 19 t/s. I’m impressed by the model so far.
The stepfun-ai GitHub repo has more in-depth info about running on the DGX Spark, and there are two options we could experiment with which aren’t included in my above config.
From this file:
“When testing long context (e.g. 256K), OOM may occur. The build flag -DGGML_CUDA_FORCE_MMQ=ON, environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, and runtime parameters -ctk q8_0 -ctv q8_0 can help mitigate memory issues.”
My Docker config above has the Int8 KV cache runtime flags already set, but is not using -DGGML_CUDA_FORCE_MMQ=ON in the llama.cpp build, nor the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1
From what I understand forcing MMQ is mostly something for legacy hardware, but can result in lower RAM use (at a substantial performance penalty). It would probably be worth trying this, as they clearly benched contexts up to 262k on the Spark.
I’m less convinced GGML_CUDA_ENABLE_UNIFIED_MEMORY=1will be helpful. It’s designed to overflow to system RAM if GPU VRAM is exhausted. The memory is already unified on the GB10, and I don’t think there is really a fallback - unless this allows it to thrash instead of crash.
I successfully ran the model using this compose file and connected it to openclaw. Overall, it works quite well, but I did notice that it’s a bit slow and almost maxes out the available resources.
Meanwhile, I saw that Qwen just released the Qwen3-Coder-Next model. According to their introduction, it has a total of 80B parameters but only activates 3B per inference, which significantly reduces memory and compute requirements. This might be a better fit for running on DGX Spark devices. I’ll deploy and test this model on a DGX Spark soon. If you’re interested, you might want to try out Qwen3-Coder-Next as well.
This depends heavily on your use case. I find 20-24 tok/s perfectly usable, and quite impressive, given the model’s size.
I really like Step3.5-Flash for general tasks. After kicking the tires for 24h I am actually even more impressed by the model. I agree with the original poster that this is hands-down the best local model I have ever run. If you’re doing coding, especially very long contexts, then yes the new Qwen3-Coder-Next appears to be excellent - as it should be, for a purpose-tuned model. But I am definitely keeping this one around!
anyone has some numbers for context at 100k / 150k / 200k ? whats the tps like with long context? My internet is quite slow so need to think twice whether to load qwen3 80 or this one xd
I have not run this (too busy using it) but modifying my docker-compose above with a different command for benchmarking using their recommended command below (you may need to set -c lower than this):
In real-world general use up to 10k context I am seeing 19-24 t/s. The above should give a sense of how it scales up to 262k.
It is possible this test will not run with my Docker environment above, because I did not use every last possible optimization - in particular I did not force MMQ in llama.cpp (see above post Running Step-3.5-Flash on Single Spark - #12 by joshua.dale.warner ). That should save memory usage but cost in throughput. My guess is that their test resultsdid force MMQ, because the throughput is lower than I would have expected (I see ~24 t/s in the 2048 and 8192 rows above). The theoretical tradeoff is speed vs memory usage.
No worries, it looks like they renamed the model shards within the last day. There was a comment thread about discoverability by llama.cpp
I think that makes the cat command unnecessary. Avoiding the unnecessary copy and space will be nice. Try passing the full name of just the first shard - llama.cpp should find the rest automatically now. I’ll verify this and, assuming true, make a follow up post in a bit.
Thanks for the updated runtime - I was duplicating a lot of this work today, unfortunately, because whoever updated the StepFun repo simply deleted all of the GGUF shards and reuploaded them with new names.
They could have instead done git mv to rename them and just updated the hashes. The changes would have been a few kb. But no, they removed and reuploaded everything. Be aware that this choice means that if you simply git pull origin main two things happen:
You download 111 GB of new weights
The prior ones look gone but persist in your .git repo, doubling size on disk
Because of this, to save space on your Spark, I recommend deleting the old repo completely and then cloning the HF repo again with git clone --depth 1
My connection is on the slower side, so I predownload weights into ~/models overnight always and then map the weights into containers. I structured my prior framework around this.
For people who are fine downloading >100GB on startup once, Jake’s new script will cache this GGUF for you.