Running Step-3.5-Flash on Single Spark

Hey there, I created a doc on how to run StepFun’s latest Step 3.5 Flash Model using the INT4/GGUF version.

This is by far the best model I have ran to date (gpt120b is a close second)

Performance Metrics

  • Tokens Per Second (TPS): 22.92 t/s
  • Latency (Time Per Token): 44.39 ms

# Step 1 - Download and Format Model

# Download Model 
# If you don't have it already, downlaod Hugging Face CLI to download model
curl -LsSf https://hf.co/cli/install.sh | bash

# Download model (replace path destination)
hf download stepfun-ai/Step-3.5-Flash-Int4 --local-dir /path/to/models/Step-3.5-Flash-Int4

# Once downloaded, combine files into one .gguf
cd /path/to/models/Step-3.5-Flash-Int4
cat step3p5_flash_Q4_K_S.gguf.part-* > step3.5_flash_Q4_K_S.gguf

# Step 2 - Build Runner

# Option 1 - Run Directly on System

# Download custom llama.cpp - Go into whatever directory you want first (i.e. cd /ai/launchers/)
git clone https://github.com/stepfun-ai/Step-3.5-Flash.git
cd Step-3.5-Flash/llama.cpp

# Build Cuda
cmake -S . -B build-cuda \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DLLAMA_CURL=OFF

# Perform the actual compilation (using all CPU cores for speed)
cmake --build build-cuda --config Release -j$(nproc)

# Launch Model - Change Port as needed - Take about 5 minutes to load
~/Step-3.5-Flash/llama.cpp/build-cuda/bin/llama-server \
  -m ~/path/to/models/Step-3.5-Flash-Int4/step3.5_flash_Q4_K_S.gguf \
  -c 16384 \
  -ngl 999 \
  --port 8000 \
  --host 0.0.0.0


# Option 2 - Build & Run Via Docker (recommended) 

# Feel free to use whatever directory you want
mkdir ~/ai/launchers/Step3.5-Flash-Int4
cd ~/ai/launchers/Step3.5-Flash-Int4
sudo nano docker-compose.yml

# Paste the following. Make sure to change the directories to point where your model is downloaded. 
services:
  step-server:
    image: nvidia/cuda:13.1.1-devel-ubuntu24.04
    container_name: llama-Step3.5-Flash-Int4
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      - ${HOME}/ai/models:/models
      - ${HOME}/ai/launchers/Step3.5-Flash-Int4/app:/app
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    working_dir: /app
    command: >
      bash -c "apt-get update && apt-get install -y git cmake build-essential libcurl4-openssl-dev && 
      if [ ! -d 'Step-3.5-Flash' ]; then git clone https://github.com/stepfun-ai/Step-3.5-Flash.git; fi && 
      cd Step-3.5-Flash/llama.cpp && 
      cmake -S . -B build-cuda -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON -DLLAMA_CURL=OFF && 
      cmake --build build-cuda --config Release -j$(nproc) && 
      ./build-cuda/bin/llama-server -m /models/Step-3.5-Flash-Int4/step3.5_flash_Q4_K_S.gguf -c 16384 -ngl 999 --port 8000 --host 0.0.0.0"

# Launch Docker
docker compose up -d

# The model will start to load - Takes a while - Check logs to see status
docker logs -f --tail=120 llama-Step3.5-Flash-Int4

# Step 3 - Test
curl http://0.0.0.0:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Confirm you are alive on Spark-1."}
    ]
  }'

# Step 4 - Add to Open WebUI or Your Preferred Inference Platform - :)

Enjoy!

Looks good! I will move this to GB10 projects

I can confirm that this model is looking pretty good.

I have been banging my head against a brick wall the past few days trying to get Nemotron 3 Nano running using the NVFP4 quant on vLLM. No matter what I did, including rebuilding on all the latest sources, I could not get a stable setup when using it with OpenCode..

The issues are crashes being logged at startup in the flashinfer code - this also happens with the official Nvidia vLLM container as well. Then once it’s up, it tends to run for a few minutes of processing then die a horrible death. Together with that and an inability to perform tool calls reliably means it’s an unworkable solution for a coding solution.

I then came across this new model, and it’s a world of difference. It runs slightly slower, but still at a pretty good rate but better yet, it appears to be totally stable. My first impressions are that it seems to be understanding the code better than Nemotron and GLM 4.7 Flash, but time will tell after I throw some more difficult problems at it.

It’s llama.cpp based, so I can’t run it on a cluster, but for the moment it looks like it’s going to give me a stable platform to work from.

This sounds really interesting and I’m going to set the weights to come down overnight. I’d really like to see more MoE’s like this with int4 or NVFP4 that use much of the Spark’s available RAM!

@Keyper-AI couple quick questions: is there a reason a couple of your commands differ from those recommended explicitly for the Spark on stepfun-ai/Step-3.5-Flash-Int4 · Hugging Face ?

  • The flags they recommend for setting up the CUDA build include -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_COMMON=ON
  • Their flags for the build command only use -j8 rather than all available cores, and do not include --config Release.
  • Their flags for starting the model include the flags -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 (I’m sceptical about setting temp to 1.0, nowhere else on the page changes temp). Yours lacks a couple of these but includes -ngl 999 which I think should be unnecessary.

Beyond that, I am very interested in their comment which says “On NVIDIA DGX Spark, the Step 3.5 Flash achieves a generation speed of 20 tokens per second; by integrating the INT8 quantization technology for KVCache, it supports an extended context window of up to 256K tokens, thus delivering long text processing capabilities on par with cloud-based inference.”

Nowhere on the model page seems to delve further into this, unfortunately. I’m more familiar with vLLM than llama.cpp - their startup command seems to limit context to 16k. Anyone care to comment further? I’d love to push that context window out on single spark.

Also if someone has it up running, could you share benchmarks on concurrent request and context length. It would be much appreciated :)

Able to build with the flags of this thread .-DGGML_CUDA=ON -DGGML_CUDA_GRAPHS=ON -DLLAMA_CURL=OFF). However the flags suggested for starting the model (-c 16384 -ngl 999) always lead to “encountered an error while trying to fit params to free device memory”. Still working on trying to run Step-3.5-Flash in a single Spark. Will try later -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 & -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_COMMON=ON

Hey @joshua.dale.warner,

Hopefully this answers your questions.

  • Since you only need llama-server to act as your backend, skipping DLLAMA_BUILD_EXAMPLES=ON & DLLAMA_BUILD_COMMON=ON saves compile time and reduces the final binary size.
  • $(nproc) is much better. It dynamically scales the build to use all available CPU cores on your Spark. Scaling back to -j8 is a generic recommendation for users on 8-core laptops to prevent system lag; on your rig, limiting to 8 cores just makes you wait longer for the compile.
  • Model flags: These the settings they set in their documentation for llama-cli (step 5) - ./llama-cli -m step3.5_flash_Q4_K_S.gguf -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0 -p “What’s your name?”
    • Note: I was able to scale to context window to 262144without any issues.
  • -ngl 999 is more of a fail safe, ensuring every layer is forced on the GPU. If you omit it, llama.cpp might default to partial CPU offloading, which would destroy your performance on the Spark.

All-in-all, amazing model, just slow. Hopefully someone can come up with ways to speed it up on the Spark.

@carlos.albarran.,mx That’s odd. Here are the Cuda/Driver versions that are running in my container

Cuda: 13.1, V13.1.115 (cuda_13.1.r13.1/compiler.37061995_0)

Driver: 580.126.09

As a follow up, here is the mix that I have found to yield the highest speed with the most possible tokens

Anything context len over ~200000 spits an out of memory error

Build Cuda

cmake -S . -B build-cuda 
-DCMAKE_BUILD_TYPE=Release 
-DGGML_CUDA=ON 
-DGGML_CUDA_GRAPHS=ON 
-DLLAMA_CURL=OFF 
-DCMAKE_CUDA_ARCHITECTURES=121

Perform the actual compilation (using all CPU cores for speed)

cmake --build build-cuda --config Release -j$(nproc)

export GGML_CUDA_GRAPH_OPT=1

Launch Model

~/Step-3.5-Flash/llama.cpp/build-cuda/bin/llama-server 
-m ~/path/to/models/Step-3.5-Flash-Int4/step3.5_flash_Q4_K_S.gguf 
-c 200000 
-ngl 999 
-fa 1 
-b 2048 
-ub 2048 
-ctk q8_0
-ctv q8_0
–no-mmap 
–port 8000 
–host 0.0.0.0

!? For me, GPT 120B keeps going insane, beyond hallucinations, both in Open WebUI and Cline/Kilo.

I found GLM 4.7 Flash NVFP4 to be great. Really eager to Step 3.5 Flash.

Thanks for your notes, @Keyper-AI! I got excited about using this model and went ahead and generalized your docker-compose.yml to abstract the build from the running of the container - this makes startup far faster.

Here is my known-working config. Necessary setup as in the original post includes

  • download the model (I assume it is located locally in your user account at ~/models/Step-3.5-Flash-Int4) and
  • combine the gguf shards into one file by running cat step3p5_flash_Q4_K_S.gguf.part-* > step3.5_flash_Q4_K_S.gguf in that directory.

Copy these into new files in an empty directory.

Dockerfile

FROM nvidia/cuda:13.1.1-devel-ubuntu24.04

# 1. Install build dependencies
RUN apt-get update && apt-get install -y \
    git \
    cmake \
    build-essential \
    libcurl4-openssl-dev \
    && rm -rf /var/lib/apt/lists/*

# 2. Set up build directory
WORKDIR /build

# 3. Clone the specific Step-3.5-Flash repository
RUN git clone https://github.com/stepfun-ai/Step-3.5-Flash.git

# 4. Configure and Build
WORKDIR /build/Step-3.5-Flash/llama.cpp

# The linker needs libcuda.so.1 (Driver API), but the driver isn't mounted during build.
# We point it to the "stubs" provided by the nvidia-devel image.
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

# We use the specific flags provided, including the custom CUDA Arch 121
RUN cmake -S . -B build-cuda \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_GRAPHS=ON \
    -DLLAMA_CURL=OFF \
    -DCMAKE_CUDA_ARCHITECTURES=121 \
    -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs" \
    -DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs"

# Compile llama.cpp with all available cores
# LD_LIBRARY_PATH set inline here only for this command
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
    cmake --build build-cuda --config Release --target llama-server -j$(nproc)

# 5. Install the binary to the system path
RUN cp build-cuda/bin/llama-server /usr/local/bin/llama-server

# 6. Set Runtime Environment Variables
ENV GGML_CUDA_GRAPH_OPT=1

# 7. Set default working directory
WORKDIR /app

In the same directory name this docker-compose.yml

services:
  step-server:
    # Build from the Dockerfile in the current directory
    build:
      context: .
      dockerfile: Dockerfile
    image: step-3.5-flash:local
    container_name: llama-Step3.5-Flash-Int4
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      # We only need to mount the models now.
      # The code/binary is baked into the image.
      - ${HOME}/models:/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    working_dir: /app
    # The command is now strictly for runtime arguments
    command: >
      llama-server
      -m /models/Step-3.5-Flash-Int4/step3.5_flash_Q4_K_S.gguf
      -c 200000
      -ngl 999
      -fa 1
      -b 2048
      -ub 2048
      -ctk q8_0
      -ctv q8_0
      --no-mmap
      --port 8000
      --host 0.0.0.0

Finally, from the directory these files are in, run

docker compose build

docker compose up -d

Starts up quite fast and works very well. I am seeing about 24 t/s for short prompts, for prompts around 10k it is still over 19 t/s. I’m impressed by the model so far.

Something to explore:

The stepfun-ai GitHub repo has more in-depth info about running on the DGX Spark, and there are two options we could experiment with which aren’t included in my above config.

From this file:

“When testing long context (e.g. 256K), OOM may occur. The build flag -DGGML_CUDA_FORCE_MMQ=ON, environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1, and runtime parameters -ctk q8_0 -ctv q8_0 can help mitigate memory issues.”

My Docker config above has the Int8 KV cache runtime flags already set, but is not using -DGGML_CUDA_FORCE_MMQ=ON in the llama.cpp build, nor the environment variable GGML_CUDA_ENABLE_UNIFIED_MEMORY=1

From what I understand forcing MMQ is mostly something for legacy hardware, but can result in lower RAM use (at a substantial performance penalty). It would probably be worth trying this, as they clearly benched contexts up to 262k on the Spark.

I’m less convinced GGML_CUDA_ENABLE_UNIFIED_MEMORY=1will be helpful. It’s designed to overflow to system RAM if GPU VRAM is exhausted. The memory is already unified on the GB10, and I don’t think there is really a fallback - unless this allows it to thrash instead of crash.

Thanks for your great work.

I successfully ran the model using this compose file and connected it to openclaw. Overall, it works quite well, but I did notice that it’s a bit slow and almost maxes out the available resources.

Meanwhile, I saw that Qwen just released the Qwen3-Coder-Next model. According to their introduction, it has a total of 80B parameters but only activates 3B per inference, which significantly reduces memory and compute requirements. This might be a better fit for running on DGX Spark devices. I’ll deploy and test this model on a DGX Spark soon. If you’re interested, you might want to try out Qwen3-Coder-Next as well.

This depends heavily on your use case. I find 20-24 tok/s perfectly usable, and quite impressive, given the model’s size.

I really like Step3.5-Flash for general tasks. After kicking the tires for 24h I am actually even more impressed by the model. I agree with the original poster that this is hands-down the best local model I have ever run. If you’re doing coding, especially very long contexts, then yes the new Qwen3-Coder-Next appears to be excellent - as it should be, for a purpose-tuned model. But I am definitely keeping this one around!

anyone has some numbers for context at 100k / 150k / 200k ? whats the tps like with long context? My internet is quite slow so need to think twice whether to load qwen3 80 or this one xd

See Step-3.5-Flash/llama.cpp/docs/step3.5-flash.md at main · stepfun-ai/Step-3.5-Flash · GitHub

I have not run this (too busy using it) but modifying my docker-compose above with a different command for benchmarking using their recommended command below (you may need to set -c lower than this):

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 &&
./llama-batched-bench \
  -m step3.5_flash_Q4_K_S.gguf \
  -c 262150 \
  -b 2048 \
  -ub 1024 \
  -npp 0,2048,8192,16384,32768,65536,262144 \
  -ntg 128 
  -npl 1 \
  -ctk q8_0 \
  -ctv q8_0

Their reported results:

PP TG B N_KV T_PP (s) S_PP (t/s) T_TG (s) S_TG (t/s) T (s) S (t/s)
0 128 1 128 0.000 0.00 6.468 19.79 6.468 19.79
2048 128 1 2176 5.222 392.17 6.315 20.27 11.538 188.60
8192 128 1 8320 15.341 533.99 6.321 20.25 21.662 384.08
16384 128 1 16512 31.008 528.39 6.652 19.24 37.659 438.46
32768 128 1 32896 68.606 477.63 7.210 17.75 75.816 433.89
65536 128 1 65664 167.820 390.51 8.303 15.42 176.122 372.83
262144 128 1 262272 1206.853 217.21 15.265 8.39 1222.118 214.60

In real-world general use up to 10k context I am seeing 19-24 t/s. The above should give a sense of how it scales up to 262k.

It is possible this test will not run with my Docker environment above, because I did not use every last possible optimization - in particular I did not force MMQ in llama.cpp (see above post Running Step-3.5-Flash on Single Spark - #12 by joshua.dale.warner ). That should save memory usage but cost in throughput. My guess is that their test results did force MMQ, because the throughput is lower than I would have expected (I see ~24 t/s in the 2048 and 8192 rows above). The theoretical tradeoff is speed vs memory usage.

Forgive my ignorance in advance. This is the first time I’ve tried doing this without LM Studio

I tried your instructions, but when I run your cat command on the files, I get this error:

userj@sparkbox:~/models/Step-3.5-Flash-Int4$ ls
config.json
README.md
step3.5_flash_Q4_K_S.gguf
step3p5_flash_Q4_K_S-00001-of-00012.gguf
step3p5_flash_Q4_K_S-00002-of-00012.gguf
step3p5_flash_Q4_K_S-00003-of-00012.gguf
step3p5_flash_Q4_K_S-00004-of-00012.gguf
step3p5_flash_Q4_K_S-00005-of-00012.gguf
step3p5_flash_Q4_K_S-00006-of-00012.gguf
step3p5_flash_Q4_K_S-00007-of-00012.gguf
step3p5_flash_Q4_K_S-00008-of-00012.gguf
step3p5_flash_Q4_K_S-00009-of-00012.gguf
step3p5_flash_Q4_K_S-00010-of-00012.gguf
step3p5_flash_Q4_K_S-00011-of-00012.gguf
step3p5_flash_Q4_K_S-00012-of-00012.gguf
step-bar-chart.png
stepfun.svg
userj@sparkbox:~/models/Step-3.5-Flash-Int4$ cat step3p5_flash_Q4_K_S.gguf.part-* > step3.5_flash_Q4_K_S.gguf
cat: ‘step3p5_flash_Q4_K_S.gguf.part-*’: No such file or directory
userj@sparkbox:~/models/Step-3.5-Flash-Int4$

No worries, it looks like they renamed the model shards within the last day. There was a comment thread about discoverability by llama.cpp

I think that makes the cat command unnecessary. Avoiding the unnecessary copy and space will be nice. Try passing the full name of just the first shard - llama.cpp should find the rest automatically now. I’ll verify this and, assuming true, make a follow up post in a bit.

I struggled for most of the day to get this working. Just in case anyone else runs into this, here’s what eventually worked.

Dockerfile

# =============================================================================
# Step 3.5 Flash – llama.cpp server for DGX Spark
#
# Uses upstream llama.cpp with -hf auto-download from HuggingFace
# =============================================================================

FROM nvidia/cuda:13.1.1-devel-ubuntu24.04

# 1. Install build dependencies (libcurl + libssl for HTTPS download)
RUN apt-get update && apt-get install -y \
    git \
    cmake \
    build-essential \
    libcurl4-openssl-dev \
    libssl-dev \
    && rm -rf /var/lib/apt/lists/*

# 2. Set up build directory
WORKDIR /build

# 3. Clone upstream llama.cpp
RUN git clone --depth 1 https://github.com/ggml-org/llama.cpp.git

# 4. Configure and Build
WORKDIR /build/llama.cpp

RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

RUN cmake -S . -B build-cuda \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_CUDA=ON \
    -DGGML_CUDA_GRAPHS=ON \
    -DLLAMA_CURL=ON \
    -DLLAMA_OPENSSL=ON \
    -DCMAKE_CUDA_ARCHITECTURES=121 \
    -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs" \
    -DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs"

RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
    cmake --build build-cuda --config Release --target llama-server -j$(nproc)

# 5. Install the binary
RUN cp build-cuda/bin/llama-server /usr/local/bin/llama-server

# 6. Runtime config
ENV GGML_CUDA_GRAPH_OPT=1
WORKDIR /app

ENV HF_MODEL=stepfun-ai/Step-3.5-Flash-Int4:Q4_K_S
ENV CTX_SIZE=200000
ENV GPU_LAYERS=999
ENV PORT=8000
ENV HOST=0.0.0.0

EXPOSE ${PORT}

HEALTHCHECK --interval=30s --timeout=10s --start-period=600s --retries=3 \
    CMD curl -sf http://localhost:${PORT}/health || exit 1

CMD llama-server -hf ${HF_MODEL} -c ${CTX_SIZE} -ngl ${GPU_LAYERS} -fa 1 -b 2048 -ub 2048 -ctk q8_0 -ctv q8_0 --no-mmap --port ${PORT} --host ${HOST}

docker-compose.yml

# =============================================================================
# Step 3.5 Flash – docker-compose for DGX Spark
#
# Usage:
#   1. cp .env.example .env   (adjust if needed)
#   2. docker compose up -d --build
#   3. docker compose logs -f
#
# First launch auto-downloads the model (~111 GB) via -hf flag.
# Cache is persisted so subsequent starts skip the download.
# =============================================================================

services:
  step-server:
    build:
      context: .
      dockerfile: Dockerfile
    image: step3.5-flash:latest
    container_name: step3.5-flash-int4
    restart: unless-stopped

    ports:
      - "${HOST_PORT:-8000}:${PORT:-8000}"

    volumes:
      # llama.cpp -hf caches to ~/.cache/llama.cpp
      - ${LLAMA_CACHE_DIR:-~/.cache/llama.cpp}:/root/.cache/llama.cpp

    environment:
      - HF_MODEL=${HF_MODEL:-stepfun-ai/Step-3.5-Flash-Int4:Q4_K_S}
      - CTX_SIZE=${CTX_SIZE:-200000}
      - GPU_LAYERS=${GPU_LAYERS:-999}
      - PORT=${PORT:-8000}
      - HOST=0.0.0.0

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    stop_grace_period: 30s

Enjoy!

Thanks for the updated runtime - I was duplicating a lot of this work today, unfortunately, because whoever updated the StepFun repo simply deleted all of the GGUF shards and reuploaded them with new names.

They could have instead done git mv to rename them and just updated the hashes. The changes would have been a few kb. But no, they removed and reuploaded everything. Be aware that this choice means that if you simply git pull origin main two things happen:

  • You download 111 GB of new weights
  • The prior ones look gone but persist in your .git repo, doubling size on disk

Because of this, to save space on your Spark, I recommend deleting the old repo completely and then cloning the HF repo again with git clone --depth 1

My connection is on the slower side, so I predownload weights into ~/models overnight always and then map the weights into containers. I structured my prior framework around this.

For people who are fine downloading >100GB on startup once, Jake’s new script will cache this GGUF for you.