Step-3.7-Flash on single Spark (llama.cpp only)

It is possible to run Step-3.7-Flash with vision, at full 262144 context, on single DGX Spark! llama.cpp is the only path. We build it using stepfun’s custom fork. The highest quant possible is the official IQ4_XS.

I got this working a couple days ago but had to iterate a bit to make it stable. I have now benched it multiple times at 256k without crashing, but it is limited to single concurrency and while decode is solid, the prefill is poor compared to vLLM. Still, IMHO those are acceptable tradeoffs for those with one GB10!

The architecture uses docker compose and a start-up script for convenience.

FIRST: downloaded the IQ4_XS GGUF files from stepfun-ai/Step-3.7-Flash-GGUF · Hugging Face to ~/models/Step-3.7-Flash-GGUF/IQ4_XS/ and the multimedia mmproj to ~/models/Step-3.7-Flash-GGUF/mmproj-step3.7-flash-f16.gguf

Then, make these files in ~/llm-launchers/step-3.7-flash (or directory of your choice, but if you change the directory, you will need to modify the startup script)

Dockerfile:

# ==============================================================================
# STAGE 1: Build Environment
# ==============================================================================
ARG UBUNTU_VERSION=24.04
ARG CUDA_VERSION=13.1.2

FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} AS builder

# 1. Install build dependencies (now including Python for script execution)
RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    cmake \
    build-essential \
    libcurl4-openssl-dev \
    libssl-dev \
    python3 \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /build

# 2. Clone step-3.7 specific llama.cpp
RUN git clone https://github.com/stepfun-ai/llama.cpp.git
WORKDIR /build/llama.cpp
RUN git checkout -b step3.7 origin/step3.7

# 3. Fix missing libcuda.so.1 for the linker during the build phase
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

# 4. Configure CMake with Blackwell (GB10) & ARM64 optimizations
# - DGGML_CUDA_F16=ON is added to accelerate half-precision kernels on Blackwell
RUN cmake -S . -B build-cuda \
  -DCMAKE_BUILD_TYPE=Release \
  -DGGML_CUDA=ON \
  -DGGML_CUDA_GRAPHS=ON \
  -DGGML_CUDA_FORCE_MMQ=ON \
  -DLLAMA_OPENSSL=OFF \
  -DLLAMA_CURL=ON \
  -DLLAMA_BUILD_COMMON=ON \
  -DLLAMA_BUILD_TOOLS=ON \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_BUILD_EXAMPLES=OFF \
  -DLLAMA_BUILD_TESTS=OFF \
  -DCMAKE_CUDA_ARCHITECTURES=121a-real \
  -DGGML_NATIVE=ON \
  -DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs" \
  -DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs"

# 5. Compile ALL targets and package them to a staging directory
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
    cmake --build build-cuda --config Release -j8 && \
    cmake --install build-cuda --prefix /out

# ==============================================================================
# STAGE 2: Lean Runtime Environment
# ==============================================================================
FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION} AS runtime

# 1. Install required runtime libraries AND Python
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 \
    libcurl4 \
    curl \
    ca-certificates \
    python3 \
    python3-pip \
    && apt-get autoremove -y \
    && apt-get clean -y \
    && rm -rf /var/lib/apt/lists/*

# 2. Set environment variables
ENV GGML_CUDA_GRAPH_OPT=1
ENV LLAMA_ARG_HOST=0.0.0.0
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

WORKDIR /app

# 3. Copy compiled binaries AND shared libraries from the builder stage, then
# update the dynamic linker cache so the system finds the new libraries
COPY --from=builder /out/ /usr/local/
RUN ldconfig

# 4. Copy Python conversion/quantization scripts and dependencies
COPY --from=builder /build/llama.cpp/*.py /app/
COPY --from=builder /build/llama.cpp/gguf-py /app/gguf-py
COPY --from=builder /build/llama.cpp/requirements /app/requirements
COPY --from=builder /build/llama.cpp/requirements.txt /app/

# 5. Install Python dependencies globally within the container sandbox
# (Using --break-system-packages is safe here since it's an isolated container)
RUN pip install --no-cache-dir --break-system-packages -r requirements.txt

# 6. Define a healthcheck for server mode
HEALTHCHECK --interval=10s --timeout=5s --start-period=30s --retries=3 \
    CMD curl -f http://localhost:8080/health || exit 1

# 7. Set default command
CMD ["llama-server"]

docker-compose.yml (--parallel 1 and --ctx-checkpoints 1 are required or it crashes on the 3rd 256k prompt)

services:
  step-server:
    # Build from the Dockerfile in the current directory
    build:
      context: .
      dockerfile: Dockerfile
    image: step-3.7-flash:local
    container_name: llama-Step3.7-Flash-IQ4_XS
    restart: unless-stopped
    ports:
      - "8000:8000"
    volumes:
      # We only need to mount the models now.
      # The code/binary is baked into the image.
      - ${HOME}/models:/models
    environment:
      - HF_MODEL=${HF_MODEL:-/models/Step-3.7-Flash-GGUF/IQ4_XS/Step-3.7-flash-IQ4_XS-00001-of-00003.gguf}
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    working_dir: /app
    # The command is now strictly for runtime arguments
    # Ubergarm's IQ4_XS quant of Step-3.5-Flash was 108 GB, but StepFun's official
    # Step-3.7-Flash IQ4_XS is 105 GB so should be able to fit full context with
    # the vision mmproj which is just shy of 4GB
    command: >
      llama-server
      -m /models/Step-3.7-Flash-GGUF/IQ4_XS/Step-3.7-flash-IQ4_XS-00001-of-00003.gguf
      --mmproj /models/Step-3.7-Flash-GGUF/mmproj-step3.7-flash-f16.gguf
      -c 262144
      -ngl 999
      -fa on
      -b 2048
      -ub 1024
      -ctk q8_0
      -ctv q8_0
      --parallel 1
      --ctx-checkpoints 1
      --checkpoint-min-step 128
      --cache-ram 1024
      --no-mmap
      --port 8000
      --host 0.0.0.0

    stop_grace_period: 15s

Start script (e.g., ~/serve-step-3.7-flash-iq4-xs.sh) - if you put the above files somewhere other than ~/llm-launchers/step-3.7-flash, edit COMPOSE_DIR

#!/bin/bash

# Define the path to your compose directory
COMPOSE_DIR="$HOME/llm-launchers/step-3.7-flash"
ENDPOINT="http://localhost:8000/health" # Adjust port if your compose uses a different one

echo "Starting Step-3.7-Flash llama.cpp container..."

# Run the command using the --project-directory flag
docker compose --project-directory "$COMPOSE_DIR" --profile step-3.7-flash up -d

echo "⏳ Waiting for model to initialize..."

# Health check loop
until $(curl --output /dev/null --silent --head --fail "$ENDPOINT"); do
    printf '.'
    sleep 2
done

echo -e "\nModel is loaded and ready for requests!"

chmod +x the startup script, clear your memory, and run it!

Performance is okay for token generation, starting about 30, but at max context it drops to 11. Prefill is acceptable for shorter prompts or follow-ons in the same conversation, but processing a new huge prompt is painful.

| model          |             test |             t/s |     peak t/s |             ttfr (ms) |          est_ppt (ms) |         e2e_ttft (ms) |
|:---------------|-----------------:|----------------:|-------------:|----------------------:|----------------------:|----------------------:|
| Step-3.7-Flash |           pp2048 | 739.43 ± 118.79 |              |      3066.04 ± 518.99 |      2863.90 ± 518.99 |      3066.04 ± 518.99 |
| Step-3.7-Flash |            tg128 |    30.98 ± 0.48 | 35.33 ± 0.47 |                       |                       |                       |
| Step-3.7-Flash |   pp2048 @ d2048 |  668.58 ± 53.67 |              |      6382.65 ± 525.10 |      6180.51 ± 525.10 |      6382.65 ± 525.10 |
| Step-3.7-Flash |    tg128 @ d2048 |    24.79 ± 0.96 | 29.67 ± 0.94 |                       |                       |                       |
| Step-3.7-Flash |   pp2048 @ d4096 |  615.20 ± 31.18 |              |     10226.08 ± 492.48 |     10023.94 ± 492.48 |     10226.08 ± 492.48 |
| Step-3.7-Flash |    tg128 @ d4096 |    24.42 ± 0.52 | 30.00 ± 1.63 |                       |                       |                       |
| Step-3.7-Flash |   pp2048 @ d8192 |  673.78 ± 15.93 |              |     15419.39 ± 361.15 |     15217.24 ± 361.15 |     15419.39 ± 361.15 |
| Step-3.7-Flash |    tg128 @ d8192 |    28.18 ± 1.60 | 33.33 ± 2.05 |                       |                       |                       |
| Step-3.7-Flash |  pp2048 @ d16384 |  724.23 ± 13.13 |              |     25672.09 ± 457.87 |     25469.94 ± 457.87 |     25672.09 ± 457.87 |
| Step-3.7-Flash |   tg128 @ d16384 |    24.94 ± 1.31 | 31.00 ± 1.41 |                       |                       |                       |
| Step-3.7-Flash |  pp2048 @ d32768 |  705.06 ± 18.65 |              |    49629.90 ± 1322.16 |    49427.75 ± 1322.16 |    49629.90 ± 1322.16 |
| Step-3.7-Flash |   tg128 @ d32768 |    21.42 ± 0.87 | 28.00 ± 0.82 |                       |                       |                       |
| Step-3.7-Flash | pp2048 @ d131072 |  486.09 ± 33.75 |              |  275452.40 ± 19986.69 |  275250.26 ± 19986.69 |  275452.40 ± 19986.69 |
| Step-3.7-Flash |  tg128 @ d131072 |    14.87 ± 0.60 | 21.00 ± 0.82 |                       |                       |                       |
| Step-3.7-Flash | pp2048 @ d196000 |  339.66 ± 88.41 |              | 631254.47 ± 186471.69 | 631052.33 ± 186471.69 | 631254.47 ± 186471.69 |
| Step-3.7-Flash |  tg128 @ d196000 |    11.73 ± 0.32 | 16.33 ± 0.47 |                       |                       |                       |
| Step-3.7-Flash | pp2048 @ d256000 | 410.21 ± 12.97  |              | 629877.78 ± 19473.92  | 629672.80 ± 19473.92  | 629877.78 ± 19473.92  |
| Step-3.7-Flash |  tg128 @ d256000 |   11.01 ± 0.29  | 14.67 ± 0.47 |                       |                       |                       |

I have tested this with vision and it works.

Perfect! I’m running the AesSedai/Step-3.7-Flash-GGUF130K context, which uses less memory, so I might be able to fit another TTS service!

Indeed, when I started working with this no other quants were yet available.

AesSedai’s IQ4_XS is significantly smaller at 89 GB (vs StepFun’s official IQ4_XS at 105) but seems to be poorer than the equivalent recipe for Step-3.5-Flash. Let us know how it goes! There is some commentary about this on a HF thread.

For Step-3.5-Flash my favored quant was Ubergarm’s IQ4_XS, which was larger than both of the above.

I haven’t run into noticeable quality issues in my own use yet, though I suspect that’s because I haven’t really stress-tested it with programming tasks where the flaws would likely show up more clearly. The HF thread does point out that the same quantization recipe that worked well on Step-3.5-Flash (~2.6% PPL increase) is performing significantly worse on 3.7 (~15.6%), so I’m definitely keeping an eye on AesSedai’s repo for any re-converted updates.

That said, the 89 GB footprint versus the official 105 GB is a huge win on a single 128 GB DGX Spark — it turns a borderline-tight deployment into something with actual breathing room for long-context KV cache. If the real-world capability gap versus the official quant ends up being acceptable, especially for agentic and tool-calling use cases rather than heavy reasoning chains, this size class feels practically tailor-made for the GB10. I’ll report back once I’ve had a chance to run it through more rigorous coding and long-context tests.

Of interest, Unsloth has a Pareto chart that suggests the official IQ4_XS may be something special and certainly an outlier in a good way.

Looks like Unsloth and AesSedai’s are pretty similar.

Thanks for sharing that Pareto chart — it makes the trade-off brutally clear. The official IQ4_XS at ~105 GB is indeed sitting on a much better part of the frontier than I initially assumed; that KLD drop from ~0.16 (AesSedai/Unsloth territory) down to ~0.10 is a massive quality win for only 16 GB of extra disk space.

How does this compare to Deepseek V4 Flash in IQ2XXS. On Artificial Analysis V4 Flash outperforms Step 3.7 Flash on all three categories (Coding, Agentic, General Intelligence) - Has anybody tested both in real life use cases?

I haven’t run comprehensive benchmarks but at the moment I view and use them as complementary.

For document analysis tasks I find Deepseek V4 Flash is often surprisingly capable, but I reach for Step-3.7-Flash more. Step-3.7-Flash as above is the closest thing I’ve yet found to a postdoc or peer.

Most RL, a lot of benchmarks, and general internet training slop is at such a low level (reading comprehension, sentence length/complexity, general understanding) that I generally have to bend over backwards in system prompts to get models to stand up and actually talk with professional tone, language, and structure - adversarial if necessary. This is beyond sycophancy, I want it to act like an academic colleague. Unfortunately, this is difficult, because the target audience is users in the general population.

I feel StepFun has not dumbed down their models in this way (yet) and it is refreshing.

YMMV, my work and application thus far is not coding and probably quite different that most.

So in “General Intelligence” and “Creative Writing” you find it to be a setp above Deeepseek V4 Flash? Btw. I really hate it that we are still in the early stages where it’s so hard to benchmark these things. I really look forward to the future where we will have clear results that show which model excels at which tasks. Maybe I should really make my own small benchmark and test on that.

Hi guys. I just noticed MTP was added yesterday to the gguf repo ( stepfun-ai/Step-3.7-Flash-GGUF at main ). I could not find any information how to make use of it and/or if a specific PR is required to use it but is a great addition

This recipe uses StepFun’s fork of llama.cpp and I’m not sure if they brought in MTP support.

At IQ4_XS it’s very tight on GB10, but the MTP weights are very close to the vision mmproj. If you want to play with this, my recommendation is to drop multimodality when adding MTP.

I made it works with the latest llama.cpp main using the following. I tested successfully with agentic workload and tooling. I can probably raise the ctx here:

/gorgon/ia/llama.cpp/build/bin/llama-server \
-m /rosso/gufi/Step-3.7-IQ4_XS/IQ4_XS/Step-3.7-flash-IQ4_XS-00001-of-00003.gguf \
--spec-draft-model /rosso/gufi/Step-3.7-IQ4_XS/Step3.7-flash-mtp-Q8_0.gguf \
--spec-type draft-mtp \
-c 128144 \
--host 0.0.0.0 --port 8000 \
-ngl all \
--temp 0 \
--spec-draft-n-max 2 \
--spec-draft-p-min 0.6 \
--reasoning-budget 16384 \
--no-warmup \
--split-mode layer \
--parallel 1 \
--reasoning on \
--reasoning-format deepseek \
--reasoning-budget-message ". Actually, let me stop here. I have been thinking about this for long enough, will just reply now."