It is possible to run Step-3.7-Flash with vision, at full 262144 context, on single DGX Spark! llama.cpp is the only path. We build it using stepfun’s custom fork. The highest quant possible is the official IQ4_XS.
I got this working a couple days ago but had to iterate a bit to make it stable. I have now benched it multiple times at 256k without crashing, but it is limited to single concurrency and while decode is solid, the prefill is poor compared to vLLM. Still, IMHO those are acceptable tradeoffs for those with one GB10!
The architecture uses docker compose and a start-up script for convenience.
FIRST: downloaded the IQ4_XS GGUF files from stepfun-ai/Step-3.7-Flash-GGUF · Hugging Face to ~/models/Step-3.7-Flash-GGUF/IQ4_XS/ and the multimedia mmproj to ~/models/Step-3.7-Flash-GGUF/mmproj-step3.7-flash-f16.gguf
Then, make these files in ~/llm-launchers/step-3.7-flash (or directory of your choice, but if you change the directory, you will need to modify the startup script)
Dockerfile:
# ==============================================================================
# STAGE 1: Build Environment
# ==============================================================================
ARG UBUNTU_VERSION=24.04
ARG CUDA_VERSION=13.1.2
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} AS builder
# 1. Install build dependencies (now including Python for script execution)
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
cmake \
build-essential \
libcurl4-openssl-dev \
libssl-dev \
python3 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
# 2. Clone step-3.7 specific llama.cpp
RUN git clone https://github.com/stepfun-ai/llama.cpp.git
WORKDIR /build/llama.cpp
RUN git checkout -b step3.7 origin/step3.7
# 3. Fix missing libcuda.so.1 for the linker during the build phase
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
# 4. Configure CMake with Blackwell (GB10) & ARM64 optimizations
# - DGGML_CUDA_F16=ON is added to accelerate half-precision kernels on Blackwell
RUN cmake -S . -B build-cuda \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DGGML_CUDA_GRAPHS=ON \
-DGGML_CUDA_FORCE_MMQ=ON \
-DLLAMA_OPENSSL=OFF \
-DLLAMA_CURL=ON \
-DLLAMA_BUILD_COMMON=ON \
-DLLAMA_BUILD_TOOLS=ON \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_EXAMPLES=OFF \
-DLLAMA_BUILD_TESTS=OFF \
-DCMAKE_CUDA_ARCHITECTURES=121a-real \
-DGGML_NATIVE=ON \
-DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs"
# 5. Compile ALL targets and package them to a staging directory
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
cmake --build build-cuda --config Release -j8 && \
cmake --install build-cuda --prefix /out
# ==============================================================================
# STAGE 2: Lean Runtime Environment
# ==============================================================================
FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION} AS runtime
# 1. Install required runtime libraries AND Python
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 \
libcurl4 \
curl \
ca-certificates \
python3 \
python3-pip \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /var/lib/apt/lists/*
# 2. Set environment variables
ENV GGML_CUDA_GRAPH_OPT=1
ENV LLAMA_ARG_HOST=0.0.0.0
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
WORKDIR /app
# 3. Copy compiled binaries AND shared libraries from the builder stage, then
# update the dynamic linker cache so the system finds the new libraries
COPY --from=builder /out/ /usr/local/
RUN ldconfig
# 4. Copy Python conversion/quantization scripts and dependencies
COPY --from=builder /build/llama.cpp/*.py /app/
COPY --from=builder /build/llama.cpp/gguf-py /app/gguf-py
COPY --from=builder /build/llama.cpp/requirements /app/requirements
COPY --from=builder /build/llama.cpp/requirements.txt /app/
# 5. Install Python dependencies globally within the container sandbox
# (Using --break-system-packages is safe here since it's an isolated container)
RUN pip install --no-cache-dir --break-system-packages -r requirements.txt
# 6. Define a healthcheck for server mode
HEALTHCHECK --interval=10s --timeout=5s --start-period=30s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
# 7. Set default command
CMD ["llama-server"]
docker-compose.yml (--parallel 1 and --ctx-checkpoints 1 are required or it crashes on the 3rd 256k prompt)
services:
step-server:
# Build from the Dockerfile in the current directory
build:
context: .
dockerfile: Dockerfile
image: step-3.7-flash:local
container_name: llama-Step3.7-Flash-IQ4_XS
restart: unless-stopped
ports:
- "8000:8000"
volumes:
# We only need to mount the models now.
# The code/binary is baked into the image.
- ${HOME}/models:/models
environment:
- HF_MODEL=${HF_MODEL:-/models/Step-3.7-Flash-GGUF/IQ4_XS/Step-3.7-flash-IQ4_XS-00001-of-00003.gguf}
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
working_dir: /app
# The command is now strictly for runtime arguments
# Ubergarm's IQ4_XS quant of Step-3.5-Flash was 108 GB, but StepFun's official
# Step-3.7-Flash IQ4_XS is 105 GB so should be able to fit full context with
# the vision mmproj which is just shy of 4GB
command: >
llama-server
-m /models/Step-3.7-Flash-GGUF/IQ4_XS/Step-3.7-flash-IQ4_XS-00001-of-00003.gguf
--mmproj /models/Step-3.7-Flash-GGUF/mmproj-step3.7-flash-f16.gguf
-c 262144
-ngl 999
-fa on
-b 2048
-ub 1024
-ctk q8_0
-ctv q8_0
--parallel 1
--ctx-checkpoints 1
--checkpoint-min-step 128
--cache-ram 1024
--no-mmap
--port 8000
--host 0.0.0.0
stop_grace_period: 15s
Start script (e.g., ~/serve-step-3.7-flash-iq4-xs.sh) - if you put the above files somewhere other than ~/llm-launchers/step-3.7-flash, edit COMPOSE_DIR
#!/bin/bash
# Define the path to your compose directory
COMPOSE_DIR="$HOME/llm-launchers/step-3.7-flash"
ENDPOINT="http://localhost:8000/health" # Adjust port if your compose uses a different one
echo "Starting Step-3.7-Flash llama.cpp container..."
# Run the command using the --project-directory flag
docker compose --project-directory "$COMPOSE_DIR" --profile step-3.7-flash up -d
echo "⏳ Waiting for model to initialize..."
# Health check loop
until $(curl --output /dev/null --silent --head --fail "$ENDPOINT"); do
printf '.'
sleep 2
done
echo -e "\nModel is loaded and ready for requests!"
chmod +x the startup script, clear your memory, and run it!
