Interested in running a 196B-A11B MoE model with full 256k context on one Spark?
Despite all of the attention on Qwen3.5, I keep coming back to Step-3.5-Flash as the best model I have ever run on my single Spark. This model just keeps impressing me, and I want more people to be able to use it. The devs specifically considered the DGX Spark, but it had a couple of hiccups during its release (they shipped a custom llama.cpp with no history, there was a broken chat template, and the original Int4 model wasn’t named right). This is fixed now. Below is all you need to know.
Step-3.5-Flash is a 196B text-only MoE model with 11B active parameters per token. It has MTP as well, but that’s still experimental and even without it’s usable. Because of how tight memory constraints are, I have only gotten this running with llama.cpp so far.
Highly efficient and performant quants are available, most notably the IQ4_XS from ubergarm (here: ubergarm/Step-3.5-Flash-GGUF at main ) which is hereafter referred to as Step-3.5-Flash-IQ4_XS. This quant actually has better perplexity and is a few GB smaller than the official Int4 quant, which has been renamed Q4_K_S. Using the IQ4_XS quant allows this model to run rock solid on the DGX Spark with full 256k context. The IQ4_XS is an outlier (in a good way) from the typical patterns as you can see here with both smaller size and better perplexity:
MMLU results from the IQ4_XS quant: ubergarm/Step-3.5-Flash-GGUF · Some benchmarks
Yes - a 196B-A11B MoE with full 256k context running on one Spark! It demands about 125.5GB / 128GB but never OOMs. KV cache is quantized to 8-bit to make this possible.
Throughput? See next post for llama-benchy results but summary: 17-18 t/s on smaller prompts, 10 at 65k, decreasing to about 6.5 t/s at 256k.
I spent the time to assemble and optimize this into a Docker infrastructure for the GB10 which is extensible if you want to run more than one thing using the same underlying llama.cpp Docker image.
Prep: Download Ubergarm’s IQ4_XS quant shards from ubergarm/Step-3.5-Flash-GGUF at main into `~/models/Step-3.5-Flash-IQ4_XS/` (first one will be named Step-3.5-Flash-IQ4_XS-00001-of-00004.gguf) - if you use a different location, you will need to adjust the docker-compose binding below
You must make a file named fixed_step_template.jinja in the model directory and paste in the following. This is because ubergarm made his quants early, before the Stepfun devs corrected the template. Ubergarm did not correct the template in his quants, and directed people to use the updated extracted template. The fixed template will be referenced in the startup command in docker-compose.yml when this corrected Jinja chat template is present:
{% macro render_content(content) %}{% if content is none %}{{- '' }}{% elif content is string %}{{- content }}{% elif content is mapping %}{{- content['value'] if 'value' in content else content['text'] }}{% elif content is iterable %}{% for item in content %}{% if item.type == 'text' %}{{- item['value'] if 'value' in item else item['text'] }}{% elif item.type == 'image' %}<im_patch>{% endif %}{% endfor %}{% endif %}{% endmacro %}
{{bos_token}}{%- if tools %}
{{- '<|im_start|>system\n' }}
{%- if messages[0].role == 'system' %}
{{- render_content(messages[0].content) + '\n\n' }}
{%- endif %}
{{- "# Tools\n\nYou have access to the following functions in JSONSchema format:\n\n<tools>" }}
{%- for tool in tools %}
{{- "\n" }}
{{- tool | tojson(ensure_ascii=False) }}
{%- endfor %}
{{- "\n</tools>\n\nIf you choose to call a function ONLY reply in the following format with NO suffix:\n\n<tool_call>\n<function=example_function_name>\n<parameter=example_parameter_1>\nvalue_1\n</parameter>\n<parameter=example_parameter_2>\nThis is the value for the second parameter\nthat can span\nmultiple lines\n</parameter>\n</function>\n</tool_call>\n\n<IMPORTANT>\nReminder:\n- Function calls MUST follow the specified format: an inner <function=...>\n...\n</function> block must be nested within <tool_call>\n...\n</tool_call> XML tags\n- Required parameters MUST be specified\n</IMPORTANT><|im_end|>\n" }}
{%- else %}
{%- if messages[0].role == 'system' %}
{{- '<|im_start|>system\n' + render_content(messages[0].content) + '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- set ns = namespace(multi_step_tool=true, last_query_index=messages|length - 1) %}
{%- for message in messages[::-1] %}
{%- set index = (messages|length - 1) - loop.index0 %}
{%- if ns.multi_step_tool and message.role == "user" and render_content(message.content) is string and not(render_content(message.content).startswith('<tool_response>') and render_content(message.content).endswith('</tool_response>')) %}
{%- set ns.multi_step_tool = false %}
{%- set ns.last_query_index = index %}
{%- endif %}
{%- endfor %}
{%- for message in messages %}
{%- set content = render_content(message.content) %}
{%- if (message.role == "user") or (message.role == "system" and not loop.first) %}
{%- set role_name = 'observation' if (message.role == "system" and not loop.first and message.name == 'observation') else message.role %}
{{- '<|im_start|>' + role_name + '\n' + content + '<|im_end|>' + '\n' }}
{%- elif message.role == "assistant" %}
{%- set reasoning_content = '' %}
{%- if enable_thinking %}
{%- if message.reasoning_content is string %}
{%- set reasoning_content = render_content(message.reasoning_content) %}
{%- else %}
{%- if '</think>' in content %}
{%- set reasoning_content = content.split('</think>')[0].rstrip('\n').split('<think>')[-1].lstrip('\n') %}
{%- set content = content.split('</think>')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- else %}
{# If thinking is disabled, strip any inline <think>...</think> from assistant content #}
{%- if '</think>' in content %}
{%- set content = content.split('</think>')[-1].lstrip('\n') %}
{%- endif %}
{%- endif %}
{%- if loop.index0 > ns.last_query_index and enable_thinking %}
{{- '<|im_start|>' + message.role + '\n<think>\n' + reasoning_content.rstrip('\n') + '\n</think>\n' + content.lstrip('\n') }}
{%- else %}
{{- '<|im_start|>' + message.role + '\n' + content.lstrip('\n') }}
{%- endif %}
{%- if message.tool_calls %}
{%- for tool_call in message.tool_calls %}
{%- if tool_call.function is defined %}
{%- set tool_call = tool_call.function %}
{%- endif %}
{{- '<tool_call>\n<function=' + tool_call.name + '>\n' }}
{%- if tool_call.arguments is defined %}
{%- if tool_call.arguments is mapping %}
{%- set arguments = tool_call.arguments %}
{%- for args_name, args_value in arguments|items %}
{{- '<parameter=' + args_name + '>\n' }}
{%- set args_value = args_value | tojson(ensure_ascii=False) | safe if args_value is mapping or (args_value is sequence and args_value is not string) else args_value | string %}
{{- args_value }}
{{- '\n</parameter>\n' }}
{%- endfor %}
{%- elif tool_call.arguments is string %}
{# Minja does not support fromjson; preserve raw JSON string as a single parameter #}
{{- '<parameter=arguments>\n' + tool_call.arguments + '\n</parameter>\n' }}
{%- endif %}
{%- endif %}
{{- '</function>\n</tool_call>' }}
{%- endfor %}
{%- endif %}
{{- '<|im_end|>\n' }}
{%- elif message.role == "tool" %}
{%- if loop.first or (messages[loop.index0 - 1].role != "tool") %}
{{- '<|im_start|>tool_response\n' }}
{%- endif %}
{{- '<tool_response>' }}
{{- content }}
{{- '</tool_response>' }}
{%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
{{- '<|im_end|>\n' }}
{%- endif %}
{%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- if enable_thinking %}
{{- '<think>\n' }}
{%- endif %}
{%- endif %}
Dockerfile for optimized Spark llama.cpp image:
# ==============================================================================
# STAGE 1: Build Environment
# ==============================================================================
ARG UBUNTU_VERSION=24.04
ARG CUDA_VERSION=13.1.1
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION} AS builder
# 1. Install build dependencies (now including Python for script execution)
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
cmake \
build-essential \
libcurl4-openssl-dev \
libssl-dev \
python3 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /build
# 2. Clone mainline llama.cpp
RUN git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
WORKDIR /build/llama.cpp
# 3. Fix missing libcuda.so.1 for the linker during the build phase
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1
# 4. Configure CMake with Blackwell (GB10) & ARM64 optimizations
# - DGGML_CUDA_F16=ON is added to accelerate half-precision kernels on Blackwell
RUN cmake -S . -B build-cuda \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_GRAPHS=ON \
-DLLAMA_CURL=ON \
-DCMAKE_CUDA_ARCHITECTURES=121a-real \
-DGGML_NATIVE=ON \
-DCMAKE_EXE_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs" \
-DCMAKE_SHARED_LINKER_FLAGS="-L/usr/local/cuda/lib64/stubs"
# 5. Compile ALL targets and package them to a staging directory
RUN export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH && \
cmake --build build-cuda --config Release -j$(nproc) && \
cmake --install build-cuda --prefix /out
# ==============================================================================
# STAGE 2: Lean Runtime Environment
# ==============================================================================
FROM nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION} AS runtime
# 1. Install required runtime libraries AND Python
RUN apt-get update && apt-get install -y --no-install-recommends \
libgomp1 \
libcurl4 \
curl \
ca-certificates \
python3 \
python3-pip \
&& apt-get autoremove -y \
&& apt-get clean -y \
&& rm -rf /var/lib/apt/lists/*
# 2. Set environment variables
ENV GGML_CUDA_GRAPH_OPT=1
ENV LLAMA_ARG_HOST=0.0.0.0
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility
WORKDIR /app
# 3. Copy compiled binaries AND shared libraries from the builder stage, then
# update the dynamic linker cache so the system finds the new libraries
COPY --from=builder /out/ /usr/local/
RUN ldconfig
# 4. Copy Python conversion/quantization scripts and dependencies
COPY --from=builder /build/llama.cpp/*.py /app/
COPY --from=builder /build/llama.cpp/gguf-py /app/gguf-py
COPY --from=builder /build/llama.cpp/requirements /app/requirements
COPY --from=builder /build/llama.cpp/requirements.txt /app/
# 5. Install Python dependencies globally within the container sandbox
# (Using --break-system-packages is safe here since it's an isolated container)
RUN pip install --no-cache-dir --break-system-packages -r requirements.txt
# 6. Define a healthcheck for server mode
HEALTHCHECK --interval=10s --timeout=5s --start-period=30s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
# 7. Set default command
CMD ["llama-server"]
docker-compose.yml
services:
step-iq4xs-server:
build:
context: .
dockerfile: Dockerfile
image: step-3.5-flash:local
container_name: llama-Step3.5-Flash-IQ4_XS
restart: unless-stopped
profiles: ["step35-iq4xs"] # Run with: docker compose --profile step35-iq4xs up -d
ports:
- "8000:8000"
volumes:
- ${HOME}/models:/models:ro
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
working_dir: /app
command: >
llama-server
-m /models/Step-3.5-Flash-IQ4_XS/Step-3.5-Flash-IQ4_XS-00001-of-00004.gguf
--jinja
--chat-template-file /models/Step-3.5-Flash-IQ4_XS/fixed_step_template.jinja
-c 262144
-ngl 999
-fa on
-b 8192
-ub 2048
-ctk q8_0
-ctv q8_0
--no-mmap
--port 8000
--host 0.0.0.0
stop_grace_period: 15s
Build & run the image with: docker compose --profile step35-iq4xs up -d which will start the server on port 8000.
If you have more GGUF quants you want to experiment with, you can copy everything below step-iq4xs-server in the docker-compose.yml file above, rename the service & profile then point to a different .gguf in your ~/models to use this as a common llama.cpp for various quants.
Final note:
There is an Int4 Autoround quant of Step-3.5-Flash: https://huggingface.co/INC4AI/Step-3.5-Flash-int4-AutoRound . I delayed writing this up as I thought the AutoRound with vLLM might be the best option, but have yet to succeed at getting it running with @eugr’s vLLM build, (I got it to look like it loaded with –enforce-eager and 128k context [CUDA graph attempt crashed with OOM], but it generated null output). In vLLM, even with fp8 KV cache, 128k context seems to be the max on Spark as the AutoRound quant is not as space efficient as the IQ4_XS model above. I plan to continue playing with this, as the performance might be compelling if higher context is not required.
