I use llama-swap to launch (mostly) llama.cpp models and some vllm ones.
Although on Spark if you use VLLM, it’s better to just start it and never stop, because the model loading is SOOOOOO slow… Like 8-9 minutes to load something like Qwen3-Next-80B or gpt-oss-120b, while the same gpt-oss-120b takes 15 seconds to load with llama.cpp after the new 6.14 update and some nvme readahead buffer tweaking (only when using --no-mmap though).
Anyway, llama-swap loads models on demand and unloads them if you query another one. You have to assign groups manually if you want multiple models running at the same time, but otherwise it works great.
I am of the opinion that the NVIDIA DGX-Spark (GB10) was fundamentally designed with FP4 computation in mind. However, as this remains bleeding-edge technology, the majority of inference engines do not yet handle the ARM64 + GB10 combination with particular elegance.
Based upon my extensive testing, both vLLM and TRT-LLM perform admirably when deployed from the container nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev.
vLLM Launch Script for openai/gpt-oss-120b:
#!/bin/bash
set -e
MODEL_HANDLE="openai/gpt-oss-120b"
LOG_FILE="/var/log/gpt_oss_120b_server.log"
PORT=8358
echo "🌟 LAUNCHING vLLM 120B (NVIDIA Auto-Optimised Mode)"
# System memory optimisation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
# Streamlined launch configuration
# --dtype auto: Permits the container to activate MXFP4 if detected
# --gpu-memory-utilization 0.65: Sufficient for 120B in 4-bit (~70GB)
# --max-model-len 8192: Sole constraint to prevent context explosion
nohup python3 -m vllm.entrypoints.openai.api_server \
--model "$MODEL_HANDLE" \
--host 0.0.0.0 \
--port $PORT \
--gpu-memory-utilization 0.65 \
--max-model-len 8192 \
--trust-remote-code \
--dtype auto \
> $LOG_FILE 2>&1 &
echo "🎉 120B launched. Logs: tail -f $LOG_FILE"
TRT-LLM Launch Script for openai/gpt-oss-120b:
#!/bin/bash
set -e
# --- CONFIGURATION ---
MODEL_HANDLE="openai/gpt-oss-120b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-120b.yml"
LOG_FILE="/var/log/gpt_120b_server.log"
PORT=8356
BATCH_SIZE=8
echo "🌟 STARTING GPT-OSS-120B SERVICE (Port $PORT)"
# 1. TIKTOKEN PREPARATION
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
echo "📥 Downloading Tiktoken..."
mkdir -p $TIKTOKEN_DIR
wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi
export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"
# 2. CONFIGURATION PATCH & DOWNLOAD
echo "🔍 Verification..."
export SNAPSHOT_PATH=$(find /root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots -maxdepth 1 -mindepth 1 -type d | head -n 1)
# Eager Mode configuration patch (Llama MoE identity)
python3 -c "
import json, os
path = os.path.join('$SNAPSHOT_PATH', 'config.json')
try:
with open(path, 'r') as f: data = json.load(f)
data['architecture'] = 'LlamaForCausalLM'
data['architectures'] = ['LlamaForCausalLM']
data['model_type'] = 'llama'
if 'dtype' not in data: data['dtype'] = 'float16'
if 'layer_types' in data: del data['layer_types']
with open(path, 'w') as f: json.dump(data, f, indent=4)
print('✅ Configuration patched for MoE compatibility.')
except Exception as e: print(f'⚠️ Patch error: {e}')
"
# 3. OPTIMISED RUNTIME CONFIGURATION (Anti-OOM)
echo "⚙️ YAML Configuration (Memory Safe)..."
cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
dtype: "auto"
# Reduced to 0.8 to provide headroom for the system and prevent OOM
free_gpu_memory_fraction: 0.8
cuda_graph_config:
enable_padding: true
disable_overlap_scheduler: true
YAML
# 4. SERVER LAUNCH
echo "🔥 Launching trtllm-serve..."
# Critical variable for memory fragmentation on GB10
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
nohup trtllm-serve "$SNAPSHOT_PATH" \
--host 0.0.0.0 \
--port $PORT \
--max_batch_size $BATCH_SIZE \
--trust_remote_code \
--extra_llm_api_options $CONFIG_FILE \
> $LOG_FILE 2>&1 &
echo "🎉 120B service launched! Logs: tail -f $LOG_FILE"
I should be most grateful if you would inform me should you discover any superior alternatives. Ultimately, the optimal choice depends entirely upon one’s intended usage. Ollama and llama.cpp are indeed satisfactory for personal use; however, they continue to rely upon GGUF models and linear quantisation, which regrettably does not leverage the full potential of your DGX-Spark.
Thats cool. Thank you for that Hint.
Here is a “Playbook” tutorial how to install Docker Desktop on ARM64
The Docker Desktop deb for ARM64 Linux is Version 4.40.0 (187762) but the new Version is 4.52.0 (210994) - could not find the link to the ARM64 Linux file for that version yet.