Model Orchestration and Deployment

martinB78 · November 21, 2025, 7:02am

Hello Guys,

I am looking for a Software to easily download, run, start and stop LLMs in a save environment.
Something like Docker but for LLMs.

vLLM (Open Source Inference Engine)
NVIDIA AI Workbench/NIM
TensorRT-LLM
Ollama
LM Studio / AnythingLLM

What do you guys use and why.

raphael.amorim · November 21, 2025, 9:20pm

Docker Model Runner supports multiple backends: including vLLM and llama.cpp

eugr · November 22, 2025, 4:57am

I use llama-swap to launch (mostly) llama.cpp models and some vllm ones.

Although on Spark if you use VLLM, it’s better to just start it and never stop, because the model loading is SOOOOOO slow… Like 8-9 minutes to load something like Qwen3-Next-80B or gpt-oss-120b, while the same gpt-oss-120b takes 15 seconds to load with llama.cpp after the new 6.14 update and some nvme readahead buffer tweaking (only when using --no-mmap though).

Anyway, llama-swap loads models on demand and unloads them if you query another one. You have to assign groups manually if you want multiple models running at the same time, but otherwise it works great.

WilliamD · November 22, 2025, 1:15pm

Dear Martin,

I am of the opinion that the NVIDIA DGX-Spark (GB10) was fundamentally designed with FP4 computation in mind. However, as this remains bleeding-edge technology, the majority of inference engines do not yet handle the ARM64 + GB10 combination with particular elegance.

Based upon my extensive testing, both vLLM and TRT-LLM perform admirably when deployed from the container nvcr.io/nvidia/tensorrt-llm/release:spark-single-gpu-dev.

vLLM Launch Script for openai/gpt-oss-120b:

#!/bin/bash
set -e

MODEL_HANDLE="openai/gpt-oss-120b"
LOG_FILE="/var/log/gpt_oss_120b_server.log"
PORT=8358

echo "🌟 LAUNCHING vLLM 120B (NVIDIA Auto-Optimised Mode)"

# System memory optimisation
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

# Streamlined launch configuration
# --dtype auto: Permits the container to activate MXFP4 if detected
# --gpu-memory-utilization 0.65: Sufficient for 120B in 4-bit (~70GB)
# --max-model-len 8192: Sole constraint to prevent context explosion

nohup python3 -m vllm.entrypoints.openai.api_server \
  --model "$MODEL_HANDLE" \
  --host 0.0.0.0 \
  --port $PORT \
  --gpu-memory-utilization 0.65 \
  --max-model-len 8192 \
  --trust-remote-code \
  --dtype auto \
  > $LOG_FILE 2>&1 &

echo "🎉 120B launched. Logs: tail -f $LOG_FILE"

TRT-LLM Launch Script for openai/gpt-oss-120b:

#!/bin/bash
set -e

# --- CONFIGURATION ---
MODEL_HANDLE="openai/gpt-oss-120b"
TIKTOKEN_DIR="/tmp/harmony-reqs"
CONFIG_FILE="/tmp/extra-llm-api-config-120b.yml"
LOG_FILE="/var/log/gpt_120b_server.log"
PORT=8356
BATCH_SIZE=8

echo "🌟 STARTING GPT-OSS-120B SERVICE (Port $PORT)"

# 1. TIKTOKEN PREPARATION
if [ ! -f "$TIKTOKEN_DIR/cl100k_base.tiktoken" ]; then
    echo "📥 Downloading Tiktoken..."
    mkdir -p $TIKTOKEN_DIR
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/o200k_base.tiktoken
    wget -q -P $TIKTOKEN_DIR https://openaipublic.blob.core.windows.net/encodings/cl100k_base.tiktoken
fi
export TIKTOKEN_ENCODINGS_BASE="$TIKTOKEN_DIR"

# 2. CONFIGURATION PATCH & DOWNLOAD
echo "🔍 Verification..."
export SNAPSHOT_PATH=$(find /root/.cache/huggingface/hub/models--openai--gpt-oss-120b/snapshots -maxdepth 1 -mindepth 1 -type d | head -n 1)

# Eager Mode configuration patch (Llama MoE identity)
python3 -c "
import json, os
path = os.path.join('$SNAPSHOT_PATH', 'config.json')
try:
    with open(path, 'r') as f: data = json.load(f)
    data['architecture'] = 'LlamaForCausalLM'
    data['architectures'] = ['LlamaForCausalLM']
    data['model_type'] = 'llama'
    if 'dtype' not in data: data['dtype'] = 'float16'
    if 'layer_types' in data: del data['layer_types']
    with open(path, 'w') as f: json.dump(data, f, indent=4)
    print('✅ Configuration patched for MoE compatibility.')
except Exception as e: print(f'⚠️ Patch error: {e}')
"

# 3. OPTIMISED RUNTIME CONFIGURATION (Anti-OOM)
echo "⚙️  YAML Configuration (Memory Safe)..."
cat > $CONFIG_FILE <<YAML
print_iter_log: false
kv_cache_config:
  dtype: "auto"
  # Reduced to 0.8 to provide headroom for the system and prevent OOM
  free_gpu_memory_fraction: 0.8
cuda_graph_config:
  enable_padding: true
disable_overlap_scheduler: true
YAML

# 4. SERVER LAUNCH
echo "🔥 Launching trtllm-serve..."

# Critical variable for memory fragmentation on GB10
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

nohup trtllm-serve "$SNAPSHOT_PATH" \
  --host 0.0.0.0 \
  --port $PORT \
  --max_batch_size $BATCH_SIZE \
  --trust_remote_code \
  --extra_llm_api_options $CONFIG_FILE \
  > $LOG_FILE 2>&1 &

echo "🎉 120B service launched! Logs: tail -f $LOG_FILE"

I should be most grateful if you would inform me should you discover any superior alternatives. Ultimately, the optimal choice depends entirely upon one’s intended usage. Ollama and llama.cpp are indeed satisfactory for personal use; however, they continue to rely upon GGUF models and linear quantisation, which regrettably does not leverage the full potential of your DGX-Spark.

Kind regards,
William

martinB78 · November 24, 2025, 7:16am

Thats cool. Thank you for that Hint.
Here is a “Playbook” tutorial how to install Docker Desktop on ARM64

The Docker Desktop deb for ARM64 Linux is Version 4.40.0 (187762) but the new Version is 4.52.0 (210994) - could not find the link to the ARM64 Linux file for that version yet.

Wrote to the Docker Support with a Request for ARM64/aarch64 Ubuntu Support (NVIDIA DGX/AI Platforms) for Docker Desktop / Model Runner

and here how to install Docker Model Runner

Topic		Replies	Views
Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously DGX Spark / GB10	8	225	December 13, 2025
New bleeding-edge vLLM Docker Image: avarok/vllm-nvfp4-gb10-sm120 DGX Spark / GB10 Projects	34	513	December 17, 2025
Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Technical Blog	8	1928	January 25, 2024
GDX Spark is extremely slow on a short LLM test DGX Spark / GB10	18	1145	December 4, 2025
Vllm docker-compose - on DGX Spark from first time user looking for suggestions and question about RAM utilization DGX Spark / GB10 docker	5	564	December 10, 2025
DGX Spark: The Sovereign AI Stack — Dual-Model Architecture for Local Inference DGX Spark / GB10 Projects docker , spark , llm	3	305	December 15, 2025
Models not using Spark GPU? DGX Spark / GB10 containers	9	156	December 15, 2025
I'd like to learn how to use the latest vLLM on DGX Spark DGX Spark / GB10 cuda	9	1094	November 29, 2025
Tutorial: Build llama.cpp from source and run Qwen3 235B DGX Spark / GB10 Projects llama	25	1341	December 12, 2025
Help: Running NVFP4 model on 2x DGX Spark with vLLM + Ray (multi-node) DGX Spark / GB10 mistral-large	17	618	December 20, 2025

Model Orchestration and Deployment

Related topics