Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs

Hi everyone,

I’ve been on Mac (M-series) forever, using LM Studio / Metal. Everything was transparent: download, run, done.

Recently I decided to move to the “big leagues” for serious workloads and bought an NVIDIA DGX Spark. I expected to launch into orbit — instead I’ve spent the last three days breaking my brain trying to get to a clear, fast, stable setup.

This “brave new world” of the Linux ML stack is honestly overwhelming.

Instead of a single Start button, I got a LEGO set of knobs and “best practices” that change depending on the model, engine, quantization, and seemingly the phase of the moon:

  • vLLM vs llama.cpp vs TGI — what’s actually fastest and most stable on a single powerful node?

  • KV cache tweaks, PagedAttention, etc. — what’s worth turning on vs ignoring?

  • NUMA optimization, mlock, no-mmap — meaningful gains or micro-optimizations?

  • CUDA Graphs — enable for speed or just invite weird errors?

  • Flash Attention 2 — must-have or “it depends”?

Even Perplexity can’t give a straight answer on what the current industry standard is for maximum inference performance on one (but strong) machine.

Questions for people who’ve been through this

1) Engine: is vLLM worth the pain?

On one hand, vLLM is fast. On the other, it seems extremely aggressive with memory reservation.

  • I tried Nemotron 30B and it somehow ate ~110 GB of RAM (feels like KV-cache allocation behavior), and then running anything alongside it becomes unrealistic.

  • llama.cpp looks simpler and more “I can reason about it,” but how much performance am I giving up (especially for batching / throughput) compared to vLLM’s optimized CUDA kernels and attention optimizations?

2) Quantization: what should I actually use right now ?

My current understanding:

  • NVFP4 is basically a unicorn right now — mostly living inside unstable Docker setups (e.g., Avarok) and waiting for broader/cleaner support on Blackwell (GB10/200).

  • So for Hopper/Ada today, what’s the practical choice: AWQ 4-bit? GPTQ? MXFP4?

  • Where’s the real balance between “won’t run reliably” and “flies but quality drops too much”?

3) My workload

I’m building a local LightRAG / Knowledge Graph pipeline. I need maximum throughput for indexing and processing a large corpus.

Using Ollama for this feels like a downgrade and kind of defeats the point of buying powerful NVIDIA hardware — at that point I could’ve stayed on my Mac where everything “just worked.”

Bottom line

Can someone share the current “gold standard” setup?

What stack (Software + Model Format + Key Parameters) are you using in production or serious development to squeeze maximum performance out of NVIDIA, without turning into a full-time DevOps engineer?

Thanks in advance 🙏

Currently llama.cpp is the fastest solution for inference. It already ready got some love from NVIDIA in form of optimizations.

vLLM is not bad either (my preferred backend in production). For the moment you should choose AWQ quants. It is expected that GB10/NVFP4 will see some optimization in the (hopefully near) future.

Hugging Face announced TGI won’t see any new feature as HF has decided to stop further development. It is now in maintenance mode (since 3 weeks or so).

As for models on a single node: for performance choose sparse models (MoEs) instead of dense models.

You could start with: gpt-oss-120b, Qwen3 Next 80 A3B Instruct (my favorites)

For hints on usage, compile options to choose have a look at:

For an optimized docker build of vLLM have look at:

@eugr has done a great job and is restless testing new improvements. ;-)

To see what you can get out of your Spark have a look at:

2 Likes

Hello, and welcome!

Disclaimer: I am by no means an expert, but here is what my experience has been:

Unless you have two Sparks, or absolutely can’t wait a day or two for GGUF’s of new models to come out, skip vLLM. If you must use vLLM, @eugr active on these forums has an excellent docker build for you can use. vLLM is generally slower and uses more memory.

Unless you enjoy torturing yourself for absolutely no reason, or are specifically targeting the stack, avoid TensorRT-LLM. It sounds promising but it doesn’t appear optimized for GB10 in any way and there isn’t any indication that it ever will be.

I’ve struggled to get SGLang to work reliably, and I don’t see much development going into it. There is an lmsys/sglang:spark docker build if you want to play with it. It’s faster than vLLM but still slower than Llamacpp.

Llama.cpp for me has been excellent. You can easily build from source with recent support for Blackwell incorporated into the main release. Instructions are in this forum. Both PP and generation speed is by far faster than vLLM, Slang or TRT. The developer community is very active. I’d focus on using this.

For quants: mxfp4 on Llamacpp works very well with good quality and speed. Best option if the model you want supports it. Quants like Q3/Q2 let you run much bigger models on the Spark (with decreased quality, of course). Q4_K_M is usually a good middle-of-the-road compromise if you can’t get an mxfp4 quant. You can move up to Q6/Q8 if you need better quality but there are diminishing returns. If you insist on using vLLM look for an AWQ, but again, you need to have a good reason to go through the extra trouble and performance loss.

NVFP4 – touted as the secret sauce for Blackwell – has been an absolute disaster in my experience. You need to run vLLM or TRT to use this quant, and as far as I can tell it just makes everything slower. The promised 1.8x speedup is complete vaporware with respect to the Spark. Perhaps someday the hardware will allow us to realize this potential, but I’m not holding my breath.

About Ollama – this isn’t really a step backwards for experimenting. It works really well with Open WebUI and uses a slightly behind build of llama.cpp as a backend. It’s great for downloading, swapping and experimenting with different models. The difference between Mac and Spark on Ollama? Way better PP throughput. If you are doing any RAG, this will make a big difference.

Other good news: I find the “DGX OS” build, which is really just Ubuntu 24, a pretty good distribution. Most of the instructions that work for Ubuntu work fine on the Spark. CUDA compatibility is there for the most part – the biggest problem I have run into is that there is not good CUDA Arm support for prior PyTorch builds. This should get better as tools with older dependencies are updated (example: PyTorch builds supported by Spark dropped codec support that makes working with TTS tools a PITA).

The upside: The hardware is more capable. You have CUDA support. You will learn a ton more using the Spark. You just have to be willing to invest some time figuring out how things work.

Good luck and refer to this community when you get stuck!

EDIT: as for models, GPT 20b and 120b are great models and perform well in Llamacpp with mxfp4 support. I find Qwen3-vl-30b-a3b mxfp4 to be very speedy with multimodal support.

2 Likes

vLLM is usually slower in token generation speed (not always), but generally faster (sometimes significantly) in prompt processing and seems to have better caching system and concurrency support. I’ve seen it in practical use (with coding agents) and I’m seeing it in my benchmarks too.

It does consume noticeably more memory (unless you disable CUDA graphs, but then you will lose all performance advantages) and has painfully slow startup times compared to llama.cpp.

3 Likes

I’ll take your word on PP because I know you bench it. I also don’t doubt concurrency, but it doesn’t play into my needs very often.

1 Like

I need to doublecheck the script logic to make sure I’m not missing something, and I’ll post the benchmarks after that, but using the same methodology for both llama.cpp and vllm I’m seeing about 2x increase in PP speeds on vLLM side (consistent up to 32768 context - haven’t tested past it yet) compared to llama.cpp on gpt-oss-120b, even though token generation is slower. Both on a single Spark.

1 Like

Might be worth me re-evaluating. Is that on your docker build?

1 Like

I’ve been studying all this for the last week, ever since I got Spark. My head is spinning, but it’s interesting.
Thanks to everyone who shared their experiences and knowledge here.

1 Like

Thank you for the reply!

It genuinely felt like a weight off my shoulders to read your message — it’s nice to know I’m not alone in the frustration and the search for something that just works 🙂

Got it: we should primarily stick to mxfp4 models for now, and wait for NVIDIA to finally bless us with nvfp4 that actually delivers on the promise.

I’ve already run a few tests and got some interesting results.

gpt-oss-120b (mxfp4)

Tested:

  • Ollama (as far as I understand, it automatically pulls this quant)

  • llama.cpp

| Metric            | Llama.cpp (131K) | Ollama (131K) | Difference             |
|------------------|------------------:|--------------:|------------------------|
| Prompt tokens     | 86                | 86            | ✅ Same                |
| Completion tokens | 250               | 250           | ✅ Same                |
| Generation time   | 4.71s             | 5.85s         | Llama.cpp ~20% faster  |
| Generation speed  | ~53 tok/s         | 42.7 tok/s    | +24% faster 🚀         |

That’s a pretty interesting result. I’m going to keep testing. What a real helping…

Could you please share what else I can tweak in llama.cpp to improve performance? Any hardware acceleration options, compile flags, runtime parameters, etc.?

Here’s my current launch script:

#!/bin/bash

DEFAULT_MODEL="models/gpt-oss-120b-mxfp4-00001-of-00003.gguf"
MODEL="${1:-$DEFAULT_MODEL}"

HOST="0.0.0.0"
PORT="8080"
CTX_SIZE="131072"
GPU_LAYERS="999"
THREADS="20"
PARALLEL="4"

echo "----------------------------------------"
echo "Starting Llama Server"
echo "Model: $MODEL"
echo "Host: $HOST"
echo "Port: $PORT"
echo "Access at: http://localhost:$PORT"
echo "----------------------------------------"

./build/bin/llama-server \
    -m "$MODEL" \
    --host "$HOST" \
    --port "$PORT" \
    -c "$CTX_SIZE" \
    -ngl "$GPU_LAYERS" \
    -t "$THREADS" \
    -np "$PARALLEL" \
    --cont-batching \
    -ub 512

Flash Attention not working..

Thanks for the detailed breakdown — super helpful.

Quick question while we’re on this: have you tried any of the newer models that came out recently, like GLM-4.5 Air or Nemotron 30B? If so, did you notice any meaningful differences in quality (instruction following, reasoning, coding, stability) or speed/PP behavior on Spark — especially under mxfp4?

I have used both of this models. GLM-4.5 Air was really impressive at coding but a little slow for me on a single Spark. Impressive though – a zero-shot prompt like “build me a Tetris game” built a really elaborate game that worked the first time. I’m less impressed with Nemo. compared to the similar sized Qwen3. GLM 4.6V is also interesting. Again, bigger than 100b MoE or 32b dense models just don’t generate enough output tokens. Thinking hard about a second Spark.

1 Like

For gpt-oss-120b you want:
-fa 1 to enable Flash Attention
-ub 2048 - more performant with batches of 2048 tokens
--jinja - to enable Jinja templates (tool calling, etc) - I believe it’s default now, but wasn’t some time ago
--reasoning-format auto - to enable reasoning parser, will also help with tool calling
--no-mmap - significantly improves model loading performance on Spark

2 Likes

Wow — it was 43 tok/s and now it’s 56.43 tok/s 😳

Thanks for the recommendation!

Are these settings generally applicable to other models too, or is this more of a “works great for this specific model/quant” kind of situation?

Out of these, only -ub is model and even backend-specific (e.g. on AMD Vulkan it should be 1024 for faster processing). Some models may also require different reasoning parser, but -fa 1, --jinja and --no-mmap are pretty much universal, unless you have a solid reason not to use any of these.

#!/bin/bash

# =============================================
# Llama.cpp Server Manager
# =============================================

# Color definitions (using ANSI codes for maximum compatibility)
readonly RED=$'\033[0;31m'
readonly GREEN=$'\033[0;32m'
readonly YELLOW=$'\033[0;33m'
readonly BLUE=$'\033[0;34m'
readonly MAGENTA=$'\033[0;35m'
readonly CYAN=$'\033[0;36m'
readonly WHITE=$'\033[0;37m'
readonly BOLD=$'\033[1m'
readonly NC=$'\033[0m'  # No Color

# Get the real script directory (resolves symlinks)
SOURCE="${BASH_SOURCE[0]}"
while [ -h "$SOURCE" ]; do
    DIR="$(cd -P "$(dirname "$SOURCE")" && pwd)"
    SOURCE="$(readlink "$SOURCE")"
    [[ $SOURCE != /* ]] && SOURCE="$DIR/$SOURCE"
done
SCRIPT_DIR="$(cd -P "$(dirname "$SOURCE")" && pwd)"
cd "$SCRIPT_DIR"

# Auto-unload settings
IDLE_TIMEOUT=1000  #  minutes in seconds (0 = disabled)
MONITOR_INTERVAL=30  # Check every 30 seconds

# Available models
MODELS=(
    "0|none||Skip main LLM server"
    "1|gpt-oss-120b-mxfp4|models/gpt-oss-120b/gpt-oss-120b-mxfp4-00001-of-00003.gguf|GPT-OSS 120B MXFP4 (60GB, 3 parts)"
    "2|gpt-oss-20b-mxfp4|models/gpt-oss-20b/openai_gpt-oss-20b-MXFP4.gguf|GPT-OSS 20B MXFP4 (12GB)"
    "3|qwen3-vl-30b|models/qwen3-vl/Qwen3-VL-30B-A3B-Instruct-1M-MXFP4_MOE.gguf|Qwen3-VL 30B MXFP4 (17GB, Vision-Language)"
    "4|qwen3-next-80b-thinking|models/qwen3-next-80b/Qwen3-Next-80B-A3B-Thinking-1M-MXFP4_MOE-00001-of-00003.gguf|Qwen3-Next 80B Thinking MXFP4 (41GB, 3 parts)"
    "5|qwen3-next-80b-instruct|models/qwen3-next-80b-instruct/Qwen3-Next-80B-A3B-Instruct-1M-MXFP4_MOE-00001-of-00003.gguf|Qwen3-Next 80B Instruct MXFP4 (41GB, 3 parts)"
    "6|devstral-2-123b|models/devstral-2-123b-instruct/Devstral-2-123B-Instruct-2512-UD-Q3_K_XL-00001-of-00002.gguf|Devstral-2 123B Q3_XL (58GB, 2 parts)"
    "7|glm-4.5-air-mxfp4|models/glm-4.5-air-mxfp4/GLM-4.5-Air-MXFP4_MOE-00001-of-00005.gguf|GLM-4.5 Air MXFP4_MOE (59GB, 5 parts)"
    "8|glm-4.5-air-q3|models/glm-4.5-air-ud-q3/GLM-4.5-Air-UD-Q3_K_XL-00001-of-00002.gguf|GLM-4.5 Air Q3_XL (54GB, 2 parts)"
    "9|qwen3-coder-30b|models/qwen3-coder-30b-mxfp4/Qwen3-Coder-30B-A3B-Instruct-1M-MXFP4_MOE.gguf|Qwen3 Coder 30B MXFP4 (16GB, Programming)"
    "10|glm-4.6v-flash|models/glm-4.6v-flash/GLM-4.6V-Flash-UD-Q6_K_XL.gguf|GLM-4.6V Flash Q6_XL (8.3GB, Video)"
    "11|olmo-3.1-32b-instruct|models/olmo-3.1-32b-instruct/Olmo-3.1-32B-Instruct-UD-Q4_K_XL.gguf|Olmo 3.1 32B Instruct Q4_XL (19GB)"
    "12|olmo-3-32b-think|models/olmo-3.1-32b-think/Olmo-3-32B-Think-UD-Q4_K_XL.gguf|Olmo 3 32B Think Q4_XL (19GB, Reasoning)"
    "13|rnj-1-8b|models/rnj-1-8b/rnj-1-instruct.Q8_0.gguf|RNJ-1 8B Q8_0 (8.3GB)"
    "14|nemotron-3-nano-30b|models/nemotron-3-nano-30b/Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf|Nemotron 3 Nano 30B Q4_XL (22GB)"
)

# OCR models
OCR_MODELS=(
    "0|none||Start without OCR server"
    "1|deepseek-ocr|models/deepseek-ocr/deepseek-ocr-q8_0.gguf|DeepSeek OCR Q8_0 (3.8GB with mmproj)"
)

# Reranker models
RERANKERS=(
    "0|none||Start without reranker"
    "1|bge-reranker-v2-m3|models/bge-reranker/bge-reranker-v2-m3-Q8_0.gguf|BGE Reranker v2-m3 Q8_0 (607MB)"
)

# Embedding models
EMBEDDINGS=(
    "0|none||Start without embedding server"
    "1|qwen3-embedding-0.6b|models/embeddings/Qwen3-Embedding-0.6B-Q8_0.gguf|Qwen3 Embedding 0.6B Q8_0 (610MB)"
    "2|qwen3-embedding-8b|models/embeddings/Qwen3-Embedding-8B-Q4_K_M.gguf|Qwen3 Embedding 8B Q4_K_M (4.4GB)"
)

# Server settings
HOST="0.0.0.0"
PORT="41447"
CTX_SIZE="131072"
GPU_LAYERS="999"
THREADS="20"
PARALLEL="4"
UBATCH="2048"
FLASH_ATTN="1"
JINJA="true"
REASONING_FORMAT="auto"
MMAP="false"

# =============================================
# Menu Functions
# =============================================

show_main_menu() {
    echo ""
    echo -e "${CYAN}========================================${NC}"
    echo -e "${BOLD}${WHITE}  Llama.cpp Server Manager${NC}"
    echo -e "${CYAN}========================================${NC}"
    echo ""
    echo -e "${YELLOW}What would you like to do?${NC}"
    echo -e "  ${GREEN}[1]${NC} Start server"
    echo -e "  ${RED}[2]${NC} Stop server"
    echo -e "  ${BOLD}[3]${NC} Exit"
    echo ""
}

check_model_status() {
    local model_path=$1
    local model_id=$2

    # Если путь пустой (опция "none")
    if [[ -z "$model_path" ]]; then
        return 0
    fi

    # Простая проверка - файл существует
    if [[ -f "$model_path" ]]; then
        return 0
    fi

    return 1
}

show_model_menu() {
    echo ""
    echo -e "${BOLD}${CYAN}Available models:${NC}"
    for model in "${MODELS[@]}"; do
        IFS='|' read -r num id path desc <<< "$model"

        if [[ "$num" == "0" ]]; then
            echo -e "  ${RED}[$num]${NC} $desc"
        else
            if check_model_status "$path" "$id"; then
                echo -e "  ${GREEN}[$num]${NC} $desc ${GREEN}✓${NC}"
            else
                echo -e "  ${YELLOW}[$num]${NC} $desc ${RED}(не найдена)${NC}"
            fi
        fi
    done
    echo ""
}

show_reranker_menu() {
    echo ""
    echo -e "${BOLD}${MAGENTA}Reranker options:${NC}"
    for reranker in "${RERANKERS[@]}"; do
        IFS='|' read -r num id path desc <<< "$reranker"
        if [[ "$num" == "0" ]]; then
            echo -e "  ${RED}[$num]${NC} $desc"
        else
            if [[ -f "$path" ]]; then
                echo -e "  ${MAGENTA}[$num]${NC} $desc ${GREEN}✓${NC}"
            else
                echo -e "  ${MAGENTA}[$num]${NC} $desc ${RED}(не найден)${NC}"
            fi
        fi
    done
    echo ""
}

show_embedding_menu() {
    echo ""
    echo -e "${BOLD}${BLUE}Embedding server options:${NC}"
    for embedding in "${EMBEDDINGS[@]}"; do
        IFS='|' read -r num id path desc <<< "$embedding"
        if [[ "$num" == "0" ]]; then
            echo -e "  ${RED}[$num]${NC} $desc"
        else
            if [[ -f "$path" ]]; then
                echo -e "  ${BLUE}[$num]${NC} $desc ${GREEN}✓${NC}"
            else
                echo -e "  ${BLUE}[$num]${NC} $desc ${RED}(не найден)${NC}"
            fi
        fi
    done
    echo ""
}

show_ocr_menu() {
    echo ""
    echo -e "${BOLD}${YELLOW}OCR server options:${NC}"
    for ocr in "${OCR_MODELS[@]}"; do
        IFS='|' read -r num id path desc <<< "$ocr"
        if [[ "$num" == "0" ]]; then
            echo -e "  ${RED}[$num]${NC} $desc"
        else
            if [[ -f "$path" ]]; then
                # Для OCR также проверяем mmproj
                local mmproj_path="$(dirname "$path")/mmproj-$(basename "$path" .gguf).gguf"
                if [[ -f "$mmproj_path" ]]; then
                    echo -e "  ${YELLOW}[$num]${NC} $desc ${GREEN}✓${NC}"
                else
                    echo -e "  ${YELLOW}[$num]${NC} $desc ${YELLOW}(без mmproj)${NC}"
                fi
            else
                echo -e "  ${YELLOW}[$num]${NC} $desc ${RED}(не найден)${NC}"
            fi
        fi
    done
    echo ""
}

show_idle_timeout_menu() {
    echo ""
    echo -e "${BOLD}${CYAN}Auto-stop settings (idle timeout):${NC}"
    echo -e "  ${RED}[0]${NC} Disabled (server runs until stopped manually)"
    echo -e "  ${GREEN}[1]${NC} 5 minutes"
    echo -e "  ${GREEN}[2]${NC} 10 minutes"
    echo -e "  ${GREEN}[3]${NC} 15 minutes"
    echo -e "  ${YELLOW}[4]${NC} 30 minutes"
    echo -e "  ${YELLOW}[5]${NC} 1 hour"
    echo -e "  ${MAGENTA}[6]${NC} Custom (in seconds)"
    echo ""
}

get_idle_timeout() {
    local choice=$1
    case $choice in
        0) echo "0" ;;
        1) echo "300" ;;      # 5 minutes
        2) echo "600" ;;      # 10 minutes
        3) echo "900" ;;      # 15 minutes
        4) echo "1800" ;;     # 30 minutes
        5) echo "3600" ;;     # 1 hour
        *) echo "$choice" ;;  # Custom value in seconds
    esac
}

# =============================================
# Server Control Functions
# =============================================

stop_servers() {
    echo ""
    echo -e "${YELLOW}Stopping all llama.cpp servers...${NC}"
    pkill -f "llama-server" 2>/dev/null
    sleep 2

    if pgrep -f "llama-server" >/dev/null; then
        echo -e "${RED}Force killing remaining processes...${NC}"
        pkill -9 -f "llama-server" 2>/dev/null
        sleep 1
    fi

    # Stop monitor if running
    pkill -f "llama-monitor" 2>/dev/null

    echo -e "${GREEN}✓ All servers stopped${NC}"
    echo ""
}

get_model_path() {
    local choice=$1
    for model in "${MODELS[@]}"; do
        IFS='|' read -r num id path desc <<< "$model"
        if [[ "$num" == "$choice" ]]; then
            if [[ "$id" == "none" ]]; then
                echo ""
            else
                echo "$path"
            fi
            return
        fi
    done
    echo ""
}

get_reranker_path() {
    local choice=$1
    for reranker in "${RERANKERS[@]}"; do
        IFS='|' read -r num id path desc <<< "$reranker"
        if [[ "$num" == "$choice" ]]; then
            if [[ "$id" == "none" ]]; then
                echo ""
            else
                echo "$path"
            fi
            return
        fi
    done
    echo ""
}

get_embedding_path() {
    local choice=$1
    for embedding in "${EMBEDDINGS[@]}"; do
        IFS='|' read -r num id path desc <<< "$embedding"
        if [[ "$num" == "$choice" ]]; then
            if [[ "$id" == "none" ]]; then
                echo ""
            else
                echo "$path"
            fi
            return
        fi
    done
    echo ""
}

get_ocr_path() {
    local choice=$1
    for ocr in "${OCR_MODELS[@]}"; do
        IFS='|' read -r num id path desc <<< "$ocr"
        if [[ "$num" == "$choice" ]]; then
            if [[ "$id" == "none" ]]; then
                echo ""
            else
                echo "$path"
            fi
            return
        fi
    done
    echo ""
}

check_idle() {
    if [[ $IDLE_TIMEOUT -eq 0 ]]; then
        return 1
    fi

    # Check if log file exists
    if [[ ! -f /tmp/llama-server.log ]]; then
        return 1
    fi

    # Get last modification time of log file (llama.cpp writes on every request)
    local last_mod=$(stat -c %Y /tmp/llama-server.log 2>/dev/null)

    if [[ -z "$last_mod" ]]; then
        return 1
    fi

    local current_time=$(date +%s)
    local idle_time=$((current_time - last_mod))

    # Log format: "srv  log_server_r: request: METHOD PATH IP STATUS"
    # Check if there were recent requests
    if [[ $idle_time -ge $IDLE_TIMEOUT ]]; then
        return 0
    fi

    return 1
}

start_monitor() {
    if [[ $IDLE_TIMEOUT -eq 0 ]]; then
        return
    fi

    local monitor_log="/tmp/llama-monitor.log"
    echo "$(date '+%Y-%m-%d %H:%M:%S') - Monitor started (timeout: ${IDLE_TIMEOUT}s)" >> "$monitor_log"

    # Start monitor in background
    (
        while true; do
            sleep $MONITOR_INTERVAL

            # Check if server is still running
            if ! pgrep -f "llama-server" >/dev/null; then
                echo "$(date '+%Y-%m-%d %H:%M:%S') - Server not running, monitor exiting" >> "$monitor_log"
                break
            fi

            # Check for idle timeout
            if check_idle; then
                echo "$(date '+%Y-%m-%d %H:%M:%S') - IDLE TIMEOUT (${IDLE_TIMEOUT}s), stopping server" >> "$monitor_log"
                stop_servers
                break
            fi
        done
    ) >/dev/null 2>&1 &

    echo "Auto-unload monitor started (timeout: ${IDLE_TIMEOUT}s, logs: $monitor_log)"
}

start_servers() {
    local model_choice=$1
    local reranker_choice=$2
    local embedding_choice=$3
    local ocr_choice=$4
    local idle_timeout_choice=$5

    # Update IDLE_TIMEOUT based on user choice
    IDLE_TIMEOUT=$(get_idle_timeout "$idle_timeout_choice")

    # Get model path
    local model_path=$(get_model_path "$model_choice")

    # Get auxiliary server paths
    local reranker_path=$(get_reranker_path "$reranker_choice")
    local embedding_path=$(get_embedding_path "$embedding_choice")
    local ocr_path=$(get_ocr_path "$ocr_choice")

    # Check if at least one service is selected
    if [[ -z "$model_path" && -z "$reranker_path" && -z "$embedding_path" && -z "$ocr_path" ]]; then
        echo "Error: No services selected. Please select at least one service."
        return 1
    fi

    # Kill existing servers
    echo "Stopping existing servers..."
    pkill -f "llama-server" 2>/dev/null
    sleep 2

    # Start main server (if model selected)
    if [[ -n "$model_path" ]]; then
        # Verify model exists
        if [[ ! -f "$model_path" ]]; then
            echo "Error: Model file not found: $model_path"
            echo "Current directory: $(pwd)"
            return 1
        fi

        # Check if model needs mmproj (vision models)
        local mmproj_arg=""
        local model_id=$(echo "$model_path" | grep -oP '(?<=models/)[^/]*' | cut -d'/' -f1)

        if [[ "$model_id" == "qwen3-vl" ]]; then
            local mmproj_path="$(dirname "$model_path")/mmproj-F16.gguf"
            if [[ -f "$mmproj_path" ]]; then
                mmproj_arg="--mmproj $mmproj_path"
                echo "Vision model detected, using mmproj: $mmproj_path"
            else
                echo "Warning: mmproj file not found at $mmproj_path"
                echo "Vision features may not work properly"
            fi
        fi

        echo ""
        echo -e "${CYAN}========================================${NC}"
        echo -e "${BOLD}${GREEN}Starting Main LLM Server${NC}"
        echo -e "${CYAN}========================================${NC}"
        echo -e "${WHITE}Model:${NC} $model_path"
        echo -e "${WHITE}Host:${NC} $HOST"
        echo -e "${WHITE}Port:${NC} $PORT"
        echo -e "${WHITE}Context:${NC} $CTX_SIZE tokens"
        echo -e "${WHITE}GPU Layers:${NC} $GPU_LAYERS"
        echo -e "${WHITE}Batch Size:${NC} $UBATCH"
        echo -e "${WHITE}Flash Attention:${NC} $FLASH_ATTN"
        echo -e "${CYAN}========================================${NC}"
        echo ""

        LOG_FILE="/tmp/llama-server.log"
        ./build/bin/llama-server \
            -m "$model_path" \
            --host "$HOST" \
            --port "$PORT" \
            -c "$CTX_SIZE" \
            -ngl "$GPU_LAYERS" \
            -t "$THREADS" \
            -np "$PARALLEL" \
            --cont-batching \
            -ub "$UBATCH" \
            -fa "$FLASH_ATTN" \
            --jinja \
            --reasoning-format "$REASONING_FORMAT" \
            --no-mmap \
            $mmproj_arg \
            > "$LOG_FILE" 2>&1 &

        SERVER_PID=$!
        echo -e "${GREEN}✓${NC} Main server started with ${CYAN}PID: $SERVER_PID${NC}"
        echo -e "${WHITE}Logs:${NC} $LOG_FILE"
        echo -e "${WHITE}Access at:${NC} ${GREEN}http://localhost:$PORT${NC}"
        echo ""

        # Wait for server to be ready
        echo -e "${YELLOW}Waiting for server to load model...${NC}"
        for i in {1..60}; do
            if curl -s http://localhost:$PORT/health >/dev/null 2>&1; then
                echo -e "${GREEN}✓ Server is ready!${NC}"
                break
            fi
            echo -n "${CYAN}.${NC}"
            sleep 2
        done
        echo ""
    else
        echo -e "${YELLOW}⊘ Skipping main LLM server (option 0 selected)${NC}"
        echo ""
    fi

    # Start reranker if requested
    if [[ -n "$reranker_path" ]]; then
        if [[ -f "$reranker_path" ]]; then
            RERANKER_PORT=$((PORT + 1))
            echo ""
            echo -e "${MAGENTA}========================================${NC}"
            echo -e "${BOLD}${MAGENTA}Starting Reranker Server${NC}"
            echo -e "${MAGENTA}========================================${NC}"
            echo -e "${WHITE}Model:${NC} $reranker_path"
            echo -e "${WHITE}Port:${NC} $RERANKER_PORT"
            echo -e "${WHITE}Context:${NC} 512 tokens (optimized for reranking)"
            echo -e "${WHITE}Batch Size:${NC} 512"
            echo -e "${MAGENTA}========================================${NC}"
            echo ""

            RERANKER_LOG="/tmp/llama-reranker.log"
            ./build/bin/llama-server \
                -m "$reranker_path" \
                --host "$HOST" \
                --port "$RERANKER_PORT" \
                -c 512 \
                -ngl "$GPU_LAYERS" \
                -t "$THREADS" \
                -ub 512 \
                --no-mmap \
                > "$RERANKER_LOG" 2>&1 &

            RERANKER_PID=$!
            echo -e "${GREEN}✓${NC} Reranker server started with ${CYAN}PID: $RERANKER_PID${NC}"
            echo -e "${WHITE}Logs:${NC} $RERANKER_LOG"
            echo -e "${WHITE}Access at:${NC} ${MAGENTA}http://localhost:$RERANKER_PORT${NC}"
            echo ""

            # Wait for reranker
            echo -e "${YELLOW}Waiting for reranker to load...${NC}"
            for i in {1..30}; do
                if curl -s http://localhost:$RERANKER_PORT/health >/dev/null 2>&1; then
                    echo -e "${GREEN}✓ Reranker is ready!${NC}"
                    break
                fi
                echo -n "${MAGENTA}.${NC}"
                sleep 2
            done
            echo ""
        else
            echo -e "${RED}Warning: Reranker model not found: $reranker_path${NC}"
            echo -e "${YELLOW}Skipping reranker startup...${NC}"
        fi
    fi

    # Start embedding server if requested
    if [[ -n "$embedding_path" ]]; then
        if [[ -f "$embedding_path" ]]; then
            EMBEDDING_PORT=$((PORT + 2))
            echo ""
            echo -e "${BLUE}========================================${NC}"
            echo -e "${BOLD}${BLUE}Starting Embedding Server${NC}"
            echo -e "${BLUE}========================================${NC}"
            echo -e "${WHITE}Model:${NC} $embedding_path"
            echo -e "${WHITE}Port:${NC} $EMBEDDING_PORT"
            echo -e "${WHITE}Context:${NC} 2048 tokens (optimized for embeddings)"
            echo -e "${WHITE}Batch Size:${NC} 512"
            echo -e "${BLUE}========================================${NC}"
            echo ""

            EMBEDDING_LOG="/tmp/llama-embedding.log"
            ./build/bin/llama-server \
                -m "$embedding_path" \
                --host "$HOST" \
                --port "$EMBEDDING_PORT" \
                -c 2048 \
                -ngl "$GPU_LAYERS" \
                -t "$THREADS" \
                -ub 512 \
                --no-mmap \
                > "$EMBEDDING_LOG" 2>&1 &

            EMBEDDING_PID=$!
            echo -e "${GREEN}✓${NC} Embedding server started with ${CYAN}PID: $EMBEDDING_PID${NC}"
            echo -e "${WHITE}Logs:${NC} $EMBEDDING_LOG"
            echo -e "${WHITE}Access at:${NC} ${BLUE}http://localhost:$EMBEDDING_PORT${NC}"
            echo ""

            # Wait for embedding server
            echo -e "${YELLOW}Waiting for embedding server to load...${NC}"
            for i in {1..30}; do
                if curl -s http://localhost:$EMBEDDING_PORT/health >/dev/null 2>&1; then
                    echo -e "${GREEN}✓ Embedding server is ready!${NC}"
                    break
                fi
                echo -n "${BLUE}.${NC}"
                sleep 2
            done
            echo ""
        else
            echo -e "${RED}Warning: Embedding model not found: $embedding_path${NC}"
            echo -e "${YELLOW}Skipping embedding server startup...${NC}"
        fi
    fi

    # Start OCR server if requested
    if [[ -n "$ocr_path" ]]; then
        if [[ -f "$ocr_path" ]]; then
            OCR_PORT=41500
            OCR_MMPROJ="$(dirname "$ocr_path")/mmproj-deepseek-ocr-q8_0.gguf"

            echo ""
            echo -e "${YELLOW}========================================${NC}"
            echo -e "${BOLD}${YELLOW}Starting OCR Server${NC}"
            echo -e "${YELLOW}========================================${NC}"
            echo -e "${WHITE}Model:${NC} $ocr_path"
            echo -e "${WHITE}Port:${NC} $OCR_PORT"
            echo -e "${WHITE}Context:${NC} 2048 tokens (optimized for OCR)"
            echo -e "${WHITE}Batch Size:${NC} 512"

            if [[ -f "$OCR_MMPROJ" ]]; then
                echo -e "${WHITE}MMProj:${NC} $OCR_MMPROJ"
            else
                echo -e "${RED}Warning: mmproj file not found at $OCR_MMPROJ${NC}"
                echo -e "${RED}OCR features may not work properly${NC}"
            fi
            echo -e "${YELLOW}========================================${NC}"
            echo ""

            OCR_LOG="/tmp/llama-ocr.log"

            if [[ -f "$OCR_MMPROJ" ]]; then
                ./build/bin/llama-server \
                    -m "$ocr_path" \
                    --host "$HOST" \
                    --port "$OCR_PORT" \
                    --mmproj "$OCR_MMPROJ" \
                    -c 2048 \
                    -ngl "$GPU_LAYERS" \
                    -t "$THREADS" \
                    -ub 512 \
                    --no-mmap \
                    > "$OCR_LOG" 2>&1 &
            else
                ./build/bin/llama-server \
                    -m "$ocr_path" \
                    --host "$HOST" \
                    --port "$OCR_PORT" \
                    -c 2048 \
                    -ngl "$GPU_LAYERS" \
                    -t "$THREADS" \
                    -ub 512 \
                    --no-mmap \
                    > "$OCR_LOG" 2>&1 &
            fi

            OCR_PID=$!
            echo -e "${GREEN}✓${NC} OCR server started with ${CYAN}PID: $OCR_PID${NC}"
            echo -e "${WHITE}Logs:${NC} $OCR_LOG"
            echo -e "${WHITE}Access at:${NC} ${YELLOW}http://localhost:$OCR_PORT${NC}"
            echo ""

            # Wait for OCR server
            echo -e "${YELLOW}Waiting for OCR server to load...${NC}"
            for i in {1..30}; do
                if curl -s http://localhost:$OCR_PORT/health >/dev/null 2>&1; then
                    echo -e "${GREEN}✓ OCR server is ready!${NC}"
                    break
                fi
                echo -n "${YELLOW}.${NC}"
                sleep 2
            done
            echo ""
        else
            echo -e "${RED}Warning: OCR model not found: $ocr_path${NC}"
            echo -e "${YELLOW}Skipping OCR server startup...${NC}"
        fi
    fi

    # Start auto-unload monitor
    start_monitor

    echo ""
    echo -e "${GREEN}========================================${NC}"
    echo -e "${BOLD}${GREEN}All servers started!${NC}"
    echo -e "${GREEN}========================================${NC}"
    if [[ -n "$SERVER_PID" ]]; then
        echo -e "${WHITE}Main LLM:${NC}     ${CYAN}http://localhost:$PORT${NC}"
    fi
    if [[ -n "$RERANKER_PID" ]]; then
        echo -e "${WHITE}Reranker:${NC}     ${MAGENTA}http://localhost:$RERANKER_PORT${NC}"
    fi
    if [[ -n "$EMBEDDING_PID" ]]; then
        echo -e "${WHITE}Embedding:${NC}   ${BLUE}http://localhost:$EMBEDDING_PORT${NC}"
    fi
    if [[ -n "$OCR_PID" ]]; then
        echo -e "${WHITE}OCR:${NC}         ${YELLOW}http://localhost:$OCR_PORT${NC}"
    fi
    if [[ $IDLE_TIMEOUT -gt 0 ]]; then
        echo -e "${WHITE}Auto-unload:${NC}  ${IDLE_TIMEOUT}s idle timeout"
    fi
    echo ""
    echo -e "${WHITE}To stop servers:${NC}"
    echo -e "  ${CYAN}llama stop${NC}"
    echo -e "${GREEN}========================================${NC}"
}

# =============================================
# Command Line Parsing
# =============================================

ACTION=""
MODEL_CHOICE=""
RERANKER_CHOICE=""
EMBEDDING_CHOICE=""
OCR_CHOICE=""
IDLE_CHOICE=""

while [[ $# -gt 0 ]]; do
    case $1 in
        start|stop)
            ACTION="$1"
            shift
            ;;
        -m|--model)
            MODEL_CHOICE="$2"
            shift 2
            ;;
        -r|--reranker)
            RERANKER_CHOICE="$2"
            shift 2
            ;;
        -e|--embedding)
            EMBEDDING_CHOICE="$2"
            shift 2
            ;;
        -o|--ocr)
            OCR_CHOICE="$2"
            shift 2
            ;;
        -t|--timeout)
            IDLE_CHOICE="$2"
            shift 2
            ;;
        -h|--help)
            echo "Usage: llama [ACTION] [OPTIONS]"
            echo ""
            echo "Actions:"
            echo "  start                  Start server (interactive if no options)"
            echo "  stop                   Stop all llama.cpp servers"
            echo ""
            echo "Options for 'start':"
            echo "  -m, --model <number>       Select model by number (0=none)"
            echo "  -r, --reranker <number>    Select reranker by number (0=none, 1=bge)"
            echo "  -e, --embedding <number>   Select embedding by number (0=none, 1=0.6b, 2=8b)"
            echo "  -o, --ocr <number>         Select OCR by number (0=none, 1=deepseek)"
            echo "  -t, --timeout <seconds>    Auto-stop timeout (0=disabled, 300=5min, etc.)"
            echo "  -h, --help                 Show this help"
            echo ""
            echo "If no action specified, interactive menu will be shown."
            exit 0
            ;;
        *)
            echo "Unknown option: $1"
            echo "Use -h for help"
            exit 1
            ;;
    esac
done

# =============================================
# Main Logic
# =============================================

# Stop action
if [[ "$ACTION" == "stop" ]]; then
    stop_servers
    exit 0
fi

# Interactive mode or start action
if [[ -z "$ACTION" ]]; then
    # Interactive mode
    show_main_menu
    read -p "Select action [1-3] (default: 1): " action_choice
    action_choice="${action_choice:-1}"

    case $action_choice in
        1)
            # Start server
            show_model_menu
            read -p "Select model [0-14] (default: 0): " model_choice
            model_choice="${model_choice:-0}"

            show_reranker_menu
            read -p "Select reranker [0-1] (default: 0): " reranker_choice
            reranker_choice="${reranker_choice:-0}"

            show_embedding_menu
            read -p "Select embedding [0-2] (default: 0): " embedding_choice
            embedding_choice="${embedding_choice:-0}"

            show_ocr_menu
            read -p "Select OCR [0-1] (default: 0): " ocr_choice
            ocr_choice="${ocr_choice:-0}"

            show_idle_timeout_menu
            read -p "Select auto-stop timeout [0-6] (default: 0): " idle_choice
            idle_choice="${idle_choice:-0}"

            # If custom timeout selected, ask for value
            if [[ "$idle_choice" == "6" ]]; then
                read -p "Enter timeout in seconds: " custom_timeout
                idle_choice="$custom_timeout"
            fi

            start_servers "$model_choice" "$reranker_choice" "$embedding_choice" "$ocr_choice" "$idle_choice"
            ;;
        2)
            # Stop server
            stop_servers
            ;;
        3)
            echo "Exiting..."
            exit 0
            ;;
        *)
            echo "Invalid choice"
            exit 1
            ;;
    esac
elif [[ "$ACTION" == "start" ]]; then
    # Command line start mode
    if [[ -z "$MODEL_CHOICE" ]]; then
        show_model_menu
        read -p "Select model [0-14] (default: 0): " model_choice
        model_choice="${model_choice:-0}"
    fi

    if [[ -z "$RERANKER_CHOICE" ]]; then
        show_reranker_menu
        read -p "Select reranker [0-1] (default: 0): " reranker_choice
        reranker_choice="${reranker_choice:-0}"
    fi

    if [[ -z "$EMBEDDING_CHOICE" ]]; then
        show_embedding_menu
        read -p "Select embedding [0-2] (default: 0): " embedding_choice
        embedding_choice="${embedding_choice:-0}"
    fi

    if [[ -z "$OCR_CHOICE" ]]; then
        show_ocr_menu
        read -p "Select OCR [0-1] (default: 0): " ocr_choice
        ocr_choice="${ocr_choice:-0}"
    fi

    if [[ -z "$IDLE_CHOICE" ]]; then
        show_idle_timeout_menu
        read -p "Select auto-stop timeout [0-6] (default: 0): " idle_choice
        idle_choice="${idle_choice:-0}"

        # If custom timeout selected, ask for value
        if [[ "$idle_choice" == "6" ]]; then
            read -p "Enter timeout in seconds: " custom_timeout
            idle_choice="$custom_timeout"
        fi

        IDLE_CHOICE="$idle_choice"
    fi

    start_servers "$MODEL_CHOICE" "$RERANKER_CHOICE" "$EMBEDDING_CHOICE" "$OCR_CHOICE" "$IDLE_CHOICE"
fi

Here’s my llama.cpp config for a single Spark.

At this point, I honestly think this is the most practical setup — picking models that are actually fast and stable right now. I’ve tested a lot, and overall the throughput looks solid.

If anyone has suggestions or tweaks that improved results on your side — welcome.

1 Like

That is one heck of a shell script!

For ‘boring but works’, you may also wish to check out https://ramalama.ai from RedHat.

1 Like

I didn’t see what parameters you compiled llama.cpp with, but since I posted my tutorial for building llama.cpp, there’s been some additional developments, so I build with the following these days:

$ sudo apt-get install libcurl4-openssl-dev
$ cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=ON -DGGML_NATIVE=ON -DCMAKE_CUDA_ARCHITECTURES=121a-real
$ cmake --build build --config Release -j 20

Using ‘121a-real’ as the architecture means it will compile for only this one architecture, and nothing else, that basically means it’s a build that specifically targets the DGX Spark.

Also, I would definitely recommend not ignoring the prompt processing speed, it’s very important, a lot of people just ignore it but it’s the speed it takes to process the prompt before it can even generate the first token, and once you get into larger context sizes, this can really add up.

2 Likes

Thanks already done!

[1] ★★★★ GPT-OSS 120B (60GB, MXFP4) [128K, 🟢56 tok/s ⭐9.1(#8)] - Best for reasoning ✓
[2] ★★★ GPT-OSS 20B (12GB, MXFP4) [128K, 🟢80 tok/s ⭐9.0(#9)] - Fast MoE model ✓
[3] ★★★★ Qwen3-VL 30B (17GB, MXFP4) [256K, 🟢82 tok/s ⭐9.7(#4) 🏆] - Vision and Code ✓
[4] ★★★★ Qwen3-Next 80B (41GB, MXFP4) [256K, 🟡35 tok/s ⭐9.3(#6)] - Deep reasoning ✓
[5] ★★★★ Qwen3-Next 80B (41GB, MXFP4) [256K, 🟡36 tok/s ⭐9.5(#5)] - Fast and smart ✓
[6] ★★★★ Devstral-2 123B (58GB, Q3_K_XL) [32K, 🔴3 tok/s ⭐9.8(#2) ⚠️NEEDS Q4_K_M!] - Best code quality
[7] ★★★ GLM-4.5 Air (59GB, MXFP4) [128K, 🟠19 tok/s ⭐8.7(#12)] - Agent model
[8] ★★★ GLM-4.5 Air (54GB, Q3_K_XL) [128K, 🟡24 tok/s ⭐8.9(#11)] - Faster variant ✓
[9] ★★★★ Qwen3 Coder 30B (16GB, MXFP4) [256K, 🟢81 tok/s ⭐9.7(#5)] - Programming specialist ✓
[10] ★★ GLM-4.6V Flash (8.3GB, Q6_K_XL) [128K, 🟡24 tok/s ⭐8.7(#13)] - Video understanding ✓
[11] ★★★ Olmo 3.1 32B (19GB, Q4_K_XL) [64K, 🟠10 tok/s ⭐9.5(#7)] - Open source model
[12] ★★★ Olmo 3 Think 32B (19GB, Q4_K_XL) [64K, 🟠10 tok/s ⭐9.9(#1) 🏆] - Best for analysis
[13] ★★ RNJ-1 8B (8.3GB, Q8_0) [32K, 🟡24 tok/s ⭐9.0(#10)] - Lightweight fast
[14] ★★★★ Nemotron 30B (22GB, Q4_K_XL) [1M, 🟢73 tok/s ⭐9.7(#3) 🏆] - Universal best ✓

My test results

1 Like

I agree that llama.cpp seems to be the fastest for inference with a single user. The strength of vllm is not in single user mode but rather in handling multiple simultaneous requests. Its developers have concentrated on making it sing for that use case and I understand it is used by many cloud sites to service zillions of users.

I have been enamored of the GLM models, particularly for vision purposes. GPT-OSS-120b is also very good for coding and general advice. The thing is LLMs have strength and weaknesses that you need to be aware of. For example, they are generally quite good at summarizing text and the video models are often amazing at describing what an image is or what it contains. Counting letters or words is a weakness—simply because they use tokens, not words, and the tokens are just numbers to the LLM. Arithmetic is also not their strong suit, so that’s why researcers have developed agent-assisted work arounds.

Although it’s a bit contrived, I enjoyed @alexander.ziskind ‘s recent video comparing the Spark, the Apple M3 Ultra and the AMD Ryzen (Fastest 1,000,000 tokens… and who paid the most) which demonstrates (among many other things) how vllm shines in multi-user scenarios.

1 Like