Hi everyone,
I’m planning to test and demo the DGX Spark “Building and Deploying a Multi-Agent Chatbot” playbook on a single workstation, and I’d like to confirm realistic minimum and recommended hardware specs.
I’m referring to this playbook and models:
-
DGX Spark multi-agent chatbot playbook (GitHub:
dgx-spark-playbooks) -
Models:
-
gpt-oss-120BGGUF (~63 GB) -
DeepSeek Coder 6.7B (GGUF)
-
Qwen3-Embedding-4B
-
Use case
-
Purpose: Proof-of-concept / pilot + live demo at an exhibition
-
Concurrency: Effectively single-user or 1–2 users at a time (controlled, not a public web service)
-
Workload:
-
Multi-agent chat orchestration (supervisor + tools)
-
RAG over a few uploaded PDFs
-
Code-related queries via DeepSeek Coder
-
-
Latency requirement: “Human acceptable” for a demo (a few seconds initial latency is fine). I don’t need cloud-level throughput, just something that doesn’t feel broken.
Candidate workstation
I have access to an HP Z8 workstation with:
-
CPU: Dual Xeon (plenty of cores/threads)
-
GPU: NVIDIA RTX A5000 (Ampere, 24 GB VRAM)
-
System RAM: 128 GB
-
Storage: NVMe SSD (≥ 2 TB)
-
OS plan: Ubuntu 22.04 LTS + latest NVIDIA driver + Docker + NVIDIA Container Toolkit
My questions
-
Is RTX A5000 (24 GB VRAM) + 128 GB RAM realistically enough to run:
-
gpt-oss-120BGGUF (~63 GB), -
DeepSeek Coder 6.7B,
-
and Qwen3-Embedding-4B
inside the multi-agent chatbot stack for a single interactive user (or at most 1–2 users at a time) without everything crawling?
-
-
If yes, what kind of throughput / token rate / latency should I realistically expect on this class of GPU?
Rough ballpark is fine: e.g., “~X tokens/sec for 120B GGUF” or “expect ~Y seconds for a ~200-token reply”. -
Are there any recommended configuration tweaks for this hardware?
-
Quantization level for
gpt-oss-120B(Q4 vs Q5 vs others)? -
Suggested llama.cpp parameters (GPU layers, batch size, threads) on a 24 GB card?
-
Limits on context length or max tokens to avoid thrashing RAM / swap?
-
-
If VRAM is not sufficient and a significant portion of the 120B model has to stay in host RAM (offloading/swap):
-
How much system RAM would you consider a practical minimum to keep the pilot reliable for just one person at a time in a booth?
-
Any guidelines or rules-of-thumb for VRAM vs host RAM split (e.g., “X GB VRAM + Y GB RAM is okay for 120B GGUF, below that it becomes unstable or painfully slow”)?
-
-
For this use case (workstation pilot, no DGX), would you recommend:
-
Sticking to Ubuntu 22.04 LTS, or is 24.04 LTS also officially fine for this toolchain (drivers, CUDA, NVIDIA Container Toolkit, DGX Spark playbooks)?
-
Any known pitfalls or version combinations to avoid?
-
-
Finally, if you consider this setup borderline, what would you list as a “comfortable minimum” for:
-
VRAM (per GPU)
-
System RAM
for running this specific multi-agent chatbot playbook withgpt-oss-120Bfor 1–2 concurrent users?
-
I’m not trying to build a production multi-tenant service here; just want a stable, realistic pilot environment where people at a booth can interact with the chatbot and RAG demo without embarrassing instability or 30+ second waits per answer.
Any guidance or concrete numbers from people who’ve run this playbook (or similar 120B GGUF setups) on workstation-class hardware would be greatly appreciated.
Thanks in advance.