Build and Deploy a Multi-Agent Chatbot on a Workstation

Hi everyone,

I’m planning to test and demo the DGX Spark “Building and Deploying a Multi-Agent Chatbot” playbook on a single workstation, and I’d like to confirm realistic minimum and recommended hardware specs.

I’m referring to this playbook and models:

  • DGX Spark multi-agent chatbot playbook (GitHub: dgx-spark-playbooks)

  • Models:

    • gpt-oss-120B GGUF (~63 GB)

    • DeepSeek Coder 6.7B (GGUF)

    • Qwen3-Embedding-4B

Use case

  • Purpose: Proof-of-concept / pilot + live demo at an exhibition

  • Concurrency: Effectively single-user or 1–2 users at a time (controlled, not a public web service)

  • Workload:

    • Multi-agent chat orchestration (supervisor + tools)

    • RAG over a few uploaded PDFs

    • Code-related queries via DeepSeek Coder

  • Latency requirement: “Human acceptable” for a demo (a few seconds initial latency is fine). I don’t need cloud-level throughput, just something that doesn’t feel broken.

Candidate workstation

I have access to an HP Z8 workstation with:

  • CPU: Dual Xeon (plenty of cores/threads)

  • GPU: NVIDIA RTX A5000 (Ampere, 24 GB VRAM)

  • System RAM: 128 GB

  • Storage: NVMe SSD (≥ 2 TB)

  • OS plan: Ubuntu 22.04 LTS + latest NVIDIA driver + Docker + NVIDIA Container Toolkit

My questions

  1. Is RTX A5000 (24 GB VRAM) + 128 GB RAM realistically enough to run:

    • gpt-oss-120B GGUF (~63 GB),

    • DeepSeek Coder 6.7B,

    • and Qwen3-Embedding-4B
      inside the multi-agent chatbot stack for a single interactive user (or at most 1–2 users at a time) without everything crawling?

  2. If yes, what kind of throughput / token rate / latency should I realistically expect on this class of GPU?
    Rough ballpark is fine: e.g., “~X tokens/sec for 120B GGUF” or “expect ~Y seconds for a ~200-token reply”.

  3. Are there any recommended configuration tweaks for this hardware?

    • Quantization level for gpt-oss-120B (Q4 vs Q5 vs others)?

    • Suggested llama.cpp parameters (GPU layers, batch size, threads) on a 24 GB card?

    • Limits on context length or max tokens to avoid thrashing RAM / swap?

  4. If VRAM is not sufficient and a significant portion of the 120B model has to stay in host RAM (offloading/swap):

    • How much system RAM would you consider a practical minimum to keep the pilot reliable for just one person at a time in a booth?

    • Any guidelines or rules-of-thumb for VRAM vs host RAM split (e.g., “X GB VRAM + Y GB RAM is okay for 120B GGUF, below that it becomes unstable or painfully slow”)?

  5. For this use case (workstation pilot, no DGX), would you recommend:

    • Sticking to Ubuntu 22.04 LTS, or is 24.04 LTS also officially fine for this toolchain (drivers, CUDA, NVIDIA Container Toolkit, DGX Spark playbooks)?

    • Any known pitfalls or version combinations to avoid?

  6. Finally, if you consider this setup borderline, what would you list as a “comfortable minimum” for:

    • VRAM (per GPU)

    • System RAM
      for running this specific multi-agent chatbot playbook with gpt-oss-120B for 1–2 concurrent users?

I’m not trying to build a production multi-tenant service here; just want a stable, realistic pilot environment where people at a booth can interact with the chatbot and RAG demo without embarrassing instability or 30+ second waits per answer.

Any guidance or concrete numbers from people who’ve run this playbook (or similar 120B GGUF setups) on workstation-class hardware would be greatly appreciated.

Thanks in advance.

Are you asking to run Spark plabyook on a non-Spark platform? You are welcome to use the playbook as a basis for experimentation, but the assets/images referenced have been built for different CPU platforms (x86 vs arm), and GPU architectures (Ampere vs. GB10). Mixing GPU Frame buffer + Host memory for inference is non-trivial.