Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously

24126172 · November 28, 2025, 2:08pm

Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously—such as vector models, language models, multimodal models, etc.—assuming these are small-parameter models and the GB10 has sufficient VRAM to support their combined size? Since the GB10 does not support MIG, do I need to use MPS? Could you please provide a reference example? Thank you.

raphael.amorim · November 28, 2025, 3:23pm

There’s a lab for that Build and Deploy a Multi-Agent Chatbot | DGX Spark

These are the models being used in the lab:

github.com/NVIDIA/dgx-spark-playbooks

nvidia/multi-agent-chatbot/assets/docker-compose-models.yml

main

#
# SPDX-FileCopyrightText: Copyright (c) 1993-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
x-build-base: &build-base
  build:
    context: .
    dockerfile: Dockerfile.llamacpp

This file has been truncated. show original

24126172 · November 29, 2025, 9:23am

Thanks for your reply

If the relatived document or suggestion for both ollama and vllm will be appreciated

haidij · November 29, 2025, 10:49am

Thanks to eugr on this forum I got llama.cpp working with multiple models.

Installation instructions here: llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub

You can use llama-server to create multiple Open AI compatible endpoints on different ports - use a different terminal session for each: ~/llama.cpp/build/bin/llama-server -m ~/.cache/llama.cpp/creativewriter32B-GGUF/creative-writer-32b-preview-Q5_K_L.gguf --host 0.0.0.0 --port 8082 --ctx-size 0 --jinja -ub 8192 -b 8192 -ngl 999 --flash-attn on --no-mmap

Or you can use llama_cpp python library: GitHub - abetlen/llama-cpp-python: Python bindings for llama.cpp

self._model = Llama( model_path=self.model_path, n_ctx=1536, n_gpu_layers=-1, n_batch=256, cache_prompt=True, flash_attn=True, use_mmap=False, mul_mat_q=False, numa=False, seed=0, logits_all=False, embedding=False, verbose=False)

This is what I have found to work in my case.

You can also do it with ollama just use different terminals to ollama run [model].

24126172 · November 29, 2025, 11:39am

Thanks.

So what about VLLM?

raphael.amorim · November 29, 2025, 11:43am

vLLM works fine. I have a playbook for nemotron nano VL, but you can use it for anything else and build your docker images

24126172 · November 29, 2025, 11:46am

Thanks

raphael.amorim · November 29, 2025, 12:13pm

You’re welcome @haidij

Topic		Replies	Views
Model Orchestration and Deployment DGX Spark / GB10 nim	4	882	November 24, 2025
Oops.. pressed the button for 2x GB10... no spousal approval, am I in trouble? DGX Spark / GB10 llama	14	677	May 14, 2026
Spark-inference: Run 3 specialized models simultaneously on your DGX Spark — cybersecurity + coding + orchestration, 30-min setup DGX Spark / GB10 Projects jetson , llama , deepseek , nemotron	3	1391	May 11, 2026
DGX Spark performance DGX Spark / GB10	49	6579	February 13, 2026
Running a Full LLM Stack on DGX Spark GB10 (Your Application -> LiteLLM -> llama-swap -> vLLM / llama.cpp / Ollama) DGX Spark / GB10 Projects spark , jetson , llama , nemotron , openclaw	19	4139	May 28, 2026
Moving from Mac to NVIDIA: bought powerful hardware, but drowning in configs DGX Spark / GB10 llama , nemotron	37	2927	February 25, 2026
Can someone please just help me set the DGX Spark up for optimal LLM use? DGX Spark / GB10 llama	11	1527	June 20, 2026
100b+ parameter LLM list DGX Spark / GB10 llm , llama	5	1443	January 24, 2026
DGX Spark Multi-Node LLM Inference Report for Qwen3-235B model DGX Spark / GB10 nim , llama	34	2748	May 1, 2026
Best LLM engine for several parallel models? DGX Spark / GB10 agentic-ai	5	1268	January 6, 2026

Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously

Related topics