Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously

Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously—such as vector models, language models, multimodal models, etc.—assuming these are small-parameter models and the GB10 has sufficient VRAM to support their combined size? Since the GB10 does not support MIG, do I need to use MPS? Could you please provide a reference example? Thank you.

There’s a lab for that Build and Deploy a Multi-Agent Chatbot | DGX Spark

These are the models being used in the lab:

Thanks for your reply

If the relatived document or suggestion for both ollama and vllm will be appreciated

Thanks to eugr on this forum I got llama.cpp working with multiple models.

Installation instructions here: llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub

You can use llama-server to create multiple Open AI compatible endpoints on different ports - use a different terminal session for each: ~/llama.cpp/build/bin/llama-server -m ~/.cache/llama.cpp/creativewriter32B-GGUF/creative-writer-32b-preview-Q5_K_L.gguf --host 0.0.0.0 --port 8082 --ctx-size 0 --jinja -ub 8192 -b 8192 -ngl 999 --flash-attn on --no-mmap

Or you can use llama_cpp python library: GitHub - abetlen/llama-cpp-python: Python bindings for llama.cpp

self._model = Llama( model_path=self.model_path, n_ctx=1536, n_gpu_layers=-1, n_batch=256, cache_prompt=True, flash_attn=True, use_mmap=False, mul_mat_q=False, numa=False, seed=0, logits_all=False, embedding=False, verbose=False)

This is what I have found to work in my case.

You can also do it with ollama just use different terminals to ollama run [model].

Thanks.

So what about VLLM?

vLLM works fine. I have a playbook for nemotron nano VL, but you can use it for anything else and build your docker images

Thanks

You’re welcome @haidij

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.