Can I use Ollama or vLLM on the GB10 to run multiple LLM models simultaneously—such as vector models, language models, multimodal models, etc.—assuming these are small-parameter models and the GB10 has sufficient VRAM to support their combined size? Since the GB10 does not support MIG, do I need to use MPS? Could you please provide a reference example? Thank you.
There’s a lab for that Build and Deploy a Multi-Agent Chatbot | DGX Spark
These are the models being used in the lab:
Thanks for your reply
If the relatived document or suggestion for both ollama and vllm will be appreciated
Thanks to eugr on this forum I got llama.cpp working with multiple models.
Installation instructions here: llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub
You can use llama-server to create multiple Open AI compatible endpoints on different ports - use a different terminal session for each: ~/llama.cpp/build/bin/llama-server -m ~/.cache/llama.cpp/creativewriter32B-GGUF/creative-writer-32b-preview-Q5_K_L.gguf --host 0.0.0.0 --port 8082 --ctx-size 0 --jinja -ub 8192 -b 8192 -ngl 999 --flash-attn on --no-mmap
Or you can use llama_cpp python library: GitHub - abetlen/llama-cpp-python: Python bindings for llama.cpp
self._model = Llama( model_path=self.model_path, n_ctx=1536, n_gpu_layers=-1, n_batch=256, cache_prompt=True, flash_attn=True, use_mmap=False, mul_mat_q=False, numa=False, seed=0, logits_all=False, embedding=False, verbose=False)
This is what I have found to work in my case.
You can also do it with ollama just use different terminals to ollama run [model].
Thanks.
So what about VLLM?
vLLM works fine. I have a playbook for nemotron nano VL, but you can use it for anything else and build your docker images
Thanks
You’re welcome @haidij
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.