Ollama load balance 2 * 64GB Orin machines + Server machine rtx3080

jtmuzix · September 25, 2024, 2:58pm

I know that NVLink isn’t available on the Jetson platform but I’ve read that I can load balance models using ollama. I’ve also looked into Kubernetes GPU passthrough but that doesn’t seem to work or even do what I want. Ideally, I want my Jeton Orin 64GB machines to act as ollama helper nodes for my main system running an rtx3080. I’ve been experimenting with running open-webui and adding a connection to one (or both) of my nvidia orin systems. However, it seems to simply put all of the load on the orin and I’m looking for ollama to actually load balance. Unfortunately, running ollama locally on my orins doesn’t work exactly when doing a simple ollama run modelname:latest but it does run locally on my rtx3080 system and thus why open-webui works wonderfully. Ideally I should be able to run ollama on all 3 systems, have open-webui running on my rtx3080 only and have all three machines share the ollama workload. Any assistance would be greatly appreciated. Below is an example code I try but it’s running in docker and I run ollama locally on all machines. I would be curious to understand why ollama fails to load a model locally on the orins but will run without issue in a docker container. The Nvidia_containers abstraction is interesting but I don’t fully understand it. Below is an example of what I attempt to acheive my goal (change hostnames respectively):

NOTE: I have OLLAMA_HOST=0.0.0.0 in the /etc/systemd/system/ollama.service file which allows it to communicate with docker. I think I’m close but I’m missing something and not exactly sure what it is that I’m missing. Any assistance and thank you in advance!

Regards,
Jason Tutwiler

AastaLLL · September 26, 2024, 4:40am

Hi,

The link you shared is deploying multiple Ollama instants on different nodes.
Do you try to deploy one Ollama instant but share the compute cross GPUs?

Thanks.

system · October 23, 2024, 5:05am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ollama and Jetson issue Jetson Orin NX jetson-inference , generative_ai	12	5611	March 20, 2024
Ollama 0.4.2 released and runs on Nvidia Jetson Orin AGX 64 Jetson AGX Orin generative_ai , llama	9	1574	November 21, 2024
Ollama is running slow on Jetson AGX Orin Dev-kit (32G) Jetson AGX Orin generative_ai	2	1163	February 29, 2024
Introducing Ollama Support for Jetson Devices Jetson Projects cuda , natural-language-processing-nlp , artificialintelligence , interactive , docker-machine-learning , generative_ai	29	12352	August 28, 2024
LLMs token/sec Jetson AGX Orin generative_ai	2	1061	April 8, 2024
Ollama Docker in Jetson AGX Orin Jetson AGX Orin docker , generative_ai	2	408	November 26, 2024
Ollama on Docker does not finmd GPU Jetson Orin Nano generative_ai	4	1290	March 5, 2025
Which dustynv/ollama I should use? Jetson AGX Orin generative_ai	4	206	April 14, 2025
Want to deploy a LLM on multiple AGX Orin Jetson AGX Orin generative_ai	9	294	March 19, 2025
Orin agx Loadbalancing Jetson AGX Orin	2	65	July 10, 2024

Ollama load balance 2 * 64GB Orin machines + Server machine rtx3080

Related topics