I know that NVLink isn’t available on the Jetson platform but I’ve read that I can load balance models using ollama. I’ve also looked into Kubernetes GPU passthrough but that doesn’t seem to work or even do what I want. Ideally, I want my Jeton Orin 64GB machines to act as ollama helper nodes for my main system running an rtx3080. I’ve been experimenting with running open-webui and adding a connection to one (or both) of my nvidia orin systems. However, it seems to simply put all of the load on the orin and I’m looking for ollama to actually load balance. Unfortunately, running ollama locally on my orins doesn’t work exactly when doing a simple ollama run modelname:latest but it does run locally on my rtx3080 system and thus why open-webui works wonderfully. Ideally I should be able to run ollama on all 3 systems, have open-webui running on my rtx3080 only and have all three machines share the ollama workload. Any assistance would be greatly appreciated. Below is an example code I try but it’s running in docker and I run ollama locally on all machines. I would be curious to understand why ollama fails to load a model locally on the orins but will run without issue in a docker container. The Nvidia_containers abstraction is interesting but I don’t fully understand it. Below is an example of what I attempt to acheive my goal (change hostnames respectively):
NOTE: I have OLLAMA_HOST=0.0.0.0 in the /etc/systemd/system/ollama.service file which allows it to communicate with docker. I think I’m close but I’m missing something and not exactly sure what it is that I’m missing. Any assistance and thank you in advance!
Regards,
Jason Tutwiler