I would like to add remote access to Ollama. I’m trying to understand how to add environment variables to allow cross origin access. I tried to use sudo systemctl edit ollama, per playbook
I get no files found for ollama.service.
Ollama is located at /snap/bin/ollama
Also where are the models and modelfiles, where is the home directory?
Do yourself a service and don’t use Ollama. Use llama.cpp or LM Studio (if you prefer GUI) instead. Ollama underperforms on Spark. It introduces unnecessary complexity compared to llama.cpp.
@eugr Do you have a good instruction that shows how to use llama.cpp on spark with cu130 ? I haven’t been looking for one, but no luck. Thanks in advance.
If you use Open WebUI with Ollama, you can easily connect that to Tailscale for free and access your Open WebUI/Ollama models from anywhere. It’s really easy to set it up. With the DGX Spark hosting my larger LLMs and my 5090 with the smaller ones, I also combined the two Ollama servers into the same Open WebUI account; again, really easy setup if you ask any of the AI apps.
Hi - thanks for all your posts, I’ve been reading your updates on different inference engines with interest. I’m trying to decide whether to prioritise llama cpp or vllm so that I can stick with one model format and stack for a bit.
For LLMs I want local chat UI and API access - not a production scenario but wanting to get the best (fastest and highest quality) performance without too many constraints. And to avoid downloading and managing multiple different formats for the same model.
So I know this is a bit of an ‘it depends’ kind of question but would love to hear your current point of view on this one?
If you are the only user, I’d stick with llama.cpp - it will give you the best single-user performance without too much overhead. It will also give you a bigger variety of quants to choose from (e.g. Q6_K_XL). It can support multi-user too in a pinch.
The only downside is that new model support make take some time, especially if it’s a new architecture. Some models get day 1 support, some take months (Qwen3-Next, Qwen3-VL). But you can use VLLM for those if needed.
My current personal setup is llama.cpp with llama-swap as a proxy to load models on demand. What’s great about llama-swap, is that you can also plug vllm or any other inference engine there if needed.