Introducing Ollama Support for Jetson Devices

Ollama on Jetson is Here!

I am pleased to announce that Ollama now works on Jetson devices, with a minor caveat:

  • The Linux ARM64 binary available for download through their site and installed by their install script doesn’t seem to work at the moment due to some incompatibilities in the way it’s compiled, we’re working on getting that resolved.

What is Ollama?

Ollama allows you to run LLMs almost anywhere using llama_cpp as the backend and provides a CLI front-end client as well as an API. It supports the standard Openai API and is compatible with most tools. It is written mostly in Go, with some CGo hooks to load the back-end and the GPU drivers. They currently support Windows (native), Windows (WSL), Apple (Metal), and Linux (x64 and ARM64).

They have a built-in tool for downloading and customizing LLMs such as Llama 2, Mistral, and Openhermes. See their Github for more information.

In The Mean Time

  • @dusty_nv has graciously included a container build in his Github repo jetson-containers. There is currently a container for JP5 and JP6, see the repo for more information.
  • If you clone the Ollama repo and build the binary yourself, that binary works fantastically. NOTE: Normally I recommend disabling/bypassing the generic CPU build by setting OLLAMA_SKIP_CPU_GENERATE=1, however the build will fail due to a bug so make sure that isn’t set for now. Ensure your cmake is up to date, gcc-10 and g+±10 is installed, and Golang is installed. More information is available in their repo.
  • Cmake 3.22.1
  • Golang 1.22.1

Future plans

  • Update the docs on Jetson-containers
  • Enable tensorcore support (currently the build forces MMQ)
  • Enable CUDA FP16 (currently the build forces it to be disabled for compatibility)
  • Find and fix the compilation bug so the downloadable binary works on Jetson

Disclaimer: I am a contributor to the Ollama Github repo, but I am not an official member of their team and do not represent them in any official capacity.

5 Likes

Awesome work @remy415, thank you so much for all your contributions! Ollama support is great to have for the community and making it easier to get started. Here are the container images for reference:

dustynv/ollama:r35.4.1
dustynv/ollama:r36.2.0

You can automatically run these like this:

# get the container tools
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh

# run the ollama server locally
jetson-containers run $(autotag ollama)

# in another terminal, run the ollama client
jetson-containers run $(autotag ollama) ollama run mistral

ollama_20240411_short

We will also have to make a tutorial on jetson-ai-lab.com for it. While Ollama doesn’t have the most optimized performance available because it uses llama.cpp underneath (you can expect about half of what is on the Benchmarks page), it may be good enough for text LLM and is exceedingly easy to get started with 👍

4 Likes

Nice! I ran OpenWebUI on my host and connected it to Ollama running on Jetson. It’s really easy to try so many LLMs!


1 Like

Nice @tokada1 ! I would also be interested in adding a Jetson container for OpenWebUI or similar.

@dusty_nv Open WebUI does provide arm64 container image and this runs fine on Jetson AGX Orin:
docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main
If it’s better to use the l4t base image, I guess it’s easy to do.

1 Like

Awesome!, was able to host this web UI server on my Jetson without issue, and it connected to the local ollama instance.

It’s not necessary because the web UI server doesn’t need GPU (just ollama does, which we rebuilt against CUDA base container). So that is very easy to use since open-webui already provides arm64 images 👍

If I wanted to add .gguf LLMs to Ollama how would I do this?

Not sure where to put the Modfile based on this tutorial for setting up Ollama.

Hello,

Ollama documentation has a guide for doing this here.

Ollama works by having its binary do two things:

  1. It runs in the background to manage requests and start servers
    ollama serve, the ollama container, or through a service (i.e. systemctl daemon, or Windows/MacOS daemon)

  2. It’s run on the command line to execute tasks:
    ollama run mistral
    ollama create <my model>

If you are using Ollama purely through containers, it might be a little confusing to add extra files in the mix. The container should be mounting the data folder in the container at /data/; follow the instructions in the import tutorial I linked above, when it says to create a modelfile and ollama create example -f Modelfile, you should save your Modelfile in data/models/ollama/Modelfile. Then when you execute the ollama create command, you’ll need to do it as a docker run command docker run --runtime nvidia -it --rm --network=host dustynv/ollama:r35.4.1

ollama create example -f Modelfile = docker run --runtime nvidia -it --rm --network=host dustynv/ollama:r35.4.1 ollama create example -f /data/models/ollama/Modelfile

ollama run example = docker run --runtime nvidia -it --rm --network=host dustynv/ollama:r35.4.1 ollama run example

Alternatively, since the “client” portion of Ollama doesn’t require CUDA acceleration, you can download the binary directly from their site and run it outside of Docker (the binary works now, it wasn’t working previously).

In this scenario, you would just run the commands as-is, substituting /PATH/TO/jetson-containers/data/models/ollama for any references to /usr/share/ollama/.ollama

ollama create example -f /PATH/TO/jetson-containers/data/models/ollama/Modelfile
ollama run example

Thanks for the info, my issues were not understanding the following:

(1) save your Modelfile in data/models/ollama/Modelfile

(2) doing the Jetson-containers command like the following:

jetson-containers run $(autotag ollama) ollama create example -f /data/models/ollama/Modelfile

Doing that, boom, LLM added.

1 Like

Awesome! I’m glad you were able to get it to work.

1 Like

Thanks @remy415, I merged your PR from today and rebuilt/tested/pushed the containers. Also put up a page for Ollama on Jetson AI Lab (along with the OpenWebUI from @tokada1)

3 Likes

Thanks Dusty! In my PR, I added the environment variable to set the models path OLLAMA_MODELS=/data/models/ollama/models and updated the -v commands in the ollama serve command to remove the mounting since the data directory is mounted at /data. Need to update the commands to reflect that in the guide as it won’t be looking in /root/.ollama anymore.

Also the binary downloaded straight from the site now works again with CUDA acceleration and the OLLAMA_SKIP_CPU_GENERATE flag works properly now, so adding it to the dockerfile should improve build times by a few minutes as long as you don’t mind it not being able to fall back to cpu when it’s OOM.

I submitted a pr to ollama to add a flag to support custom GPU defs for cmake when compiling llamacpp. It’s in final review to be merged and should be live today or tomorrow. I played around with flags and setting CUDA architectures to 87, enabling tensor cores, and enabling cuda_f16c did nothing to change performance. I tested it using a static seed and prompt using Mistral 7b, and the two builds performed nearly exactly the same with a 20s generation time +/- 1s over 10 runs. I need to further research leveraging tensor cores in llamacpp, I would welcome any pointers.

1 Like

Are you testing on an AGX Orin or a Orin Nano?

@jamesnajera i have 4x Orin Nano 8Gb in a Turing Pi 2 carrier board. The board is unfortunately not compatible with the Orin AGX

1 Like

Oh right!, I saw that you added OLLAMA_MODELS to the env but did not make the connection (probably because I was confused still seeing the files cached there, when in fact they were left over from a previous download). So I removed the cache folders on my host device, then tried this standalone docker command which overrides $OLLAMA_MODELS:

docker run --runtime nvidia -d --rm --network=host -v ~/ollama:/ollama -e OLLAMA_MODELS=/ollama dustynv/ollama:r36.2.0

That worked as expected, and I just updated the docs on GitHub and Jetson AI Lab with it 👍

Since you have it working building from source in your Dockerfile, I would like to keep it that way (for now at least) while the ARM64+CUDA support settles down upstream. And then eventually we can keep around a separate ‘builder’ dockerfile to fall back to in the event that CUDA-related issues pop up in the upstream repo.

Historically I had just built llama.cpp with -DLLAMA_CUBLAS=on -DLLAMA_CUDA_F16=1 (like here) and that did the job, but it looks like they added newer flags which may warrant further investigation and profiling:

Although llama.cpp and ollama are for performance “good enough” for typical LLM chat, so don’t need to spend too much time digging into it (also considering the pace at which llama.cpp changes/breaks)

From what I can tell, LLAMA_CUBLAS is deprecated in favor of LLAMA_CUDA. The F16 flag is still relevant. Neither turning on F16 nor tensor core support (turning off mmq) made any difference >10% across 10 runs with the same seed and prompt.

I also have been digging through llamacpp for arm simd support since avx isn’t present, though on a CUDA iGPU this is kinda moot.

It seems like LLAMA_LLAMAFILE contains code for NEON simd support, however it’s recently been disabled until they iron out a bug on armv7 platforms - should be back again soon.

The other library that contained NEON support is the ggml-metal library, but I hardly think importing the entire metal ecosystem into Jetson devices to support NEON on them is worth it or will even work. I will poke around it more when they reactivate sgemm/llama_llamafile.

Ok thanks, no big deal on the CUDA flags then, and yea jumping through hoops to enable NEON in it is pretty moot since with Jetson’s unified memory you aren’t going to be using CPU offloading (because if you run out of GPU memory, you’re out of system memory too). Sounds like the llama.cpp perf is what it is.

1 Like

I added this /start_ollama script to the ollama container which is now the default CMD and will start the ollama server in the background, and then return control to the user so they can run the chat client from the same terminal (without needing to start another container, ect)

The docs have been updated to reflect this:

1 Like

@remy415 well, maybe off topic, but I tried to run Ollama natively on AGX Orin with Jetpack 6.0GA release. But it got stuck and the log shows:

...
May 03 20:31:59 orin6 ollama[3079]: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED
May 03 20:31:59 orin6 ollama[3079]:   current device: 0, in function ggml_cuda_mul_mat_batched_cublas at /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:1848
May 03 20:31:59 orin6 ollama[3079]:   cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), C>
May 03 20:31:59 orin6 ollama[3079]: GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error"
May 03 20:31:59 orin6 ollama[3079]: Could not attach to process.  If your uid matches the uid of the target
May 03 20:31:59 orin6 ollama[3079]: process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
May 03 20:31:59 orin6 ollama[3079]: again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
May 03 20:31:59 orin6 ollama[3079]: ptrace: Inappropriate ioctl for device.
May 03 20:31:59 orin6 ollama[3079]: No stack.
May 03 20:31:59 orin6 ollama[3079]: The program is not being run.

Ollama runs natively on 5.1.3.

I installed Ollama with: curl -fsSL https://ollama.com/install.sh | sh

I’ll try to debug it, but if you have any idea, please let me know.