Hi all,
When running ollama3:4b (or other models larger than 1b parameters) I can see that the gpu is not being used much and inference is very slow as a result of this. On the other hand smaller models with 1b parameters such as ollama3:1b work well and make good use of the gpu, but they are not very accurate.
I believe have followed all the steps in 🚀 Initial Setup Guide - Jetson Orin Nano - NVIDIA Jetson AI Lab and in the llama tutorial to set up the nano and run ollama with gemma and other LLMs.
I have also followed the demo https://www.youtube.com/watch?v=jSKHeYVcAB8 and checked the repo GitHub - asierarranz/Google_Gemma_DevDay: This repo has the code of the 3 demos I presented at Google Gemma2 DevDay Tokyo, using Gemma2 on a Jetson Orin Nano device. . I have increased swap using the script in the repo and set up MAXN_SUPER.
I have also followed the steps here:
The code below also works but when I run the model it is still very slow. I see occasional GPU spikes, whereas with the smaller models it is constantly being used.
Could anyone help? Thx!
docker run -it --rm \
-e OLLAMA_MODEL=gemma3:4b \
-e OLLAMA_MODELS=/root/.ollama \
-e OLLAMA_HOST=0.0.0.0:9000 \
-e OLLAMA_CONTEXT_LEN=4096 \
-e OLLAMA_LOGS=/root/.ollama/ollama.log \
-v /mnt/nvme/cache/ollama:/root/.ollama \
--gpus all \
-p 9000:9000 \
-e DOCKER_PULL=always --pull always \
-e HF_TOKEN=${HF_TOKEN} \
-e HF_HUB_CACHE=/root/.cache/huggingface \
-v /mnt/nvme/cache:/root/.cache \
dustynv/ollama:main-r36.4.0
main-r36.4.0: Pulling from dustynv/ollama
Digest: sha256:64a9e1ac0fe5b0fd7715c6c7457c340844fe05bb5f89245aaf781b07a0af1c82
Status: Image is up to date for dustynv/ollama:main-r36.4.0
Starting ollama server
OLLAMA_HOST 0.0.0.0:9000
OLLAMA_LOGS /root/.ollama/ollama.log
OLLAMA_MODELS /root/.ollama
Loading model gemma3:4b ...