Gemma3:4b not using the gpu while gemma3:1b does on orin Jetson Nano super

Hi all,

When running ollama3:4b (or other models larger than 1b parameters) I can see that the gpu is not being used much and inference is very slow as a result of this. On the other hand smaller models with 1b parameters such as ollama3:1b work well and make good use of the gpu, but they are not very accurate.

I believe have followed all the steps in 🚀 Initial Setup Guide - Jetson Orin Nano - NVIDIA Jetson AI Lab and in the llama tutorial to set up the nano and run ollama with gemma and other LLMs.

I have also followed the demo https://www.youtube.com/watch?v=jSKHeYVcAB8 and checked the repo GitHub - asierarranz/Google_Gemma_DevDay: This repo has the code of the 3 demos I presented at Google Gemma2 DevDay Tokyo, using Gemma2 on a Jetson Orin Nano device. . I have increased swap using the script in the repo and set up MAXN_SUPER.

I have also followed the steps here:

The code below also works but when I run the model it is still very slow. I see occasional GPU spikes, whereas with the smaller models it is constantly being used.

Could anyone help? Thx!

docker run -it --rm \
  -e OLLAMA_MODEL=gemma3:4b \
  -e OLLAMA_MODELS=/root/.ollama \
  -e OLLAMA_HOST=0.0.0.0:9000 \
  -e OLLAMA_CONTEXT_LEN=4096 \
  -e OLLAMA_LOGS=/root/.ollama/ollama.log \
  -v /mnt/nvme/cache/ollama:/root/.ollama \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HF_TOKEN=${HF_TOKEN} \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  dustynv/ollama:main-r36.4.0
main-r36.4.0: Pulling from dustynv/ollama
Digest: sha256:64a9e1ac0fe5b0fd7715c6c7457c340844fe05bb5f89245aaf781b07a0af1c82
Status: Image is up to date for dustynv/ollama:main-r36.4.0

Starting ollama server


OLLAMA_HOST   0.0.0.0:9000
OLLAMA_LOGS   /root/.ollama/ollama.log
OLLAMA_MODELS /root/.ollama

Loading model gemma3:4b ...

Replying to my own post if someone else has a similar problem.

Updating to the latest ollama version (0.9.0) with the line below should have fixed it somehow. Now also larger models with 4b parameters use the gpu well

curl -fsSL https://ollama.com/install.sh | sh