Gemma3:4b not using the gpu while gemma3:1b does on orin Jetson Nano super

gabrielet85 · May 28, 2025, 1:48pm

Hi all,

When running ollama3:4b (or other models larger than 1b parameters) I can see that the gpu is not being used much and inference is very slow as a result of this. On the other hand smaller models with 1b parameters such as ollama3:1b work well and make good use of the gpu, but they are not very accurate.

I believe have followed all the steps in 🚀 Initial Setup Guide - Jetson Orin Nano - NVIDIA Jetson AI Lab and in the llama tutorial to set up the nano and run ollama with gemma and other LLMs.

I have also followed the demo https://www.youtube.com/watch?v=jSKHeYVcAB8 and checked the repo GitHub - asierarranz/Google_Gemma_DevDay: This repo has the code of the 3 demos I presented at Google Gemma2 DevDay Tokyo, using Gemma2 on a Jetson Orin Nano device. . I have increased swap using the script in the repo and set up MAXN_SUPER.

I have also followed the steps here:

The code below also works but when I run the model it is still very slow. I see occasional GPU spikes, whereas with the smaller models it is constantly being used.

Could anyone help? Thx!

docker run -it --rm \
  -e OLLAMA_MODEL=gemma3:4b \
  -e OLLAMA_MODELS=/root/.ollama \
  -e OLLAMA_HOST=0.0.0.0:9000 \
  -e OLLAMA_CONTEXT_LEN=4096 \
  -e OLLAMA_LOGS=/root/.ollama/ollama.log \
  -v /mnt/nvme/cache/ollama:/root/.ollama \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HF_TOKEN=${HF_TOKEN} \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  dustynv/ollama:main-r36.4.0
main-r36.4.0: Pulling from dustynv/ollama
Digest: sha256:64a9e1ac0fe5b0fd7715c6c7457c340844fe05bb5f89245aaf781b07a0af1c82
Status: Image is up to date for dustynv/ollama:main-r36.4.0

Starting ollama server


OLLAMA_HOST   0.0.0.0:9000
OLLAMA_LOGS   /root/.ollama/ollama.log
OLLAMA_MODELS /root/.ollama

Loading model gemma3:4b ...

gabrielet85 · June 2, 2025, 8:34am

Replying to my own post if someone else has a similar problem.

Updating to the latest ollama version (0.9.0) with the line below should have fixed it somehow. Now also larger models with 4b parameters use the gpu well

curl -fsSL https://ollama.com/install.sh | sh

Topic		Replies	Views
Jetson Orin Nano Super: Error Running Gemma 3 4B Model Jetson Orin Nano generative_ai	8	418	April 2, 2025
Jetson Orin Nano Super Dev Kit Performance Jetson Orin Nano cudnn , gemma-2-9b-it , llama-31-8b-instruct , llama	6	663	January 28, 2025
Jetson orin nano local small models perform insanely slow Jetson Orin Nano generative_ai	2	639	June 6, 2024
Running llama3.3 or llama4 on Jetson AGX Orin Developer Kit (64 GB) Jetson AGX Orin generative_ai	8	193	May 12, 2025
Which container - jetson container or nvidia container? Jetson Orin Nano containers , generative_ai	9	67	March 13, 2025
Ollama timing out when attempting to use GPU instead of CPU Jetson AGX Orin cuda , jetson-inference , generative_ai	9	4625	August 27, 2024
Is the Jetson Nano Developer Kit capable of loading LLMs like LLaMA 3? Jetson Nano generative_ai , llama	6	1693	January 16, 2025
Free up more RAM for Ollama (Jetson Orin Nano Super) Jetson Orin Nano generative_ai	7	164	May 21, 2025
Ollama with nemotron-mini Jetson Orin Nano generative_ai	5	129	March 6, 2025
Ollama and Jetson issue Jetson Orin NX jetson-inference , generative_ai	12	5441	March 20, 2024

Gemma3:4b not using the gpu while gemma3:1b does on orin Jetson Nano super

Related topics