No luck with Gemma 4 on Jetson Nano Super

Hi guys - Are you running Gemma 4 locally on the Jetson Nano Super (JP 6.2.2)? This is the container i used:

docker run --rm -it --runtime nvidia --network host
-v ~/.cache/huggingface:/root/.cache/huggingface
ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

But when i try to serve the model, this is the models i used: Gemma 4 - a unsloth Collection . This is the error message:

jcm@ubuntu:~$ docker run --rm -it --runtime nvidia --network host
-v ~/.cache/huggingface:/root/.cache/huggingface

  ghcr.io/nvidia-ai-iot/vllm:latest-jetson-orin

root@ubuntu:/# vllm serve unsloth/gemma-4-E4B-it-unsloth-bnb-4bit
/opt/venv/lib/python3.10/site-packages/transformers/utils/hub.py:110: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
(APIServer pid=22) INFO 04-04 21:42:58 [utils.py:293]
(APIServer pid=22) INFO 04-04 21:42:58 [utils.py:293] █ █ █▄ ▄█
(APIServer pid=22) INFO 04-04 21:42:58 [utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.0rc2.dev479+g15d76f74e.d20260226
(APIServer pid=22) INFO 04-04 21:42:58 [utils.py:293] █▄█▀ █ █ █ █ model unsloth/gemma-4-E4B-it-unsloth-bnb-4bit
(APIServer pid=22) INFO 04-04 21:42:58 [utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=22) INFO 04-04 21:42:58 [utils.py:293]
(APIServer pid=22) INFO 04-04 21:42:58 [utils.py:229] non-default args: {‘model_tag’: ‘unsloth/gemma-4-E4B-it-unsloth-bnb-4bit’, ‘model’: ‘unsloth/gemma-4-E4B-it-unsloth-bnb-4bit’}
config.json: 6.41kB [00:00, 6.01MB/s]
(APIServer pid=22) Traceback (most recent call last):
(APIServer pid=22) File “/opt/venv/bin/vllm”, line 10, in
(APIServer pid=22) sys.exit(main())
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py”, line 73, in main
(APIServer pid=22) args.dispatch_function(args)
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py”, line 112, in cmd
(APIServer pid=22) uvloop.run(run_server(args))
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/uvloop/init.py”, line 69, in run
(APIServer pid=22) return loop.run_until_complete(wrapper())
(APIServer pid=22) File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/uvloop/init.py”, line 48, in wrapper
(APIServer pid=22) return await main
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 471, in run_server
(APIServer pid=22) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 490, in run_server_worker
(APIServer pid=22) async with build_async_engine_client(
(APIServer pid=22) File “/root/.local/share/uv/python/cpython-3.10-linux-aarch64-gnu/lib/python3.10/contextlib.py”, line 199, in aenter
(APIServer pid=22) return await anext(self.gen)
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 96, in build_async_engine_client
(APIServer pid=22) async with build_async_engine_client_from_engine_args(
(APIServer pid=22) File “/root/.local/share/uv/python/cpython-3.10-linux-aarch64-gnu/lib/python3.10/contextlib.py”, line 199, in aenter
(APIServer pid=22) return await anext(self.gen)
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py”, line 122, in build_async_engine_client_from_engine_args
(APIServer pid=22) vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py”, line 1431, in create_engine_config
(APIServer pid=22) model_config = self.create_model_config()
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py”, line 1283, in create_model_config
(APIServer pid=22) return ModelConfig(
(APIServer pid=22) File “/opt/venv/lib/python3.10/site-packages/pydantic/_internal/_dataclasses.py”, line 121, in init
(APIServer pid=22) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=22) pydantic_core._pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=22) Value error, The checkpoint you are trying to load has model type gemma4 but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
(APIServer pid=22)
(APIServer pid=22) You can update Transformers with the command pip install --upgrade transformers. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command pip install git+https://github.com/huggingface/transformers.git [type=value_error, input_value=ArgsKwargs((), {‘model’: …rocessor_plugin’: None}), input_type=ArgsKwargs]
(APIServer pid=22) For further information visit Redirecting...

I have the exactly same issue on Jetpack 6.2 Jetson Nano Super, and also expecting any helps. Thanks !

By attaching to the container of current image (ghcr.io/ nvidia-ai-iot/ vllm : latest-jetson-orin), it is found that there is no gemma4 support files under directory /opt/venv/lib/python3.10/site-packages/vllm/model_executor/models/. So, it looks that gemma4 is, in fact, not supported yet in current latest-jetson-orin image. The version of vllm used in that image is 0.16.0rc2.dev479+g15d76f74e.d20260226.cu126.

ok, makes sense. and the llama container?

Hi,

Based on our tutorial below:

Gemma4 E4B model is tested with ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin container.
Could you give it a try?

We are double-check this internally and will get back to you later.
Thanks.

Hi,

We have confirmed that Gemma4 E4B can work on Orin Nano with r36.5 (JetPack 6.2.2).

Please give it a try:

sudo docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin llama-server -hf ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M
...
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle

Thanks.

How did you got to this tutorial in the first place? I tried with the nornal Jetson AI Lab and the search for “GEMMA”, but no luck see:

Hello - what im doing wrong?

jcm@ubuntu:~$ sudo docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin llama-server -hf ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M --port 8081
 
gemma4-jetson-orin: Pulling from nvidia-ai-iot/llama_cpp
Digest: sha256:de16bf712e9614ae5fe6e7230c3807cbe6aba94a949a3aaf2294c2aee1b9ccd1
Status: Image is up to date for ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin

ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7619 MiB):
Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 7619 MiB
common_download_file_single_online: HEAD failed, status: 404
no remote preset found, skipping
Downloading mmproj-gemma-4-e4b-it-f16.gguf ————————————————————————— 100%
Downloading gemma-4-e4b-it-Q4_K_M.gguf ————————————————————————————— 100%
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
system info: n_threads = 6, n_threads_batch = 6, total_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CUDA : ARCHS = 870 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 

Running without SSL
init: using 8 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model '/data/models/huggingface/models--ggml-org--gemma-4-E4B-it-GGUF/snapshots/6b352c53e1d2e4bb974d9f8cafcf85887c224219/gemma-4-e4b-it-Q4_K_M.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 5533 MiB of device memory vs. 5507 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 1049 MiB
llama_params_fit_impl: context size reduced from 131072 to 63744 -> need 1052 MiB less memory in total
llama_params_fit_impl: entire model can be fit by reducing context
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 5.78 seconds
llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 5496 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 720 tensors from /data/models/huggingface/models--ggml-org--gemma-4-E4B-it-GGUF/snapshots/6b352c53e1d2e4bb974d9f8cafcf85887c224219/gemma-4-e4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma4
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 64
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                         general.size_label str              = 7.5B
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                       general.license.link str              = https://ai.google.dev/gemma/docs/gemm...
llama_model_loader: - kv   8:                               general.tags arr[str,1]       = ["any-to-any"]
llama_model_loader: - kv   9:                         gemma4.block_count u32              = 42
llama_model_loader: - kv  10:                      gemma4.context_length u32              = 131072
llama_model_loader: - kv  11:                    gemma4.embedding_length u32              = 2560
llama_model_loader: - kv  12:                 gemma4.feed_forward_length u32              = 10240
llama_model_loader: - kv  13:                gemma4.attention.head_count u32              = 8
llama_model_loader: - kv  14:             gemma4.attention.head_count_kv u32              = 2
llama_model_loader: - kv  15:                      gemma4.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  16:                  gemma4.rope.freq_base_swa f32              = 10000.000000
llama_model_loader: - kv  17:    gemma4.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  18:                gemma4.attention.key_length u32              = 512
llama_model_loader: - kv  19:              gemma4.attention.value_length u32              = 512
llama_model_loader: - kv  20:             gemma4.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  21:            gemma4.attention.sliding_window u32              = 512
llama_model_loader: - kv  22:          gemma4.attention.shared_kv_layers u32              = 18
llama_model_loader: - kv  23:    gemma4.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  24:    gemma4.attention.sliding_window_pattern arr[bool,42]     = [true, true, true, true, true, false,...
llama_model_loader: - kv  25:            gemma4.attention.key_length_swa u32              = 256
llama_model_loader: - kv  26:          gemma4.attention.value_length_swa u32              = 256
llama_model_loader: - kv  27:                gemma4.rope.dimension_count u32              = 512
llama_model_loader: - kv  28:            gemma4.rope.dimension_count_swa u32              = 256
llama_model_loader: - kv  29:                       tokenizer.ggml.model str              = gemma4
llama_model_loader: - kv  30:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  31:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  32:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  33:                      tokenizer.ggml.merges arr[str,514906]  = ["\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n", ...
llama_model_loader: - kv  34:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  35:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  36:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  37:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  38:               tokenizer.ggml.mask_token_id u32              = 4
llama_model_loader: - kv  39:                    tokenizer.chat_template str              = {%- macro format_parameters(propertie...
llama_model_loader: - kv  40:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  41:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - kv  43:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  339 tensors
llama_model_loader: - type  f16:    1 tensors
llama_model_loader: - type q4_K:  336 tensors
llama_model_loader: - type q6_K:   44 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.95 GiB (5.66 BPW) 
load: 0 unused tokens
load: control-looking token:    212 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load:   - 1 ('<eos>')
load:   - 106 ('<turn|>')
load:   - 212 ('</s>')
load: special tokens cache size = 25
load: token to piece cache size = 1.9445 MB
print_info: arch                  = gemma4
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 131072
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 42
print_info: n_head                = 8
print_info: n_head_kv             = 2
print_info: n_rot                 = 512
print_info: n_swa                 = 512
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 512
print_info: n_embd_head_v         = 512
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = [512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024]
print_info: n_embd_v_gqa          = [512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024]
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 1.0e+00
print_info: n_ff                  = 10240
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 1000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_embd_head_k_swa     = 256
print_info: n_embd_head_v_swa     = 256
print_info: n_rot_swa             = 256
print_info: n_ctx_orig_yarn       = 131072
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = E4B
print_info: model params          = 7.52 B
print_info: general.name          = n/a
print_info: vocab type            = SPM
print_info: n_vocab               = 262144
print_info: n_merges              = 0
print_info: BOS token             = 2 '<bos>'
print_info: EOS token             = 1 '<eos>'
print_info: UNK token             = 3 '<unk>'
print_info: PAD token             = 0 '<pad>'
print_info: MASK token            = 4 '<mask>'
print_info: LF token              = 248 '<0x0A>'
print_info: EOG token             = 1 '<eos>'
print_info: EOG token             = 106 '<turn|>'
print_info: EOG token             = 212 '</s>'
print_info: max token length      = 93
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
str: cannot properly format tensor name output with suffix=weight bid=-1 xid=-1
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2868.05 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 3007371008
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model '/data/models/huggingface/models--ggml-org--gemma-4-E4B-it-GGUF/snapshots/6b352c53e1d2e4bb974d9f8cafcf85887c224219/gemma-4-e4b-it-Q4_K_M.gguf'
srv    load_model: failed to load model, '/data/models/huggingface/models--ggml-org--gemma-4-E4B-it-GGUF/snapshots/6b352c53e1d2e4bb974d9f8cafcf85887c224219/gemma-4-e4b-it-Q4_K_M.gguf'
srv    operator(): operator(): cleaning up before exit...
main: exiting due to model loading error
jcm@ubuntu:~$ free -m
               total        used        free      shared  buff/cache   available
Mem:            7619        1621        4887          38        1111        5711
Swap:          65249         110       65139

Hi,

In the screenshot you attached, E4B is listed as the second model under “Google Gemma4”.
You can find the details by clicking the “Details”.

Please also upgrade the system to r36.5 (JetPack 6.2.2) to fix the memory issue.
Thanks.

Hello - got it to run with this version:

sudo docker run -it --rm \
  --runtime=nvidia \
  --network host \
  -v /home/jcm/.cache/huggingface:/root/.cache/huggingface \
  ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin \
  llama-server \
  -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S \
  --no-mmproj \
  -ngl 99 \
  -c 4096 \