Gemma4 e4b on Jetson Orin Nano fails due to CUDA out of memory issue

Hi,

I am running this command to serve the already downloaded Gemma 4 e4b model. According the jet-ai-lab reference page gemma4 e4b is supposed to work with llama.cpp inference engine but fails. Any thing I missed ? Anyone has a similar problem ?

sudo docker run -it --rm
–runtime=nvidia
–network host
-v ~/kbot_ws/models:/models

llama-server
-m /models/gemma4-e4b/gemma-4-e4b-it-Q4_K_M.gguf
–mmproj /models/gemma4-e4b/mmproj-gemma-4-e4b-it-f16.gguf
–n-gpu-layers 99
–port 8080

but it fails and same works with adding ‘-e CUDA_VISIBLE_DEVICES=“” ‘ - cpu only mode
It seems to work with cpu but not with gpu enabled.

Here is the full error:

-v ~/kbot_ws/models:/models \
Package llama_cpp · GitHub \
llama-server \
-m /models/gemma4-e4b/gemma-4-e4b-it-Q4_K_M.gguf \
–mmproj /models/gemma4-e4b/mmproj-gemma-4-e4b-it-f16.gguf \
–n-gpu-layers 99 \
–port 8080
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 7619 MiB):
Device 0: Orin, compute capability 8.7, VMM: yes, VRAM: 7619 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
system info: n_threads = 6, n_threads_batch = 6, total_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 6 | CUDA : ARCHS = 870 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |

Running without SSL
init: using 8 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model ‘/models/gemma4-e4b/gemma-4-e4b-it-Q4_K_M.gguf’
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 5533 MiB of device memory vs. 6387 MiB of free device memory
llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 169 MiB
llama_params_fit_impl: context size reduced from 131072 to 120064 → need 172 MiB less memory in total
llama_params_fit_impl: entire model can be fit by reducing context
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 2.69 seconds
llama_model_load_from_file_impl: using device CUDA0 (Orin) (0000:00:00.0) - 6375 MiB free
llama_model_loader: loaded meta data with 44 key-value pairs and 720 tensors from /models/gemma4-e4b/gemma-4-e4b-it-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma4
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_k i32 = 64
llama_model_loader: - kv 3: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 4: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 5: general.size_label str = 7.5B
llama_model_loader: - kv 6: general.license str = apache-2.0
llama_model_loader: - kv 7: general.license.link str = https://ai.google.dev/gemma/docs/gemm
llama_model_loader: - kv 8: general.tags arr[str,1] = [“any-to-any”]
llama_model_loader: - kv 9: gemma4.block_count u32 = 42
llama_model_loader: - kv 10: gemma4.context_length u32 = 131072
llama_model_loader: - kv 11: gemma4.embedding_length u32 = 2560
llama_model_loader: - kv 12: gemma4.feed_forward_length u32 = 10240
llama_model_loader: - kv 13: gemma4.attention.head_count u32 = 8
llama_model_loader: - kv 14: gemma4.attention.head_count_kv u32 = 2
llama_model_loader: - kv 15: gemma4.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 16: gemma4.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 17: gemma4.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 18: gemma4.attention.key_length u32 = 512
llama_model_loader: - kv 19: gemma4.attention.value_length u32 = 512
llama_model_loader: - kv 20: gemma4.final_logit_softcapping f32 = 30.000000
llama_model_loader: - kv 21: gemma4.attention.sliding_window u32 = 512
llama_model_loader: - kv 22: gemma4.attention.shared_kv_layers u32 = 18
llama_model_loader: - kv 23: gemma4.embedding_length_per_layer_input u32 = 256
llama_model_loader: - kv 24: gemma4.attention.sliding_window_pattern arr[bool,42] = [true, true, true, true, true, false,…
llama_model_loader: - kv 25: gemma4.attention.key_length_swa u32 = 256
llama_model_loader: - kv 26: gemma4.attention.value_length_swa u32 = 256
llama_model_loader: - kv 27: gemma4.rope.dimension_count u32 = 512
llama_model_loader: - kv 28: gemma4.rope.dimension_count_swa u32 = 256
llama_model_loader: - kv 29: tokenizer.ggml.model str = gemma4
llama_model_loader: - kv 30: tokenizer.ggml.tokens arr[str,262144] = [“”, “”, “”, “”, …
llama_model_loader: - kv 31: tokenizer.ggml.scores arr[f32,262144] = [-1000.000000, -1000.000000, -1000.00…
llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,262144] = [3, 3, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, …
llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,514906] = [“\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n”, …
llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 36: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 37: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 38: tokenizer.ggml.mask_token_id u32 = 4
llama_model_loader: - kv 39: tokenizer.chat_template str = {%- macro format_parameters(propertie…
llama_model_loader: - kv 40: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 41: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 15
llama_model_loader: - type f32: 339 tensors
llama_model_loader: - type f16: 1 tensors
llama_model_loader: - type q4_K: 336 tensors
llama_model_loader: - type q6_K: 44 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.95 GiB (5.66 BPW)
load: 0 unused tokens
load: control-looking token: 212 ‘’ was not control-type; this is probably a bug in the model. its type will be overridden
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 1 (‘’)
load: - 106 (‘<turn|>’)
load: - 212 (‘’)
load: special tokens cache size = 25
load: token to piece cache size = 1.9445 MB
print_info: arch = gemma4
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2560
print_info: n_embd_inp = 2560
print_info: n_layer = 42
print_info: n_head = 8
print_info: n_head_kv = 2
print_info: n_rot = 512
print_info: n_swa = 512
print_info: is_swa_any = 1
print_info: n_embd_head_k = 512
print_info: n_embd_head_v = 512
print_info: n_gqa = 4
print_info: n_embd_k_gqa = [512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024]
print_info: n_embd_v_gqa = [512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024, 512, 512, 512, 512, 512, 1024]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 1.0e+00
print_info: n_ff = 10240
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_embd_head_k_swa = 256
print_info: n_embd_head_v_swa = 256
print_info: n_rot_swa = 256
print_info: n_ctx_orig_yarn = 131072
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = E4B
print_info: model params = 7.52 B
print_info: general.name = n/a
print_info: vocab type = SPM
print_info: n_vocab = 262144
print_info: n_merges = 0
print_info: BOS token = 2 ‘’
print_info: EOS token = 1 ‘’
print_info: UNK token = 3 ‘’
print_info: PAD token = 0 ‘’
print_info: MASK token = 4 ‘’
print_info: LF token = 248 ‘<0x0A>’
print_info: EOG token = 1 ‘’
print_info: EOG token = 106 ‘<turn|>’
print_info: EOG token = 212 ‘’
print_info: max token length = 93
load_tensors: loading model tensors, this can take a while… (mmap = true, direct_io = false)
str: cannot properly format tensor name output with suffix=weight bid=-1 xid=-1
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
NvMapMemAllocInternalTagged: 1075072515 error 12
NvMapMemHandleAlloc: error 0
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2868.05 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 3007371008
llama_model_load: error loading model: unable to allocate CUDA0 buffer
llama_model_load_from_file_impl: failed to load model
common_init_from_params: failed to load model ‘/models/gemma4-e4b/gemma-4-e4b-it-Q4_K_M.gguf’
srv load_model: failed to load model, ‘/models/gemma4-e4b/gemma-4-e4b-it-Q4_K_M.gguf’
srv operator(): operator(): cleaning up before exit…
main: exiting due to model loading error

Hi,

Could you share which software version you use?
Is it r36.5 from JetPack 6.2.2?

Thanks.

Hi,

We have confirmed that Gemma4 E4B can work on Orin Nano with r36.5 (JetPack 6.2.2).

Please give it a try:

sudo docker run -it --rm --pull always --runtime=nvidia --network host -v $HOME/.cache/huggingface:/root/.cache/huggingface ghcr.io/nvidia-ai-iot/llama_cpp:gemma4-jetson-orin llama-server -hf ggml-org/gemma-4-E4B-it-GGUF:Q4_K_M
...
main: model loaded
main: server is listening on http://127.0.0.1:8080
main: starting the main loop...
srv  update_slots: all slots are idle

Thanks.

My current jetpack version is r36.4.7 - does it have to be r36.5 ?

$ dpkg-query --show nvidia-l4t-core

nvidia-l4t-core 36.4.7-20250918154033

$ cat /etc/nv_tegra_release

# R36 (release), REVISION: 4.7, GCID: 42132812, BOARD: generic, EABI: aarch64, DATE: Thu Sep 18 22:54:44 UTC 2025

# KERNEL_VARIANT: oot

TARGET_USERSPACE_LIB_DIR=nvidia

TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

It works now, thanks!

Hi,

Yes, there is a known memory issue in r36.4.7, which has been fixed in r36.5.
Thanks.