Errors in ggml-cuda running models in ollama on Thor

chad.mcquillen · October 1, 2025, 7:20pm

Experimenting with running vision models in ollama on the Thor developer kit (Jetson Linux 38.2 / JetPack 7.0) and am getting errors in ggml-cuda frequently enough for it to be unusable. I’ve tried running both llava and qwen2.5vl. Ollama logs for llava 7b, 13b, and 34b models attached.

ollama-llava-7b.log (99.3 KB)
ollama-llava-13b.log (78.4 KB)
ollama-llava-34b.log (93.6 KB)

Command used to invoke container:

docker run -it --rm -e OLLAMA_DEBUG=1 -v /home/$USER/ollama:/data -p 11434:11434 --name ollama ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04

/etc/docker/daemon.json configured with "default-runtime": "nvidia"

jhendersonphd · October 1, 2025, 10:33pm

I have been having the same issues

AastaLLL · October 2, 2025, 4:34am

Hi,

Thanks for reporting this issue.
We test llava:7b several times, but are not able to reproduce the CUDA launch failure issue.

Compared the ollama log, the kv_cache usage between us seems to be different.
In our testing, some layers are put on the CPU. (reproduce this via ollama run llava:7b)

...
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CUDA0
llama_kv_cache_unified: layer  23: dev = CUDA0
llama_kv_cache_unified: layer  24: dev = CUDA0
llama_kv_cache_unified: layer  25: dev = CUDA0
llama_kv_cache_unified: layer  26: dev = CUDA0
llama_kv_cache_unified: layer  27: dev = CUDA0
llama_kv_cache_unified: layer  28: dev = CUDA0
llama_kv_cache_unified: layer  29: dev = CUDA0
llama_kv_cache_unified: layer  30: dev = CUDA0
llama_kv_cache_unified: layer  31: dev = CUDA0
llama_kv_cache_unified:        CPU KV buffer size =   352.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  512.00 MiB (  4096 cells,  32 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      CUDA0 compute buffer size =   331.50 MiB
llama_context:  CUDA_Host compute buffer size =    24.01 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 245 (with bs=512), 3 (with bs=1)
clip_model_loader: model name:   openai/clip-vit-large-patch14-336
clip_model_loader: description:  image encoder for LLaVA
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    377
clip_model_loader: n_kv:         19

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          mlp
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            23
load_hparams: ffn_op:             gelu_quick
load_hparams: projection_dim:     768

--- vision hparams ---
load_hparams: image_size:         336
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     1
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         595.49 MiB
load_hparams: metadata size:      0.13 MiB
alloc_compute_meta:      CUDA0 compute buffer size =    32.88 MiB
alloc_compute_meta:        CPU compute buffer size =     1.30 MiB
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1272 msg="llama runner started in 0.69 seconds"
time=2025-10-02T04:15:48.800Z level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-10-02T04:15:48.800Z level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/data/models/ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1272 msg="llama runner started in 0.69 seconds"
time=2025-10-02T04:15:48.800Z level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/llava:7b runner.inference=cuda runner.devices=1 runner.size="5.7 GiB" runner.vram="2.8 GiB" runner.parallel=1 runner.pid=80 runner.model=/data/models/ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 runner.num_ctx=4096

Could you share your command for reproducing this issue so we can check it further?

Thanks.

chad.mcquillen · October 2, 2025, 1:20pm

Console log and exact image used attached.

running-ollama-llava-7b.log (11.7 KB)

Please note that it takes a number of iterations (10 - 20?) to see the error although I believe occasionally it occurs much sooner. The qwen2.5vl models seem to exhibit the errors much more frequently so make sure to try those as well.

AastaLLL · October 3, 2025, 6:31am

Hi,

Thanks for sharing the information.
We will test this and update more information with you.

AastaLLL · October 8, 2025, 3:27am

Hi,

Thanks a lot for your patience.

We also see the same error when running ollama around 10 times.
Our internal team is checking this issue. We will provide more info to you later.

Thanks.

Topic		Replies	Views
Thor ollama[7754]: CUDA error: an internal operation failed Jetson Thor llama , deepseek	3	559	September 2, 2025
Critical problems running ollama on Nvidia Jetson Thor Jetson Thor cuda , llama	5	324	January 16, 2026
Ollama in docker causing graphics exceptions and bad responses Jetson Thor generative_ai	8	277	November 26, 2025
Ollama run Gives: Error-GGML_ASSERT: /go/src/github.com/ollama/ollama/llm/llama.cpp/ggml-cuda.cu:60: !"CUDA error" Jetson AGX Orin cuda	2	3048	May 15, 2024
Jetson thor: run qwen2.5vl by ollama can't on GPU, only cpu Jetson Thor generative_ai	6	537	September 10, 2025
Run llm stuck while use jetson thor Jetson Thor cuda , generative_ai	6	507	September 25, 2025
Llama.cpp can't work properly with docker. Multi-modal functionality fails with a CUDA internal error Jetson Thor cuda , cublas , llama	8	183	June 9, 2026
Ollama errors orin nano Jetson Orin NX nvbugs , generative_ai	42	2841	February 12, 2026
@Dusty_nv has anyone managed to get Ollama running with llama3.2-vision yet? Jetson AGX Orin cuda , generative_ai , llama	6	726	December 14, 2024
How to control amount of shared memory available to LLM on Jetson Thor? Jetson Thor generative_ai	20	1404	November 10, 2025

Errors in ggml-cuda running models in ollama on Thor

Related topics