Errors in ggml-cuda running models in ollama on Thor

Experimenting with running vision models in ollama on the Thor developer kit (Jetson Linux 38.2 / JetPack 7.0) and am getting errors in ggml-cuda frequently enough for it to be unusable. I’ve tried running both llava and qwen2.5vl. Ollama logs for llava 7b, 13b, and 34b models attached.

ollama-llava-7b.log (99.3 KB)
ollama-llava-13b.log (78.4 KB)
ollama-llava-34b.log (93.6 KB)

Command used to invoke container:

docker run -it --rm -e OLLAMA_DEBUG=1 -v /home/$USER/ollama:/data -p 11434:11434 --name ollama ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04

/etc/docker/daemon.json configured with "default-runtime": "nvidia"

I have been having the same issues

Hi,

Thanks for reporting this issue.
We test llava:7b several times, but are not able to reproduce the CUDA launch failure issue.

Compared the ollama log, the kv_cache usage between us seems to be different.
In our testing, some layers are put on the CPU. (reproduce this via ollama run llava:7b)

...
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CUDA0
llama_kv_cache_unified: layer  23: dev = CUDA0
llama_kv_cache_unified: layer  24: dev = CUDA0
llama_kv_cache_unified: layer  25: dev = CUDA0
llama_kv_cache_unified: layer  26: dev = CUDA0
llama_kv_cache_unified: layer  27: dev = CUDA0
llama_kv_cache_unified: layer  28: dev = CUDA0
llama_kv_cache_unified: layer  29: dev = CUDA0
llama_kv_cache_unified: layer  30: dev = CUDA0
llama_kv_cache_unified: layer  31: dev = CUDA0
llama_kv_cache_unified:        CPU KV buffer size =   352.00 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   160.00 MiB
llama_kv_cache_unified: size =  512.00 MiB (  4096 cells,  32 layers,  1/1 seqs), K (f16):  256.00 MiB, V (f16):  256.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      CUDA0 compute buffer size =   331.50 MiB
llama_context:  CUDA_Host compute buffer size =    24.01 MiB
llama_context: graph nodes  = 1126
llama_context: graph splits = 245 (with bs=512), 3 (with bs=1)
clip_model_loader: model name:   openai/clip-vit-large-patch14-336
clip_model_loader: description:  image encoder for LLaVA
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    377
clip_model_loader: n_kv:         19

clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          mlp
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            23
load_hparams: ffn_op:             gelu_quick
load_hparams: projection_dim:     768

--- vision hparams ---
load_hparams: image_size:         336
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     1
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       0

load_hparams: model size:         595.49 MiB
load_hparams: metadata size:      0.13 MiB
alloc_compute_meta:      CUDA0 compute buffer size =    32.88 MiB
alloc_compute_meta:        CPU compute buffer size =     1.30 MiB
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1272 msg="llama runner started in 0.69 seconds"
time=2025-10-02T04:15:48.800Z level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-10-02T04:15:48.800Z level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/data/models/ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1272 msg="llama runner started in 0.69 seconds"
time=2025-10-02T04:15:48.800Z level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/llava:7b runner.inference=cuda runner.devices=1 runner.size="5.7 GiB" runner.vram="2.8 GiB" runner.parallel=1 runner.pid=80 runner.model=/data/models/ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 runner.num_ctx=4096

Could you share your command for reproducing this issue so we can check it further?

Thanks.

Console log and exact image used attached.

running-ollama-llava-7b.log (11.7 KB)

Please note that it takes a number of iterations (10 - 20?) to see the error although I believe occasionally it occurs much sooner. The qwen2.5vl models seem to exhibit the errors much more frequently so make sure to try those as well.

Hi,

Thanks for sharing the information.
We will test this and update more information with you.

Hi,

Thanks a lot for your patience.

We also see the same error when running ollama around 10 times.
Our internal team is checking this issue. We will provide more info to you later.

Thanks.