Experimenting with running vision models in ollama on the Thor developer kit (Jetson Linux 38.2 / JetPack 7.0) and am getting errors in ggml-cuda frequently enough for it to be unusable. I’ve tried running both llava and qwen2.5vl. Ollama logs for llava 7b, 13b, and 34b models attached.
ollama-llava-7b.log (99.3 KB)
ollama-llava-13b.log (78.4 KB)
ollama-llava-34b.log (93.6 KB)
Command used to invoke container:
docker run -it --rm -e OLLAMA_DEBUG=1 -v /home/$USER/ollama:/data -p 11434:11434 --name ollama ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04
/etc/docker/daemon.json configured with "default-runtime": "nvidia"
I have been having the same issues
Hi,
Thanks for reporting this issue.
We test llava:7b several times, but are not able to reproduce the CUDA launch failure issue.
Compared the ollama log, the kv_cache usage between us seems to be different.
In our testing, some layers are put on the CPU. (reproduce this via ollama run llava:7b)
...
create_memory: n_ctx = 4096 (padded)
llama_kv_cache_unified: layer 0: dev = CPU
llama_kv_cache_unified: layer 1: dev = CPU
llama_kv_cache_unified: layer 2: dev = CPU
llama_kv_cache_unified: layer 3: dev = CPU
llama_kv_cache_unified: layer 4: dev = CPU
llama_kv_cache_unified: layer 5: dev = CPU
llama_kv_cache_unified: layer 6: dev = CPU
llama_kv_cache_unified: layer 7: dev = CPU
llama_kv_cache_unified: layer 8: dev = CPU
llama_kv_cache_unified: layer 9: dev = CPU
llama_kv_cache_unified: layer 10: dev = CPU
llama_kv_cache_unified: layer 11: dev = CPU
llama_kv_cache_unified: layer 12: dev = CPU
llama_kv_cache_unified: layer 13: dev = CPU
llama_kv_cache_unified: layer 14: dev = CPU
llama_kv_cache_unified: layer 15: dev = CPU
llama_kv_cache_unified: layer 16: dev = CPU
llama_kv_cache_unified: layer 17: dev = CPU
llama_kv_cache_unified: layer 18: dev = CPU
llama_kv_cache_unified: layer 19: dev = CPU
llama_kv_cache_unified: layer 20: dev = CPU
llama_kv_cache_unified: layer 21: dev = CPU
llama_kv_cache_unified: layer 22: dev = CUDA0
llama_kv_cache_unified: layer 23: dev = CUDA0
llama_kv_cache_unified: layer 24: dev = CUDA0
llama_kv_cache_unified: layer 25: dev = CUDA0
llama_kv_cache_unified: layer 26: dev = CUDA0
llama_kv_cache_unified: layer 27: dev = CUDA0
llama_kv_cache_unified: layer 28: dev = CUDA0
llama_kv_cache_unified: layer 29: dev = CUDA0
llama_kv_cache_unified: layer 30: dev = CUDA0
llama_kv_cache_unified: layer 31: dev = CUDA0
llama_kv_cache_unified: CPU KV buffer size = 352.00 MiB
llama_kv_cache_unified: CUDA0 KV buffer size = 160.00 MiB
llama_kv_cache_unified: size = 512.00 MiB ( 4096 cells, 32 layers, 1/1 seqs), K (f16): 256.00 MiB, V (f16): 256.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
llama_context: max_nodes = 2328
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
llama_context: CUDA0 compute buffer size = 331.50 MiB
llama_context: CUDA_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 1126
llama_context: graph splits = 245 (with bs=512), 3 (with bs=1)
clip_model_loader: model name: openai/clip-vit-large-patch14-336
clip_model_loader: description: image encoder for LLaVA
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 377
clip_model_loader: n_kv: 19
clip_model_loader: has vision encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: mlp
load_hparams: n_embd: 1024
load_hparams: n_head: 16
load_hparams: n_ff: 4096
load_hparams: n_layer: 23
load_hparams: ffn_op: gelu_quick
load_hparams: projection_dim: 768
--- vision hparams ---
load_hparams: image_size: 336
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 1
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 0
load_hparams: model size: 595.49 MiB
load_hparams: metadata size: 0.13 MiB
alloc_compute_meta: CUDA0 compute buffer size = 32.88 MiB
alloc_compute_meta: CPU compute buffer size = 1.30 MiB
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1272 msg="llama runner started in 0.69 seconds"
time=2025-10-02T04:15:48.800Z level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-10-02T04:15:48.800Z level=DEBUG source=sched.go:583 msg="evaluating already loaded" model=/data/models/ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1234 msg="waiting for llama runner to start responding"
time=2025-10-02T04:15:48.800Z level=INFO source=server.go:1272 msg="llama runner started in 0.69 seconds"
time=2025-10-02T04:15:48.800Z level=DEBUG source=sched.go:485 msg="finished setting up" runner.name=registry.ollama.ai/library/llava:7b runner.inference=cuda runner.devices=1 runner.size="5.7 GiB" runner.vram="2.8 GiB" runner.parallel=1 runner.pid=80 runner.model=/data/models/ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 runner.num_ctx=4096
Could you share your command for reproducing this issue so we can check it further?
Thanks.
Console log and exact image used attached.
running-ollama-llava-7b.log (11.7 KB)
Please note that it takes a number of iterations (10 - 20?) to see the error although I believe occasionally it occurs much sooner. The qwen2.5vl models seem to exhibit the errors much more frequently so make sure to try those as well.
Hi,
Thanks for sharing the information.
We will test this and update more information with you.
Hi,
Thanks a lot for your patience.
We also see the same error when running ollama around 10 times.
Our internal team is checking this issue. We will provide more info to you later.
Thanks.