How to control amount of shared memory available to LLM on Jetson Thor?

Hi,

Sorry, this might be a total newbie question. Some similar results came up when I searched for the matter on the forum, but not quite an answer I can replicate yet.

I’m trying to run openai/gpt-oss:120B on my Jetson Thor, but am seeing mixed results that kind of point to a dynamic, not quite timely memory allocation. Consider the below output from Ollama (inside the official Docker container):

root@1b1b53f1d26a:/# ollama list
NAME                                  ID              SIZE     MODIFIED       
gpt-oss:120b                          f7f8e2f8f4e0    65 GB    49 minutes ago
root@1b1b53f1d26a:/# ollama run gpt-oss:120b
Error: 500 Internal Server Error: model requires more system memory (61.4 GiB) than is available (36.4 GiB)
root@1b1b53f1d26a:/# ollama run gpt-oss:120b
>>> ollama run gpt-oss:120b
Error: an error was encountered while running the model: CUDA error: out of memory
  current device: 0, in function evaluate_and_capture_cuda_graph at //ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:3015
  cudaGraphInstantiate(&cuda_ctx->cuda_graph->instance, cuda_ctx->cuda_graph->graph, __null, __null, 0)
//ml/backend/ggml/ggml/src/ggml-cuda/ggml-cuda.cu:84: CUDA error
root@1b1b53f1d26a:/# ollama run gpt-oss:120b
Error: 500 Internal Server Error: model requires more system memory (61.4 GiB) than is available (57.5 GiB)

I have had it once only that the model actually loaded into memory, though that happened when sending a message to the LLM from Open WebUI. Other times it also didn’t load from Open WebUI.

Noticing the dynamic nature of available memory (see the logs above), is there any way to tell the system “please reserve this much memory for the LLM”? Preferably from the BIOS or CLI, or otherwise happy to run a script / write up a short program in C that would do it.

For reference, I’m using Jetson Linux 38.2 and the firmware got updated after installing Linux.

Update: I also checked btop and apparently something is using 65 GB of memory? The process list is on the right, in descending order of memory used, I get nowhere near that amount.

What is reserving memory on the system?

Hi,

Please try the steps below and restart the ollama container again to see if it helps.
It will enable the cache cleaner in the background:

$ wget https://raw.githubusercontent.com/NVIDIA-AI-Blueprints/video-search-and-summarization/refs/heads/main/deploy/scripts/sys_cache_cleaner.sh
$ sudo sh sys_cache_cleaner.sh &

Thanks

Worked perfectly, thank you so much! Could you also please explain why cleaning the system cache helps? I’d like to understand why this is happening as well, in case you know

OK, after some more testing, I think I need to update the topic again, sorry. The solution above does work, especially for the first message. However, while leaving it running in the background and continuing a conversation, the attached logs come up in Ollama. Basically there’s an illegal memory access, a stack trace, then the model gets reloaded (sometimes successfully)

CUDA error: an illegal memory access was encountered

Also, by keeping an eye on watch free -h I notice a dip to 4 GB used memory after I send a follow-up question.

Does the script end up clearing the model out from cache?

ollama.log (122.1 KB)

Hi,

The script only cleans up the cache should not impact the data that has been loaded to the memory.

We want to check this further. Would you mind sharing the behavior you meet?
For example, which model did you use? The failure rate, and how many messages are required before the illegal memory happens?

Thanks.

Hi,

Thanks for following this up. I’m using the vanilla gpt-oss:120b provided by Ollama. The number of messages sent varies, I’m afraid. Sometimes after the second message, other times it takes a number of messages to trigger it. Might be some correlation with the timing between messages, but I can’t really say 100%.

I have attached a video that should showcase what I’m talking about. Also, next to it is the full log from the Ollama container, showing both the API calls (stdout) and the errors and stack trace (stderr).

Hope this helps give you an idea. If I can do anything differently to help you replicate the behaviour, or if there are any other diagnostics I can send you, please let me know

ollama.log (76.6 KB)

Hi,

Thanks for following this up. I’m using the vanilla gpt-oss:120b provided by Ollama. The number of messages sent varies, I’m afraid. Sometimes after the second message, other times it takes a number of messages to trigger it. Might be some correlation with the timing between messages, but I can’t really say 100%.

I have attached a video that should showcase what I’m talking about. Also, next to it is the full log from the Ollama container, showing both the API calls (stdout) and the errors and stack trace (stderr).

Hope this helps give you an idea. If I can do anything differently to help you replicate the behaviour, or if there are any other diagnostics I can send you, please let me know

ollama.log (76.6 KB)

Edit: For completeness’ sake, this is the compose.yaml I use to spin up Ollama:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ./data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities:
                - gpu
    ports:
      - 11434:11434
    restart: unless-stopped

And the Ollama version I’m running, in case it changes in the meantime:

root@1b038dc14dc6:/# ollama --version
ollama version is 0.12.2

same problem, after ollama running for a few minutes, it’s trying to re-allocate a new memory chunk, for gpt-oss:120b it’s 5GB short for it to run, it complains and fail to continue. For a 20b model also fails after a few minutes.
If running ollama without cuda, cpu only, no same issue.

Hi, both

Thanks a lot for this information.
We are checking this internally and will provide more information with you.

The behavior is:

  1. Ollama try to allocate a big chunk of memory (ex. 5GiB) after during the inference.
    This might fail but can be solved by running the sys_cache_cleaner.sh shared in this comment.

  2. But when running the cleaner in the background, Ollama eventually fails with illegal memory error.

Is our understanding correct?
Thanks.

1 Like

Hi,

At least from the point of view of my usage of it, it is correct, yes.

Thank you for following this up, looking forward to a workaround / fix

Thanks for confirming the issue

  1. Ollama tries to allocate another copy of whole 56GB+ memory while it’s running and frozen
  2. running cleaner doesn’t resolve the issue, the system frozen and lost desktop response after cleaning, and SSH shutdown took minutes to execute.

Hi,

Since dropping the cache could cause illegal memory access, disable huge pages instead:

echo 0 | sudo tee /proc/sys/vm/nr_hugepages

Then run openai/gpt-oss:120B

Thanks.

1 Like

Hi, I’m afraid that doesn’t help much. I set vm.nr_hugepages to 0 and run Ollama, it responds to my first request, then fails on any follow-up with this error below and a restart of the model.

panic: failed to sample token: sample: logits sum to NaN, check model output

Here’s the full log. I asked an initial question, got a response, then every panic entry marks the point where I’m asking a follow-up question:

ime=2025-10-16T17:39:56.143Z level=INFO source=routes.go:1481 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-10-16T17:39:56.145Z level=INFO source=images.go:522 msg="total blobs: 5"
time=2025-10-16T17:39:56.145Z level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-10-16T17:39:56.145Z level=INFO source=routes.go:1534 msg="Listening on [::]:11434 (version 0.12.5)"
time=2025-10-16T17:39:56.145Z level=INFO source=runner.go:80 msg="discovering available GPUs..."
time=2025-10-16T17:40:26.431Z level=INFO source=runner.go:507 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_jetpack5]" extra_envs=[] error="failed to finish discovery before timeout"
time=2025-10-16T17:40:26.864Z level=INFO source=types.go:112 msg="inference compute" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA compute=11.0 name=CUDA0 description="NVIDIA Thor" libdirs=ollama,cuda_v13 driver=13.0 pci_id=01:00.0 type=iGPU total="122.8 GiB" available="118.4 GiB"
time=2025-10-16T17:41:11.623Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-16T17:41:11.624Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 41809"
time=2025-10-16T17:41:11.624Z level=INFO source=server.go:675 msg="loading model" "model layers"=37 requested=-1
time=2025-10-16T17:41:11.625Z level=INFO source=server.go:681 msg="system memory" total="122.8 GiB" free="118.6 GiB" free_swap="0 B"
time=2025-10-16T17:41:11.625Z level=INFO source=server.go:689 msg="gpu memory" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA available="118.0 GiB" free="118.4 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-16T17:41:11.636Z level=INFO source=runner.go:1316 msg="starting ollama engine"
time=2025-10-16T17:41:11.639Z level=INFO source=runner.go:1352 msg="Server listening on 127.0.0.1:41809"
time=2025-10-16T17:41:11.647Z level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:41:11.716Z level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, ID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-16T17:41:11.764Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-16T17:41:36.042Z level=INFO source=runner.go:1189 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:41:37.496Z level=INFO source=runner.go:1189 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:41:37.496Z level=INFO source=ggml.go:477 msg="offloading 36 repeating layers to GPU"
time=2025-10-16T17:41:37.496Z level=INFO source=ggml.go:483 msg="offloading output layer to GPU"
time=2025-10-16T17:41:37.496Z level=INFO source=ggml.go:488 msg="offloaded 37/37 layers to GPU"
time=2025-10-16T17:41:37.496Z level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="59.8 GiB"
time=2025-10-16T17:41:37.496Z level=INFO source=device.go:211 msg="model weights" device=CPU size="1.1 GiB"
time=2025-10-16T17:41:37.496Z level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="450.0 MiB"
time=2025-10-16T17:41:37.496Z level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="129.8 MiB"
time=2025-10-16T17:41:37.496Z level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.6 MiB"
time=2025-10-16T17:41:37.496Z level=INFO source=device.go:238 msg="total memory" size="61.4 GiB"
time=2025-10-16T17:41:37.496Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-10-16T17:41:37.497Z level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-16T17:41:37.497Z level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-16T17:41:57.340Z level=INFO source=server.go:1309 msg="llama runner started in 45.72 seconds"
panic: failed to sample token: sample: logits sum to NaN, check model output

goroutine 12 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0x40001601e0, {0xaaaae3fcc9c0, 0x400014a7d0})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:415 +0x2c8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:1330 +0x470
time=2025-10-16T17:43:58.117Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-16T17:43:58.118Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 38213"
time=2025-10-16T17:43:58.118Z level=INFO source=server.go:675 msg="loading model" "model layers"=37 requested=-1
time=2025-10-16T17:43:58.119Z level=INFO source=server.go:681 msg="system memory" total="122.8 GiB" free="58.0 GiB" free_swap="0 B"
time=2025-10-16T17:43:58.119Z level=INFO source=server.go:689 msg="gpu memory" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA available="57.4 GiB" free="57.8 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-16T17:43:58.132Z level=INFO source=runner.go:1316 msg="starting ollama engine"
time=2025-10-16T17:43:58.136Z level=INFO source=runner.go:1352 msg="Server listening on 127.0.0.1:38213"
time=2025-10-16T17:43:58.141Z level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:43:58.197Z level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, ID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-16T17:43:58.244Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-16T17:43:58.510Z level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:43:58.631Z level=INFO source=runner.go:1189 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:43:59.201Z level=INFO source=runner.go:1189 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:43:59.201Z level=INFO source=ggml.go:477 msg="offloading 34 repeating layers to GPU"
time=2025-10-16T17:43:59.201Z level=INFO source=ggml.go:481 msg="offloading output layer to CPU"
time=2025-10-16T17:43:59.201Z level=INFO source=ggml.go:488 msg="offloaded 34/37 layers to GPU"
time=2025-10-16T17:43:59.201Z level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="55.4 GiB"
time=2025-10-16T17:43:59.201Z level=INFO source=device.go:211 msg="model weights" device=CPU size="5.4 GiB"
time=2025-10-16T17:43:59.201Z level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="425.0 MiB"
time=2025-10-16T17:43:59.201Z level=INFO source=device.go:222 msg="kv cache" device=CPU size="25.0 MiB"
time=2025-10-16T17:43:59.201Z level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="139.1 MiB"
time=2025-10-16T17:43:59.201Z level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB"
time=2025-10-16T17:43:59.201Z level=INFO source=device.go:238 msg="total memory" size="61.5 GiB"
time=2025-10-16T17:43:59.201Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-10-16T17:43:59.202Z level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-16T17:43:59.202Z level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-16T17:44:20.326Z level=INFO source=server.go:1309 msg="llama runner started in 22.21 seconds"
panic: failed to sample token: sample: logits sum to NaN, check model output

goroutine 11 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).run(0x40002370e0, {0xaaaac4a6c9c0, 0x4000698a50})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:415 +0x2c8
created by github.com/ollama/ollama/runner/ollamarunner.Execute in goroutine 1
	github.com/ollama/ollama/runner/ollamarunner/runner.go:1330 +0x470
time=2025-10-16T17:45:06.695Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.179344357 runner.size="61.5 GiB" runner.vram="56.0 GiB" runner.parallel=1 runner.pid=274 runner.model=/root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-10-16T17:45:06.945Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.429387351 runner.size="61.5 GiB" runner.vram="56.0 GiB" runner.parallel=1 runner.pid=274 runner.model=/root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-10-16T17:45:07.188Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-16T17:45:07.189Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 35825"
time=2025-10-16T17:45:07.189Z level=INFO source=server.go:675 msg="loading model" "model layers"=37 requested=-1
time=2025-10-16T17:45:07.189Z level=INFO source=server.go:681 msg="system memory" total="122.8 GiB" free="58.0 GiB" free_swap="0 B"
time=2025-10-16T17:45:07.189Z level=INFO source=server.go:689 msg="gpu memory" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA available="57.4 GiB" free="57.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-16T17:45:07.195Z level=WARN source=sched.go:655 msg="gpu VRAM usage didn't recover within timeout" seconds=5.679553001 runner.size="61.5 GiB" runner.vram="56.0 GiB" runner.parallel=1 runner.pid=274 runner.model=/root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3
time=2025-10-16T17:45:07.204Z level=INFO source=runner.go:1316 msg="starting ollama engine"
time=2025-10-16T17:45:07.208Z level=INFO source=runner.go:1352 msg="Server listening on 127.0.0.1:35825"
time=2025-10-16T17:45:07.212Z level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:45:07.290Z level=INFO source=ggml.go:133 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, ID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-16T17:45:07.363Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-16T17:45:07.681Z level=INFO source=runner.go:1189 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:45:07.797Z level=INFO source=runner.go:1189 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:45:08.413Z level=INFO source=runner.go:1189 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-16T17:45:08.413Z level=INFO source=ggml.go:477 msg="offloading 34 repeating layers to GPU"
time=2025-10-16T17:45:08.413Z level=INFO source=ggml.go:481 msg="offloading output layer to CPU"
time=2025-10-16T17:45:08.413Z level=INFO source=ggml.go:488 msg="offloaded 34/37 layers to GPU"
time=2025-10-16T17:45:08.413Z level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="55.4 GiB"
time=2025-10-16T17:45:08.414Z level=INFO source=device.go:211 msg="model weights" device=CPU size="5.4 GiB"
time=2025-10-16T17:45:08.414Z level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="425.0 MiB"
time=2025-10-16T17:45:08.414Z level=INFO source=device.go:222 msg="kv cache" device=CPU size="25.0 MiB"
time=2025-10-16T17:45:08.414Z level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="139.1 MiB"
time=2025-10-16T17:45:08.414Z level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB"
time=2025-10-16T17:45:08.414Z level=INFO source=device.go:238 msg="total memory" size="61.5 GiB"
time=2025-10-16T17:45:08.414Z level=INFO source=sched.go:481 msg="loaded runners" count=1
time=2025-10-16T17:45:08.414Z level=INFO source=server.go:1271 msg="waiting for llama runner to start responding"
time=2025-10-16T17:45:08.414Z level=INFO source=server.go:1305 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-16T17:45:29.771Z level=INFO source=server.go:1309 msg="llama runner started in 22.58 seconds"

Hi @georgelpreput ,

Please provide the steps in detail (including run commands and the messages sent to the LLM) to help us reproduce the error, as we are unable to replicate it on our side.

Thanks

Sure, here’s a log that contains the timestamps of chat messages, as well as the error. The screenshot shows the conversation itself. Do note that Open WebUI sends a few messages in quick succession for each user message, since it also asks for a title to the conversation, and tags for it as well.

What I did was soft-reboot Jetson Thor, then log in remotely via SSH on two separate connections, without an active Gnome session. In the first SSH connection, I ran btop to confirm that I’m starting with ~110 GB of free RAM. In the second SSH connection, I ran these commands:

cd /opt/stacks/ollama/
docker compose up -d
echo 0 | sudo tee /proc/sys/vm/nr_hugepages
docker logs ollama --follow

Here are the logs, including API call logs:

time=2025-10-21T20:18:09.735Z level=INFO source=routes.go:1511 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GGML_VK_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-10-21T20:18:09.737Z level=INFO source=images.go:522 msg="total blobs: 5"
time=2025-10-21T20:18:09.737Z level=INFO source=images.go:529 msg="total unused blobs removed: 0"
time=2025-10-21T20:18:09.737Z level=INFO source=routes.go:1564 msg="Listening on [::]:11434 (version 0.12.6)"
time=2025-10-21T20:18:09.738Z level=INFO source=runner.go:80 msg="discovering available GPUs..."
time=2025-10-21T20:18:10.552Z level=INFO source=types.go:112 msg="inference compute" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA compute=11.0 name=CUDA0 description="NVIDIA Thor" libdirs=ollama,cuda_v13 driver=13.0 pci_id=01:00.0 type=iGPU total="122.8 GiB" available="119.6 GiB"
time=2025-10-21T20:20:06.661Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-21T20:20:06.661Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 36481"
time=2025-10-21T20:20:06.662Z level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1
time=2025-10-21T20:20:06.662Z level=INFO source=server.go:682 msg="system memory" total="122.8 GiB" free="119.1 GiB" free_swap="0 B"
time=2025-10-21T20:20:06.662Z level=INFO source=server.go:690 msg="gpu memory" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA available="118.5 GiB" free="119.0 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-21T20:20:06.674Z level=INFO source=runner.go:1332 msg="starting ollama engine"
time=2025-10-21T20:20:06.679Z level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:36481"
time=2025-10-21T20:20:06.684Z level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:20:06.765Z level=INFO source=ggml.go:134 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, ID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-21T20:20:06.854Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-21T20:20:07.186Z level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:20:08.707Z level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:20:08.707Z level=INFO source=ggml.go:480 msg="offloading 36 repeating layers to GPU"
time=2025-10-21T20:20:08.707Z level=INFO source=ggml.go:487 msg="offloading output layer to GPU"
time=2025-10-21T20:20:08.707Z level=INFO source=ggml.go:492 msg="offloaded 37/37 layers to GPU"
time=2025-10-21T20:20:08.707Z level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="59.8 GiB"
time=2025-10-21T20:20:08.707Z level=INFO source=device.go:211 msg="model weights" device=CPU size="1.1 GiB"
time=2025-10-21T20:20:08.707Z level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="450.0 MiB"
time=2025-10-21T20:20:08.707Z level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="129.8 MiB"
time=2025-10-21T20:20:08.707Z level=INFO source=device.go:233 msg="compute graph" device=CPU size="5.6 MiB"
time=2025-10-21T20:20:08.707Z level=INFO source=device.go:238 msg="total memory" size="61.4 GiB"
time=2025-10-21T20:20:08.707Z level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-10-21T20:20:08.707Z level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-10-21T20:20:08.708Z level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-21T20:20:26.038Z level=INFO source=server.go:1310 msg="llama runner started in 19.38 seconds"
[GIN] 2025/10/21 - 20:20:28 | 200 | 22.474630397s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 20:20:34 | 200 |  6.497649123s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 20:20:40 | 200 |  5.527840289s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 20:20:44 | 200 |  4.202008031s |    192.168.2.18 | POST     "/api/chat"
panic: failed to sample token

goroutine 981 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0x40002370e0, {0x1c5, {0xaaaae3b130e0, 0x40002e4000}, {0xaaaae3b1dfa8, 0x4000a76df8}, {0x4000b00008, 0x1, 0x1}, {{0xaaaae3b1dfa8, ...}, ...}, ...})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:735 +0x138c
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 38
	github.com/ollama/ollama/runner/ollamarunner/runner.go:432 +0x22c
[GIN] 2025/10/21 - 20:21:06 | 500 |  1.502073126s |    192.168.2.18 | POST     "/api/chat"

The screenshot shows the conversation from Open WebUI (not hosted on Thor):

Logs from the Open WebUI container (on a different machine):

2025-10-21 20:20:04.645 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/tools/ HTTP/1.1" 200
2025-10-21 20:20:04.713 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200
2025-10-21 20:20:04.754 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/folders/ HTTP/1.1" 200
2025-10-21 20:20:04.770 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "POST /api/v1/chats/af1101e4-bad0-4fe7-adb3-0e15048ac09a HTTP/1.1" 200
2025-10-21 20:20:04.804 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200
2025-10-21 20:20:04.852 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/folders/ HTTP/1.1" 200
2025-10-21 20:20:05.865 | INFO     | httpx._client:_send_single_request:1025 - HTTP Request: GET http://qdrant:6333/collections/open-webui_memories/exists "HTTP/1.1 200 OK"
2025-10-21 20:20:05.867 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "POST /api/chat/completions HTTP/1.1" 200
2025-10-21 20:20:05.907 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200
2025-10-21 20:20:05.946 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/folders/ HTTP/1.1" 200
2025-10-21 20:20:24.249 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /_app/version.json HTTP/1.1" 200
2025-10-21 20:20:24.327 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/af1101e4-bad0-4fe7-adb3-0e15048ac09a/tags HTTP/1.1" 200
2025-10-21 20:20:28.607 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "POST /api/chat/completed HTTP/1.1" 200
2025-10-21 20:20:28.649 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "POST /api/v1/chats/af1101e4-bad0-4fe7-adb3-0e15048ac09a HTTP/1.1" 200
2025-10-21 20:20:28.690 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200
2025-10-21 20:20:28.729 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/folders/ HTTP/1.1" 200
2025-10-21 20:20:40.790 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200
2025-10-21 20:20:40.845 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/folders/ HTTP/1.1" 200
2025-10-21 20:20:45.084 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/af1101e4-bad0-4fe7-adb3-0e15048ac09a HTTP/1.1" 200
2025-10-21 20:20:45.134 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/all/tags HTTP/1.1" 200
2025-10-21 20:21:04.815 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "POST /api/v1/chats/af1101e4-bad0-4fe7-adb3-0e15048ac09a HTTP/1.1" 200
2025-10-21 20:21:04.964 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200
2025-10-21 20:21:05.005 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/folders/ HTTP/1.1" 200
2025-10-21 20:21:05.136 | INFO     | httpx._client:_send_single_request:1025 - HTTP Request: GET http://qdrant:6333/collections/open-webui_memories/exists "HTTP/1.1 200 OK"
2025-10-21 20:21:05.137 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "POST /api/chat/completions HTTP/1.1" 200
2025-10-21 20:21:05.260 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200
2025-10-21 20:21:05.300 | INFO     | uvicorn.protocols.http.httptools_impl:send:476 - 172.22.0.1:0 - "GET /api/v1/folders/ HTTP/1.1" 200

Interestingly though, if I install Alpaca through Flatpak, connect it to the same Ollama instance (didn’t even stop it in the meantime), I can hold a normal conversation from there (until Alpaca crashes due to its own bugs – not of interest for this topic). Here’s the Ollama log when using Alpaca:

time=2025-10-21T20:52:04.850Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-21T20:52:04.850Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 42217"
time=2025-10-21T20:52:04.851Z level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1
time=2025-10-21T20:52:04.851Z level=INFO source=server.go:682 msg="system memory" total="122.8 GiB" free="57.0 GiB" free_swap="0 B"
time=2025-10-21T20:52:04.851Z level=INFO source=server.go:690 msg="gpu memory" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA available="56.5 GiB" free="56.9 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-21T20:52:04.863Z level=INFO source=runner.go:1332 msg="starting ollama engine"
time=2025-10-21T20:52:04.867Z level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:42217"
time=2025-10-21T20:52:04.873Z level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:04.944Z level=INFO source=ggml.go:134 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, ID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-21T20:52:05.001Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-21T20:52:05.284Z level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:05.409Z level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:06.020Z level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:06.020Z level=INFO source=ggml.go:480 msg="offloading 34 repeating layers to GPU"
time=2025-10-21T20:52:06.020Z level=INFO source=ggml.go:484 msg="offloading output layer to CPU"
time=2025-10-21T20:52:06.020Z level=INFO source=ggml.go:492 msg="offloaded 34/37 layers to GPU"
time=2025-10-21T20:52:06.021Z level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="55.4 GiB"
time=2025-10-21T20:52:06.021Z level=INFO source=device.go:211 msg="model weights" device=CPU size="5.4 GiB"
time=2025-10-21T20:52:06.021Z level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="425.0 MiB"
time=2025-10-21T20:52:06.021Z level=INFO source=device.go:222 msg="kv cache" device=CPU size="25.0 MiB"
time=2025-10-21T20:52:06.021Z level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="139.1 MiB"
time=2025-10-21T20:52:06.021Z level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB"
time=2025-10-21T20:52:06.021Z level=INFO source=device.go:238 msg="total memory" size="61.5 GiB"
time=2025-10-21T20:52:06.021Z level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-10-21T20:52:06.022Z level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-10-21T20:52:06.022Z level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-21T20:52:30.642Z level=INFO source=server.go:1310 msg="llama runner started in 25.79 seconds"
[GIN] 2025/10/21 - 20:52:33 | 200 | 29.751495779s |      172.20.0.1 | POST     "/api/generate"
ggml_nvml_get_device_memory NVML not supported for memory query, using system memory (total=131881750528, available=54037270528)
ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 54037270528 total: 131881750528
time=2025-10-21T20:52:38.895Z level=INFO source=runner.go:545 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=[] error="failed to finish discovery before timeout"
time=2025-10-21T20:52:38.895Z level=WARN source=runner.go:347 msg="unable to refresh free memory, using old values"
time=2025-10-21T20:52:39.308Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-21T20:52:39.309Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 36061"
time=2025-10-21T20:52:39.309Z level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1
time=2025-10-21T20:52:39.310Z level=INFO source=server.go:682 msg="system memory" total="122.8 GiB" free="57.2 GiB" free_swap="0 B"
time=2025-10-21T20:52:39.310Z level=INFO source=server.go:690 msg="gpu memory" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA available="56.6 GiB" free="57.1 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-21T20:52:39.321Z level=INFO source=runner.go:1332 msg="starting ollama engine"
time=2025-10-21T20:52:39.325Z level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:36061"
time=2025-10-21T20:52:39.332Z level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16384 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:39.402Z level=INFO source=ggml.go:134 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, ID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-21T20:52:39.458Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-21T20:52:39.745Z level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16384 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:39.853Z level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16384 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:40.438Z level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:16384 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T20:52:40.439Z level=INFO source=ggml.go:480 msg="offloading 34 repeating layers to GPU"
time=2025-10-21T20:52:40.439Z level=INFO source=ggml.go:484 msg="offloading output layer to CPU"
time=2025-10-21T20:52:40.439Z level=INFO source=ggml.go:492 msg="offloaded 34/37 layers to GPU"
time=2025-10-21T20:52:40.439Z level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="55.4 GiB"
time=2025-10-21T20:52:40.439Z level=INFO source=device.go:211 msg="model weights" device=CPU size="5.4 GiB"
time=2025-10-21T20:52:40.439Z level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="697.0 MiB"
time=2025-10-21T20:52:40.439Z level=INFO source=device.go:222 msg="kv cache" device=CPU size="41.0 MiB"
time=2025-10-21T20:52:40.439Z level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="147.1 MiB"
time=2025-10-21T20:52:40.439Z level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB"
time=2025-10-21T20:52:40.439Z level=INFO source=device.go:238 msg="total memory" size="61.8 GiB"
time=2025-10-21T20:52:40.439Z level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-10-21T20:52:40.439Z level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-10-21T20:52:40.439Z level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-21T20:53:04.047Z level=INFO source=server.go:1310 msg="llama runner started in 24.74 seconds"
[GIN] 2025/10/21 - 20:53:08 | 200 |          1m4s |      172.20.0.1 | POST     "/api/chat"
[GIN] 2025/10/21 - 20:53:25 | 200 |  165.391304ms |      172.20.0.1 | POST     "/api/show"
[GIN] 2025/10/21 - 20:53:36 | 200 | 11.169146607s |      172.20.0.1 | POST     "/api/chat"
[GIN] 2025/10/21 - 20:53:52 | 200 |  135.551585ms |      172.20.0.1 | POST     "/api/show"
[GIN] 2025/10/21 - 20:54:30 | 200 | 37.468109406s |      172.20.0.1 | POST     "/api/chat"
[GIN] 2025/10/21 - 20:54:43 | 200 |  160.722797ms |      172.20.0.1 | POST     "/api/show"
[GIN] 2025/10/21 - 20:54:54 | 200 |  10.49062772s |      172.20.0.1 | POST     "/api/chat"
[GIN] 2025/10/21 - 20:55:37 | 200 |  138.794007ms |      172.20.0.1 | POST     "/api/show"
[GIN] 2025/10/21 - 20:55:47 | 200 | 10.216652255s |      172.20.0.1 | POST     "/api/chat"

Do note that when I ran Alpaca on the Jetson Thor, I was connected to it through RDP over LAN via the Gnome Connections app. When running Open WebUI I had no Gnome session active (not sure it matters).

Later, after I stopped messaging, disconnected RDP and basically left the system idle for some minutes, I got this in the Ollama logs:

ggml_nvml_get_device_memory NVML not supported for memory query, using system memory (total=131881750528, available=54210768896)
ggml_backend_cuda_device_get_memory utilizing NVML memory reporting free: 54210768896 total: 131881750528
time=2025-10-21T21:00:53.036Z level=INFO source=runner.go:545 msg="failure during GPU discovery" OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=[] error="failed to finish discovery before timeout"
time=2025-10-21T21:00:53.036Z level=WARN source=runner.go:347 msg="unable to refresh free memory, using old values"

OK, so this is going to sound weird, but might be onto something. Retried the LLM conversation through Open WebUI, but this time while having the RDP connection to Jetson Thor active, doing nothing in the background. While that connection was up, I could keep on asking the LLM for limericks. As soon as I closed the RDP connection (and I guess, by extension, the Gnome session terminated), the next message got that panic: failed to sample token error in the Ollama container logs (I kept them running in an SSH window).

Here are these Ollama logs, do note that I disconnected the RDP connection to Thor somewhere between timestamps 2025/10/21 - 21:07:04 and 2025/10/21 - 21:07:38. The LLM did not respond to the /api/chat call from 21:07:38.

time=2025-10-21T21:00:53.036Z level=WARN source=runner.go:347 msg="unable to refresh free memory, using old values"
time=2025-10-21T21:02:53.016Z level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-10-21T21:02:53.016Z level=INFO source=server.go:400 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-90a618fe6ff21b09ca968df959104eb650658b0bef0faef785c18c2795d993e3 --port 37909"
time=2025-10-21T21:02:53.017Z level=INFO source=server.go:676 msg="loading model" "model layers"=37 requested=-1
time=2025-10-21T21:02:53.017Z level=INFO source=server.go:682 msg="system memory" total="122.8 GiB" free="57.6 GiB" free_swap="0 B"
time=2025-10-21T21:02:53.017Z level=INFO source=server.go:690 msg="gpu memory" id=GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 library=CUDA available="57.0 GiB" free="57.5 GiB" minimum="457.0 MiB" overhead="0 B"
time=2025-10-21T21:02:53.029Z level=INFO source=runner.go:1332 msg="starting ollama engine"
time=2025-10-21T21:02:53.032Z level=INFO source=runner.go:1367 msg="Server listening on 127.0.0.1:37909"
time=2025-10-21T21:02:53.039Z level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:37[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:37(0..36)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T21:02:53.120Z level=INFO source=ggml.go:134 msg="" architecture=gptoss file_type=MXFP4 name="" description="" num_tensors=471 num_key_values=30
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA Thor, compute capability 11.0, VMM: yes, ID: GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v13/libggml-cuda.so
time=2025-10-21T21:02:53.209Z level=INFO source=ggml.go:104 msg=system CPU.0.NEON=1 CPU.0.ARM_FMA=1 CPU.0.LLAMAFILE=1 CPU.1.NEON=1 CPU.1.ARM_FMA=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=750,800,860,870,890,900,1000,1100,1200,1210 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-10-21T21:02:53.495Z level=INFO source=runner.go:1205 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T21:02:53.600Z level=INFO source=runner.go:1205 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T21:02:54.189Z level=INFO source=runner.go:1205 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType: NumThreads:14 GPULayers:34[ID:GPU-a7c66ad2-6dbb-0ab8-c1a2-37ba6dba3600 Layers:34(2..35)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-10-21T21:02:54.189Z level=INFO source=ggml.go:480 msg="offloading 34 repeating layers to GPU"
time=2025-10-21T21:02:54.189Z level=INFO source=ggml.go:484 msg="offloading output layer to CPU"
time=2025-10-21T21:02:54.189Z level=INFO source=ggml.go:492 msg="offloaded 34/37 layers to GPU"
time=2025-10-21T21:02:54.189Z level=INFO source=device.go:206 msg="model weights" device=CUDA0 size="55.4 GiB"
time=2025-10-21T21:02:54.189Z level=INFO source=device.go:211 msg="model weights" device=CPU size="5.4 GiB"
time=2025-10-21T21:02:54.189Z level=INFO source=device.go:217 msg="kv cache" device=CUDA0 size="425.0 MiB"
time=2025-10-21T21:02:54.189Z level=INFO source=device.go:222 msg="kv cache" device=CPU size="25.0 MiB"
time=2025-10-21T21:02:54.189Z level=INFO source=device.go:228 msg="compute graph" device=CUDA0 size="139.1 MiB"
time=2025-10-21T21:02:54.189Z level=INFO source=device.go:233 msg="compute graph" device=CPU size="109.2 MiB"
time=2025-10-21T21:02:54.189Z level=INFO source=device.go:238 msg="total memory" size="61.5 GiB"
time=2025-10-21T21:02:54.189Z level=INFO source=sched.go:482 msg="loaded runners" count=1
time=2025-10-21T21:02:54.190Z level=INFO source=server.go:1272 msg="waiting for llama runner to start responding"
time=2025-10-21T21:02:54.190Z level=INFO source=server.go:1306 msg="waiting for server to become available" status="llm server loading model"
time=2025-10-21T21:03:14.540Z level=INFO source=server.go:1310 msg="llama runner started in 21.52 seconds"
[GIN] 2025/10/21 - 21:03:24 | 200 | 32.650454145s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:03:59 | 200 | 34.424267457s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:04:08 | 200 | 22.861922029s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:04:32 | 200 | 24.723650948s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:04:49 | 200 |  9.754817773s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:05:23 | 200 | 34.735305649s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:06:09 | 200 | 11.905579288s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:06:31 | 200 | 22.470565161s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:07:04 | 200 | 17.563989115s |    192.168.2.18 | POST     "/api/chat"
[GIN] 2025/10/21 - 21:07:38 | 200 | 33.266232108s |    192.168.2.18 | POST     "/api/chat"
panic: failed to sample token

goroutine 4845 [running]:
github.com/ollama/ollama/runner/ollamarunner.(*Server).computeBatch(0x4000236f00, {0x825, {0xaaaab70430e0, 0x40002e2000}, {0xaaaab704dfa8, 0x400138d620}, {0x400007e078, 0x1, 0x1}, {{0xaaaab704dfa8, ...}, ...}, ...})
	github.com/ollama/ollama/runner/ollamarunner/runner.go:735 +0x138c
created by github.com/ollama/ollama/runner/ollamarunner.(*Server).run in goroutine 38
	github.com/ollama/ollama/runner/ollamarunner/runner.go:432 +0x22c
[GIN] 2025/10/21 - 21:07:56 | 500 |  7.621396938s |    192.168.2.18 | POST     "/api/chat"

Also, regarding the setup I have, I am running Jetson Linux (with Gnome) on Thor, though I normally access it remotely via SSH, rarely through the GUI. Instead of a monitor attached to it, I have an HDMI stub plug (this one: https://www.amazon.nl/-/en/dp/B087R35V9Q) inserted into the HDMI port at all times.

Finally, I do see in the logs that only 34 out of 37 layers were offloaded to the GPU (timestamp 2025-10-21T21:02:54.189Z) maybe that also has something to do with it

Hi,

We still cannot reproduce error in below steps:

  1. run ollama
docker run -it --rm -e OLLAMA_DEBUG=1 -p 11434:11434 --name ollama ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04
  1. run open webui
docker run -it --rm --network=host --add-host=host.docker.internal:host-gateway ghcr.io/open-webui/open-webui:main
  1. Conversation from Open WebUI

Please remove the local Docker image and model, then reinstall them.

And provide the whole log if issue exists.

Thanks

Hi,

You’re trying to reproduce it with a different image. I was using the official Ollama image, while you’re using an image from nvidia-ai-iot from back in August. First time I tried with your image, I got this:

georgelpreput@ai /e/k/r/d/s/h/ollama> docker run -it --rm -e OLLAMA_DEBUG=1 -p 11434:11434 --name ollama ghcr.io/nvidia-ai-iot/ollama:r38.2.arm64-sbsa-cu130-24.04
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices runtime.nvidia.com/gpu=all: unknown

Also got a system freeze, so I had to force shut down from the power button. After restarting, I was able to get the model to respond to multiple prompts over time, without the original issue, though it runs at a lower 5~7 tokens per second, compared to ~11 tokens per second for the few times it worked with the official Ollama image.

Even though I ran this image with OLLAMA_DEBUG=1, the only logs it produced are these:

georgelpreput@ai /e/a/sources.list.d> docker logs --follow ollama

Starting ollama server


OLLAMA_HOST   0.0.0.0:11434
OLLAMA_LOGS   /data/logs/ollama.log
OLLAMA_MODELS /data/models/ollama/models


ollama server is now started, and you can run commands here like 'ollama run gemma3'

@DavidDDD or @AastaLLL , given that this nvidia-ai-iot/ollama Docker image seems to work over time without crashing, I’ll keep running with it, but I do have a number of questions, if you’d be so kind:

  • what are the differences between this Docker image and the official Ollama image, i.e. why won’t the ollama/ollama image also work
  • any idea why nvidia-ai-iot/ollama seems less performant than ollama/ollama (on the few occasions that ollama/ollama actually works)?
  • given that there’s only a single tag for the nvidia-ai-iot/ollama image, and that it’s from back in August, are there any plans to keep up with the official Ollama versions?
  • alternatively, what can one do to replicate this image and include a newer Ollama installation, since I couldn’t find a Dockerfile for nvidia-ai-iot/ollama

Thank you!