Das kam nicht von dir, sondern aus seiner Frage:
“Is the drive you attach externally powered? If not you might want try that.”
I am not actually running the workload from an externally attached drive in the critical setup. So external-drive power is probably not the main factor here.
Regarding manual reboots: I do not normally reboot Thor twice a day. In my old stable setup, that was not necessary. The system was used very heavily and was still stable.
Regarding “drop_caches”: I do use it, but not on a fixed schedule. I use it depending on the situation. If memory is full but I know the models are no longer active, I clear caches manually. That can be once a day, or it can be many times per day during heavy testing.
The strange part in my case is that I don’t think this is normal memory pressure or that Thor generally can’t handle these workloads. Before the first hard crash, my setup was extremely stable. I used it daily for visual AI tests in quality assurance at my main employer: analyzing large amounts of images, generating many images with Flux Schnell, testing 3D workflows, and running heavy workloads without issues.
Flux Schnell using 40–50 GB unified memory was normal for my setup. I generated hundreds of images like that on Thor before, so the memory usage itself does not seem to be the real problem.
The issue seems to start after one hard crash / reboot event from a still unknown cause. After that, the system appears to remain permanently unstable: Ollama Vision starts crashing again, ComfyUI/Flux becomes dangerous to run, and the same kinds of workloads that were stable before suddenly trigger reboots.
After reinstalling JetPack 7.0 cleanly, the system seems to behave like the old stable setup again. I tested Ollama Vision with many full-size images across different models and so far there were no crashes.
So my current suspicion is that one specific container / CUDA / PyTorch / offloading path triggered a hard GPU/system failure, and after that the system state became unreliable. It may be related to aggressive pinned memory, async offloading, CUDA kernels, container runtime state, or something similar — but it is very hard to isolate because the actual crash does not leave useful logs.
For my use case, rebooting twice a day or carefully limiting every workload is not really an acceptable solution. The system has to be production-ready. I eventually want this stack to support robotics/vision workloads, so it cannot behave like a fragile desktop experiment. A robot brain that randomly reboots under load is not useful.
My next step is to test a cleaner CUDA 13 / PyTorch 25.11 ComfyUI container built for Spark/GB10-class systems, because Spark and Thor are at least closer in architecture. If that fails as well, I may move to JetPack 7.1 and test the whole core stack again immediately: ComfyUI, Flux Schnell, WAN, and Ollama Vision.