Unable to load large models on Jetson Orin Nano Super despite sufficient RAM

This just started happening recently?

I’m encountering a GPU memory allocation issue on a Jetson Orin Nano Super. Previously, I could run multiple large models (up to ~8B parameters) successfully, but now even smaller models (e.g., 1.3B parameters) fail to load.

System Info:

Jetson Orin Nano Super Developer Kit

JetPack 36.4.7, Kernel 5.15.148-tegra

Total RAM: 7.6 GB

Swap: ~19 Gb

HDD: NVMe ssd

Docker container with shm_size set to 8 GB

Observed Behavior:

Attempting to load models results in errors such as:

error loading model: unable to allocate CUDA0 buffer
cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_model_load_from_file_impl: failed to load model

tegrastats shows free RAM, but the largest contiguous block (lfb) is very small (e.g., 14×4 MB).

Stopping the GUI (sudo systemctl stop gdm) does not improve the largest contiguous block.

Small models (~1B parameters) can still load, but anything larger fails.

Additional Notes:

The Docker container runs Ollama with shm_size=8g.

Even after a fresh reboot, the system cannot allocate the required GPU memory for models that previously worked.

The issue appears related to contiguous GPU memory availability, not total RAM usage.

Questions:

  1. Could recent JetPack/kernel updates have changed memory reservation or fragmentation behavior?

  2. Is there a way to force larger contiguous blocks of GPU memory for CUDA allocations?

  3. Any recommended workarounds for loading models in the 1.3B–8B parameter range on this hardware?

*** Please note that this reply is generated by LLM automatically ***

The issue you’re experiencing with GPU memory allocation on your Jetson Orin Nano Super is likely related to the contiguous GPU memory availability, rather than the total RAM usage. This is a common problem in systems with limited GPU memory, where fragmentation can occur, making it difficult to allocate large contiguous blocks of memory.

Recent JetPack/kernel updates might have changed memory reservation or fragmentation behavior, which could be contributing to the issue. However, without more information about the specific updates and changes made, it’s difficult to pinpoint the exact cause.

To better understand the issue, let’s compare the specifications of the Jetson Orin Nano Super with other similar devices:

  • Jetson Orin Nano Super: 16 GB RAM, 76 GB total RAM, 19 GB swap, NVMe SSD
  • Jetson Orin Nano: 8 GB RAM, 32 GB total RAM, 8 GB swap, microSD card
  • Jetson AGX Orin: 32 GB RAM, 64 GB total RAM, 32 GB swap, eMMC

As you can see, the Jetson Orin Nano Super has more RAM and swap space compared to the Jetson Orin Nano, but less than the Jetson AGX Orin.

To force larger contiguous blocks of GPU memory for CUDA allocations, you can try the following workarounds:

  1. Reduce the shmsize: Try reducing the shmsize in your Docker container to a smaller value, such as 4 GB or 2 GB. This might help reduce the memory allocation requirements and allow for larger contiguous blocks of GPU memory.
  2. Use CUDA memory pooling: CUDA memory pooling allows you to allocate a large block of memory and then sub-allocate it to different CUDA streams. This can help reduce memory fragmentation and improve contiguous memory allocation.
  3. Use the CUDA memory debugger: The CUDA memory debugger can help you identify memory allocation issues and provide insights into memory usage and fragmentation.
  4. Optimize your model: If possible, try optimizing your model to reduce its memory requirements. This could involve reducing the number of parameters, using more efficient data types, or applying model pruning techniques.
  5. Use a different GPU: If none of the above workarounds help, you might consider using a different GPU with more memory or better memory management capabilities.

In terms of loading models in the 13B-8B parameter range on your hardware, you could try the following:

  1. Use model parallelism: Model parallelism allows you to split your model across multiple GPUs, which can help reduce the memory requirements for each GPU.
  2. Use gradient checkpointing: Gradient checkpointing allows you to store intermediate gradients during backpropagation, which can help reduce the memory requirements for large models.
  3. Use mixed precision training: Mixed precision training allows you to use lower precision data types for certain parts of the model, which can help reduce the memory requirements.

By trying out these workarounds and optimizations, you should be able to improve the contiguous GPU memory availability on your Jetson Orin Nano Super and load larger models successfully.

*** Please give the thumbs up if you get answers you like. Or provide feedback to help us improve the answer. ***

Try the steps here to clear memory cache.

osnava/learnJetson: Learning to use this device. You’ll find edge AI applications and useful commands to optimize your Jetson

Hi,

We observe this behavior after the system automatically upgrades to r36.4.7.
Could you try to remove all the ollama cache and re-download it to see if it can work?

Thanks.

hey, i have the same issue for running vila-2.7b and vila1.5-3b, what should i do for this issue

Thank you, everyone, for the responses. I finally got responses from my 3.3B modal and 1.1 TinyLlama. I did a few things, and Im not 100% sure about the solution, but it was a combo of JumikoSK, AastaLLL, and a bit of carolyuu that got me working. The steps I took were to

  1. Remove the GUI by running sudo systemctl set-default multi-user. target. JTop with the newer code does not let me adjust additional settings
  2. I remove, and pulled back down all ollama modals.
    By this point, I was still struggling, but I saw that after reboot, I also had an instance of StableDefusion Web UI running. I spun this down before spinning back up the Ollama / OpenWebUI combo, and I finally started getting responses back.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.