Jetson Orin Nano Super insufficient GPU memory

Hi,

I have encountered an issue of Insufficient GPU memory when I try to run the DeepSeek R1 Qwen-1.5B model. Following is the log:

root@0737e6e49070:~# free -h
               total        used        free      shared  buff/cache   available
Mem:           7.4Gi       3.1Gi       1.5Gi        76Mi       2.8Gi       4.0Gi
Swap:           15Gi       0.0Ki        15Gi
root@0737e6e49070:~# sudonim serve       --model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC       --quantization q4f16_ft       --max-batch-size 1       --prefill-chunk 512       --chat-template deepseek_r1_qwen       --host 0.0.0.0       --port 9000

[05:37:13] sudonim | sudonim version 0.1.6

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ CUDA_VERSION   12.6      โ”‚ GPU 0                       โ”‚ CACHE_ROOT      /root/.cache โ”‚
โ”‚ NVIDIA_DRIVER  540.4.0   โ”‚  โ”œ name      Orin Nano 8GB  โ”‚ HAS_MLC         True         โ”‚
โ”‚ SYSTEM_ID      orin-nano โ”‚  โ”œ family    Ampere         โ”‚ HAS_HF_HUB      True         โ”‚
โ”‚ CPU_ARCH       aarch64   โ”‚  โ”œ cores     1024           โ”‚ HAS_NVIDIA_SMI  True         โ”‚
โ”‚ GPU_ARCH       sm87      โ”‚  โ”œ mem_free  [3.7 / 7.6 GB] โ”‚                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

[05:37:14] sudonim | Downloading model from HF Hub:  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC -> /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Fetching 38 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 38/38 [00:00<00:00, 904.37it/s]
[05:37:18] sudonim | Downloaded model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC to:  /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
[05:37:18] sudonim | Loading model 'DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC' from /root/.cache/mlc_llm/dusty-nv 

  mlc_llm serve --mode interactive --device cuda \
    --host 0.0.0.0 --port 9000 \
    --overrides='tensor_parallel_shards=1;prefill_chunk_size=512' \
    --model-lib /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so \
    DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC

[2025-04-18 05:37:22] INFO auto_device.py:79: Found device: cuda:0
[2025-04-18 05:37:22] INFO engine_base.py:142: Using library model: /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so
[2025-04-18 05:37:22] INFO engine_base.py:186: The selected engine mode is interactive. We fix max batch size to 1 for interactive single sequence use.
[2025-04-18 05:37:22] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2025-04-18 05:37:22] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (5) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(TVMFuncCall+0x68) [0xffff9e1e0f98]
  [bt] (4) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()+0x824) [0xffff77298054]
  [bt] (3) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x2cc) [0xffff772956fc]
  [bt] (2) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(+0x303cfc) [0xffff77293cfc]
  [bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff9c31c7f8]
  [bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff9e22bfc0]
  File "/opt/mlc-llm/cpp/serve/threaded_engine.cc", line 287
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 6476.864 MB, which is less than the sum of model weight size (876.640 MB) and temporary buffer size (10771.183 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.

I have try to add the swapfile and setting the prefill-chunk-size, but it seemed not work. Would you help me to solve this issue?

jetson_release:
Software part of jetson-stats 4.3.2 - (c) 2024, Raffaello Bonghi
Model: NVIDIA Jetson Orin Nano Engineering Reference Developer Kit Super - Jetpack 6.2 [L4T 36.4.3]
NV Power Mode[2]: MAXN_SUPER
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - P-Number: p3767-0005
 - Module: NVIDIA Jetson Orin Nano (Developer kit)
Platform:
 - Distribution: Ubuntu 22.04 Jammy Jellyfish
 - Release: 5.15.148-tegra
jtop:
 - Version: 4.3.2
 - Service: Active
Libraries:
 - CUDA: Not installed
 - cuDNN: Not installed
 - TensorRT: Not installed
 - VPI: Not installed
 - Vulkan: 1.3.204
 - OpenCV: 4.5.4 - with CUDA: NO

Thanks!
richer.chan

This is probably a question for @dusty_nv or someone other than me, but I will start by adding that the GPU on the Jetson is integrated directly to the memory controller (an iGPU), and does not have its own memory. This iGPU shares with system RAM. However, the only memory which the GPU can use is from a physical addressโ€ฆmeaning that virtual memory from swap cannot be used.

Along with requiring physical RAM, I think usually the memory block has to be contiguous and not fragmented (Iโ€™m not sure what exceptions there might be for that).

Thus adding swap wonโ€™t help, at least not directly. It is possible that some user space process would swap out and release RAM, which could then be useful. However, if that RAM was not contiguous with the current RAM block, then it might still not help.

Sometimes people add a boot parameter to reserve a memory block specifically for the GPU, e..g, via an argument in the โ€œ/boot/extlinux/extlinux.confโ€ file. Someone else might have a suggestion for whether this would be useful or not, but it could mean you just need a leaner model.

Hi Linuxdev๏ผŒ
Thank you for your analysis and suggestions. I saw on the Jetson official website that there is an introduction to swap, mainly on how to optimize memory๏ผˆ ๐Ÿ”– Memory optimization - NVIDIA Jetson AI Lab ๏ผ‰I want to test the performance of Jetson Orin Nano 8G according to this. Perhaps as you suggested, a leaner model is needed. Strangely, the model was also downloaded from the official website, but this memory insufficient error occurred.
model:DeepSeek R1 Qwen-1.5B
๏ผˆdocker run -it --rm
โ€“name llm_server
โ€“gpus all
-p 9000:9000
-e DOCKER_PULL=always --pull always
-e HF_HUB_CACHE=/root/.cache/huggingface
-v /mnt/nvme/cache:/ root/.cache
dustynv/mlc:r36.4.0
sudonim serve
โ€“model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
โ€“quantization q4f16_ft
โ€“max-batch-size 1
โ€“chat-template deepseek_r1_qwen
โ€“host 0.0.0.0
โ€“port 9000๏ผ‰

Hi,

Have you followed the memory optimization (as you shared) and run the command?
It is required for Orin Nano to run the LLMs model for its relatively limited memory size.

Thanks.

Hi AastaLLL,

I have followed the memory optimization and run the command write in the web doc(๐Ÿ”– Memory optimization - NVIDIA Jetson AI Lab).
Now I test some models:
pass:

bartowski/DeepSeek-R1-Distill-Qwen-1.5B-GGUF/DeepSeek-R1-Distill-Qwen-1.5B-Q4_K_M.gguf
dusty-nv/Qwen2.5-7B-Instruct-q4f16_ft-MLC

fail with insufficient GPU memory:

dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
dusty-nv/DeepSeek-R1-Distill-Qwen-7B-q4f16_ft-MLC

I checked the log, it seemed that the running result is depend on the temporary buffer size, I try to modify the prefill-chunk-size, but it seemed not work.

Hi AastaLLL,

The command that I modify the prefill-chunk-size showed above.

Thanks in advance.

Hi,

Could you verify the memory usage with tegrastats on other console concurrently.

$ sudo tegrastats

Thanks.

Hi AastaLLL,
I run the command you suggest, following is the log:

~$ tegrastats
04-24-2025 09:49:12 RAM 1603/7620MB (lfb 9x4MB) SWAP 0/16384MB (cached 0MB) CPU [0%@960,0%@960,1%@960,0%@960,0%@729,0%@729] GR3D_FREQ 0% cpu@48.437C soc2@47.531C soc0@48.062C gpu@48.375C tj@49.812C soc1@49.812C VDD_IN 4629mW/4629mW VDD_CPU_GPU_CV 522mW/522mW VDD_SOC 1406mW/1406mW
04-24-2025 09:49:13 RAM 1603/7620MB (lfb 9x4MB) SWAP 0/16384MB (cached 0MB) CPU [1%@729,0%@729,0%@729,1%@729,0%@729,0%@729] GR3D_FREQ 0% cpu@48.437C soc2@47.593C soc0@48.093C gpu@48.406C tj@49.937C soc1@49.937C VDD_IN 4629mW/4629mW VDD_CPU_GPU_CV 522mW/522mW VDD_SOC 1406mW/1406mW
04-24-2025 09:49:14 RAM 1708/7620MB (lfb 2x4MB) SWAP 0/16384MB (cached 0MB) CPU [12%@1728,11%@1728,7%@1728,45%@1728,7%@729,8%@729] GR3D_FREQ 0% cpu@48.656C soc2@47.593C soc0@48.031C gpu@48.781C tj@50.125C soc1@50.125C VDD_IN 5256mW/4838mW VDD_CPU_GPU_CV 963mW/669mW VDD_SOC 1484mW/1432mW
04-24-2025 09:49:15 RAM 1738/7620MB (lfb 2x4MB) SWAP 0/16384MB (cached 0MB) CPU [3%@1728,1%@1728,70%@1728,15%@1728,0%@729,0%@729] GR3D_FREQ 0% cpu@49.187C soc2@47.687C soc0@48.062C gpu@48.906C tj@49.937C soc1@49.937C VDD_IN 5818mW/5083mW VDD_CPU_GPU_CV 1400mW/852mW VDD_SOC 1482mW/1445mW
04-24-2025 09:49:16 RAM 1768/7620MB (lfb 1x4MB) SWAP 0/16384MB (cached 0MB) CPU [1%@1728,0%@1728,17%@1728,75%@1728,0%@729,1%@729] GR3D_FREQ 0% cpu@49.156C soc2@47.875C soc0@48.125C gpu@48.937C tj@50.031C soc1@50.031C VDD_IN 5889mW/5244mW VDD_CPU_GPU_CV 1480mW/977mW VDD_SOC 1482mW/1452mW
04-24-2025 09:49:17 RAM 1896/7620MB (lfb 1x4MB) SWAP 0/16384MB (cached 0MB) CPU [7%@1728,7%@1728,65%@1728,36%@1728,7%@729,6%@729] GR3D_FREQ 0% cpu@49.718C soc2@47.843C soc0@48.187C gpu@48.625C tj@50.062C soc1@50.062C VDD_IN 6009mW/5372mW VDD_CPU_GPU_CV 1560mW/1075mW VDD_SOC 1482mW/1457mW
04-24-2025 09:49:18 RAM 1928/7620MB (lfb 1x2MB) SWAP 0/16384MB (cached 0MB) CPU [0%@1728,0%@1728,100%@1728,0%@1728,0%@729,0%@729] GR3D_FREQ 0% cpu@49.593C soc2@47.906C soc0@48.062C gpu@48.781C tj@50.187C soc1@50.187C VDD_IN 6009mW/5463mW VDD_CPU_GPU_CV 1560mW/1144mW VDD_SOC 1482mW/1461mW
04-24-2025 09:49:19 RAM 1786/7620MB (lfb 2x4MB) SWAP 1/16384MB (cached 0MB) CPU [1%@1190,1%@1190,84%@1190,8%@1190,0%@729,0%@729] GR3D_FREQ 0% cpu@48.968C soc2@47.843C soc0@48.125C gpu@48.781C tj@50.093C soc1@50.093C VDD_IN 5849mW/5511mW VDD_CPU_GPU_CV 1320mW/1166mW VDD_SOC 1522mW/1468mW
04-24-2025 09:49:20 RAM 2040/7620MB (lfb 1x4MB) SWAP 1/16384MB (cached 0MB) CPU [4%@1728,4%@1728,0%@1728,87%@1728,1%@1267,0%@1267] GR3D_FREQ 0% cpu@49.187C soc2@48C soc0@48.156C gpu@48.781C tj@50.125C soc1@50.125C VDD_IN 5969mW/5562mW VDD_CPU_GPU_CV 1440mW/1196mW VDD_SOC 1522mW/1474mW
04-24-2025 09:49:21 RAM 2388/7620MB (lfb 4x4MB) SWAP 1/16384MB (cached 0MB) CPU [1%@729,1%@729,0%@729,18%@729,22%@729,0%@729] GR3D_FREQ 0% cpu@48.812C soc2@47.812C soc0@48.218C gpu@48.906C tj@50.187C soc1@50.187C VDD_IN 4903mW/5496mW VDD_CPU_GPU_CV 722mW/1149mW VDD_SOC 1446mW/1471mW
04-24-2025 09:49:22 RAM 2388/7620MB (lfb 4x4MB) SWAP 1/16384MB (cached 0MB) CPU [1%@729,0%@729,0%@729,0%@729,0%@729,0%@729] GR3D_FREQ 0% cpu@48.625C soc2@47.718C soc0@48.218C gpu@48.468C tj@50.187C soc1@50.187C VDD_IN 4629mW/5417mW VDD_CPU_GPU_CV 522mW/1092mW VDD_SOC 1406mW/1465mW
04-24-2025 09:49:23 RAM 2389/7620MB (lfb 4x4MB) SWAP 1/16384MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] GR3D_FREQ 0% cpu@48.531C soc2@47.718C soc0@48.156C gpu@48.687C tj@50.062C soc1@50.062C VDD_IN 4629mW/5352mW VDD_CPU_GPU_CV 522mW/1044mW VDD_SOC 1406mW/1460mW
04-24-2025 09:49:24 RAM 2389/7620MB (lfb 4x4MB) SWAP 1/16384MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729] GR3D_FREQ 0% cpu@48.625C soc2@47.75C soc0@48.25C gpu@48.593C tj@50.031C soc1@50.031C VDD_IN 4629mW/5296mW VDD_CPU_GPU_CV 522mW/1004mW VDD_SOC 1406mW/1456mW
# sudonim serve \
      --model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC \
      --quantization q4f16_ft \
      --max-batch-size 1 \
      --chat-template deepseek_r1_qwen \
      --host 0.0.0.0 \
      --port 9000

[01:49:09] sudonim | sudonim version 0.1.6

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ CUDA_VERSION   12.6      โ”‚ GPU 0                       โ”‚ CACHE_ROOT      /root/.cache โ”‚
โ”‚ NVIDIA_DRIVER  540.4.0   โ”‚  โ”œ name      Orin Nano 8GB  โ”‚ HAS_MLC         True         โ”‚
โ”‚ SYSTEM_ID      orin-nano โ”‚  โ”œ family    Ampere         โ”‚ HAS_HF_HUB      True         โ”‚
โ”‚ CPU_ARCH       aarch64   โ”‚  โ”œ cores     1024           โ”‚ HAS_NVIDIA_SMI  True         โ”‚
โ”‚ GPU_ARCH       sm87      โ”‚  โ”œ mem_free  [6.2 / 7.6 GB] โ”‚                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

[01:49:09] sudonim | Downloading model from HF Hub:  dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC -> /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
Fetching 38 files:   0%|                                                                                                                                  | 0/38 [00:00<?, ?it/s]/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Fetching 38 files: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 38/38 [00:00<00:00, 1084.34it/s]
[01:49:13] sudonim | Downloaded model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC to:  /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
[01:49:13] sudonim | Loading model 'DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC' from /root/.cache/mlc_llm/dusty-nv

  mlc_llm serve --mode interactive --device cuda \
    --host 0.0.0.0 --port 9000 \
    --overrides='tensor_parallel_shards=1' \
    --model-lib /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so \
    DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC

[2025-04-24 01:49:19] INFO auto_device.py:79: Found device: cuda:0
[2025-04-24 01:49:19] INFO engine_base.py:142: Using library model: /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so
[2025-04-24 01:49:19] INFO engine_base.py:186: The selected engine mode is interactive. We fix max batch size to 1 for interactive single sequence use.
[2025-04-24 01:49:19] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2025-04-24 01:49:19] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[01:49:20] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
  File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
  File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
  File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
  File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
    raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
  [bt] (5) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(TVMFuncCall+0x68) [0xffff7a130f98]
  [bt] (4) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()+0x824) [0xffff531e8054]
  [bt] (3) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x2cc) [0xffff531e56fc]
  [bt] (2) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(+0x303cfc) [0xffff531e3cfc]
  [bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff7826c7f8]
  [bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff7a17bfc0]
  File "/opt/mlc-llm/cpp/serve/threaded_engine.cc", line 287
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 6476.864 MB, which is less than the sum of model weight size (876.640 MB) and temporary buffer size (10771.183 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.

Hi,

It looks like there are some available in your system so it should not trigger OOM.
We will give it a try and provide more info to you.

Thanks.

1 Like

This is interesting. Did you try using ollama with DeepSeek R1 Qwen-1.5B ?

image

I just tested to run DeepSeek R1 7B and 8B with Ollama + OpenWebUI. Jetson Orin Nano Dev Kit 8GB can work.

I didnโ€™t using ollama, I just follow the tutorials Microservices Intro - NVIDIA Jetson AI Lab which using the MLC_LLM by default. And I also try with the llama.cpp and it can work.

So I think maybe the issue just occured on the MLC-LLM sample?
Anyway, I really appreciate your assistance. Your explanation has been very helpful.

Hi AastaLLL,

I hope youโ€™re doing well.

Just following up to check if there are any updates regarding the issue we discussed earlier.
I understand these things can take time โ€” just wanted to see if thereโ€™s any progress or anything you might need from my side.

Thanks a lot for your help!

Hi,

We can reproduce the same behavior internally.
Will check it further and update.

Thanks.

Hi AastaLLL,

Thank you very much for the clarification and for looking into this.
Iโ€™ll wait for your further update on re-quantizing and compiling the engine on Orin Nano.

Thanks again for your support!

Hi,

Test DeepSeek-R1-Distill-Qwen-1.5B and it can work correctly on Orin Nano.

Please find below for the modification of sudonim. (located at /opt/)
Below change forces the app to re-quantize and re-compile the engine with prefill-chunk-size=4096:

diff --git sudonim/runtimes/mlc.py sudonim/runtimes/mlc.py
index 8cb7497..f15eb51 100644
--- sudonim/runtimes/mlc.py
+++ sudonim/runtimes/mlc.py
@@ -74,10 +74,12 @@ class MLC:
         model_name = get_model_name(model)
         is_quant = quantization in model_name and '-mlc' in model_name.lower()

+        '''
         if not is_quant:
             quant_repo = MLC.find_quantized(model, quantization, **kwargs)
             if quant_repo:
                 model, is_quant = quant_repo, True
+        '''

         if not is_quant and not hf_hub_exists(model, warn=True, **kwargs):
             raise IOError(f"could not locate or access model {model}")
@@ -97,9 +99,9 @@ class MLC:

         weights = [x for x in Path(quant_path).glob('**/params_*.bin')]

-        if weights and cache_mode.quantization:
-            log.debug(f"Found existing quantized weights ({quant_path}), skipping quantization")
-            return quant_path #os.path.dirname(weights[0])
+        #if weights and cache_mode.quantization:
+        #    log.debug(f"Found existing quantized weights ({quant_path}), skipping quantization")
+        #    return quant_path #os.path.dirname(weights[0])

         cmd = [
             f'mlc_llm convert_weight --quantization {MLC.QuantizationMap.get(quantization, quantization)}',
@@ -114,13 +116,13 @@ class MLC:
     def config(model_path : str, quant_path : str, quantization: str=None, cache_mode=env.CACHE_MODE, **kwargs):
         config_path = os.path.join(quant_path, 'mlc-chat-config.json')

-        if os.path.isfile(config_path) and cache_mode.engine:
-            return config_path
+        #if os.path.isfile(config_path) and cache_mode.engine:
+        #    return config_path

         if 'chat_template' not in kwargs or not kwargs['chat_template']:
             kwargs['chat_template'] = MLC.get_chat_template(model_path)

-        cmd = [f'mlc_llm gen_config --quantization {MLC.QuantizationMap.get(quantization, quantization)}']
+        cmd = [f'mlc_llm gen_config --quantization {MLC.QuantizationMap.get(quantization, quantization)} --prefill-chunk-size 4096 ']
         cmd += MLC.overrides(packed=False, **kwargs)
         cmd += [f'--output {quant_path}', f'{model_path}']

@@ -131,9 +133,9 @@ class MLC:
     def compile(quant_path : str, cache_mode=env.CACHE_MODE, **kwargs):
         model_lib = MLC.find_model_lib(quant_path)

-        if model_lib and cache_mode.engine:
-            log.debug(f"Found existing model library ({model_lib}), skipping model builder")
-            return model_lib
+        #if model_lib and cache_mode.engine:
+        #    log.debug(f"Found existing model library ({model_lib}), skipping model builder")
+        #   return model_lib

         model_lib = os.path.join(quant_path, MLC.get_model_lib())

Test: the required memory is now 6476.847 MB.

# sudonim serve --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --quantization q4f16_ft --max-batch-size 1 --chat-template deepseek_r1_qwen --host 0.0.0.0 --port 9000

...

[09:32:29] sudonim | Loading model 'DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC' from /root/.cache/mlc_llm/deepseek-ai

  mlc_llm serve --mode interactive --device cuda \
    --host 0.0.0.0 --port 9000 \
    --overrides='tensor_parallel_shards=1' \
    --model-lib /root/.cache/mlc_llm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so \
    DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC

[2025-04-28 09:32:34] INFO auto_device.py:79: Found device: cuda:0
[2025-04-28 09:32:34] INFO engine_base.py:142: Using library model: /root/.cache/mlc_llm/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so
[2025-04-28 09:32:34] INFO engine_base.py:186: The selected engine mode is interactive. We fix max batch size to 1 for interactive single sequence use.
[2025-04-28 09:32:34] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2025-04-28 09:32:34] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[09:32:35] /opt/mlc-llm/cpp/serve/config.cc:797: Under mode "local", max batch size will be set to 1, max KV cache token capacity will be set to 6928, prefill chunk size will be set to 4096.
[09:32:35] /opt/mlc-llm/cpp/serve/config.cc:797: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 6928, prefill chunk size will be set to 4096.
[09:32:35] /opt/mlc-llm/cpp/serve/config.cc:797: Under mode "server", max batch size will be set to 1, max KV cache token capacity will be set to 6928, prefill chunk size will be set to 4096.
[09:32:35] /opt/mlc-llm/cpp/serve/config.cc:878: The actual engine mode is "interactive". So max batch size is 1, max KV cache token capacity is 6928, prefill chunk size is 4096.
[09:32:35] /opt/mlc-llm/cpp/serve/config.cc:883: Estimated total single GPU memory usage: 6476.847 MB (Parameters: 876.640 MB. KVCache: 265.852 MB. Temporary buffer: 5334.355 MB). The actual usage might be slightly larger than the estimated number.
INFO:     Started server process [1491]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:9000 (Press CTRL+C to quit)

Thanks.

1 Like

Hi AastaLLL,

Thank you for your update.

Iโ€™m currently verifying it, but since it requires downloading some files, it will take a bit more time.

Meanwhile, I have a quick question:
After modifying the code, do I also need to change the model path in the command from dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC to deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B?

Thank you again for your help!

Best regards,

Hi AastaLLL,

I have tested the solution you provided, and it is working correctly now.
Thank you very much for your support!

Hi,

Thanks for the verification.
The sample quantizes the engine from the orignal model so change the link to deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B.

Thanks.

Hi AastaLLL,

OK, got it.

Thanks.