Hi,
I have encountered an issue of Insufficient GPU memory when I try to run the DeepSeek R1 Qwen-1.5B model. Following is the log:
root@0737e6e49070:~# free -h
total used free shared buff/cache available
Mem: 7.4Gi 3.1Gi 1.5Gi 76Mi 2.8Gi 4.0Gi
Swap: 15Gi 0.0Ki 15Gi
root@0737e6e49070:~# sudonim serve --model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC --quantization q4f16_ft --max-batch-size 1 --prefill-chunk 512 --chat-template deepseek_r1_qwen --host 0.0.0.0 --port 9000
[05:37:13] sudonim | sudonim version 0.1.6
โโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ CUDA_VERSION 12.6 โ GPU 0 โ CACHE_ROOT /root/.cache โ
โ NVIDIA_DRIVER 540.4.0 โ โ name Orin Nano 8GB โ HAS_MLC True โ
โ SYSTEM_ID orin-nano โ โ family Ampere โ HAS_HF_HUB True โ
โ CPU_ARCH aarch64 โ โ cores 1024 โ HAS_NVIDIA_SMI True โ
โ GPU_ARCH sm87 โ โ mem_free [3.7 / 7.6 GB] โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
[05:37:14] sudonim | Downloading model from HF Hub: dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC -> /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
Fetching 38 files: 100%|โโโโโโโโโโโ| 38/38 [00:00<00:00, 904.37it/s]
[05:37:18] sudonim | Downloaded model dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC to: /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
[05:37:18] sudonim | Loading model 'DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC' from /root/.cache/mlc_llm/dusty-nv
mlc_llm serve --mode interactive --device cuda \
--host 0.0.0.0 --port 9000 \
--overrides='tensor_parallel_shards=1;prefill_chunk_size=512' \
--model-lib /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so \
DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC
[2025-04-18 05:37:22] INFO auto_device.py:79: Found device: cuda:0
[2025-04-18 05:37:22] INFO engine_base.py:142: Using library model: /root/.cache/mlc_llm/dusty-nv/DeepSeek-R1-Distill-Qwen-1.5B-q4f16_ft-MLC/aarch64-cu126-sm87.so
[2025-04-18 05:37:22] INFO engine_base.py:186: The selected engine mode is interactive. We fix max batch size to 1 for interactive single sequence use.
[2025-04-18 05:37:22] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2025-04-18 05:37:22] INFO engine_base.py:210: If you have high concurrent requests and want to maximize the GPU memory utilization, please select mode "server".
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:722: temp_buffer = 10616.000
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:723: kv_aux workspace = 104.813
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:724: model workspace = 48.034
[05:37:23] /opt/mlc-llm/cpp/serve/config.cc:725: logit processor workspace = 2.336
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "tvm/_ffi/_cython/./packed_func.pxi", line 339, in tvm._ffi._cy3.core.PackedFuncBase.__call__
File "tvm/_ffi/_cython/./packed_func.pxi", line 270, in tvm._ffi._cy3.core.FuncCall
File "tvm/_ffi/_cython/./packed_func.pxi", line 259, in tvm._ffi._cy3.core.FuncCall3
File "tvm/_ffi/_cython/./base.pxi", line 185, in tvm._ffi._cy3.core.CHECK_CALL
File "/usr/local/lib/python3.10/dist-packages/tvm/_ffi/base.py", line 481, in raise_last_ffi_error
raise py_err
tvm._ffi.base.TVMError: Traceback (most recent call last):
[bt] (5) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(TVMFuncCall+0x68) [0xffff9e1e0f98]
[bt] (4) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(mlc::llm::serve::ThreadedEngineImpl::RunBackgroundLoop()+0x824) [0xffff77298054]
[bt] (3) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(mlc::llm::serve::ThreadedEngineImpl::EngineReloadImpl(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x2cc) [0xffff772956fc]
[bt] (2) /usr/local/lib/python3.10/dist-packages/mlc_llm/libmlc_llm_module.so(+0x303cfc) [0xffff77293cfc]
[bt] (1) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::detail::LogFatal::Entry::Finalize()+0x68) [0xffff9c31c7f8]
[bt] (0) /usr/local/lib/python3.10/dist-packages/tvm/libtvm.so(tvm::runtime::Backtrace[abi:cxx11]()+0x30) [0xffff9e22bfc0]
File "/opt/mlc-llm/cpp/serve/threaded_engine.cc", line 287
TVMError: Check failed: (output_res.IsOk()) is false: Insufficient GPU memory error: The available single GPU memory is 6476.864 MB, which is less than the sum of model weight size (876.640 MB) and temporary buffer size (10771.183 MB).
1. You can set a larger "gpu_memory_utilization" value.
2. If the model weight size is too large, please enable tensor parallelism by passing `--tensor-parallel-shards $NGPU` to `mlc_llm gen_config` or use quantization.
3. If the temporary buffer size is too large, please use a smaller `--prefill-chunk-size` in `mlc_llm gen_config`.
I have try to add the swapfile and setting the prefill-chunk-size, but it seemed not work. Would you help me to solve this issue?
jetson_release:
Software part of jetson-stats 4.3.2 - (c) 2024, Raffaello Bonghi
Model: NVIDIA Jetson Orin Nano Engineering Reference Developer Kit Super - Jetpack 6.2 [L4T 36.4.3]
NV Power Mode[2]: MAXN_SUPER
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
- P-Number: p3767-0005
- Module: NVIDIA Jetson Orin Nano (Developer kit)
Platform:
- Distribution: Ubuntu 22.04 Jammy Jellyfish
- Release: 5.15.148-tegra
jtop:
- Version: 4.3.2
- Service: Active
Libraries:
- CUDA: Not installed
- cuDNN: Not installed
- TensorRT: Not installed
- VPI: Not installed
- Vulkan: 1.3.204
- OpenCV: 4.5.4 - with CUDA: NO
Thanks!
richer.chan