TensorRT-LLM on Jetson Orin NX(16GB)

Has anyone tried using TensorRT-LLM on Jetson Orin NX (16GB)? I keep encountering the issue “core dump” when using trtllm-build, even with a small model (0.5B). The official tests were conducted on AGX Orin.

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

3. Tutorial

Startup deep learning tutorial:

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

Hi,

Could you share how you setup the environment?
Are you following the instructions shared in the below topic:

Thanks.

Thanks a lot for your reply. Previously, I set up the TensorRT-LLM environment based on the link TensorRT-LLM Deployment on Jetson Orin.

Now I am using the pre-configured container. And I have successfully conducted inference on a small-scale model(1.5b) using the Tensor-LLM framework.

However, I still encounter issues when attampting to convert a 7b model. The model file is downloaded from qwen2.5-7b. My conversion command is:

python3 convert_checkpoint.py --model_dir /home/models/qwen/qwen-2.5_7b --output_dir /home/models/engines/qwen2.5_7b_ckpt --dtype=float16 --use_weight_only --weight_only_precision int8

The error message received is as follows:

/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
0.12.0
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████| 4/4 [00:11<00:00,  2.86s/it]
[01/05/2025-14:08:12] Some parameters are on the meta device because they were offloaded to the cpu.
Traceback (most recent call last):
  File "/home/learn/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 308, in <module>
    main()
  File "/home/learn/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 300, in main
    convert_and_save_hf(args)
  File "/home/learn/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 256, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/home/learn/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 263, in execute
    f(args, rank)
  File "/home/learn/TensorRT-LLM/examples/qwen/convert_checkpoint.py", line 246, in convert_and_save_rank
    qwen = QWenForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/model.py", line 315, in from_hugging_face
    weights = load_weights_from_hf_model(hf_model, config)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 1221, in load_weights_from_hf_model
    weights = convert_hf_qwen(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 823, in convert_hf_qwen
    get_tllm_linear_weight(qkv_w, tllm_prex + 'attention.qkv.',
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/qwen/convert.py", line 502, in get_tllm_linear_weight
    v.cpu(), plugin_weight_only_quant_type)
NotImplementedError: Cannot copy out of meta tensor; no data!

Hi,

The 7B model cannot work on a 16GB device.
Please try the model with weight <= 4B.

Thanks.

Thanks a lot.

Does “weight <= 4B” refers to the numbers of parmeters in the model? The 1.5B model actually works on Jetson Orin NX(16GB). However, when I try a 3B model, the conversion
process still fails. The model was downloaded from 通义千问2.5-3B-Instruct · 模型库.

The conversion command is:

python3 convert_checkpoint.py --model_dir Qwen2.5-3B-Instruct --output_dir qwen2.5_3b_ckpt --dtype float16

The error message is:

/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
0.12.0
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:07<00:00,  3.86s/it]
Weights loaded. Total time: 00:00:05
[1]    1132242 killed     python3 convert_checkpoint.py --model_dir  --output_dir  --dtype float16

When converting the model Llama-3.2-1B-Instruct · 模型库, I encounter the following error:

/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
0.12.0
117it [00:00, 177.11it/s]
Traceback (most recent call last):
  File "/home/learn/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 487, in <module>
    main()
  File "/home/learn/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 479, in main
    convert_and_save_hf(args)
  File "/home/learn/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 421, in convert_and_save_hf
    execute(args.workers, [convert_and_save_rank] * world_size, args)
  File "/home/learn/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 428, in execute
    f(args, rank)
  File "/home/learn/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 410, in convert_and_save_rank
    llama = LLaMAForCausalLM.from_hugging_face(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/llama/model.py", line 363, in from_hugging_face
    loader.generate_tllm_weights(model)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 329, in generate_tllm_weights
    tllm_weights.update(self.load(tllm_key))
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/model_weights_loader.py", line 271, in load
    v = sub_module.postprocess(tllm_key, v, **postprocess_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/layers/linear.py", line 380, in postprocess
    weights = weights.to(str_dtype_to_torch(self.dtype))
AttributeError: 'NoneType' object has no attribute 'to'
Exception ignored in: <function PretrainedModel.__del__ at 0xfffef0f240d0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 449, in __del__
    self.release()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/models/modeling_utils.py", line 446, in release
    release_gc()
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 469, in release_gc
    torch.cuda.ipc_collect()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 966, in ipc_collect
    _lazy_init()
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 338, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: 'NoneType' object is not iterable

CUDA call was originally invoked at:

  File "/home/learn/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 8, in <module>
    from transformers import AutoConfig
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/__init__.py", line 26, in <module>
    from . import dependency_versions_check
  File "<frozen importlib._bootstrap>", line 1078, in _handle_fromlist
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/dependency_versions_check.py", line 16, in <module>
    from .utils.versions import require_version, require_version_core
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 992, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/__init__.py", line 27, in <module>
    from .chat_template_utils import DocstringParsingException, TypeHintParsingException, get_json_schema
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/transformers/utils/chat_template_utils.py", line 39, in <module>
    from torch import Tensor
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/__init__.py", line 1955, in <module>
    _C._initExtension(_manager_path())
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 1539, in <module>
    _lazy_call(_register_triton_kernels)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 261, in _lazy_call
    _queued_calls.append((callable, traceback.format_stack()))

My runtime environment is a pre-configured container. And my conversion command is:

python3 convert_checkpoint.py --model_dir Llama-3.2-1B-Instruct --output_dir llama3.2_1b_ckpt --dtype float16