Jetpack6.2+TensorRT OOM issue

Recently, I saw in Nvidia’s press release that Jetpack 6.2 can enhance the performance of the NX and Nano.
NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules | NVIDIA Technical Blog, and it metion lots of LLM model can be run on Nano (see Table 4. Benchmark performance in tokens/sec for popular LLMs on Jetson Orin Nano 8GB in this topic)

However, when I tried to use the official TensorRT package for inference on Llama-2-7b, I encountered OOM issue. Here is reproduce step:

  1. Install Jetpack6.2 and let Power mode to MAX, using jtop to check
  2. Use tensorrt_llm wheel from Jetson AI Lab TensorRT-LLM - NVIDIA Jetson AI Lab
  3. I extract example shell script from tensorrt_llm containter(dustynv/tensorrt_llm:0.12-r36.4.0):
#!/usr/bin/env bash
set -ex

MODEL="/mnt/Llama-2-7b-chat-hf"
QUANT="/mnt/Llama-2-7B-Chat-GPTQ/model.safetensors"

LLAMA_EXAMPLES="/opt/TensorRT-LLM/examples/llama"
TRT_LLM_MODELS="/mnt/models/tensorrt_llm"

: "${FORCE_BUILD:=off}"


llama_fp16() 
{
	output_dir="$TRT_LLM_MODELS/$(basename $MODEL)-fp16"
	
	if [ ! -f $output_dir/*.safetensors ]; then
		python3 $LLAMA_EXAMPLES/convert_checkpoint.py \
			--model_dir $(huggingface-downloader $MODEL) \
			--output_dir $output_dir \
			--dtype float16
	fi

	trtllm-build \
		--checkpoint_dir $output_dir \
		--output_dir $output_dir/engines \
		--gemm_plugin float16
}

llama_gptq() 
{
	output_dir="$TRT_LLM_MODELS/Llama-2-7b-chat-hf-gptq"
	engine_dir="$output_dir/engines"
	
	# if [ ! -f $output_dir/*.safetensors ] || [ $FORCE_BUILD = "on" ]; then
	# 	python3 $LLAMA_EXAMPLES/convert_checkpoint.py \
	# 		--model_dir $(huggingface-downloader $MODEL) \
	# 		--output_dir $output_dir \
	# 		--dtype float16 \
	# 		--quant_ckpt_path $(huggingface-downloader $QUANT) \
	# 		--use_weight_only \
	# 		--weight_only_precision int4_gptq \
	# 		--group_size 128 \
	# 		--per_group
	# fi
	
	if [ ! -f $engine_dir/*.engine ] || [ $FORCE_BUILD = "on" ]; then
	    trtllm-build \
		    --checkpoint_dir $output_dir \
		    --output_dir $engine_dir \
		    --gemm_plugin auto \
		    --log_level verbose \
		    --max_batch_size 1 \
		    --max_num_tokens 512 \
		    --max_seq_len 512 \
		    --max_input_len 128	    
    fi

    python3 $LLAMA_EXAMPLES/../run.py \
        --max_input_len=128 \
        --max_output_len=128 \
        --max_attention_window_size 256 \
        --max_tokens_in_paged_kv_cache=256 \
        --tokenizer_dir $MODEL \
        --engine_dir $engine_dir

    python3 /opt/TensorRT-LLM/benchmarks/python/benchmark.py \
        -m dec \
        --engine_dir $engine_dir \
        --quantization int4_weight_only_gptq \
        --batch_size 1 \
        --input_output_len "16,128;32,128;64,128;128,128" \
        --log_level verbose \
        --enable_cuda_graph \
        --warm_up 2 \
        --num_runs 3 \
        --duration 10  
}

#llama_fp16
llama_gptq

I ran the llama_gptq() by calling FORCE_BUILD=on bash llama.sh, but did not run convert_checkpoint.py part (commented in shell script) since I already converted from 6000ADA server.
4. After I ran this script, Nano OOM and restart by itself, here is jtop capture when Nano OOM.

And here is my question:

  1. Were the official data evaluated using the TensorRT?
  2. Is there any difference between my environment and the package compared to the official one?

Thanks.

Here is TensorRT log before OOM
testlog.txt (1.2 MB)

Hi,

Do you follow the doc to enlarge RAM for Orin Nano 8GB?

Thanks

Hi, I tried following commnad:

sudo init 3
sudo systemctl disable nvargus-daemon.service
sudo systemctl disable nvzramconfig
sudo fallocate -l 16G /ssd/16GB.swap
sudo mkswap /ssd/16GB.swap
sudo swapon /ssd/16GB.swap

And here is initial condition of Nano, and SWAP extented to 19.7G


After running script, Nano still got OOM and restart, here is jtop capture when OOM

SWAP only used 589MB, is SWAP will be used on GPU? Logs from TensorRT separate Mem usage into CPU and GPU, although Mem on Nano can be used on CPU and GPU.

[02/05/2025-15:46:45] [TRT] [V] [MemUsageChange] Subgraph compilation: CPU +0, GPU +2, now: CPU 1215, GPU 5538 (MiB)

Hi,

No.

TensorRT-LLM is verified by AGX Orin

If you want to inference Llama-2-7-b, you could try other like

Thanks

Okay, so this table from NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules | NVIDIA Technical Blog is also run by Small LLM?

Because I want to run Llama3.1-8B, Llama2-7B is only for testing TensorRT framework.

Hi,

The benchmark you could refer this doc steps

Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.