Jetpack6.2+TensorRT OOM issue

Recently, I saw in Nvidia’s press release that Jetpack 6.2 can enhance the performance of the NX and Nano.
NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules | NVIDIA Technical Blog, and it metion lots of LLM model can be run on Nano (see Table 4. Benchmark performance in tokens/sec for popular LLMs on Jetson Orin Nano 8GB in this topic)

However, when I tried to use the official TensorRT package for inference on Llama-2-7b, I encountered OOM issue. Here is reproduce step:

  1. Install Jetpack6.2 and let Power mode to MAX, using jtop to check
  2. Use tensorrt_llm wheel from Jetson AI Lab TensorRT-LLM - NVIDIA Jetson AI Lab
  3. I extract example shell script from tensorrt_llm containter(dustynv/tensorrt_llm:0.12-r36.4.0):
#!/usr/bin/env bash
set -ex

MODEL="/mnt/Llama-2-7b-chat-hf"
QUANT="/mnt/Llama-2-7B-Chat-GPTQ/model.safetensors"

LLAMA_EXAMPLES="/opt/TensorRT-LLM/examples/llama"
TRT_LLM_MODELS="/mnt/models/tensorrt_llm"

: "${FORCE_BUILD:=off}"


llama_fp16() 
{
	output_dir="$TRT_LLM_MODELS/$(basename $MODEL)-fp16"
	
	if [ ! -f $output_dir/*.safetensors ]; then
		python3 $LLAMA_EXAMPLES/convert_checkpoint.py \
			--model_dir $(huggingface-downloader $MODEL) \
			--output_dir $output_dir \
			--dtype float16
	fi

	trtllm-build \
		--checkpoint_dir $output_dir \
		--output_dir $output_dir/engines \
		--gemm_plugin float16
}

llama_gptq() 
{
	output_dir="$TRT_LLM_MODELS/Llama-2-7b-chat-hf-gptq"
	engine_dir="$output_dir/engines"
	
	# if [ ! -f $output_dir/*.safetensors ] || [ $FORCE_BUILD = "on" ]; then
	# 	python3 $LLAMA_EXAMPLES/convert_checkpoint.py \
	# 		--model_dir $(huggingface-downloader $MODEL) \
	# 		--output_dir $output_dir \
	# 		--dtype float16 \
	# 		--quant_ckpt_path $(huggingface-downloader $QUANT) \
	# 		--use_weight_only \
	# 		--weight_only_precision int4_gptq \
	# 		--group_size 128 \
	# 		--per_group
	# fi
	
	if [ ! -f $engine_dir/*.engine ] || [ $FORCE_BUILD = "on" ]; then
	    trtllm-build \
		    --checkpoint_dir $output_dir \
		    --output_dir $engine_dir \
		    --gemm_plugin auto \
		    --log_level verbose \
		    --max_batch_size 1 \
		    --max_num_tokens 512 \
		    --max_seq_len 512 \
		    --max_input_len 128	    
    fi

    python3 $LLAMA_EXAMPLES/../run.py \
        --max_input_len=128 \
        --max_output_len=128 \
        --max_attention_window_size 256 \
        --max_tokens_in_paged_kv_cache=256 \
        --tokenizer_dir $MODEL \
        --engine_dir $engine_dir

    python3 /opt/TensorRT-LLM/benchmarks/python/benchmark.py \
        -m dec \
        --engine_dir $engine_dir \
        --quantization int4_weight_only_gptq \
        --batch_size 1 \
        --input_output_len "16,128;32,128;64,128;128,128" \
        --log_level verbose \
        --enable_cuda_graph \
        --warm_up 2 \
        --num_runs 3 \
        --duration 10  
}

#llama_fp16
llama_gptq

I ran the llama_gptq() by calling FORCE_BUILD=on bash llama.sh, but did not run convert_checkpoint.py part (commented in shell script) since I already converted from 6000ADA server.
4. After I ran this script, Nano OOM and restart by itself, here is jtop capture when Nano OOM.

And here is my question:

  1. Were the official data evaluated using the TensorRT?
  2. Is there any difference between my environment and the package compared to the official one?

Thanks.