Recently, I saw in Nvidia’s press release that Jetpack 6.2 can enhance the performance of the NX and Nano.
NVIDIA JetPack 6.2 Brings Super Mode to NVIDIA Jetson Orin Nano and Jetson Orin NX Modules | NVIDIA Technical Blog, and it metion lots of LLM model can be run on Nano (see Table 4. Benchmark performance in tokens/sec for popular LLMs on Jetson Orin Nano 8GB in this topic)
However, when I tried to use the official TensorRT package for inference on Llama-2-7b, I encountered OOM issue. Here is reproduce step:
- Install Jetpack6.2 and let Power mode to MAX, using jtop to check
- Use tensorrt_llm wheel from Jetson AI Lab TensorRT-LLM - NVIDIA Jetson AI Lab
- I extract example shell script from tensorrt_llm containter(dustynv/tensorrt_llm:0.12-r36.4.0):
#!/usr/bin/env bash
set -ex
MODEL="/mnt/Llama-2-7b-chat-hf"
QUANT="/mnt/Llama-2-7B-Chat-GPTQ/model.safetensors"
LLAMA_EXAMPLES="/opt/TensorRT-LLM/examples/llama"
TRT_LLM_MODELS="/mnt/models/tensorrt_llm"
: "${FORCE_BUILD:=off}"
llama_fp16()
{
output_dir="$TRT_LLM_MODELS/$(basename $MODEL)-fp16"
if [ ! -f $output_dir/*.safetensors ]; then
python3 $LLAMA_EXAMPLES/convert_checkpoint.py \
--model_dir $(huggingface-downloader $MODEL) \
--output_dir $output_dir \
--dtype float16
fi
trtllm-build \
--checkpoint_dir $output_dir \
--output_dir $output_dir/engines \
--gemm_plugin float16
}
llama_gptq()
{
output_dir="$TRT_LLM_MODELS/Llama-2-7b-chat-hf-gptq"
engine_dir="$output_dir/engines"
# if [ ! -f $output_dir/*.safetensors ] || [ $FORCE_BUILD = "on" ]; then
# python3 $LLAMA_EXAMPLES/convert_checkpoint.py \
# --model_dir $(huggingface-downloader $MODEL) \
# --output_dir $output_dir \
# --dtype float16 \
# --quant_ckpt_path $(huggingface-downloader $QUANT) \
# --use_weight_only \
# --weight_only_precision int4_gptq \
# --group_size 128 \
# --per_group
# fi
if [ ! -f $engine_dir/*.engine ] || [ $FORCE_BUILD = "on" ]; then
trtllm-build \
--checkpoint_dir $output_dir \
--output_dir $engine_dir \
--gemm_plugin auto \
--log_level verbose \
--max_batch_size 1 \
--max_num_tokens 512 \
--max_seq_len 512 \
--max_input_len 128
fi
python3 $LLAMA_EXAMPLES/../run.py \
--max_input_len=128 \
--max_output_len=128 \
--max_attention_window_size 256 \
--max_tokens_in_paged_kv_cache=256 \
--tokenizer_dir $MODEL \
--engine_dir $engine_dir
python3 /opt/TensorRT-LLM/benchmarks/python/benchmark.py \
-m dec \
--engine_dir $engine_dir \
--quantization int4_weight_only_gptq \
--batch_size 1 \
--input_output_len "16,128;32,128;64,128;128,128" \
--log_level verbose \
--enable_cuda_graph \
--warm_up 2 \
--num_runs 3 \
--duration 10
}
#llama_fp16
llama_gptq
I ran the llama_gptq()
by calling FORCE_BUILD=on bash llama.sh
, but did not run convert_checkpoint.py part (commented in shell script) since I already converted from 6000ADA server.
4. After I ran this script, Nano OOM and restart by itself, here is jtop capture when Nano OOM.
And here is my question:
- Were the official data evaluated using the TensorRT?
- Is there any difference between my environment and the package compared to the official one?
Thanks.