TensorRT-LLM for jetson errors

Using AGx orin 64gb Jetpack 6.1 [LT4 36.4.0]

Ran 1st

jetson-containers run -e HUGGINGFACE_TOKEN=HF_MY TOKEN -e FORCE_BUILD=on dustynv/tensorrt_llm:0.12-r36.4.0 /opt/TensorRT-LLM/llama.sh

Then :- which produced following errors
.
jetson-containers run
dustynv/tensorrt_llm:0.12-r36.4.0
python3 /opt/TensorRT-LLM/examples/apps/openai_server.py
/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
V4L2_DEVICES: --device /dev/video0 --device /dev/video1

DISPLAY environmental variable is already set: “:1”

localuser:root being added to access control list

  • docker run --runtime nvidia -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/paul/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/video0 --device /dev/video1 --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-3 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-6 --device /dev/i2c-7 --device /dev/i2c-8 --device /dev/i2c-9 -v /run/jtop.sock:/run/jtop.sock --name jetson_container_20241123_115929 dustynv/tensorrt_llm:0.12-r36.4.0 python3 /opt/TensorRT-LLM/examples/apps/openai_server.py /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
    /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
    warnings.warn(

[TensorRT-LLM] TensorRT-LLM version: 0.12.0
Loading Model: [1/2] Loading TRT checkpoints to memory
Time: 0.169s
Loading Model: [2/2] Build TRT-LLM engine
Time: 282.760s
Loading model done.
Total latency: 282.929s
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 10
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 10
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 512
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 512
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 5120
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 511 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3943 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 415.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3938 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.81 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 61.37 GiB, available: 43.44 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1252
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 8
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 39.12 GiB for max tokens in paged KV cache (80128).
[11/23/2024-12:04:36] [TRT-LLM] [E] Failed to load tokenizer from /tmp/tmpofbqzwkkllm-workspace/tmp.engine: Unrecognized model in /tmp/tmpofbqzwkkllm-workspace/tmp.engine. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth
Traceback (most recent call last):
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 451, in
entrypoint()
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1078, in main
rv = self.invoke(ctx)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 441, in entrypoint
hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py”, line 877, in from_pretrained
config = AutoConfig.from_pretrained(
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py”, line 1049, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth
Error in sys.excepthook:
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/utils.py”, line 319, in call
obj.shutdown()
AttributeError: ‘LLM’ object has no attribute ‘shutdown’. Did you mean: ‘_shutdown’?

Original exception was:
Traceback (most recent call last):
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 451, in
entrypoint()
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1078, in main
rv = self.invoke(ctx)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 441, in entrypoint
hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py”, line 877, in from_pretrained
config = AutoConfig.from_pretrained(
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py”, line 1049, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth

Hi,
Here are some suggestions for the common issues:

1. Performance

Please run the below command before benchmarking deep learning use case:

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

2. Installation

Installation guide of deep learning frameworks on Jetson:

3. Tutorial

Startup deep learning tutorial:

4. Report issue

If these suggestions don’t help and you want to report an issue to us, please attach the model, command/step, and the customized app (if any) with us to reproduce locally.

Thanks!

Thanks for the reply:
Is this information (2) not in the dustynv/tensorrt_llm container ?

If you wanted you could skip docker and use
native Jetson tensorrt-llm/0.12.0 wheel

from dusty-nv post on implementation of tensorrt_llm on Jetson

Hi,

Based on the error:

[11/23/2024-12:04:36] [TRT-LLM] [E] Failed to load tokenizer from /tmp/tmpofbqzwkkllm-workspace/tmp.engine: Unrecognized model in /tmp/tmpofbqzwkkllm-workspace/tmp.engine. 

Do you apply the “OpenAI API Endpoint” step mentioned in the below link?

Thanks.

Thanks for your reply:
Will give this a go. Stage 1.2 and 1.3 completed ok.
Will let you know how I get on
Thanks for your suggestion


Part 1.3
The error messages (a) wrong version of numpy 1.26.1 this was installed in part 1.2 of the prerequisites !!!

git clone MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ · Hugging Face

python convert_checkpoint.py --model_dir Meta-Llama-3-8B-Instruct-GPTQ --output_dir tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group

export PATH=/home/nvidia/.local/bin:$PATH
trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_gptq --output_dir engine_1gpu_gptq --gemm_plugin float16
Cloning into ‘Meta-Llama-3-8B-Instruct-GPTQ’…
remote: Enumerating objects: 15, done.
remote: Counting objects: 100% (12/12), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 15 (delta 2), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (15/15), 2.23 MiB | 1.86 MiB/s, done.
Encountered 1 file(s) that may not have been copied correctly on Windows:
model.safetensors

See: git lfs help smudge for more details.
python: can’t open file ‘/home/paul/TensorRT-LLM/convert_checkpoint.py’: [Errno 2] No such file or directory
/usr/lib/python3/dist-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.17.3 and <1.25.0 is required for this version of SciPy (detected version 1.26.1
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
[11/25/2024-18:06:49] [TRT-LLM] [W] A required package ‘pynvml’ is not installed. Will not monitor the device memory usages. Please install the package first, e.g, ‘pip install pynvml>=11.5.0’.
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
[11/25/2024-18:06:49] [TRT-LLM] [I] Set bert_attention_plugin to auto.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set gpt_attention_plugin to auto.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set gemm_plugin to float16.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set gemm_swiglu_plugin to None.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set fp8_rowwise_gemm_plugin to None.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set nccl_plugin to auto.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set lookup_plugin to None.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set lora_plugin to None.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set moe_plugin to auto.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set mamba_conv1d_plugin to auto.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set context_fmha to True.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set bert_context_fmha_fp32_acc to False.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set paged_kv_cache to True.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set remove_input_padding to True.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set reduce_fusion to False.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set enable_xqa to True.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set tokens_per_block to 64.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set use_paged_context_fmha to False.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set use_fp8_context_fmha to False.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set multiple_profiles to False.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set paged_state to True.
[11/25/2024-18:06:49] [TRT-LLM] [I] Set streamingllm to False.
Traceback (most recent call last):
File “/home/paul/.local/bin/trtllm-build”, line 8, in
sys.exit(main())
File “/home/paul/.local/lib/python3.10/site-packages/tensorrt_llm/commands/build.py”, line 442, in main
model_config = PretrainedConfig.from_json_file(config_path)
File “/home/paul/.local/lib/python3.10/site-packages/tensorrt_llm/models/modeling_utils.py”, line 301, in from_json_file
with open(config_file) as f:
FileNotFoundError: [Errno 2] No such file or directory: ‘tllm_checkpoint_1gpu_gptq/config.json’

I seem to be going further down a rabbit Hole…

This might help.

The MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ repo has requirments and this is the only package not in tensorrt-llm

git clone GitHub - AutoGPTQ/AutoGPTQ: An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

If you aren’t using conda, edit setup.py modify this line to this conda_cuda_include_dir = “/usr/local/cuda/include”

export BUILD_CUDA_EXT=1
export TORCH_CUDA_ARCH_LIST=“8.7”
export COMPILE_MARLIN=1
MAX_JOBS=10 python -m pip wheel . --no-build-isolation -w dist --no-clean
pip install dist/auto_gptq-0.8.0.dev0+cu126-cp310-cp310-linux_aarch64.whl --user

Hi,

We will give it a try and let you know the following.
Thanks.

Any update on this? I also have the same error when running

python convert_checkpoint.py --model_dir Meta-Llama-3-8B-Instruct-GPTQ --output_dir tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq --per_group
python: can’t open file ‘/media/jetson/sensordata1/TensorRT-LLM/convert_checkpoint.py’: [Errno 2] No such file or directory

HI
Sorry I have got no further.
I have had to move onto something else for the moment.

Hi,

Sorry for the late update.

It should work by the steps below:
https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/README4Jetson.md#21-build-the-engine-with-int4-gptq

$ git clone https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ

$ python convert_checkpoint.py --model_dir Meta-Llama-3-8B-Instruct-GPTQ --output_dir tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_gptq  --per_group

$ export PATH=/home/nvidia/.local/bin:$PATH
$ trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_gptq --output_dir engine_1gpu_gptq --gemm_plugin float16

Based on your output, could you verify if the HF model has been downloaded completely?
Since the model is relatively large, it’s expected to take several minutes to finish.

Thanks.

Hi
I have tried to reinstall TensorRT-LLM as per above and still get the same error’s as my last post on 25th November. Part (1:3).
Initially this started with my original post which has not been resolved or am I missing a trick.
i.e TensorRT_LLM (new) Nvidla jetson AI Lab.
I built the docker image successfully
then
jetson-containers run
dustynv/tensorrt_llm:0.12-r36.4.0
python3 /opt/TensorRT-LLM/examples/apps/openai_server.py
/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
produced the errors in my original post.

I have got no further and have given up.
Thanks for your replies
Cheers

What about adding --tokenizer /data/models/huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/to your command?

Hi

I added the line --tokenizer …etc to the python command inside the container and got the following resultINFO: Started server process [31]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:47036 - “GET / HTTP/1.1” 404 Not Found
INFO: 127.0.0.1:47036 - “GET /favicon.ico HTTP/1.1” 404 Not Found

the 404 not found was trying to open link browser.
the “GET/faavicon.ico HTTP/1.1” was when running the curl command -------- below

curl http://localhost:8000/v1/completions
-H “Content-Type: application/json”
-d '{
“model”: <model_name>,
“prompt”: “Where is New York?”,
“max_tokens”: 16,
“temperature”: 0

Thanks for your input another step forward.
I don’t understand why TensorRT_LLM is not as straight forward as most of the JETSON AI LAB
containers example most of which I have managed to get up and running.
I Have found differences with jetpack 6.0 and jetpack 6.1 where some have worked in 6.0 but have errors in 6.1
Thanks again for your help . Have a good Christmas

The above example works fine it you run this command in another terminal

jetson-containers run
–workdir /opt/TensorRT-LLM/examples/apps
dustynv/tensorrt_llm:0.12-r36.4.0
python3 openai_client.py --prompt “Where is New York?” --api chat


Building TensorRT-LLM Engine for Llama

jetson-containers run \
  -e HUGGINGFACE_TOKEN=YOUR_API_KEY \
  -e FORCE_BUILD=on \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
    /opt/TensorRT-LLM/llama.sh

OpenAI API Endpoint

2)
jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 /opt/TensorRT-LLM/examples/apps/openai_server.py \
    /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq --tokenizer /data/models/huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/
 
3) OPEN ANOTHER TERMINAL

jetson-containers run \
  --workdir /opt/TensorRT-LLM/examples/apps \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
    python3 openai_client.py --prompt "Where is New York?" --api chat


magic it works.
Thanks again for your help
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.