Error in Tensorrt_llm's whl file

I downloaded the whl file from Package Index

As you seen, the file required python 3.12, and I install it using python 3.12.11, but it reported the error “No module named ‘distutils’”.

The “distutils” package is not available in 3.12, and it is available in 3.10. If I used python 3.10 to install the whl file, it would report the file is not supported by this platform.

So how to fix this?

Because that package is not compatible in jetson. It is for CUDA SBSA (Grace)

Nvidia said that TensorRT-LLM can be run in Jetson AGX Orin.

Although I completely follow the steps presented on the website, I still cannot get my TensorRT-LLM on my Jetson AGX Orin.

TensorRT-LLM/v0.12.0-jetson/README4Jetson.md

There should be a Jetson wheel trt-llm_v0.12.-jetson available here pypi.jetson-ai-lab.dev when it comes back online, or you can compile it. Here’s how I compiled it months ago.

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git checkout v0.12.0-jetson

https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/README4Jetson.md omits the following line which may be needed.

git submodule update --init --recursive
git lfs pull

Installed requirements manually as I am not in a Python virtual environment and my first run of build_wheel.py that invokes setup.py errored on requirements install.
pip install -r requirements-dev-jetson.txt --user

python3 scripts/build_wheel.py --clean --cuda_architectures 87 -DENABLE_MULTI_DEVICE=0 --build_type Release --benchmarks --use_ccache

pip install build/tensorrt_llm-*.whl


Following needed only if running the examples.

huggingface-cli download --repo-type model MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ

This huggingface.co repo has a requirments so:
git clone https://github.com/AutoGPTQ/AutoGPTQ.git

If you aren't using conda, edit setup.py modify this line to this  conda_cuda_include_dir = "/usr/local/cuda/include"
export BUILD_CUDA_EXT=1
export TORCH_CUDA_ARCH_LIST="8.7"
export COMPILE_MARLIN=1
export MAX_JOBS=10
python -m pip wheel . --no-build-isolation -w dist --no-clean
pip install dist/auto_gptq-0.8.0.dev0+cu126-cp310-cp310-linux_aarch64.whl --user


python convert_checkpoint.py --model_dir Meta-Llama-3-8B-Instruct-GPTQ --output_dir tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_g>

export PATH=$HOME/.local/bin:$PATH

trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_gptq --output_dir engine_1gpu_gptq --gemm_plugin float16

python3 ../run.py --max_output_len=50 --tokenizer_dir Meta-Llama-3-8B-Instruct --engine_dir=engine_1gpu_gptq --use_mmap

Thanks for your suggestion. But I run the code, everything runs well, until the step

python3 scripts/build_wheel.py --clean --cuda_architectures 87 -DENABLE_MULTI_DEVICE=0 --build_type Release --benchmarks --use_ccache

Some updates on “/3rdparty/cutlass/” make the code reports error, so now I cannot get the whl file.

Either drop

–use_ccache
or
sudo apt update && sudo apt install ccache

This might help with cutlass

pip install nvidia-cutlass==3.5.1.0

Hi,

Please find v0.12.0-jetson branch for the Jetson support.
You can find the corresponding image below:
https://hub.docker.com/r/dustynv/tensorrt_llm/tags

The prebuilt package can be found in the pypi.jetson-ai-lab.dev.
But unfortunately, the server is down recently.

Thanks.

Thanks, even I install nvidia-cutlass, when I run the code to generate the whl file, it reports error

ERROR: file:///media/chosen/AEEC9ED8EC9E9A63/jetson/jetson-containers/TensorRT-LLM/3rdparty/cutlass/python does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

Thanks for your suggestion.

Yes, I do follow the steps from

to pull tensorrt_llm:0.12-r36.4.0

jetson-containers run \
  -e HUGGINGFACE_TOKEN=YOUR_API_KEY \
  -e FORCE_BUILD=on \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
    /opt/TensorRT-LLM/llama.sh

After that, I start the container with the server using the code

jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 /opt/TensorRT-LLM/examples/apps/openai_server.py \
    /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq

At the beginning, everything looks good, here is the output.

[TensorRT-LLM] TensorRT-LLM version: 0.12.0
Loading Model: [1/2]	Loading TRT checkpoints to memory
Time: 3.422s
Loading Model: [2/2]	Build TRT-LLM engine
Time: 281.236s
Loading model done.
Total latency: 284.659s

But later, it reports error:

[07/14/2025-09:06:39] [TRT-LLM] [E] Failed to load tokenizer from /tmp/tmprsowp_eillm-workspace/tmp.engine: Unrecognized model in /tmp/tmprsowp_eillm-workspace/tmp.engine. Should have a model_type key in its config.json, or contain one of the following strings in its name:…

I completely don’t know what i can do now. Please advise. Thanks

find this directory
/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
edit the config.json or create it and make sure it has the following line or lines.

{
    "model_type": "llama"
}

If that doesn’t fix it please post the entire error.

1 Like

Thanks, after edit the config.json. Here is the error:

chosen@ubuntu:/media/chosen/AEEC9ED8EC9E9A63/jetson/jetson-containers$ jetson-containers run
dustynv/tensorrt_llm:0.12-r36.4.0
python3 /opt/TensorRT-LLM/examples/apps/openai_server.py
/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
V4L2_DEVICES:

DISPLAY environmental variable is already set: “:1”

localuser:root being added to access control list

ARM64 architecture detected

Jetson Detected

SYSTEM_ARCH=tegra-aarch64

  • docker run --runtime nvidia --env NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /media/chosen/AEEC9ED8EC9E9A63/jetson/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-3 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-6 --device /dev/i2c-7 --device /dev/i2c-8 --device /dev/i2c-9 --name jetson_container_20250715_080557 dustynv/tensorrt_llm:0.12-r36.4.0 python3 /opt/TensorRT-LLM/examples/apps/openai_server.py /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
    /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
    warnings.warn(
    [TensorRT-LLM] TensorRT-LLM version: 0.12.0
    Loading Model: [1/2] Loading TRT checkpoints to memory
    Time: 3.290s
    Loading Model: [2/2] Build TRT-LLM engine
    Time: 262.181s
    Loading model done.
    Total latency: 265.472s
    [TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
    [TensorRT-LLM][INFO] Rank 0 is using GPU 0
    [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 10
    [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 10
    [TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
    [TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 512
    [TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
    [TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 512
    [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
    [TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
    [TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 5120
    [TensorRT-LLM][INFO] TRTGptModel maxInputLen: 511 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
    [TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
    [TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
    [TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
    [TensorRT-LLM][INFO] Loaded engine size: 3693 MiB
    [TensorRT-LLM][INFO] [MemUsageChange] Allocated 415.00 MiB for execution context memory.
    [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3688 (MiB)
    [TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
    [TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.81 MB GPU memory for decoder.
    [TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 61.37 GiB, available: 42.69 GiB
    [TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1230
    [TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
    [TensorRT-LLM][INFO] Max KV cache pages per sequence: 8
    [TensorRT-LLM][INFO] Number of tokens per block: 64.
    [TensorRT-LLM][INFO] [MemUsageChange] Allocated 38.44 GiB for max tokens in paged KV cache (78720).
    [07/15/2025-08:11:05] [TRT-LLM] [E] Failed to load tokenizer from /tmp/tmpz09dp39allm-workspace/tmp.engine: Unrecognized model in /tmp/tmpz09dp39allm-workspace/tmp.engine. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth
    Traceback (most recent call last):
    File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 451, in
    entrypoint()
    File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1157, in call
    return self.main(*args, **kwargs)
    File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1078, in main
    rv = self.invoke(ctx)
    File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
    File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 783, in invoke
    return __callback(*args, **kwargs)
    File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 441, in entrypoint
    hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
    File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py”, line 939, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    File “/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py”, line 2197, in from_pretrained
    raise EnvironmentError(
    OSError: Can’t load tokenizer for ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’. If you were trying to load it from ‘Models – Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’ is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
    Error in sys.excepthook:
    Traceback (most recent call last):
    File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/utils.py”, line 319, in call
    obj.shutdown()
    AttributeError: ‘LLM’ object has no attribute ‘shutdown’. Did you mean: ‘_shutdown’?

Original exception was:
Traceback (most recent call last):
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 451, in
entrypoint()
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1078, in main
rv = self.invoke(ctx)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 441, in entrypoint
hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py”, line 939, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py”, line 2197, in from_pretrained
raise EnvironmentError(
OSError: Can’t load tokenizer for ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’. If you were trying to load it from ‘Models – Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’ is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Under ‘/data/models/Llama-2-7b-chat-hf-gptq’, it has config.json, rank0.safetensors and a subfolder “engines”.

Within “engines”, it has config.json and rank0.engine.

See if there is a directory with a name something like
models–Llama-2-7b-chat-hf-gptq in this directory.

~/.cache/huggingface/hub/

or see if this is set to find that directories location.
echo $HF_HUB

1 Like

Thanks.

No, in my “/data/models/huggingface/” folder, I only have two subfolders and three files. (“stored_tokens”, “token”, and “version.txt”)

The two subfolders are “models-meta-llama-Llama-2-7b-chat-hf” and “models–TheBloke–Llama-2-7B-Chat-GPTQ”.

I checked “llama.sh”, it only downloads the files and create these two subfolders in huggingface folder.

Thank you so much for your help.

Finally, I found the same error on the forum which is posted last November. (TensorRT-LLM for jetson errors - #17 by paulrrh)

Here is the solution:

jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 /opt/TensorRT-LLM/examples/apps/openai_server.py \
    /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq --tokenizer /data/models/huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/
 

I am surprised that such error has been last for such a long time, but NVIDIA keeps it on their website without any notification.

I am not the first one to find it, they know it, but don’t want to fix it, even just a some change on the website.

2 Likes

Try

ls ~/.cache/huggingface/hub/

and see if you see a directory named similar to
models–Llama-2-7b-chat-hf-gptq

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.