Error in Tensorrt_llm's whl file

Chosen_Zhou · July 13, 2025, 10:24pm

I downloaded the whl file from Package Index

As you seen, the file required python 3.12, and I install it using python 3.12.11, but it reported the error “No module named ‘distutils’”.

The “distutils” package is not available in 3.12, and it is available in 3.10. If I used python 3.10 to install the whl file, it would report the file is not supported by this platform.

So how to fix this?

johnnynunez · July 13, 2025, 11:32pm

Because that package is not compatible in jetson. It is for CUDA SBSA (Grace)

Chosen_Zhou · July 14, 2025, 12:18am

Nvidia said that TensorRT-LLM can be run in Jetson AGX Orin.

Although I completely follow the steps presented on the website, I still cannot get my TensorRT-LLM on my Jetson AGX Orin.

whitesscott · July 14, 2025, 1:35am

TensorRT-LLM/v0.12.0-jetson/README4Jetson.md

There should be a Jetson wheel trt-llm_v0.12.-jetson available here pypi.jetson-ai-lab.dev when it comes back online, or you can compile it. Here’s how I compiled it months ago.

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git checkout v0.12.0-jetson

https://github.com/NVIDIA/TensorRT-LLM/blob/v0.12.0-jetson/README4Jetson.md omits the following line which may be needed.

git submodule update --init --recursive
git lfs pull

Installed requirements manually as I am not in a Python virtual environment and my first run of build_wheel.py that invokes setup.py errored on requirements install.
pip install -r requirements-dev-jetson.txt --user

python3 scripts/build_wheel.py --clean --cuda_architectures 87 -DENABLE_MULTI_DEVICE=0 --build_type Release --benchmarks --use_ccache

pip install build/tensorrt_llm-*.whl

Following needed only if running the examples.

huggingface-cli download --repo-type model MaziyarPanahi/Meta-Llama-3-8B-Instruct-GPTQ

This huggingface.co repo has a requirments so:
git clone https://github.com/AutoGPTQ/AutoGPTQ.git

If you aren't using conda, edit setup.py modify this line to this  conda_cuda_include_dir = "/usr/local/cuda/include"
export BUILD_CUDA_EXT=1
export TORCH_CUDA_ARCH_LIST="8.7"
export COMPILE_MARLIN=1
export MAX_JOBS=10
python -m pip wheel . --no-build-isolation -w dist --no-clean
pip install dist/auto_gptq-0.8.0.dev0+cu126-cp310-cp310-linux_aarch64.whl --user


python convert_checkpoint.py --model_dir Meta-Llama-3-8B-Instruct-GPTQ --output_dir tllm_checkpoint_1gpu_gptq --dtype float16 --use_weight_only --weight_only_precision int4_g>

export PATH=$HOME/.local/bin:$PATH

trtllm-build --checkpoint_dir tllm_checkpoint_1gpu_gptq --output_dir engine_1gpu_gptq --gemm_plugin float16

python3 ../run.py --max_output_len=50 --tokenizer_dir Meta-Llama-3-8B-Instruct --engine_dir=engine_1gpu_gptq --use_mmap

Chosen_Zhou · July 14, 2025, 1:43am

Thanks for your suggestion. But I run the code, everything runs well, until the step

python3 scripts/build_wheel.py --clean --cuda_architectures 87 -DENABLE_MULTI_DEVICE=0 --build_type Release --benchmarks --use_ccache

Some updates on “/3rdparty/cutlass/” make the code reports error, so now I cannot get the whl file.

whitesscott · July 14, 2025, 2:51am

Either drop

–use_ccache
or
sudo apt update && sudo apt install ccache

This might help with cutlass

pip install nvidia-cutlass==3.5.1.0

AastaLLL · July 14, 2025, 3:11am

Hi,

Please find v0.12.0-jetson branch for the Jetson support.
You can find the corresponding image below:
https://hub.docker.com/r/dustynv/tensorrt_llm/tags

The prebuilt package can be found in the pypi.jetson-ai-lab.dev.
But unfortunately, the server is down recently.

Thanks.

Chosen_Zhou · July 14, 2025, 11:45am

Thanks, even I install nvidia-cutlass, when I run the code to generate the whl file, it reports error

ERROR: file:///media/chosen/AEEC9ED8EC9E9A63/jetson/jetson-containers/TensorRT-LLM/3rdparty/cutlass/python does not appear to be a Python project: neither 'setup.py' nor 'pyproject.toml' found.

Chosen_Zhou · July 14, 2025, 1:12pm

Thanks for your suggestion.

Yes, I do follow the steps from

to pull tensorrt_llm:0.12-r36.4.0

jetson-containers run \
  -e HUGGINGFACE_TOKEN=YOUR_API_KEY \
  -e FORCE_BUILD=on \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
    /opt/TensorRT-LLM/llama.sh

After that, I start the container with the server using the code

jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 /opt/TensorRT-LLM/examples/apps/openai_server.py \
    /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq

At the beginning, everything looks good, here is the output.

[TensorRT-LLM] TensorRT-LLM version: 0.12.0
Loading Model: [1/2]	Loading TRT checkpoints to memory
Time: 3.422s
Loading Model: [2/2]	Build TRT-LLM engine
Time: 281.236s
Loading model done.
Total latency: 284.659s

But later, it reports error:

[07/14/2025-09:06:39] [TRT-LLM] [E] Failed to load tokenizer from /tmp/tmprsowp_eillm-workspace/tmp.engine: Unrecognized model in /tmp/tmprsowp_eillm-workspace/tmp.engine. Should have a model_type key in its config.json, or contain one of the following strings in its name:…

I completely don’t know what i can do now. Please advise. Thanks

whitesscott · July 15, 2025, 1:22am

find this directory
/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
edit the config.json or create it and make sure it has the following line or lines.

{
    "model_type": "llama"
}

If that doesn’t fix it please post the entire error.

Chosen_Zhou · July 15, 2025, 12:16pm

Thanks, after edit the config.json. Here is the error:

chosen@ubuntu:/media/chosen/AEEC9ED8EC9E9A63/jetson/jetson-containers$ jetson-containers run
dustynv/tensorrt_llm:0.12-r36.4.0
python3 /opt/TensorRT-LLM/examples/apps/openai_server.py
/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
V4L2_DEVICES:

DISPLAY environmental variable is already set: “:1”

localuser:root being added to access control list

ARM64 architecture detected

Jetson Detected

SYSTEM_ARCH=tegra-aarch64

docker run --runtime nvidia --env NVIDIA_DRIVER_CAPABILITIES=compute,utility,graphics -it --rm --network host --shm-size=8g --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /media/chosen/AEEC9ED8EC9E9A63/jetson/jetson-containers/data:/data -v /etc/localtime:/etc/localtime:ro -v /etc/timezone:/etc/timezone:ro --device /dev/snd -e PULSE_SERVER=unix:/run/user/1000/pulse/native -v /run/user/1000/pulse:/run/user/1000/pulse --device /dev/bus/usb -e DISPLAY=:1 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth --device /dev/i2c-0 --device /dev/i2c-1 --device /dev/i2c-2 --device /dev/i2c-3 --device /dev/i2c-4 --device /dev/i2c-5 --device /dev/i2c-6 --device /dev/i2c-7 --device /dev/i2c-8 --device /dev/i2c-9 --name jetson_container_20250715_080557 dustynv/tensorrt_llm:0.12-r36.4.0 python3 /opt/TensorRT-LLM/examples/apps/openai_server.py /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using TRANSFORMERS_CACHE is deprecated and will be removed in v5 of Transformers. Use HF_HOME instead.
warnings.warn(
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
Loading Model: [1/2] Loading TRT checkpoints to memory
Time: 3.290s
Loading Model: [2/2] Build TRT-LLM engine
Time: 262.181s
Loading model done.
Total latency: 265.472s
[TensorRT-LLM][INFO] Engine version 0.12.0 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 10
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 10
[TensorRT-LLM][INFO] TRTGptModel maxBeamWidth: 1
[TensorRT-LLM][INFO] TRTGptModel maxSequenceLen: 512
[TensorRT-LLM][INFO] TRTGptModel maxDraftLen: 0
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 512
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] TRTGptModel maxNumTokens: 5120
[TensorRT-LLM][INFO] TRTGptModel maxInputLen: 511 = min(maxSequenceLen - 1, maxNumTokens) since context FMHA and usePackedInput are enabled
[TensorRT-LLM][INFO] TRTGptModel If model type is encoder, maxInputLen would be reset in trtEncoderModel to maxInputLen: min(maxSequenceLen, maxNumTokens).
[TensorRT-LLM][INFO] Capacity Scheduler Policy: GUARANTEED_NO_EVICT
[TensorRT-LLM][INFO] Context Chunking Scheduler Policy: None
[TensorRT-LLM][INFO] Loaded engine size: 3693 MiB
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 415.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 3688 (MiB)
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 1.36 MB GPU memory for runtime buffers.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 6.81 MB GPU memory for decoder.
[TensorRT-LLM][INFO] Memory usage when calculating max tokens in paged kv cache: total: 61.37 GiB, available: 42.69 GiB
[TensorRT-LLM][INFO] Number of blocks in KV cache primary pool: 1230
[TensorRT-LLM][INFO] Number of blocks in KV cache secondary pool: 0, onboard blocks to primary memory before reuse: true
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 8
[TensorRT-LLM][INFO] Number of tokens per block: 64.
[TensorRT-LLM][INFO] [MemUsageChange] Allocated 38.44 GiB for max tokens in paged KV cache (78720).
[07/15/2025-08:11:05] [TRT-LLM] [E] Failed to load tokenizer from /tmp/tmpz09dp39allm-workspace/tmp.engine: Unrecognized model in /tmp/tmpz09dp39allm-workspace/tmp.engine. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth
Traceback (most recent call last):
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 451, in
entrypoint()
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1078, in main
rv = self.invoke(ctx)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 441, in entrypoint
hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py”, line 939, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py”, line 2197, in from_pretrained
raise EnvironmentError(
OSError: Can’t load tokenizer for ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’. If you were trying to load it from ‘Models – Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’ is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.
Error in sys.excepthook:
Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/hlapi/utils.py”, line 319, in call
obj.shutdown()
AttributeError: ‘LLM’ object has no attribute ‘shutdown’. Did you mean: ‘_shutdown’?

Original exception was:
Traceback (most recent call last):
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 451, in
entrypoint()
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1078, in main
rv = self.invoke(ctx)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File “/usr/local/lib/python3.10/dist-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
File “/opt/TensorRT-LLM/examples/apps/openai_server.py”, line 441, in entrypoint
hf_tokenizer = AutoTokenizer.from_pretrained(tokenizer or model_dir)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py”, line 939, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py”, line 2197, in from_pretrained
raise EnvironmentError(
OSError: Can’t load tokenizer for ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’. If you were trying to load it from ‘Models – Hugging Face’, make sure you don’t have a local directory with the same name. Otherwise, make sure ‘/data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq’ is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

Chosen_Zhou · July 15, 2025, 12:27pm

Under ‘/data/models/Llama-2-7b-chat-hf-gptq’, it has config.json, rank0.safetensors and a subfolder “engines”.

Within “engines”, it has config.json and rank0.engine.

whitesscott · July 15, 2025, 6:31pm

See if there is a directory with a name something like
models–Llama-2-7b-chat-hf-gptq in this directory.

~/.cache/huggingface/hub/

or see if this is set to find that directories location.
echo $HF_HUB

Chosen_Zhou · July 16, 2025, 3:31am

Thanks.

No, in my “/data/models/huggingface/” folder, I only have two subfolders and three files. (“stored_tokens”, “token”, and “version.txt”)

The two subfolders are “models-meta-llama-Llama-2-7b-chat-hf” and “models–TheBloke–Llama-2-7B-Chat-GPTQ”.

Chosen_Zhou · July 16, 2025, 3:48am

I checked “llama.sh”, it only downloads the files and create these two subfolders in huggingface folder.

Chosen_Zhou · July 16, 2025, 4:57am

Thank you so much for your help.

Finally, I found the same error on the forum which is posted last November. (TensorRT-LLM for jetson errors - #17 by paulrrh)

Here is the solution:

jetson-containers run \
  dustynv/tensorrt_llm:0.12-r36.4.0 \
  python3 /opt/TensorRT-LLM/examples/apps/openai_server.py \
    /data/models/tensorrt_llm/Llama-2-7b-chat-hf-gptq --tokenizer /data/models/huggingface/models--meta-llama--Llama-2-7b-chat-hf/snapshots/f5db02db724555f92da89c216ac04704f23d4590/

I am surprised that such error has been last for such a long time, but NVIDIA keeps it on their website without any notification.

I am not the first one to find it, they know it, but don’t want to fix it, even just a some change on the website.

whitesscott · July 16, 2025, 4:58am

Try

ls ~/.cache/huggingface/hub/

and see if you see a directory named similar to
models–Llama-2-7b-chat-hf-gptq

system · August 13, 2025, 1:29am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TensorRT-LLM for jetson errors Jetson AGX Orin generative_ai , paligemma , kosmos-2 , llama	14	840	January 16, 2025
TensorRT python library is not working Jetson AGX Xavier tensorrt	4	830	June 21, 2023
Failed building wheel for tensorrt TensorRT	6	7979	January 23, 2025
[Xavier] ERROR: tensorrt-7.0.0.11-cp27-none-linux_x86_64.whl is not a supported wheel on this platform Jetson AGX Xavier	2	2862	October 18, 2021
How to install nvidia-tensorrt? Jetson AGX Orin tensorrt	7	13204	May 17, 2023
Unable to download torch-tensorrt On Jetson ORIN Jetpack 5.02 Jetson AGX Orin tensorrt , ngc	2	2796	September 22, 2022
TensorFlow 1.11.0 wheel with JetPack 3.3 Jetson TX2	103	46168	November 13, 2019
Error install torch_tensorrt TensorRT cudnn	5	742	January 31, 2024
Cannot find any whl file in zip file of TensorRT 8.0.3.4 for windows TensorRT tensorrt	5	2708	September 27, 2021
Unable to install TensorRT TensorRT	1	1714	November 8, 2020

Error in Tensorrt_llm's whl file

DISPLAY environmental variable is already set: “:1”

ARM64 architecture detected

Jetson Detected

Related topics