VSS blueprint 2.2.0 - ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine

Please provide the following information when creating a topic:

sudo microk8s kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-etcd-deployment-997647859-89ch5 1/1 Running 0 72m
milvus-milvus-deployment-7764df4d7c-sld8l 1/1 Running 0 72m
minio-minio-deployment-665bb7d8c4-zjxh6 1/1 Running 0 72m
nemo-embedding-embedding-deployment-59d77cdcc4-kjp89 1/1 Running 0 72m
nemo-rerank-ranking-deployment-55d7885b58-2fwnp 1/1 Running 0 72m
neo4j-neo4j-deployment-595cb69cc-lzv2d 1/1 Running 0 72m
vss-blueprint-0 1/1 Running 0 72m
vss-vss-deployment-5f8c7b4fcc-vbbks 0/1 CrashLoopBackOff 16 (2m23s ago) 72m

and
sudo microk8s kubectl logs vss-vss-deployment-5f8c7b4fcc-vbbks -f

ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Traceback (most recent call last):
File “/tmp/via/via-engine/via_server.py”, line 1211, in run
self._stream_handler = ViaStreamHandler(self._args)
File “/opt/nvidia/via/via-engine/via_stream_handler.py”, line 373, in init
self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
File “/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py”, line 965, in init
raise Exception(“Failed to generate TRT-LLM engine”)
Exception: Failed to generate TRT-LLM engine

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/tmp/via/via-engine/via_server.py”, line 2572, in
server.run()
File “/tmp/via/via-engine/via_server.py”, line 1213, in run
raise ViaException(f"Failed to load VIA stream handler - {str(ex)}")
via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Killed process with PID 94

I just ran into the same problem when trying to one click deploy launchables:

7e8via-server-1 | 2025-03-03 14:51:22,391 INFO Initializing VIA Stream Handler
e7e8

e7w Enable Watche8e7e8via-server-1 | 2025-03-03 14:51:22,392 INFO Initializing VLM pipeline
e7e8

e7w Enable Watche8e7e8via-server-1 | 2025-03-03 14:51:22,395 INFO Using model cached at /root/.via/ngc_model_cache/nim_nvidia_vila-1.5-40b_vila-yi-34b-siglip-stage3_1003_video_v8_vila-llama-3-8b-lita
e7e8

e7w Enable Watche8e7e8via-server-1 | 2025-03-03 14:51:22,395 INFO TRT-LLM Engine not found. Generating engines …
e7e8

e7e8via-server-1 | Selecting INT4 AWQ mode
e7e8

e7w Enable Watche8e7e8via-server-1 | Converting Checkpoint …
e7e8

e7e8via-server-1 | [2025-03-03 14:51:25,547] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
e7e8

e7e8via-server-1 | [TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
e7e8

e7e8via-server-1 | Traceback (most recent call last):
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py”, line 156, in
e7e8

e7w Enable Watche8e7e8via-server-1 | quantize_and_export(
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py”, line 669, in quantize_and_export
e7e8

e7w Enable Watche8e7e8via-server-1 | hf_config = get_hf_config(model_dir)
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py”, line 265, in get_hf_config
e7e8

e7w Enable Watche8e7e8via-server-1 | return AutoConfig.from_pretrained(ckpt_path, trust_remote_code=True)
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py”, line 1053, in from_pretrained
e7e8

e7w Enable Watche8e7e8via-server-1 | raise ValueError(
e7e8

e7w Enable Watche8e7e8via-server-1 | ValueError: Unrecognized model in /tmp/tmp.vila.KhW5ddDT. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth, intern_vit_6b, v2l_projector, llava_llama, llava_mistral, llava_mixtral
e7e8

e7e8via-server-1 | ERROR: Failed to convert checkpoint
e7e8

e7w Enable Watche8e7e8via-server-1 | 2025-03-03 14:51:29,388 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine
e7e8

e7w Enable Watche8e7e8via-server-1 | Traceback (most recent call last):
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/opt/nvidia/via/via-engine/via_server.py”, line 1211, in run
e7e8

e7w Enable Watche8e7e8via-server-1 | self._stream_handler = ViaStreamHandler(self._args)
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/opt/nvidia/via/via-engine/via_stream_handler.py”, line 373, in init
e7e8

e7w Enable Watche8e7e8via-server-1 | self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py”, line 965, in init
e7e8

e7w Enable Watche8e7e8via-server-1 | raise Exception(“Failed to generate TRT-LLM engine”)
e8

e7w Enable Watche8e7e8via-server-1 | Exception: Failed to generate TRT-LLM engine
e7e8

e7w Enable Watche8e7e8via-server-1 |
e7e8

e7w Enable Watche8e7e8via-server-1 | During handling of the above exception, another exception occurred:
e7e8

e7w Enable Watche8e7e8via-server-1 |
e7e8

e7w Enable Watche8e7e8via-server-1 | Traceback (most recent call last):
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/opt/nvidia/via/via-engine/via_server.py”, line 2572, in
e7e8

e7w Enable Watche8e7e8via-server-1 | server.run()
e7e8

e7w Enable Watche8e7e8via-server-1 | File “/opt/nvidia/via/via-engine/via_server.py”, line 1213, in run
e7e8

e7w Enable Watche8e7e8via-server-1 | raise ViaException(f"Failed to load VIA stream handler - {str(ex)}")
e7e8

e7w Enable Watche8e7e8via-server-1 | via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine
e7e8

e7e8via-server-1 | Killed process with PID 96
e7e8

Have you removed any stale TensorRT engines for VILA-1.5 by referring to our Guide.

I also get the same error. I am running the example using docker compose in brev.

[2025-03-04 08:11:17,532] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
df: /root/.triton/autotune: No such file or directory
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
Traceback (most recent call last):
  File "/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py", line 156, in <module>
    quantize_and_export(
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 669, in quantize_and_export
    hf_config = get_hf_config(model_dir)
  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 265, in get_hf_config
    return AutoConfig.from_pretrained(ckpt_path, trust_remote_code=True)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1053, in from_pretrained
    raise ValueError(
ValueError: Unrecognized model in /tmp/tmp.vila.I9iY9eJB. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth, intern_vit_6b, v2l_projector, llava_llama, llava_mistral, llava_mixtral
ERROR: Failed to convert checkpoint
2025-03-04 08:11:20,829 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Traceback (most recent call last):
  File "/opt/nvidia/via/via-engine/via_server.py", line 1211, in run
    self._stream_handler = ViaStreamHandler(self._args)
  File "/opt/nvidia/via/via-engine/via_stream_handler.py", line 373, in __init__
    self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
  File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 965, in __init__
    raise Exception("Failed to generate TRT-LLM engine")
Exception: Failed to generate TRT-LLM engine

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/nvidia/via/via-engine/via_server.py", line 2572, in <module>
    server.run()
  File "/opt/nvidia/via/via-engine/via_server.py", line 1213, in run
    raise ViaException(f"Failed to load VIA stream handler - {str(ex)}")
via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Killed process with PID 111

You mean this part?

I am running the Launchable and using the Notebook

image

There is currently a bug with downloading the VILA 1.5 VLM from NGC. As a workaround, we recommend either:

  1. Use NVILA (documentation for helm chart)
  2. Download the model on the host machine using the latest NGC CLI and then mount in the VSS container / pod

The Launchable has been updated to use NVILA.

1 Like

How to get this update into my already created launchable instance?

I tried using NVILA and running the blueprint docker containers. But the container fails due to Connection Refused to port 8000 and 9234.

local_deployment-via-server-1.log (32.8 KB)

I am not sure whether it is something that I am doing wrong or whether it has something to do with the model changing ( there are some log lines related to LLM Call Exceptions ).

Hi @shinen ,have you tried the way @aryason advised?
#7

Yes, I did.

Since there was a bug with downloading the vila-1.5 VLM from NGC, I switched to using nvila.

That is when I got to this error. I still have not gotten the containers to run properly.

OK. There are a lot of issues discussed on this topic, and they are not quite the same as yours. Can you describe your problem in detail and file a new topic? Thanks

Understood. I will create a new topic. Thanks for the pointer.

Just adding commands for modifying and deleting the pvcs to remove any stale engine volume data:

kubectl get pvc -n videosearch
 
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
model-store-vss-blueprint-0   Lost     pvc-d92972b3-6709-4962-9fc2-e2386de9923d   0                         local-nfs      <unset>                 44d
vss-ngc-model-cache-pvc       Bound    pvc-dc846d62-0645-44b7-9d4c-ad2f6dd594ef   100Gi      RWO            local-nfs      <unset>                 44d

kubectl edit pvc -n videosearch model-store-vss-blueprint- 0


remove these two lines:

finalizers:

- kubernetes.io/pvc-protection


kubectl delete pvc -n videosearch model-store-vss-blueprint- 0