Video Search Summarization Models fail to download

I am trying to deploy the Video Search and Summarization blueprint ( local_deployment variant ) for local use.

docker-compose up had errors to even start :

ERROR: Invalid interpolation format for "volumes" option in service "via-server": "${ASSET_STORAGE_DIR:-/dummy}${ASSET_STORAGE_DIR:+:/tmp/assets}"

Since I was attempting to get the blueprint to run, the volume definitions were not critical, so I commented the offending volume definitions.

However, now, while docker-compose runs and creates containers, the via-server is unable to run properly and download the models.

Initially, I had an NGC API Key issue, but then I figured that I needed to insert a valid NGC API Key in .env file. But now it seems that via server does not want to download the models.

container log
via-server_1  | Starting VIA server in release mode
via-server_1  | 2025-03-04 10:19:43,657 INFO Initializing VIA Stream Handler
via-server_1  | 2025-03-04 10:19:43,657 INFO Initializing VLM pipeline
via-server_1  | 2025-03-04 10:19:43,957 INFO Downloading model nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8 ...
Getting files to download...
⠼ ━ • … • Remaining: … • … • Elapsed: 0… • Total: 29 - Completed: 0 - Failed: 29
via-server_1  |       …                  …
via-server_1  |
via-server_1  | --------------------------------------------------------------------------------
via-server_1  |    Download status: FAILED
via-server_1  |    Downloaded local path model: /tmp/tmp5sw8ivy7/vila-1.5-40b_vvila-yi-34b-siglip-stage3_1003_video_v8
via-server_1  |    Total files downloaded: 0
via-server_1  |    Total transferred: 0 B
via-server_1  |    Started at: 2025-03-04 10:19:44
via-server_1  |    Completed at: 2025-03-04 10:19:49
via-server_1  |    Duration taken: 5s
via-server_1  | --------------------------------------------------------------------------------
via-server_1  | 2025-03-04 10:19:49,800 INFO Downloaded model to /root/.via/ngc_model_cache/nim_nvidia_vila-1.5-40b_vila-yi-34b-siglip-stage3_1003_video_v8_vila-llama-
3-8b-lita
via-server_1  | 2025-03-04 10:19:49,801 INFO TRT-LLM Engine not found. Generating engines ...
via-server_1  | Selecting INT4 AWQ mode
via-server_1  | Converting Checkpoint ...
via-server_1  | [2025-03-04 10:19:52,856] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
via-server_1  | df: /root/.triton/autotune: No such file or directory
via-server_1  | [TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
via-server_1  | Traceback (most recent call last):
via-server_1  |   File "/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py", line 156, in <module>
via-server_1  |     quantize_and_export(
via-server_1  |   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 669, in quantize_and_export
via-server_1  |     hf_config = get_hf_config(model_dir)
via-server_1  |   File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py", line 265, in get_hf_config
via-server_1  |     return AutoConfig.from_pretrained(ckpt_path, trust_remote_code=True)   
via-server_1  |   File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py", line 1053, in from_pretrained
via-server_1  |     raise ValueError(
via-server_1  | ValueError: Unrecognized model in /tmp/tmp.vila.oa7xbt3I. Should have a `model_type` key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth, intern_vit_6b, v2l_projector, llava_llama, llava_mistral, llava_mixtral
via-server_1  | ERROR: Failed to convert checkpoint
via-server_1  | 2025-03-04 10:19:56,338 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine
via-server_1  | Traceback (most recent call last):
via-server_1  |   File "/opt/nvidia/via/via-engine/via_server.py", line 1211, in run
via-server_1  |     self._stream_handler = ViaStreamHandler(self._args)
via-server_1  |   File "/opt/nvidia/via/via-engine/via_stream_handler.py", line 373, in __init__
via-server_1  |     self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
via-server_1  |   File "/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py", line 965, in __init__
via-server_1  |     raise Exception("Failed to generate TRT-LLM engine")
via-server_1  | Exception: Failed to generate TRT-LLM engine
via-server_1  |
via-server_1  | During handling of the above exception, another exception occurred:
via-server_1  |
via-server_1  | Traceback (most recent call last):
via-server_1  |   File "/opt/nvidia/via/via-engine/via_server.py", line 2572, in <module>  
via-server_1  |     server.run()
via-server_1  |   File "/opt/nvidia/via/via-engine/via_server.py", line 1213, in run
via-server_1  |     raise ViaException(f"Failed to load VIA stream handler - {str(ex)}")   
via-server_1  | via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine
via-server_1  | Killed process with PID 70
local_deployment_via-server_1 exited with code 1

What seems to be the problem here? Am I doing something wrong here?

NB 1 : I also tried the remote_llm_deployment as well as the remote_vlm_deployment and my results were the same.
NB 2 : I could not figure out how to get a NVIDIA_API_KEY ( nvapi-*** ) from build.nvidia.com - required for the remote_llm_deployment and remote_vlm_deployment. Perhaps, that portal has moved?

What version of docker-compose are you running?
Recommended v2.32.4

Based on errors associated with it, you should try to update.

docker compose version

mkdir -p ~/.docker/cli-plugins
curl -SL https://github.com/docker/compose/releases/latest/download/docker-compose-linux-x86_64 -o ~/.docker/cli-plugins/docker-compose
chmod +x ~/.docker/cli-plugins/docker-compose

I was using the version that came with the package manager of ubuntu 22.04 - v1.29.2 🫣 Thanks for pointing out that the version might be old.

After updating docker-compose to the latest version, the interpolation errors have gone away.


Now, the pending issue is the same as VSS blueprint 2.2.0 - ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine. I will try to use the workarounds suggested there and see if I can get the blueprint to run successfully.