VSS blueprint 2.2.0 - processing, percentage complete is 0.00 forever

  • Hardware Platform (GPU model and numbers): AWS g6e.48xlarge (8xL40)
  • Ubuntu Version: Ubuntu 22.04.5 LTS
  • NVIDIA GPU Driver Version (valid for GPU only): 570.86.10
  • Issue Type( questions, new requirements, bugs): image/video upload/demos upload and not summary. Using UI and API. I am able to upload images using API, but no summary

Log in the pod and tail via_engine.log -f

2025-03-04 12:22:57,806 INFO Status for query f210a928-8fee-4333-a1a6-476b5f4b3281 is processing, percent complete is 0.00, size of response list is 0
2025-03-04 12:22:58,132 INFO Status for query 1b1efdc1-2f88-4f4c-854a-cbe810a47d56 is processing, percent complete is 0.00, size of response list is 0
2025-03-04 12:22:58,615 INFO Status for query 14ed9733-1476-4e5f-926f-03863d9b30a1 is processing, percent complete is 0.00, size of response list is 0
2025-03-04 12:22:58,807 INFO Status for query f210a928-8fee-4333-a1a6-476b5f4b3281 is processing, percent complete is 0.00, size of response list is 0
2025-03-04 12:22:59,133 INFO Status for query 1b1efdc1-2f88-4f4c-854a-cbe810a47d56 is processing, percent complete is 0.00, size of response list is 0
2025-03-04 12:22:59,173 INFO Status for query b0ba51e5-639c-4e8c-8e6d-1ced4e1a050b is processing, percent complete is 0.00, size of response list is 0

sudo microk8s kubectl get pods
NAME READY STATUS RESTARTS AGE
etcd-etcd-deployment-997647859-szvj2 1/1 Running 2 20h
milvus-milvus-deployment-7764df4d7c-k2tm2 1/1 Running 0 115m
minio-minio-deployment-665bb7d8c4-l2lwn 1/1 Running 2 20h
nemo-embedding-embedding-deployment-59d77cdcc4-r9bg9 1/1 Running 0 115m
nemo-rerank-ranking-deployment-55d7885b58-mnb7r 1/1 Running 0 115m
neo4j-neo4j-deployment-595cb69cc-g52p5 1/1 Running 2 20h
vss-blueprint-0 1/1 Running 0 115m
vss-vss-deployment-6546dcbdc4-8jbzb 1/1 Running 0 112m

Can you install GPU driver version 535 and try the deployment again?

1 Like

nvidia-smi
Tue Mar 4 18:29:22 2025
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA L40S Off | 00000000:9E:00.0 Off | 0 |
| N/A 49C P0 109W / 350W | 36489MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 1 NVIDIA L40S Off | 00000000:A0:00.0 Off | 0 |
| N/A 47C P0 101W / 350W | 37040MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 2 NVIDIA L40S Off | 00000000:A2:00.0 Off | 0 |
| N/A 45C P0 99W / 350W | 15498MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 3 NVIDIA L40S Off | 00000000:A4:00.0 Off | 0 |
| N/A 45C P0 98W / 350W | 15498MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 4 NVIDIA L40S Off | 00000000:C6:00.0 Off | 0 |
| N/A 47C P0 100W / 350W | 45162MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 5 NVIDIA L40S Off | 00000000:C8:00.0 Off | 0 |
| N/A 45C P0 98W / 350W | 43674MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 6 NVIDIA L40S Off | 00000000:CA:00.0 Off | 0 |
| N/A 45C P0 99W / 350W | 43678MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+
| 7 NVIDIA L40S Off | 00000000:CC:00.0 Off | 0 |
| N/A 45C P0 97W / 350W | 43674MiB / 46068MiB | 0% Default |
| | | N/A |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 646646 M+C /usr/bin/python3 1538MiB |
| 0 N/A N/A 646653 C /usr/bin/python3 910MiB |
| 0 N/A N/A 648538 C nvidia-cuda-mps-server 28MiB |
| 0 N/A N/A 650322 C /usr/bin/python3 34002MiB |
| 1 N/A N/A 644733 M+C python3 422MiB |
| 1 N/A N/A 646649 C /usr/bin/python3 1538MiB |
| 1 N/A N/A 646654 C /usr/bin/python3 1038MiB |
| 1 N/A N/A 648538 C nvidia-cuda-mps-server 28MiB |
| 1 N/A N/A 680427 C /usr/bin/python3 34002MiB |
| 2 N/A N/A 23506 C tritonserver 15492MiB |
| 3 N/A N/A 23547 C tritonserver 15492MiB |
| 4 N/A N/A 22341 C /opt/nim/llm/.venv/bin/python3 43668MiB |
| 4 N/A N/A 22342 C /opt/nim/llm/.venv/bin/python3 490MiB |
| 4 N/A N/A 22343 C /opt/nim/llm/.venv/bin/python3 490MiB |
| 4 N/A N/A 22344 C /opt/nim/llm/.venv/bin/python3 490MiB |
| 5 N/A N/A 22342 C /opt/nim/llm/.venv/bin/python3 43666MiB |
| 6 N/A N/A 22343 C /opt/nim/llm/.venv/bin/python3 43670MiB |
| 7 N/A N/A 22344 C /opt/nim/llm/.venv/bin/python3 43666MiB |
±--------------------------------------------------------------------------------------+

2025-03-04 18:26:44,173 INFO Received summarize query, id - becf1844-fe14-4663-b985-057400ea1ffd (live-stream=0), chunk_duration=30, chunk_overlap_duration=0, media-offset-type=None, media-start-time=None, media-end-time=None, modelParams={“max_new_tokens”: 512, “top_p”: 1.0, “top_k”: 100.0, “temperature”: 0.4, “seed”: 1}, summary_duration=0, stream=True num_frames_per_chunk=0 vlm_input_width = 0, vlm_input_height = 0, summarize_batch_size = None, rag_type = None, rag_top_k = None, rag_batch_size = None
2025-03-04 18:26:44 | INFO | stdout | INFO: 172.31.15.59:23067 - “GET /gradio_api/queue/data?session_hash=ui5as8vlope HTTP/1.1” 200 OK
INFO: 172.31.15.59:48136 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:48140 - “GET /health/ready HTTP/1.1” 200 OK
Guardrails process execution time = 1.491 sec
2025-03-04 18:26:45,778 INFO INTIALIZING CONTEXT MANAGER
2025-03-04 18:26:45,779 INFO INITIALIZING CONTEXT MANAGER PROCESS
2025-03-04 18:26:45,785 INFO Triggering oldest queued query 93e5b481-fd82-4bf2-947b-979533659577
File Split execution time = 5.719 millisec
2025-03-04 18:26:45,792 INFO Created video file query 93e5b481-fd82-4bf2-947b-979533659577 for videoId becf1844-fe14-4663-b985-057400ea1ffd
2025-03-04 18:26:45,792 INFO Waiting for results of query 93e5b481-fd82-4bf2-947b-979533659577
INFO: 127.0.0.1:32810 - “POST /summarize HTTP/1.1” 200 OK
Failed to query video capabilities: Invalid argument
Failed to query video capabilities: Invalid argument
2025-03-04 18:26:45,806 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
Failed to query video capabilities: Invalid argument
Failed to query video capabilities: Invalid argument
Failed to query video capabilities: Invalid argument
Failed to query video capabilities: Invalid argument
Decode execution time = 965.793 millisec
Decode execution time = 965.879 millisec
Process VlmProcess-2:
Traceback (most recent call last):
File “/usr/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
self.run()
File “/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py”, line 277, in run
item = self._queue.get()
File “/usr/lib/python3.10/multiprocessing/queues.py”, line 122, in get
return _ForkingPickler.loads(res)
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py”, line 181, in rebuild_cuda_tensor
storage = storage_cls._new_shared_cuda(
File “/usr/local/lib/python3.10/dist-packages/torch/storage.py”, line 1434, in _new_shared_cuda
return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
RuntimeError: CUDA error: peer access is not supported between these two devices
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Decode execution time = 1.152 sec
Decode execution time = 1.141 sec
Decode execution time = 1.141 sec
Decode execution time = 1.145 sec
INFO: 172.31.15.59:57018 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:26:49,840 INFO INTIALIZING CONTEXT MANAGER HANDLER
2025-03-04 18:26:49,858 INFO Using meta/llama-3.3-70b-instruct as the summarization llm
2025-03-04 18:26:49,907 INFO Using meta/llama-3.3-70b-instruct as the chat llm
2025-03-04 18:26:49,960 INFO Setting up Batcher with batch size 5
2025-03-04 18:26:49,960 INFO Setting up QnA, rag type: graph-rag
2025-03-04 18:26:49,971 INFO Setting up Batcher with batch size 2
2025-03-04 18:26:50,021 INFO Index already exist,Skipping creation.
2025-03-04 18:26:50,154 INFO Successfully retrieved Neo4jVector Fulltext index ‘vector’ and keyword index ‘keyword’
2025-03-04 18:26:50,154 INFO Starting to create document retriever chain
2025-03-04 18:26:50,155 INFO Successfully created document retriever chain
2025-03-04 18:26:50,155 INFO Using meta/llama-3.3-70b-instruct as the notification llm
2025-03-04 18:26:50,319 INFO Updating context manager with config:
{‘chat’: {‘embedding’: {‘base_url’: ‘http://nemo-embedding-embedding-deployment-embedding-service:8000/v1’, ‘model’: ‘nvidia/llama-3.2-nv-embedqa-1b-v2’}, ‘llm’: {‘base_url’: ‘http://llm-nim-svc:8000/v1’, ‘max_tokens’: 2048, ‘model’: ‘meta/llama-3.3-70b-instruct’, ‘temperature’: 0.2, ‘top_p’: 0.7}, ‘params’: {‘batch_size’: 2, ‘top_k’: 5}, ‘rag’: ‘graph-rag’, ‘reranker’: {‘base_url’: ‘http://nemo-rerank-ranking-deployment-ranking-service:8000/v1’, ‘model’: ‘nvidia/llama-3.2-nv-rerankqa-1b-v2’}}, ‘notification’: {‘enable’: True, ‘endpoint’: ‘http://127.0.0.1:60000/via-alert-callback’, ‘llm’: {‘base_url’: ‘http://llm-nim-svc:8000/v1’, ‘max_tokens’: 2048, ‘model’: ‘meta/llama-3.3-70b-instruct’, ‘temperature’: 0.2, ‘top_p’: 0.7}}, ‘summarization’: {‘embedding’: {‘base_url’: ‘http://nemo-embedding-embedding-deployment-embedding-service:8000/v1’, ‘model’: ‘nvidia/llama-3.2-nv-embedqa-1b-v2’}, ‘enable’: True, ‘llm’: {‘base_url’: ‘http://llm-nim-svc:8000/v1’, ‘max_tokens’: 2048, ‘model’: ‘meta/llama-3.3-70b-instruct’, ‘temperature’: 0.2, ‘top_p’: 0.7}, ‘method’: ‘batch’, ‘params’: {‘batch_size’: 5}, ‘prompts’: {‘caption’: ‘You are a bridge inspection system. Describe the condition of the bridge. Start each event description with a start and end time stamp of the event’, ‘caption_summarization’: ‘You will be given captions from sequential clips of a video. Aggregate captions in the format start_time:end_time:caption based on whether captions are related to one another or create a continuous scene.’, ‘summary_aggregation’: ‘Based on the available information, generate a summary that describes the condition of the bridge. The summary should be organized chronologically and in logical sections. This should be a concise, yet descriptive summary of all the important events. The format should be intuitive and easy for a user to understand what happened. Format the output in Markdown so it can be displayed nicely.’}}, ‘api_key’: ‘NOAPIKEYSET’, ‘milvus_db_host’: ‘milvus-milvus-deployment-milvus-service’, ‘milvus_db_port’: ‘19530’}
2025-03-04 18:26:50,319 INFO Setting up Batcher with batch size 5
INFO: 172.31.15.59:57024 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:57040 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:26:55,919 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:39182 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:39186 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:39200 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:27:06,031 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:43536 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:43538 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:43554 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:27:16,144 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:36566 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:36574 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:36588 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:27:26,256 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:34024 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:34038 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:34044 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:27:36,365 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:48658 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:48672 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:48674 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:27:46,469 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:53330 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:53342 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:53346 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:27:56,577 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:57422 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:57438 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:57440 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:28:06,685 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:59120 - “GET /health/ready HTTP/1.1” 200 OK
INFO: 172.31.15.59:59136 - “GET /health/live HTTP/1.1” 200 OK
INFO: 172.31.15.59:59138 - “GET /health/ready HTTP/1.1” 200 OK
2025-03-04 18:28:16,792 INFO Status for query 93e5b481-fd82-4bf2-947b-979533659577 is processing, percent complete is 0.00, size of response list is 0
INFO: 172.31.15.59:36762 - “GET /health/ready HTTP/1.1” 200 OK

Could you attach your source video?

I am using only a single picture for testing purposes, and is a jpg format.
When the pod starts I get this message:
VILA TRT model load execution time = 60.839 sec
2025-03-05 09:12:02,208 INFO Initialized VLM pipeline
Process VlmProcess-1:
Traceback (most recent call last):
File “/usr/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
self.run()
File “/opt/nvidia/via/via-engine/vlm_pipeline/process_base.py”, line 277, in run
item = self._queue.get()
File “/usr/lib/python3.10/multiprocessing/queues.py”, line 122, in get
return _ForkingPickler.loads(res)
File “/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py”, line 181, in rebuild_cuda_tensor
storage = storage_cls._new_shared_cuda(
File “/usr/local/lib/python3.10/dist-packages/torch/storage.py”, line 1434, in _new_shared_cuda
return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
RuntimeError: CUDA error: peer access is not supported between these two devices
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Here the log from the pod
vss-vss-deployment-8486d46b4b-zr6p.log (42.6 KB)

Here my values.yaml (needed to rename the file to be able to upload it)
values_yaml.txt (6.9 KB)

Could you try our default VLM model ngc:nim/nvidia/vila-1.5-40b:vila-yi-34b-siglip-stage3_1003_video_v8? Because you are using L40, we are not sure if it will work properly on this GPU.

Doing the proposed changes:

Starting VIA server in release mode
2025-03-05 11:02:42,031 INFO Initializing VIA Stream Handler
2025-03-05 11:02:42,032 INFO Initializing VLM pipeline
2025-03-05 11:02:42,036 INFO Using model cached at /tmp/via-ngc-model-cache/nim_nvidia_vila-1.5-40b_vila-yi-34b-siglip-stage3_1003_video_v8_vila-llama-3-8b-lita
2025-03-05 11:02:42,040 INFO TRT-LLM Engine not found. Generating engines 

Selecting INT4 AWQ mode
Converting Checkpoint 

[2025-03-05 11:02:45,486] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[TensorRT-LLM] TensorRT-LLM version: 0.18.0.dev2025020400
Traceback (most recent call last):
File “/opt/nvidia/via/via-engine/models/vila15/trt_helper/quantize.py”, line 156, in
quantize_and_export(
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py”, line 669, in quantize_and_export
hf_config = get_hf_config(model_dir)
File “/usr/local/lib/python3.10/dist-packages/tensorrt_llm/quantization/quantize_by_modelopt.py”, line 265, in get_hf_config
return AutoConfig.from_pretrained(ckpt_path, trust_remote_code=True)
File “/usr/local/lib/python3.10/dist-packages/transformers/models/auto/configuration_auto.py”, line 1053, in from_pretrained
raise ValueError(
ValueError: Unrecognized model in /tmp/tmp.vila.GW09JQvm. Should have a model_type key in its config.json, or contain one of the following strings in its name: albert, align, altclip, audio-spectrogram-transformer, autoformer, bark, bart, beit, bert, bert-generation, big_bird, bigbird_pegasus, biogpt, bit, blenderbot, blenderbot-small, blip, blip-2, bloom, bridgetower, bros, camembert, canine, chameleon, chinese_clip, chinese_clip_vision_model, clap, clip, clip_text_model, clip_vision_model, clipseg, clvp, code_llama, codegen, cohere, conditional_detr, convbert, convnext, convnextv2, cpmant, ctrl, cvt, dac, data2vec-audio, data2vec-text, data2vec-vision, dbrx, deberta, deberta-v2, decision_transformer, deformable_detr, deit, depth_anything, deta, detr, dinat, dinov2, distilbert, donut-swin, dpr, dpt, efficientformer, efficientnet, electra, encodec, encoder-decoder, ernie, ernie_m, esm, falcon, falcon_mamba, fastspeech2_conformer, flaubert, flava, fnet, focalnet, fsmt, funnel, fuyu, gemma, gemma2, git, glm, glpn, gpt-sw3, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gpt_neox_japanese, gptj, gptsan-japanese, granite, granitemoe, graphormer, grounding-dino, groupvit, hiera, hubert, ibert, idefics, idefics2, idefics3, ijepa, imagegpt, informer, instructblip, instructblipvideo, jamba, jetmoe, jukebox, kosmos-2, layoutlm, layoutlmv2, layoutlmv3, led, levit, lilt, llama, llava, llava_next, llava_next_video, llava_onevision, longformer, longt5, luke, lxmert, m2m_100, mamba, mamba2, marian, markuplm, mask2former, maskformer, maskformer-swin, mbart, mctct, mega, megatron-bert, mgp-str, mimi, mistral, mixtral, mllama, mobilebert, mobilenet_v1, mobilenet_v2, mobilevit, mobilevitv2, moshi, mpnet, mpt, mra, mt5, musicgen, musicgen_melody, mvp, nat, nemotron, nezha, nllb-moe, nougat, nystromformer, olmo, olmo2, olmoe, omdet-turbo, oneformer, open-llama, openai-gpt, opt, owlv2, owlvit, paligemma, patchtsmixer, patchtst, pegasus, pegasus_x, perceiver, persimmon, phi, phi3, phimoe, pix2struct, pixtral, plbart, poolformer, pop2piano, prophetnet, pvt, pvt_v2, qdqbert, qwen2, qwen2_audio, qwen2_audio_encoder, qwen2_moe, qwen2_vl, rag, realm, recurrent_gemma, reformer, regnet, rembert, resnet, retribert, roberta, roberta-prelayernorm, roc_bert, roformer, rt_detr, rt_detr_resnet, rwkv, sam, seamless_m4t, seamless_m4t_v2, segformer, seggpt, sew, sew-d, siglip, siglip_vision_model, speech-encoder-decoder, speech_to_text, speech_to_text_2, speecht5, splinter, squeezebert, stablelm, starcoder2, superpoint, swiftformer, swin, swin2sr, swinv2, switch_transformers, t5, table-transformer, tapas, time_series_transformer, timesformer, timm_backbone, trajectory_transformer, transfo-xl, trocr, tvlt, tvp, udop, umt5, unispeech, unispeech-sat, univnet, upernet, van, video_llava, videomae, vilt, vipllava, vision-encoder-decoder, vision-text-dual-encoder, visual_bert, vit, vit_hybrid, vit_mae, vit_msn, vitdet, vitmatte, vits, vivit, wav2vec2, wav2vec2-bert, wav2vec2-conformer, wavlm, whisper, xclip, xglm, xlm, xlm-prophetnet, xlm-roberta, xlm-roberta-xl, xlnet, xmod, yolos, yoso, zamba, zoedepth, intern_vit_6b, v2l_projector, llava_llama, llava_mistral, llava_mixtral
ERROR: Failed to convert checkpoint
2025-03-05 11:02:52,138 ERROR Failed to load VIA stream handler - Failed to generate TRT-LLM engine
Traceback (most recent call last):
File “/tmp/via/via-engine/via_server.py”, line 1211, in run
self._stream_handler = ViaStreamHandler(self._args)
File “/opt/nvidia/via/via-engine/via_stream_handler.py”, line 373, in init
self._vlm_pipeline = VlmPipeline(args.asset_dir, args)
File “/opt/nvidia/via/via-engine/vlm_pipeline/vlm_pipeline.py”, line 965, in init
raise Exception(“Failed to generate TRT-LLM engine”)
Exception: Failed to generate TRT-LLM engine

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/tmp/via/via-engine/via_server.py”, line 2572, in
server.run()
File “/tmp/via/via-engine/via_server.py”, line 1213, in run
raise ViaException(f"Failed to load VIA stream handler - {str(ex)}")
via_exception.ViaException: ViaException - code: InternalServerError message: Failed to load VIA stream handler - Failed to generate TRT-LLM engine

Quick update
doing as in Release Notes — Video Search and Summarization Agent

sudo microk8s kubectl delete pvc vss-ngc-model-cache-pvc

the pod is always pending and no log

microk8s kubectl get pods
NAME                                                   READY   STATUS    RESTARTS   AGE
etcd-etcd-deployment-997647859-pnww2                   1/1     Running   0          32m
milvus-milvus-deployment-7764df4d7c-v296b              1/1     Running   0          32m
minio-minio-deployment-665bb7d8c4-2dc4t                1/1     Running   0          32m
nemo-embedding-embedding-deployment-59d77cdcc4-wm4m8   1/1     Running   0          32m
nemo-rerank-ranking-deployment-55d7885b58-vtjq5        1/1     Running   0          32m
neo4j-neo4j-deployment-595cb69cc-cqcx2                 1/1     Running   0          32m
vss-blueprint-0                                        1/1     Running   0          32m
vss-vss-deployment-5f8c7b4fcc-bg4sz                    0/1     Pending   0          14m

It may be caused by insufficient resources allocated by the GPU. Can you not limit the number of GPUs allocated to the VSS? Just modify your config file like below.

  resources:
    limits:
      nvidia.com/gpu: 0