VSS Engine (vila-1.5): "Sorry, I don't see that in the video" response with 0 chunks processed

hoangnt66 · June 30, 2025, 9:15am

I’m testing the vila-1.5 (https://huggingface.co/Efficient-Large-Model/VILA1.5-7b) model with the vss-engine:2.3.0 container and encountering an issue where the response is always:

{
  "id": "7f9ba7e5-a15b-4d12-bfc6-dda0c30ee130",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Sorry, I don't see that in the video.",
        "tool_calls": [],
        "role": "assistant"
      }
    }
  ],
  "created": 0,
  "model": "vila-1.5",
  "media_info": {
    "type": "offset",
    "start_offset": 0,
    "end_offset": 4000000000
  },
  "object": "summarization.completion",
  "usage": {
    "query_processing_time": 0,
    "total_chunks_processed": 0
  }
}

Here is the request I used:

curl -X POST http://localhost:8100/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "id": "9ff99617-42b1-4738-99bf-f9c33a1e3bed",
    "messages": [
      {
        "content": "How many people in the video are not wearing PPE helmets?",
        "role": "user",
        "name": "hoangnt66"
      }
    ],
    "model": "vila-1.5",
    "api_type": "internal",
    "response_format": { "type": "text" },
    "stream": true,
    "stream_options": { "include_usage": false },
    "max_tokens": 512,
    "temperature": 0.2,
    "top_p": 1,
    "top_k": 100,
    "seed": 10,
    "chunk_duration": 60,
    "chunk_overlap_duration": 10,
    "summary_duration": 60,
    "media_info": {
      "type": "offset",
      "start_offset": 0,
      "end_offset": 4000000000
    },
    "highlight": false,
    "user": "hoangnt66"
  }'

It seems the model fails to process any video chunks (total_chunks_processed: 0 ) even though the media_info covers the entire video duration. The response is always a fallback message: "Sorry, I don't see that in the video."

yuweiw · July 1, 2025, 2:59am

Could you please elaborate on your env and operations, like the model of GPU, driver version, config yaml files, operating steps, etc…?

hoangnt66 · July 1, 2025, 3:34am

I use an A30 GPU with driver version 570.153.02
Here is my .env config file

NVIDIA_API_KEY=
OPENAI_API_KEY=
NGC_API_KEY=
FRONTEND_PORT=9100
BACKEND_PORT=8100
GRAPH_DB_USERNAME=neo4j
GRAPH_DB_PASSWORD=password
CA_RAG_CONFIG=./config.yaml
VIA_IMAGE=nvcr.io/nvidia/blueprint/vss-engine:2.3.0
VLM_MODEL_TO_USE=vila-1.5 # for local vlm model
MODEL_PATH=git:https://huggingface.co/Efficient-Large-Model/VILA1.5-7b
TRT_LLM_MODE=int4_awq
ENABLE_AUDIO=false
DISABLE_GUARDRAILS=true
DISABLE_CV_PIPELINE=false
INSTALL_PROPRIETARY_CODECS=true
NVIDIA_VISIBLE_DEVICES=0

Here is my yaml config file

summarization:
  enable: true
  method: "batch"
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "https://integrate.api.nvidia.com/v1"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7
  embedding:
    model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
    base_url: "https://integrate.api.nvidia.com/v1"
  params:
    batch_size: 5
    batch_max_concurrency: 20
  prompts:
    caption: "Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, forklift stuck, etc. Start and end each sentence with a time stamp."
    caption_summarization: "You should summarize the following events of a warehouse in the format start_time:end_time:caption. For start_time and end_time use . to seperate seconds, minutes, hours. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don't return anything else except the bullet points."
    summary_aggregation: "You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. If the event_description is the same as another event_description, aggregate the captions in the format start_time1:end_time1,...,start_timek:end_timek:event_description. If any two adjacent end_time1 and start_time2 is within a few tenths of a second, merge the captions in the format start_time1:end_time2. The output should only contain bullet points.  Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage and Unauthorized Personnel"

chat:
  rag: graph-rag # graph-rag or vector-rag #If using a small LLM model, vector-rag is recommended.
  params:
    batch_size: 1
    top_k: 5
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "https://integrate.api.nvidia.com/v1"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7
  embedding:
    model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
    base_url: "https://integrate.api.nvidia.com/v1"
  reranker:
    model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
    base_url: "https://integrate.api.nvidia.com/v1"

notification:
  enable: true
  endpoint: "http://127.0.0.1:60000/via-alert-callback"
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "https://integrate.api.nvidia.com/v1"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7

yuweiw · July 1, 2025, 5:06am

There might be the following several reasons that require you to investigate.

The accuracy of the Vila 1.5-7b model is insufficient. You can try our ngc:nvidia/tao/nvila-highres:nvila-lite-15b-highres-lita model
Please reduce the value of chunk_duration to improve the accuracy

Topic		Replies	Views
Issue detecting video in NVIDIA VSS Visual AI Agent	27	168	July 1, 2025
VSS blueprint 2.2.0 - processing, percentage complete is 0.00 forever Visual AI Agent	8	93	March 6, 2025
VILA with VIA [New] Visual AI Agent demos-and-tutorials , llama	4	1053	December 24, 2024
Facing issue while using live stream APIs for public rtsp url Visual AI Agent nvbugs	17	103	June 24, 2025
VSS 2.3.0 Docker remote_llm_deployment Failed to generate TRT-LLM engine Visual AI Agent nim , paligemma , kosmos-2 , llama	5	53	May 23, 2025
(VSS 2.3.0) Issue with Using vila and nvila Models in VSS Deployment Visual AI Agent nim , llama-31-70b-instruct , llama	3	21	July 3, 2025
VSS Installation Visual AI Agent	14	158	February 14, 2025
Vi-output, tc35 100% loading and No Video Stream with tc358840 4K HDMI to MIPI Jetson Xavier NX camera , board-design	25	2315	February 2, 2022
Tegra194-vi5 15c10000.vi: no reply from camera processor Jetson AGX Xavier camera	32	4160	November 24, 2021
VI Engine crashing when camera source not delivered Jetson Xavier NX mmapi	9	755	August 18, 2023

VSS Engine (vila-1.5): "Sorry, I don't see that in the video" response with 0 chunks processed

Related topics