VSS Engine (vila-1.5): "Sorry, I don't see that in the video" response with 0 chunks processed

I’m testing the vila-1.5 (https://huggingface.co/Efficient-Large-Model/VILA1.5-7b) model with the vss-engine:2.3.0 container and encountering an issue where the response is always:

{
  "id": "7f9ba7e5-a15b-4d12-bfc6-dda0c30ee130",
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Sorry, I don't see that in the video.",
        "tool_calls": [],
        "role": "assistant"
      }
    }
  ],
  "created": 0,
  "model": "vila-1.5",
  "media_info": {
    "type": "offset",
    "start_offset": 0,
    "end_offset": 4000000000
  },
  "object": "summarization.completion",
  "usage": {
    "query_processing_time": 0,
    "total_chunks_processed": 0
  }
}

Here is the request I used:

curl -X POST http://localhost:8100/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "id": "9ff99617-42b1-4738-99bf-f9c33a1e3bed",
    "messages": [
      {
        "content": "How many people in the video are not wearing PPE helmets?",
        "role": "user",
        "name": "hoangnt66"
      }
    ],
    "model": "vila-1.5",
    "api_type": "internal",
    "response_format": { "type": "text" },
    "stream": true,
    "stream_options": { "include_usage": false },
    "max_tokens": 512,
    "temperature": 0.2,
    "top_p": 1,
    "top_k": 100,
    "seed": 10,
    "chunk_duration": 60,
    "chunk_overlap_duration": 10,
    "summary_duration": 60,
    "media_info": {
      "type": "offset",
      "start_offset": 0,
      "end_offset": 4000000000
    },
    "highlight": false,
    "user": "hoangnt66"
  }'

It seems the model fails to process any video chunks (total_chunks_processed: 0 ) even though the media_info covers the entire video duration. The response is always a fallback message: "Sorry, I don't see that in the video."

Could you please elaborate on your env and operations, like the model of GPU, driver version, config yaml files, operating steps, etc…?

I use an A30 GPU with driver version 570.153.02
Here is my .env config file

NVIDIA_API_KEY=
OPENAI_API_KEY=
NGC_API_KEY=
FRONTEND_PORT=9100
BACKEND_PORT=8100
GRAPH_DB_USERNAME=neo4j
GRAPH_DB_PASSWORD=password
CA_RAG_CONFIG=./config.yaml
VIA_IMAGE=nvcr.io/nvidia/blueprint/vss-engine:2.3.0
VLM_MODEL_TO_USE=vila-1.5 # for local vlm model
MODEL_PATH=git:https://huggingface.co/Efficient-Large-Model/VILA1.5-7b
TRT_LLM_MODE=int4_awq
ENABLE_AUDIO=false
DISABLE_GUARDRAILS=true
DISABLE_CV_PIPELINE=false
INSTALL_PROPRIETARY_CODECS=true
NVIDIA_VISIBLE_DEVICES=0

Here is my yaml config file

summarization:
  enable: true
  method: "batch"
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "https://integrate.api.nvidia.com/v1"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7
  embedding:
    model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
    base_url: "https://integrate.api.nvidia.com/v1"
  params:
    batch_size: 5
    batch_max_concurrency: 20
  prompts:
    caption: "Write a concise and clear dense caption for the provided warehouse video, focusing on irregular or hazardous events such as boxes falling, workers not wearing PPE, workers falling, workers taking photographs, workers chitchatting, forklift stuck, etc. Start and end each sentence with a time stamp."
    caption_summarization: "You should summarize the following events of a warehouse in the format start_time:end_time:caption. For start_time and end_time use . to seperate seconds, minutes, hours. If during a time segment only regular activities happen, then ignore them, else note any irregular activities in detail. The output should be bullet points in the format start_time:end_time: detailed_event_description. Don't return anything else except the bullet points."
    summary_aggregation: "You are a warehouse monitoring system. Given the caption in the form start_time:end_time: caption, Aggregate the following captions in the format start_time:end_time:event_description. If the event_description is the same as another event_description, aggregate the captions in the format start_time1:end_time1,...,start_timek:end_timek:event_description. If any two adjacent end_time1 and start_time2 is within a few tenths of a second, merge the captions in the format start_time1:end_time2. The output should only contain bullet points.  Cluster the output into Unsafe Behavior, Operational Inefficiencies, Potential Equipment Damage and Unauthorized Personnel"

chat:
  rag: graph-rag # graph-rag or vector-rag #If using a small LLM model, vector-rag is recommended.
  params:
    batch_size: 1
    top_k: 5
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "https://integrate.api.nvidia.com/v1"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7
  embedding:
    model: "nvidia/llama-3.2-nv-embedqa-1b-v2"
    base_url: "https://integrate.api.nvidia.com/v1"
  reranker:
    model: "nvidia/llama-3.2-nv-rerankqa-1b-v2"
    base_url: "https://integrate.api.nvidia.com/v1"

notification:
  enable: true
  endpoint: "http://127.0.0.1:60000/via-alert-callback"
  llm:
    model: "meta/llama-3.1-70b-instruct"
    base_url: "https://integrate.api.nvidia.com/v1"
    max_tokens: 2048
    temperature: 0.2
    top_p: 0.7

There might be the following several reasons that require you to investigate.

  1. The accuracy of the Vila 1.5-7b model is insufficient. You can try our ngc:nvidia/tao/nvila-highres:nvila-lite-15b-highres-lita model
  2. Please reduce the value of chunk_duration to improve the accuracy