VSS Installation

Please provide the following information when creating a topic:

  • Hardware Platform (GPU model and numbers) H100 x4GPUs
  • System Memory
  • Ubuntu Version
  • NVIDIA GPU Driver Version (valid for GPU only)
  • Issue Type( questions, new requirements, bugs)
  • How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
  • Requirement details (This is for new requirement. Including the logs for the pods, the description for the pods)

I would like to ask is my config file for helm chart correctly setup? This is because the VSS services cannot be run. I am using GPUID 4,5,6,7

sudo microk8s kubectl logs vss-vss-deployment-67f75b5894-bd99j 

image

config.txt (1.3 KB)

I have followed the link below to configure the Helm Chart

sudo microk8s kubectl get pods -A

Could you try to modify the value of the nemo-embedding and nemo-rerank to 7?

Is still the same, but for our side, we debug the nemo-rerank-ranking-deployment-c857ccf6d-vvsfc & nemo-embedding-embedding-deployment-557c44b764-pl5gj and found out below error:

===================================
== NVIDIA NIM for Text Reranking ==
===================================

NVIDIA Release 1.3.0
Model: nvidia/llama-3.2-nv-rerankqa-1b-v2

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/).
Third Party Software Attributions and Licenses can be found under /opt/nim/NOTICE

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

libnvidia-ml.so.1 not found under /usr.

What’s the NVIDIA GPU Driver Version on your device? Could you install the recommended version according to our prerequisites?

This is the version of my Nvidia GPU Driver

And this is the CUDA version

While this is the kubernetes version

Currently, im having issue on installing nvidia gpu operator on specific version based on the prerequisites need showed below

Any help would be appreciate

You can just install the driver by referring to our install-the-nvidia-driver, then installing-a-kubernetes-cluster.
You should also obtain-ngc-api-key and set the NGC_API_KEY first.

export NGC_API_KEY=<your_ngc_api_key>

I followed the installation, now this is the details of the pods

Is it concerning for the gpu operator got error and crashloopbackoff ?

That could be the problem. Below is the details of the pods on my side with 4xA100(80G). Could you try to stop all the other services and just run VSS?

I cant stop the existing services running on my 4xH100 GPUs. I found out that there are some connection issues with the coredns and calico kube controllers.

k logs coredns-7896dbf49-kch6l -n kube-system
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = ec28a3f83e0dca1ea83a58874fa7333262cfeb8888208c6833633df2fc0bccf57999407c27f4a9d9cba2b95cae6a190fe5c881c4f592b2d34fa688d643aca662
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3
[INFO] 127.0.0.1:44534 - 24674 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 4.001907398s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:52536->8.8.4.4:53: read: connection refused
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
[INFO] 127.0.0.1:60645 - 42291 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 0.000158765s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:56204->8.8.4.4:53: read: connection refused
[INFO] 127.0.0.1:49462 - 60997 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 4.003796935s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:60873->8.8.4.4:53: i/o timeout
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
k logs calico-kube-controllers-6457f58ccc-qvzdb  -n kube-system
2025-02-14 02:40:47.035 [INFO][1] main.go 107: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0214 02:40:47.036457       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2025-02-14 02:40:47.036 [INFO][1] main.go 131: Ensuring Calico datastore is initialized
2025-02-14 02:40:47.037 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:47.037 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:52.038 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:52.038 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:57.042 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:57.042 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:02.046 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:02.046 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:07.050 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:07.050 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:12.054 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:12.054 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:17.059 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:17.059 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:22.063 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:22.063 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:27.066 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:27.066 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:32.071 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:32.071 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused

We just did a fresh install and wanted to solve the problem one by one. GPU-operator pods will come after once we are able to solve the kube system pods. Any advise for the kube-system’s pods to run will be appreciated

May i know what version of docker are you using? Came across this:

I didn’t use the gpu-operator or calico kube controllers. All related software is the default installation version on Ubuntu 22.04.
My deployment process followed our Guide exactly.

did you add a parameter -A after the kubectl get pod command?


Currently the issue is resolved, may i know why vss is unable to start

This will take some time to deploy the VLM, LLM model, if there are no errors, just continue to wait until the deployment is successful.

Hi, the VSS is running but we cannot summarize our own video. The sample videos in the VSS can be used but the video that we uploaded shows NaN:


After uploading the video, when we click on the Summarize button, it shows no output from the models.

Below are the logs from the service:

2025-02-14 06:24:55 | INFO | gradio_web_server | summarize. ip: 172.18.6.2
2025-02-14 06:24:55,465 INFO Received add video file request - purpose Purpose.VISION, media_type MediaType.VIDEO have file None, filename - /tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4
2025-02-14 06:24:55,466 INFO [AssetManager] Added file from path - asset-id: 88dc2d83-d645-4bae-b82e-5173544a357a original path: /tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:20947 - "GET /gradio_api/queue/data?session_hash=tdjscy1h069 HTTP/1.1" 200 OK
Failed to query video capabilities: Invalid argument
INFO:     127.0.0.1:33692 - "POST /files HTTP/1.1" 200 OK
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:20947 - "POST /gradio_api/queue/join HTTP/1.1" 200 OK
2025-02-14 06:24:55 | INFO | gradio_web_server | summarize. ip: 172.18.6.2
2025-02-14 06:24:55,919 INFO Received list models request. Responding with 1 models info
INFO:     127.0.0.1:33692 - "GET /models HTTP/1.1" 200 OK
2025-02-14 06:24:55,920 INFO Received summarize query, id - 88dc2d83-d645-4bae-b82e-5173544a357a (live-stream=0), chunk_duration=10, chunk_overlap_duration=0, media-offset-type=None, media-start-time=None, media-end-time=None, modelParams={"max_new_tokens": 512, "top_p": 1.0, "top_k": 100.0, "temperature": 0.4, "seed": 1}, summary_duration=0, stream=True
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:20947 - "GET /gradio_api/file%3D/tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4 HTTP/1.1" 206 Partial Content
Failed to query video capabilities: Invalid argument
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:40033 - "GET /gradio_api/file%3D/tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4 HTTP/1.1" 206 Partial Content
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:10109 - "GET /gradio_api/queue/data?session_hash=tdjscy1h069 HTTP/1.1" 200 OK
Guardrails process execution time = 328.914 millisec
2025-02-14 06:24:56,615 INFO Using meta/llama-3.1-70b-instruct as the summarization llm
2025-02-14 06:24:56,716 INFO Using meta/llama-3.1-70b-instruct as the cypher llm
2025-02-14 06:24:56,816 INFO Setting up GraphRAG
2025-02-14 06:24:56,823 INFO Triggering oldest queued query a6644f0d-3e11-4531-8fe9-7ee5a67e39af
Failed to query video capabilities: Invalid argument
File Split execution time = 383.801 millisec
2025-02-14 06:24:57,222 INFO Created video file query a6644f0d-3e11-4531-8fe9-7ee5a67e39af for videoId 88dc2d83-d645-4bae-b82e-5173544a357a
2025-02-14 06:24:57,222 INFO Waiting for results of query a6644f0d-3e11-4531-8fe9-7ee5a67e39af
INFO:     127.0.0.1:33692 - "POST /summarize HTTP/1.1" 200 OK
2025-02-14 06:24:57,233 INFO Status for query a6644f0d-3e11-4531-8fe9-7ee5a67e39af is successful, percent complete is 100.00, size of response list is 0
2025-02-14 06:24:57 | INFO | stdout | INFO:     172.18.6.2:10109 - "GET /gradio_api/file%3D/tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4 HTTP/1.1" 206 Partial Content
2025-02-14 06:24:57 | INFO | stdout | INFO:     172.18.6.2:41433 - "POST /gradio_api/queue/join HTTP/1.1" 200 OK
2025-02-14 06:24:57 | INFO | stdout | INFO:     172.18.6.2:41433 - "GET /gradio_api/queue/data?session_hash=tdjscy1h069 HTTP/1.1" 200 OK
INFO:     172.18.6.2:44910 - "GET /health/ready HTTP/1.1" 200 OK
sudo microk8s kubectl get pods -A

Hi @jason.cham , I have tried to transcode your video from HEVC to h264 with the ffmpeg command below. It works well on my side.

ffmpeg -i input.mp4 -vcodec libx264 -preset ultrafast -b:v 2000k output.mp4

You can tune the VSS by transcoding the video source first. We will synchronously investigate why this type of video source is not supported. This could possibly take a while to investigate.

Thanks yuwei, we manage to upload the video now