VSS Installation

ezahan.hilmi · February 12, 2025, 7:19am

Please provide the following information when creating a topic:

Hardware Platform (GPU model and numbers) H100 x4GPUs
System Memory
Ubuntu Version
NVIDIA GPU Driver Version (valid for GPU only)
Issue Type( questions, new requirements, bugs)
How to reproduce the issue ? (This is for bugs. Including the command line used and other details for reproducing)
Requirement details (This is for new requirement. Including the logs for the pods, the description for the pods)

I would like to ask is my config file for helm chart correctly setup? This is because the VSS services cannot be run. I am using GPUID 4,5,6,7

sudo microk8s kubectl logs vss-vss-deployment-67f75b5894-bd99j

config.txt (1.3 KB)

I have followed the link below to configure the Helm Chart

sudo microk8s kubectl get pods -A

yuweiw · February 12, 2025, 9:16am

Could you try to modify the value of the nemo-embedding and nemo-rerank to 7?

ezahan.hilmi · February 12, 2025, 9:39am

Is still the same, but for our side, we debug the nemo-rerank-ranking-deployment-c857ccf6d-vvsfc & nemo-embedding-embedding-deployment-557c44b764-pl5gj and found out below error:

===================================
== NVIDIA NIM for Text Reranking ==
===================================

NVIDIA Release 1.3.0
Model: nvidia/llama-3.2-nv-rerankqa-1b-v2

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/).
Third Party Software Attributions and Licenses can be found under /opt/nim/NOTICE

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

libnvidia-ml.so.1 not found under /usr.

yuweiw · February 12, 2025, 10:23am

What’s the NVIDIA GPU Driver Version on your device? Could you install the recommended version according to our prerequisites?

ezahan.hilmi · February 13, 2025, 6:29am

This is the version of my Nvidia GPU Driver

And this is the CUDA version

While this is the kubernetes version

Currently, im having issue on installing nvidia gpu operator on specific version based on the prerequisites need showed below

Any help would be appreciate

yuweiw · February 13, 2025, 7:12am

You can just install the driver by referring to our install-the-nvidia-driver, then installing-a-kubernetes-cluster.
You should also obtain-ngc-api-key and set the NGC_API_KEY first.

export NGC_API_KEY=<your_ngc_api_key>

ezahan.hilmi · February 13, 2025, 7:33am

I followed the installation, now this is the details of the pods

Is it concerning for the gpu operator got error and crashloopbackoff ?

yuweiw · February 13, 2025, 8:11am

That could be the problem. Below is the details of the pods on my side with 4xA100(80G). Could you try to stop all the other services and just run VSS?

ezahan.hilmi · February 14, 2025, 2:48am

I cant stop the existing services running on my 4xH100 GPUs. I found out that there are some connection issues with the coredns and calico kube controllers.

k logs coredns-7896dbf49-kch6l -n kube-system
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = ec28a3f83e0dca1ea83a58874fa7333262cfeb8888208c6833633df2fc0bccf57999407c27f4a9d9cba2b95cae6a190fe5c881c4f592b2d34fa688d643aca662
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3
[INFO] 127.0.0.1:44534 - 24674 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 4.001907398s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:52536->8.8.4.4:53: read: connection refused
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
[INFO] 127.0.0.1:60645 - 42291 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 0.000158765s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:56204->8.8.4.4:53: read: connection refused
[INFO] 127.0.0.1:49462 - 60997 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 4.003796935s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:60873->8.8.4.4:53: i/o timeout
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused

k logs calico-kube-controllers-6457f58ccc-qvzdb  -n kube-system
2025-02-14 02:40:47.035 [INFO][1] main.go 107: Loaded configuration from environment config=&config.Config{LogLevel:"info", WorkloadEndpointWorkers:1, ProfileWorkers:1, PolicyWorkers:1, NodeWorkers:1, Kubeconfig:"", DatastoreType:"kubernetes"}
W0214 02:40:47.036457       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2025-02-14 02:40:47.036 [INFO][1] main.go 131: Ensuring Calico datastore is initialized
2025-02-14 02:40:47.037 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:47.037 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:52.038 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:52.038 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:57.042 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:40:57.042 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:02.046 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:02.046 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:07.050 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:07.050 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:12.054 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:12.054 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:17.059 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:17.059 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:22.063 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:22.063 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:27.066 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:27.066 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:32.071 [ERROR][1] client.go 290: Error getting cluster information config ClusterInformation="default" error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused
2025-02-14 02:41:32.071 [INFO][1] main.go 138: Failed to initialize datastore error=Get "https://127.0.0.1:16443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 127.0.0.1:16443: connect: connection refused

We just did a fresh install and wanted to solve the problem one by one. GPU-operator pods will come after once we are able to solve the kube system pods. Any advise for the kube-system’s pods to run will be appreciated

May i know what version of docker are you using? Came across this:

yuweiw · February 14, 2025, 3:24am

I didn’t use the gpu-operator or calico kube controllers. All related software is the default installation version on Ubuntu 22.04.
My deployment process followed our Guide exactly.

A new Unbuntu 22.04 OS
install-the-nvidia-driver
installing-a-kubernetes-cluster
obtain-ngc-api-key and set the env by export NGC_API_KEY=<your_ngc-api-key>
create-required-secrets
reallocate the GPU sources by the yaml file and deploy-the-helm-chart

jason.cham · February 14, 2025, 3:30am

did you add a parameter -A after the kubectl get pod command?

Currently the issue is resolved, may i know why vss is unable to start

yuweiw · February 14, 2025, 5:06am

This will take some time to deploy the VLM, LLM model, if there are no errors, just continue to wait until the deployment is successful.

jason.cham · February 14, 2025, 6:28am

Hi, the VSS is running but we cannot summarize our own video. The sample videos in the VSS can be used but the video that we uploaded shows NaN:

After uploading the video, when we click on the Summarize button, it shows no output from the models.

Below are the logs from the service:

2025-02-14 06:24:55 | INFO | gradio_web_server | summarize. ip: 172.18.6.2
2025-02-14 06:24:55,465 INFO Received add video file request - purpose Purpose.VISION, media_type MediaType.VIDEO have file None, filename - /tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4
2025-02-14 06:24:55,466 INFO [AssetManager] Added file from path - asset-id: 88dc2d83-d645-4bae-b82e-5173544a357a original path: /tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:20947 - "GET /gradio_api/queue/data?session_hash=tdjscy1h069 HTTP/1.1" 200 OK
Failed to query video capabilities: Invalid argument
INFO:     127.0.0.1:33692 - "POST /files HTTP/1.1" 200 OK
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:20947 - "POST /gradio_api/queue/join HTTP/1.1" 200 OK
2025-02-14 06:24:55 | INFO | gradio_web_server | summarize. ip: 172.18.6.2
2025-02-14 06:24:55,919 INFO Received list models request. Responding with 1 models info
INFO:     127.0.0.1:33692 - "GET /models HTTP/1.1" 200 OK
2025-02-14 06:24:55,920 INFO Received summarize query, id - 88dc2d83-d645-4bae-b82e-5173544a357a (live-stream=0), chunk_duration=10, chunk_overlap_duration=0, media-offset-type=None, media-start-time=None, media-end-time=None, modelParams={"max_new_tokens": 512, "top_p": 1.0, "top_k": 100.0, "temperature": 0.4, "seed": 1}, summary_duration=0, stream=True
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:20947 - "GET /gradio_api/file%3D/tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4 HTTP/1.1" 206 Partial Content
Failed to query video capabilities: Invalid argument
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:40033 - "GET /gradio_api/file%3D/tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4 HTTP/1.1" 206 Partial Content
2025-02-14 06:24:55 | INFO | stdout | INFO:     172.18.6.2:10109 - "GET /gradio_api/queue/data?session_hash=tdjscy1h069 HTTP/1.1" 200 OK
Guardrails process execution time = 328.914 millisec
2025-02-14 06:24:56,615 INFO Using meta/llama-3.1-70b-instruct as the summarization llm
2025-02-14 06:24:56,716 INFO Using meta/llama-3.1-70b-instruct as the cypher llm
2025-02-14 06:24:56,816 INFO Setting up GraphRAG
2025-02-14 06:24:56,823 INFO Triggering oldest queued query a6644f0d-3e11-4531-8fe9-7ee5a67e39af
Failed to query video capabilities: Invalid argument
File Split execution time = 383.801 millisec
2025-02-14 06:24:57,222 INFO Created video file query a6644f0d-3e11-4531-8fe9-7ee5a67e39af for videoId 88dc2d83-d645-4bae-b82e-5173544a357a
2025-02-14 06:24:57,222 INFO Waiting for results of query a6644f0d-3e11-4531-8fe9-7ee5a67e39af
INFO:     127.0.0.1:33692 - "POST /summarize HTTP/1.1" 200 OK
2025-02-14 06:24:57,233 INFO Status for query a6644f0d-3e11-4531-8fe9-7ee5a67e39af is successful, percent complete is 100.00, size of response list is 0
2025-02-14 06:24:57 | INFO | stdout | INFO:     172.18.6.2:10109 - "GET /gradio_api/file%3D/tmp/gradio/499e748e72f4e72c9a0950b0f387d178a6fbb2f22a5b869de5279834dd52457d/bangsar.mp4 HTTP/1.1" 206 Partial Content
2025-02-14 06:24:57 | INFO | stdout | INFO:     172.18.6.2:41433 - "POST /gradio_api/queue/join HTTP/1.1" 200 OK
2025-02-14 06:24:57 | INFO | stdout | INFO:     172.18.6.2:41433 - "GET /gradio_api/queue/data?session_hash=tdjscy1h069 HTTP/1.1" 200 OK
INFO:     172.18.6.2:44910 - "GET /health/ready HTTP/1.1" 200 OK

sudo microk8s kubectl get pods -A

yuweiw · February 14, 2025, 7:36am

Hi @jason.cham , I have tried to transcode your video from HEVC to h264 with the ffmpeg command below. It works well on my side.

ffmpeg -i input.mp4 -vcodec libx264 -preset ultrafast -b:v 2000k output.mp4

You can tune the VSS by transcoding the video source first. We will synchronously investigate why this type of video source is not supported. This could possibly take a while to investigate.

jason.cham · February 14, 2025, 8:11am

Thanks yuwei, we manage to upload the video now

Topic		Replies	Views
VSS Installation problem Visual AI Agent	11	234	February 21, 2025
Getting Error while running blueprint-vss demo Visual AI Agent	30	695	January 24, 2025
VSS blueprint 2.2.0 - processing, percentage complete is 0.00 forever Visual AI Agent	8	196	March 6, 2025
While making setup for Video search and summarization.there are certain dependence that are not resolved NVIDIA AI Workbench nvbugs	1	47	March 31, 2025
Error running NVIDIA VSS \|\| pods keep restarting and crashing multiple times Visual AI Agent ubuntu	10	142	April 13, 2025
VSS FAQ Visual AI Agent	9	502	October 15, 2025
Deployment of Nvidia VSS Blueprint - vss-vss-deployment POD is failing to initialize Visual AI Agent nim , llama-31-70b-instruct , llama , blueprints	1	138	February 14, 2025
VSS issue - vss-blueprint-0 keeps restarting Visual AI Agent nvbugs	4	145	February 13, 2025
Error deploying VSS blueprint Visual AI Agent nim , llama	3	128	March 10, 2025
Error running Nvidia VSS blueprint \|\| pods kept restating and crashing multiple times and never completed Visual AI Agent nim , llama	10	322	March 5, 2025

VSS Installation

Related topics