Is still the same, but for our side, we debug the nemo-rerank-ranking-deployment-c857ccf6d-vvsfc & nemo-embedding-embedding-deployment-557c44b764-pl5gj and found out below error:
===================================
== NVIDIA NIM for Text Reranking ==
===================================
NVIDIA Release 1.3.0
Model: nvidia/llama-3.2-nv-rerankqa-1b-v2
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/
A copy of this license can be found under /opt/nim/LICENSE.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/).
Third Party Software Attributions and Licenses can be found under /opt/nim/NOTICE
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
libnvidia-ml.so.1 not found under /usr.
That could be the problem. Below is the details of the pods on my side with 4xA100(80G). Could you try to stop all the other services and just run VSS?
I cant stop the existing services running on my 4xH100 GPUs. I found out that there are some connection issues with the coredns and calico kube controllers.
k logs coredns-7896dbf49-kch6l -n kube-system
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[INFO] plugin/kubernetes: waiting for Kubernetes API before starting server
[WARNING] plugin/kubernetes: starting server with unsynced Kubernetes API
.:53
[INFO] plugin/reload: Running configuration SHA512 = ec28a3f83e0dca1ea83a58874fa7333262cfeb8888208c6833633df2fc0bccf57999407c27f4a9d9cba2b95cae6a190fe5c881c4f592b2d34fa688d643aca662
CoreDNS-1.10.1
linux/amd64, go1.20, 055b2c3
[INFO] 127.0.0.1:44534 - 24674 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 4.001907398s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:52536->8.8.4.4:53: read: connection refused
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
[INFO] 127.0.0.1:60645 - 42291 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 0.000158765s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:56204->8.8.4.4:53: read: connection refused
[INFO] 127.0.0.1:49462 - 60997 "HINFO IN 812425296251136449.5340730100123542633. udp 56 false 512" - - 0 4.003796935s
[ERROR] plugin/errors: 2 812425296251136449.5340730100123542633. HINFO: read udp 10.1.166.91:60873->8.8.4.4:53: i/o timeout
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://127.0.0.1:16443/version": dial tcp 127.0.0.1:16443: connect: connection refused
We just did a fresh install and wanted to solve the problem one by one. GPU-operator pods will come after once we are able to solve the kube system pods. Any advise for the kube-system’s pods to run will be appreciated
May i know what version of docker are you using? Came across this:
I didn’t use the gpu-operator or calico kube controllers. All related software is the default installation version on Ubuntu 22.04.
My deployment process followed our Guide exactly.
You can tune the VSS by transcoding the video source first. We will synchronously investigate why this type of video source is not supported. This could possibly take a while to investigate.