Dear Nvidia experts,
Could you help me to clarify the difference in the usage of the Kubernetes operators on the Nvidia Infiniband fabric:
- Network Operator
- Nvidia Network Operator
- SR-IOV Network Operator
Also when i run command
kubectl get nodes -l feature.node.kubernetes.io/network-sriov.capable=true
It shows only one node:
NAME STATUS ROLES AGE VERSION
node4 Ready,SchedulingDisabled worker 6d4h v1.28.
This is output of kubectl get nodes command:
NAME STATUS ROLES AGE VERSION
node1 Ready control-plane 6d4h v1.28.6
node2 Ready control-plane 6d4h v1.28.6
node3 Ready control-plane 6d4h v1.28.6
node4 Ready,SchedulingDisabled worker 6d4h v1.28.6
node5 Ready worker 6d4h v1.28.6
My main goal is to test the Infiniband network using ib_write_bw tool by creating two pods on the two different DGX A100(worker nodes). The network switch is QM9700. I’m just confused which network operator must be installed and how it should be installed. I was following these amazing tutorials written by Nvidia engineers. I guess some points were missed there.
- RDG for Accelerated K8s Cluster over NVIDIA DGX A100 Servers and 200Gbps Ethernet Network Fabric - NVIDIA Docs
- RDG for Accelerating AI Workloads in Red Hat OCP with NVIDIA DGX A100 Servers and NVIDIA InfiniBand Fabric
Thanks in advance for your support.
Best regards,
Shakhizat