K8s test via Infiniband network

Dear Nvidia experts,

Could you help me to clarify the difference in the usage of the Kubernetes operators on the Nvidia Infiniband fabric:

  • Network Operator
  • Nvidia Network Operator
  • SR-IOV Network Operator

Also when i run command

 kubectl get nodes -l feature.node.kubernetes.io/network-sriov.capable=true

It shows only one node:

NAME    STATUS                     ROLES    AGE    VERSION
node4   Ready,SchedulingDisabled   worker   6d4h   v1.28.

This is output of kubectl get nodes command:

NAME    STATUS                     ROLES           AGE    VERSION
node1   Ready                      control-plane   6d4h   v1.28.6
node2   Ready                      control-plane   6d4h   v1.28.6
node3   Ready                      control-plane   6d4h   v1.28.6
node4   Ready,SchedulingDisabled   worker          6d4h   v1.28.6
node5   Ready                      worker          6d4h   v1.28.6

My main goal is to test the Infiniband network using ib_write_bw tool by creating two pods on the two different DGX A100(worker nodes). The network switch is QM9700. I’m just confused which network operator must be installed and how it should be installed. I was following these amazing tutorials written by Nvidia engineers. I guess some points were missed there.

  1. RDG for Accelerated K8s Cluster over NVIDIA DGX A100 Servers and 200Gbps Ethernet Network Fabric - NVIDIA Docs
  2. RDG for Accelerating AI Workloads in Red Hat OCP with NVIDIA DGX A100 Servers and NVIDIA InfiniBand Fabric

Thanks in advance for your support.

Best regards,
Shakhizat

Hello Shakhizat and thanks for writing us.

To achieve your goal of testing the Infiniband network using the ib_write_bw tool by creating two pods on two different DGX A100 worker nodes, you will need to install and configure the appropriate network operator. Here’s a breakdown of the different operators and their usage:

Network Operator

The Network Operator is a general term that can refer to any Kubernetes operator that manages network-related components within a cluster. It is not specific to Nvidia or InfiniBand.

Nvidia Network Operator

The Nvidia Network Operator is designed to manage networking components to enable the execution of RDMA and GPUDirect RDMA workloads in a Kubernetes cluster. It automates the deployment of necessary drivers, device plugins, and secondary network components for network-intensive workloads. To install the Nvidia Network Operator, you can use Helm, which is a package manager for Kubernetes.

helm install -n nvidia-network-operator --create-namespace --wait network-operator nvidia/network-operator --set psp=true

This command installs the Nvidia Network Operator in its own namespace and sets up the necessary permissions for the operator’s pods[.

SR-IOV Network Operator

The SR-IOV Network Operator manages SR-IOV network devices and network attachments. It is used to configure a high-speed data path for IO-intensive workloads on a secondary network in each cluster node To install the SR-IOV Network Operator, you can use the OpenShift Container Platform CLI or the web console. The installation involves creating a namespace, an OperatorGroup CR, and a Subscription CR for the SR-IOV Network Operator.

Given that you are using Nvidia Infiniband fabric and DGX A100 servers, and your goal is to test the Infiniband network, you should install the Nvidia Network Operator. This operator is specifically designed to work with Nvidia networking hardware and will enable RDMA and GPUDirect capabilities required for your testing.

To install the Nvidia Network Operator, follow these steps:

  1. Create a namespace for the Network Operator.
  2. Install the Network Operator using Helm or the OpenShift CLI, depending on your Kubernetes distribution.
  3. Label the namespace with the required Pod Security Admission level if necessary.
  4. Verify the installation by checking the status of the operator’s pods.

Regarding the output of your kubectl get nodes command, it shows that only one node (node4) is labeled as SR-IOV capable and is currently in a Ready,SchedulingDisabled state. This means that the node is ready but not scheduling any new pods. If you intend to use this node for your Infiniband testing, you may need to enable scheduling on it. Additionally, ensure that all necessary nodes are properly labeled to indicate their SR-IOV capabilities.

If there are any issues please feel free to open a case for us in the NVIDIA Enterprise Service Portal where we can help you further if you have any issues with testing your setup.

Thanks and have a wonderful day!
Ilan.

2 Likes

Hello @ipavis, wow, thanks for your informative response. I really appreciate it. Now, I have a more clear picture of what is going on. Yeap, we have current active support for our DGXA100s in the NVIDIA Enterprise Service Portal. We mainly raise a case in case of technical hardware issues. I can raise a case, if you wish.

Meanwhile, we successfully installed using helm as per your suggestion. We used the below values.yaml file

nfd:
  enabled: true
 
sriovNetworkOperator:
  enabled: true
 
deployCR: true
ofedDriver:
  deploy: false
 
nvPeerDriver:
  deploy: false
 
rdmaSharedDevicePlugin:
  deploy: false
 
sriovDevicePlugin:
  deploy: false
 
secondaryNetwork:
  deploy: true
  cniPlugins:
    deploy: true
  multus:
    deploy: true
  ipamPlugin:
    deploy: true

Do we need to change ofedDriver deploy from false to true for the Infiniband network? Should we try Network Operator Deployment with an SR-IOV InfiniBand Network, which is described here?

Here is the output of kubectl -n network-operator get pods -o wide

NAME                                                              READY   STATUS    RESTARTS        AGE     IP               NODE    NOMINATED NODE   READINESS GATES
cni-plugins-ds-65456                                              1/1     Running   1 (18m ago)     18m     10.6.254.74      node4   <none>           <none>
kube-multus-ds-n7pnk                                              1/1     Running   0               18m     10.6.254.74      node4   <none>           <none>
network-operator-68c9cf9f94-m9v4k                                 1/1     Running   0               18m     10.233.71.2      node3   <none>           <none>
network-operator-node-feature-discovery-master-6776877686-9lhw7   1/1     Running   0               18m     10.233.75.20     node2   <none>           <none>
network-operator-node-feature-discovery-worker-4wc7w              1/1     Running   7 (5m33s ago)   18m     10.233.75.19     node2   <none>           <none>
network-operator-node-feature-discovery-worker-9rbg4              1/1     Running   7 (5m31s ago)   18m     10.233.102.163   node1   <none>           <none>
network-operator-node-feature-discovery-worker-g8c8m              1/1     Running   7 (5m24s ago)   18m     10.233.71.63     node3   <none>           <none>
network-operator-node-feature-discovery-worker-gx772              1/1     Running   0               18m     10.233.74.108    node4   <none>           <none>
network-operator-sriov-network-operator-5c77bf848b-5zlgm          1/1     Running   0               18m     10.233.75.15     node2   <none>           <none>
sriov-device-plugin-lp8nq                                         1/1     Running   0               5m31s   10.6.254.74      node4   <none>           <none>
sriov-device-plugin-tltqp                                         0/1     Pending   0               4m28s   <none>           node5   <none>           <none>
sriov-network-config-daemon-d67bq                                 1/3     Unknown   0               18m     10.6.254.75      node5   <none>           <none>
sriov-network-config-daemon-sqg9r                                 3/3     Running   0               18m     10.6.254.74      node4   <none>           <none>
whereabouts-lxlsv                                                 1/1     Running   0               18m     10.6.254.74      node4   <none>           <none>

Weird observation. Sometimes, the below command works, sometimes not. Maybe we are doing something wrong. Please suggest.

kubectl -n network-operator get sriovnetworknodestates.sriovnetwork.openshift.io node4 -o yaml

It shows

Error from server (NotFound): sriovnetworknodestates.sriovnetwork.openshift.io "node4" not found

The issue with Ready,SchedulingDisabled was fixed.

 kubectl get nodes -l feature.node.kubernetes.io/network-sriov.capable=truees.io/network-sriov.capable=true
NAME    STATUS   ROLES    AGE    VERSION
node4   Ready    worker   7d6h   v1.28.6
node5   Ready    worker   7d6h   v1.28.6

Best regards,
Shakhizat

Dear Shakhizat,

We also support SW issues and not just HW issues :-)
It would be better to open a ticket on an issue such as this as we need more information from the system and debugging this would be more affective in a case.
So, please open a case and we would be more than happy to help.

Have a great day!
Ilan.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.