Network operator in RKE2 cluster for GPUDirect Workloads

NVIDIA I have a RKE2 based K8s cluster that im attempting to enable GPUDirectRDMA in. I’ve deployed the GPU operator and it seems to be up and running. But when I attempt to deploy the Network Operator I get the following from the NicClusterPolicy
Name: state-OFED
State: notReady
Name: state-SRIOV-device-plugin
State: notReady
Name: state-RDMA-device-plugin
State: notReady

and I don’t see any containers associated with those items deploy (or even attempt to deploy)
I am following the example in the K8s cloud Orchestration Network Operator application notes under the heading Network Operator Deployment for GPUDirect Workloads

Hardware:
1 Dell R440 – Ubuntu 22.04 5.15.0-75-generic
1 Microserver 1 Dell R440 – Ubuntu 22.04 5.15.0-75-generic
4x Tesla V100
1x Meallanox Connectx-5 100Gb Model CX516A

This is the ansible I use to deploy both the GPU and Network Operator:

  • name: Install Nvidia GPU/Network operator
    hosts: portalgun
    become: false
    vars:
    ansible_python_interpreter: ‘/usr/bin/python3’
    tasks:
    • name: Adding Nvidia to helm repo
      kubernetes.core.helm_repository:
      name: nvidia
      repo_url: https://helm.ngc.nvidia.com/nvidia
    • name: installing Nvidia GPU operator helm chart
      kubernetes.core.helm:
      name: nvidia-gpu
      namespace: nvidia-gpu
      create_namespace: true
      chart_ref: nvidia/gpu-operator
      values_files:
      - gpu-values.yaml
    • name: installing Nvidia network operator helm chart
      kubernetes.core.helm:
      name: nvidia-net
      namespace: nvidia-net
      create_namespace: true
      chart_ref: nvidia/network-operator
      values_files:
      - net-values.yaml

I’ve also attached files with my values.yaml for deploying the operator and the outputs of kubectl describe and logs from the gpu operator.

Any assistance is welcome I cant figure out where my config is wrong

Thank you

Frank

net-values.yaml (8.5 KB)

Kubectl_logs_pod_nvidia-net-network-operator.txt (4.3 MB)

kubectl_describe_NicClusterPolicy.txt (5.1 KB)

kubectl_describe_HostDeviceNetwork.txt (1.6 KB)

Kubectl_describe_nodes.txt (19.0 KB)

Sorry for all the extra posts, the system would only let me add one file per-post
Kubectl_get_all.txt (11.9 KB)

Hello @francis.bethuy and welcome to the NVIDIA developer forums!

I am afraid I will not be able to help here since I don’t know much about this topic. If you are ok with it, I can move your post to the dedicated GPU RDMA category. In that forum there are also discussions about GPU RDMA and Mellanox setups.

Thanks!

Markus,

if that is the correct location for this then please move it!

Frank

Thank you!

I moved it, but I keep this on my watchlist in case I was wrong. But for now this is the best place to start with in my opinion.