NVIDIA I have a RKE2 based K8s cluster that im attempting to enable GPUDirectRDMA in. I’ve deployed the GPU operator and it seems to be up and running. But when I attempt to deploy the Network Operator I get the following from the NicClusterPolicy
Name: state-OFED
State: notReady
Name: state-SRIOV-device-plugin
State: notReady
Name: state-RDMA-device-plugin
State: notReady
and I don’t see any containers associated with those items deploy (or even attempt to deploy)
I am following the example in the K8s cloud Orchestration Network Operator application notes under the heading Network Operator Deployment for GPUDirect Workloads
Hardware:
1 Dell R440 – Ubuntu 22.04 5.15.0-75-generic
1 Microserver 1 Dell R440 – Ubuntu 22.04 5.15.0-75-generic
4x Tesla V100
1x Meallanox Connectx-5 100Gb Model CX516A
This is the ansible I use to deploy both the GPU and Network Operator:
- name: Install Nvidia GPU/Network operator
hosts: portalgun
become: false
vars:
ansible_python_interpreter: ‘/usr/bin/python3’
tasks:- name: Adding Nvidia to helm repo
kubernetes.core.helm_repository:
name: nvidia
repo_url: https://helm.ngc.nvidia.com/nvidia - name: installing Nvidia GPU operator helm chart
kubernetes.core.helm:
name: nvidia-gpu
namespace: nvidia-gpu
create_namespace: true
chart_ref: nvidia/gpu-operator
values_files:
- gpu-values.yaml - name: installing Nvidia network operator helm chart
kubernetes.core.helm:
name: nvidia-net
namespace: nvidia-net
create_namespace: true
chart_ref: nvidia/network-operator
values_files:
- net-values.yaml
- name: Adding Nvidia to helm repo
I’ve also attached files with my values.yaml for deploying the operator and the outputs of kubectl describe and logs from the gpu operator.
Any assistance is welcome I cant figure out where my config is wrong
Thank you
Frank