Hi Guys,
I’ve recently joined the Morpheus EA. First todo is to install it on an AWS EC2 instance, but I’ve run into some issues. Can you advise?
I’m following instructions on the “Morpheus_Developer_Kit_on_AWS_0.1-062121-2.pdf”
There, we’re instructed to install the EGX Stack for AWS by following these instructions:
I go along until the section “Validate the state of the GPU Operator”, except for “Adding additional node to EGX Stack”.
According to the installation guide, the expected output for the terminal command:
kubectl get pods --all-namespaces | grep -v kube-system
is
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-operator-1590097431-node-feature-discovery-master-76578jwwt 1/1 Running 0 5m2s
default gpu-operator-1590097431-node-feature-discovery-worker-pv5nf 1/1 Running 0 5m2s
default gpu-operator-74c97448d9-n75g8 1/1 Running 1 5m2s
gpu-operator-resources nvidia-container-toolkit-daemonset-pwhfr 1/1 Running 0 4m58s
gpu-operator-resources nvidia-dcgm-exporter-bdzrz 1/1 Running 0 4m57s
gpu-operator-resources nvidia-device-plugin-daemonset-zmjhn 1/1 Running 0 4m57s
gpu-operator-resources nvidia-device-plugin-validation 0/1 Completed 0 4m57s
gpu-operator-resources nvidia-driver-daemonset-7b66v 1/1 Running 0 4m57s
… But I get a different output:
NAMESPACE NAME READY STATUS RESTARTS AGE
default gpu-operator-1637173708-node-feature-discovery-master-78bdv66dz 1/1 Running 0 2m53s
default gpu-operator-1637173708-node-feature-discovery-worker-mjq2g 1/1 Running 0 2m53s
default gpu-operator-76fb8d5c55-g62x9 1/1 Running 0 2m53s
gpu-operator-resources nvidia-container-toolkit-daemonset-d854w 0/1 Init:0/1 0 2m20s
gpu-operator-resources nvidia-driver-daemonset-qrmp7
i.e. the “nvidia-container-toolkit-daemonset” never gets the init stage, and I don’t see the “nvidia-dcgm-exporter” or any of the two “nvidia-device-plugin”
None of the validations listed in the install guide work past this point.
Can you advise?
Thanks