Broker in CrashLoopBackOff

Hi All,

I’m looking to deploy Morpheus for demonstrations to customers and am following the setup guide Setup — Morpheus documentation. Each time I deploy Morpheus the Broker keeps going into a ‘CrashLoopBackOff’ and from the logs it shows the following error to be the cause:

[main-SendThread()] ERROR org.apache.zookeeper.client.StaticHostProvider - Unable to resolve address: zookeeper:2181
java.net.UnknownHostException: zookeeper: Name or service not known
at java.base/java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
at java.base/java.net.InetAddress$PlatformNameService.lookupAllHostAddr(InetAddress.java:929)
at java.base/java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1519)
at java.base/java.net.InetAddress$NameServiceAddresses.get(InetAddress.java:848)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1509)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1368)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1302)
at org.apache.zookeeper.client.StaticHostProvider$1.getAllByName(StaticHostProvider.java:92)
at org.apache.zookeeper.client.StaticHostProvider.resolve(StaticHostProvider.java:147)
at org.apache.zookeeper.client.StaticHostProvider.next(StaticHostProvider.java:375)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1137)

This resolution task is repeated until the broker errors out and starts again. The setup i have is 2x DGX A100 servers running NVIDIA DeepOps with Kubernetes, the output of kubectl shows the following:

kubectl -n morpheus4 get all
NAME READY STATUS RESTARTS AGE
pod/ai-engine-86df47c77d-r5xb2 1/1 Running 0 47h
pod/broker-76f7c64dc9-k8274 0/1 CrashLoopBackOff 481 (55s ago) 47h
pod/zookeeper-87f9f4dd-4bf7t 1/1 Running 0 47h

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ai-engine ClusterIP 10.233.33.127 8000/TCP,8001/TCP,8002/TCP 47h
service/broker ClusterIP 10.233.14.84 9092/TCP 47h
service/zookeeper ClusterIP 10.233.44.234 2181/TCP 47h

NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ai-engine 1/1 1 1 47h
deployment.apps/broker 0/1 1 0 47h
deployment.apps/zookeeper 1/1 1 1 47h

NAME DESIRED CURRENT READY AGE
replicaset.apps/ai-engine-86df47c77d 1 1 1 47h
replicaset.apps/broker-76f7c64dc9 1 1 0 47h
replicaset.apps/zookeeper-87f9f4dd 1 1 1 47h

kubectl get pod -o wide -n morpheus4
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ai-engine-86df47c77d-r5xb2 1/1 Running 0 47h 10.233.84.114 dgx-01
broker-76f7c64dc9-k8274 0/1 CrashLoopBackOff 482 (4m57s ago) 47h 10.233.84.115 dgx-01
zookeeper-87f9f4dd-4bf7t 1/1 Running 0 47h 10.233.98.125 dgx-02

I’ve connected to the ai-engine container and have tested network connectivity and it can connect to both the Cluster-IP range and the External-IP range and can also connect to the internet

wget --spider http://www.google.com

Spider mode enabled. Check if remote file exists.
–2022-04-19 10:05:18-- http://www.google.com/
Resolving www.google.com (www.google.com)… 142.250.76.100, 2a00:1450:4009:822::2004
Connecting to www.google.com (www.google.com)|142.250.76.100|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: unspecified [text/html]
Remote file exists and could contain further links,
but recursion is disabled – not retrieving.

If i try to resolve ‘zookeeper’ from the ai-engine container it can’t resolve the name. Has anyone else ran into this type of issue?

Hi @loughlin thanks for being a part of the Morpheus EA program!

Can you paste the output of “kubectl describe svc broker zookeeper” please?

Also, “kubectl get po -n kube-system”. Something seems wrong with your cluster DNS resolution.

Thanks,
\Pete

Hi Pete,

Happy to be part of the program :)

Please see requested outputs below:

kubectl -n morpheus4 describe svc broker zookeeper
Name: broker
Namespace: morpheus4
Labels: app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=broker
Annotations: meta.helm.sh/release-name: morpheus4
meta.helm.sh/release-namespace: morpheus4
Selector: app.kubernetes.io/name=broker
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.233.14.84
IPs: 10.233.14.84
Port: default 9092/TCP
TargetPort: 9092/TCP
Endpoints:
Session Affinity: None
Events:

Name: zookeeper
Namespace: morpheus4
Labels: app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=zookeeper
Annotations: meta.helm.sh/release-name: morpheus4
meta.helm.sh/release-namespace: morpheus4
Selector: app.kubernetes.io/name=zookeeper
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.233.44.234
IPs: 10.233.44.234
Port: default 2181/TCP
TargetPort: 2181/TCP
Endpoints: 10.233.98.125:2181
Session Affinity: None
Events:

kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-5788f6558-vd56q 1/1 Running 2 (16d ago) 17d
calico-node-fddfn 1/1 Running 0 17d
calico-node-qtglf 1/1 Running 1 (8h ago) 17d
calico-node-sfsjj 1/1 Running 1 (16d ago) 17d
calico-node-trrjn 1/1 Running 0 17d
calico-node-vcmdn 1/1 Running 0 17d
coredns-8474476ff8-7jgtt 1/1 Running 1 (8h ago) 17d
coredns-8474476ff8-lvfmq 1/1 Running 0 17d
dns-autoscaler-5ffdc7f89d-k7nvp 1/1 Running 0 17d
kube-apiserver-k8s-mgmt01 1/1 Running 2 (8h ago) 17d
kube-apiserver-k8s-mgmt02 1/1 Running 1 17d
kube-apiserver-k8s-mgmt03 1/1 Running 1 17d
kube-controller-manager-k8s-mgmt01 1/1 Running 2 (8h ago) 17d
kube-controller-manager-k8s-mgmt02 1/1 Running 1 17d
kube-controller-manager-k8s-mgmt03 1/1 Running 2 17d
kube-proxy-4kfsv 1/1 Running 1 (16d ago) 17d
kube-proxy-8jbrw 1/1 Running 0 17d
kube-proxy-bp68m 1/1 Running 0 17d
kube-proxy-wtktw 1/1 Running 0 17d
kube-proxy-z6qrc 1/1 Running 1 (8h ago) 17d
kube-scheduler-k8s-mgmt01 1/1 Running 2 (8h ago) 17d
kube-scheduler-k8s-mgmt02 1/1 Running 2 (146m ago) 17d
kube-scheduler-k8s-mgmt03 1/1 Running 1 17d
kubernetes-dashboard-6c96f5b677-lfdnj 1/1 Running 3 (16d ago) 17d
kubernetes-metrics-scraper-54b676c794-9t7dl 1/1 Running 0 17d
nginx-proxy-dgx-01 1/1 Running 0 17d
nginx-proxy-dgx-02 1/1 Running 1 (16d ago) 17d
nodelocaldns-bvrg2 1/1 Running 2 (8h ago) 17d
nodelocaldns-kv4hm 1/1 Running 9 (26h ago) 17d
nodelocaldns-x7nvs 1/1 Running 0 17d
nodelocaldns-x9g8n 1/1 Running 0 17d
nodelocaldns-z6xg2 1/1 Running 16 (17h ago) 17d

Thanks,
Loughlin.

HI @pmackinnon,

I’m incorrect in one of my previous statements, the Cluster IP for Zookeeper is 10.233.98.125 isn’t reachable from within the ai-engine container, output below:

kubectl -n morpheus4 exec --stdin --tty ai-engine-86df47c77d-r5xb2 – /bin/bash

I0419 15:26:40.675545 22729 request.go:665] Waited for 1.158287914s due to client-side throttling, not priority and fairness, request: GET:https://172.21.2.91:6443/apis/certificates.k8s.io/v1?timeout=32s

root@ai-engine-86df47c77d-r5xb2:/opt/tritonserver# ping 10.233.98.125

PING 10.233.98.125 (10.233.98.125) 56(84) bytes of data.

^C

— 10.233.98.125 ping statistics —

6 packets transmitted, 0 received, 100% packet loss, time 5107ms

In addition, the Zookeeper container is deployed on a seperate DGX to the Broker and AI-Engine containers, below are the ping outputs from the AI-Engine:

root@ai-engine-86df47c77d-r5xb2:/opt/tritonserver# ping 10.233.33.127

PING 10.233.33.127 (10.233.33.127) 56(84) bytes of data.

64 bytes from 10.233.33.127: icmp_seq=1 ttl=64 time=0.051 ms

64 bytes from 10.233.33.127: icmp_seq=2 ttl=64 time=0.019 ms

64 bytes from 10.233.33.127: icmp_seq=3 ttl=64 time=0.024 ms

64 bytes from 10.233.33.127: icmp_seq=4 ttl=64 time=0.025 ms

^C

— 10.233.33.127 ping statistics —

4 packets transmitted, 4 received, 0% packet loss, time 3060ms

rtt min/avg/max/mdev = 0.019/0.029/0.051/0.012 ms

root@ai-engine-86df47c77d-r5xb2:/opt/tritonserver# ping 10.233.14.84

PING 10.233.14.84 (10.233.14.84) 56(84) bytes of data.

64 bytes from 10.233.14.84: icmp_seq=1 ttl=64 time=0.052 ms

64 bytes from 10.233.14.84: icmp_seq=2 ttl=64 time=0.016 ms

64 bytes from 10.233.14.84: icmp_seq=3 ttl=64 time=0.017 ms

64 bytes from 10.233.14.84: icmp_seq=4 ttl=64 time=0.016 ms

64 bytes from 10.233.14.84: icmp_seq=5 ttl=64 time=0.011 ms

^C

— 10.233.14.84 ping statistics —

5 packets transmitted, 5 received, 0% packet loss, time 4101ms

rtt min/avg/max/mdev = 0.011/0.022/0.052/0.014 ms

I have a feeling the issue is going to be related to the Tunn10 interface on Calico, and can see that my Calico peering is not 100%

sudo calicoctl node status
Calico process is running.

IPv4 BGP status
±-----------------------±------------------±------±---------±------------+
| PEER ADDRESS | PEER TYPE | STATE | SINCE | INFO |
±-----------------------±------------------±------±---------±------------+
| 172.21.10.151 | node-to-node mesh | up | 06:32:37 | Established |
| 172.21.10.153 | node-to-node mesh | up | 06:32:38 | Established |
| 172.21.2.92 | node-to-node mesh | up | 06:32:38 | Established |
| 172.21.2.93 | node-to-node mesh | up | 06:32:37 | Established |
| 172.21.10.151.port.179 | global | start | 06:32:36 | Idle |
±-----------------------±------------------±------±---------±------------+

IPv6 BGP status
No IPv6 peers found.

There is a management node missing form the Calico nodes on IP 172.21.2.91 which i will need to troubleshoot and bring online, i can see the BGP Peering to the Top-of-Rack (SN2100 w/ Cumulus) is okay:

sudo calicoctl get bgppeer
NAME PEERIP NODE ASN
dgx01-tor 172.21.10.2 dgx-01 64713
dgx02-tor 172.21.10.3 dgx-02 64714
my-global-peer 172.21.10.151 (global) 64512

As a next step i’ll troubleshoot the Calico issues and get the topology working correctly then report back to thread with what i found and the steps used to resolve.

Will keep you posted on the outcome.

Thanks,
Loughlin.

Sounds good @loughlin

The Morpheus service declarations were defined pretty generically and without consideration for DualStack.