When trying to create a pipeline: An error occurred forwarding 45807 -> 50051: error forwarding port 50051

Hello,
This is driving me nuts. I’m new to kubernetes and Clara Deploy, so I am not sure how to debug this. Any help is apprieciated!

I installed Clara Deploy with the ansible-installer. Everything seems to be in order.
Checked kubectl get pods and helm ls, and they match the documentation of “verify installation”
I pulled the ai_hippocampus_segmentation pipeline and tried to create, but then i get the following error:

$:~/.clara/pipelines/clara_ai_hippocampus_pipeline$ clara create pipeline -p hippocampus-pipeline.yaml --verbose
Starting create pipeline call
E1008 11:55:28.477173   41326 portforward.go:400] an error occurred forwarding 37117 -> 50051: error forwarding port 50051 to pod c696596501ad47d6730df0b0bdce6e1bc7318b553952ab8dcf45d4602a1a8dd6, uid : exit status 1: 2021/10/08 11:55:28 socat[41343] E connect(5, AF=2 127.0.0.1:50051, 16): Connection refused
Unable to create pipeline. Error:rpc error: code = Unavailable desc = connection closed

Checking kubectl get pods now reveals:

$:~/.clara/pipelines/clara_ai_hippocampus_pipeline$ kubectl get pods
NAME                                             READY   STATUS             RESTARTS   AGE
clara-clara-platformapiserver-75bc9b4867-xlmdn   0/1     CrashLoopBackOff   4          11m
clara-node-monitor-64vcn                         1/1     Running            0          11m
clara-resultsservice-f758699b-vhhcg              1/1     Running            0          11m
clara-ui-758d9645b7-479w7                        1/1     Running            0          11m
clara-workflow-controller-7c66d77f55-8r9nc       0/1     CrashLoopBackOff   6          11m
fluentd-tskrm                                    0/1     CrashLoopBackOff   5          11m

I also tried uninstall, flush ip-tables and reinstall as I suspected a IP-collision problem in the ansible install where the default CIDR is set to 10.256.0.0/16 (or something). My machine have a static IP starting with 10.61. I doubt this was the cause of the problem, but it was worth a try.

I sincerely hope someone know a solution to this “port forwarding” issue, as it is driving me mad!
Regards M

Providing logs from the crached containers:

:~/.clara/pipelines/clara_ai_hippocampus_pipeline$ kubectl logs clara-clara-platformapiserver-75bc9b4867-xlmdn
Fatal Exception: One or more errors occurred. (A task was canceled.)
System.AggregateException: One or more errors occurred. (A task was canceled.)
 ---> System.Threading.Tasks.TaskCanceledException: A task was canceled.
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at Nvidia.Clara.Platform.Server.Program.Run(Options options) in /Clara/src/Platform/Server/Program.cs:line 252
   at Nvidia.Clara.Platform.Server.Program.Main(String[] args) in /Clara/src/Platform/Server/Program.cs:line 445

and

~/.clara/pipelines/clara_ai_hippocampus_pipeline$ kubectl logs fluentd-tskrm                  
2021-10-08 10:14:17 +0000 [info]: parsing config file is succeeded path="/fluentd/etc/fluent.conf"
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-concat' version '2.4.0'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-dedot_filter' version '1.0.0'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-detect-exceptions' version '0.0.13'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-elasticsearch' version '4.0.9'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-grok-parser' version '2.6.1'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-json-in-json-2' version '1.0.2'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '2.3.0'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-multi-format-parser' version '1.0.0'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-prometheus' version '1.6.1'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-record-modifier' version '2.0.1'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-rewrite-tag-filter' version '2.2.0'
2021-10-08 10:14:17 +0000 [info]: gem 'fluent-plugin-systemd' version '1.0.2'
2021-10-08 10:14:17 +0000 [info]: gem 'fluentd' version '1.11.0'
2021-10-08 10:15:17 +0000 [error]: config error file="/fluentd/etc/fluent.conf" error_class=Fluent::ConfigError error="Invalid Kubernetes API v1 endpoint https://10.96.0.1:443/api: Timed out connecting to server"

and

~/.clara/pipelines/clara_ai_hippocampus_pipeline$ kubectl logs clara-workflow-controller-7c66d77f55-8r9nc
Error: Get https://10.96.0.1:443/api/v1/namespaces/default/configmaps/clara-workflow-controller-configmap: dial tcp 10.96.0.1:443: i/o timeout

Hi mathiser,

Welcome to the forums and thanks for your interest in Clara Deploy.

First, which version of Clara Deploy are you using?

I’m not sure what would cause this, but both the pipeline and the fluentd pods are attempting to use the 10.96 subnet and failing. The default pod CIDR with the 0.8.1 ansible installation should be 10.254.0.0

Can you provide the output of the following:

kubectl describe node
kubectl describe pod fluentd-tskrm
kubectl describe pod clara-clara-platformapiserver-75bc9b4867-xlmdn

Thanks,
Kris

Thank you Kris, for always being helpful on the forum!

~$ clara version
Clara CLI version: 0.8.1-20693.a326d62b
E1009 06:44:55.804322  706330 portforward.go:400] an error occurred forwarding 34565 -> 50051: error forwarding port 50051 to pod ab07a6b0ea0c92dc01f9503a95f1a32bac2eed39f5a6a7b8f39ed6452938a360, uid : exit status 1: 2021/10/09 06:44:55 socat[706347] E connect(5, AF=2 127.0.0.1:50051, 16): Connection refused
Error reaching Clara Platform; rpc error: code = Unavailable desc = connection closed

And the kubes (the hashes are changed as I tried to restart it - but the same problem remains)

$ kubectl describe node
Name:               omen
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=omen
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"1e:2c:7f:19:a6:01"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.61.200.14
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Fri, 08 Oct 2021 14:20:44 +0200
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  omen
  AcquireTime:     <unset>
  RenewTime:       Sat, 09 Oct 2021 06:44:20 +0200
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Fri, 08 Oct 2021 14:53:58 +0200   Fri, 08 Oct 2021 14:53:58 +0200   FlannelIsUp                  Flannel is running on this node
  MemoryPressure       False   Sat, 09 Oct 2021 06:42:56 +0200   Fri, 08 Oct 2021 14:20:41 +0200   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Sat, 09 Oct 2021 06:42:56 +0200   Fri, 08 Oct 2021 14:20:41 +0200   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Sat, 09 Oct 2021 06:42:56 +0200   Fri, 08 Oct 2021 14:20:41 +0200   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Sat, 09 Oct 2021 06:42:56 +0200   Fri, 08 Oct 2021 14:21:17 +0200   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.61.200.14
  Hostname:    omen
Capacity:
  cpu:                16
  ephemeral-storage:  959200352Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32662744Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                16
  ephemeral-storage:  883999042940
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32560344Ki
  nvidia.com/gpu:     1
  pods:               110
System Info:
  Machine ID:                 0be87bed6df74e0bb14c3b4e3ee50b25
  System UUID:                3e179440-ad71-11eb-b34e-3c18a010ff18
  Boot ID:                    1d2a6ea5-a3f6-4ff0-8d05-c41265a0c5fd
  Kernel Version:             5.11.0-37-generic
  OS Image:                   Ubuntu 20.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://20.10.7
  Kubelet Version:            v1.19.4
  Kube-Proxy Version:         v1.19.4
PodCIDR:                      10.254.0.0/24
PodCIDRs:                     10.254.0.0/24
Non-terminated Pods:          (17 in total)
  Namespace                   Name                                                   CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                   ------------  ----------  ---------------  -------------  ---
  default                     clara-clara-platformapiserver-75bc9b4867-xkcms         0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  default                     clara-dicom-adapter-7fb4cc587b-cq9rd                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         92s
  default                     clara-node-monitor-f7lcz                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  default                     clara-render-server-clara-renderer-6b9d4799f6-kpr4b    0 (0%)        0 (0%)      0 (0%)           0 (0%)         82s
  default                     clara-resultsservice-f758699b-gwzxk                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  default                     clara-ui-758d9645b7-gtl6k                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  default                     clara-workflow-controller-7c66d77f55-zrdvt             0 (0%)        0 (0%)      0 (0%)           0 (0%)         8h
  default                     fluentd-6fnv5                                          100m (0%)     0 (0%)      200Mi (0%)       512Mi (1%)     8h
  kube-system                 coredns-f9fd979d6-4nb8v                                100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     16h
  kube-system                 coredns-f9fd979d6-fsxb4                                100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     16h
  kube-system                 etcd-omen                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-apiserver-omen                                    250m (1%)     0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-controller-manager-omen                           200m (1%)     0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-flannel-ds-h7kp5                                  100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      16h
  kube-system                 kube-proxy-rcpn7                                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 kube-scheduler-omen                                    100m (0%)     0 (0%)      0 (0%)           0 (0%)         16h
  kube-system                 nvidia-device-plugin-sdrgz                             0 (0%)        0 (0%)      0 (0%)           0 (0%)         16h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                950m (5%)   100m (0%)
  memory             390Mi (1%)  902Mi (2%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0
Events:              <none>

And the API-server

$ kubectl describe pod clara-clara-platformapiserver-75bc9b4867-xkcms
Name:         clara-clara-platformapiserver-75bc9b4867-xkcms
Namespace:    default
Priority:     0
Node:         omen/10.61.200.14
Start Time:   Fri, 08 Oct 2021 21:50:34 +0200
Labels:       app=clara
              name=apis
              pod-template-hash=75bc9b4867
              release=clara
Annotations:  <none>
Status:       Running
IP:           10.254.0.45
IPs:
  IP:           10.254.0.45
Controlled By:  ReplicaSet/clara-clara-platformapiserver-75bc9b4867
Containers:
  platformapiserver:
    Container ID:  docker://a33bfdd6a83590ed9b75bec417366bcc255bdf2c6ee0e62a47beaad57bf59e9e
    Image:         nvcr.io/nvidia/clara/platformapiserver:0.8.1-2108.1
    Image ID:      docker-pullable://nvcr.io/nvidia/clara/platformapiserver@sha256:a8a6eda218a300a00baa15cf848ebcb5594b6e217d58b57eeb6462ac48199487
    Port:          50051/TCP
    Host Port:     0/TCP
    Args:
      --storage
      disk:/clara/payloads
      --commonvolume
      clara-platformapiserver-common-volume-claim
      --default-cpu-limit
      1
      --default-memory-limit
      1024
      --servicevolume
      disk:/clara/service-volumes
      --podmanagerimage
      nvcr.io/nvidia/clara/podmanager
      --podmanagertag
      0.8.1-2108.1
      --pipeline-timeout-default
      600
      --pipeline-timeout-maximum
      1800
      --pipeline-timeout-grace
      15
      --inferenceserverimage
      nvcr.io/nvidia/tritonserver
      --inferenceservertag
      20.07-v1-py3
      --logsarchiverootpath
      /clara/job-logs-archive/
      --maxstorageusagepercentage
      80
      --modelsyncimage
      nvcr.io/nvidia/clara/model-sync-daemon
      --modelsynctag
      0.8.1-2108.1
      --modellimits
      8
      --maxtrtisinstances
      1
      --enablecleaner
      true
      --podcleanerbuffer
      1
      --podcleanerfrequency
      3
      --payloadcleanerenable
      true
      --payloadcleanerbuffer
      60
      --payloadcleanerfrequency
      10
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 09 Oct 2021 06:41:40 +0200
      Finished:     Sat, 09 Oct 2021 06:43:21 +0200
    Ready:          False
    Restart Count:  82
    Environment:
      CLARA_NAMESPACE:  default (v1:metadata.namespace)
    Mounts:
      /clara/job-logs-archive/ from clara-platformapiserver-job-logs-archive-volume (rw)
      /clara/payloads from clara-platformapiserver-storage (rw)
      /clara/service-volumes from clara-platformapiserver-service-volume (rw)
      /data/models from clara-platformapiserver-model-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from platformapiserver-service-account-token-9jz68 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  clara-platformapiserver-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clara-platformapiserver-payload-volume-claim
    ReadOnly:   false
  clara-platformapiserver-service-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clara-platformapiserver-service-volume-claim
    ReadOnly:   false
  clara-platformapiserver-model-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clara-platformapiserver-model-volume-claim
    ReadOnly:   false
  clara-platformapiserver-job-logs-archive-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  clara-platformapiserver-job-logs-archive-volume-claim
    ReadOnly:   false
  platformapiserver-service-account-token-9jz68:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  platformapiserver-service-account-token-9jz68
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason   Age                  From     Message
  ----     ------   ----                 ----     -------
  Warning  BackOff  61s (x1827 over 8h)  kubelet  Back-off restarting failed container

And fluentd

kubectl describe pod fluentd-6fnv5
Name:         fluentd-6fnv5
Namespace:    default
Priority:     0
Node:         omen/10.61.200.14
Start Time:   Fri, 08 Oct 2021 21:50:26 +0200
Labels:       app.kubernetes.io/instance=clara
              app.kubernetes.io/name=clara-log-collector
              controller-revision-hash=8697895b4d
              kubernetes.io/cluster-service=true
              pod-template-generation=1
Annotations:  <none>
Status:       Running
IP:           10.254.0.42
IPs:
  IP:           10.254.0.42
Controlled By:  DaemonSet/fluentd
Containers:
  fluentd:
    Container ID:   docker://1aa63b552640ddf37461c56e0e0d28608b2b699e1371bf696c411790f3a899fc
    Image:          fluent/fluentd-kubernetes-daemonset:v1.11.0-debian-elasticsearch7-1.0
    Image ID:       docker-pullable://fluent/fluentd-kubernetes-daemonset@sha256:692addd615674aacd1e5fac98dff53ff2851270414d62514921779b4aa5da1b8
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sat, 09 Oct 2021 06:42:49 +0200
      Finished:     Sat, 09 Oct 2021 06:43:50 +0200
    Ready:          False
    Restart Count:  91
    Limits:
      memory:  512Mi
    Requests:
      cpu:        100m
      memory:     200Mi
    Environment:  <none>
    Mounts:
      /fluentd/etc/fluent.conf from fluentconfig (rw,path="fluent.conf")
      /var/lib/docker/containers from varlibdockercontainers (ro)
      /var/log from varlog (rw)
      /var/log/fluent from clara-log-path (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from fluentd-token-s5kqt (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker/containers
    HostPathType:  
  fluentconfig:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      fluentd-configmap
    Optional:  false
  clara-log-path:
    Type:          HostPath (bare host directory volume)
    Path:          /clara/log-archive
    HostPathType:  
  fluentd-token-s5kqt:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  fluentd-token-s5kqt
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists
                 node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                 node.kubernetes.io/unreachable:NoExecute op=Exists
                 node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason   Age                    From     Message
  ----     ------   ----                   ----     -------
  Normal   Pulled   17m (x90 over 8h)      kubelet  Container image "fluent/fluentd-kubernetes-daemonset:v1.11.0-debian-elasticsearch7-1.0" already present on machine
  Warning  BackOff  2m10s (x2035 over 8h)  kubelet  Back-off restarting failed container

Would a quick fix be to just reinstall clara with CIDR on 10.96.0.0?

Thanks in advance!
mathiser

Dear Kris
I apologise for bumping this already. Can you think of a way to circumvent this problem. Can I change the ip directly in fluentd?

I’m not sure where to continue reading. Kubernetes I quite overwhelming… especially because the whole framework is packed in easy installers I have no idea what is going on behind the scenes.

Thanks in advance!

Hi mathiser,

Can you take a look at

kubectl get svc

You should see something along the lines of this:

$ kubectl get svc
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                         AGE
clara                  NodePort    10.108.26.238    <none>        50051:30031/TCP                 10d
clara-console          NodePort    10.102.206.76    <none>        8080:32002/TCP,5000:32003/TCP   10d
clara-dicom-adapter    NodePort    10.107.188.120   <none>        104:30048/TCP,5000:31068/TCP    10d
clara-resultsservice   ClusterIP   10.101.104.97    <none>        8088/TCP                        10d
clara-ui               ClusterIP   10.100.105.126   <none>        80/TCP                          10d
kubernetes             ClusterIP   10.96.0.1        <none>        443/TCP                         10d

For whatever reason, fluentd and clara running on the POD CIDR 10.254 network are having trouble talking to the kubernetes api on the 10.96 network.

You mentioned reinstalling - is it possible you have a residual kubernetes config in ~/.kube or CNI config in /etc/cni that’s breaking this new installation?

You may want to try starting from scratch, doing a kubeadm reset, removing any residual config, flushing IP tables, and reinstalling via ansible.

-Kris