Fluentd elasticsearch clara-monitor-server pod restarting frequently

Hi. I noticed that the fluentd elasticsearch pod (part of the monitor server) is restarting frequently. Should I be concerned? If so, what should I do about it?

Hi @mastreips,

Thanks for reporting the issue!
Could you please check if there are useful debug messages for the restart issue and share the information?

Please check clara-monitor pods: kubectl get pods | grep clara-monitor

gbae@gbae:~$ kubectl get pods | grep clara-monitor
clara-monitor-server-fluentd-elasticsearch-8m2cm       1/1     Running   0          3m10s
clara-monitor-server-grafana-5f874b974d-x85m4          1/1     Running   0          3m10s
clara-monitor-server-monitor-server-669c6cb97f-gsn67   1/1     Running   0          3m10s

and check description of the failing/restarting pods: kubectl describe pods/clara-monitor-server-fluentd-elasticsearch-8m2cm

gbae@gbae:~$ kubectl describe pods/clara-monitor-server-fluentd-elasticsearch-8m2cm
Name:           clara-monitor-server-fluentd-elasticsearch-8m2cm
Namespace:      default
Priority:       0
Node:           gbae.nvidia.com/10.110.28.98
Start Time:     Mon, 22 Jun 2020 10:48:55 -0700
Labels:         app.kubernetes.io/instance=clara-monitor-server
                app.kubernetes.io/managed-by=Tiller
                app.kubernetes.io/name=fluentd-elasticsearch
                app.kubernetes.io/version=2.7.0
                controller-revision-hash=64745c89f8
                helm.sh/chart=fluentd-elasticsearch-5.0.0
                kubernetes.io/cluster-service=true
                pod-template-generation=1
Annotations:    checksum/config: 743192195d716651b44374afd375eadaabd254b3d670dc9c65e409778c33ba5b
Status:         Running
IP:             10.244.0.20
Controlled By:  DaemonSet/clara-monitor-server-fluentd-elasticsearch
Containers:
  clara-monitor-server-fluentd-elasticsearch:
    Container ID:   docker://3524020017ec6cf33459e477c7573b13f820400c84933fff475fb7155f8a066b
    Image:          quay.io/fluentd_elasticsearch/fluentd:v2.7.0
    Image ID:       docker-pullable://quay.io/fluentd_elasticsearch/fluentd@sha256:9d97cc110835d29c77a335475b33a29735892378ff7441f79fad9b9b68cfa149
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Mon, 22 Jun 2020 10:49:13 -0700
    Ready:          True
    Restart Count:  0
    Liveness:       exec [/bin/sh -c LIVENESS_THRESHOLD_SECONDS=${LIVENESS_THRESHOLD_SECONDS:-300}; STUCK_THRESHOLD_SECONDS=${STUCK_THRESHOLD_SECONDS:-900}; if [ ! -e /var/log/fluentd-buffers ]; then
  exit 1;
fi; touch -d "${STUCK_THRESHOLD_SECONDS} seconds ago" /tmp/marker-stuck; if [ -z "$(find /var/log/fluentd-buffers -type d -newer /tmp/marker-stuck -print -quit)" ]; then
  rm -rf /var/log/fluentd-buffers;
  exit 1;
fi; touch -d "${LIVENESS_THRESHOLD_SECONDS} seconds ago" /tmp/marker-liveness; if [ -z "$(find /var/log/fluentd-buffers -type d -newer /tmp/marker-liveness -print -quit)" ]; then
  exit 1;
fi;
] delay=600s timeout=1s period=60s #success=1 #failure=3
    Environment:
      FLUENTD_ARGS:               --no-supervisor -q
      OUTPUT_HOST:                elasticsearch-master
      OUTPUT_PORT:                9200
      OUTPUT_PATH:                
      LOGSTASH_PREFIX:            logstash
      OUTPUT_SCHEME:              http
      OUTPUT_SSL_VERIFY:          true
      OUTPUT_SSL_VERSION:         TLSv1_2
      OUTPUT_TYPE_NAME:           _doc
      OUTPUT_BUFFER_CHUNK_LIMIT:  2M
      OUTPUT_BUFFER_QUEUE_LIMIT:  8
      OUTPUT_LOG_LEVEL:           info
      K8S_NODE_NAME:               (v1:spec.nodeName)
    Mounts:
      /etc/fluent/config.d from config-volume (rw)
      /usr/lib64 from libsystemddir (ro)
      /var/lib/docker/containers from varlibdockercontainers (ro)
      /var/log from varlog (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from clara-monitor-server-fluentd-elasticsearch-token-2gwk5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  varlibdockercontainers:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/docker/containers
    HostPathType:  
  libsystemddir:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/lib64
    HostPathType:  
  config-volume:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      clara-monitor-server-fluentd-elasticsearch
    Optional:  false
  clara-monitor-server-fluentd-elasticsearch-token-2gwk5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  clara-monitor-server-fluentd-elasticsearch-token-2gwk5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type    Reason     Age    From                      Message
  ----    ------     ----   ----                      -------
  Normal  Scheduled  3m25s  default-scheduler         Successfully assigned default/clara-monitor-server-fluentd-elasticsearch-8m2cm to gbae.nvidia.com
  Normal  Pulling    3m25s  kubelet, gbae.nvidia.com  Pulling image "quay.io/fluentd_elasticsearch/fluentd:v2.7.0"
  Normal  Pulled     3m8s   kubelet, gbae.nvidia.com  Successfully pulled image "quay.io/fluentd_elasticsearch/fluentd:v2.7.0"
  Normal  Created    3m7s   kubelet, gbae.nvidia.com  Created container clara-monitor-server-fluentd-elasticsearch
  Normal  Started    3m7s   kubelet, gbae.nvidia.com  Started container clara-monitor-server-fluentd-elasticsearch

and see if there are error logs in the pod: kubectl logs clara-monitor-server-fluentd-elasticsearch-8m2cm --all-containers or kubectl logs clara-monitor-server-fluentd-elasticsearch-8m2cm --all-containers -f (to follow messages).

BTW, you can use Clara without monitors (clara monitor stop for disabling the monitor service)

Thanks gigony - here is the description of the failing pod:

Name:           clara-monitor-server-fluentd-elasticsearch-2gvcg

...

    Restart Count:  349

...

Events:
  Type     Reason     Age                      From                      Message
  ----     ------     ----                     ----                      -------
  Warning  Unhealthy  2m18s (x905 over 2d17h)  kubelet, virtualserver01  Liveness probe failed:
  Normal   Killing    78s (x302 over 2d17h)    kubelet, virtualserver01  Container clara-monitor-server-fluentd-elasticsearch failed liveness probe, will be restarted

Log file seems to be unreadable -

Log unreadable. It is excluded and would be examined next time.

I didn’t manually setup any monitoring services as described here.

Maybe that’s why I am seeing this error?

Hi @mastreips, I am still thinking about what can caused the issue.
The log message (‘Log unreadable. It is excluded and would be examined next time’) might not be the root cause as I can also see such messages as ‘warning’.
Might be similar issue with https://github.com/helm/charts/issues/8519 (but it doesn’t have any solution yet).

Since liveness probe was failed on your system, I am curious which system configuration do you have. Could you please execute kubectl describe nodes and share the output to see the k8s node’s status?

$ kubectl describe nodes
Name:               gbae.nvidia.com
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=gbae.nvidia.com
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":"1e:d7:5a:a3:93:e6"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.110.28.98
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 22 Jun 2020 10:42:45 -0700
Taints:             <none>
Unschedulable:      false
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 24 Jun 2020 17:23:53 -0700   Mon, 22 Jun 2020 10:42:42 -0700   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 24 Jun 2020 17:23:53 -0700   Mon, 22 Jun 2020 10:42:42 -0700   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 24 Jun 2020 17:23:53 -0700   Mon, 22 Jun 2020 10:42:42 -0700   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 24 Jun 2020 17:23:53 -0700   Mon, 22 Jun 2020 10:43:14 -0700   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.110.28.98
  Hostname:    gbae.nvidia.com
Capacity:
 cpu:                12
 ephemeral-storage:  959200352Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65537436Ki
 nvidia.com/gpu:     1
 pods:               110
Allocatable:
 cpu:                12
 ephemeral-storage:  883999042940
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             65435036Ki
 nvidia.com/gpu:     1
 pods:               110
System Info:
 Machine ID:                 88fc0090508a435b935bfd746a428dbf
 System UUID:                26ce57e0-d7da-11dd-aac5-40b0769f757d
 Boot ID:                    723603a2-35f1-4e17-9380-8a6a03bc2c24
 Kernel Version:             5.3.0-59-generic
 OS Image:                   Ubuntu 18.04.4 LTS
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://19.3.11
 Kubelet Version:            v1.15.4
 Kube-Proxy Version:         v1.15.4
PodCIDR:                     10.244.0.0/24
Non-terminated Pods:         (24 in total)
  Namespace                  Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                                    ------------  ----------  ---------------  -------------  ---
  default                    clara-clara-platformapiserver-66ccc9c54-h6l26           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    clara-console-b5f64754b-ppqdm                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    clara-console-mongodb-85f8bd5f95-jxxjl                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    clara-dicom-adapter-66cfbf9c57-vjmsj                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    clara-monitor-server-fluentd-elasticsearch-4fk2q        0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m34s
  default                    clara-monitor-server-grafana-5f874b974d-fz5p6           0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m34s
  default                    clara-monitor-server-monitor-server-669c6cb97f-s4w4v    0 (0%)        0 (0%)      0 (0%)           0 (0%)         8m34s
  default                    clara-render-server-clara-renderer-7b676c6b4b-5z6nt     0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    clara-resultsservice-6f9c844db8-tkq9c                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    clara-ui-6f89b97df8-tpx9t                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    clara-workflow-controller-69cbb55fc8-6567n              0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  default                    dp-sample-75ck4                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         6m33s
  default                    elasticsearch-master-0                                  100m (0%)     1 (8%)      2Gi (3%)         2Gi (3%)       8m34s
  default                    elasticsearch-master-1                                  100m (0%)     1 (8%)      2Gi (3%)         2Gi (3%)       8m34s
  kube-system                coredns-5c98db65d4-84j57                                100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     2d6h
  kube-system                coredns-5c98db65d4-tg6n8                                100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     2d6h
  kube-system                etcd-gbae.nvidia.com                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                kube-apiserver-gbae.nvidia.com                          250m (2%)     0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                kube-controller-manager-gbae.nvidia.com                 200m (1%)     0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                kube-flannel-ds-amd64-zlfn6                             100m (0%)     100m (0%)   50Mi (0%)        50Mi (0%)      2d6h
  kube-system                kube-proxy-j5bg4                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                kube-scheduler-gbae.nvidia.com                          100m (0%)     0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                nvidia-device-plugin-daemonset-4bzl9                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
  kube-system                tiller-deploy-659c6788f5-f2xlk                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         2d6h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1050m (8%)   2100m (17%)
  memory             4286Mi (6%)  4486Mi (7%)
  ephemeral-storage  0 (0%)       0 (0%)
  nvidia.com/gpu     0            0
Events:              <none>

Here is the nodes description you requested:

Name:               virtualserver01
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=virtualserver01
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
Annotations:        flannel.alpha.coreos.com/backend-data: {"VtepMAC":""}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 
                    kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Sat, 13 Jun 2020 03:46:33 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  virtualserver01
  AcquireTime:     <unset>
  RenewTime:       Fri, 26 Jun 2020 13:25:32 +0000
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Fri, 26 Jun 2020 13:24:41 +0000   Sat, 13 Jun 2020 03:46:29 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Fri, 26 Jun 2020 13:24:41 +0000   Sat, 13 Jun 2020 05:02:54 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Fri, 26 Jun 2020 13:24:41 +0000   Sat, 13 Jun 2020 03:46:29 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Fri, 26 Jun 2020 13:24:41 +0000   Sat, 13 Jun 2020 03:47:08 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  
  Hostname:    virtualserver01
Capacity:
  cpu:                8
  ephemeral-storage:  25413004Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             61836120Ki
  nvidia.com/gpu:     1
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  23420624448
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             61733720Ki
  nvidia.com/gpu:     1
  pods:               110
System Info:
  Machine ID:                 
  System UUID:                
  Boot ID:                    
  Kernel Version:             4.15.0-88-generic
  OS Image:                   Ubuntu 18.04.4 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.11
  Kubelet Version:            v1.15.4
  Kube-Proxy Version:         v1.15.4
PodCIDR:                      
Non-terminated Pods:          (24 in total)
  Namespace                   Name                                                    CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                    ------------  ----------  ---------------  -------------  ---
  default                     clara-clara-platformapiserver-d4558785-pxp8x            0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d22h
  default                     clara-console-mongodb-85f8bd5f95-lmx4r                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d22h
  default                     clara-dicom-adapter-6fbf684fd4-f4svp                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         4d20h
  default                     clara-monitor-server-fluentd-elasticsearch-t2zd9        0 (0%)        0 (0%)      0 (0%)           0 (0%)         55m
  default                     clara-monitor-server-grafana-5f874b974d-zdfx9           0 (0%)        0 (0%)      0 (0%)           0 (0%)         55m
  default                     clara-monitor-server-monitor-server-668b464cff-np7jj    0 (0%)        0 (0%)      0 (0%)           0 (0%)         55m
  default                     clara-render-server-clara-renderer-5f7549dd66-9ftr6     0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d22h
  default                     clara-resultsservice-54c548c658-t77s6                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d22h
  default                     clara-ui-6f89b97df8-2cphm                               0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d22h
  default                     clara-ux-6f4dc59d4-twmhx                                0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d22h
  default                     clara-workflow-controller-69cbb55fc8-9mfr5              0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d22h
  default                     elasticsearch-master-0                                  100m (1%)     1 (12%)     2Gi (3%)         2Gi (3%)       55m
  default                     elasticsearch-master-1                                  100m (1%)     1 (12%)     2Gi (3%)         2Gi (3%)       55m
  default                     fd7219e9-trtis-clara-pipesvc-fbdb98567-z58x8            0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  kube-system                 coredns-5c98db65d4-b7tsq                                100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     13d
  kube-system                 coredns-5c98db65d4-g6ldm                                100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     13d
  kube-system                 etcd-virtualserver01                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13d
  kube-system                 kube-apiserver-virtualserver01                          250m (3%)     0 (0%)      0 (0%)           0 (0%)         13d
  kube-system                 kube-controller-manager-virtualserver01                 200m (2%)     0 (0%)      0 (0%)           0 (0%)         13d
  kube-system                 kube-flannel-ds-amd64-nhqcx                             100m (1%)     100m (1%)   50Mi (0%)        50Mi (0%)      13d
  kube-system                 kube-proxy-lgnx4                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         13d
  kube-system                 kube-scheduler-virtualserver01                          100m (1%)     0 (0%)      0 (0%)           0 (0%)         13d
  kube-system                 nvidia-device-plugin-daemonset-6pxh8                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         13d
  kube-system                 tiller-deploy-7bf78cdbf7-8ld49                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         13d
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1050m (13%)  2100m (26%)
  memory             4286Mi (7%)  4486Mi (7%)
  ephemeral-storage  0 (0%)       0 (0%)
  hugepages-1Gi      0 (0%)       0 (0%)
  hugepages-2Mi      0 (0%)       0 (0%)
  nvidia.com/gpu     0            0
Events:              <none>

Thanks @mastreips for sharing the information and sorry for late reply!

Overall status (capacity) of your system looks good and I couldn’t find any suspecious things from the information you shared for now.

100m (1%)     1 (12%)     2Gi (3%)         2Gi (3%)       55m
  default                     elasticsearch-master-1                                  100m (1%)     1 (12%)     2Gi (3%)         2Gi (3%)       55m 

Clara monitor server is using the following helm charts:

The erroneous clara-monitor-server-fluentd-elasticsearch container is using https://hub.helm.sh/charts/kiwigrid/fluentd-elasticsearch/5.0.0 .

It looks like the issue is in fluentd-elasticsearch itself and I can see the many stability issues of fluentd if I google fluentd-elasticsearch failed liveness probe, will be restarted

Can I ask if you have any need for monitoring server?
We are going to make monitor server optional in the next release because the server takes much resource (it creates two instances of elastic search)

Please do not start monitor service if not needed. You can stop it by executing clara monitor stop

Sorry for not helping much on that. Please let us know if anything I can help on that.

Thanks. Yes. I stopped the monitoring service completely as I do not have a need for it.