12th July 2017

Overall Cluster State Node failure causes Pod downtime

An APPUiO production cluster node suffered a software failure in its Docker daemon. We proceeded to cordon the node to avoid further pod allocations and evacuated all pods. Due to the nature of the failure not all Pods could be terminated in a clean fashion.

Update July 12, 2017 10:08 CEST: The failing node has been rebooted and pods are recovering. Some need their images downloaded (i.e. because of imagePullPolicy: Always). According to our monitoring system the cluster is healthy again.

Update July 12, 2017 10:17 CEST: We consider this incident to be resolved.