28th January 2019

Persistent Storage Service Partial Storage outage

We are currently investigating an issue with the gluster storage service

Update 18:39: We have identified the issue and are currently restoring the service

Update 19:13: The service is completely restored. We are observing the situation

Update 2019-01-28: A more detailed post-mortem is now available:

Gluster Storage service outage 2019-01-28

On Monday, Jan. 28, the Gluster storage service suffered from an outage. This lead to some pods running on APPUiO Public which used Gluster Persistent Volumes (PVs) failing and becoming unavailable.

Timeline of events

All times in CET, approximated

  • 18:10 - First gluster monitoring checks timing out, and first apps going offline. We started investigating the issues
  • 18:15 - At this point, pods failed to restart since they could not mount the storage volumes.
  • 18:20 - 2 out of 3 gluster servers were unreachable. Attempts to access the console directly also failed.
  • 18:30 - We decided to reset one of the offline Gluster servers. The server came back online and started recovering. We continued to reset the second failed Gluster server which came back some minutes later.
  • 18:50 - Both gluster servers back online
  • 18:50 - Under the load of the recovery operation, the third server also became unresponsive
  • 19:00 - Since the third server now blocked cluster operations due to being unresponsive, we decided to reset it aswell
  • 19:10 - Third server back online, storage cluster started recovering. At that point volumes could be mounted again.
  • 19:20 - Most volumes recovered, remaining ones were taken care of manually
  • 19:30 - Gluster storage service restored completely.

Root cause

As described in the timeline, we were not able to access the affected systems during the outage, which greatly limited our ability to do a detailed analysis. However, we identified the following issue:

About 1-2h before the downtime, new Gluster Volumes were provisioned This is a routine operation and performed many times each week across our clusters. Adding new volumes is known to cause an increased load on the Gluster servers.

Last week, one of the core components on our Gluster servers (journald) was updated to mitigate a security issue. Unfortunately, the fixed package provided by Red Hat caused a memory leak in said component.

Now, from analyzing our monitoring data we can clearly see that memory usage on the affected systems rapidly increased after adding the new volumes. We assume that the addition of the new volumes triggered the memory leak in journald. This in turn lead to the systems becoming completely unresponsive.

Next steps

Since the journald memory leak affects many of our systems, we were already working on a mitigation. We are now working on a workaround until a fix is available. We also contacted Red Hat to get a timeline for a permanent fix, and support us with finding a workaround.

We have amended our operational procedures around provisioning new Gluster Volumes to ensure issues caused by it are identified earlier. We are also closely monitoring the situation at the moment.

Outlook

In the long run, we are looking to find an alternative solution for providing storage to APPUiO users. We are already working with our infrastructure partners to provide native block storage with future versions of the APPUiO Public platform. This will not only reduce the operational risks related with Gluster storage, but also provide much better performance for users.