We are currently investigating an issue with the gluster storage service
Update 18:39: We have identified the issue and are currently restoring the service
Update 19:13: The service is completely restored. We are observing the situation
Update 2019-01-28: A more detailed post-mortem is now available:
On Monday, Jan. 28, the Gluster storage service suffered from an outage. This lead to some pods running on APPUiO Public which used Gluster Persistent Volumes (PVs) failing and becoming unavailable.
All times in CET, approximated
As described in the timeline, we were not able to access the affected systems during the outage, which greatly limited our ability to do a detailed analysis. However, we identified the following issue:
About 1-2h before the downtime, new Gluster Volumes were provisioned This is a routine operation and performed many times each week across our clusters. Adding new volumes is known to cause an increased load on the Gluster servers.
Last week, one of the core components on our Gluster servers (journald) was updated to mitigate a security issue. Unfortunately, the fixed package provided by Red Hat caused a memory leak in said component.
Now, from analyzing our monitoring data we can clearly see that memory usage on the affected systems rapidly increased after adding the new volumes. We assume that the addition of the new volumes triggered the memory leak in journald. This in turn lead to the systems becoming completely unresponsive.
Since the journald memory leak affects many of our systems, we were already working on a mitigation. We are now working on a workaround until a fix is available. We also contacted Red Hat to get a timeline for a permanent fix, and support us with finding a workaround.
We have amended our operational procedures around provisioning new Gluster Volumes to ensure issues caused by it are identified earlier. We are also closely monitoring the situation at the moment.
In the long run, we are looking to find an alternative solution for providing storage to APPUiO users. We are already working with our infrastructure partners to provide native block storage with future versions of the APPUiO Public platform. This will not only reduce the operational risks related with Gluster storage, but also provide much better performance for users.