Database failure

Updates

Overview

On October 14, 2024, Glide experienced a significant service disruption that affected multiple features including Glide Big Tables, SQL data sources, BigQuery integrations, external data source syncs, Workflows, and certain integrations.

The incident began at 02:06 UTC on October 14 and was fully resolved by 12:40 UTC on October 14, lasting 10 hours, 34 minutes in total.

We recognize that this was a severe disruption to your business and we sincerely apologize for its impact.

What Happened

During a planned infrastructure upgrade aimed at improving our service capacity and reliability, we encountered unexpected issues with one of the databases being migrated. The general sequence of events is as follows:

At 02:06 UTC, we initiated the migration of multiple Postgres High Availability (HA) clusters to a new set of dedicated compute resources (a new “node pool”). This was part of our ongoing efforts to increase the performance and scalability of the data layer of the Glide platform.
While most databases restarted successfully, the app-data database (that is primarily used for syncing with external data sources) began experiencing Out of Memory (OOM) errors. This was unexpected as there hadn’t been any specific changes to the database itself.
During one of these OOM-induced restarts, some Write-Ahead Log (WAL) entries were not written to disk. In PostgreSQL, WAL is crucial for maintaining data integrity and enabling point-in-time recovery.
The loss of these WAL entries prevented us from starting the standby replica, which is essential for our HA setup. Our cluster is configured to require a synchronous replica, meaning no transactions could commit on the primary database until the replica was restored. Basically, the failure of one of the database replicas to migrate to the new node pool brought the whole database down.
Multiple attempts to reinitialize the replicas failed due to the missing WAL entries, extending the duration of the outage.
After several unsuccessful attempts to restore the replicas, we ultimately had to take a new base backup of the primary database and restore it to the replicas. This process, while time-consuming, eventually allowed us to bring the entire cluster back online and with no known or reported loss of data.

The technical nature of this issue, combined with gaps in our monitoring and alerting systems, led to a delay in fully understanding and communicating the extent of the service impact.

This, in turn, also led to a delay in our public-facing communication and diminished our customers’ trust in the Glide platform.

Root Causes

In summary, the root causes we identified in our review include:

Not fully understanding the resource constraints and limits of our new database infrastructure, or properly preparing for such a high-risk migration.
Not having a sufficient level of user impact monitoring and alerting.
Not having a robust enough incident response and communication protocols.

What We’re Doing to Prevent This from Happening Again

We will be actively pursuing the following corrective measures as a result of this incident:

Implementing more rigorous planning and team awareness for high-risk systems work.
Improving our monitoring and alerting to more clearly identify the extent and severity of any serious degradations to the user experience so we can more proactively communicate system status to our customers.
Revamping our incident response protocol to include more clearly defined roles and to better account for public awareness and progress reporting of any incident.

Conclusion

We understand the critical role Glide plays in your projects and businesses. This incident fell short of the high standards of reliability and transparency that we set for ourselves. We are committed to learning from this experience and implementing the necessary changes to prevent similar issues in the future.

As a fast-growing company (thank you!) we are constantly working to upgrade important aspects of our platform for increased functionality, faster performance, and higher levels of stability. This is inherently risky work, but we will make these changes to reduce the likelihood and severity of this risk. We can both evolve the Glide platform AND provide an excellent level of uptime!

We appreciate your patience and understanding during this incident. If you have any questions or need further clarification, please don’t hesitate to reach out to the team.

Thank you for your continued trust in Glide.

October 17, 2024 · 15:47 UTC

Update

We have completed our maintenance reversion and all systems remain operational.

We will continue to post updates here as we finalize these infrastructure upgrades, which we plan to flesh out tomorrow ET.

No further customer impact is expected, but we will be proactive with these informational updates.

October 15, 2024 · 01:10 UTC

Update

We will be performing database maintenance in the next 15m to revert some of the changes we attempted in the previous 24 hours. This maintenance is NOT expected to result in any downtime or noticeable impact on customers and, for now, is purely informational in nature.

We will post an additional update when this maintenance has concluded.

October 15, 2024 · 00:52 UTC

Update

Late last night, on Sunday, October 13th, to the morning of Monday, October 14th, Eastern Time, we attempted to move a cluster of databases onto new infrastructure as part of an ongoing effort to expand the capacity and reliability of our data layer.

For several months now we have been investing in our underlying data layer so we can serve the increasing needs of our customers. As part of that effort, we planned to move our various databases off of a general pool of compute resources to a dedicated pool so that we could continue to scale that layer of our infrastructure without being constrained by the needs of other parts of our system.

This was a move that had been in planning for several weeks and had been successfully executed in our various staging environments.
Unfortunately, as part of this particular change, we ran into issues with the new cluster of one of our primary databases running out memory, and then failing to reach a healthy state due to missing replication data. It then took us several hours to restore the new cluster and bring its data back up-to-date.

During this time, almost all application functionality was either broken or severely degraded and we did not properly communicate our status, leaving many of our customers in the dark during a critical time.
We are still performing an internal investigation into the specific nature of this incident and the resulting downtime, and we’ll be following up here with our full postmortem. However, it is quite clear that our response to the issue was insufficient.

Incidents such as these are also very complex, having both technical, human, and process-related origins. This incident exposed the flaws in our incident response process and external communication. We will be evaluating our approach and making changes as a result, and will have more details as part of the full post-mortem.
We realize that Glide is central to many of your businesses, and we apologize for the impact this incident has had. We strive to do better, and will.

October 14, 2024 · 15:28 UTC

Resolved

This incident is now resolved.

October 14, 2024 · 12:53 UTC

Monitoring

We believe we have addressed the root cause. We are currently monitoring for the next 15 minutes to confirm Glide is healthy.

October 14, 2024 · 12:40 UTC

Update

We are continuing our mitigation efforts. We have identified that some run integrations such as AI based integrations may be impacted.

October 14, 2024 · 12:14 UTC

Escalate

Glide Big Tables and SQL data sources are also failing to load. Glide has identified the problem and is working on mitigation.

October 14, 2024 · 11:35 UTC

Issue

Glide is experiencing a large scale database failure affecting Glide Big Tables, SQL data sources, Automations, and some integration runs.

October 14, 2024 · 08:03 UTC

← Back