On Friday 9th of November 2018 6.00 AM, a routinary update operation of a minor version of our Kubernetes infraestructure reported unexpected errors leading to passbolt.com and demo.passbolt.com being unaccessible. The cluster update process is automated as part of Google Kubernetes Engine service that we use. Our team contacted Google support team to fix the issue without success leading us to fail over a completely new cluster. At 11.27 AM passbolt.com comes back online.
What is the current status?
We are still working actively with Google Engineers to identify the root cause of the problem and will publish updates as soon as we have them.
Why was this happening?
It is still uncertain what happened. There are a few hypothesis on several kubernetes components we use were conflicting and preventing the cluster to update and locking it on a "repair" mode endlessly. GKE and GCP were experiencing service disruptions that could be related with our affected resources. Again we will post more information as we know it.
What does it mean for passbolt security?
Essentially nothing as it was an update problem.
What are we doing to prevent this to happen again?
We are improving our disaster recovery processes and architecture in order to reduce the downtime in events like these.
1. Identify root cause of the update problem
2. Improve DR processes and architecture to make it more robust
- 2018-11-09 6:00 AM CET: Unusual number of alerts
- 2018-11-09 6:08 AM CET: Pingdom services reporting sites down
- 2018-11-09 6:08 AM CET: On call engineer checks that the GKE update process is ongoing and waits for a few minutes until situation is stabilized
- 2018-11-09 6:10 AM CET: After a few a minutes on call engineer checks unusual spike of containers restarting and crashinglooping
- 2018-11-09 6.30 AM CET: the problem is escalated to SRE team
- 2018-11-09 7:30 AM CET: First strategy is to try to fix the current cluster
- 2018-11-09 8:00 AM CET: Start called google support engineers
- 2018-11-09 8:45 AM CET: Started failover process to a new cluster
- 2018-11-09 11:27 AM CET: Services back online