Cloud outage June 10th

Summary

On June 10th, 2024, a service degradation occurred on Passbolt Cloud between 4:00 PM and 5:15 PM UTC. This incident resulted in some users experiencing 404 errors and limited or no access to their organization workspace. This incident was triggered by a release performed on the same day that contained some performance improvements that proved to be problematic in conjunction with the underlying auto-scaling service configuration.

Impact

We understand that this service degradation may have disrupted your daily operations, and we sincerely apologize for the inconvenience caused.

Root Cause Analysis

A thorough investigation identified a combination of factors contributing to the incident:

Change of Resource Usage: The deployed applications were experiencing a change in usual resource consumption.
Aggressive Scaling Policy: The recently implemented scaling policy was overly aggressive in resource deallocation.
Cluster Topology: The specific configuration of the underlying cluster exacerbated the resource constraints.

These factors, when combined, triggered a cascading series of failures across various system components, leading to the service degradation.

Resolution

The engineering team promptly rolled back the new release and restarted the underlying services to restore service stability.

Next steps

We are committed to providing a reliable and high-performing cloud service. We will continue to monitor system performance closely and implement new preventive measures.

In that purpose, the following internal actions will take place:

Scaling Policy Review: Our team is actively reviewing and adjusting the scaling policy to ensure optimal resource allocation and prevent future resource contention.
Application Resource Optimization: We are continuing to investigate opportunities to optimize resource utilization within our applications.
Cluster Configuration Analysis: The underlying cluster configuration is being evaluated to identify any potential optimizations that could mitigate similar incidents in the future.

Thank you for your patience and understanding.

Timeline

June 10th, 2024:

15:30 CEST: Deployment of a new release that includes performance improvements specific to the cloud.
16:05 CEST: Service degradation began.
16:10 CEST: Service degradation reported by customers and monitoring tools.
16:20 CEST: Incident response team mobilized
16:30 CEST: Incident response team starts rolling back the release
16:45 CEST: Rollback of the release completed
17:10 CEST: Public communication about the incident on public channels
17:15 CEST: Remaining affected components are restored in working state
17:20 CEST: Service is back to nominal performance.
17:40 CEST: End of incident communicated on public channels.

June 11th, 2024:

18:00 CEST: Incident response team investigation ends

June 12th, 2024

12:00 CEST: Incident report published on the website.

Current status:

1. Acknowledge issue with reporter
2. Get a fix/patch prepared
3. Prepare a report about the issue
4. Feature the incident report