Post Incident Review (PIR) – Cosmos DB – Impacted multiple services in West Europe
What happened?
Between 19:03 UTC and 22:08 UTC on 10 February 2025 a Cosmos DB scale unit in the West Europe region hosting Cosmos DB containers experienced failures and was unable to respond to customer requests. Cosmos DB is a distributed database management system, and data for a container is sharded and stored over multiple sets of machines based on partition key values of items stored in the container. In this case one of these sets of machines became unavailable, leading to unavailability for a subset of containers in the region.
Cosmos DB accounts would have been impacted in different ways, depending on their configuration:
Other downstream services which had dependencies on Cosmos DB were impacted, these included:
What went wrong and why?
While performing platform maintenance a set of metadata nodes was down for updates. Metadata nodes are system nodes that maintain the scale unit in a healthy condition. This type of maintenance has taken place regularly and reliably for many years to provide security, performance, and availability improvements. During this time, a set of metadata nodes experienced an unexpected failure leading to a temporary loss of the required number of metadata nodes to keep the scale unit up and functional. This was due to the total number of down metadata nodes exceeding the maximum allowed to maintain scale unit integrity. Ordinarily this transient state would not lead to failures, as the system is designed to handle such failures, but some of the nodes got stuck in a boot up sequence and had to be restarted to reestablish the number of metadata nodes needed to maintain the health of the scale unit. We determined that there was insufficient buffer in the number of metadata nodes under maintenance to handle the additional loss of metadata nodes experienced. Had either the number of buffer metadata nodes been larger, or if the failed metadata nodes had been able to self-recover, the scale unit would not have entered a failed state.
How did we respond?
Upon detection, Cosmos DB engineers determined that a single scale unit had a set of nodes impacted that caused the scale unit to become unavailable. Engineers determined that a subset of machines in the scale unit had entered a stuck state and required a manual reboot to recover. Once manually rebooted, the nodes were able to recover and availability was restored at 20:03 UTC. A small subset of those machines required additional steps to recover, leading to the longer recovery time (until 22:08 UTC) for a small set of containers.
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/CSDL-1X8
Previous communications would have had an inaccurate/incomplete impacted services list. We have rectified the errors and updated the communication below. We apologize for any confusion this may have caused.
What happened?
Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue with Cosmos DB caused degraded service availability for subsets of the following services, in the West Europe region:
What do we know so far?
We have identified a group of nodes in the region that became unhealthy, leading to the cluster serving those nodes becoming unavailable. This affected instances of Cosmos DB, which the affected services rely on to process requests. Due to this inability to process requests, subsets of those services became unavailable.
How did we respond?
What happens next?
Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.
What happened?
Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:
This issue is now mitigated. An update with more information will be provided shortly.
What happened?
Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:
This issue is now mitigated. An update with more information will be provided shortly.
Impact Statement: Starting at 19:09 UTC on 10 February 2025, customers in West Europe may experience degradation in service availability for these affected services:
Current Status: We are aware of this issue and are actively investigating potential contributing factors. The next update will be provided within 60 minutes, or as events warrant.