Microsoft Incident - Azure Active Directory - PIR – Cosmos DB – Impacted multiple services in West Europe

Incident Report for Graphisoft

Resolved

This incident has been resolved.

Posted Apr 15, 2025 - 08:23 CEST

Update

Post Incident Review (PIR) – Cosmos DB – Impacted multiple services in West Europe

What happened?

Between 19:03 UTC and 22:08 UTC on 10 February 2025 a Cosmos DB scale unit in the West Europe region hosting Cosmos DB containers experienced failures and was unable to respond to customer requests. Cosmos DB is a distributed database management system, and data for a container is sharded and stored over multiple sets of machines based on partition key values of items stored in the container. In this case one of these sets of machines became unavailable, leading to unavailability for a subset of containers in the region.

Cosmos DB accounts would have been impacted in different ways, depending on their configuration:

Database accounts with multiple read regions and a single write region outside West Europe maintained availability for reads and writes if configured with session or lower consistency.
Accounts using strong consistency with less than three regions or bounded staleness consistency may have experienced write throttling to preserve consistency guarantees until the West Europe region was either taken offline or recovered. This behavior is by design.
Active-passive database accounts with multiple read regions and a single write region in West Europe maintained read availability, but write availability was impacted until the West Europe region was taken offline or recovered.
Single-region database accounts in West Europe were impacted if any partition resided on the affected instances.

Other downstream services which had dependencies on Cosmos DB were impacted, these included:

Azure AD Fusion: Between 19:08 and 20:59 UTC on 10 February, customers using the AAD Fusion service would have experienced 504 or time out errors.
Azure AD B2C: Between 19:19 and 20:02 UTC on 10 February, a subset of customers using Azure Active Directory B2C may have had their end users experience intermittent failures in Europe while trying to authenticate against B2C applications.
Azure Data Factory (ADF): Between 19:25 and 21:10 UTC on 10 February, customers using: Integration Runtime, Pipeline and Trigger CRUD operation; Pipeline and Trigger executions; Sandbox operations; Query pipeline and trigger run history; Query activity status; or Dataflow operations would have encountered error messages.
Azure Resource Manager (ARM): Between 19:05 and 19:55 UTC on 10 February, customers in Europe or working with resources in Europe may have been unable to view, create, update or delete resources.
Azure IoT Hub: Between 19:10 and 20:14 UTC on 10 February, customers using the device provisioning service would have seen issues registering and retrieving devices.
Azure Portal: Between 18:14 and 19:30 UTC on 10 February, customers in Europe (being serviced from West Europe and/or UK South) may have experienced increased latency and intermittent connectivity failures when attempting to access resources in the Azure Portal.
Azure Multi-Factor Authentication (MFA): Between 19:05 and 20:05 UTC on 10 February, a subset of customers may have experienced MFA failures while trying to authenticate with Phone App MFA methods. Some calls would have succeeded on retry.
Azure Synapse Job Service Email: Between 19:44 and 20:05 UTC on 10 February, customers in Europe or those working with European resources may have experienced issues viewing, creating, updating, or deleting resources such as OS disks or VMs.
Azure Synapse Platform Service: Between 19:14 and 21:06 UTC on 10 February, our Synapse customers experienced the Synapse resource provider (RP) provisioning failing due to timeouts from Azure Data Factory.
Microsoft Entra Identity Diagnostics: Between 19:37 and 20:23 UTC on 10 February, customers could not load the 'diagnose and solve problems' blade in Microsoft Entra ID.
Microsoft Entra Privileged Identity Management (PIM): Between 19:09 and 19:41 UTC on 10 February, customers using Microsoft Entra Privileged Identity Management for Azure resources may have been unable to view, create, update, delete, or activate role assignments.
Microsoft Entra Terms of Use: Between 19:08 and 20:02 UTC on 10 February, customers who configured terms of use for their users would see errors during sign-in if they have not accepted terms of use before.
Microsoft PowerApps: Between 19:08 and 20:03 UTC on 10 February, customers would have experienced intermittent issues playing power apps.
Microsoft Common Data Service (CDS): Between 19:03 and 20:48 UTC on 10 February, customers faced issues when performing CRUD (create/read/update/delete) operations in elastic entities. Audit retrieve and create operations were also affected, the create failures were retried internally while customer could see problems when retrieving audit data.

What went wrong and why?

While performing platform maintenance a set of metadata nodes was down for updates. Metadata nodes are system nodes that maintain the scale unit in a healthy condition. This type of maintenance has taken place regularly and reliably for many years to provide security, performance, and availability improvements. During this time, a set of metadata nodes experienced an unexpected failure leading to a temporary loss of the required number of metadata nodes to keep the scale unit up and functional. This was due to the total number of down metadata nodes exceeding the maximum allowed to maintain scale unit integrity. Ordinarily this transient state would not lead to failures, as the system is designed to handle such failures, but some of the nodes got stuck in a boot up sequence and had to be restarted to reestablish the number of metadata nodes needed to maintain the health of the scale unit. We determined that there was insufficient buffer in the number of metadata nodes under maintenance to handle the additional loss of metadata nodes experienced. Had either the number of buffer metadata nodes been larger, or if the failed metadata nodes had been able to self-recover, the scale unit would not have entered a failed state.

How did we respond?

19:03 UTC on 10 February 2025 – Customer impact began.
19:14 UTC on 10 February 2025 – Service monitoring detected failed requests, alerting us to begin an investigation. Upon reviewing the failure logs, we were able to identify the requests were failing on one specific scale unit.
20:03 UTC on 10 February 2025 – To resolve the issue, we brought the unhealthy nodes back to a healthy state. Most customers saw a partial or full recovery at this point.
22:08 UTC on 10 February 2025 – Services fully restored, and all customer impact mitigated.

Upon detection, Cosmos DB engineers determined that a single scale unit had a set of nodes impacted that caused the scale unit to become unavailable. Engineers determined that a subset of machines in the scale unit had entered a stuck state and required a manual reboot to recover. Once manually rebooted, the nodes were able to recover and availability was restored at 20:03 UTC. A small subset of those machines required additional steps to recover, leading to the longer recovery time (until 22:08 UTC) for a small set of containers.

How are we making incidents like this less likely or less impactful?

We identified global configurations that will make scale units resilient to similar failures by increasing the number of system nodes, to increase the number of nodes that can be in a failed state. We are in the process of rolling out these changes.(Estimated completion: March 2025)
In addition, we are making configuration changes to reduce concurrency of platform maintenance jobs and to provide an increased buffer to prevent platform maintenance jobs from being able to start without sufficient buffer in the number of system nodes.(Estimated completion: March 2025)
In the longer term, we are reviewing critical path dependencies on additional system services on metadata nodes. Our goal is to remove any unneeded dependencies in order to reduce potential failure points. This is not required to prevent this specific failure, but adds additional defense in depth. (Estimated completion: September 2025)

How can customers make incidents like this less impactful?

Customers using Azure Cosmos DB should consider reviewing our high availability and reliability guidance for adopting Availability Zone architecture: https://learn.microsoft.com/azure/reliability/reliability-cosmos-db-nosql.
Customers using Azure Cosmos DB can familiarize themselves with forced failover (region offline) to enable self-service failover in the event of an incident: https://learn.microsoft.com/azure/cosmos-db/how-to-manage-database-account#perform-forced-failover-for-your-azure-cosmos-db-account.
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency.
Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/CSDL-1X8

Posted Feb 26, 2025 - 01:25 CET

Update

Previous communications would have had an inaccurate/incomplete impacted services list. We have rectified the errors and updated the communication below. We apologize for any confusion this may have caused.

What happened?

Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue with Cosmos DB caused degraded service availability for subsets of the following services, in the West Europe region:

Azure Cosmos DB
Azure Data Factory
Azure IoT Hub
Azure Resource Manager (ARM)
Azure Portal
Azure Synapse Analytics
Microsoft Entra ID Terms of Use (TOU)
Microsoft Entra Multi-Factor Authentication (MFA)
Microsoft Entra Privileged Identity Management (PIM)

What do we know so far?

We have identified a group of nodes in the region that became unhealthy, leading to the cluster serving those nodes becoming unavailable. This affected instances of Cosmos DB, which the affected services rely on to process requests. Due to this inability to process requests, subsets of those services became unavailable.

How did we respond?

19:09 UTC on 10 February 2025 – Customer impact began.
19:14 UTC on 10 February 2025 – Service monitoring detected failed requests, alerting us to begin investigation. Upon reviewing the failure logs, we were able to identify the requests were failing on a specific cluster.
20:03 UTC on 10 February 2025 – To resolve the issue, we brought the unhealthy nodes back to a healthy state.
22:08 UTC on 10 February 2025 – Services restored, and customer impact mitigated.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

Posted Feb 11, 2025 - 01:36 CET

Update

What happened?

Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:

Azure Cosmos DB
Azure Resource Manager
Azure Iot Hub
Microsoft Entra ID Terms of Use (TOU)
Microsoft Azure portal
Azure Data Factory
Identity and Access Management (IAM) Services
Microsoft Entra multifactor authentication (MFA)

This issue is now mitigated. An update with more information will be provided shortly.

Posted Feb 11, 2025 - 00:08 CET

Update

What happened?

Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:

Azure Cosmos DB
Azure Resource Manager
Azure Iot Hub
Microsoft Entra ID Terms of Use (TOU)
Microsoft Azure portal
Azure Data Factory
Identity and Access Management (IAM) Services
Microsoft Entra multifactor authentication (MFA)

This issue is now mitigated. An update with more information will be provided shortly.

Posted Feb 10, 2025 - 23:42 CET

Investigating

Impact Statement: Starting at 19:09 UTC on 10 February 2025, customers in West Europe may experience degradation in service availability for these affected services:

Azure Cosmos DB
Azure Resource Manager
Azure Iot Hub
Microsoft Entra ID Terms of Use (TOU)
Microsoft Azure portal
Azure Data Factory
Identity and Access Management (IAM) Services
Microsoft Entra multifactor authentication (MFA)

Current Status: We are aware of this issue and are actively investigating potential contributing factors. The next update will be provided within 60 minutes, or as events warrant.

Posted Feb 10, 2025 - 22:39 CET