What happened?
Between 08:51 and 10:15 UTC on 01 April 2025, we identified customer impact resulting from a power event in the North Europe region which impacted Microsoft Entra ID, Virtual Machines, Virtual Machine Scale Sets, Storage, Azure Cosmos DB, Azure Database for PostgreSQL flexible servers, Azure ExpressRoute, Azure Site Recovery, Service Bus, Azure Cache for Redis, Azure SQL Database, Azure Site Recovery, Application Gateway, and Azure NetApp Files. We can confirm that all affected services have now recovered.
What do we know so far?
During a power maintenance event, a failure on a UPS system led to temporary power loss in a single Data Center in Physical Availability Zone 2 in the North Europe region affecting multiple devices. The power has now been fully restored and all affected services have recovered.
How did we respond?
What happens next?
Summary of Impact: Between 08:51 and 10:15 UTC on 01 April 2025, we identified customer impact resulting from a power event in the North Europe region which impacted Virtual Machines, Storage, CosmosDB, PostgreSQL, Azure ExpressRoute and Azure NetApp Files. We can confirm that all affected services have now recovered.
A power maintenance event led to temporary power loss in a single datacenter, in Physical Availability Zone 2, in the North Europe region affecting multiple racks and devices. The power has been fully restored and services are seeing full recovery.
An update with additional information will be provided shortly.
Summary of Impact: Between 08:51 and 10:15 UTC on 01 April 2025, we identified customer impact resulting from a power event in the North Europe region which impacted Virtual Machines, Storage, CosmosDB, PostgreSQL, Azure ExpressRoute and Azure NetApp Files. We can confirm that all affected services have now recovered.
A power maintenance event led to temporary power loss in a single datacenter, in Physical Availability Zone 2, in the North Europe region affecting multiple racks and devices. The power has been fully restored and services are seeing full recovery.
An update with additional information will be provided shortly.
Impact Statement: Starting approximately at 08:51 UTC on 01 April 2025, we received an alert of an issue impacting multiple Azure services across the North Europe region.
Current Status: All relevant teams are currently looking into this alert and are actively working on identifying any workstreams needed to mitigate all customer impact. The next update will be provided within 60 minutes, or as events warrant.
What happened?
Between 13:00 UTC on 10 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and/or incorrect alert activation.
What went wrong and why?
Application Insights Ingestion is the service that handles ingesting and routing of Application Insights data from customers. One of its internal components is a cache where it stores information about the customer's Application Insights resource configuration. This cache is deployed at a region-level, so it is shared by multiple clusters in a region. When a deployment is done, some regions deploy to one cluster, then delay until the next business day before deploying to remaining clusters. There was feature work being done that involved adding a new flag to the Application Insights resource configuration stored in the cache. The flag was supposed to default to true, in which case it wouldn't impact the behavior of Application Insights Ingestion. However, if the flag was set to false, it would stop sending custom metrics data to the Log Analytics workspace.
A recent incident in a separate cloud was caused by this flag becoming incorrectly set to false. As a response to this, it was decided the flag should be flipped to represent the opposite - so that defaulting to "false" would result in no-op behavior instead. As part of this, the original flag was removed from the contract used to serialize cache entries. The above change was then deployed. It started with the first cluster, then waited until the next business day to deploy to remaining clusters. During this time, the first cluster started serializing new cache entries that were missing a value for the original (default true) flag. This caused the remaining clusters (still running the old deployment) to read values from the cache with this flag set to false, and therefore stop routing custom metrics data to Log Analytics. When the deployment completed in a region, impact would resolve as all clusters would be running the new code with the correct default value for the flag.
There was no monitoring of data volume drops by data type and no new monitor for the flag's operation was added, since it was considered a normal operation for it to be active. This caused the deployment to proceed to an additional region before the issue was detected. The incident persisted for around 24 hours in the South Central US region, before the deployment completed. Since the issue wasn't detected by automated monitoring, the deployment proceeded to the West Europe region, where it deployed to the first cluster. Because it deployed next to a weekend, it persisted for several days before the deployment finished. Eventually, several customers raised tickets noticing that their custom metrics data was missing. During this incident, the flag became incorrectly set to false, causing the ingestion service to incorrectly stop routing custom metrics data to Log Analytics.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/D_CP-JQ8
What happened?
Between 13:30 UTC on 13 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.
What do we know so far?
We identified that the issue was caused by a service deployment. A new version was deployed to a single cluster of the service in the West Europe region, introducing a change to the contract of a cache shared among all clusters in the area. This contract change was incompatible with the code running on the remaining clusters, leading to incorrect routing of customer metrics data.
How did we respond?
What happens next?
Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.
This issue is now mitigated, and more information will be shared shortly.
Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may experience intermittent data gaps on custom metrics data and incorrect alert activation.
Current Status: This issue was raised to us by a customer report. Upon investigation, we determined that this bug was introduced as part of a deployment, once we identified this, we assessed the possibility of rolling back. After further inspection, we deemed that to mitigate this, the deployment would need to be applied to all the clusters in the region as it was the mismatch is deployment versions that was causing this issue.
We are currently expediting the deployment to all remaining clusters in the region, this is expected to take 1 hour to complete. We have paused the broader deployment going out to any remaining regions, and we will reassess our deployment plan after we mitigate this issue in West Europe.
The next update will be provided within 2 hours, or as events warrant.
Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8
What happened?
Between 13:37 and 16:52 UTC on 18 March, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, 2025, a combination of a third-party fiber cut, and an internal tooling failure resulted in an impact to a subset of Azure customers with services in our East US region.
During the first impact window, immediately after the fiber cut, customers may have experienced intermittent connectivity loss for inter-zone traffic that included AZ03 - to/from other zones, or to/from the public internet. During this time, the traffic loss rate peaked at 0.02% for short periods of time. Traffic within AZ03, as well as traffic to/from/within AZ01 and AZ02, was not impacted.
During the second impact window, triggered by the tooling issue, customers may have experienced intermittent connectivity loss – primarily when sending inter-zone traffic that included AZ03. During this time, the traffic loss rate peaked at 0.55% for short periods of time. Traffic entering or leaving the East US region was not impacted, but there was some minimal impact to inter-zone traffic from both of the other Availability Zones, AZ01 and AZ02.
Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.
What went wrong and why?
At 13:37 UTC on 18 March 2025, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within AZ03. When fiber cuts impact our networking capacity, our systems are designed to redistribute traffic automatically to other paths. In this instance, we had two concurrent failures happen – before the cut, a datacenter router in AZ03 was down for maintenance and was in the process of being repaired. This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ03, leading to the potential for intermittent connectivity issues for some customers. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time.
Additionally, after the fiber cut and failed isolation, at 14:16 UTC a linecard failed on another router further reducing overall capacity to AZ03. However, as traffic had been re-routed, this further reduction in capacity did not cause any additional customer impact.
During the initial mitigation efforts outlined above, our auto-mitigation tool encountered a lock contention problem blocking commands on the impacted devices, failing to isolate all capacity connected to those devices. This failure left some of the impacted capacity un-isolated, and our system did not flag this failed isolation state. Due to some capacity being out of service from the fiber cut, this failed state was not immediately flagged in our systems as the down capacity was not carrying production traffic.
At approximately 21:00 UTC, our fiber provider commenced recovery work on the damaged fiber. During the capacity recovery process, at 23:20 UTC, as a result of the failure to isolate all the impacted fiber capacity, as individual fibers were repaired, our recovery systems begin re-sending traffic to the devices connected to the un-isolated capacity, therefore, bringing them back into service without safe levels of capacity. This caused traffic congestion that impacted customers as described above.
The traffic congestion within AZ03, due to the tooling failure, triggered an unplanned failure mode on a regional hub router that connects multiple datacenters. By design, our network devices attempt to contain congestive packet loss to capacity that is already impacted. Due to the encountered failure mode, this containment failed on a subset of routers – so congestion spread to neighboring capacity on the same regional hub router, beyond AZ03. This containment failure impacted a small subset of traffic from the regional hub router to AZ1 and AZ2.
At this stage, all originally-impacted capacity from the third-party fiber cut was manually isolated from the network – mitigating all customer impact by 00:30 UTC on 19 March. At 01:52 UTC on 19 March the underlying fiber cut was fully recovered. At that time, we completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8
What happened?
Between 13:37 and 16:52 UTC on 18 March 2025, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, a combination of a fiber cut and tooling failure within our East US region, resulted in an impact to a subset of Azure customers with services in that region. Customers may have experienced intermittent connectivity loss and increased network latency – when sending traffic to/from/within Availability Zone 3 (AZ3) within this East US region.
What do we know so far?
At 13:37 UTC, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within physical Availability Zone 3 (AZ3) only. With fiber cuts impacting capacity, our systems are designed to shift traffic automatically to other diverse paths. In this instance, we had two concurrent failures happen – before the cut, a large hub router was down due to maintenance (in the process of being repaired); and after the cut, a linecard failed on another router.
This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ3, leading to the potential for retransmits or intermittent connectivity for some customers. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time. The restoration of traffic was fully completed by 16:52 UTC, and the issue was noted as mitigated.
At approximately 21:00 UTC, our fiber provider commenced recovery work on the cut fiber. During the capacity recovery process, at 23:20 UTC, a tooling failure caused our systems to add devices back into the production network before the safe levels of capacity were recovered on the impacted fiber path. As a result, as individual fibers were repaired and brought back into service, our tooling incorrectly began adding devices back to the network. Due to the timing of this tooling failure, traffic was restarted without safe levels of capacity – resulting in congestion that led to customer impact, when sending traffic to/from/within AZ3 of the East US region. The impact was mitigated at 00:30 UTC on 19 March, after manually isolating the capacity affected by this tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut was fully recovered. We completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region.
At 23:21 UTC on 18 March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from East US Region.
What do we know so far?
We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continued to test and restore all capacity to pre-incident levels, these tasks completed at 6:50 UTC on 19 March.
How did we respond?
What happens next?
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region.
At 23:21 UTC on 18 March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from East US Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.
Our telemetry data shows that the customer impact has been fully mitigated. We are continuing to monitor the situation during our capacity recovery process before confirming complete resolution of the incident.
An update will be provided in 3 hours, or as events warrant
Previous communications would have indicated an incorrect severity (changing from Warning to informational). We have rectified the error and updated the communication below. We apologize for any confusion this may have caused.
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.
Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation.
An update will be provided in 3 hours, or as events warrant.
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.
Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation.
An update will be provided in 3 hours, or as events warrant.
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18th March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18th March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18th March 2025 and the issue was mitigated.
At 23:20 UTC on 18th March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19th March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19th March, the underlying fiber cut has been fully restored. We are now working to test and restore all capacity to pre-incident levels.
An update will be provided in 60 minutes, or as events warrant.
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC and the issue was mitigated.
At 23:20 UTC, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. We are actively mitigating the current impact to ensure no further incidents occur during the recovery process.
An update will be provided in 60 minutes, or as events warrant.
After further investigation, we determined we are still working with our providers on fiber repairs, however, no further impact should be experienced as previously communicated. We apologize for the inconvenience caused.
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes. Further updates will be provided in 6 hours or as events warrant.
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
What do we know so far?
We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes.
How did we respond?
What happens next?
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
This issue is now mitigated. Additional information will now be provided shortly.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. We have mitigated the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity. Impacted customers should now see their services recover. In parallel, we are working with our providers on fiber repairs. We do not yet have a reliable ETA for repairs at this time. We will continue to provide updates here as they become available.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber impacted capacity to those datacenters increasing the utilization for the remaining capacity of the affected datacenters. We are actively re-routing traffic to mitigate high utilization and restore connectivity for impacted customers. Underlying fiber repairs have been initiated with our provider, but we do not yet have an ETA for fiber repairs. An update will be provided in 60 minutes, or as events warrant.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency in the region.
Current Status: We identified a networking issue affecting a subset of datacenters in the East US region. We are continuing to investigate the contributing factors of this issue, and in parallel, we are actively re-routing affected traffic to minimize the impact on networking-dependent services. The next update will be provided within 60 minutes, or as events warrant.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, you have been identified as an Azure customer in the East US region who may experience intermittent connectivity loss and increased network latency in the region.
Current Status: We are aware of the issue and actively working on mitigation workstreams to reroute traffic and mitigate impact for customers. The next update will be provided within 60 minutes, or as events warrant.
What happened?
Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions and West Europe - was affected. ExpressRoute and Internet traffic were not impacted.
What went wrong and why?
From as early as 09:02 UTC, we were engaged on incidents, where we were mitigating multiple failures with some of our networking devices across multiple regions in which Wide Area Netowrk (WAN) routers were failing to renew their certificates. These issues were a combination of a decommissioned certificate authority, and a bug affecting agents across a subset of our devices.
Our automated systems create and renew certificates using Certificate Authority (CA) servers. During this event we unearthed an issue where a subset of our agents was susceptible to a renewal bug where it corrupted the working certificate while performing a renewal using the secondary CA server, when the primary CA server was unavailable. This caused the agent to restart continuously.
In West Europe, we engaged on an issue where the agent running on a backbone router was failing to renew its certificate. A backbone router is a crucial component in networks that provides the primary path for data traffic across different segments or parts of a network.
At 17:22 UTC, the automated mitigation initiated the isolation of the unhealthy backbone device from serving traffic. However, at 17:29 UTC, the process was cancelled as it was believed to be related to the non-backbone devices that were failing across West Europe. As this was a backbone device, it should have remained isolated. Consequently, this device was inadvertently brought back into rotation. At 17:24 UTC, this failed renewal led to the agent restarting and putting the router’s forwarding database in an inconsistent state. A forwarding database is a table maintained to efficiently forward packets to their intended destinations. As a result of the inconsistent state, we observed that approximately 25% of inter-region traffic entering and exiting the West Europe region was being erroneously dropped, or blackholed.
Mitigation efforts were delayed by the ongoing response to the decommissioning of the certificate authority and the uncovered bug, which made it difficult for us to broadly assess our monitoring signals. Our efforts were primarily focused on patching devices that serve all Internet and ExpressRoute traffic.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/TL8S-TX8
What happened?
Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions to West Europe - was affected. ExpressRoute and internet traffic were not impacted.
What do we know so far?
The issue originated from a service running on the device downloaded incorrect certificate from old certificate server. This led to service restarting and putting device's forwarding database in inconsistent state. We observed that inter-region traffic entering and exiting West Europe was being erroneously discarded.
We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.
How did we respond?
17:26 UTC on 18 March 2025 –Network connectivity issues were first observed.
17:26 UTC on 18 March 2025 –The issue was detected shortly after the impact started.
18:25 UTC on 18 March 2025 – We determined the issue was due to an incorrect certificate was erroneously downloaded, leading to traffic black-holing.
18:45 UTC on 18 March 2025 – The mitigation was applied by fixing the configuration on the affected routers.
18:45 UTC on 18 March 2025 –Immediately after we made the change we observed recovery the network returned to normal operation.
What happens next?
Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic—meaning traffic moving to or from other Azure regions to West Europe—was affected. ExpressRoute and internet traffic were not impacted.
The issue originated from a service running on the device downloaded incorrect certificate from old certificate server. This led to service restarting and putting device's forwarding database in inconsistent state. We observed that inter-region traffic entering and exiting West Europe was being erroneously discarded.
We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.
Starting at approximately 17:26 UTC on 18 Match 2025, a subset of customers are experiencing network connectivity issues to resources hosted in West Europe. We have determined that the impact is affecting inter-region traffic, this means traffic traversing an Azure region to or from West Europe, might be affected. We are starting to see recovery.
We can confirm that ExpressRoute, and internet traffic are not impacted.
More information will be provided shortly.
An outage alert is being investigated. More information will be provided as it is known.
At 17:03 UTC on 18 March 2025, we received a monitoring alert for Application Gateway in West Europe which initiated an investigation and notified customers of a potential outage. We have concluded our investigation of the alert and confirmed that all services remained healthy, and a service incident did not occur. We will continue investigating to determine why alerts were triggered to pre-emptively avoid similar false alerts going forward. Apologies for any inconvenience caused.
At 17:03 UTC on 18 March 2025, we received a monitoring alert for Application Gateway in West Europe which initiated an investigation and notified customers of a potential outage. We have concluded our investigation of the alert and confirmed that all services remained healthy, and a service incident did not occur. We will continue investigating to determine why alerts were triggered to pre-emptively avoid similar false alerts going forward. Apologies for any inconvenience caused.
Impact Statement: Starting at 17:48 UTC on 18 Mar 2025, you have been identified as a customer who may encounter data plane issues affecting your Application Gateway in West Europe.This may impact the performance and availability of your applications hosted behind application gateways in the region.
Visit the Impacted Resources tab in Azure Service Health for details on resources confirmed or potentially affected by this event.
Current Status: We are aware and actively working on mitigating the incident. This situation is being closely monitored and we will provide updates as the situation warrants or once the issue is fully mitigated.
Post Incident Review (PIR) – Azure Resource Manager – Timeouts or 5xx responses from ARM while calling an older API
What happened?
On 27 February 2025 a platform issue in Azure Resource Manager (ARM) caused inadvertent throttling that impacted different services:
What went wrong and why?
When Azure Resource Manager (ARM) receives a request for authentication and authorization, in some specific scenarios, it leverages an older API. The backend system responsible for these API calls experienced an unexpected rise in traffic during the time of this incident and, as a result, the backend throttled some of those calls. Throttling is a common resiliency strategy designed to regulate the rate at which internal resources are accessed. This helps prevent the system from being overwhelmed by a large volume of requests, protecting the system, while at the same time allowing the system to function for the majority of requests. During the timeframe of this impact window, we experienced an unusual rise in requests from an internal service in Azure. This led to internal throttling resulting in a higher number of 504 errors.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: AzPIR/SV7D-DV0.
Previous communications would have indicated an incorrect additional service being impacted. We have rectified the error and updated the communication below. We apologise for any confusion this may have caused.
What happened?
Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.
What do we know so far?
We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.
How did we respond?
What happens next?
What happened?
Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.
What do we know so far?
We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.
How did we respond?
What happens next?
What happened?
Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.
This issue is now mitigated. An update with more information will be provided shortly.
At 18:39 UTC on 26 February 2025, we received a monitoring alert for a possible issue with Managed Identities for Azure resources. Subsequently, communications were sent to customers, notifying them of this possible issue.
Upon further investigation during our post-incident review, we have determined that a significant percentage of those notified were not impacted by this event. We apologize for any confusion or inconvenience this may have caused.
If you were not impacted, please disregard the previous notification. We are committed to ensuring the accuracy of our communications and will continue to improve our processes and tooling to prevent such false notifications in the future.
For those customers who were impacted, you will receive subsequent messaging with the final Post Incident Review (PIR).
What happened?
Between 18:39 and 20:55 UTC on 26 February 2025, we experienced an issue which resulted in an impact for customers being unable to perform control plane operations related to Azure Managed Identity. This included impact to the following services: Azure Container Apps, Azure SQL, Azure SQL Managed Instance, Azure Front Door, Azure Resource Manager, Azure Synapse Analytics, Azure Data Bricks, Azure Chaos Studio, Azure App Services, Azure Logic Apps, Azure Media Services, MSFT Power BI and Azure Service Bus.
What do we know so far?
We identified an issue with our Managed Identity infrastructure related to a key rotation. We performed manual steps to repair the key in each region, which resolved the issue.
How did we respond?
What happens next?
Impact Statement: Starting at 16:48 UTC on 26 February 2025, you have been identified as a customer using Managed Identities who may be unable to create, update, delete, scale-up Azure resources using Managed Identities, and/or request tokens in some cases. Chaos customers may also not have been able to create or run experiments.
Current Status: We have identified the issue and have begun to roll out a fix region-by-region. The regions where customers should see mitigation are Central US, North Europe, West US, UK West, West Europe, East US, East US 2, Korea Central, Canada Central, West US 2, Australia Central, Australia East, Japan East, Sweden Central, UK South, South Central US, Southeast Asia, West US 3, UAE Central, West Central US, Canada East, Brazil South, Central India, France Central, Germany West Central, North Central US, UAE North, Switzerland North, South India, Australia South East, Norway East, Italy North, Korea South, Switzerland West, Sweden South, South Africa North, Mexico Central, Norway West, South Africa West, Israel Central, Poland Central, Jio India West, West India, France South, Germany North, Australia Central, Brazil Southeast, Jio India Central.
Impact Statement: Starting at 16:48 UTC on 26 February 2025, you have been identified as a customer using Managed Identities who may be unable to create, update, delete, scale-up Azure resources using Managed Identities, and/or request tokens in some cases. Chaos customers may also not have been able to create or run experiments.
Current Status: We are currently investigating this issue and suspect it is related to a certificate. We will provide additional information as it becomes available. The next update will be provided in 60 minutes, or as events warrant.
Between 18:39 and 20:55 UTC on 26 February 2025, we experienced an issue which resulted in an impact for customers being unable to perform control plane operations related to Azure Managed Identity. This included impact to the following services: Azure Container Apps, Azure SQL, Azure SQL Managed Instance, Azure Front Door, Azure Resource Manager, Azure Synapse Analytics, Azure Data Bricks, Azure Chaos Studio, Azure App Services, Azure Logic Apps, Azure Media Services, MSFT Power BI and Azure Service Bus.
Information on steps taken to mitigate this incident will be provided shortly.
Impact Statement: Starting at 16:48 UTC on 26 February 2025, you have been identified as a customer using Managed Identities who may be unable to create, update, delete, scale-up Azure resources using Managed Identities, and/or request tokens in some cases. Chaos customers may also not have been able to create or run experiments.
Current Status: We are currently investigating this issue and suspect it is related to a certificate. We will provide additional information as it becomes available. The next update will be provided in 60 minutes, or as events warrant.
Post Incident Review (PIR) – Cosmos DB – Impacted multiple services in West Europe
What happened?
Between 19:03 UTC and 22:08 UTC on 10 February 2025 a Cosmos DB scale unit in the West Europe region hosting Cosmos DB containers experienced failures and was unable to respond to customer requests. Cosmos DB is a distributed database management system, and data for a container is sharded and stored over multiple sets of machines based on partition key values of items stored in the container. In this case one of these sets of machines became unavailable, leading to unavailability for a subset of containers in the region.
Cosmos DB accounts would have been impacted in different ways, depending on their configuration:
Other downstream services which had dependencies on Cosmos DB were impacted, these included:
What went wrong and why?
While performing platform maintenance a set of metadata nodes was down for updates. Metadata nodes are system nodes that maintain the scale unit in a healthy condition. This type of maintenance has taken place regularly and reliably for many years to provide security, performance, and availability improvements. During this time, a set of metadata nodes experienced an unexpected failure leading to a temporary loss of the required number of metadata nodes to keep the scale unit up and functional. This was due to the total number of down metadata nodes exceeding the maximum allowed to maintain scale unit integrity. Ordinarily this transient state would not lead to failures, as the system is designed to handle such failures, but some of the nodes got stuck in a boot up sequence and had to be restarted to reestablish the number of metadata nodes needed to maintain the health of the scale unit. We determined that there was insufficient buffer in the number of metadata nodes under maintenance to handle the additional loss of metadata nodes experienced. Had either the number of buffer metadata nodes been larger, or if the failed metadata nodes had been able to self-recover, the scale unit would not have entered a failed state.
How did we respond?
Upon detection, Cosmos DB engineers determined that a single scale unit had a set of nodes impacted that caused the scale unit to become unavailable. Engineers determined that a subset of machines in the scale unit had entered a stuck state and required a manual reboot to recover. Once manually rebooted, the nodes were able to recover and availability was restored at 20:03 UTC. A small subset of those machines required additional steps to recover, leading to the longer recovery time (until 22:08 UTC) for a small set of containers.
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/CSDL-1X8
Previous communications would have had an inaccurate/incomplete impacted services list. We have rectified the errors and updated the communication below. We apologize for any confusion this may have caused.
What happened?
Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue with Cosmos DB caused degraded service availability for subsets of the following services, in the West Europe region:
What do we know so far?
We have identified a group of nodes in the region that became unhealthy, leading to the cluster serving those nodes becoming unavailable. This affected instances of Cosmos DB, which the affected services rely on to process requests. Due to this inability to process requests, subsets of those services became unavailable.
How did we respond?
What happens next?
Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.
What happened?
Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:
This issue is now mitigated. An update with more information will be provided shortly.
What happened?
Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:
This issue is now mitigated. An update with more information will be provided shortly.
Impact Statement: Starting at 19:09 UTC on 10 February 2025, customers in West Europe may experience degradation in service availability for these affected services:
Current Status: We are aware of this issue and are actively investigating potential contributing factors. The next update will be provided within 60 minutes, or as events warrant.