Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8
What happened?
Between 13:37 and 16:52 UTC on 18 March, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, 2025, a combination of a third-party fiber cut, and an internal tooling failure resulted in an impact to a subset of Azure customers with services in our East US region.
During the first impact window, immediately after the fiber cut, customers may have experienced intermittent connectivity loss for inter-zone traffic that included AZ03 - to/from other zones, or to/from the public internet. During this time, the traffic loss rate peaked at 0.02% for short periods of time. Traffic within AZ03, as well as traffic to/from/within AZ01 and AZ02, was not impacted.
During the second impact window, triggered by the tooling issue, customers may have experienced intermittent connectivity loss – primarily when sending inter-zone traffic that included AZ03. During this time, the traffic loss rate peaked at 0.55% for short periods of time. Traffic entering or leaving the East US region was not impacted, but there was some minimal impact to inter-zone traffic from both of the other Availability Zones, AZ01 and AZ02.
Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.
What went wrong and why?
At 13:37 UTC on 18 March 2025, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within AZ03. When fiber cuts impact our networking capacity, our systems are designed to redistribute traffic automatically to other paths. In this instance, we had two concurrent failures happen – before the cut, a datacenter router in AZ03 was down for maintenance and was in the process of being repaired. This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ03, leading to the potential for intermittent connectivity issues for some customers. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time.
Additionally, after the fiber cut and failed isolation, at 14:16 UTC a linecard failed on another router further reducing overall capacity to AZ03. However, as traffic had been re-routed, this further reduction in capacity did not cause any additional customer impact.
During the initial mitigation efforts outlined above, our auto-mitigation tool encountered a lock contention problem blocking commands on the impacted devices, failing to isolate all capacity connected to those devices. This failure left some of the impacted capacity un-isolated, and our system did not flag this failed isolation state. Due to some capacity being out of service from the fiber cut, this failed state was not immediately flagged in our systems as the down capacity was not carrying production traffic.
At approximately 21:00 UTC, our fiber provider commenced recovery work on the damaged fiber. During the capacity recovery process, at 23:20 UTC, as a result of the failure to isolate all the impacted fiber capacity, as individual fibers were repaired, our recovery systems begin re-sending traffic to the devices connected to the un-isolated capacity, therefore, bringing them back into service without safe levels of capacity. This caused traffic congestion that impacted customers as described above.
The traffic congestion within AZ03, due to the tooling failure, triggered an unplanned failure mode on a regional hub router that connects multiple datacenters. By design, our network devices attempt to contain congestive packet loss to capacity that is already impacted. Due to the encountered failure mode, this containment failed on a subset of routers – so congestion spread to neighboring capacity on the same regional hub router, beyond AZ03. This containment failure impacted a small subset of traffic from the regional hub router to AZ1 and AZ2.
At this stage, all originally-impacted capacity from the third-party fiber cut was manually isolated from the network – mitigating all customer impact by 00:30 UTC on 19 March. At 01:52 UTC on 19 March the underlying fiber cut was fully recovered. At that time, we completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.
Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8
What happened?
Between 13:37 and 16:52 UTC on 18 March 2025, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, a combination of a fiber cut and tooling failure within our East US region, resulted in an impact to a subset of Azure customers with services in that region. Customers may have experienced intermittent connectivity loss and increased network latency – when sending traffic to/from/within Availability Zone 3 (AZ3) within this East US region.
What do we know so far?
At 13:37 UTC, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within physical Availability Zone 3 (AZ3) only. With fiber cuts impacting capacity, our systems are designed to shift traffic automatically to other diverse paths. In this instance, we had two concurrent failures happen – before the cut, a large hub router was down due to maintenance (in the process of being repaired); and after the cut, a linecard failed on another router.
This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ3, leading to the potential for retransmits or intermittent connectivity for some customers. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time. The restoration of traffic was fully completed by 16:52 UTC, and the issue was noted as mitigated.
At approximately 21:00 UTC, our fiber provider commenced recovery work on the cut fiber. During the capacity recovery process, at 23:20 UTC, a tooling failure caused our systems to add devices back into the production network before the safe levels of capacity were recovered on the impacted fiber path. As a result, as individual fibers were repaired and brought back into service, our tooling incorrectly began adding devices back to the network. Due to the timing of this tooling failure, traffic was restarted without safe levels of capacity – resulting in congestion that led to customer impact, when sending traffic to/from/within AZ3 of the East US region. The impact was mitigated at 00:30 UTC on 19 March, after manually isolating the capacity affected by this tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut was fully recovered. We completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region.
At 23:21 UTC on 18 March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from East US Region.
What do we know so far?
We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continued to test and restore all capacity to pre-incident levels, these tasks completed at 6:50 UTC on 19 March.
How did we respond?
What happens next?
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region.
At 23:21 UTC on 18 March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from East US Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.
Our telemetry data shows that the customer impact has been fully mitigated. We are continuing to monitor the situation during our capacity recovery process before confirming complete resolution of the incident.
An update will be provided in 3 hours, or as events warrant
Previous communications would have indicated an incorrect severity (changing from Warning to informational). We have rectified the error and updated the communication below. We apologize for any confusion this may have caused.
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.
Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation.
An update will be provided in 3 hours, or as events warrant.
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.
At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.
Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation.
An update will be provided in 3 hours, or as events warrant.
Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18th March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18th March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18th March 2025 and the issue was mitigated.
At 23:20 UTC on 18th March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19th March after isolating the capacity impacted by the tooling failure.
At 01:52 UTC on 19th March, the underlying fiber cut has been fully restored. We are now working to test and restore all capacity to pre-incident levels.
An update will be provided in 60 minutes, or as events warrant.
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
At 23:21 UTC, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC and the issue was mitigated.
At 23:20 UTC, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. We are actively mitigating the current impact to ensure no further incidents occur during the recovery process.
An update will be provided in 60 minutes, or as events warrant.
After further investigation, we determined we are still working with our providers on fiber repairs, however, no further impact should be experienced as previously communicated. We apologize for the inconvenience caused.
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes. Further updates will be provided in 6 hours or as events warrant.
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
What do we know so far?
We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes.
How did we respond?
What happens next?
What happened?
Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
This issue is now mitigated. Additional information will now be provided shortly.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. We have mitigated the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity. Impacted customers should now see their services recover. In parallel, we are working with our providers on fiber repairs. We do not yet have a reliable ETA for repairs at this time. We will continue to provide updates here as they become available.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.
Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber impacted capacity to those datacenters increasing the utilization for the remaining capacity of the affected datacenters. We are actively re-routing traffic to mitigate high utilization and restore connectivity for impacted customers. Underlying fiber repairs have been initiated with our provider, but we do not yet have an ETA for fiber repairs. An update will be provided in 60 minutes, or as events warrant.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency in the region.
Current Status: We identified a networking issue affecting a subset of datacenters in the East US region. We are continuing to investigate the contributing factors of this issue, and in parallel, we are actively re-routing affected traffic to minimize the impact on networking-dependent services. The next update will be provided within 60 minutes, or as events warrant.
Impact Statement: Starting at 13:09 UTC on 18 March 2025, you have been identified as an Azure customer in the East US region who may experience intermittent connectivity loss and increased network latency in the region.
Current Status: We are aware of the issue and actively working on mitigation workstreams to reroute traffic and mitigate impact for customers. The next update will be provided within 60 minutes, or as events warrant.