Microsoft Incident - Network Infrastructure - PIR – Network Connectivity – Availability issues in East US

Incident Report for Graphisoft

Resolved

This incident has been resolved.

Posted Apr 15, 2025 - 08:24 CEST

Update

Post Incident Review (PIR) – Network Connectivity – Availability issues in East US

Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8

What happened?

Between 13:37 and 16:52 UTC on 18 March, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, 2025, a combination of a third-party fiber cut, and an internal tooling failure resulted in an impact to a subset of Azure customers with services in our East US region.

During the first impact window, immediately after the fiber cut, customers may have experienced intermittent connectivity loss for inter-zone traffic that included AZ03 - to/from other zones, or to/from the public internet. During this time, the traffic loss rate peaked at 0.02% for short periods of time. Traffic within AZ03, as well as traffic to/from/within AZ01 and AZ02, was not impacted.

During the second impact window, triggered by the tooling issue, customers may have experienced intermittent connectivity loss – primarily when sending inter-zone traffic that included AZ03. During this time, the traffic loss rate peaked at 0.55% for short periods of time. Traffic entering or leaving the East US region was not impacted, but there was some minimal impact to inter-zone traffic from both of the other Availability Zones, AZ01 and AZ02.

Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.

What went wrong and why?

At 13:37 UTC on 18 March 2025, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within AZ03. When fiber cuts impact our networking capacity, our systems are designed to redistribute traffic automatically to other paths. In this instance, we had two concurrent failures happen – before the cut, a datacenter router in AZ03 was down for maintenance and was in the process of being repaired. This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ03, leading to the potential for intermittent connectivity issues for some customers. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time.

Additionally, after the fiber cut and failed isolation, at 14:16 UTC a linecard failed on another router further reducing overall capacity to AZ03. However, as traffic had been re-routed, this further reduction in capacity did not cause any additional customer impact.

During the initial mitigation efforts outlined above, our auto-mitigation tool encountered a lock contention problem blocking commands on the impacted devices, failing to isolate all capacity connected to those devices. This failure left some of the impacted capacity un-isolated, and our system did not flag this failed isolation state. Due to some capacity being out of service from the fiber cut, this failed state was not immediately flagged in our systems as the down capacity was not carrying production traffic.

At approximately 21:00 UTC, our fiber provider commenced recovery work on the damaged fiber. During the capacity recovery process, at 23:20 UTC, as a result of the failure to isolate all the impacted fiber capacity, as individual fibers were repaired, our recovery systems begin re-sending traffic to the devices connected to the un-isolated capacity, therefore, bringing them back into service without safe levels of capacity. This caused traffic congestion that impacted customers as described above.

The traffic congestion within AZ03, due to the tooling failure, triggered an unplanned failure mode on a regional hub router that connects multiple datacenters. By design, our network devices attempt to contain congestive packet loss to capacity that is already impacted. Due to the encountered failure mode, this containment failed on a subset of routers – so congestion spread to neighboring capacity on the same regional hub router, beyond AZ03. This containment failure impacted a small subset of traffic from the regional hub router to AZ1 and AZ2.

At this stage, all originally-impacted capacity from the third-party fiber cut was manually isolated from the network – mitigating all customer impact by 00:30 UTC on 19 March. At 01:52 UTC on 19 March the underlying fiber cut was fully recovered. At that time, we completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.

How did we respond?

13:37 UTC on 18 March 2025 – Customer impact began, triggered by a fiber cut causing network congestion which led to customers experiencing packet drops or intermittent connectivity. Our monitoring systems identified the impact immediately, so our on-call engineers engaged to investigate.
13:45 UTC on 18 March 2025 – Our fiber provider was notified of the fiber cut and prepared for dispatch.
13:55 UTC on 18 March 2025 – Mitigation efforts began identifying the impacted datacenters and redirecting traffic to healthier routes.
15:07 UTC on 18 March 2025 – All customers using the East US region were notified about connectivity issues, even if their services were not directly impacted.
16:52 UTC on 18 March 2025 – Mitigation efforts were successfully completed. All devices affected by the fiber cut were isolated, all customer traffic was using healthy paths and not experiencing congestion.
23:20 UTC on 18 March 2025 – Customer impact recommenced, due to a tooling failure during the capacity repair process of the initial fiber cut.
00:30 UTC on 19 March 2025 – This impact was mitigated after isolating the capacity that was incorrectly added by the tooling failure as part of the recovery process. Customers and services would have experienced full mitigation.
01:52 UTC on 19 March 2025 – The underlying fiber cut was fully restored. We continued to monitor our capacity during the recovery process.
06:50 UTC on 19 March 2025 – Fiber restoration efforts were completed. The incident was confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

We are fixing the tooling failure that caused the devices to be restored to take traffic before they were production ready. (Estimated completion: May 2025)
We are expediting a capacity upgrade within the most impacted datacenter, ahead of a planned technology refresh for all datacenters within this region - to de-risk the impact of multiple concurrent failures. (Estimated completion: July 2025)
In the longer term, we are working to limit the scope of impact further – specifically, to prevent the failure of a device from spreading across availability zones. (Estimated completion: February 2026)

How can customers make incidents like this less impactful?

For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that predominantly impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8

Posted Mar 31, 2025 - 22:59 CEST

Update

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8

What happened?

Between 13:37 and 16:52 UTC on 18 March 2025, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, a combination of a fiber cut and tooling failure within our East US region, resulted in an impact to a subset of Azure customers with services in that region. Customers may have experienced intermittent connectivity loss and increased network latency – when sending traffic to/from/within Availability Zone 3 (AZ3) within this East US region.

What do we know so far?

At 13:37 UTC, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within physical Availability Zone 3 (AZ3) only. With fiber cuts impacting capacity, our systems are designed to shift traffic automatically to other diverse paths. In this instance, we had two concurrent failures happen – before the cut, a large hub router was down due to maintenance (in the process of being repaired); and after the cut, a linecard failed on another router.

This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ3, leading to the potential for retransmits or intermittent connectivity for some customers. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time. The restoration of traffic was fully completed by 16:52 UTC, and the issue was noted as mitigated.

At approximately 21:00 UTC, our fiber provider commenced recovery work on the cut fiber. During the capacity recovery process, at 23:20 UTC, a tooling failure caused our systems to add devices back into the production network before the safe levels of capacity were recovered on the impacted fiber path. As a result, as individual fibers were repaired and brought back into service, our tooling incorrectly began adding devices back to the network. Due to the timing of this tooling failure, traffic was restarted without safe levels of capacity – resulting in congestion that led to customer impact, when sending traffic to/from/within AZ3 of the East US region. The impact was mitigated at 00:30 UTC on 19 March, after manually isolating the capacity affected by this tooling failure.

At 01:52 UTC on 19 March, the underlying fiber cut was fully recovered. We completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.

How did we respond?

13:37 UTC on 18 March 2025 – Customer impact began, triggered by a fiber cut causing network congestion which led to customers experiencing packet drops or intermittent connectivity. Our monitoring systems identified the impact immediately, so our on-call engineers engaged to investigate.
13:45 UTC on 18 March 2025 – Our fiber provider was notified of the fiber cut and prepared for dispatch.
13:55 UTC on 18 March 2025 – Mitigation efforts began identifying the impacted datacenters and redirecting traffic to healthier routes.
15:07 UTC on 18 March 2025 – All customers using the East US region were notified about connectivity issues.
16:52 UTC on 18 March 2025 – Mitigation efforts were successfully completed. All devices affected by the fiber cut were isolated, all customer traffic was using healthy paths and not experiencing congestion.
23:20 UTC on 18 March 2025 – Customer impact began, due to a tooling failure during the capacity repair process of the initial fiber cut.
00:30 UTC on 19 March 2025 – This impact was mitigated after isolating the capacity that was incorrectly added by the tooling failure as part of the recovery process. Customers and services would have experienced full mitigation.
01:52 UTC on 19 March 2025 – The underlying fiber cut was fully restored. We continued to monitor our capacity during the recovery process.
06:50 UTC on 19 March 2025 – Fiber restoration efforts were completed. Incident was confirmed as mitigated.

How are we making incidents like this less likely or less impactful?

We are fixing the tooling failure that caused devices being restored to take traffic before they were ready. (Estimated completion: TBD)
We are increasing the bandwidth within the East US region as part of a planned technology refresh, to de-risk the impact of multiple concurrent failures. (Estimated completion: May 2025)
This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

How can customers make incidents like this less impactful?

Consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8

Posted Mar 22, 2025 - 02:29 CET

Update

What happened?

Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region.

At 23:21 UTC on 18 March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from East US Region.

What do we know so far?

We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.

At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.

At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continued to test and restore all capacity to pre-incident levels, these tasks completed at 6:50 UTC on 19 March.

How did we respond?

13:09 UTC on 18 March 2025 - Fiber cut in East US that caused packet drops. Our monitoring systems identified the impact.
13:55 UTC on 18 March 2025 - Mitigation efforts begin with identifying the impacted data centers and redirecting traffic to healthier routes.
15:07 UTC on 18 March 2025 - Outage declared; all East US customers notified of potential impact.
18:51 UTC on 18 March 2025 - Mitigation efforts have been successfully completed. All devices affected by the fiber cut have been isolated.
23:20 UTC on 18 March 2025 - An additional impact due to tooling failure was noted during the capacity repair process of the previous incident. It was anticipated that the capacity repair process would not impact customers.
00:28 UTC on 19 March 2025 - The second impact was mitigated after isolating the capacity resources impacted by the tooling failure. At this stage most customers and services would have seen full mitigation.
01:52 UTC on 19 March 2025 - The underlying fiber cut has been fully restored. We continued to monitor our capacity during the recovery process.
06:50 UTC on 19 March 2025 - All restoration efforts have been completed. Incident mitigation has been confirmed and declared.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring .
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness

Posted Mar 19, 2025 - 08:36 CET

Update

Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region.

Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.

At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.

Our telemetry data shows that the customer impact has been fully mitigated. We are continuing to monitor the situation during our capacity recovery process before confirming complete resolution of the incident.

An update will be provided in 3 hours, or as events warrant

Posted Mar 19, 2025 - 07:40 CET

Update

Previous communications would have indicated an incorrect severity (changing from Warning to informational). We have rectified the error and updated the communication below. We apologize for any confusion this may have caused.

At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.

At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.

Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation.

An update will be provided in 3 hours, or as events warrant.

Posted Mar 19, 2025 - 05:23 CET

Update

At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.

Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation.

An update will be provided in 3 hours, or as events warrant.

Posted Mar 19, 2025 - 05:12 CET

Update

Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18th March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18th March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18th March 2025 and the issue was mitigated.

At 23:20 UTC on 18th March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19th March after isolating the capacity impacted by the tooling failure.

At 01:52 UTC on 19th March, the underlying fiber cut has been fully restored. We are now working to test and restore all capacity to pre-incident levels.

An update will be provided in 60 minutes, or as events warrant.

Posted Mar 19, 2025 - 03:47 CET

Update

What happened?

At 23:21 UTC, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East.

Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC and the issue was mitigated.

At 23:20 UTC, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. We are actively mitigating the current impact to ensure no further incidents occur during the recovery process.

An update will be provided in 60 minutes, or as events warrant.

Posted Mar 19, 2025 - 02:19 CET

Update

After further investigation, we determined we are still working with our providers on fiber repairs, however, no further impact should be experienced as previously communicated. We apologize for the inconvenience caused.

What happened?

Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes. Further updates will be provided in 6 hours or as events warrant.

Posted Mar 19, 2025 - 00:54 CET

Update

What happened?

What do we know so far?

We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes.

How did we respond?

At 13:09 UTC on 18 March 2025 - Fiber cut in East US that may have caused packet drops. Monitoring systems identified the impact.
At 13:55 UTC on 18 March 2025 - Mitigation begins- identifying the impacted DCs and shifting the traffic to healthier routes.
At 15:07 UTC on 18 March 2025 - Outage declared- possible impact informed to all the customers in East US.
At 18:51 UTC on 18 March 2025 - Mitigation complete. All impacted devices due to fiber cut were isolated.

What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .

Posted Mar 18, 2025 - 22:11 CET

Update

What happened?

This issue is now mitigated. Additional information will now be provided shortly.

Posted Mar 18, 2025 - 21:47 CET

Update

Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.

Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. We have mitigated the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity. Impacted customers should now see their services recover. In parallel, we are working with our providers on fiber repairs. We do not yet have a reliable ETA for repairs at this time. We will continue to provide updates here as they become available.

Posted Mar 18, 2025 - 18:40 CET

Update

Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber impacted capacity to those datacenters increasing the utilization for the remaining capacity of the affected datacenters. We are actively re-routing traffic to mitigate high utilization and restore connectivity for impacted customers. Underlying fiber repairs have been initiated with our provider, but we do not yet have an ETA for fiber repairs. An update will be provided in 60 minutes, or as events warrant.

Posted Mar 18, 2025 - 18:19 CET

Update

Current Status: We identified a networking issue affecting a subset of datacenters in the East US region. We are continuing to investigate the contributing factors of this issue, and in parallel, we are actively re-routing affected traffic to minimize the impact on networking-dependent services. The next update will be provided within 60 minutes, or as events warrant.

Posted Mar 18, 2025 - 17:15 CET

Investigating

Impact Statement: Starting at 13:09 UTC on 18 March 2025, you have been identified as an Azure customer in the East US region who may experience intermittent connectivity loss and increased network latency in the region.

Current Status: We are aware of the issue and actively working on mitigation workstreams to reroute traffic and mitigate impact for customers. The next update will be provided within 60 minutes, or as events warrant.

Posted Mar 18, 2025 - 16:47 CET