Microsoft Incident - Network Infrastructure - Post Incident Review (PIR) - Network connectivity - Issues impacting Azure services in West Europe

Incident Report for Graphisoft

Resolved

This incident has been resolved.

Posted Apr 15, 2025 - 08:23 CEST

Update

What happened?

Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions and West Europe - was affected. ExpressRoute and Internet traffic were not impacted.

What went wrong and why?

From as early as 09:02 UTC, we were engaged on incidents, where we were mitigating multiple failures with some of our networking devices across multiple regions in which Wide Area Netowrk (WAN) routers were failing to renew their certificates. These issues were a combination of a decommissioned certificate authority, and a bug affecting agents across a subset of our devices.

Our automated systems create and renew certificates using Certificate Authority (CA) servers. During this event we unearthed an issue where a subset of our agents was susceptible to a renewal bug where it corrupted the working certificate while performing a renewal using the secondary CA server, when the primary CA server was unavailable. This caused the agent to restart continuously.

In West Europe, we engaged on an issue where the agent running on a backbone router was failing to renew its certificate. A backbone router is a crucial component in networks that provides the primary path for data traffic across different segments or parts of a network.

At 17:22 UTC, the automated mitigation initiated the isolation of the unhealthy backbone device from serving traffic. However, at 17:29 UTC, the process was cancelled as it was believed to be related to the non-backbone devices that were failing across West Europe. As this was a backbone device, it should have remained isolated. Consequently, this device was inadvertently brought back into rotation. At 17:24 UTC, this failed renewal led to the agent restarting and putting the router’s forwarding database in an inconsistent state. A forwarding database is a table maintained to efficiently forward packets to their intended destinations. As a result of the inconsistent state, we observed that approximately 25% of inter-region traffic entering and exiting the West Europe region was being erroneously dropped, or blackholed.

Mitigation efforts were delayed by the ongoing response to the decommissioning of the certificate authority and the uncovered bug, which made it difficult for us to broadly assess our monitoring signals. Our efforts were primarily focused on patching devices that serve all Internet and ExpressRoute traffic.

How did we respond?

09:02 UTC on 18 March 2025 – Issues were detected across multiple regions due to WAN routers failing to renew their certificates.
17:10 UTC on 18 March 2025 – A backbone router in West Europe failed to renew its certificate.
17:22 UTC on 18 March 2025 – Automated mitigation efforts initiated to isolate the unhealthy backbone router in West Europe.
17:24 UTC on 18 March 2025 – Failure in the certificate renewal process for a backbone router led to the agent restarting.
17:29 UTC on 18 March 2025 – Unhealthy backbone router inadvertently brought back into service, exacerbating connectivity issues.
18:25 UTC on 18 March 2025 – We determined the issue was due to incorrect certificates being erroneously downloaded, leading to traffic black-holing.
18:40 UTC on 18 March 2025 – The mitigation was applied by fixing the configuration on the affected routers
18:45 UTC on 18 March 2025 – Network recovery complete.

How are we making incidents like this less likely or less impactful?

We have updated our configurations for the primary certificate authority across our WAN devices. (Completed).
We will build the capability to incorporate a broader set of monitoring signals. (Completed)
We are updating our playbooks around isolating unhealthy devices more robustly. (Completed)
We are rolling out patched versions of the agents that address certificate handling. (Completed)
We are expanding our automation to assess and mitigate forwarding database health to obviate the need for humans. (Estimated completion: September 2025)

How can customers make incidents like this less impactful?

For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that only impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes
Consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/azure/architecture/framework/resiliency
The impact times above represent the full incident duration, so they are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, ensure that the right people in your organization will be notified about any future service issues by configuring Azure Service Health alerts. These alerts can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts and specifically here to receive Entra ID email notifications: https://learn.microsoft.com/entra/identity/monitoring-health/howto-configure-health-alert-emails

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/TL8S-TX8

Posted Mar 28, 2025 - 19:47 CET

Update

What happened?

Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions to West Europe - was affected. ExpressRoute and internet traffic were not impacted.

What do we know so far?

The issue originated from a service running on the device downloaded incorrect certificate from old certificate server. This led to service restarting and putting device's forwarding database in inconsistent state. We observed that inter-region traffic entering and exiting West Europe was being erroneously discarded.

We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.

How did we respond?

17:26 UTC on 18 March 2025 –Network connectivity issues were first observed.

17:26 UTC on 18 March 2025 –The issue was detected shortly after the impact started.

18:25 UTC on 18 March 2025 – We determined the issue was due to an incorrect certificate was erroneously downloaded, leading to traffic black-holing.

18:45 UTC on 18 March 2025 – The mitigation was applied by fixing the configuration on the affected routers.

18:45 UTC on 18 March 2025 –Immediately after we made the change we observed recovery the network returned to normal operation.

What happens next?

After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness

Posted Mar 18, 2025 - 21:53 CET

Update

Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic—meaning traffic moving to or from other Azure regions to West Europe—was affected. ExpressRoute and internet traffic were not impacted.

We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.

Posted Mar 18, 2025 - 20:44 CET

Update

Starting at approximately 17:26 UTC on 18 Match 2025, a subset of customers are experiencing network connectivity issues to resources hosted in West Europe. We have determined that the impact is affecting inter-region traffic, this means traffic traversing an Azure region to or from West Europe, might be affected. We are starting to see recovery.

We can confirm that ExpressRoute, and internet traffic are not impacted.

More information will be provided shortly.

Posted Mar 18, 2025 - 20:22 CET

Investigating

An outage alert is being investigated. More information will be provided as it is known.

Posted Mar 18, 2025 - 19:49 CET