What happened?
Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions and West Europe - was affected. ExpressRoute and Internet traffic were not impacted.
What went wrong and why?
From as early as 09:02 UTC, we were engaged on incidents, where we were mitigating multiple failures with some of our networking devices across multiple regions in which Wide Area Netowrk (WAN) routers were failing to renew their certificates. These issues were a combination of a decommissioned certificate authority, and a bug affecting agents across a subset of our devices.
Our automated systems create and renew certificates using Certificate Authority (CA) servers. During this event we unearthed an issue where a subset of our agents was susceptible to a renewal bug where it corrupted the working certificate while performing a renewal using the secondary CA server, when the primary CA server was unavailable. This caused the agent to restart continuously.
In West Europe, we engaged on an issue where the agent running on a backbone router was failing to renew its certificate. A backbone router is a crucial component in networks that provides the primary path for data traffic across different segments or parts of a network.
At 17:22 UTC, the automated mitigation initiated the isolation of the unhealthy backbone device from serving traffic. However, at 17:29 UTC, the process was cancelled as it was believed to be related to the non-backbone devices that were failing across West Europe. As this was a backbone device, it should have remained isolated. Consequently, this device was inadvertently brought back into rotation. At 17:24 UTC, this failed renewal led to the agent restarting and putting the router’s forwarding database in an inconsistent state. A forwarding database is a table maintained to efficiently forward packets to their intended destinations. As a result of the inconsistent state, we observed that approximately 25% of inter-region traffic entering and exiting the West Europe region was being erroneously dropped, or blackholed.
Mitigation efforts were delayed by the ongoing response to the decommissioning of the certificate authority and the uncovered bug, which made it difficult for us to broadly assess our monitoring signals. Our efforts were primarily focused on patching devices that serve all Internet and ExpressRoute traffic.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/TL8S-TX8
What happened?
Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions to West Europe - was affected. ExpressRoute and internet traffic were not impacted.
What do we know so far?
The issue originated from a service running on the device downloaded incorrect certificate from old certificate server. This led to service restarting and putting device's forwarding database in inconsistent state. We observed that inter-region traffic entering and exiting West Europe was being erroneously discarded.
We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.
How did we respond?
17:26 UTC on 18 March 2025 –Network connectivity issues were first observed.
17:26 UTC on 18 March 2025 –The issue was detected shortly after the impact started.
18:25 UTC on 18 March 2025 – We determined the issue was due to an incorrect certificate was erroneously downloaded, leading to traffic black-holing.
18:45 UTC on 18 March 2025 – The mitigation was applied by fixing the configuration on the affected routers.
18:45 UTC on 18 March 2025 –Immediately after we made the change we observed recovery the network returned to normal operation.
What happens next?
Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic—meaning traffic moving to or from other Azure regions to West Europe—was affected. ExpressRoute and internet traffic were not impacted.
The issue originated from a service running on the device downloaded incorrect certificate from old certificate server. This led to service restarting and putting device's forwarding database in inconsistent state. We observed that inter-region traffic entering and exiting West Europe was being erroneously discarded.
We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.
Starting at approximately 17:26 UTC on 18 Match 2025, a subset of customers are experiencing network connectivity issues to resources hosted in West Europe. We have determined that the impact is affecting inter-region traffic, this means traffic traversing an Azure region to or from West Europe, might be affected. We are starting to see recovery.
We can confirm that ExpressRoute, and internet traffic are not impacted.
More information will be provided shortly.
An outage alert is being investigated. More information will be provided as it is known.