Post Incident Review (PIR) – Multiple services – Single zone power issue in North Europe
What happened?
Between 08:40 and 12:55 UTC on 1 April 2025, we identified customer impact resulting from a power event in a single data hall within Availability Zone 02, one of the three Availability Zones that comprise the North Europe region.
Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in which physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.
Impacted downstream services included:
Other impacted services included Azure Site Recovery, Application Gateway, Azure NetApp Files, Azure ExpressRoute, Azure Site Recovery, Service Bus, Azure Cosmos DB.
What went wrong and why?
This site is powered by two main transformers which, via a series of voltage distribution boards, supply power to the A and B Uninterruptible Power Supply (UPS) systems. Following a shutdown of Feed A for the purpose of maintenance, Feed B was carrying the load at 'N redundancy' instead of our standard N+1 redundancy. A failure occurred in the UPS system of Feed B, which caused the tripping of all related breakers, and causing the UPS system to fault and drop the load on Feed B. Under normal operation Feed A would be the redundancy, but because it was down for maintenance that was not able to happen. After an in-depth analysis of the electrical terminal board within the UPS, we identified an incorrect tightening of a fuse which was the initial trigger of this event. The fuse terminal showed a color change typical of severe overheating.
How did we respond?
When the UPS device experienced a major failure, the team acted swiftly and used an Emergency Operating Procedure (EOP) to put the system into an 'External Manual Bypass' to return service. Onsite technicians noticed smoke from the UPS supporting Feed B, extinguished a fire, and were then able to inspect the IT hardware. The maintenance on Feed A was paused before any of the maintenance operations had started, so that power could then be restored. This meant that every server in the data hall was again backed by two power feeds - one with UPS and battery backup, and the other with raw utility power.
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey.
Post Incident Review (PIR) – Multiple services – Single zone power issue in North Europe
What happened?
Between 08:40 and 12:55 UTC on 1 April 2025, we identified customer impact resulting from a power event in a single data hall within Availability Zone 02, one of the three Availability Zones that comprise the North Europe region.
Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in which physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.
Impacted downstream services included:
Other impacted services included Azure Site Recovery, Application Gateway, Azure NetApp Files, Azure ExpressRoute, Azure Site Recovery, Service Bus, Azure Cosmos DB.
What went wrong and why?
This site is powered by two main transformers which, via a series of voltage distribution boards, supply power to the A and B Uninterruptible Power Supply (UPS) systems. Following a shutdown of Feed A for the purpose of maintenance, Feed B was carrying the load at 'N redundancy' instead of our standard N+1 redundancy. A failure occurred in the UPS system of Feed B, which caused the tripping of all related breakers, and causing the UPS system to fault and drop the load on Feed B. Under normal operation Feed A would be the redundancy, but because it was down for maintenance that was not able to happen. After an in-depth analysis of the electrical terminal board within the UPS, we identified an incorrect tightening of a fuse which was the initial trigger of this event. The fuse terminal showed a color change typical of severe overheating.
How did we respond?
When the UPS device experienced a major failure, the team acted swiftly and used an Emergency Operating Procedure (EOP) to put the system into an 'External Manual Bypass' to return service. Onsite technicians noticed smoke from the UPS supporting Feed B, extinguished a fire, and were then able to inspect the IT hardware. The maintenance on Feed A was paused before any of the maintenance operations had started, so that power could then be restored. This meant that every server in the data hall was again backed by two power feeds - one with UPS and battery backup, and the other with raw utility power.
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey.