Microsoft Incident - Azure Cosmos DB - Post Incident Review (PIR) – Multiple services – Single zone power issue in North Europe

Incident Report for Graphisoft

Update

Post Incident Review (PIR) – Multiple services – Single zone power issue in North Europe




What happened?


Between 08:40 and 12:55 UTC on 1 April 2025, we identified customer impact resulting from a power event in a single data hall within Availability Zone 02, one of the three Availability Zones that comprise the North Europe region.


 


Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in which physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.


 


Impacted downstream services included:


 


  • Azure Cache for Redis: Between 08:51 and 12:33 UTC, some customers using the Azure Cache for Redis Non-Basic tier may have experienced availability issues while trying to use the cache.
  • Azure Database for PostgreSQL flexible servers: Between 08:40 and 10:44 UTC, some customers using the Azure Database for PostgreSQL flexible servers service may have encountered connectivity failures and/or timeouts when executing operations, as well as unavailability of resources hosted in this region.
  • Azure Databricks - Between 09:50 and 10:10 UTC, some customers may have experienced failures when attempting to launch jobs and compute resources, although retried requests may have succeeded.
  • Azure SQL Database: Between 09:33 and 11:33 UTC, some customers using the SQL Database service may have experienced intermittent issues performing service management operations. Retrieving information about servers and databases through the Azure Portal may have resulted in an error or timeout. Server and database create, drop, rename, and change edition or performance tier operations may have not completed successfully.
  • Azure Storage: Between 09:05 and 12:51 UTC, some customers using Storage may have experienced higher than expected latency, timeouts or HTTP 500 errors when accessing data stored on Storage accounts hosted in this region.
  • Azure Virtual Machines: Between 09:02 and 12:53 UTC, some customers may have experienced connection failures along with error notifications when performing service management operations - such as create, delete, update, restart, reimage, start, stop - for resources hosted in this region.
  • Azure Virtual Machine Scale Sets: Between 09:00 and 11:15 UTC, some customers may have received error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for resources hosted in this region.



Other impacted services included Azure Site Recovery, Application Gateway, Azure NetApp Files, Azure ExpressRoute, Azure Site Recovery, Service Bus, Azure Cosmos DB.




What went wrong and why?


This site is powered by two main transformers which, via a series of voltage distribution boards, supply power to the A and B Uninterruptible Power Supply (UPS) systems. Following a shutdown of Feed A for the purpose of maintenance, Feed B was carrying the load at 'N redundancy' instead of our standard N+1 redundancy. A failure occurred in the UPS system of Feed B, which caused the tripping of all related breakers, and causing the UPS system to fault and drop the load on Feed B. Under normal operation Feed A would be the redundancy, but because it was down for maintenance that was not able to happen. After an in-depth analysis of the electrical terminal board within the UPS, we identified an incorrect tightening of a fuse which was the initial trigger of this event. The fuse terminal showed a color change typical of severe overheating.




How did we respond?


When the UPS device experienced a major failure, the team acted swiftly and used an Emergency Operating Procedure (EOP) to put the system into an 'External Manual Bypass' to return service. Onsite technicians noticed smoke from the UPS supporting Feed B, extinguished a fire, and were then able to inspect the IT hardware. The maintenance on Feed A was paused before any of the maintenance operations had started, so that power could then be restored. This meant that every server in the data hall was again backed by two power feeds - one with UPS and battery backup, and the other with raw utility power.




  • 08:40 UTC on 1 April 2025 – Power event occurred.
  • 08:40 UTC on 1 April 2025 – Technicians from datacenter operations team engaged.
  • 08:50 UTC on 1 April 2025 – Technicians notice smoke from UPS supporting Feed B.
  • 08:51 UTC on 1 April 2025 – Customer impact identified from an ongoing power maintenance event.
  • 08:52 UTC on 1 April 2025 – Fire extinguished from UPS supporting Feed B.
  • 08:57 UTC on 1 April 2025 – Once confirmed that room was safe to enter, EOP was started.
  • 08:59 UTC on 1 April 2025 – UPS was placed into manual bypass via EOP.
  • 09:04 UTC on 1 April 2025 – UPS is completely isolated and in external manual bypass.
  • 09:05 UTC on 1 April 2025 – Power was restored to affected devices, service-specific recovery began.
  • 09:20 UTC on 1 April 2025 – Outage declared, affected dependent services identified, customers notified via Azure Portal.
  • 10:10 UTC on 1 April 2024 – Azure Databricks confirmed as mitigated.
  • 11:33 UTC on 1 April 2024 – Azure SQL Database confirmed as mitigated.
  • 12:33 UTC on 1 April 2024 – Azure Cache for Redis confirmed as mitigated.
  • 12:51 UTC on 1 April 2024 – Azure Storage confirmed as mitigated.
  • 12:53 UTC on 1 April 2024 – Azure Virtual Machines confirmed as mitigated.
  • 12:55 UTC on 1 April 2025 – Full mitigation confirmed.


How are we making incidents like this less likely or less impactful? 

  • We are reviewing the UPS redundancy design and philosophy, to identify areas for improvement in this scenario. (Estimated completion: May 2025).
  • We are reviewing and updating our torque/tightening process to help ensure we are consistent in the torque pressure we apply, to help minimize the risk of this class of incident in the future. (Estimated completion: May 2025).
  • We are reviewing and updating our processes related to testing of the post-retrofitting of our UPS battery backups, to ensure they adhere to our global standard, and to help minimize the potential for incidents like this in the future. (Estimated completion: May 2025).
  • We have engaged our Forensics Engineering reliability team to do a deep dive on this power event and conduct an Failure Mode & Effects Analysis (FMEA) study to assess long-term resilience improvements. (Estimated completion: June 2025).


How can customers make incidents like this less impactful?

  • Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
  • For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that only impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes 
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts 


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey.

Posted Apr 17, 2025 - 22:00 CEST

Investigating

Post Incident Review (PIR) – Multiple services – Single zone power issue in North Europe




What happened?


Between 08:40 and 12:55 UTC on 1 April 2025, we identified customer impact resulting from a power event in a single data hall within Availability Zone 02, one of the three Availability Zones that comprise the North Europe region.


 


Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in which physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.


 


Impacted downstream services included:


 


  • Azure Cache for Redis: Between 08:51 and 12:33 UTC, some customers using the Azure Cache for Redis Non-Basic tier may have experienced availability issues while trying to use the cache.
  • Azure Database for PostgreSQL flexible servers: Between 08:40 and 10:44 UTC, some customers using the Azure Database for PostgreSQL flexible servers service may have encountered connectivity failures and/or timeouts when executing operations, as well as unavailability of resources hosted in this region.
  • Azure Databricks - Between 09:50 and 10:10 UTC, some customers may have experienced failures when attempting to launch jobs and compute resources, although retried requests may have succeeded.
  • Azure SQL Database: Between 09:33 and 11:33 UTC, some customers using the SQL Database service may have experienced intermittent issues performing service management operations. Retrieving information about servers and databases through the Azure Portal may have resulted in an error or timeout. Server and database create, drop, rename, and change edition or performance tier operations may have not completed successfully.
  • Azure Storage: Between 09:05 and 12:51 UTC, some customers using Storage may have experienced higher than expected latency, timeouts or HTTP 500 errors when accessing data stored on Storage accounts hosted in this region.
  • Azure Virtual Machines: Between 09:02 and 12:53 UTC, some customers may have experienced connection failures along with error notifications when performing service management operations - such as create, delete, update, restart, reimage, start, stop - for resources hosted in this region.
  • Azure Virtual Machine Scale Sets: Between 09:00 and 11:15 UTC, some customers may have received error notifications when performing service management operations - such as create, delete, update, scaling, start, stop - for resources hosted in this region.



Other impacted services included Azure Site Recovery, Application Gateway, Azure NetApp Files, Azure ExpressRoute, Azure Site Recovery, Service Bus, Azure Cosmos DB.




What went wrong and why?


This site is powered by two main transformers which, via a series of voltage distribution boards, supply power to the A and B Uninterruptible Power Supply (UPS) systems. Following a shutdown of Feed A for the purpose of maintenance, Feed B was carrying the load at 'N redundancy' instead of our standard N+1 redundancy. A failure occurred in the UPS system of Feed B, which caused the tripping of all related breakers, and causing the UPS system to fault and drop the load on Feed B. Under normal operation Feed A would be the redundancy, but because it was down for maintenance that was not able to happen. After an in-depth analysis of the electrical terminal board within the UPS, we identified an incorrect tightening of a fuse which was the initial trigger of this event. The fuse terminal showed a color change typical of severe overheating.




How did we respond?


When the UPS device experienced a major failure, the team acted swiftly and used an Emergency Operating Procedure (EOP) to put the system into an 'External Manual Bypass' to return service. Onsite technicians noticed smoke from the UPS supporting Feed B, extinguished a fire, and were then able to inspect the IT hardware. The maintenance on Feed A was paused before any of the maintenance operations had started, so that power could then be restored. This meant that every server in the data hall was again backed by two power feeds - one with UPS and battery backup, and the other with raw utility power.




  • 08:40 UTC on 1 April 2025 – Power event occurred.
  • 08:40 UTC on 1 April 2025 – Technicians from datacenter operations team engaged.
  • 08:50 UTC on 1 April 2025 – Technicians notice smoke from UPS supporting Feed B.
  • 08:51 UTC on 1 April 2025 – Customer impact identified from an ongoing power maintenance event.
  • 08:52 UTC on 1 April 2025 – Fire extinguished from UPS supporting Feed B.
  • 08:57 UTC on 1 April 2025 – Once confirmed that room was safe to enter, EOP was started.
  • 08:59 UTC on 1 April 2025 – UPS was placed into manual bypass via EOP.
  • 09:04 UTC on 1 April 2025 – UPS is completely isolated and in external manual bypass.
  • 09:05 UTC on 1 April 2025 – Power was restored to affected devices, service-specific recovery began.
  • 09:20 UTC on 1 April 2025 – Outage declared, affected dependent services identified, customers notified via Azure Portal.
  • 10:10 UTC on 1 April 2024 – Azure Databricks confirmed as mitigated.
  • 11:33 UTC on 1 April 2024 – Azure SQL Database confirmed as mitigated.
  • 12:33 UTC on 1 April 2024 – Azure Cache for Redis confirmed as mitigated.
  • 12:51 UTC on 1 April 2024 – Azure Storage confirmed as mitigated.
  • 12:53 UTC on 1 April 2024 – Azure Virtual Machines confirmed as mitigated.
  • 12:55 UTC on 1 April 2025 – Full mitigation confirmed.


How are we making incidents like this less likely or less impactful? 

  • We are reviewing the UPS redundancy design and philosophy, to identify areas for improvement in this scenario. (Estimated completion: May 2025).
  • We are reviewing and updating our torque/tightening process to help ensure we are consistent in the torque pressure we apply, to help minimize the risk of this class of incident in the future. (Estimated completion: May 2025).
  • We are reviewing and updating our processes related to testing of the post-retrofitting of our UPS battery backups, to ensure they adhere to our global standard, and to help minimize the potential for incidents like this in the future. (Estimated completion: May 2025).
  • We have engaged our Forensics Engineering reliability team to do a deep dive on this power event and conduct an Failure Mode & Effects Analysis (FMEA) study to assess long-term resilience improvements. (Estimated completion: June 2025).


How can customers make incidents like this less impactful?

  • Consider using Availability Zones (AZs) to run your services across physically separate locations within an Azure region. To help services be more resilient to datacenter-level failures like this one, each AZ provides independent power, networking, and cooling. Many Azure services support zonal, zone-redundant, and/or always-available configurations: https://docs.microsoft.com/azure/availability-zones/az-overview
  • For mission-critical workloads, customers should consider a multi-region geodiversity strategy to avoid impact from incidents like this one that only impacted a single region: https://learn.microsoft.com/training/modules/design-a-geographically-distributed-application and https://learn.microsoft.com/azure/architecture/patterns/geodes 
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts 


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey.

Posted Apr 17, 2025 - 21:59 CEST