Microsoft Incident - Azure Resource Manager - Post Incident Review (PIR) – Azure Resource Manager – Timeouts or 5xx responses from ARM while calling an older API

Incident Report for Graphisoft

Update

Post Incident Review (PIR) – Azure Resource Manager – Timeouts or 5xx responses from ARM while calling an older API




What happened?


On 27 February 2025 a platform issue in Azure Resource Manager (ARM) caused inadvertent throttling that impacted different services:


  • Between 07:30 and 11:46 UTC on 27 February 2025, a subset of customers using Azure Resource Manager were impacted when some calls via internal backend systems for authentication details may have experienced 504 responses in the West Europe region.
  • Between 09:28 and 11:07 UTC on 27 February 2025, a subset of customers using Azure Log Analytics or Application Insights Query APIs were impacted when some calls may have experienced transient failures or degraded performance in the West Europe region.
  • Between 09:39 and 11:06 UTC on 27 February 2025, a subset of customers using Azure Container Apps were impacted when some calls, via Azure Resource Manager, may have experienced errors when attempting to create application containers in the West Europe region.



What went wrong and why?


When Azure Resource Manager (ARM) receives a request for authentication and authorization, in some specific scenarios, it leverages an older API. The backend system responsible for these API calls experienced an unexpected rise in traffic during the time of this incident and, as a result, the backend throttled some of those calls. Throttling is a common resiliency strategy designed to regulate the rate at which internal resources are accessed. This helps prevent the system from being overwhelmed by a large volume of requests, protecting the system, while at the same time allowing the system to function for the majority of requests. During the timeframe of this impact window, we experienced an unusual rise in requests from an internal service in Azure. This led to internal throttling resulting in a higher number of 504 errors.




How did we respond?


  • 07:30 UTC on 27 February 2025 - Customer impact began. Our internal monitoring system alerted us to this issue, prompting us to initiate an investigation.
  • 10:13 UTC on 27 February 2025 - Engineers determined this issue was caused by an increase in service traffic.
  • 11:00 UTC on 27 February 2025 - The Azure platform self-healed the issue.
  • 11:11 UTC on 27 February 2025 - After a period of monitoring to validate the mitigation, we confirmed service functionality had been restored, and no further impact was observed at this time.
  • 11:30 UTC on 27 February 2025 - As a proactive measure, our internal teams redistributed workloads across other nearby regions to prevent future recurrences while we continue to perform deeper investigations.


How are we making incidents like this less likely or less impactful?

  • We will be working on updating our code to make fewer requests to this API, to help minimize the chances that we run into the threshold where throttling may occur. (Estimated completion: June 2025)
  • We are working with the respective back-end teams to improve resiliency for these and similar traffic patterns in the future, to handle more gracefully the traffic pattern that we experienced during this impact. (Estimated completion: June 2025)
  • Additionally, the API that was impacted is under the deprecation path. We have modernized that part of our system and have been gradually migrating workflows away from this and to the new systems. Most of our systems are now integrated with the replacement system. We will continue with the full deprecation of this API. (Estimate completion: November 2025)


How can customers make incidents like this less impactful?

  • While we understand for this instance, this was caused by internal back-end systems in Azure, we would also recommend that customers consider implementing appropriate retry logic to enable applications to handle transient failures effectively. See: https://learn.microsoft.com/azure/architecture/patterns/retry
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: AzPIR/SV7D-DV0.

Posted Mar 18, 2025 - 12:45 CET

Update

Previous communications would have indicated an incorrect additional service being impacted. We have rectified the error and updated the communication below. We apologise for any confusion this may have caused.




What happened?


Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.


 


What do we know so far?


We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.


 


How did we respond?


  • 07:30 UTC on 27 February 2025 – Internal monitoring thresholds were breached, alerting us to this issue and prompting us to start our investigation; customer impact began. 
  • At approximately 10:13 UTC on 27 February 2025 – We determined this issue was caused by an increase in service traffic.
  • At approximately 11:00 UTC on 27 February 2025 – While validating the health of our Azure Resource Manager services and network, the Azure Platform self-healed the issue – include details, where possible, of the nature of the self-healing process.
  • 11:11 UTC on 27 February 2025 – After a period of monitoring to validate the mitigation, we confirmed service functionality had been restored, and no further impact was observed. 



What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

Posted Feb 27, 2025 - 15:31 CET

Update

What happened?


Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.


 


What do we know so far?


We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.


 


How did we respond?


  • 07:30 UTC on 27 February 2025 – Internal monitoring thresholds were breached, alerting us to this issue and prompting us to start our investigation; customer impact began. 
  • At approximately 10:13 UTC on 27 February 2025 – We determined this issue was caused by an increase in service traffic.
  • At approximately 11:00 UTC on 27 February 2025 – While validating the health of our Azure Kubernetes Service services and network, the Azure Platform self-healed the issue – include details, where possible, of the nature of the self-healing process.
  • 11:11 UTC on 27 February 2025 – After a period of monitoring to validate the mitigation, we confirmed service functionality had been restored, and no further impact was observed. 



What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

Posted Feb 27, 2025 - 15:21 CET

Investigating

What happened?


Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.


 


This issue is now mitigated. An update with more information will be provided shortly.

Posted Feb 27, 2025 - 13:12 CET