Post Incident Review (PIR) – Azure Resource Manager – Timeouts or 5xx responses from ARM while calling an older API
What happened?
On 27 February 2025 a platform issue in Azure Resource Manager (ARM) caused inadvertent throttling that impacted different services:
What went wrong and why?
When Azure Resource Manager (ARM) receives a request for authentication and authorization, in some specific scenarios, it leverages an older API. The backend system responsible for these API calls experienced an unexpected rise in traffic during the time of this incident and, as a result, the backend throttled some of those calls. Throttling is a common resiliency strategy designed to regulate the rate at which internal resources are accessed. This helps prevent the system from being overwhelmed by a large volume of requests, protecting the system, while at the same time allowing the system to function for the majority of requests. During the timeframe of this impact window, we experienced an unusual rise in requests from an internal service in Azure. This led to internal throttling resulting in a higher number of 504 errors.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: AzPIR/SV7D-DV0.
Previous communications would have indicated an incorrect additional service being impacted. We have rectified the error and updated the communication below. We apologise for any confusion this may have caused.
What happened?
Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.
What do we know so far?
We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.
How did we respond?
What happens next?
What happened?
Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.
What do we know so far?
We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.
How did we respond?
What happens next?
What happened?
Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.
This issue is now mitigated. An update with more information will be provided shortly.