Post Incident Review (PIR) - Azure Cache for Redis - Failures in West Europe
What happened?
Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations on Azure Redis Cache in this region. Retries would generally have been successful, as there was approximately a 10% failure rate for requests across the region during this period.
What went wrong and why?
A small subset of high traffic in the West Europe region experienced an aggressive retry pattern in client applications, resulting in an abnormally high volume of API calls to the Redis Cache Resource Provider (RP). Specifically, when these clients restarted, they attempted first to get all Redis resources from their subscription and then fetch secrets by connecting to the Redis service. Any connection failure immediately triggered repeated, rapid retries, leading to a feedback loop that further amplified the load on the backend RP infrastructure. This spike in management operation requests, caused by both misconfigured client retry logic and unstable client-side servers, quickly exhausted the backend worker threads and drove CPU utilization to critical levels. As a result, the service began returning request timeouts and errors, ultimately impacting many customers in the region, not just those responsible for the excess traffic. Contributing to the incident, there was no effective throttling or safeguards in place at the RP level to detect and contain this abnormal traffic early, which allowed the cascading overload to persist.
How did we respond?
How are we making incidents like this less likely or less impactful?
We have already...
We are improving our…
In the longer term, we will…
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/8TGH-3XG
What happened?
Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.
What do we know so far?
We determined that a large, unexpected spike in usage resulted in backend Azure Cache for Redis components reaching an operational threshold. This led to a spike in service management operation failures in the region.
How did we respond?
What happens next?
What happened?
Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.
This issue is now mitigated. A further update will be provided shortly.
What happened?
Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.
Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.
Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large, unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region.
We are currently validating our mitigation efforts to confirm full-service functionality. Customers may experience signs of recovery at this time.
The next update will be provided in 2 hours, or as events warrant.
Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.
Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large, unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region.
We are currently continuing to explore mitigation workstreams to alleviate the impact of this issue.
The next update will be provided in 2 hours, or as events warrant.
Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.
Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region.
We are currently exploring mitigation workstreams to alleviate the impact of this issue.
The next update will be provided in 60 minutes, or as events warrant.
Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.
Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We are currently examining the contributing factors leading to the observed performance degradation.
The next update will be provided in 60 minutes, or as events warrant.
What happened?
At 20:00 UTC on 13 May 2025 we received a monitoring alert for a possible issue with the Azure Storage service in the East US and West Europe regions. We have concluded our investigation of the alert and confirmed that your resources are healthy, and you were not impacted by this issue. We apologize for any inconvenience caused.
Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.