Microsoft Incident - Redis Cache - PIR - Azure Cache for Redis - Failures in West Europe

Incident Report for Graphisoft

Update

Post Incident Review (PIR) - Azure Cache for Redis - Failures in West Europe




What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations on Azure Redis Cache in this region. Retries would generally have been successful, as there was approximately a 10% failure rate for requests across the region during this period.




What went wrong and why?


A small subset of high traffic in the West Europe region experienced an aggressive retry pattern in client applications, resulting in an abnormally high volume of API calls to the Redis Cache Resource Provider (RP). Specifically, when these clients restarted, they attempted first to get all Redis resources from their subscription and then fetch secrets by connecting to the Redis service. Any connection failure immediately triggered repeated, rapid retries, leading to a feedback loop that further amplified the load on the backend RP infrastructure. This spike in management operation requests, caused by both misconfigured client retry logic and unstable client-side servers, quickly exhausted the backend worker threads and drove CPU utilization to critical levels. As a result, the service began returning request timeouts and errors, ultimately impacting many customers in the region, not just those responsible for the excess traffic. Contributing to the incident, there was no effective throttling or safeguards in place at the RP level to detect and contain this abnormal traffic early, which allowed the cascading overload to persist.




How did we respond?


  • 09:17 UTC on 10 June 2025 – Customer impact began. Service monitoring detected a spike in usage and subsequent failures in the region and our investigation started.
  • 09:53 UTC on 10 June 2025 - Shortly afterwards, a large spike in traffic was identified and service monitoring detected a drop in availability.
  • 11:30 UTC on 10 June 2025 - We identified this issue to be caused by client backend services that rely on Azure Cache causing increased load on the Azure Cache control plane in West Europe.
  • 13:00 UTC on 10 June 2025 – Our mitigation workstreams started, including applying a throttling limit to reduce the number of concurrent incoming requests, and scaling up the service to allow backend components to handle the incoming requests.
  • 14:35 UTC on 10 June 2025 – Service restored, and customer impact mitigated.
  • 15:46 UTC on 10 June 2025 – After a period of monitoring to validate the mitigation, we confirmed that full-service functionality was restored, so declared the incident mitigated.



How are we making incidents like this less likely or less impactful?


We have already...


  • Applied reactive throttling and scaling policies to quickly stabilize load during unexpected spikes. (Completed)
  • Engaged with high-traffic customers to fix aggressive retry patterns and unhealthy connection behaviors. (Completed)

We are improving our…

  • Bulk request management by introducing batching for operations that can consume high CPU. (Estimated completion: August 2025)
  • Ability to scale-out more quickly during demand spikes, and improving operational workflows to reduce time taken for scale-out, throttling, and reboots — minimizing time taken and efforts .(Estimated completion: August 2025)
  • Service Level Indicator (SLI) monitoring and customer communication processes, to detect and respond to anomalies more quickly and keep affected customers informed throughout. (Estimated completion: August 2025)

In the longer term, we will…

  • Implement more robust protections for APIs like “list keys” to prevent a small set of subscriptions from overwhelming the system. (Estimated completion: December 2025)
  • Introduce rate limiting and throttling mechanisms at the resource provider level, to de-risk incidents like this one. (Estimated completion: December 2025)
  • Improve message handling efficiency by blocking or deduplicating provisioning and enqueueing requests for the same cache or subscription ID. (Estimated completion: December 2025)


How can customers make incidents like this less impactful?

  • Consider evaluating your client-side retry mechanism(s) by increasing timeouts, scoped retries and exploring a 'circuit breaker' pattern in case of failure to connect to the Redis server. See: https://learn.microsoft.com/azure/azure-cache-for-redis/cache-best-practices-connection
  • Consider switching to Entra ID based authentication instead of key based authentication while connecting to the Redis Server. See: https://learn.microsoft.com/azure/redis/entra-for-authentication
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/8TGH-3XG

Posted Jun 23, 2025 - 23:18 CEST

Update

What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.


What do we know so far?


We determined that a large, unexpected spike in usage resulted in backend Azure Cache for Redis components reaching an operational threshold. This led to a spike in service management operation failures in the region. 


How did we respond? 


  • 09:17 UTC on 10 June 2025 – Customer impact began. Service monitoring detected a spike in usage and subsequent failures in the region and our investigation started.
  • Shortly afterwards, a large spike in traffic was identified as a contributing cause factor. 
  • 13:00 UTC on 10 June 2025 – Our mitigation workstream started this included applying a throttling limit to reduce the number of concurrent incoming requests and scaling up the service to allow backend components to handle the incoming requests.
  • 14:35 UTC on 10 June 2025 – Service restored, and customer impact mitigated.
  • 15:46 UTC on 10 June 2025 – After a period of monitoring to validate the mitigation, we confirmed that full-service functionality was restored.

What happens next?


  • To request a Post Incident Review (PIR), impacted customers can use the “Request PIR” feature within Azure Service Health. 
  • To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts  
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs  
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring  
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness

Posted Jun 10, 2025 - 18:01 CEST

Update

What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.


This issue is now mitigated. A further update will be provided shortly.

Posted Jun 10, 2025 - 17:49 CEST

Update

What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.

Posted Jun 10, 2025 - 17:47 CEST

Update

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large, unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region. 


We are currently validating our mitigation efforts to confirm full-service functionality. Customers may experience signs of recovery at this time.


The next update will be provided in 2 hours, or as events warrant.

Posted Jun 10, 2025 - 17:04 CEST

Update

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large, unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region. 


We are currently continuing to explore mitigation workstreams to alleviate the impact of this issue. 


The next update will be provided in 2 hours, or as events warrant.

Posted Jun 10, 2025 - 15:07 CEST

Update

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region. 


We are currently exploring mitigation workstreams to alleviate the impact of this issue. 


The next update will be provided in 60 minutes, or as events warrant.

Posted Jun 10, 2025 - 14:15 CEST

Investigating

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We are currently examining the contributing factors leading to the observed performance degradation.


The next update will be provided in 60 minutes, or as events warrant.

Posted Jun 10, 2025 - 13:51 CEST