Update -

Post Incident Review (PIR) - Azure Cache for Redis - Failures in West Europe




What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations on Azure Redis Cache in this region. Retries would generally have been successful, as there was approximately a 10% failure rate for requests across the region during this period.




What went wrong and why?


A small subset of high traffic in the West Europe region experienced an aggressive retry pattern in client applications, resulting in an abnormally high volume of API calls to the Redis Cache Resource Provider (RP). Specifically, when these clients restarted, they attempted first to get all Redis resources from their subscription and then fetch secrets by connecting to the Redis service. Any connection failure immediately triggered repeated, rapid retries, leading to a feedback loop that further amplified the load on the backend RP infrastructure. This spike in management operation requests, caused by both misconfigured client retry logic and unstable client-side servers, quickly exhausted the backend worker threads and drove CPU utilization to critical levels. As a result, the service began returning request timeouts and errors, ultimately impacting many customers in the region, not just those responsible for the excess traffic. Contributing to the incident, there was no effective throttling or safeguards in place at the RP level to detect and contain this abnormal traffic early, which allowed the cascading overload to persist.




How did we respond?


  • 09:17 UTC on 10 June 2025 – Customer impact began. Service monitoring detected a spike in usage and subsequent failures in the region and our investigation started.
  • 09:53 UTC on 10 June 2025 - Shortly afterwards, a large spike in traffic was identified and service monitoring detected a drop in availability.
  • 11:30 UTC on 10 June 2025 - We identified this issue to be caused by client backend services that rely on Azure Cache causing increased load on the Azure Cache control plane in West Europe.
  • 13:00 UTC on 10 June 2025 – Our mitigation workstreams started, including applying a throttling limit to reduce the number of concurrent incoming requests, and scaling up the service to allow backend components to handle the incoming requests.
  • 14:35 UTC on 10 June 2025 – Service restored, and customer impact mitigated.
  • 15:46 UTC on 10 June 2025 – After a period of monitoring to validate the mitigation, we confirmed that full-service functionality was restored, so declared the incident mitigated.



How are we making incidents like this less likely or less impactful?


We have already...


  • Applied reactive throttling and scaling policies to quickly stabilize load during unexpected spikes. (Completed)
  • Engaged with high-traffic customers to fix aggressive retry patterns and unhealthy connection behaviors. (Completed)

We are improving our…

  • Bulk request management by introducing batching for operations that can consume high CPU. (Estimated completion: August 2025)
  • Ability to scale-out more quickly during demand spikes, and improving operational workflows to reduce time taken for scale-out, throttling, and reboots — minimizing time taken and efforts .(Estimated completion: August 2025)
  • Service Level Indicator (SLI) monitoring and customer communication processes, to detect and respond to anomalies more quickly and keep affected customers informed throughout. (Estimated completion: August 2025)

In the longer term, we will…

  • Implement more robust protections for APIs like “list keys” to prevent a small set of subscriptions from overwhelming the system. (Estimated completion: December 2025)
  • Introduce rate limiting and throttling mechanisms at the resource provider level, to de-risk incidents like this one. (Estimated completion: December 2025)
  • Improve message handling efficiency by blocking or deduplicating provisioning and enqueueing requests for the same cache or subscription ID. (Estimated completion: December 2025)


How can customers make incidents like this less impactful?

  • Consider evaluating your client-side retry mechanism(s) by increasing timeouts, scoped retries and exploring a 'circuit breaker' pattern in case of failure to connect to the Redis server. See: https://learn.microsoft.com/azure/azure-cache-for-redis/cache-best-practices-connection
  • Consider switching to Entra ID based authentication instead of key based authentication while connecting to the Redis Server. See: https://learn.microsoft.com/azure/redis/entra-for-authentication
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/8TGH-3XG


Jun 23, 2025 - 23:18 CEST
Update -

What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.


What do we know so far?


We determined that a large, unexpected spike in usage resulted in backend Azure Cache for Redis components reaching an operational threshold. This led to a spike in service management operation failures in the region. 


How did we respond? 


  • 09:17 UTC on 10 June 2025 – Customer impact began. Service monitoring detected a spike in usage and subsequent failures in the region and our investigation started.
  • Shortly afterwards, a large spike in traffic was identified as a contributing cause factor. 
  • 13:00 UTC on 10 June 2025 – Our mitigation workstream started this included applying a throttling limit to reduce the number of concurrent incoming requests and scaling up the service to allow backend components to handle the incoming requests.
  • 14:35 UTC on 10 June 2025 – Service restored, and customer impact mitigated.
  • 15:46 UTC on 10 June 2025 – After a period of monitoring to validate the mitigation, we confirmed that full-service functionality was restored.

What happens next?


  • To request a Post Incident Review (PIR), impacted customers can use the “Request PIR” feature within Azure Service Health. 
  • To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts  
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs  
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring  
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness


Jun 10, 2025 - 18:01 CEST
Update -

What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.


This issue is now mitigated. A further update will be provided shortly.


Jun 10, 2025 - 17:49 CEST
Update -

What happened?


Between 09:17 UTC and 14:35 UTC on 10 June 2025, a platform issue resulted in an impact to the Azure Cache for Redis service in the West Europe region. Impacted customers experienced failures or delays when trying to perform service management operations in this region. Retries may have been successful and there was about a 10% failure rate for requests across the region during this period.


Jun 10, 2025 - 17:47 CEST
Update -

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large, unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region. 


We are currently validating our mitigation efforts to confirm full-service functionality. Customers may experience signs of recovery at this time.


The next update will be provided in 2 hours, or as events warrant.


Jun 10, 2025 - 17:04 CEST
Update -

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large, unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region. 


We are currently continuing to explore mitigation workstreams to alleviate the impact of this issue. 


The next update will be provided in 2 hours, or as events warrant.


Jun 10, 2025 - 15:07 CEST
Update -

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We have determined that a large unexpected spike in usage has resulted in backend Azure Cache for Redis components reaching an operational threshold. This has led to a spike in service management operation failures in the region. 


We are currently exploring mitigation workstreams to alleviate the impact of this issue. 


The next update will be provided in 60 minutes, or as events warrant.


Jun 10, 2025 - 14:15 CEST
Investigating -

Impact Statement: Starting at 09:17 UTC on 10 June 2025, you have been identified as a customer using Azure Cache for Redis service in West Europe who may experience failures or delays when trying to perform service management operations in this region. Retries may be successful and there is about a 10% failure rate for requests across the region.


Current Status: Service monitoring alerted us to errors exceeding our thresholds, prompting us to begin an investigation. We are currently examining the contributing factors leading to the observed performance degradation.


The next update will be provided in 60 minutes, or as events warrant.


Jun 10, 2025 - 13:51 CEST
Update - Metric Alert for BI-SRV2-HQ - the service availability is less than or equal to 95%
Jun 19, 2025 - 00:25 CEST
Investigating - Metric Alert for BI-SRV2-HQ - the service availability is less than or equal to 95%
Jun 19, 2025 - 00:18 CEST
Update - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Jun 16, 2025 - 21:37 CEST
Investigating - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Jun 16, 2025 - 20:29 CEST
Update - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 12, 2025 - 20:10 CEST
Investigating - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 12, 2025 - 06:10 CEST
Update - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 10, 2025 - 08:50 CEST
Investigating - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 09, 2025 - 18:10 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 07, 2025 - 14:20 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 07, 2025 - 04:20 CEST
Update - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Jun 06, 2025 - 00:10 CEST
Investigating - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Jun 05, 2025 - 22:30 CEST
Update - Metric Alert for BI-SRV2-HQ - the service availability is less than or equal to 95%
Jun 04, 2025 - 01:31 CEST
Investigating - Metric Alert for BI-SRV2-HQ - the service availability is less than or equal to 95%
Jun 04, 2025 - 01:24 CEST
Update - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 03, 2025 - 21:47 CEST
Investigating - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Jun 03, 2025 - 07:47 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 30, 2025 - 09:33 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 30, 2025 - 07:33 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 30, 2025 - 02:19 CEST
Investigating -

What happened?


At 20:00 UTC on 13 May 2025 we received a monitoring alert for a possible issue with the Azure Storage service in the East US and West Europe regions. We have concluded our investigation of the alert and confirmed that your resources are healthy, and you were not impacted by this issue. We apologize for any inconvenience caused.


Stay informed about Azure service issues by creating custom service health alerts: https://aka.ms/ash-videos for video tutorials and https://aka.ms/ash-alerts for how-to documentation.


May 30, 2025 - 02:15 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 29, 2025 - 17:33 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 29, 2025 - 17:33 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 29, 2025 - 15:33 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 29, 2025 - 12:00 CEST
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 29, 2025 - 12:00 CEST
Update - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 27, 2025 - 23:23 CEST
Investigating - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
May 27, 2025 - 09:23 CEST
Graphisoft ID Degraded Performance
90 days ago
100.0 % uptime
Today
Graphisoft License Delivery Degraded Performance
90 days ago
100.0 % uptime
Today
Graphisoft Store Operational
90 days ago
100.0 % uptime
Today
Graphisoft Legacy Store Degraded Performance
90 days ago
100.0 % uptime
Today
Graphisoft Legacy Webshop Degraded Performance
90 days ago
100.0 % uptime
Today
GSPOS Operational
90 days ago
100.0 % uptime
Today
Graphisoft BIM Components Operational
90 days ago
100.0 % uptime
Today
Graphisoft BIMx Transfer Operational
90 days ago
100.0 % uptime
Today
Graphisoft DevOps Components Operational
90 days ago
100.0 % uptime
Today
Microsoft Incidents Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Jul 5, 2025

No incidents reported today.

Jul 4, 2025

No incidents reported.

Jul 3, 2025

No incidents reported.

Jul 2, 2025

No incidents reported.

Jul 1, 2025

No incidents reported.

Jun 30, 2025

No incidents reported.

Jun 29, 2025

No incidents reported.

Jun 28, 2025

No incidents reported.

Jun 27, 2025

No incidents reported.

Jun 26, 2025

No incidents reported.

Jun 25, 2025

No incidents reported.

Jun 24, 2025

No incidents reported.

Jun 23, 2025

Unresolved incident: Microsoft Incident - Redis Cache - PIR - Azure Cache for Redis - Failures in West Europe.

Jun 22, 2025

No incidents reported.

Jun 21, 2025

No incidents reported.