Microsoft Incident - Azure Frontdoor - Mitigated – Azure Front Door traffic experienced 504 errors

Incident Report for Graphisoft

Update

What happened?


Between 12:05 UTC and 21:33 UTC on 15 September 2025, a platform issue impacted Azure Front Door (Standard, Premium, and Classic SKUs) and Azure CDN Standard from Microsoft. During this time, customers accessing services in the East US region may have experienced 504 (Gateway Timeout) errors. The impact was limited to two Points of Presence (PoPs) in East US and affected only cache-miss traffic.




What do we know so far?


Our investigation determined that the incident was caused by elevated CPU utilization in two Azure Front Door (AFD) environment. This condition led to intermittent 504 (Gateway Timeout) errors affecting cache-miss traffic. Customer retries were usually successful. The issue was identified through service monitoring and impacted approximately 0.25% of overall cache-miss requests in the affected environments. The elevated CPU load was traced to an increase in cumulative traffic, which overloaded the processing capacity of these environments.




How did we respond?


  • 12:05 UTC on 15 September 2025 – Customer impact began.
  • 12:07 UTC on 15 September 2025 – Issue was detected via service monitoring upon observing gradual increase in request failures. An incident was created, and the service team was notified.
  • 16:00 UTC on 15 September 2025 – Service team identified that the issue was caused by an increased load and high CPU usage on two Azure Front Door environments, leading to intermittent 504 (Gateway timeout) errors for customer traffic, impacting cache-miss requests.
  • 17:19 UTC on 15 September 2025 – Additional capacity was provisioned to address overloaded environments. This reduced the frequency of error rates, but lower volume error rates were still observed from our telemetry.
  • 20:12 UTC on 15 September 2025 - Additional throughput optimization was rolled out to further reduce system load.
  • 21:33 UTC on 15 September 2025 – Service(s) restored, and customer impact mitigated, and verified by service telemetry.

What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts  
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs  
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring  
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness


Posted Sep 16, 2025 - 01:28 CEST

Investigating

What happened? 


Between 12:05 UTC and 21:33 UTC on 15 September 2025, a platform issue resulted in an impact to the Azure Front Door (Standard, Premium or Classic SKU) or Azure CDN Standard by Microsoft, who may have experienced 504 HTTP status response code errors when using the services in East US region. The impact is limited to 2 points of presence in East US area and occurs for caching traffic.


This issue is now mitigated. An update with more information will be provided shortly.



Posted Sep 16, 2025 - 00:34 CEST