Microsoft Incident - Azure Frontdoor - PIR – Azure Front Door – Multi-region

Incident Report for Graphisoft

Resolved

This incident has been resolved.

Posted Sep 09, 2024 - 10:04 CEST

Investigating

What happened?

Between 07:30 UTC and 15:37 UTC on 22 April 2024, a subset of customers using Azure Front Door (AFD) experienced intermittent availability drops, connection timeouts and increased latency in UAE North and Qatar Central. Approximately 18% of traffic that was served by AFD in the region was affected during the impact window.

What went wrong and why?

The issue was triggered by a 10x regional traffic surge, which caused a high CPU utilization within specific AFD Points of Presence (POPs) and in turn an availability loss for AFD customers near the UAE North and Qatar Central regions.

Normally, automatic traffic shaping, and other mechanisms work to shed load away from overloaded AFD environments.
In this case, traffic shaping did activate and shifted normal traffic away, but some AFD environments remained overloaded due to the fact that the surge traffic did not honor DNS record TTLs and customer’s traffic stayed on these overloaded environments for much longer than the configured DNS TTL.
In addition to traffic shaping, AFD also has mechanisms in place that protect the AFD platform from traffic surges. These platform protections were also overloaded during this incident due to an undiscovered regression and geographically concentrated nature of the traffic.
While automated monitors identified the issue nearly immediately, impact was prolonged because of alert with configured with too-low severity and the lack of mitigation processes for floods of this type.

How did we respond?

07:20 UTC on 22 April 2024: There was an unexpected increase in customer traffic near Azure Front Door's DOH point-of-presence.
07:28 UTC: As traffic volume increased, AFD's automatic traffic shaping mechanisms activated to distribute the load. Platform protections also began rate limiting the surge traffic due to the fact that it exceeded the default traffic quotas for AFD.
07:55 UTC: Internal monitoring triggered a low-level alert due to high CPU in the region.
11:33 UTC: Availability monitors crossed thresholds necessary to trigger a high-level alert and on-call engineers were engaged.
12:18 UTC: The source of the surge traffic was identified, and engineers applied mitigations to reduce impact to other customers.
15:37 UTC: Our telemetry confirmed that the issue was mitigated, and service functionality was fully restored.

How are we making incidents like this less likely or less impactful?

We have already updated our platform to protect against traffic that does not respond to normal traffic shaping. We have also clarified our updated processes to mitigate traffic floods of this type (Completed).
Adjust our regional availability instrumentation thresholds to properly alert of the severity of issues (Estimated completion: August 2024).
In the longer term, we are working to improve our resilience testing for scenarios such as this and improve noisy-neighbor protections to isolate anomalous traffic (Estimated completion: CY24-Q4).

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey

Posted Jul 02, 2024 - 21:20 CEST