Microsoft Incident - Azure Frontdoor - Post Incident Review (PIR) - Azure Front door and Azure CDN - intermittent issues with latency and timeouts.

Incident Report for Graphisoft

Resolved

This incident has been resolved.

Posted Sep 09, 2024 - 10:03 CEST

Update

What happened?

Between 23:52 UTC on 11 June 2024 and 02:10 UTC on 12 June 2024 and Between 23:30 UTC on 13 June 2024 and 00:24 UTC on 14 June 2024, a platform update resulted in an impact to the following services.

• Azure Front Door: Customers may have experienced intermittent issues with latency and timeouts in Japan East.

• Azure CDN: Customers experienced intermittent issues with latency and timeouts.

What went wrong and why?

To comply with the continuously evolving security and compliance landscape, we follow a continuous security patch lifecycle. The patches upgrade the versions of all components and libraries that power our infrastructure. The patches are qualified in lab environments and rolled out following a safe deployment practice. During the patch process, nodes are taken offline to prevent disruption due to ongoing patches and are brought back into rotation after synthetic health probes give us a positive signal.

As Azure Front Door provides content caching and forwarding abilities, it needs to maintain warm connections to other services within the system and to the origins. The most recent security patches reduced the connection limit between systems, and typically when this behavior occurs our load balancing mechanism will balance traffic to over other systems. Although, there was previously an unknown bug with this load balancing mechanism, and it did not recognize the systems being patched had exhausted their connection limits.

This bug was first observed on the 11th of June 2024 when we were performing patching, but we attributed the issues to needing more capacity, and as such the problem was mitigated when capacity was added. Subsequently on the 13th of June 2024 during a security patch rollout we were notified that a customer was impacted again with intermittent issues with latency and timeouts, and it was discovered during this timeframe of the bug noted above.

Moreover, it was identified our current patch qualification system does not simulate all load conditions to test peak limits, thus we failed to identify the reduction in available connection limits.

How did we respond?

• 00:46 UTC on 11 June 2024 - Japan East POP taken offline for security patches

• 23:00 UTC on 11 June 2024 - Japan East POP brought online

• 23:52 UTC on 11 June 2024 - Customer impact started.

• 00:18 UTC on 12 June 2024 - Issue was detected via service monitoring alerts.

• 00:26 UTC on 12 June 2024 - We traced the source of this issue to a scheduled maintenance.

• 01:58 UTC on 12 June 2024 - We adjusted load management configuration to balance the traffic in the region.

• 02:10 UTC on 12 June 2024 - Service(s) restored, and customer impact mitigated.

• 04:47 UTC on 12 June 2024 - Customers notified of the outage (Mitigated)

• 21:00 UTC on 13 June 2024 - Added additional capacity in Japan East to address load concerns.

• 21:30 UTC on 13 June 2024 - Traffic ramp-up started in Japan East POP

• 23:30 UTC on 13 June 2024 – Customer impact began.

• 23:37 UTC on 13 June 2024 – Issue was detected via Service monitoring

• 00:02 UTC on 14 June 2024 – contributing cause factor identified.

• 00:24 UTC on 14 June 2024 – Mitigation workstream started.

• 00:34 UTC on 14 June 2024 – Service restored, and customer impact mitigated

• 02:39 UTC on 14 June 2024 – Customers notified of the outage (Mitigated)

How are we making incidents like this less likely or less impactful?

We have identified several key repair items to prevent and mitigate against similar issues, and we are confident these repairs will prevent such incidents. Although as noted below, since it will take several months to implement these repairs, we have implemented processes that will prevent issues during security patches.

• We have already identified the root-cause and applied fixes to ensure all POPs have sufficient connection limits (Completed).

• We are improving our monitoring to add metrics for active connections in use (Estimated completion: Q4-CY24).

• We are adding intelligence in traffic load management to automatically shift traffic based on active connections in use (Estimated completion: Q4-CY24)

• We are improving our procedures to run load tests in patch qualification (Estimated completion: Q4-CY24).

• We are improving our notification procedures to proactively inform customers of availability impact (Estimated completion: Q4-CY24).

• In the longer term, we will address validation gaps in our procedures to bring POPs online (Estimated completion: Q4-CY24).

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/XKDP-DTG

Posted Jul 03, 2024 - 11:59 CEST

Update

What happened?

Between 23:52 UTC on 11 June 2024 and 02:10 UTC on 12 Jun 2024, a planned maintenance resulted in an impact to the following services Japan East.

Azure Frontdoor : Customers may have experienced intermittent issues with latency and timeouts.
Azure CDN: Customers may have experienced intermittent issues with latency and timeouts.

What do we know so far?

We have traced the source of this issue to a scheduled maintenance. This resulted in some major traffic changes among the regions.

How did we respond?

23:52 UTC on 11 June 2024 - Customer impact began.
00:18 UTC on 12 Jun 2024 - Issue was detected via service monitoring alerts.
00:26 UTC on 12 Jun 2024 - We traced the source of this issue to a scheduled maintenance.
01:58 UTC on 12 JUN 2024 - We adjusted load management configuration to balance the traffic in the region.
02:10 UTC on 12 Jun 2024 - Service(s) restored, and customer impact mitigated.

What happens next?

To request a Post Incident Review (PIR), impacted customers can use the “Request PIR” feature within Azure Service Health. (Note: We're in the process of transitioning from "Root Cause Analyses (RCAs)" to "Post Incident Reviews (PIRs)", so you may temporarily see both terms used interchangeably in the Azure portal and in Service Health alerts.)
To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .

Posted Jun 12, 2024 - 07:39 CEST

Investigating

What happened?

Between 23:52 UTC on 11 June 2024 and 02:10 UTC on 12 Jun 2024, a platform update resulted in an impact to the following services.

Azure Frontdoor : Customers may have experienced intermittent issues with latency and timeouts in Japan East.
Azure CDN: Customers experienced intermittent issues with latency and timeouts.

This issue is now mitigated, more information will be provided shortly.

Posted Jun 12, 2024 - 06:48 CEST