Microsoft Incident - App Service - Post Incident Review (PIR) - App Service – Erroneous 404 failures

Incident Report for Graphisoft

Resolved

This incident has been resolved.

Posted Nov 26, 2024 - 11:38 CET

Update

What Happened?

Between 10:01 UTC on 28 October 2024 and 18:07 UTC on 01 November 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs. This could have potentially impacted all of App Services customers, in all Azure regions.

What went wrong and why?

The impact was on the 'App Service (Web Apps)' service and was caused by changes in a back-end service; 'Microsoft Defender for Cloud' (MDC). MDC secures App Services related resources as part of its protection suite by periodically scanning customers' environment using Azure APIs, to detect potential issues in their environment. Previously, MDC only scanned the Microsoft.Web/sites API endpoints; now, MDC is advancing its protections to include internal endpoints. This change blocked Web Apps causing the 404 responses, that in turn appeared in customers' environment logs.

How did we respond?

This incident was reported by our customers, and App Services team began their investigation of the manner to pin-point the cause. Once MDC was detected as the cause of the issue, the MDC team was brought in to support mitigation. The teams decided to throttle and block MDC’s request to minimize impact on Microsoft.Web/sites/workflows API endpoint. This prevented customers from experiencing 404 errors mitigating the problem. The issue was resolved only after the MDC team identified the exact cause of the issue and decided to roll back the change, stopping the endpoint scan altogether.

10:01:00 UTC on 28 October 2024 – Customer Impact started.
15:52:10 UTC on 31 October 2024 – App Services team engaged and began their investigation.
17:28:12 UTC on 01 November 2024 – MDC team engaged and supported the investigation.
18:07:00 UTC on 01 November 2024 – Teams decided to throttle block end points mitigating the problem.
20:19:13 UTC on 01 November 2024 – MDC rolled back the change fully resolving the incident.

How are we making incidents like this less likely or less impactful?

We are fixing the scanning approach MDC uses to retrieve their required information to reduce errors. (Estimated completion: December 2024)
We will Improve MDC monitors to detect errors earlier. (Estimated completion: December 2024)

How can customers make incidents like this less impactful?

Consider configuring geo-replication on your premium Container Registries. During this incident, customers with geo-replicated registries could have disabled the West Europe endpoint as a temporary workaround until this regional issue was mitigated: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-geo-replication.
More generally, consider evaluating the reliability of your mission-critical applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency.
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey.

Posted Nov 21, 2024 - 12:15 CET

Update

What happened?

What we know so far?

We identified a previous change to a backend service which caused backend operations to be called to apps incorrectly.

How did we respond?

We were alerted to this issue via customer reports and responded to investigate. We applied steps to limit the erroneous failures to alleviate additional erroneous alerts and logging of these. Additionally, we’ve taken steps to revert the previous change.

What happens next?

To request a Post Incident Review (PIR), impacted customers can use the “Request PIR” feature within Azure Service Health. (Note: We're in the process of transitioning from "Root Cause Analyses (RCAs)" to "Post Incident Reviews (PIRs)", so you may temporarily see both terms used interchangeably in the Azure portal and in Service Health alerts.)
To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

Posted Nov 01, 2024 - 21:31 CET

Investigating

Impact Statement: Starting at 10:01 UTC on 28 October 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs.

Current Status: We have applied steps to limit the erroneous failures to alleviate additional erroneous alerts and logging of these. Additionally, we’re taking steps to revert a previous change which caused backend operations to be called to apps incorrectly. We’ll provide an update within the next 2 hours or as events warrant.

Posted Nov 01, 2024 - 20:09 CET