What Happened?
Between 10:01 UTC on 28 October 2024 and 18:07 UTC on 01 November 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs. This could have potentially impacted all of App Services customers, in all Azure regions.
What went wrong and why?
The impact was on the 'App Service (Web Apps)' service and was caused by changes in a back-end service; 'Microsoft Defender for Cloud' (MDC). MDC secures App Services related resources as part of its protection suite by periodically scanning customers' environment using Azure APIs, to detect potential issues in their environment. Previously, MDC only scanned the Microsoft.Web/sites API endpoints; now, MDC is advancing its protections to include internal endpoints. This change blocked Web Apps causing the 404 responses, that in turn appeared in customers' environment logs.
How did we respond?
This incident was reported by our customers, and App Services team began their investigation of the manner to pin-point the cause. Once MDC was detected as the cause of the issue, the MDC team was brought in to support mitigation. The teams decided to throttle and block MDC’s request to minimize impact on Microsoft.Web/sites/workflows API endpoint. This prevented customers from experiencing 404 errors mitigating the problem. The issue was resolved only after the MDC team identified the exact cause of the issue and decided to roll back the change, stopping the endpoint scan altogether.
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey.
What happened?
Between 10:01 UTC on 28 October 2024 and 18:07 UTC on 01 November 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs.
What we know so far?
We identified a previous change to a backend service which caused backend operations to be called to apps incorrectly.
How did we respond?
We were alerted to this issue via customer reports and responded to investigate. We applied steps to limit the erroneous failures to alleviate additional erroneous alerts and logging of these. Additionally, we’ve taken steps to revert the previous change.
What happens next?
Impact Statement: Starting at 10:01 UTC on 28 October 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs.
Current Status: We have applied steps to limit the erroneous failures to alleviate additional erroneous alerts and logging of these. Additionally, we’re taking steps to revert a previous change which caused backend operations to be called to apps incorrectly. We’ll provide an update within the next 2 hours or as events warrant.
What happened?
Between 18:17 UTC on 10 October 2024 and 20:10 UTC on 15 October 2024, Azure Key Vault deployed changes which impacted backend authorization workflows on Role-Based Access Control (RBAC)-enabled vaults. Customers across multiple regions may have experienced intermittent authorization failures for cross-tenant access, resulting in 403 responses while attempting to perform Key Vault operations. An action taken to mitigate that issue led to increased load on the authorization service in West Europe, causing intermittent issues while attempting to access RBAC enabled vaults resulting in authorization failures.
What went wrong and why?
A configuration change was made to the authorization logic for Key Vault, which resulted in intermittent authorization failures to some RBAC-enabled vaults, across multiple regions. Upon identifying the issue, our Key Vault team disabled the impacted code path via a feature flag, which led to increased load being placed on downstream authorization services in West Europe, leading to intermittent issues for a subset of customers in the region.
How did we respond?
Upon being alerted to the issue, our Key Vault team rolled back to a previous build to mitigate the issue. Following the rollback, we monitored telemetry and concluded service health had recovered.
How are we making incidents like this less likely or less impactful?
Key Vault has updated mitigation guidelines to prevent disabling critical code paths and instructs engineers to roll back to the previous working version of the service in order to mitigate change related issues in the future. (Completed)
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/LT7P-9W0
Summary of Impact: Between 16:08 UTC on 15 Oct 2024 and 20:10 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may have experienced issues accessing Key Vaults.
Next Steps: We will continue to investigate to establish the full root cause to help prevent future occurrences for this class of issues.
Impact Statement: Starting at 16:08 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may experience issues accessing Key Vaults. This may directly impact performing operations on the control plane or data plane for Key Vault or for supported scenarios where Key Vault is integrated with other Azure services.
Current Status: We are aware and actively working on mitigating the incident. This situation is being closely monitored and we will provide updates as the situation warrants or once the issue is fully mitigated.
Post Incident Review (PIR) – Azure Portal – Slow initial page load and/or intermittent errors when accessing Azure Portal
What happened?
Between 08:00 and 10:30 UTC on 24 September 2024, we received alerts from platform telemetry and additional customer reports indicating increased latency and intermittent connectivity failures when attempting to access resources in the Azure Portal through the France Central, Germany West Central, North Europe, Norway East, Sweden Central, Switzerland North, UK South, UK West, and West Europe regions.
What went wrong and why?
During a standard deployment procedure, a subset of server instances within the Azure Portal experienced an impact, resulting in reduced capacity during a period of peak traffic. This resulted in concentrated load on fewer servers than typically available, which led to the observed issues. The service automatically mitigated the issue by refreshing the affected instances and restoring them to full functionality.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/GMYD-LB8