What happened?
Between 18:17 UTC on 10 October 2024 and 20:10 UTC on 15 October 2024, Azure Key Vault deployed changes which impacted backend authorization workflows on Role-Based Access Control (RBAC)-enabled vaults. Customers across multiple regions may have experienced intermittent authorization failures for cross-tenant access, resulting in 403 responses while attempting to perform Key Vault operations. An action taken to mitigate that issue led to increased load on the authorization service in West Europe, causing intermittent issues while attempting to access RBAC enabled vaults resulting in authorization failures.
What went wrong and why?
A configuration change was made to the authorization logic for Key Vault, which resulted in intermittent authorization failures to some RBAC-enabled vaults, across multiple regions. Upon identifying the issue, our Key Vault team disabled the impacted code path via a feature flag, which led to increased load being placed on downstream authorization services in West Europe, leading to intermittent issues for a subset of customers in the region.
How did we respond?
Upon being alerted to the issue, our Key Vault team rolled back to a previous build to mitigate the issue. Following the rollback, we monitored telemetry and concluded service health had recovered.
How are we making incidents like this less likely or less impactful?
Key Vault has updated mitigation guidelines to prevent disabling critical code paths and instructs engineers to roll back to the previous working version of the service in order to mitigate change related issues in the future. (Completed)
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/LT7P-9W0
Summary of Impact: Between 16:08 UTC on 15 Oct 2024 and 20:10 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may have experienced issues accessing Key Vaults.
Next Steps: We will continue to investigate to establish the full root cause to help prevent future occurrences for this class of issues.
Impact Statement: Starting at 16:08 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may experience issues accessing Key Vaults. This may directly impact performing operations on the control plane or data plane for Key Vault or for supported scenarios where Key Vault is integrated with other Azure services.
Current Status: We are aware and actively working on mitigating the incident. This situation is being closely monitored and we will provide updates as the situation warrants or once the issue is fully mitigated.