Microsoft Incident - Key Vault - Post Incident Review (PIR) – Azure Key Vault – Service availability impacted in West Europe
Incident Report for Graphisoft
Update

What happened?


Between 18:17 UTC on 10 October 2024 and 20:10 UTC on 15 October 2024, Azure Key Vault deployed changes which impacted backend authorization workflows on Role-Based Access Control (RBAC)-enabled vaults. Customers across multiple regions may have experienced intermittent authorization failures for cross-tenant access, resulting in 403 responses while attempting to perform Key Vault operations. An action taken to mitigate that issue led to increased load on the authorization service in West Europe, causing intermittent issues while attempting to access RBAC enabled vaults resulting in authorization failures.




What went wrong and why?


A configuration change was made to the authorization logic for Key Vault, which resulted in intermittent authorization failures to some RBAC-enabled vaults, across multiple regions. Upon identifying the issue, our Key Vault team disabled the impacted code path via a feature flag, which led to increased load being placed on downstream authorization services in West Europe, leading to intermittent issues for a subset of customers in the region.




How did we respond?


Upon being alerted to the issue, our Key Vault team rolled back to a previous build to mitigate the issue. Following the rollback, we monitored telemetry and concluded service health had recovered.


  • 17:16 UTC on 15 October 2024 – Alerted to availability drop in West Europe.
  • 17:30 UTC on 15 October 2024 – Conduced rollback to previous build.
  • 20:10 UTC on 15 October 2024 – Service health restored and customer impact fully mitigated.





How are we making incidents like this less likely or less impactful?


Key Vault has updated mitigation guidelines to prevent disabling critical code paths and instructs engineers to roll back to the previous working version of the service in order to mitigate change related issues in the future. (Completed)








How can customers make incidents like this less impactful?


  • In regions with a region pair, Key Vault automatically replicates your key vault to the secondary region, in the rare event that an entire Azure region is unavailable, the requests that you make to Key Vault in that region are automatically routed (failed over) to a secondary region. When the primary region is available again, requests are routed back (failed back) to the primary region. Again, you don't need to take any action because this happens automatically.
  • Azure Key Vault availability and redundancy - Azure Key Vault | Microsoft Learn
  • In regions that don't support automatic replication to a secondary region, you must plan for the recovery of your key vaults in a region failure scenario. To back up and restore your Azure key vault to a region of your choice, refer to the steps that are detailed in the Azure Key Vault backup guidance.
  • Back up a secret, key, or certificate stored in Azure Key Vault | Microsoft Learn
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts



How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/LT7P-9W0

Posted Nov 09, 2024 - 02:22 CET
Update

Summary of Impact: Between 16:08 UTC on 15 Oct 2024 and 20:10 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may have experienced issues accessing Key Vaults.





Next Steps: We will continue to investigate to establish the full root cause to help prevent future occurrences for this class of issues.

Posted Oct 15, 2024 - 22:10 CEST
Investigating

Impact Statement: Starting at 16:08 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may experience issues accessing Key Vaults. This may directly impact performing operations on the control plane or data plane for Key Vault or for supported scenarios where Key Vault is integrated with other Azure services.




Current Status: We are aware and actively working on mitigating the incident. This situation is being closely monitored and we will provide updates as the situation warrants or once the issue is fully mitigated.

Posted Oct 15, 2024 - 20:55 CEST