Post Incident Review (PIR) – Azure Portal – Slow initial page load and/or intermittent errors when accessing Azure Portal
What happened?
Between 08:00 and 10:30 UTC on 24 September 2024, we received alerts from platform telemetry and additional customer reports indicating increased latency and intermittent connectivity failures when attempting to access resources in the Azure Portal through the France Central, Germany West Central, North Europe, Norway East, Sweden Central, Switzerland North, UK South, UK West, and West Europe regions.
What went wrong and why?
During a standard deployment procedure, a subset of server instances within the Azure Portal experienced an impact, resulting in reduced capacity during a period of peak traffic. This resulted in concentrated load on fewer servers than typically available, which led to the observed issues. The service automatically mitigated the issue by refreshing the affected instances and restoring them to full functionality.
How did we respond?
- 08:00 UTC on 24 September 2024 – Platform alerts fired.
- 08:30 UTC on 24 September 2024 – Customers reported impact.
- 10:00 UTC on 24 September 2024 – We observed decrease in failures.
- 10:30 UTC on 24 September 2024 – Mitigation confirmed.
How are we making incidents like this less likely or less impactful?
- We have improved resiliency for the future, which should avoid straining server instances on deployments. (Completed)
- We improved our alerts to detect the issue earlier and improved our processes to include specific troubleshooting steps. (Completed)
- We are reviewing and adopting processes to avoid applying deployment procedures during peak traffic hours for the region. (Estimated completion: November 2024)
- We will be pushing further performance updates to ensure deployments aren't as disruptive to traffic. (Estimated completion: November 2024)
How can customers make incidents like this less impactful?
- Consider refreshing the Azure Portal page only when the page indicates failure, for example, every couple of minutes instead of seconds. Often times, only the loading operation is slowed down. Once the page is loaded, the rest of the experience isn't affected by the same incident.
- The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
- Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/GMYD-LB8