Investigating - Failure Anomalies notifies you of an unusual rise in the rate of failed HTTP requests or dependency calls.
Nov 22, 2024 - 15:23 CET
Update -

What Happened?


Between 10:01 UTC on 28 October 2024 and 18:07 UTC on 01 November 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs. This could have potentially impacted all of App Services customers, in all Azure regions.




What went wrong and why?


The impact was on the 'App Service (Web Apps)' service and was caused by changes in a back-end service; 'Microsoft Defender for Cloud' (MDC). MDC secures App Services related resources as part of its protection suite by periodically scanning customers' environment using Azure APIs, to detect potential issues in their environment. Previously, MDC only scanned the Microsoft.Web/sites API endpoints; now, MDC is advancing its protections to include internal endpoints. This change blocked Web Apps causing the 404 responses, that in turn appeared in customers' environment logs.




How did we respond?


This incident was reported by our customers, and App Services team began their investigation of the manner to pin-point the cause. Once MDC was detected as the cause of the issue, the MDC team was brought in to support mitigation. The teams decided to throttle and block MDC’s request to minimize impact on Microsoft.Web/sites/workflows API endpoint. This prevented customers from experiencing 404 errors mitigating the problem. The issue was resolved only after the MDC team identified the exact cause of the issue and decided to roll back the change, stopping the endpoint scan altogether.


  • 10:01:00 UTC on 28 October 2024 – Customer Impact started.
  • 15:52:10 UTC on 31 October 2024 – App Services team engaged and began their investigation.
  • 17:28:12 UTC on 01 November 2024 – MDC team engaged and supported the investigation.
  • 18:07:00 UTC on 01 November 2024 – Teams decided to throttle block end points mitigating the problem.
  • 20:19:13 UTC on 01 November 2024 – MDC rolled back the change fully resolving the incident.



How are we making incidents like this less likely or less impactful?


  • We are fixing the scanning approach MDC uses to retrieve their required information to reduce errors. (Estimated completion: December 2024)
  • We will Improve MDC monitors to detect errors earlier. (Estimated completion: December 2024)


How can customers make incidents like this less impactful?

  • Consider configuring geo-replication on your premium Container Registries. During this incident, customers with geo-replicated registries could have disabled the West Europe endpoint as a temporary workaround until this regional issue was mitigated: https://docs.microsoft.com/en-us/azure/container-registry/container-registry-geo-replication.
  • More generally, consider evaluating the reliability of your mission-critical applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey.


Nov 21, 2024 - 12:15 CET
Update -

What happened?


Between 10:01 UTC on 28 October 2024 and 18:07 UTC on 01 November 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs.


What we know so far?


We identified a previous change to a backend service which caused backend operations to be called to apps incorrectly.


How did we respond?


We were alerted to this issue via customer reports and responded to investigate. We applied steps to limit the erroneous failures to alleviate additional erroneous alerts and logging of these. Additionally, we’ve taken steps to revert the previous change.


What happens next?


  • To request a Post Incident Review (PIR), impacted customers can use the “Request PIR” feature within Azure Service Health. (Note: We're in the process of transitioning from "Root Cause Analyses (RCAs)" to "Post Incident Reviews (PIRs)", so you may temporarily see both terms used interchangeably in the Azure portal and in Service Health alerts.)
  • To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

Nov 01, 2024 - 21:31 CET
Investigating -

Impact Statement: Starting at 10:01 UTC on 28 October 2024, a subset of customers using App Service may have experienced erroneous 404 failure notifications for the Microsoft.Web/sites/workflows API from alerts and recorded into their logs.


Current Status: We have applied steps to limit the erroneous failures to alleviate additional erroneous alerts and logging of these. Additionally, we’re taking steps to revert a previous change which caused backend operations to be called to apps incorrectly. We’ll provide an update within the next 2 hours or as events warrant.


Nov 01, 2024 - 20:09 CET
Investigating - Failure Anomalies notifies you of an unusual rise in the rate of failed HTTP requests or dependency calls.
Nov 20, 2024 - 09:12 CET
Update -

What happened?


Between 18:17 UTC on 10 October 2024 and 20:10 UTC on 15 October 2024, Azure Key Vault deployed changes which impacted backend authorization workflows on Role-Based Access Control (RBAC)-enabled vaults. Customers across multiple regions may have experienced intermittent authorization failures for cross-tenant access, resulting in 403 responses while attempting to perform Key Vault operations. An action taken to mitigate that issue led to increased load on the authorization service in West Europe, causing intermittent issues while attempting to access RBAC enabled vaults resulting in authorization failures.




What went wrong and why?


A configuration change was made to the authorization logic for Key Vault, which resulted in intermittent authorization failures to some RBAC-enabled vaults, across multiple regions. Upon identifying the issue, our Key Vault team disabled the impacted code path via a feature flag, which led to increased load being placed on downstream authorization services in West Europe, leading to intermittent issues for a subset of customers in the region.




How did we respond?


Upon being alerted to the issue, our Key Vault team rolled back to a previous build to mitigate the issue. Following the rollback, we monitored telemetry and concluded service health had recovered.


  • 17:16 UTC on 15 October 2024 – Alerted to availability drop in West Europe.
  • 17:30 UTC on 15 October 2024 – Conduced rollback to previous build.
  • 20:10 UTC on 15 October 2024 – Service health restored and customer impact fully mitigated.





How are we making incidents like this less likely or less impactful?


Key Vault has updated mitigation guidelines to prevent disabling critical code paths and instructs engineers to roll back to the previous working version of the service in order to mitigate change related issues in the future. (Completed)








How can customers make incidents like this less impactful?


  • In regions with a region pair, Key Vault automatically replicates your key vault to the secondary region, in the rare event that an entire Azure region is unavailable, the requests that you make to Key Vault in that region are automatically routed (failed over) to a secondary region. When the primary region is available again, requests are routed back (failed back) to the primary region. Again, you don't need to take any action because this happens automatically.
  • Azure Key Vault availability and redundancy - Azure Key Vault | Microsoft Learn
  • In regions that don't support automatic replication to a secondary region, you must plan for the recovery of your key vaults in a region failure scenario. To back up and restore your Azure key vault to a region of your choice, refer to the steps that are detailed in the Azure Key Vault backup guidance.
  • Back up a secret, key, or certificate stored in Azure Key Vault | Microsoft Learn
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://docs.microsoft.com/en-us/azure/architecture/framework/resiliency
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts



How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/LT7P-9W0


Nov 09, 2024 - 02:22 CET
Update -

Summary of Impact: Between 16:08 UTC on 15 Oct 2024 and 20:10 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may have experienced issues accessing Key Vaults.





Next Steps: We will continue to investigate to establish the full root cause to help prevent future occurrences for this class of issues.


Oct 15, 2024 - 22:10 CEST
Investigating -

Impact Statement: Starting at 16:08 UTC on 15 Oct 2024, some customers using the Key Vault service in the West Europe region may experience issues accessing Key Vaults. This may directly impact performing operations on the control plane or data plane for Key Vault or for supported scenarios where Key Vault is integrated with other Azure services.




Current Status: We are aware and actively working on mitigating the incident. This situation is being closely monitored and we will provide updates as the situation warrants or once the issue is fully mitigated.


Oct 15, 2024 - 20:55 CEST
Identified - The issue has been identified and a fix is being implemented.
Oct 04, 2024 - 13:00 CEST
Update - The logins are working again. We are still investigating the issue.
Oct 01, 2024 - 09:50 CEST
Update - Users are experiencing login timeouts. A temporary scaling was implemented. We are investigating the source issue.
Oct 01, 2024 - 09:15 CEST
Investigating - Failure Anomalies notifies you of an unusual rise in the rate of failed HTTP requests or dependency calls.
Oct 01, 2024 - 09:00 CEST
Update - We are continuing to investigate this issue.
Oct 16, 2024 - 09:20 CEST
Investigating -

Post Incident Review (PIR) – Azure Portal – Slow initial page load and/or intermittent errors when accessing Azure Portal




What happened?


Between 08:00 and 10:30 UTC on 24 September 2024, we received alerts from platform telemetry and additional customer reports indicating increased latency and intermittent connectivity failures when attempting to access resources in the Azure Portal through the France Central, Germany West Central, North Europe, Norway East, Sweden Central, Switzerland North, UK South, UK West, and West Europe regions.


 


What went wrong and why?


During a standard deployment procedure, a subset of server instances within the Azure Portal experienced an impact, resulting in reduced capacity during a period of peak traffic. This resulted in concentrated load on fewer servers than typically available, which led to the observed issues. The service automatically mitigated the issue by refreshing the affected instances and restoring them to full functionality.


 


How did we respond?


  • 08:00 UTC on 24 September 2024 – Platform alerts fired.
  • 08:30 UTC on 24 September 2024 – Customers reported impact.
  • 10:00 UTC on 24 September 2024 – We observed decrease in failures.
  • 10:30 UTC on 24 September 2024 – Mitigation confirmed.

 


How are we making incidents like this less likely or less impactful?


  • We have improved resiliency for the future, which should avoid straining server instances on deployments. (Completed)
  • We improved our alerts to detect the issue earlier and improved our processes to include specific troubleshooting steps. (Completed)
  • We are reviewing and adopting processes to avoid applying deployment procedures during peak traffic hours for the region. (Estimated completion: November 2024)
  • We will be pushing further performance updates to ensure deployments aren't as disruptive to traffic. (Estimated completion: November 2024)

 

How can customers make incidents like this less impactful?

  • Consider refreshing the Azure Portal page only when the page indicates failure, for example, every couple of minutes instead of seconds. Often times, only the loading operation is slowed down. Once the page is loaded, the rest of the experience isn't affected by the same incident.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

 

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/GMYD-LB8


Oct 14, 2024 - 21:22 CEST
Graphisoft ID Operational
90 days ago
100.0 % uptime
Today
Graphisoft License Delivery Operational
90 days ago
100.0 % uptime
Today
Graphisoft Store Operational
90 days ago
100.0 % uptime
Today
Graphisoft Legacy Store Operational
90 days ago
100.0 % uptime
Today
Graphisoft Legacy Webshop Operational
90 days ago
100.0 % uptime
Today
GSPOS Operational
90 days ago
100.0 % uptime
Today
Graphisoft BIM Components Operational
90 days ago
100.0 % uptime
Today
Graphisoft BIMx Transfer Operational
90 days ago
100.0 % uptime
Today
Graphisoft DevOps Components Operational
90 days ago
100.0 % uptime
Today
Microsoft Incidents Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Past Incidents
Nov 25, 2024

No incidents reported today.

Nov 24, 2024

No incidents reported.

Nov 23, 2024

No incidents reported.

Nov 22, 2024

Unresolved incident: Failure Anomalies - licensemanager-appinsights.

Nov 21, 2024

Unresolved incident: Microsoft Incident - App Service - Post Incident Review (PIR) - App Service – Erroneous 404 failures.

Nov 20, 2024

Unresolved incident: Failure Anomalies - licensemanager-appinsights.

Nov 19, 2024

No incidents reported.

Nov 18, 2024

No incidents reported.

Nov 17, 2024

No incidents reported.

Nov 16, 2024

No incidents reported.

Nov 15, 2024

No incidents reported.

Nov 14, 2024

No incidents reported.

Nov 13, 2024

No incidents reported.

Nov 12, 2024

No incidents reported.

Nov 11, 2024

No incidents reported.