Update - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 15:16 CEST
Investigating - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 15:01 CEST
Update -

What happened?


Between 08:51 and 10:15 UTC on 01 April 2025, we identified customer impact resulting from a power event in the North Europe region which impacted Microsoft Entra ID, Virtual Machines, Virtual Machine Scale Sets, Storage, Azure Cosmos DB, Azure Database for PostgreSQL flexible servers, Azure ExpressRoute, Azure Site Recovery, Service Bus, Azure Cache for Redis, Azure SQL Database, Azure Site Recovery, Application Gateway, and Azure NetApp Files. We can confirm that all affected services have now recovered. 


 


What do we know so far?


During a power maintenance event, a failure on a UPS system led to temporary power loss in a single Data Center in Physical Availability Zone 2 in the North Europe region affecting multiple devices. The power has now been fully restored and all affected services have recovered.




How did we respond?


  • 08:51 UTC on 1 April 2025 – Customer impact identified from an ongoing power maintenance event.
  • 09:05 UTC on 1 April 2025 – Power was restored to affected devices.
  • 09:20 UTC on 1 April 2025 – Outage declared and customers notified via Azure Portal. Affected dependent services identified.
  • 09:40 UTC on 1 April 2025 – Dependent services report recovery.
  • 10:15 UTC on 1 April 2025 - Full mitigation confirmed.

 


What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .


Apr 01, 2025 - 14:48 CEST
Update -

Summary of Impact: Between 08:51 and 10:15 UTC on 01 April 2025, we identified customer impact resulting from a power event in the North Europe region which impacted Virtual Machines, Storage, CosmosDB, PostgreSQL, Azure ExpressRoute and Azure NetApp Files. We can confirm that all affected services have now recovered. 




A power maintenance event led to temporary power loss in a single datacenter, in Physical Availability Zone 2, in the North Europe region affecting multiple racks and devices. The power has been fully restored and services are seeing full recovery.




An update with additional information will be provided shortly.


Apr 01, 2025 - 13:01 CEST
Update -

Summary of Impact: Between 08:51 and 10:15 UTC on 01 April 2025, we identified customer impact resulting from a power event in the North Europe region which impacted Virtual Machines, Storage, CosmosDB, PostgreSQL, Azure ExpressRoute and Azure NetApp Files. We can confirm that all affected services have now recovered. 




A power maintenance event led to temporary power loss in a single datacenter, in Physical Availability Zone 2, in the North Europe region affecting multiple racks and devices. The power has been fully restored and services are seeing full recovery.




An update with additional information will be provided shortly.


Apr 01, 2025 - 13:00 CEST
Investigating -

Impact Statement: Starting approximately at 08:51 UTC on 01 April 2025, we received an alert of an issue impacting multiple Azure services across the North Europe region. 




Current Status: All relevant teams are currently looking into this alert and are actively working on identifying any workstreams needed to mitigate all customer impact. The next update will be provided within 60 minutes, or as events warrant.


Apr 01, 2025 - 12:12 CEST
Update - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 14:46 CEST
Investigating - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 14:26 CEST
Update - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 14:16 CEST
Investigating - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 13:56 CEST
Update - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 13:26 CEST
Investigating - Metric Alert for webshop-documentdb - the average normalized RU/s consumption is greater than or equal to 95%. The normalized RU consumption metric gives the maximum throughput utilization within a replica set.
Apr 01, 2025 - 13:11 CEST
Update -

What happened?


Between 13:00 UTC on 10 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and/or incorrect alert activation.




What went wrong and why?


Application Insights Ingestion is the service that handles ingesting and routing of Application Insights data from customers. One of its internal components is a cache where it stores information about the customer's Application Insights resource configuration. This cache is deployed at a region-level, so it is shared by multiple clusters in a region. When a deployment is done, some regions deploy to one cluster, then delay until the next business day before deploying to remaining clusters. There was feature work being done that involved adding a new flag to the Application Insights resource configuration stored in the cache. The flag was supposed to default to true, in which case it wouldn't impact the behavior of Application Insights Ingestion. However, if the flag was set to false, it would stop sending custom metrics data to the Log Analytics workspace.


A recent incident in a separate cloud was caused by this flag becoming incorrectly set to false. As a response to this, it was decided the flag should be flipped to represent the opposite - so that defaulting to "false" would result in no-op behavior instead. As part of this, the original flag was removed from the contract used to serialize cache entries. The above change was then deployed. It started with the first cluster, then waited until the next business day to deploy to remaining clusters. During this time, the first cluster started serializing new cache entries that were missing a value for the original (default true) flag. This caused the remaining clusters (still running the old deployment) to read values from the cache with this flag set to false, and therefore stop routing custom metrics data to Log Analytics. When the deployment completed in a region, impact would resolve as all clusters would be running the new code with the correct default value for the flag.


There was no monitoring of data volume drops by data type and no new monitor for the flag's operation was added, since it was considered a normal operation for it to be active. This caused the deployment to proceed to an additional region before the issue was detected. The incident persisted for around 24 hours in the South Central US region, before the deployment completed. Since the issue wasn't detected by automated monitoring, the deployment proceeded to the West Europe region, where it deployed to the first cluster. Because it deployed next to a weekend, it persisted for several days before the deployment finished. Eventually, several customers raised tickets noticing that their custom metrics data was missing. During this incident, the flag became incorrectly set to false, causing the ingestion service to incorrectly stop routing custom metrics data to Log Analytics.




How did we respond?


  • 13:00 UTC on 10 March 2025 – Customer impact in South Central US began.
  • 23:30 UTC on 11 March 2025 – This issue was auto-mitigated in South Central US when the deployment finished, once all the clusters were on the same version.
  • 13:30 UTC on 13 March 2025 – Customer impact in West Europe began.
  • 16:11 UTC on 17 March 2025 – Issue was detected via customer reports, which prompted us to start our investigation.
  • 23:00 UTC on 17 March 2025 – We have identified that the issue was caused due to a recent deployment, as described above.
  • 23:01 UTC on 17 March 2025 – The deployment was triggered and expedited to ensure that all clusters were on the same version, mitigating the issue.
  • 00:25 UTC on 18 March 2025 – Service restored, and customer impact mitigated in all regions.



How are we making incidents like this less likely or less impactful?


  • We have added unit tests for backwards compatibility of the cache contract that validates new tests are added when the contract is changed. (Completed)
  • We have added a new dedicated monitor on the new flag being activated, to help detect and mitigate related issues more quickly. (Completed)
  • We have improved our change review process, by requiring that risk assessments be completed on each change. (Completed)
  • We are improving our monitoring of data volume drops that monitors every data type. (Estimated completion: April 2025)


How can customers make incidents like this less impactful?

  • There was nothing customers could have done to avoid or minimize impact from this specific service incident.
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/D_CP-JQ8


Apr 01, 2025 - 00:31 CEST
Update -

What happened?


Between 13:30 UTC on 13 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.


What do we know so far?


We identified that the issue was caused by a service deployment. A new version was deployed to a single cluster of the service in the West Europe region, introducing a change to the contract of a cache shared among all clusters in the area. This contract change was incompatible with the code running on the remaining clusters, leading to incorrect routing of customer metrics data.


How did we respond?


  • 13:30 UTC on 13 March 2025 – Customer impact began.
  • 16:11 UTC on 17 March 2025 – Issue was detected via customer report which prompted to start our investigation.
  • 23:00 UTC on 17 March 2025 – We have identified that the issue is caused due to a recent deployment.
  • 23:01 UTC on 17 March 2025 –   The deployment was triggered and expedited to ensure that all clusters are on the same version, mitigating the issue.
  • 00:25 UTC on 18 March 2025– Service restored, and customer impact mitigated.

What happens next?


  • To request a Post Incident Review (PIR), impacted customers can use the “Request PIR” feature within Azure Service Health. (Note: We're in the process of transitioning from "Root Cause Analyses (RCAs)" to "Post Incident Reviews (PIRs)", so you may temporarily see both terms used interchangeably in the Azure portal and in Service Health alerts).
  • To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.


Mar 18, 2025 - 02:25 CET
Update -

Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.




This issue is now mitigated, and more information will be shared shortly.


Mar 18, 2025 - 01:41 CET
Investigating -

Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may experience intermittent data gaps on custom metrics data and incorrect alert activation.


 


Current Status: This issue was raised to us by a customer report. Upon investigation, we determined that this bug was introduced as part of a deployment, once we identified this, we assessed the possibility of rolling back. After further inspection, we deemed that to mitigate this, the deployment would need to be applied to all the clusters in the region as it was the mismatch is deployment versions that was causing this issue.


 


We are currently expediting the deployment to all remaining clusters in the region, this is expected to take 1 hour to complete. We have paused the broader deployment going out to any remaining regions, and we will reassess our deployment plan after we mitigate this issue in West Europe.


 


The next update will be provided within 2 hours, or as events warrant.


Mar 18, 2025 - 01:31 CET
Update -



Post Incident Review (PIR) – Network Connectivity – Availability issues in East US


Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8




What happened?


Between 13:37 and 16:52 UTC on 18 March, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, 2025, a combination of a third-party fiber cut, and an internal tooling failure resulted in an impact to a subset of Azure customers with services in our East US region.


During the first impact window, immediately after the fiber cut, customers may have experienced intermittent connectivity loss for inter-zone traffic that included AZ03 - to/from other zones, or to/from the public internet. During this time, the traffic loss rate peaked at 0.02% for short periods of time. Traffic within AZ03, as well as traffic to/from/within AZ01 and AZ02, was not impacted.


During the second impact window, triggered by the tooling issue, customers may have experienced intermittent connectivity loss – primarily when sending inter-zone traffic that included AZ03. During this time, the traffic loss rate peaked at 0.55% for short periods of time. Traffic entering or leaving the East US region was not impacted, but there was some minimal impact to inter-zone traffic from both of the other Availability Zones, AZ01 and AZ02.


Note that the 'logical' availability zones used by each customer subscription may correspond to different physical availability zones. Customers can use the Locations API to understand this mapping, to confirm which resources run in this physical AZ, see: https://learn.microsoft.com/rest/api/resources/subscriptions/list-locations?HTTP#availabilityzonemappings.




What went wrong and why?


At 13:37 UTC on 18 March 2025, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within AZ03. When fiber cuts impact our networking capacity, our systems are designed to redistribute traffic automatically to other paths. In this instance, we had two concurrent failures happen – before the cut, a datacenter router in AZ03 was down for maintenance and was in the process of being repaired. This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ03, leading to the potential for intermittent connectivity issues for some customers. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time.


Additionally, after the fiber cut and failed isolation, at 14:16 UTC a linecard failed on another router further reducing overall capacity to AZ03. However, as traffic had been re-routed, this further reduction in capacity did not cause any additional customer impact.


During the initial mitigation efforts outlined above, our auto-mitigation tool encountered a lock contention problem blocking commands on the impacted devices, failing to isolate all capacity connected to those devices. This failure left some of the impacted capacity un-isolated, and our system did not flag this failed isolation state. Due to some capacity being out of service from the fiber cut, this failed state was not immediately flagged in our systems as the down capacity was not carrying production traffic.


At approximately 21:00 UTC, our fiber provider commenced recovery work on the damaged fiber. During the capacity recovery process, at 23:20 UTC, as a result of the failure to isolate all the impacted fiber capacity, as individual fibers were repaired, our recovery systems begin re-sending traffic to the devices connected to the un-isolated capacity, therefore, bringing them back into service without safe levels of capacity. This caused traffic congestion that impacted customers as described above.


The traffic congestion within AZ03, due to the tooling failure, triggered an unplanned failure mode on a regional hub router that connects multiple datacenters. By design, our network devices attempt to contain congestive packet loss to capacity that is already impacted. Due to the encountered failure mode, this containment failed on a subset of routers – so congestion spread to neighboring capacity on the same regional hub router, beyond AZ03. This containment failure impacted a small subset of traffic from the regional hub router to AZ1 and AZ2.


At this stage, all originally-impacted capacity from the third-party fiber cut was manually isolated from the network – mitigating all customer impact by 00:30 UTC on 19 March. At 01:52 UTC on 19 March the underlying fiber cut was fully recovered. At that time, we completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.




How did we respond?


  • 13:37 UTC on 18 March 2025 – Customer impact began, triggered by a fiber cut causing network congestion which led to customers experiencing packet drops or intermittent connectivity. Our monitoring systems identified the impact immediately, so our on-call engineers engaged to investigate.
  • 13:45 UTC on 18 March 2025 – Our fiber provider was notified of the fiber cut and prepared for dispatch.
  • 13:55 UTC on 18 March 2025 – Mitigation efforts began identifying the impacted datacenters and redirecting traffic to healthier routes.
  • 15:07 UTC on 18 March 2025 – All customers using the East US region were notified about connectivity issues, even if their services were not directly impacted.
  • 16:52 UTC on 18 March 2025 – Mitigation efforts were successfully completed. All devices affected by the fiber cut were isolated, all customer traffic was using healthy paths and not experiencing congestion.
  • 23:20 UTC on 18 March 2025 – Customer impact recommenced, due to a tooling failure during the capacity repair process of the initial fiber cut.
  • 00:30 UTC on 19 March 2025 – This impact was mitigated after isolating the capacity that was incorrectly added by the tooling failure as part of the recovery process. Customers and services would have experienced full mitigation.
  • 01:52 UTC on 19 March 2025 – The underlying fiber cut was fully restored. We continued to monitor our capacity during the recovery process.
  • 06:50 UTC on 19 March 2025 – Fiber restoration efforts were completed. The incident was confirmed as mitigated.



How are we making incidents like this less likely or less impactful?


  • We are fixing the tooling failure that caused the devices to be restored to take traffic before they were production ready. (Estimated completion: May 2025)
  • We are expediting a capacity upgrade within the most impacted datacenter, ahead of a planned technology refresh for all datacenters within this region - to de-risk the impact of multiple concurrent failures. (Estimated completion: July 2025)
  • In the longer term, we are working to limit the scope of impact further – specifically, to prevent the failure of a device from spreading across availability zones. (Estimated completion: February 2026) 


How can customers make incidents like this less impactful?


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8





Mar 31, 2025 - 22:59 CEST
Update -

This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.


Join one of our upcoming 'Azure Incident Retrospective' livestreams discussing this incident, and get any questions answered by our experts: https://aka.ms/AIR/Z_SZ-NV8




What happened?


Between 13:37 and 16:52 UTC on 18 March 2025, and again between 23:20 UTC on 18 March and 00:30 UTC on 19 March, a combination of a fiber cut and tooling failure within our East US region, resulted in an impact to a subset of Azure customers with services in that region. Customers may have experienced intermittent connectivity loss and increased network latency – when sending traffic to/from/within Availability Zone 3 (AZ3) within this East US region.


What do we know so far?


At 13:37 UTC, a drilling operation near one of our network paths accidentally struck fiber used by Microsoft, causing an unplanned disruption to datacenter connectivity within physical Availability Zone 3 (AZ3) only. With fiber cuts impacting capacity, our systems are designed to shift traffic automatically to other diverse paths. In this instance, we had two concurrent failures happen – before the cut, a large hub router was down due to maintenance (in the process of being repaired); and after the cut, a linecard failed on another router.


This combination of multiple concurrent failures impacted a small portion of our diverse capacity within AZ3, leading to the potential for retransmits or intermittent connectivity for some customers. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity – customers would have started to see their services recover at this time. The restoration of traffic was fully completed by 16:52 UTC, and the issue was noted as mitigated.


At approximately 21:00 UTC, our fiber provider commenced recovery work on the cut fiber. During the capacity recovery process, at 23:20 UTC, a tooling failure caused our systems to add devices back into the production network before the safe levels of capacity were recovered on the impacted fiber path. As a result, as individual fibers were repaired and brought back into service, our tooling incorrectly began adding devices back to the network. Due to the timing of this tooling failure, traffic was restarted without safe levels of capacity – resulting in congestion that led to customer impact, when sending traffic to/from/within AZ3 of the East US region. The impact was mitigated at 00:30 UTC on 19 March, after manually isolating the capacity affected by this tooling failure.


At 01:52 UTC on 19 March, the underlying fiber cut was fully recovered. We completed the test and restoration of all capacity to pre-incident levels by 06:50 UTC on 19 March.


How did we respond?


  • 13:37 UTC on 18 March 2025 – Customer impact began, triggered by a fiber cut causing network congestion which led to customers experiencing packet drops or intermittent connectivity. Our monitoring systems identified the impact immediately, so our on-call engineers engaged to investigate.
  • 13:45 UTC on 18 March 2025 – Our fiber provider was notified of the fiber cut and prepared for dispatch.
  • 13:55 UTC on 18 March 2025 – Mitigation efforts began identifying the impacted datacenters and redirecting traffic to healthier routes.
  • 15:07 UTC on 18 March 2025 – All customers using the East US region were notified about connectivity issues.
  • 16:52 UTC on 18 March 2025 – Mitigation efforts were successfully completed. All devices affected by the fiber cut were isolated, all customer traffic was using healthy paths and not experiencing congestion.
  • 23:20 UTC on 18 March 2025 – Customer impact began, due to a tooling failure during the capacity repair process of the initial fiber cut.
  • 00:30 UTC on 19 March 2025 – This impact was mitigated after isolating the capacity that was incorrectly added by the tooling failure as part of the recovery process. Customers and services would have experienced full mitigation.
  • 01:52 UTC on 19 March 2025 – The underlying fiber cut was fully restored. We continued to monitor our capacity during the recovery process.
  • 06:50 UTC on 19 March 2025 – Fiber restoration efforts were completed. Incident was confirmed as mitigated.

How are we making incidents like this less likely or less impactful?


  • We are fixing the tooling failure that caused devices being restored to take traffic before they were ready. (Estimated completion: TBD)
  • We are increasing the bandwidth within the East US region as part of a planned technology refresh, to de-risk the impact of multiple concurrent failures. (Estimated completion: May 2025)
  • This is our Preliminary PIR to share what we know so far. After our internal retrospective is completed (generally within 14 days) we will publish a Final PIR with additional details.

How can customers make incidents like this less impactful?

  • Consider ensuring that the right people in your organization will be notified about any future service issues - by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/Z_SZ-NV8


Mar 22, 2025 - 02:29 CET
Update -

What happened? 


Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region. 


At 23:21 UTC on 18 March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from East US Region. 


  


What do we know so far? 


We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated. 


At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure. 


At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continued to test and restore all capacity to pre-incident levels, these tasks completed at 6:50 UTC on 19 March. 


  


How did we respond? 


  • 13:09 UTC on 18 March 2025 - Fiber cut in East US that caused packet drops. Our monitoring systems identified the impact. 
  • 13:55 UTC on 18 March 2025 - Mitigation efforts begin with identifying the impacted data centers and redirecting traffic to healthier routes. 
  • 15:07 UTC on 18 March 2025 - Outage declared; all East US customers notified of potential impact. 
  • 18:51 UTC on 18 March 2025 - Mitigation efforts have been successfully completed. All devices affected by the fiber cut have been isolated. 
  • 23:20 UTC on 18 March 2025 - An additional impact due to tooling failure was noted during the capacity repair process of the previous incident. It was anticipated that the capacity repair process would not impact customers. 
  • 00:28 UTC on 19 March 2025 - The second impact was mitigated after isolating the capacity resources impacted by the tooling failure. At this stage most customers and services would have seen full mitigation. 
  • 01:52 UTC on 19 March 2025 - The underlying fiber cut has been fully restored. We continued to monitor our capacity during the recovery process. 
  • 06:50 UTC on 19 March 2025 - All restoration efforts have been completed. Incident mitigation has been confirmed and declared. 

  


What happens next? 


  • Our team will be completing an internal retrospective to understand the incident in more detail. We will publish a Preliminary Post Incident Review (PIR) within approximately 72 hours, to share more details on what happened and how we responded. After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings. 
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness 


Mar 19, 2025 - 08:36 CET
Update -

Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of East US Region.  


At 23:21 UTC on 18 March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from East US Region. 


 


 


Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated.  


At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure.  


At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels.  


Our telemetry data shows that the customer impact has been fully mitigated. We are continuing to monitor the situation during our capacity recovery process before confirming complete resolution of the incident. 


An update will be provided in 3 hours, or as events warrant 


Mar 19, 2025 - 07:40 CET
Update -

Previous communications would have indicated an incorrect severity (changing from Warning to informational). We have rectified the error and updated the communication below. We apologize for any confusion this may have caused.






Impact Statement: Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 


At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East. 


 


Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated. 


At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure. 


At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels. 


Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation. 


An update will be provided in 3 hours, or as events warrant. 


Mar 19, 2025 - 05:23 CET
Update -

Impact Statement:  Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 


At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East. 


 


Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18 March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18 March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18 March 2025 and the issue was mitigated. 


At 23:20 UTC on 18 March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19 March after isolating the capacity impacted by the tooling failure. 


At 01:52 UTC on 19 March, the underlying fiber cut has been fully restored. We continue working to test and restore all capacity to pre-incident levels. 


Our telemetry indicates that customer impact has been fully mitigated. We will continue to monitor during our capacity recovery process before confirming complete incident mitigation. 


An update will be provided in 3 hours, or as events warrant. 


Mar 19, 2025 - 05:12 CET
Update -

Impact Statement:  Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 


At 23:21 UTC on 18th March 2025, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East. 


 


Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC on 18th March 2025. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC on 18th March 2025, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC on 18th March 2025 and the issue was mitigated. 


At 23:20 UTC on 18th March 2025, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. The impact was mitigated at 00:30 UTC on 19th March after isolating the capacity impacted by the tooling failure. 


At 01:52 UTC on 19th March, the underlying fiber cut has been fully restored. We are now working to test and restore all capacity to pre-incident levels. 


An update will be provided in 60 minutes, or as events warrant. 


Mar 19, 2025 - 03:47 CET
Update -

What happened? 


Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region.  


At 23:21 UTC, another impact to network capacity occurred during the recovery of the underlying fiber that customers may have experienced the same intermittent connectivity loss and increased latency sending traffic within, to and from US East. 


 


Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region at 13:09 UTC. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 UTC and the issue was mitigated.  


At 23:20 UTC, another impact was observed during the capacity repair process. This was due to a tooling failure during the recovery process that started adding traffic back into the network before the underlying capacity was ready. We are actively mitigating the current impact to ensure no further incidents occur during the recovery process.  


An update will be provided in 60 minutes, or as events warrant. 


Mar 19, 2025 - 02:19 CET
Update -

After further investigation, we determined we are still working with our providers on fiber repairs, however, no further impact should be experienced as previously communicated. We apologize for the inconvenience caused.




What happened?


Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 




Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes. Further updates will be provided in 6 hours or as events warrant.


Mar 19, 2025 - 00:54 CET
Update -

What happened?


Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 




What do we know so far?


We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. At 13:55 UTC, we began mitigating the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity; customers should have started to see service recover starting at this time. The restoration of traffic was fully completed by 18:51 and the issue was mitigated. While fixing fiber will take time, there should be no further impact to customers as impacted devices are isolated and the traffic is shifted to healthier routes. 




How did we respond?


  • At 13:09 UTC on 18 March 2025 - Fiber cut in East US that may have caused packet drops. Monitoring systems identified the impact. 
  • At 13:55 UTC on 18 March 2025 - Mitigation begins- identifying the impacted DCs and shifting the traffic to healthier routes.
  • At 15:07 UTC on 18 March 2025 - Outage declared- possible impact informed to all the customers in East US.
  • At 18:51 UTC on 18 March 2025 - Mitigation complete. All impacted devices due to fiber cut were isolated. 



What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts .
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs .
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness .


Mar 18, 2025 - 22:11 CET
Update -

What happened?


Between 13:09 UTC and 18:51 UTC on 18 March 2025, a platform issue resulted in an impact to a subset of Azure customers in the East US region. Customers may have experienced intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 




This issue is now mitigated. Additional information will now be provided shortly.


Mar 18, 2025 - 21:47 CET
Update -

Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 




Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber cut impacted capacity to those datacenters increasing the utilization for the remaining capacity serving the affected datacenters. We have mitigated the impact of the fiber cut by load balancing traffic and restoring some of the impacted capacity. Impacted customers should now see their services recover. In parallel, we are working with our providers on fiber repairs. We do not yet have a reliable ETA for repairs at this time. We will continue to provide updates here as they become available.




Mar 18, 2025 - 18:40 CET
Update -

Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency sending traffic within as well as in and out of Azure's US East Region. 




Current Status: We identified multiple fiber cuts affecting a subset of datacenters in the East US region. The fiber impacted capacity to those datacenters increasing the utilization for the remaining capacity of the affected datacenters. We are actively re-routing traffic to mitigate high utilization and restore connectivity for impacted customers. Underlying fiber repairs have been initiated with our provider, but we do not yet have an ETA for fiber repairs. An update will be provided in 60 minutes, or as events warrant.




Mar 18, 2025 - 18:19 CET
Update -

Impact Statement: Starting at 13:09 UTC on 18 March 2025, a subset of Azure customers in the East US region may experience intermittent connectivity loss and increased network latency in the region. 




Current Status: We identified a networking issue affecting a subset of datacenters in the East US region. We are continuing to investigate the contributing factors of this issue, and in parallel, we are actively re-routing affected traffic to minimize the impact on networking-dependent services. The next update will be provided within 60 minutes, or as events warrant.




Mar 18, 2025 - 17:15 CET
Investigating -

Impact Statement: Starting at 13:09 UTC on 18 March 2025, you have been identified as an Azure customer in the East US region who may experience intermittent connectivity loss and increased network latency in the region.


 


Current Status: We are aware of the issue and actively working on mitigation workstreams to reroute traffic and mitigate impact for customers. The next update will be provided within 60 minutes, or as events warrant.


Mar 18, 2025 - 16:47 CET
Update -

What happened?


Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions and West Europe - was affected. ExpressRoute and Internet traffic were not impacted.




What went wrong and why?


From as early as 09:02 UTC, we were engaged on incidents, where we were mitigating multiple failures with some of our networking devices across multiple regions in which Wide Area Netowrk (WAN) routers were failing to renew their certificates. These issues were a combination of a decommissioned certificate authority, and a bug affecting agents across a subset of our devices.


Our automated systems create and renew certificates using Certificate Authority (CA) servers. During this event we unearthed an issue where a subset of our agents was susceptible to a renewal bug where it corrupted the working certificate while performing a renewal using the secondary CA server, when the primary CA server was unavailable. This caused the agent to restart continuously.


In West Europe, we engaged on an issue where the agent running on a backbone router was failing to renew its certificate. A backbone router is a crucial component in networks that provides the primary path for data traffic across different segments or parts of a network.


At 17:22 UTC, the automated mitigation initiated the isolation of the unhealthy backbone device from serving traffic. However, at 17:29 UTC, the process was cancelled as it was believed to be related to the non-backbone devices that were failing across West Europe. As this was a backbone device, it should have remained isolated. Consequently, this device was inadvertently brought back into rotation. At 17:24 UTC, this failed renewal led to the agent restarting and putting the router’s forwarding database in an inconsistent state. A forwarding database is a table maintained to efficiently forward packets to their intended destinations. As a result of the inconsistent state, we observed that approximately 25% of inter-region traffic entering and exiting the West Europe region was being erroneously dropped, or blackholed.


Mitigation efforts were delayed by the ongoing response to the decommissioning of the certificate authority and the uncovered bug, which made it difficult for us to broadly assess our monitoring signals. Our efforts were primarily focused on patching devices that serve all Internet and ExpressRoute traffic.




How did we respond?


  • 09:02 UTC on 18 March 2025 – Issues were detected across multiple regions due to WAN routers failing to renew their certificates.
  • 17:10 UTC on 18 March 2025 – A backbone router in West Europe failed to renew its certificate.
  • 17:22 UTC on 18 March 2025 – Automated mitigation efforts initiated to isolate the unhealthy backbone router in West Europe.
  • 17:24 UTC on 18 March 2025 – Failure in the certificate renewal process for a backbone router led to the agent restarting.
  • 17:29 UTC on 18 March 2025 – Unhealthy backbone router inadvertently brought back into service, exacerbating connectivity issues.
  • 18:25 UTC on 18 March 2025 – We determined the issue was due to incorrect certificates being erroneously downloaded, leading to traffic black-holing.
  • 18:40 UTC on 18 March 2025 – The mitigation was applied by fixing the configuration on the affected routers
  • 18:45 UTC on 18 March 2025 – Network recovery complete.





How are we making incidents like this less likely or less impactful?


  • We have updated our configurations for the primary certificate authority across our WAN devices. (Completed).
  • We will build the capability to incorporate a broader set of monitoring signals. (Completed)
  • We are updating our playbooks around isolating unhealthy devices more robustly. (Completed)
  • We are rolling out patched versions of the agents that address certificate handling. (Completed)
  • We are expanding our automation to assess and mitigate forwarding database health to obviate the need for humans. (Estimated completion: September 2025)



How can customers make incidents like this less impactful?



How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/TL8S-TX8


Mar 28, 2025 - 19:47 CET
Update -

What happened?


Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic - meaning traffic moving to or from other Azure regions to West Europe - was affected. ExpressRoute and internet traffic were not impacted.


What do we know so far?


The issue originated from a service running on the device downloaded incorrect certificate from old certificate server. This led to service restarting and putting device's forwarding database in inconsistent state. We observed that inter-region traffic entering and exiting West Europe was being erroneously discarded.


We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.


How did we respond?


17:26 UTC on 18 March 2025 –Network connectivity issues were first observed.


17:26 UTC on 18 March 2025 –The issue was detected shortly after the impact started.


18:25 UTC on 18 March 2025 – We determined the issue was due to an incorrect certificate was erroneously downloaded, leading to traffic black-holing.


18:45 UTC on 18 March 2025 – The mitigation was applied by fixing the configuration on the affected routers.


18:45 UTC on 18 March 2025 –Immediately after we made the change we observed recovery the network returned to normal operation.


What happens next?


  • After our internal retrospective is completed, generally within 14 days, we will publish a Final Post Incident Review with any additional details and learnings.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness

 


Mar 18, 2025 - 21:53 CET
Update -

Between 17:24 and 18:45 UTC on 18 March 2025, a subset of customers may have experienced network connectivity issues with resources hosted in the West Europe region. We determined that only inter-region traffic—meaning traffic moving to or from other Azure regions to West Europe—was affected. ExpressRoute and internet traffic were not impacted.


The issue originated from a service running on the device downloaded incorrect certificate from old certificate server. This led to service restarting and putting device's forwarding database in inconsistent state. We observed that inter-region traffic entering and exiting West Europe was being erroneously discarded. 


We identified a specific networking router as the problem. The issue was resolved by updating service configuration to point to new certificate server.


Mar 18, 2025 - 20:44 CET
Update -

Starting at approximately 17:26 UTC on 18 Match 2025, a subset of customers are experiencing network connectivity issues to resources hosted in West Europe. We have determined that the impact is affecting inter-region traffic, this means traffic traversing an Azure region to or from West Europe, might be affected. We are starting to see recovery.


We can confirm that ExpressRoute, and internet traffic are not impacted.


More information will be provided shortly.  


Mar 18, 2025 - 20:22 CET
Investigating -

An outage alert is being investigated. More information will be provided as it is known.


Mar 18, 2025 - 19:49 CET
Investigating - Failure Anomalies notifies you of an unusual rise in the rate of failed HTTP requests or dependency calls.
Mar 24, 2025 - 01:11 CET
Update -

At 17:03 UTC on 18 March 2025, we received a monitoring alert for Application Gateway in West Europe which initiated an investigation and notified customers of a potential outage. We have concluded our investigation of the alert and confirmed that all services remained healthy, and a service incident did not occur. We will continue investigating to determine why alerts were triggered to pre-emptively avoid similar false alerts going forward. Apologies for any inconvenience caused.


Mar 18, 2025 - 22:38 CET
Update -

At 17:03 UTC on 18 March 2025, we received a monitoring alert for Application Gateway in West Europe which initiated an investigation and notified customers of a potential outage. We have concluded our investigation of the alert and confirmed that all services remained healthy, and a service incident did not occur. We will continue investigating to determine why alerts were triggered to pre-emptively avoid similar false alerts going forward. Apologies for any inconvenience caused.


Mar 18, 2025 - 22:33 CET
Investigating -


Impact Statement: Starting at 17:48 UTC on 18 Mar 2025, you have been identified as a customer who may encounter data plane issues affecting your Application Gateway in West Europe.This may impact the performance and availability of your applications hosted behind application gateways in the region.





Visit the Impacted Resources tab in Azure Service Health for details on resources confirmed or potentially affected by this event.





Current Status: We are aware and actively working on mitigating the incident. This situation is being closely monitored and we will provide updates as the situation warrants or once the issue is fully mitigated.


Mar 18, 2025 - 19:02 CET
Update - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Mar 18, 2025 - 19:08 CET
Investigating - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Mar 18, 2025 - 18:53 CET
Update -

Post Incident Review (PIR) – Azure Resource Manager – Timeouts or 5xx responses from ARM while calling an older API




What happened?


On 27 February 2025 a platform issue in Azure Resource Manager (ARM) caused inadvertent throttling that impacted different services:


  • Between 07:30 and 11:46 UTC on 27 February 2025, a subset of customers using Azure Resource Manager were impacted when some calls via internal backend systems for authentication details may have experienced 504 responses in the West Europe region.
  • Between 09:28 and 11:07 UTC on 27 February 2025, a subset of customers using Azure Log Analytics or Application Insights Query APIs were impacted when some calls may have experienced transient failures or degraded performance in the West Europe region.
  • Between 09:39 and 11:06 UTC on 27 February 2025, a subset of customers using Azure Container Apps were impacted when some calls, via Azure Resource Manager, may have experienced errors when attempting to create application containers in the West Europe region.



What went wrong and why?


When Azure Resource Manager (ARM) receives a request for authentication and authorization, in some specific scenarios, it leverages an older API. The backend system responsible for these API calls experienced an unexpected rise in traffic during the time of this incident and, as a result, the backend throttled some of those calls. Throttling is a common resiliency strategy designed to regulate the rate at which internal resources are accessed. This helps prevent the system from being overwhelmed by a large volume of requests, protecting the system, while at the same time allowing the system to function for the majority of requests. During the timeframe of this impact window, we experienced an unusual rise in requests from an internal service in Azure. This led to internal throttling resulting in a higher number of 504 errors.




How did we respond?


  • 07:30 UTC on 27 February 2025 - Customer impact began. Our internal monitoring system alerted us to this issue, prompting us to initiate an investigation.
  • 10:13 UTC on 27 February 2025 - Engineers determined this issue was caused by an increase in service traffic.
  • 11:00 UTC on 27 February 2025 - The Azure platform self-healed the issue.
  • 11:11 UTC on 27 February 2025 - After a period of monitoring to validate the mitigation, we confirmed service functionality had been restored, and no further impact was observed at this time.
  • 11:30 UTC on 27 February 2025 - As a proactive measure, our internal teams redistributed workloads across other nearby regions to prevent future recurrences while we continue to perform deeper investigations.


How are we making incidents like this less likely or less impactful?

  • We will be working on updating our code to make fewer requests to this API, to help minimize the chances that we run into the threshold where throttling may occur. (Estimated completion: June 2025)
  • We are working with the respective back-end teams to improve resiliency for these and similar traffic patterns in the future, to handle more gracefully the traffic pattern that we experienced during this impact. (Estimated completion: June 2025)
  • Additionally, the API that was impacted is under the deprecation path. We have modernized that part of our system and have been gradually migrating workflows away from this and to the new systems. Most of our systems are now integrated with the replacement system. We will continue with the full deprecation of this API. (Estimate completion: November 2025)


How can customers make incidents like this less impactful?

  • While we understand for this instance, this was caused by internal back-end systems in Azure, we would also recommend that customers consider implementing appropriate retry logic to enable applications to handle transient failures effectively. See: https://learn.microsoft.com/azure/architecture/patterns/retry
  • More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
  • Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: AzPIR/SV7D-DV0.


Mar 18, 2025 - 12:45 CET
Update -

Previous communications would have indicated an incorrect additional service being impacted. We have rectified the error and updated the communication below. We apologise for any confusion this may have caused.




What happened?


Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.


 


What do we know so far?


We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.


 


How did we respond?


  • 07:30 UTC on 27 February 2025 – Internal monitoring thresholds were breached, alerting us to this issue and prompting us to start our investigation; customer impact began. 
  • At approximately 10:13 UTC on 27 February 2025 – We determined this issue was caused by an increase in service traffic.
  • At approximately 11:00 UTC on 27 February 2025 – While validating the health of our Azure Resource Manager services and network, the Azure Platform self-healed the issue – include details, where possible, of the nature of the self-healing process.
  • 11:11 UTC on 27 February 2025 – After a period of monitoring to validate the mitigation, we confirmed service functionality had been restored, and no further impact was observed. 



What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.


Feb 27, 2025 - 15:31 CET
Update -

What happened?


Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.


 


What do we know so far?


We determined that an increase in service traffic resulted in backend service components reaching an operational threshold. This led to service impact and manifested in the experience described above.


 


How did we respond?


  • 07:30 UTC on 27 February 2025 – Internal monitoring thresholds were breached, alerting us to this issue and prompting us to start our investigation; customer impact began. 
  • At approximately 10:13 UTC on 27 February 2025 – We determined this issue was caused by an increase in service traffic.
  • At approximately 11:00 UTC on 27 February 2025 – While validating the health of our Azure Kubernetes Service services and network, the Azure Platform self-healed the issue – include details, where possible, of the nature of the self-healing process.
  • 11:11 UTC on 27 February 2025 – After a period of monitoring to validate the mitigation, we confirmed service functionality had been restored, and no further impact was observed. 



What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
  • The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.


Feb 27, 2025 - 15:21 CET
Investigating -

What happened?


Between 07:30 and 11:11 UTC on 27 February 2025, a platform issue resulted in an impact to Azure Resource Manager operations in the West Europe region. A subset of customers may have experienced a temporary degradation in performance and latency when trying to access resources hosted in the region.


 


This issue is now mitigated. An update with more information will be provided shortly.


Feb 27, 2025 - 13:12 CET
Investigating - Metric Alert for sb-idp-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Mar 18, 2025 - 11:40 CET
Update -

At 18:39 UTC on 26 February 2025, we received a monitoring alert for a possible issue with Managed Identities for Azure resources. Subsequently, communications were sent to customers, notifying them of this possible issue.


Upon further investigation during our post-incident review, we have determined that a significant percentage of those notified were not impacted by this event. We apologize for any confusion or inconvenience this may have caused.


If you were not impacted, please disregard the previous notification. We are committed to ensuring the accuracy of our communications and will continue to improve our processes and tooling to prevent such false notifications in the future.


For those customers who were impacted, you will receive subsequent messaging with the final Post Incident Review (PIR).


Mar 11, 2025 - 19:06 CET
Update -

What happened?


Between 18:39 and 20:55 UTC on 26 February 2025, we experienced an issue which resulted in an impact for customers being unable to perform control plane operations related to Azure Managed Identity. This included impact to the following services: Azure Container Apps, Azure SQL, Azure SQL Managed Instance, Azure Front Door, Azure Resource Manager, Azure Synapse Analytics, Azure Data Bricks, Azure Chaos Studio, Azure App Services, Azure Logic Apps, Azure Media Services, MSFT Power BI and Azure Service Bus.


 


What do we know so far?


We identified an issue with our Managed Identity infrastructure related to a key rotation. We performed manual steps to repair the key in each region, which resolved the issue.


 


How did we respond?


  • 18:39 UTC on 26 February 2025 – Customer impact began.
  • 18:49 UTC on 26 February 2025 – Engineering teams engaged to incident. 
  • 18:58 UTC on 26 February 2025 – Key rotation issue identified as the cause of the incident. 
  • 20:05 UTC on 26 February 2025 – First set of regions successfully mitigated
  • 20:55 UTC on 26 February 2025 – Services restored in all regions. Customer impact mitigated

 


What happens next?


  • Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers.
  • To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts
  • For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs
  • Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness



Feb 27, 2025 - 00:58 CET
Update -

Impact Statement: Starting at 16:48 UTC on 26 February 2025, you have been identified as a customer using Managed Identities who may be unable to create, update, delete, scale-up Azure resources using Managed Identities, and/or request tokens in some cases. Chaos customers may also not have been able to create or run experiments. 


 


Current Status: We have identified the issue and have begun to roll out a fix region-by-region. The regions where customers should see mitigation are Central US, North Europe, West US, UK West, West Europe, East US, East US 2, Korea Central, Canada Central, West US 2, Australia Central, Australia East, Japan East, Sweden Central, UK South, South Central US, Southeast Asia, West US 3, UAE Central, West Central US, Canada East, Brazil South, Central India, France Central, Germany West Central, North Central US, UAE North, Switzerland North, South India, Australia South East, Norway East, Italy North, Korea South, Switzerland West, Sweden South, South Africa North, Mexico Central, Norway West, South Africa West, Israel Central, Poland Central, Jio India West, West India, France South, Germany North, Australia Central, Brazil Southeast, Jio India Central.


Feb 26, 2025 - 23:37 CET
Update -

Impact Statement: Starting at 16:48 UTC on 26 February 2025, you have been identified as a customer using Managed Identities who may be unable to create, update, delete, scale-up Azure resources using Managed Identities, and/or request tokens in some cases. Chaos customers may also not have been able to create or run experiments. 


 


Current Status: We are currently investigating this issue and suspect it is related to a certificate. We will provide additional information as it becomes available. The next update will be provided in 60 minutes, or as events warrant.


Feb 26, 2025 - 23:18 CET
Update -

Between 18:39 and 20:55 UTC on 26 February 2025, we experienced an issue which resulted in an impact for customers being unable to perform control plane operations related to Azure Managed Identity. This included impact to the following services: Azure Container Apps, Azure SQL, Azure SQL Managed Instance, Azure Front Door, Azure Resource Manager, Azure Synapse Analytics, Azure Data Bricks, Azure Chaos Studio, Azure App Services, Azure Logic Apps, Azure Media Services, MSFT Power BI and Azure Service Bus.




Information on steps taken to mitigate this incident will be provided shortly.


Feb 26, 2025 - 23:03 CET
Investigating -

Impact Statement: Starting at 16:48 UTC on 26 February 2025, you have been identified as a customer using Managed Identities who may be unable to create, update, delete, scale-up Azure resources using Managed Identities, and/or request tokens in some cases. Chaos customers may also not have been able to create or run experiments. 


 


Current Status: We are currently investigating this issue and suspect it is related to a certificate. We will provide additional information as it becomes available. The next update will be provided in 60 minutes, or as events warrant.


Feb 26, 2025 - 21:54 CET
Update - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Mar 06, 2025 - 09:03 CET
Investigating - Metric Alert for fd-global-prod-001 - Service Availability is less than or equal to 95%
Mar 06, 2025 - 07:08 CET
Investigating - Metric Alert for sb-fcs-prod-euw-001 - the maximum Count of dead-lettered messages in a Queue/Topic is greater than equal to 10.
Mar 05, 2025 - 14:54 CET
Update -

Post Incident Review (PIR) – Cosmos DB – Impacted multiple services in West Europe




What happened?


Between 19:03 UTC and 22:08 UTC on 10 February 2025 a Cosmos DB scale unit in the West Europe region hosting Cosmos DB containers experienced failures and was unable to respond to customer requests. Cosmos DB is a distributed database management system, and data for a container is sharded and stored over multiple sets of machines based on partition key values of items stored in the container. In this case one of these sets of machines became unavailable, leading to unavailability for a subset of containers in the region.


Cosmos DB accounts would have been impacted in different ways, depending on their configuration:


  • Database accounts with multiple read regions and a single write region outside West Europe maintained availability for reads and writes if configured with session or lower consistency.
  • Accounts using strong consistency with less than three regions or bounded staleness consistency may have experienced write throttling to preserve consistency guarantees until the West Europe region was either taken offline or recovered. This behavior is by design.
  • Active-passive database accounts with multiple read regions and a single write region in West Europe maintained read availability, but write availability was impacted until the West Europe region was taken offline or recovered.
  • Single-region database accounts in West Europe were impacted if any partition resided on the affected instances.



Other downstream services which had dependencies on Cosmos DB were impacted, these included:


  • Azure AD Fusion: Between 19:08 and 20:59 UTC on 10 February, customers using the AAD Fusion service would have experienced 504 or time out errors.
  • Azure AD B2C: Between 19:19 and 20:02 UTC on 10 February, a subset of customers using Azure Active Directory B2C may have had their end users experience intermittent failures in Europe while trying to authenticate against B2C applications.
  • Azure Data Factory (ADF): Between 19:25 and 21:10 UTC on 10 February, customers using: Integration Runtime, Pipeline and Trigger CRUD operation; Pipeline and Trigger executions; Sandbox operations; Query pipeline and trigger run history; Query activity status; or Dataflow operations would have encountered error messages.
  • Azure Resource Manager (ARM): Between 19:05 and 19:55 UTC on 10 February, customers in Europe or working with resources in Europe may have been unable to view, create, update or delete resources.
  • Azure IoT Hub: Between 19:10 and 20:14 UTC on 10 February, customers using the device provisioning service would have seen issues registering and retrieving devices.
  • Azure Portal: Between 18:14 and 19:30 UTC on 10 February, customers in Europe (being serviced from West Europe and/or UK South) may have experienced increased latency and intermittent connectivity failures when attempting to access resources in the Azure Portal.
  • Azure Multi-Factor Authentication (MFA): Between 19:05 and 20:05 UTC on 10 February, a subset of customers may have experienced MFA failures while trying to authenticate with Phone App MFA methods. Some calls would have succeeded on retry.
  • Azure Synapse Job Service Email: Between 19:44 and 20:05 UTC on 10 February, customers in Europe or those working with European resources may have experienced issues viewing, creating, updating, or deleting resources such as OS disks or VMs.
  • Azure Synapse Platform Service: Between 19:14 and 21:06 UTC on 10 February, our Synapse customers experienced the Synapse resource provider (RP) provisioning failing due to timeouts from Azure Data Factory.
  • Microsoft Entra Identity Diagnostics: Between 19:37 and 20:23 UTC on 10 February, customers could not load the 'diagnose and solve problems' blade in Microsoft Entra ID.
  • Microsoft Entra Privileged Identity Management (PIM): Between 19:09 and 19:41 UTC on 10 February, customers using Microsoft Entra Privileged Identity Management for Azure resources may have been unable to view, create, update, delete, or activate role assignments.
  • Microsoft Entra Terms of Use: Between 19:08 and 20:02 UTC on 10 February, customers who configured terms of use for their users would see errors during sign-in if they have not accepted terms of use before.
  • Microsoft PowerApps: Between 19:08 and 20:03 UTC on 10 February, customers would have experienced intermittent issues playing power apps.
  • Microsoft Common Data Service (CDS): Between 19:03 and 20:48 UTC on 10 February, customers faced issues when performing CRUD (create/read/update/delete) operations in elastic entities. Audit retrieve and create operations were also affected, the create failures were retried internally while customer could see problems when retrieving audit data.


What went wrong and why?

While performing platform maintenance a set of metadata nodes was down for updates. Metadata nodes are system nodes that maintain the scale unit in a healthy condition. This type of maintenance has taken place regularly and reliably for many years to provide security, performance, and availability improvements. During this time, a set of metadata nodes experienced an unexpected failure leading to a temporary loss of the required number of metadata nodes to keep the scale unit up and functional. This was due to the total number of down metadata nodes exceeding the maximum allowed to maintain scale unit integrity. Ordinarily this transient state would not lead to failures, as the system is designed to handle such failures, but some of the nodes got stuck in a boot up sequence and had to be restarted to reestablish the number of metadata nodes needed to maintain the health of the scale unit. We determined that there was insufficient buffer in the number of metadata nodes under maintenance to handle the additional loss of metadata nodes experienced. Had either the number of buffer metadata nodes been larger, or if the failed metadata nodes had been able to self-recover, the scale unit would not have entered a failed state.


How did we respond?

  • 19:03 UTC on 10 February 2025 – Customer impact began.
  • 19:14 UTC on 10 February 2025 – Service monitoring detected failed requests, alerting us to begin an investigation. Upon reviewing the failure logs, we were able to identify the requests were failing on one specific scale unit.
  • 20:03 UTC on 10 February 2025 – To resolve the issue, we brought the unhealthy nodes back to a healthy state. Most customers saw a partial or full recovery at this point.
  • 22:08 UTC on 10 February 2025 – Services fully restored, and all customer impact mitigated.


Upon detection, Cosmos DB engineers determined that a single scale unit had a set of nodes impacted that caused the scale unit to become unavailable. Engineers determined that a subset of machines in the scale unit had entered a stuck state and required a manual reboot to recover. Once manually rebooted, the nodes were able to recover and availability was restored at 20:03 UTC. A small subset of those machines required additional steps to recover, leading to the longer recovery time (until 22:08 UTC) for a small set of containers.


How are we making incidents like this less likely or less impactful?

  • We identified global configurations that will make scale units resilient to similar failures by increasing the number of system nodes, to increase the number of nodes that can be in a failed state. We are in the process of rolling out these changes.(Estimated completion: March 2025)
  • In addition, we are making configuration changes to reduce concurrency of platform maintenance jobs and to provide an increased buffer to prevent platform maintenance jobs from being able to start without sufficient buffer in the number of system nodes.(Estimated completion: March 2025)
  • In the longer term, we are reviewing critical path dependencies on additional system services on metadata nodes. Our goal is to remove any unneeded dependencies in order to reduce potential failure points. This is not required to prevent this specific failure, but adds additional defense in depth. (Estimated completion: September 2025)


How can customers make incidents like this less impactful?


How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/CSDL-1X8


Feb 26, 2025 - 01:25 CET
Update -

Previous communications would have had an inaccurate/incomplete impacted services list. We have rectified the errors and updated the communication below. We apologize for any confusion this may have caused.




What happened?


Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue with Cosmos DB caused degraded service availability for subsets of the following services, in the West Europe region:


  • Azure Cosmos DB
  • Azure Data Factory
  • Azure IoT Hub
  • Azure Resource Manager (ARM)
  • Azure Portal
  • Azure Synapse Analytics
  • Microsoft Entra ID Terms of Use (TOU)
  • Microsoft Entra Multi-Factor Authentication (MFA)
  • Microsoft Entra Privileged Identity Management (PIM)



What do we know so far?


We have identified a group of nodes in the region that became unhealthy, leading to the cluster serving those nodes becoming unavailable. This affected instances of Cosmos DB, which the affected services rely on to process requests. Due to this inability to process requests, subsets of those services became unavailable.




How did we respond?


  • 19:09 UTC on 10 February 2025 – Customer impact began.
  • 19:14 UTC on 10 February 2025 – Service monitoring detected failed requests, alerting us to begin investigation. Upon reviewing the failure logs, we were able to identify the requests were failing on a specific cluster.
  • 20:03 UTC on 10 February 2025 – To resolve the issue, we brought the unhealthy nodes back to a healthy state.
  • 22:08 UTC on 10 February 2025 – Services restored, and customer impact mitigated.


What happens next?

Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts. For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs. The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring. Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.


Feb 11, 2025 - 01:36 CET
Update -

What happened?


Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:


  • Azure Cosmos DB
  • Azure Resource Manager
  • Azure Iot Hub
  • Microsoft Entra ID Terms of Use (TOU)
  • Microsoft Azure portal
  • Azure Data Factory
  • Identity and Access Management (IAM) Services
  • Microsoft Entra multifactor authentication (MFA)

 


This issue is now mitigated. An update with more information will be provided shortly.


Feb 11, 2025 - 00:08 CET
Update -

What happened?


Between 19:09 UTC and 22:08 UTC on 10 February 2025, a platform issue resulted in an impact to the following services in West Europe:


  • Azure Cosmos DB
  • Azure Resource Manager
  • Azure Iot Hub
  • Microsoft Entra ID Terms of Use (TOU)
  • Microsoft Azure portal
  • Azure Data Factory
  • Identity and Access Management (IAM) Services
  • Microsoft Entra multifactor authentication (MFA)

 


This issue is now mitigated. An update with more information will be provided shortly.


Feb 10, 2025 - 23:42 CET
Investigating -

Impact Statement: Starting at 19:09 UTC on 10 February 2025, customers in West Europe may experience degradation in service availability for these affected services:


  • Azure Cosmos DB
  • Azure Resource Manager
  • Azure Iot Hub
  • Microsoft Entra ID Terms of Use (TOU)
  • Microsoft Azure portal
  • Azure Data Factory
  • Identity and Access Management (IAM) Services
  • Microsoft Entra multifactor authentication (MFA)

 


Current Status: We are aware of this issue and are actively investigating potential contributing factors. The next update will be provided within 60 minutes, or as events warrant.


Feb 10, 2025 - 22:39 CET
Graphisoft ID Degraded Performance
90 days ago
100.0 % uptime
Today
Graphisoft License Delivery Degraded Performance
90 days ago
100.0 % uptime
Today
Graphisoft Store Operational
90 days ago
100.0 % uptime
Today
Graphisoft Legacy Store Degraded Performance
90 days ago
100.0 % uptime
Today
Graphisoft Legacy Webshop Degraded Performance
90 days ago
100.0 % uptime
Today
GSPOS Operational
90 days ago
100.0 % uptime
Today
Graphisoft BIM Components Operational
90 days ago
100.0 % uptime
Today
Graphisoft BIMx Transfer Operational
90 days ago
100.0 % uptime
Today
Graphisoft DevOps Components Operational
90 days ago
100.0 % uptime
Today
Microsoft Incidents Operational
90 days ago
100.0 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Major outage
Partial outage
No downtime recorded on this day.
No data exists for this day.
had a major outage.
had a partial outage.
Apr 3, 2025

No incidents reported today.

Apr 2, 2025

No incidents reported.

Apr 1, 2025

Unresolved incidents: Metric Alert - webshop-documentdb - NormalizedRUConsumption95, Metric Alert - webshop-documentdb - NormalizedRUConsumption95, Metric Alert - webshop-documentdb - NormalizedRUConsumption95, Metric Alert - webshop-documentdb - NormalizedRUConsumption95, Microsoft Incident - Azure Cosmos DB - Mitigated - Performance Degradation for multiple Azure services in North Europe, Microsoft Incident - Application Insights - Post Incident Review (PIR) – Application Insights – Intermittent data gaps on custom metrics data in West Europe.

Mar 31, 2025

Unresolved incident: Microsoft Incident - Network Infrastructure - PIR – Network Connectivity – Availability issues in East US.

Mar 30, 2025

No incidents reported.

Mar 29, 2025

No incidents reported.

Mar 28, 2025

Unresolved incident: Microsoft Incident - Network Infrastructure - Post Incident Review (PIR) - Network connectivity - Issues impacting Azure services in West Europe.

Mar 27, 2025

No incidents reported.

Mar 26, 2025

No incidents reported.

Mar 25, 2025

No incidents reported.

Mar 24, 2025

Unresolved incident: Failure Anomalies - gsitweb-p-euw-gsid-ai.

Mar 23, 2025

No incidents reported.

Mar 22, 2025
Mar 21, 2025

No incidents reported.

Mar 20, 2025

No incidents reported.