Microsoft Incident - Virtual Machines - Mitigated – Incorrect Resource Health Status for Azure Virtual Machines
Incident Report for Graphisoft
Update

What happened?


Between 19:51 UTC on 07 September 2024 and 11:30 UTC on 10 September 2024, a platform issue resulted in an impact to the Azure Virtual Machines service. Customers may have experienced resource health status displayed incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may have resulted in inaccuracies in activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.


 


What do we know so far?


We determined that a recent change resulted in an increase in traffic for multiple backend clusters supporting Virtual Machines. This issue led to inaccuracies in the health status of some resources, leading to potential incorrect or missed alerts. 


 


How did we respond?


  • 19:51 UTC on 07 September 2024 – Customer impact began.
  • 02:08 UTC on 08 September 2024 – We received customer reports that indicated this issue was impacting multiple resources. Initially, we were investigating isolated occurrences of this issue, however once we had the additional reports, we began further investigations.
  • 07:08 UTC on 09 September 2024 – The recent change to the service identified as a main contributing factor.
  • 09:28 UTC on 09 September 2024 – We performed scale-out operations to allow our service to handle increased load.
  • 09:45 UTC on 09 September 2024 – Our telemetry showed the service was healthy and we issued resolved communications.
  • 15:10 UTC on 09 September 2024 – We observed the issue reoccur. To bring full mitigation, we executed a rollback of the problematic change in accordance with our Safe Deployment Practices (SDP).
  • 11:30 UTC on 10 September 2024 – Service restored, and customer impact mitigated.

 


What happens next?


Our team will be completing an internal retrospective to understand the incident in more detail. Once that is completed, generally within 14 days, we will publish a Post Incident Review (PIR) to all impacted customers. To get notified when that happens, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts . For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs . The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring . Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness . 

Posted Sep 11, 2024 - 02:38 CEST
Update

What happened?


Between 19:51 UTC on 07 September 2024 and 11:30 UTC on 10 September 2024, a platform issue resulted in an impact to the Azure Virtual Machines service. Customers may have experienced resource health status displayed incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may have resulted in inaccuracies in activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.


 


This issue is now mitigated. An update with more information will be provided shortly.

Posted Sep 11, 2024 - 02:00 CEST
Update

Impact Statement: Starting at 19:51 UTC on 7 September 2024, you have been identified as a customer using Azure Virtual Machines whose resource health status may display incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may result in inaccuracies in your activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.


 


Current Status: We have correlated this issue to a recent change that caused an increase in traffic for multiple backend scale unit supporting resource health metrics for Virtual Machines across several regions.


 


We have completed our efforts to increase capacity and scale for our mitigation operations, and are making good progress in executing a rollback of the recent change in accordance with our Safe Deployment Practices (SDP). Many customers may already be experiencing signs of recovery when viewing the Resource Health status. We are progressing through affected regions, and expect the rollback to be complete within next 24 hours at which point all customer impact should be mitigated.


 


We will provide the next update in 12 hours, or sooner if events warrant.

Posted Sep 10, 2024 - 19:43 CEST
Update

Impact Statement: Starting at 19:51 UTC on 07 September 2024, you have been identified as a customer using Azure Virtual Machines whose resource health status may display incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may result in inaccuracies in your activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.


Current Status: In earlier communications, we indicated that this issue had been mitigated. However, our telemetry shows the issue has resurfaced. We have correlated this recurrence to a recent change that caused an increase in traffic for multiple backend scale unit supporting resource health metrics for Virtual Machines across several regions. 


We have completed our efforts to increase capacity and scale for our mitigation operations, and some customers may already be experiencing signs of recovery when viewing the Resource Health status. 


To ensure full recovery, we are executing a rollback of the recent change in accordance with our Safe Deployment Practices (SDP), and progress is going well. Since this will take some time, we will provide the next update in 5 hours, or sooner if events warrant.

Posted Sep 10, 2024 - 15:51 CEST
Update

Impact Statement: Starting at 19:51 UTC on 07 September 2024, you have been identified as a customer using Azure Virtual Machines whose resource health status may display incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may result in inaccuracies in your activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.


Current Status: In earlier communications, we indicated that this issue had been mitigated. However, our telemetry shows the issue has resurfaced. We have correlated this recurrence to a recent change that caused an increase in traffic for multiple backend scale unit supporting resource health metrics for Virtual Machines across several regions. 


We continue to increase capacity and scale for our mitigation operations, and some customers may already be experiencing improvements. 


To ensure full recovery, we are executing a rollback in accordance with our Safe Deployment Practices (SDP), and progress is going well. Since this will take some time, we will provide the next update in 10 hours, or sooner if events warrant.

Posted Sep 10, 2024 - 11:50 CEST
Update

Impact Statement: Starting at 19:51 UTC on 07 September 2024, you have been identified as a customer using Azure Virtual Machines whose resource health status may display incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may result in inaccuracies in your activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.




Current Status: In earlier communications, we indicated that this issue had been mitigated. However, our telemetry shows the issue has resurfaced. We have correlated this recurrence to a recent change that caused an increase in traffic for multiple backend scale unit supporting resource health metrics for Virtual Machines across several regions. We continue to increase capacity and scale for our mitigation operations, and some customers may already be experiencing improvements. To ensure full recovery, we are executing a rollback in accordance with our Safe Deployment Practices (SDP), and progress is going well. Since this will take some time, we will provide the next update within 4 hours, or sooner if necessary.

Posted Sep 10, 2024 - 07:14 CEST
Update

Impact Statement: Starting at 19:51 UTC on 07 September 2024, you have been identified as a customer using Azure Virtual Machines whose resource health status may display incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may result in inaccuracies in your activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.


 


Current Status: In earlier communications, we indicated that this issue had been mitigated. However, our telemetry shows the issue has resurfaced. We have correlated this recurrence to a recent change that caused an increase in traffic for multiple backend scale unit supporting resource health metrics for Virtual Machines across several regions. We continue to increase capacity and scale for our mitigation operations, as a result, customers should begin to notice some improvements. To ensure a full recovery, we are executing a rollback following our Safe Deployment Practices (SDP). Since this will take some time we will provide the next update within 4 hours, or sooner if necessary.

Posted Sep 10, 2024 - 02:34 CEST
Update

Impact Statement: Starting at 19:51 UTC on 07 September 2024, you have been identified as a customer using Azure Virtual Machines whose resource health status may display incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may result in inaccuracies in your activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.




Current Status: In earlier communications, we indicated that this issue had been mitigated. However, our telemetry shows the issue has resurfaced. We have correlated this recurrence to a recent change that caused an increase in traffic for multiple backend scale unit supporting resource health metrics for Virtual Machines across several regions. We continue to focus on increasing capacity as we work on validating a roll back in accordance with our Safe Deployment Practices (SDP) for a complete recovery. Since this will take some time, we will provide the next update within 4 hours, or sooner if necessary.

Posted Sep 09, 2024 - 23:23 CEST
Investigating

Impact Statement: Starting at 19:51 UTC on 07 September 2024, you have been identified as a customer using Azure Virtual Machines whose resource health status may display incorrectly, indicating an 'unknown' status even though they may be healthy. This discrepancy in the health status may result in inaccuracies in your activity logs, potentially leading to incorrect alerts—either false alerts or missed alerts.




Current Status: In earlier communications, we indicated that this issue had been mitigated. However, our telemetry shows the issue has resurfaced. We have correlated this recurrence to a recent change that caused an increase in traffic for multiple backend scale unit supporting resource health metrics for Virtual Machines across several regions. We continue to focus on increasing capacity as we work on validating a roll back in accordance with our Safe Deployment Practices (SDP) for a complete recovery. Since this will take some time, we will provide the next update within 4 hours, or sooner if necessary.

Posted Sep 09, 2024 - 23:02 CEST