Microsoft Incident - Application Insights - Post Incident Review (PIR) – Application Insights – Intermittent data gaps on custom metrics data in West Europe

Incident Report for Graphisoft

Resolved

This incident has been resolved.

Posted Apr 15, 2025 - 08:24 CEST

Update

What happened?

Between 13:00 UTC on 10 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and/or incorrect alert activation.

What went wrong and why?

Application Insights Ingestion is the service that handles ingesting and routing of Application Insights data from customers. One of its internal components is a cache where it stores information about the customer's Application Insights resource configuration. This cache is deployed at a region-level, so it is shared by multiple clusters in a region. When a deployment is done, some regions deploy to one cluster, then delay until the next business day before deploying to remaining clusters. There was feature work being done that involved adding a new flag to the Application Insights resource configuration stored in the cache. The flag was supposed to default to true, in which case it wouldn't impact the behavior of Application Insights Ingestion. However, if the flag was set to false, it would stop sending custom metrics data to the Log Analytics workspace.

A recent incident in a separate cloud was caused by this flag becoming incorrectly set to false. As a response to this, it was decided the flag should be flipped to represent the opposite - so that defaulting to "false" would result in no-op behavior instead. As part of this, the original flag was removed from the contract used to serialize cache entries. The above change was then deployed. It started with the first cluster, then waited until the next business day to deploy to remaining clusters. During this time, the first cluster started serializing new cache entries that were missing a value for the original (default true) flag. This caused the remaining clusters (still running the old deployment) to read values from the cache with this flag set to false, and therefore stop routing custom metrics data to Log Analytics. When the deployment completed in a region, impact would resolve as all clusters would be running the new code with the correct default value for the flag.

There was no monitoring of data volume drops by data type and no new monitor for the flag's operation was added, since it was considered a normal operation for it to be active. This caused the deployment to proceed to an additional region before the issue was detected. The incident persisted for around 24 hours in the South Central US region, before the deployment completed. Since the issue wasn't detected by automated monitoring, the deployment proceeded to the West Europe region, where it deployed to the first cluster. Because it deployed next to a weekend, it persisted for several days before the deployment finished. Eventually, several customers raised tickets noticing that their custom metrics data was missing. During this incident, the flag became incorrectly set to false, causing the ingestion service to incorrectly stop routing custom metrics data to Log Analytics.

How did we respond?

13:00 UTC on 10 March 2025 – Customer impact in South Central US began.
23:30 UTC on 11 March 2025 – This issue was auto-mitigated in South Central US when the deployment finished, once all the clusters were on the same version.
13:30 UTC on 13 March 2025 – Customer impact in West Europe began.
16:11 UTC on 17 March 2025 – Issue was detected via customer reports, which prompted us to start our investigation.
23:00 UTC on 17 March 2025 – We have identified that the issue was caused due to a recent deployment, as described above.
23:01 UTC on 17 March 2025 – The deployment was triggered and expedited to ensure that all clusters were on the same version, mitigating the issue.
00:25 UTC on 18 March 2025 – Service restored, and customer impact mitigated in all regions.

How are we making incidents like this less likely or less impactful?

We have added unit tests for backwards compatibility of the cache contract that validates new tests are added when the contract is changed. (Completed)
We have added a new dedicated monitor on the new flag being activated, to help detect and mitigate related issues more quickly. (Completed)
We have improved our change review process, by requiring that risk assessments be completed on each change. (Completed)
We are improving our monitoring of data volume drops that monitors every data type. (Estimated completion: April 2025)

How can customers make incidents like this less impactful?

There was nothing customers could have done to avoid or minimize impact from this specific service incident.
More generally, consider evaluating the reliability of your applications using guidance from the Azure Well-Architected Framework and its interactive Well-Architected Review: https://aka.ms/AzPIR/WAF
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability varied between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring
Finally, consider ensuring that the right people in your organization will be notified about any future service issues – by configuring Azure Service Health alerts. These can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/AzPIR/Alerts

How can we make our incident communications more useful?

You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/D_CP-JQ8

Posted Apr 01, 2025 - 00:31 CEST

Update

What happened?

Between 13:30 UTC on 13 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.

What do we know so far?

We identified that the issue was caused by a service deployment. A new version was deployed to a single cluster of the service in the West Europe region, introducing a change to the contract of a cache shared among all clusters in the area. This contract change was incompatible with the code running on the remaining clusters, leading to incorrect routing of customer metrics data.

How did we respond?

13:30 UTC on 13 March 2025 – Customer impact began.
16:11 UTC on 17 March 2025 – Issue was detected via customer report which prompted to start our investigation.
23:00 UTC on 17 March 2025 – We have identified that the issue is caused due to a recent deployment.
23:01 UTC on 17 March 2025 – The deployment was triggered and expedited to ensure that all clusters are on the same version, mitigating the issue.
00:25 UTC on 18 March 2025– Service restored, and customer impact mitigated.

What happens next?

To request a Post Incident Review (PIR), impacted customers can use the “Request PIR” feature within Azure Service Health. (Note: We're in the process of transitioning from "Root Cause Analyses (RCAs)" to "Post Incident Reviews (PIRs)", so you may temporarily see both terms used interchangeably in the Azure portal and in Service Health alerts).
To get notified if a PIR is published, and/or to stay informed about future Azure service issues, make sure that you configure and maintain Azure Service Health alerts – these can trigger emails, SMS, push notifications, webhooks, and more: https://aka.ms/ash-alerts.
For more information on Post Incident Reviews, refer to https://aka.ms/AzurePIRs.
The impact times above represent the full incident duration, so are not specific to any individual customer. Actual impact to service availability may vary between customers and resources – for guidance on implementing monitoring to understand granular impact: https://aka.ms/AzPIR/Monitoring.
Finally, for broader guidance on preparing for cloud incidents, refer to https://aka.ms/incidentreadiness.

Posted Mar 18, 2025 - 02:25 CET

Update

Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.

This issue is now mitigated, and more information will be shared shortly.

Posted Mar 18, 2025 - 01:41 CET

Investigating

Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may experience intermittent data gaps on custom metrics data and incorrect alert activation.

Current Status: This issue was raised to us by a customer report. Upon investigation, we determined that this bug was introduced as part of a deployment, once we identified this, we assessed the possibility of rolling back. After further inspection, we deemed that to mitigate this, the deployment would need to be applied to all the clusters in the region as it was the mismatch is deployment versions that was causing this issue.

We are currently expediting the deployment to all remaining clusters in the region, this is expected to take 1 hour to complete. We have paused the broader deployment going out to any remaining regions, and we will reassess our deployment plan after we mitigate this issue in West Europe.

The next update will be provided within 2 hours, or as events warrant.

Posted Mar 18, 2025 - 01:31 CET