What happened?
Between 13:00 UTC on 10 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and/or incorrect alert activation.
What went wrong and why?
Application Insights Ingestion is the service that handles ingesting and routing of Application Insights data from customers. One of its internal components is a cache where it stores information about the customer's Application Insights resource configuration. This cache is deployed at a region-level, so it is shared by multiple clusters in a region. When a deployment is done, some regions deploy to one cluster, then delay until the next business day before deploying to remaining clusters. There was feature work being done that involved adding a new flag to the Application Insights resource configuration stored in the cache. The flag was supposed to default to true, in which case it wouldn't impact the behavior of Application Insights Ingestion. However, if the flag was set to false, it would stop sending custom metrics data to the Log Analytics workspace.
A recent incident in a separate cloud was caused by this flag becoming incorrectly set to false. As a response to this, it was decided the flag should be flipped to represent the opposite - so that defaulting to "false" would result in no-op behavior instead. As part of this, the original flag was removed from the contract used to serialize cache entries. The above change was then deployed. It started with the first cluster, then waited until the next business day to deploy to remaining clusters. During this time, the first cluster started serializing new cache entries that were missing a value for the original (default true) flag. This caused the remaining clusters (still running the old deployment) to read values from the cache with this flag set to false, and therefore stop routing custom metrics data to Log Analytics. When the deployment completed in a region, impact would resolve as all clusters would be running the new code with the correct default value for the flag.
There was no monitoring of data volume drops by data type and no new monitor for the flag's operation was added, since it was considered a normal operation for it to be active. This caused the deployment to proceed to an additional region before the issue was detected. The incident persisted for around 24 hours in the South Central US region, before the deployment completed. Since the issue wasn't detected by automated monitoring, the deployment proceeded to the West Europe region, where it deployed to the first cluster. Because it deployed next to a weekend, it persisted for several days before the deployment finished. Eventually, several customers raised tickets noticing that their custom metrics data was missing. During this incident, the flag became incorrectly set to false, causing the ingestion service to incorrectly stop routing custom metrics data to Log Analytics.
How did we respond?
How are we making incidents like this less likely or less impactful?
How can customers make incidents like this less impactful?
How can we make our incident communications more useful?
You can rate this PIR and provide any feedback using our quick 3-question survey: https://aka.ms/AzPIR/D_CP-JQ8
What happened?
Between 13:30 UTC on 13 March 2025 and 00:25 UTC on 18 March 2025, a platform issue resulted in an impact to the Application Insights service in the West Europe region. Customers may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.
What do we know so far?
We identified that the issue was caused by a service deployment. A new version was deployed to a single cluster of the service in the West Europe region, introducing a change to the contract of a cache shared among all clusters in the area. This contract change was incompatible with the code running on the remaining clusters, leading to incorrect routing of customer metrics data.
How did we respond?
What happens next?
Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may have experienced intermittent data gaps on custom metrics data and incorrect alert activation.
This issue is now mitigated, and more information will be shared shortly.
Impact Statement: Starting at 13:30 UTC on 13 March 2025, you have been identified as a customer using Application Insights in West Europe who may experience intermittent data gaps on custom metrics data and incorrect alert activation.
Current Status: This issue was raised to us by a customer report. Upon investigation, we determined that this bug was introduced as part of a deployment, once we identified this, we assessed the possibility of rolling back. After further inspection, we deemed that to mitigate this, the deployment would need to be applied to all the clusters in the region as it was the mismatch is deployment versions that was causing this issue.
We are currently expediting the deployment to all remaining clusters in the region, this is expected to take 1 hour to complete. We have paused the broader deployment going out to any remaining regions, and we will reassess our deployment plan after we mitigate this issue in West Europe.
The next update will be provided within 2 hours, or as events warrant.