On the 27th of November 2022, we were made aware of an incident in our backend infrastructure. An Atlassian Cloud Fortified check notified us, that new customers were not able to install our app correctly. We were able to reproduce this issue in our own environment. We immediately started investigating the issue and found that our main Redis instance, which handles asynchronous job processing and caching, had run out of available memory, and therefore all calls to Redis were failing. Once the issue was located, we started a task to temporarily increase the Redis memory (from 3GB to 60GB), to stabilize the backend and investigate the root cause in more detail. After around 30 minutes the increase, our services were back online.
We switched one of our apps (To Do for Jira) to our new backend infrastructure, as it was still relying on an older version of our API. In this process, we also switched the event processing (Jira webhooks) over to our new backend. During the review of the code, we failed to notice a missing flag in our Redis queueing code, which removed completed jobs from Redis.
Once the mentioned code was deployed to our application servers, it slowly increased the used memory of the Redis instance over a period of three weeks. Once at 100% memory usage, the Redis instance started returning errors, which only worsened the issue further and escalated even through the overprovisioning AWS has in place.
This issue impacted all our customers using the Teams app, which was completely unusable during the incident, as it was relying on the availability of the Redis instance. In addition, other services, as webhook processing and starting new trials / installing our app, was impacted.
We only learned about the issue from the Atlassian Cloud fortified monitoring service, which alerted us to the installation of new apps issue.
On Sunday morning, after the second alert from Atlassian was received, we started investigating the issue. One team member quickly attributed the issue to the out-of-memory issue in Redis. Our logs clearly indicated the error, so thankfully, we were able to locate it in under 30 minutes and start the recovery process.
Once the issue was located, we started a task to temporarily increase the Redis memory (from 3GB to 60GB) to get the backend service back up. After ~20 minutes AWS was able to provision the resized instance, which restored our API availability.
Once the recovery was confirmed, we started looking into the root cause of the issue. We identified the issue to be related to a rework we did a few weeks earlier, which ported our To Do app functionality to our new backend stack. During the process, we failed to notice that a crucial flag was missing, that prevented the cleanup of processed jobs from Redis. Over the course of three weeks the Redis cluster slowly increased in memory usage, until it reached 100% usage and started failing with out of memory errors. A few places in our backend services did not handle the failure correctly, and stopped working completely (Webhook processing, installation / uninstallation of apps, the complete render-function for our Teams app).
We will use this incident as a learning to improve in the following areas. a) Improve defaults for new services relying on the Redis instance, namely removing completed jobs from Redis by default. b) Introduce new monitoring for the memory usage of the Redis cluster to be notified of increasing memory in time. c) Introduce a shared piece of code that will offer a failure resistant way of accessing Redis as a cache, to avoid failing services because of a missing cache. d) Measure daily throughput of Redis and introduce a appropriately sized instance to have at least 48h leeway from being notified about a memory issue until it impacts production services.
2022-11-26 17:01 UTC
|First email received from Atlassian monitoring that our target SLA of 99% successful new installations went down to 90%
2022-11-27 00:02 UTC
|Second email received from Atlassian monitoring that our target SLA of 99% successful new installations went below 10%
2022-11-27 07:18 UTC
|On-call engineer noticed the issue and started investigating the failing trials and the non-functioning Teams app
2022-11-27 07:38 UTC
|The issue was detected to be an out-of-memory issue in the main Redis instance which is used for caching and asynchronous jobs. A manual resize of the Redis cluster was triggered to restore functionality quickly
2022-11-27 07:58 UTC
|Redis cluster was resized and accepting connections again, which immediately full functionality across our service again
List the issues created to prevent this class of incident in the future.
|Monitoring of Redis cluster was insufficient
|Introduce new monitoring to alert the responsible colleagues ahead of time, once the Redis memory usage passes a certain threshold (planned: 20%, 50%, 90%)
|Some of our services are too reliant on the uptime of the Redis cache instance, even though it’s not critical for the service
|Introduce circuit-breakers to avoid reliance on Redis cache availability, in cases where we can recover without Redis being available
|Define SLAs on how much leeway/buffer memory the Redis instance should have in case of a processing failure. E.g. in case we regularly expect around 1GB/day of throughput in Redis, size the instance accordingly so we have at least 48-72 hours to react to processing issues