|Microsoft Teams for Jira - Smart Connect & Microsoft 365 for Jira (creating and viewing chats & conversations in Jira)
|2022/10/24 14:01 UTC
|55 minutes (resolved at 14:56 UTC)
|Incident response teams
🔮 Executive summary
On the 24th of October 2022, we were made aware of an incident in our Microsoft Teams for Jira app (also included as part of our Microsoft 365 for Jira app). We were able to reproduce this issue in our Jira production instance. We immediately started investigating the issue and found a faulty backend commit that introduced an unwanted change in production. We immediately started rolling back to the last known version, which took about 30 minutes until all affected app servers were back on the working version.
⛑ Postmortem report
|When troubleshooting an authentication issue that appeared in our logs files since a few weeks, we introduced a new piece of log that would give us more information about the error. With this change a seemingly minor refactoring of a central piece of authentication code took place, which turned out to be faulty.
|Once the mentioned code was deployed to our fleet of application servers, Jira users trying to use our Teams functionality were presented with an error. Due to the change, our app servers classified all requests coming from the Teams app as unauthorized, resulting in an error shown for all Jira users, stating “The page has expired, please reload the page.”
|This issue impacted all our customers using the Teams app, which was completely unusable for about an hour. After ~15 minutes, the first support request was raised, followed by three other support requests in the following minutes.
|We only learned about the issue from the first incoming support request, since neither our static type checker nor our automated pre-deployment tests caught the issue.
|🙋♂️ Response .
|Once the first support ticket came in, we immediately started investigating the issue. One team member immediately attributed the issue to a just-released change. We notified the first customer immediately about this and once the recovery started, contacted all other customers with open support requests.
|We immediately started the roll back to the last known good version of our backend software. After a few minutes, the first restored app server resulted in partially restored functionality, sometimes after a page reload. Over the course of the next 30 minutes all app servers were rolled back to the fixed version.
|🔎 Root cause identification
|A combination of human-error and software issues resulted in this faulty change to be deployed to production. The error did not come up in the dev environment. The code review for this change did not catch the error. The static type checker did not catch the error. The automated pre-deployment tests did not cover this specific area of code, letting the deployment continue
|🤔 Lessons learned
|We will use this incident as a learning to improve in the following areas: Improve and validate the static type checker error detection. Make sure code reviews for central points of failure (e.g. authorization related) are thoroughly tested in the staging environment and reviewed with utmost care. Better monitoring to catch this kind of production error before the first support request is even raised
⏱ Incident timeline
2022-10-24 13:49 UTC
|First app server is updated to the faulty version, no widespread outage yet
2022-10-24 14:01 UTC
|All app servers are updated to the faulty version, resulting in all customers having issues access the app
2022-10-24 14:17 UTC
|First support ticket documenting the incident is raised
2022-10-24 14:31 UTC
|Rollback to last known good version started
2022-10-24 14:33 UTC
|First app server is updated to the previous version, recovery for customers started
2022-10-24 14:56 UTC
|Last app server is updated to the previous version, recovery for all customers is completed
✅ Follow-up tasks
List the issues created to prevent this class of incident in the future.
|Static type checker did not catch the issue
|Validate and improve type checker correctness to avoid this type of issue in the future
|Missing automated test for cental piece of code
|Implement tests for this part of the apps authentication logic to prevent regressions
|Rollback to working version could be faster
|Investigate if we can improve rollback time to quickly rollback to an earlier version