Microsoft Teams for Jira - Smart Connect & Microsoft 365 for Jira (creating and viewing chats & conversations in Jira)
Incident date
2022/10/24 14:01 UTC
Incident duration
55 minutes (resolved at 14:56 UTC)
Incident response teams
Development support
Incident responders
Tobias Viehweger
🔮 Executive summary
On the 24th of October 2022, we were made aware of an incident in our Microsoft Teams for Jira app (also included as part of our Microsoft 365 for Jira app). We were able to reproduce this issue in our Jira production instance. We immediately started investigating the issue and found a faulty backend commit that introduced an unwanted change in production. We immediately started rolling back to the last known version, which took about 30 minutes until all affected app servers were back on the working version.
⛑ Postmortem report
Instructions
Report
⚠️ Leadup
When troubleshooting an authentication issue that appeared in our logs files since a few weeks, we introduced a new piece of log that would give us more information about the error. With this change a seemingly minor refactoring of a central piece of authentication code took place, which turned out to be faulty.
🙅♀️ Fault
Once the mentioned code was deployed to our fleet of application servers, Jira users trying to use our Teams functionality were presented with an error. Due to the change, our app servers classified all requests coming from the Teams app as unauthorized, resulting in an error shown for all Jira users, stating “The page has expired, please reload the page.”
🥏 Impact
This issue impacted all our customers using the Teams app, which was completely unusable for about an hour. After ~15 minutes, the first support request was raised, followed by three other support requests in the following minutes.
👁 Detection
We only learned about the issue from the first incoming support request, since neither our static type checker nor our automated pre-deployment tests caught the issue.
🙋♂️ Response .
Once the first support ticket came in, we immediately started investigating the issue. One team member immediately attributed the issue to a just-released change. We notified the first customer immediately about this and once the recovery started, contacted all other customers with open support requests.
🙆♀️ Recovery
We immediately started the roll back to the last known good version of our backend software. After a few minutes, the first restored app server resulted in partially restored functionality, sometimes after a page reload. Over the course of the next 30 minutes all app servers were rolled back to the fixed version.
🔎 Root cause identification
A combination of human-error and software issues resulted in this faulty change to be deployed to production. The error did not come up in the dev environment. The code review for this change did not catch the error. The static type checker did not catch the error. The automated pre-deployment tests did not cover this specific area of code, letting the deployment continue
🤔 Lessons learned
We will use this incident as a learning to improve in the following areas: Improve and validate the static type checker error detection. Make sure code reviews for central points of failure (e.g. authorization related) are thoroughly tested in the staging environment and reviewed with utmost care. Better monitoring to catch this kind of production error before the first support request is even raised
⏱ Incident timeline
Time
What
2022-10-24 13:49 UTC
First app server is updated to the faulty version, no widespread outage yet
2022-10-24 14:01 UTC
All app servers are updated to the faulty version, resulting in all customers having issues access the app
2022-10-24 14:17 UTC
First support ticket documenting the incident is raised
2022-10-24 14:31 UTC
Rollback to last known good version started
2022-10-24 14:33 UTC
First app server is updated to the previous version, recovery for customers started
2022-10-24 14:56 UTC
Last app server is updated to the previous version, recovery for all customers is completed
✅ Follow-up tasks
List the issues created to prevent this class of incident in the future.
Problem
Action items
Static type checker did not catch the issue
Validate and improve type checker correctness to avoid this type of issue in the future
Missing automated test for cental piece of code
Implement tests for this part of the apps authentication logic to prevent regressions
Rollback to working version could be faster
Investigate if we can improve rollback time to quickly rollback to an earlier version
Posted Oct 24, 2022 - 22:47 CEST
Resolved
Using the Teams features in Jira currently only shows “The page has expired, please reload the page.”