On the 5th of November 2023, we were made aware of an incident with our Outlook Calendar for Confluence app. A customer notified us on Sunday, the 5th of November, that the calendar app could not be enabled in Confluence. We were able to reproduce this issue in our own environment. We immediately started investigating the issue and found due to a path change in the location of the translation files, which was originally done for our Jira apps, the app would not enable anymore in Confluence. We released a fixed update on the 6th, hoping to re-enable the app for all customers. Unfortunately, once the corrupted update was rolled out to all customer instances, the app would not re-enable on it’s own, even after the fixed update was deployed. On Tuesday, 7th, we send out an email communication to all affected customers, explaining the need to perform a manual update of the app. In parallel, we worked with Atlassian Marketplace support to enable the app for all customers automatically. After a week, on the 16th of November, we were able to resolve the issue fully for all customers.
Instructions | Report |
---|---|
⚠️ Leadup | We switched our apps build process to use translation files from a different path. Unfortunately, while making & validating the change to our Jira apps, an cross-dependency to our Confluence app was not discovered, allowing the new translation paths to go live for the Confluence app as well. |
🙅♀️ Fault | Once the Atlassian Marketplace picked up the app update & started rolling it out to all customer instances automatically, the app became disabled for all customers. Manually enabling the app again fixed the issue after we rolled out a second update on Monday. Unfortunately, the second update failed to re-enable the app for all affected customers, so a manual action was necessary (at first). |
🥏 Impact | The app was disabled for all customers, removing all UIs entry points from Confluence and preventing users from accessing the app. |
👁 Detection | We learned about the issue a few hours after the update rolled out on the Marketplace on Nov. 5th. |
🙋♂️ Response . | After the noticed the issue, we immediately began troubleshooting and located the issue in the translation paths section of the Atlassian Connect manifest file of our app. Apps pointing to non-existing translations will validate the schema correctly and also install, but fail to get into the “Enabled” state. Once in “Disabled” state, there is no way for a Marketplace vendor to get the app back to enabled. |
🙆♀️ Recovery | Once we sent out responses to the support tickets and an email communication to all affected customers, we saw apps being manually re-enabled. After ~a week, Atlassian confirmed the run of a script which re-enabled the app for all affected customers, with the exception of a few instances which churned during that period of time. |
🔎 Root cause identification | The root case was already identified during the development of the fix. A change to a build process for our Jira apps caused a missing translation file path to be introduced for our Confluence app. |
🤔 Lessons learned | We will use this incident as a learning to improve in the following areas: Improve our release process to validate a full install of the resulting Connect app manifest file to prevent any erroneous updates to be delivered to customer instances. This is especially important since the Atlassian Marketplace does not seem to validate all edge-cases before installing the app in the cloud instances. |
Time | What |
---|---|
2023-11-05 9:48 PM CET |
First email received from customer notifying us about the issue |
2023-11-06 10:22 PM CET |
Raised ticket with Atlassian, letting them know we cannot fix the issue on our own, due to a quirk in how the Marketplace installs updates in cloud instances |
2023-11-06 22:42 PM CET |
PR merged with the fix and update released on the Atlassian Marketplace |
2023-11-07 15:00 PM CET |
Send out email communication to all affected customers, letting them know about the issue |
2023-11-16 15:30 PM CET |
Atlassian notifies us that a script has been executed manually to enable the app again for all customers |
List the issues created to prevent this class of incident in the future.
Problem | Action items |
---|---|
Reliance on CI/CD alone to catch issues was not sufficient to catch all issues with Connect manifest. Schema validation & Atlassian Marketplace do not catch all error cases, allowing erroneous app updates to ship to all customer instances | Introduce new pipeline checks, validating Connect manifest install in production environment before go-live |