Applying DevOps principles to a 7-year-old “legacy” app
How modern DevOps practices helped our team get better outcomes for our stakeholders, customers and ourselves.
Introduction
In everything I do, I am always trying to think of ways in which I can automate some process away as soon as it becomes repetitive. Be it watering the garden, setting music to play as my alarm or, in my professional life, helping teams improve their software development process and lifecycle. One of the highlights of my time at Qoria (nee Family Zone) was to introduce some DevOps practices I had learned during my career to help a team that was struggling to release often and without fear of impacting customers. By adopting some good software engineering practices and creating team standards we were able to make waves on our ability to make our stakeholders and customers happy. This had a significant impact on our DORA metrics; helping us to improve our deployment frequency, lead time for changes, mean time to recovery, and change fail rate.
The Challenge
When I initially joined the team, we were struggling to keep up with the demands of our customers both internal and external. We were releasing new features and bug fixes less frequently than we would have liked, and the change failure rate was a higher percentage than what we were comfortable with.
The team was doing “trunk based development” but really it was just a glorified version of git flow. Each week, working with the product owner (PO) and quality assurance (QA), tickets were selected which were to be released. These tickets would then be merged into the trunk and deployed into a testing environment where both the PO and QA could assess the validity of the change and decide to promote the ticket through the release process should the change be deemed “successful”.
Realistically, three or four out of every ten tickets were able to successfully pass the validation and this therefore meant that any other work that was included in that release had to be rolled back whilst the successful tickets were rechecked once the unsuccessful ones had been removed from the change. Because a number of changes were going in together it was hard to determine which change was the faulty one and if it had had knock-on effects to other tickets.
I won’t even try to quantify the amount of time lost to this process, but suffice to say it was a significant portion of the team’s fortnightly effort in the given sprint.
The Approach
After joining the team in May, I became the squad lead as I felt I could help the team out here by providing technical guidance and uplifting the skills of my teammates, with the experience I had had in my career. Using a combination of my interest in DevOps as well as a solid understanding of developer tools, processes and what was useful to the team, we set to work to bring about change to our ways of working.
The first thing we decided to do was to introduce unit testing to our workflows and automation to our code repository as a starting point to help the team get a better understanding of the code we were creating and features we were shipping out. Adding testing to the development process meant that developers had a more focused context on what is being built, but also meant that more code paths are being thought about in the development process. Compared to writing some code, saving the file, and refreshing the browser, developers are now thinking more critically about the features they’re writing by writing out the happy and unhappy paths in their code and observing how the application is responding to the different scenarios.
Second was to introduce status checks at the merge request or pull request stage. This helps to rule out the “works on my machine” problem for both build and test cases. However at the same time we also decided to introduce the three musketeer’s approach utilising Docker containers to develop, build, and test the application on local and remote machines. Using the checks before the code is merged helps us to keep the main branch (or trunk) healthy which is the number one priority.
One of the other benefits our team had which isn’t strictly part of the deployment process, but rather a development tool, was the use of TypeScript (TS) on our application which was written in plain JS. As a team, we decided that new components were to be introduced in TS and that anytime an existing component was touched, that it too be converted (where it made sense). Save for some scenarios where we are talking about 1000+ line files (the horror!), it was generally agreed that this would form part of the ticket’s work. This aggressive strategy meant that we were able to get type-safe code against components which were undergoing change at the same time so that we would get compile errors during the build process at the CI check when developers were working on similar parts of the application at the same time.
You’ll remember earlier I mentioned that multiple changes were going in at once and we weren’t sure what was breaking the application. Well this compile check at build time during every build run at the merge request stage put a stop to that problem pretty quickly. And even when there were no build errors, usually a conflicting change was caught in unit tests. The first time this happened was way sooner than I expected. In fact I think it happened in the second week of implementing this. We got a test build failure when an internal component-library tool we were using had a change made to it which was imported into the app we’re discussing in this post. At the status checks, one of the tests failed for an unrelated component proving the value of unit testing our UI interactions in a very short time as this would have otherwise broken a feature causing a regression for our customers which would have gone undetected until a complaint came in.
We also implemented the use of a continuous delivery (CD) pipeline. This pipeline automatically deploys our code to our staging environment on merges to the trunk. A long term goal which we’d like to build is a system that runs integration and/or end to end tests in a live environment which validates the change being made against a real environment. And should those checks pass, automatic promotion into production occurs without intervention. Right now, there are a number of complexities around multi-service dependencies and the need for feature flags, which I won’t go into detail here, but suffice to say that it is in the plan for DevOps nirvana. Right now though whilst we do have stage gates, and a great QA process, automated regression tests in a live environment is the goal!
The Results
So what has happened since we introduced testing and automation to our code repository? Let’s have a look at some fancy charts.
As the charts below show, our lead time to change (LTTC) dropped drastically and our change rate increased significantly. At the time our team was undergoing significant change; we lost half of the developers in the team that weren't replaced, we had many different QAs come and go as well as three product owners, three engineering managers and a huge amount of extra work that has come in from other teams that has needed review, approval and deployment. On top of all this, with the pretty strict Typescript enforcement and mandated unit testing we introduced across our repo, which was a big learning curve for many, I felt that this was a big accomplishment.
Lead Time to Change (LTTC)
Change Rate
The first chart is the LTTC, along with the batch size. You can see that the teams lead time to change ballooned out to 17.4 days at times (as an average!) but with a lot of work and effort we managed to drop this by 83% to a low of 3.07 days in early December. Our batch size is also 94% smaller, with what used to be an average of 33, is now 2.1 (not a typo!). This means we are releasing a smaller amount of changes per release, and getting them out much, much faster.
The second graph speaks absolute volumes to me about the capability of the team. When I joined in very late May, the team was deploying twice a week on average. When we brought in automated testing and TS as well as a bunch of CI status checks on the codebase, there was some initial learnings the team undertook about how to work with TS and the compiler as well as how to write effective unit tests, but as you can see, the deployments per day have tripled and we peaked at 9 deployments per day. with an average now of about 7 per week.
Finally, I want to share a small piece about automation and testing.
In the above image, you can see we have several automated checks which run when a change is made. In this case it’s an automated pull request created by dependabot which has raised the change as making it will resolve a critical vulnerability with the application. By having the automation in place along with testing of the application to verify application behaviour, we can be more confident that bringing in this change would not have any adverse effects on our application nor the customers using it.
Additional Success
In addition to the metrics I’ve discussed above, the team also saw specific successes in the following areas. We have been able to:
Reduce the number of bugs that are released to production.
Improve the quality of our code.
Reduce the amount of time that devs and QA’s spend on testing.
Increase the confidence of our developers and stakeholders in the quality of our software.
Reduce the amount of failed QA cycles and with fewer regressions.
Recommendations
Every single piece of literature I can find that speaks to problems had, and then solved with respect to software engineering revolves around automation and the ability to validate a change through an automated process - i.e. testing. If you are considering introducing testing and automation to your code repository, I will add to the choir of other voices and also highly recommend it. It is one of the most important things you can do to improve your software development process and deliver better software to your customers. However it’s not just about delivering better software, it’s also about making sure the customers you have already remain happy and continue to use your product. Happy customers are the best source of new revenue and help build a strong brand, and in many cases will advocate for you.
Here are a few tips for getting started:
Start by implementing a CI process. This will help you to catch bugs early and prevent them from being merged into the trunk and released to production. Bugs merged to the trunk, which other developers then pull down when branching off into another branch will make the problem worse as fixing it can become a headache.
Start writing tests and make sure they are run at every step of the release process. From merge time to deploy time.
Start small and focus on the most important areas of your codebase. You don't need to automate all of your tests at once.
It’s important that this decision comes from the team, so get buy-in from everyone. The key to success is having the team understand the benefits of testing and automation and be willing to use these tools.
Conclusion
These improvements have allowed us to deliver software to our customers more quickly and reliably. We have also been able to reduce the number of failed changes (where a change is released that did not work correctly in all scenarios) and the impact of outages when they do occur.
Introducing testing, language tooling and automation to our code repository was one of the best decisions we've ever made. It has helped us to improve our software development process and deliver better software to our customers. It takes time and effort to implement and maintain a good testing and automation framework. The engineering cost may seem high early on, but as the project matures it will easily pay for itself thousands of times over.
Additionally, responding to change requests from product teams and customers is simple and quick. Just as importantly, when a dependency of your application is identified as vulnerable, a fix can be created with the automation running to validate the functionality, so patching security vulnerabilities can take minutes, not weeks or months (or longer!).
Thanks for reading this longer form post, I hope you got something out of it. I would love any feedback you may have or stories of your own you’d like to share either via the comments section or by contacting me on socials if preferred.
Great work JK. I wonder if such an effort can be packaged and given/sold to engineers with a rule that says, First Operate, Then Develop (FOTD or OpsDev ;) ).
It would be nice if you could shed light on how many humans/hour have been involved in the transformation. A rough estimation works, too.