- Don't overlook, trivialize, or concede the "small stuff".
- Acceptance of new technical debt signals tolerance of lower quality standards.
- Culture of excellence begins to erode.
- Future projects will multiply the debt further.
- Find the proper balance between high standards and perfectionism.
What is the Broken Windows Theory?
The Broken Windows Theory contends that visible signs of disorder and crime (such as broken windows & excess litter) signals that the area is not actively maintained and that crime will go undetected. Disorder will increase. Locals become more likely to stay inside out of fear that the community is unsafe. Disorder will further increase (public intoxication and gangs form). This creates a positive feedback loop where disorder and crime run rampant.
The crux of the theory is that although the broken windows themselves may be harmless, they send a signal that law and order is not maintained or enforced.
To translate the theory into a software development team setting: the criminals are your team (including yourself) and their crime is contributing to various degrees of technical debt & fueling bad habits.
A Broken Window Appears
Many years ago, my team had an initiative to switch from Perforce to Git for VCS. The switch was done quickly and had some initial tooling and process gaps.
With Perforce, we had custom tooling that enforced completed code reviews. Due to the formality of the process, the code was often clean, well designed, well documented, and sufficient test coverage. Despite being formal, getting code reviewed and merged was quick, easy, and highly regarded by the team.
The initial Git setup had a trivial "2 reviewers" merge policy opposed to the custom tooling we had with Perforce. Occasionally pull requests would be merged with "rubber stamp" reviews from peers. And thus, our first window became broken.
The butterfly effect
To avoid making a mountain out of a molehill with the "rubber-stamped" pull requests, we would let these rare instances slide, especially since they author was usually had good intentions (the reviewing SME on vacation). We rationalized that the code review could always occur post-merge and any issues corrected in a subsequent PR. For major post-merge issues, this assumption held true. For minor issues (such as insufficient documentation), any updates missed the original pull request were often left incomplete.
Over time, the frequency of rubber-stamped PRs grew. The quality standards for commits slowly became more relaxed as well. There were no consequences to opening (and merging) a pull request without documentation. Developers eventually caught on and started to omit unit tests as well.
Developers new to the team would see code being merged without any documentation or tests, and followed the example being set. Before long, code started getting checked in that hadn't even been tested locally. Customers are reporting bugs that we didn't catch. More serious "crime" is now taking place and the team is numb to it.
We now have many problems. Increased defect escape rate is causing customer pain. Automation test coverage is low. Our release cadence is slowing down. Missing documentation is causing features to take longer & bugs are becoming more frequent. We have strong engineers... how did this happen?
Enter Copy Paste
Before I go into how we resolved our problem, I want to highlight another risk involving technical debt. I like to call this "The Copy Paste Effect."
My team was building an internal dashboard (DASH) to expose troubleshooting information for use during customer support calls. This involved the creation of a few backend services. Fast-Forward 3 months... We have 3 services deployed (DASH-A, DASH-B, DASH-C). Two of these are of solid construction: following development best practices, fully automated regression pipelines, using shared libraries, etc. One of the services (DASH-C) wasn't quite up to par.
DASH-C APIs were not particularly critical. While planning DASH, I considered the DASH-C service a good opportunity for an up-and-coming developer to get a taste of ownership. To cut to the chase - DASH-C was single instance (not HA), manually deployed, and missing unit/component tests. I didn't realize this until the service had already went live. Oops...
My team setup a plan to uplift DASH-C in the upcoming quarter. It was also a good learning experience for both the developers and the managers. I thought we were in the clear, but I was wrong...
Another team had be really impressed with a demo involving DASH. This team wanted to extend DASH to include additional functionality. Before long, this team delivered another 2 services (DASH-D, DASH-E), both of which used DASH-C as a template. Now 3 out of 5 services are sub-standard!!
Other examples of the Copy Paste Effect I have seen in the past year:
- Project Management: Cloning an incomplete RFI missing critical details
- Alerting: Duplicating a noisy alert that was not indicative of an actionable problem.
- Data Analytics: Basing a new Splunk dashboard off a poorly written, non-optimized query.
- Unit testing: Copy/pasting non thread-safe unit tests from another project and having flaky tests that intermittently fail.
Lay down the law
To be honest, it was challenging to dig ourselves out of this mess. We started by enhancing our tooling to enforce SME approval during the pull request process (back to what we previously had with Perforce). We fought a massive uphill battle to return our quality standard for code submissions back to earlier standards. We had to rely heavily on well respected leaders within the team to "uphold the law." Many times I had to step in when a developer would claim "The sprint plan didn't factor time into writing unit tests" or "I'll cleanup the code before we hit production. I just want to get this merged now in order to demo this internally." I focused on letting my team know that we are committed to quality, that we should be proud of our work, and that I support taking the necessary time in order to get the project done right. Through several months of perseverance, high quality standards became habitual and second nature for the team.
We managed to stop the bleeding, but we have accumulated a mountain of technical debt that we are - to this day - still resolving. While there were definitely other factors that led to our predicament, I can't help but wonder how things would have turned out differently had we maintained tooling parity from the get-go.
Perfectionism & Over Policing?
In the field of criminology, there has been some controversy over the Broken Windows Theory. It has been cited to defend overzealous policing, which can cause other problems, such as lack of corporation and distrust in the police. This highlights the need to use proper judgement and execution when handling minor issues. It is crucial that the manager remains trusted and respected by their team.
Broken windows can be anything and their severity varies from team to team. Some examples:
- Intermittent system test failures to go undiagnosed and unresolved
- Build execution times to grow uncontrollably
- New code without documentation or test automation
- API response time to grow unchecked
- Source code to be duplicated
If these types of issues are diagnosed and the estimates to fix them are very large, that hints at a more foundational problem, such as repository setup, build system, product architecture, lack of tooling, etc..
The general philosophy applies to many areas in the day-to-day life of a software engineer. A few guiding principles I keep in mind everyday:
- Maintaining a high standard is important for success - make it part of your team culture
- Small concessions are not doing anyone a favor
Identify the broken windows in your community and work to repair them before "crime" escalates.