What Is and Should Never Be

I have been banging on about the perils of the Great Rewrite in many previous posts. Huge risks regarding feature creep, lost requirements, hidden assumptions, spiralling cost, internal unhealthy competition where the new system can barely keep up with the evolving legacy system, et cetera.

I will in this post attempt to argue the opposite case. Why should you absolutely embark on a Great Rewrite? How do you easily skirt around the pitfalls? What rewards lay in front of you? I will start with my usual standpoint, what are invallid excuses for embarking on this type of journey?

Why NOT to give up on gradual refactoring?

If you analyse your software problems and notice that they are not technical in nature, but political there is no point in embarking on a massive adventure, because the dysfunction will not go away. No engineering problems are unsolvable, but political roadblocks can be completely immovable under certain circumstances. You cannot make two teams collaborate where the organisation is deeply invested in making that collaboration impossible. This is usually a problem of P&L where the hardest thing about a complex migration problem is the accounting and budgeting involved in having a key set of subject matter experts collaborating cross-functionally.

The most horrible instances of shadow IT or Frankenstein middleware have been created because the people that ought to do something were not available so some other people had to do something themselves.

Basically, if – regardless of size of work – you cannot get a piece of work funnelled through the IT department into production in an acceptable time, and the chief problem is the way the department operates, you cannot fix that by ripping out the code and starting over.

Why DO give up on gradual refactoring?

Impossible to enact change in a reasonable time frame.

Let us say you have an existing centralised datastore that has several small systems integrate across it in undocumented ways, and your largest legacy systems are getting to the point where its libraries cannot be upgraded anymore. Every deployment is risky, and performance characteristics are unpredictable for every change, and your business side, your customer in the lean sense, demands quicker adoption of new products. You literally cannot deliver what the business wants in a defensible time.

It may be better to start building a new system for the new products, and refactor the new system to bring older products across after a while. Yes, the risk of a race condition between new and old teams is enormous, so ideally teams should own the business function in both the new and the old system, so that the developers get some accidental domain knowledge which is useful when migrating.

Radically changed requirements

Has the world changed drastically since the system was first created? Are you laden with legacy code that you would just like to throw away, except the way the code is structured you would first need to do a great refactor before you can throw bits away, but the test coverage is too low to do so safely?

One example of radically changed requirements could be – you started out as a small site only catering to a domestic audience, but then success happens and you need to deal with multiple languages and the dreaded concept of timezones. Some of the changes necessary for such a change can be of the magnitude that you are better off throwing away the old code rather than touching almost every area of the code to use resources instead of hard coded text. This might be an example of amortising on well adjudicated technical debt. The time to market gain you made by not internationalising your application first time round could have been the difference that made you a success, but still – now that choice is coming back to haunt you.

Pick a piece of functionality that you want to keep, and write a test around both the legacy and the new version to make sure you cover all requirements you have forgotten over the years (this is Very Hard to Do). Once you have correctly implemented this feature, bring it live and switch off this feature in the legacy system. Pick the next keeper feature and repeat the process, until nothing remains that you want to salvage from the old system and you can decommission the charred remains.

Pitfalls

Race condition

Basically, you have a team of developers implement client onboarding in the new system. Some internal developers and a couple of external boutique consultants from some firm near Old Street. They have meetings with the business, i.e strategic sales and marketing are involved, they have an external designer involved to make sure the visuals are top notch, meanwhile in the damp lower ground floor, the legacy team has Compliance in their ear about the changes that need to go live NOW or else the business risk being in violation of some treaty that enters into force next week.

I.e. as the new system is slowly polished, made accessible, perhaps being a bit bikeshedded as too many senior stakeholders get involved, the requirements for the actual behind-the-scenes criteria that need to be implemented are rapidly changing, and to the team involved in the rework it seems that the goalposts never stop moving, and most of the time they are never told, because compliance “already told IT”, i.e. the legacy team.

What is the best way to avoid this? Well, if legacy functionality seems to have high churn, move it out into a “neutral venue”, a separate service that can be accessed from both new and old systems and remove the legacy remains to avoid confusion. Once the legacy system is fully decommissioned you can take a view and see if you want to absorb these halfway houses or if you are happy with how they are factored. The important thing is that key functionality only exists in one location at all time.

Stall

A brave head of engineering sets out to implement a new modern web front-end, replacing a server rendered website communicating via soap with a legacy backend where all business logic lives. Some APIs have to be created to do processing that that the legacy website did on its own before or after calling into the service. On top of that, a strangler fig pattern is implemented around the calls to the legacy monolith, primarily to isolate the use of soap away from the new code, but also to obviate some of the things that is deemed not to be worth taking the round trip over soap. Unfortunately, after the new website is live and complete, the strangler fig has not actually strangled the back-end service, and a desktop client app is still talking soap directly to the backend service with no intention of ever caring about or even acknowledging the strangler fig. Progress ceases and you are stuck with a half-finished API that in some cases implements the same features as the backend service, but in most cases just acts as a wrapper around soap. Some features live in two places, and nobody is happy.

How to avoid it? Well, things may happen that prevent you from completing a long term plan, but ideally, if you intend to strangle a service, make sure all stakeholders are bought into the plan. This can be complex if the legacy platform being strangled is managed by another organisation, e.g. an outsourcing partner.

Reflux

Lets say you have a monolithic storage, the One Database. Over the years BI and financial ops have gotten used to querying directly into the One Database to capture reports. Since the application teams are never told about this work, the reports are often broken, but they persevere and keep maintaining these reports anyway. The big issue for engineering is the host of “batch jobs”, i.e. small programs run from a band built task scheduler from 2001 that does some rudimentary form of logging directly into a SchedulerLogs database. Nobody knows what these various programs do, or which tables in the One Database they touch, just that the jobs are Important. The source code for these small executables exist somewhere, probably… Most likely in the old CVS install on a snapshot of a Windows Server 2008 VM that is an absolute pain to start up, but there is a batch file from 2016 that does the whole thing, it usually works.

Now, a new system is created. Finally, the data structure in the New Storage is fit for purpose, new and old products can be maintained and manipulated correctly because there are no secret dependencies. An entity relationship that was stuck as 1-1 due to an old, bad design that had never been possible to rectify – as it would break the reconciliation batch job that nobody wants to touch – can finally be put right, and several years worth of poor data quality can finally be addressed.

Then fin ops and BI write an angry email to the CFO that the main product no longer reports data to their models, and how can life be this way, and there is a crisis meeting amongst the C-level execs and an edict is brought down to the floor, and the head of engineering gets told off for threatening to obstruct the fiduciary duties of the company, and is told to immediately make sure data is populated in the proper tables… Basically, automatically sync the new data to the old One Database to make sure that the legacy Qlik reports show the correct data, which also means that some of the new data structures have to be dismantled as they cannot be meaningfully mapped back to the legacy database.

How do you avoid this? Well, loads of things were wrong in this scenario, but my hobby-horse is about abstractions, i.e. make sure any reports pointing directly into an operational database do not do that anymore. Ideally you should have a data platform for all reporting data where people can subscribe to published datasets, i.e. you get contracts between producer and consumer of data so that the dependencies are explicit and can be enforced, but. at minimum have some views or temporary tables that define the data used by the people making the report. That way they can ask you to add certain columns, and as a developer you are aware that your responsibility is to not break those views at any cost, but you are still free to refactor underneath and make sure the operational data model is always fit for purpose.

Conclusion

You can successfully execute a great rewrite, but unless you are in a situation where the company has made a great pivot and large swathes of the feature in the legacy system can just be deleted, you will always contend with legacy data and legacy features, so fundamentally is is crucial to avoid at least the pitfalls listed above (add more in the comments, and I’ll add them and pretend they were there all along). Things like how reporting will work must be sorted out ahead of time. There will be lack of understanding, shock and dismay, because what we see as hard coupling and poor cohesion, some folks will see as single pane of glass, so some people will think it is ludicrous to not use the existing database structure forever. All the data is there already?!

Once there is a strategy and a plan in place for how the work will take place, the organisation will have to be told that although you were not of the opinion that we were moving quickly before, we shall actually for a significant time worsen our response times regarding new features as we dedicate considerable resources to performing a major upgrade to our platform into a state that will be more flexible and easy to change.

Then the main task is to only move forward at pace, and to atomically go feature by feature into the new world, removing legacy as you go, and use enough resources to keep the momentum going. Best of luck!

Disparate Opinions

Various tidbits