Category Archives: DevOps

Further on how time is wasted

I keep going on about why software development should consist of empowered cross-functional teams, and a lot of actual experts have written very well – at length – about this, and within manufacturing the Japanese showed why it matters in the 1980s, and the American automotive industry drew the wrong lessons from it, but that is a separate issue. For some light reading on these topics I recommend People Before Tech by Duena Blomstrom and also Eli Goldratt’s The Goal.

Incorrect conclusions from the Toyota Production System

The emergence of Six Sigma was a perfect example of drawing the completely wrong conclusion from the TPS1. In manufacturing, as well as in other processes where you do the exact same thing multiple times you do need to do a few sensible things, like examine your value chain and eliminate waste, so figuring out exactly how to fit a dashboard 20 seconds faster in a car, or provide automated power tools that let the fitter apply the exactly correct torque without any manual settings creates massive value, and communicating and re-evaluating these procedures to gradually optimise further has a direct tie in with value creation.

But transferring that way of working to an office full of software developers where you hopefully solve different problems every single day (or you would use the software you have already written, or license existing software rather than waste resources building something that already exists) is purely negative, causing unnecessary bureaucracy that actually prevents value creation.

Also -the exact processes that have been developed at Toyota or even at its highly successful joint venture with General Motors – NUMMI – were never the success factor. The success factor was the introspection, the empowerment to adapt and inspect processes at the leaf nodes of the organisation. The attempts by GM to bring the exact processes back to Detroit failed miserably. The clue is in the meta process, the mission and purpose as discussed by Richard Pascale’s in The Art of Japanese Management and in Tom Peter’s In Search of Excellence.

The value chain

The books I mentioned in the beginning explain how to look at work as it flows through an organisation as a chain where pieces of work are handed off between different parts of the organisation as different specialists do their part. Obviously there needs to be authorisation and executive oversight to make sure nothing that runs contrary to the ethos of the company gets done in its name, there are multiple regulatory and legal concerns that a company wants to safeguard, and I want to make it clear that I am not proposing we remove those safeguards, but the total actual work that is done needs mapping out. Executives and especially developers rarely have a full picture of what is really going on, as in there can be workarounds created around perceived failings in the systems used that have never been reported accurately.

A more detailed approach that is like a DLC on the Value Stream Mapping process, is called Event Storming, where you gather stakeholders in a room to map out exactly the actors and the information that makes up a business process. This can take days, and may seem like a waste of a meeting room, but the knowledge that come out of it is very real – as long as you make sure not to make this a fun exercise for architects only, but to involve representatives from the very real people involved in these processes day-to-day.

The waste

What is waste then? Well – spending time that does not even indirectly create value. Having people wait for approvals rather than ship them, product creating tickets six months ahead of time that then need to be thrown away because the business needs to go in a different direction. Writing documentation nobody reads (it is the “that nobody reads” that is the problem there, so work on writing good, useful and discoverable documentation, and skip the rest). Having two or more teams work on solving the same problem without coordination – although there is a cost to coordination as well, there is a tradeoff here. Sometimes it is less wasteful for two teams to independently solve the same problem if it leads to faster time to market, as long as the maintenance burden created is minimal.

On a value stream map it becomes utterly painfully clear that you need information before it exists, or that dependencies flow in the wrong direction, and with enough seniority in the room you can make decisions on what matters and with enough individual contributors in the room you can agree practical solutions that make people’s lives easier. You can see handoffs that are unnecessary or approvals that could be determined algorithmically, find ways of making late decisions based on correct data, or find ways of implementing safeguards in a way that does not cost the business time.

As a small cog in a big machine it is sometimes very difficult to know what parts of your daily struggles add or detract value from the business as a whole, and these types of exercises are very useful in making things visible. The organisation is also forced to resolve unspoken things like power struggles so that business decisions are made at a sensible level with clear levels of authority. Especially businesses with a lot of informal authority or informal hierarchies can struggle to put into words how they do things day-to-day, but it is very important that what is documented is the current unvarnished truth, or else it is like you learn repeatedly in The Goal – optimising any other place than the constraint is useless.

But what – why is a handoff between teams “waste”?

There are some unappealing truths in the Goal- e.g. the ideal batch size is 1, and you need slack in the system – but when you think about them, they are true:

Slack in the system

Say for instance, you are afraid of your regulator – with good reason – and you know from bitter experience that software developers are cowboys. You hire an ops team to gatekeep, to prevent developers from running around with direct access to production, and now the ops team relies on detailed instructions form the developers on how to install the freshly created software into production, yet the developers are not privy to exactly how production works. Hilarity ensues and deployments often fail. It becomes hard for the business to know when their features go out, because both ops and dev are flying blind. In addition to this, the ops team is 100% utilised, they are deploying and configuring things all day, so any failed deployment (or, let’s be honest, botched configuration change ops attempts on their own without any developer to blame) always leads to delays, so the lead time for a deployment goes out to two weeks, and then further.

OK, so let’s say we solve that, ops build – or commission – a pipeline that they accept is secure and has enough controls and reliable rollback capabilities to be trusted to hand over to be used by a pair of developers, bosh – we solve the deployment problem, developers can only release code that works, or it will be rolled back without them needing the actual keys to the kingdom, they have a button and observability, that’s all they need. Of course, us developers will find new ways to break production, but the fact remains, rollback is easy to achieve with this new magical pipeline.

Now this state of affairs is magical thinking, no overworked ops team is going to have the spare capacity to work on tooling. What actually tended to happen was that the business hired a “devops team”, which unfortunately weren’t trusted with access to production either, so you might end up with separate automation among ops vs dev (“dev ops team” writes and maintains CI/CD tooling and some developer observability, ops team run their own management platform and liveness monitoring) which does not really solve the problem. The ops team needs time to introspect and improve processes, i.e. slack in the system.

Ideal batch size is 1

Let us say, you have iterations, i.e. the “agile is multiple waterfalls in a row” trope. You work for a bit, you push the new code to classic QA that revise their test plans, and then they test your changes as much as they can before the release. You end up with 60 jira tickets that need QA signoff on all browsers before the release, and you need to dedicate a middle manager to go around and hover behind the shoulders of devs and QA until all questions are straightened out and the various bugs and features have been verified in the dedicated test environment.

A test plan for the release is created, perhaps a dry run is carried out, you warn support that things are about to go down, you bring half the estate down and install updates on the one side of the load balancer, you let the QAs test the deployed tickets. They will try and test as much as they can of the 60 tickets without taking the proverbial, given that the whole estate is currently only serving prod traffic from a subset of the instances, and once they are happy, prod is switched over to the freshly deployed side, and a couple of interested parties start refreshing the logs to see if something bad seems to be happening, as the second half of the system is deployed, the firewall is reset to normal and monitoring is enabled.

So that is a “normal” release for your department, and it requires fairly many people to go to DEFCON 2 and dedicate their time to shepherding the release into production. A lot of the complexity with the release is the sheer size of the changes. If you were deploying a small service with one change, the work to validate that it is working would be minimal, and you would also immediately know what is broken because you know exactly what you changed. With large change sets, if you start seeing an Internal Server Error becoming common, you have no exact clue as to what went wrong, unless you are lucky and the error message makes immediate sense, but unfortunately, if it was a simple problem, you would probably have caught it in the various test stages beforehand.

Now imagine that one month the marquee feature that was planned to be released was a little bit too big and wouldn’t be able to be released in its entirety, so the powers that be decide, hey let’s just combine this sprint with the next one and push the release out another two weeks.

Come QA validation before the delayed release, there are 120 tickets to be validated – do you think that takes twice the time to validate or more? Well, you only get the same three days to do the job, it’s up to the Head of QA to figure it out, but the test plan is longer which makes the release take 10 hours, four hours of which include the bit of limbo while the estate is running on half power.

So yea, you want to make releases easy to roll back and fast to do and rely heavily on automated validation to avoid needing manual verification – but most of all, you want to keep the change sets small. The automation helps that become an easier choice, but you could choose to release often even with manual validation, but it seems to be human nature to prefer a traumatic delivery every two weeks rather than slight nausea every afternoon.

So what are the right conclusions to draw from TPS?

Well, the main thing is go and see, and continuous improvement. Allow the organisation to learn from previous performance, and empower the people to make decision at the level it makes sense, e.g. discussions about tabs vs spaces should not travel further up the organisation than among the developer collective, or some discussions should not be elevated beyond the team. If you give teams accountability on cloud spend, the visibility and the authority to affect change, you will see teams waste less money, but if the cost is hard to attribute and only visible on a departmental level, your developers are going to shrug their shoulders because they do not see how they can effect change. If you allow teams some autonomy on how to solve problems whilst giving them clear visibility on what the Customer – in the agile sense – wants and needs, the teams can make correct small tradeoffs on their level without derailing the greater good. So – basically – make work visible, let teams know the truth about how their software is working and what their users feel about it. Let developers see how users use their software. Do not be afraid to show how much the team costs, so that they can make reasonable suggestions – like “we could automate these three things, and given our estimated running cost, that would cost the business roughly 50 grand, and we think that would be money well spent because a), b) c) […]” or alternatively “that would be cool, but there is no way we could produce that for you for a defensible amount of money, let us look what exists on the market that we can plug into to solve the problem without writing code”.

Everyone at work is a grown-up, so if you think that you are getting unrealistic suggestions from your development teams, consider if you perhaps have hidden too much relevant information from them, and that perhaps if we figure out how to make relevant information easily accessible, we could give everyone a better understanding of not only what is happening right now, but more importantly, what reasonably should happen next. This also works to help upper management understand what each team is doing. If you have internal resistance from this, consider why, because that in itself could explain problems you might be having.

  1. The initialism TPS stands for Toyota Production System, as you may deduce from the proximity to the headline, but I acknowledge the existience of TPS reports in popular culture – i.e. Office Space. I do not believe they are related. ↩︎

Mob / Pair vs Solo and Speed

I have recently “thought led” on LinkedIn, claiming that the future of software development lies in mob programming. I think this take automatically flips my bozo bit in the minds of certain listeners, whilst for many people that is a statement about as revolutionary as saying water is wet.

Some definitions (based on my vague recollection and lazy googling to verify, please let me know if you know better) about what I mean by mob and pair programming.

Solo development

This is what you think it is. Not usually completely alone in the dark wearing a hoodie like in the films, but at least, you sit at your own screen, pick up a ticket, write your code, open a PR, tag your mates to review it, get a coffee, go through and see if there are PRs from other developers for you to review. Rinse / repeat.

The benefit here is you get to have your own keyboard shortcuts , your own Spotify playlist and can respond immediately to chat messages from the boss. The downside is that regulators don’t trust developers, not alone, so you need someone else to check your work. We used to have a silo for testers, but like trying to season the food afterwards, it is impossible to retrofit quality, so we have modified our ways of working, but the queue of pull requests in a review queue is still a bottleneck, and if you are unlucky, you lose the “race” and need to resolve merge conflicts before your changes can be applied to the trunk of the source tree.

Pair programming

Origins

Pair programming is one of the OG practices of Extreme Programming (XP), developed in 1996 by Kent Beck, Ward Cunningham and Ron Jeffries, and later publicised in the book Extreme Programming Explained (Beck) and basically means one computer, two programmers. One person types – drives – the other navigates. It makes it easier to retain context in the minds of both people, it is easier to retain state in case you get interrupted, and you spread knowledge incredibly quickly. There are limitations of course, if the navigator is disengaged or if the two people have strong egos and you get unproductive discussions over syntactic preference, but that would have played out in pull requests/code review anyway, so at least this is resolved live. In practical terms this is rarely a problem.

Having two people work on code is much more efficient than reviewing after the fact, but it is of course not completely guaranteed, but it is pretty close. The only time I have worked in a development team that produced literally zero defects, pair programming was mandatory, change sets were small, and releases were frequent. We recruited a pair of developers that had already adopted these practices at a previous job, and in passing chat with some of our developers ahead of joining they had mentioned that their teams had zero defects, and our people laughed – because surely that’s impossible. Then they showed us. Test first, pair program, release often. It works. There were still occasions where we had missed a requirement, but that was discovered before code went live, but still of course led us to evolve our ways of working until that didn’t happen either.

Downsides?

The most obvious naive observation would be 2 developers, one computer – surely you get half the output? Now, typing speed is not the bottleneck when it comes to software development, but more importantly – code has no intrinsic value. The value is in the software delivering the right features at the least possible investment of time and money (whether it is creation or maintenance), so writing the right code – including writing only code that differentiates your business from the competition – is a lot more important than writing the most code. Most people in the industry are aware of this simple fact, so generally the “efficiency loss” of having two people operating one computer is understood to outweigh by delivering the right code faster.

On the human level, initially people rarely love having a backseat driver when coding, either you are self-conscious about your typing speed or your rate of typos or you feel like you are slowing people down, but by frequently revolving pairs and roles driver/navigator the ice breaks quickly. You need to have a situation where a junior feels safe to challenge the choices of a senior, i.e. psychological safety, but that is generally true of an innovative and efficient workplace, so if you don’t have that – start there. Another niggle is that I am still looking for a way to do it frictionlessly online. It is doable over Teams, but it isn’t ideal. I have had very limited success with the collab feature in VS Code and Visual Studio, but if it works for you – great!

Overall

People that have given it a proper go seem to almost universally agree on the benefits, even if that began as a thing forced upon them by an engineering manager, seem to appreciate it. It does take a lot of mental effort, because the normal breaks to think as you type get skipped because your navigator is completely on it, so you write the whole time, and similarly the navigator can keep the whole problem in mind and does not have to deal with browsing file trees or triggering compilations and test runs, they can focus on the next thing. All in all this means that after about 6-7 hours, you are done. Just give up, finish off the day writing documentation, reporting time, do other admin and check emails – because thinking about code will have ceased. By this time in the afternoon you will probably have pushed a piece of code into production, so it’s also a fantastic opportunity to get a snack and pat yourself on the back as the monitoring is all green and everything is working.

Mob programming

Origins

In 2011, a software development team at Hunter Industries happens upon Mob Programming as the evolution from practicing TDD and Coding Dojos and applying those techniques to get up to speed on a project that had been put on hold for several months. A gradual evolution of practices, as well as a daily inspection and adaptation cycle, resulted in the approach that is now known as Mob Programming.

2014 Woody Zuill originally described Mob Programming in an Experience Report at Agile2014 based on the experiences of his team at Hunter Industries.

Mob programming is next level Pair Programming. Fundamentally, the team is seated together in one area. One person writes code at the time, usually projected or connected into a massive TV for everyone to be able to see. Other computers are available for research, looking at logs or databases, but everyone stays in the room, both physically and mentally, so everybody doesn’t get to sit at a table with their own laptop open, the focus is on the big screen. People talk out loud and guide the work forward. Communication is direct.

Downsides

I mean it is hard to go tell a manager that a whole team needs to book a conference room or secluded collaboration area and hang all day, every day going forward – it seems like a ludicrously expensive meeting, and you want to expense a incredibly large flatscreen TV as well – are the Euros coming up or what? Let me guess you want Sky Sports with that? All joking aside, the optics can be problematic, just like it would be problematic getting developers multiple big monitors back in the day. At some companies you have to let your back problems become debilitating before you are allowed to create discord by getting a fancier chair than the rest of the populace, so – those dynamics can play in as well.

The same problems of fatigue from being on 100% of the time can appear in a mob and because there are more people involved, the complexities grow. Making sure the whole team buys in ahead of time is crucial, it is not something that can be successfully imposed from above. However, again, people that have tried it properly seem to agree on its benefits. A possible compromise can be to pair on tickets, but code review in a mob.

Overall

The big leap in productivity here lies in the the advent of AI. If you can mob on code design and construction, you can avoid reviewing massive PRs, evade ensuing complex merge conflicts and instead safely deliver features often. The help of AI agents. yet with a team of expert humans still in the loop. I am convinced a mob approach to AI-assisted software development is going to be a game changer.

Whole-team approach – origins?

The book The Mythical Man-Month came out in 1975, a fantastic year, and addresses a lot of mistakes round managing teamwork. Most famously the book shows how and why adding new team members to speed up development actually slows things down. The thing I was fascinated by when I read it was essay 3, The Surgical Team. A proposal by Harlan Mills was something akin to a team of surgeons with multiple specialised roles doing work together. Remember at the time Brooks was collating the ideas in this book, CPU time was absurdly expensive, terminals were not yet a thing, so you wrote code on paper and handed it off to be hole punched before you handed that stack off to an operator. Technicians wore a white coat when they went on site to service a mainframe so that people took them seriously. The archetypal java developer in cargo shorts, t-shirt and a beard was far away still, at least from Europe.

The idea was basically to move from private art to public practice, and was founded on having a a team of specialists that all worked together:

  • a surgeon, Mills calls him a chief programmer – basically the most senior developer
  • a copilot, basically a chief programmer in waiting, basically acts as sounding board
  • an administrator – room bookings, wages, holidays, HR [..]
  • an editor – technical writer that ensures all documentation is readable and discoverable
  • two secretaries that handle all communication from the team
  • a program clerk – a secretary that understands code, and can organise the work product, i.e. manages the output files and basically does versioning as well as keeps notes and records of recent runs – again, this was pre-git, pre CI.
  • the toolsmith – basically maintains all the utilities the surgeon needs to do his or her job
  • the tester – classic QA
  • the language lawyer – basically a Staff Programmer that evaluates new techniques in spikes and comes back with new viable ways of working. This was intended as a shared role where one LL could serve multiple surgeons.

So – why was I fascinated, this is clear lunacy – you think – who has secretaries anymore?! Yes, clearly several of these roles have been usurped by tooling, such as the secretaries, the program clerk and the editor (unfortunately, I’d love having access to a proper technical writer). Parts of the Administrator’s job is sometimes handled by delivery leads, and few developers have to line manage as it is seen as a separate skill. Although it still happens, it is not a requirement for a senior developer, but rather a role that a developer adopts in addition to their existing role as a form of personal development.

No, I liked the way the concept accepts that you need multiple flavours of people to make a good unit of software construction.

The idea of a Chief Programmer in a team is clearly unfit for a world where CPU time is peanuts compared to human time and programmers themselves are cheap as chips compared to surgeons, and the siloing effect of having only two people in a team understand the whole system is undesirable.

But, in the actual act of software development, having one person behind the keyboard, and a group of people behind them constantly thinking about different aspects of the problem being solved, they each have their own niche and they can propose good tests to add, risks to consider as well as suitable mitigations – I think from a future where a lot of the typing is done by an AI agent – the concept really has legs. The potential for quick feedback and immediate help is perfect and the disseminated context across the whole team lets you remain productive even if the occasional team member goes on leave for a few days. The obvious differences in technical context aside, it seems there was an embryo there for what has through repeated experimentation and analysis developed into Mob Programming of today.

So what is the bottleneck then?

I keep writing that typing speed is not the bottleneck, so what is? Why is everything so bad out there?

Fundamentally code is text. Back in the day you would write huge files of text and struggle to not overwrite each other’s changes. Eventually, code versioning came along, and you could “check out” code like a library, and then only you could check that file back in. This was unsustainable when people went on annual leave and forgot to check their code back in, and eventually tooling improved to support merging code files automatically with some success.

In some organisations you would have one team working the next version of a piece of software, and another team working on the current version being live. At the end of a year long development cycle it would be time to spend a month integrating the new version into the various fixes that had been done to the old version over the whole year of teams working full time. Unless you have been involved in something like that, you cannot imagine how hard that is to do. Long lived branches become a problem way before you hit a year, a couple of days is enough to make you question your life choices. And, the time spent on integration is of literally zero value to the business. All you are doing is shoehorning changes already written in order to get the new version in a state where it can be released, that whole month of work is waste. Not to mention the colossal load on testing it is to verify a year’s worth of features before going live.

People came up with Continuous Integration, where you agree to continuously integrate your changes into a common area making sure that the source code is releaseable and correct at all times. In practice this means you don’t get to have a branch live longer than a day, you have to merge your changes to the agreed integration area every day.

Now, CI – like Behaviour Driven Development has come to mean a tool. That is, do we use continuous integration? Yeah, we have Azure DevOps, the same way BDD has become we use SpecSharp for acceptance tests, but I believe it is important to understand what words really mean. I loathe the work involved in setting up a good grammar for a set of cucumber tests in the true sense of the word, but I love giving tests names that adhere to the BDD style, and I find that testers can understand what the tests do even if they are in C# instead of English.

The point is, activities like the integration of long lived branches and code reviews of large PRs become more difficult just due to their size, and if you need to do any manual verification, working on a huge change set is inherently exponentially more difficult than dealing with smaller change sets.

But what about the world of AI? I believe the future will consist of programmers herding AI agents doing a lot of the actual typing and prototyping, and regulators deeply worried about what this means for accountability and auditability.

The solution from legislators seem to be Human-in-the-Loop, and the only way to avoid the pitfalls of large change sets whilst giving the business the execution speed they have heard AI owes them, is to modify our ways of working so that the output of a mob of programmers can be equated to reviewed code – because, let’s face it – it has been reviewed by a whole team of people – and regulators worry about singular rogue employees being able to push malicious code into production, so if anything, if an evildoer wants to bribe developers, rather than needing to bribe two, they would now have to bribe a whole team without getting exposed, so I think it holds up well from a security perspective. Technically of course, pushes would still need to be signed off by multiple people for there to be accountability on record and to prevent malware from wreaking havoc, but that is a rather simple variation on existing workflows, the thing we are trying to avoid is an actual PR review queue holding up work, especially since reviewing a massive PR is what humans do the worst at.

Is this going to be straightforward? No, probably not, as with anything, we need to inspect and adapt – carefully observe what works and what does not, but I am fairly certain that the most highly productive teams of the future will have a workflow that incorporates a substantial share of mob programming.

Development Productivity

Why the rush to measure developer productivity?

A lot has been written on developer productivity, the best – indubitably – by Gregerly Orosz and Kent Beck, but instead of methodical thinking and analysis I will just recall some anecdotes, it is Saturday after all.

The reason a business even invests in custom software development is that the business realises they could get a competitive advantage by automating or simplifying tasks used when generating revenue, and I would say only as a secondary concern to lower cost.

The main goal is to get the features and have some rapid response if there are any problems, as long as this happens there are no problems. Unfortunately as we know, software development is notorious for having problems.

The customer part of the organisation is measured relentlessly, that is sometimes incentivised based on generated profit, whilst they deal with the supplier part of the organisation that seems to have no benchmarks for individual performance, which is bewildering when projects keep slipping.

Why can’t we just measure outcome of an individual software developer? Because an individual developer’s outcome depends on too many external factors. Sure, one developer might type the line that once in production led to a massive reduction in losses, but who should get credit? The developer that committed the code, the other developer that reviewed it? The ops person that approved the release into production? The product owner that distilled wants and needs into requirements that could be translated into code?

The traditional solution is to attempt to measure effort, velocity, lines of code et cetera. Those metrics can easily prove to yield problematic emergent behaviour in teams, and it at no point measures if the code written actually provides value to the business.

I would argue that the smallest unit of accountability is the development team. The motley crew of experts in various fields that work together to extract business requirements and convert it into running software. If they are empowered to talk to the business and trusted to release software responsibly into production, they an be held accountable for things like cloud spend, speed of delivery and reliability.

Unfortunately the above description of a development team and its empowerment sounds like a fairytale for many developers out there. There are decision gates, shared infrastructure, shared legacy code and domain complexities meaning that teams wait for each other, or wait for a specific specialist team to do some work only they are trained/allowed to do. I have likened it to an incandescent lightbulb before. A lot of heat loss, very little light. Most of the development effort is waste heat.

Why do software development teams have problem delivering features?

I will engage in harmful stereotype here to make a wider point. I have been around the block over the years and visited many organisations that have degrees of the below issues, in different ways.

Getting new stuff into the pipeline

Fundamentally it is difficult to get access to have the internal IT department build you a new thing. A new thing means a project plan has to be drawn up to figure out how much you are willing to spend, the IT department need to negotiate priorities with other things currently going on and there can be power plays between various stakeholders and some larger projects IT are engaging in on their own because they mentioned it when their own budget was negotiated.

Some of this stuff is performative, literally a project needs to look “cool” to justify an increase in headcount. Now, I’m not saying upper management are careless, there are follow-ups and metrics on the department, and if you get a bunch of people in and the projected outcomes don’t materialise, you will be questioned, but the accountability does not change the fact that there is a marketing aspect when asking for a budget, which also means there is some work the IT department must do regardless of what the rest of the business wants, because they promised the CEO a specific shiny thing.

Gathering requirements

In some organisations, the business know “their” developers and can shoot questions via DM. There is a dark side to this when one developer becomes someone’s “guy” and his Teams in tray becomes a shadow support ticket system. This deterioration is why delivery managers or scrum masters or engineering managers step in to protect the developers – and thus their timelines – because otherwise none of the project work gets done because the developers are working on pet projects for random people in the business and all that budgeting and all those plans go out the window. The problem with this protectionism is that you remove direct feedback, the developers do not intuitively know how their users operate the software in anger. So many developers get anxiety pangs when they finally see the amount of workarounds people do on a daily basis with hotkeys or various clipboard tricks, things that could just have been an event in the source code, or a few function calls, saving literally thousands of hours of employee time in a year.

The funnel of requirements therefore goes through specific people, a product owner or business analyst that has the unenviable task to gather all kinds of requests into a slew of work that a representative in the business is authorised to approve, meaning instead of letting every Tom, Dick and Harry have a go at adding requirements, there is some control to prevent cost runaway and to offer a cohesive vision. Yes, great advantages, but one thing it means is that people on the front line that work against commission and feel a constant pressure whilst battling the custom software, when they complain about stuff being annoying or difficult, they never see that being addressed, and when every six months some new feature is presented, they notice that their pet peeve is still there unaddressed. This creates a groundswell of dissatisfaction, sometimes unfounded, but unfortunately sometimes not.

Sometimes the IT department introduce a dual pipeline, one pipe for long-term feature work and one for small changes that can be addressed “in the run of play” to offer people the sense of feedback being quickly addressed, which adds the burden of having a separate control mechanism of “does this change make sense? is it small enough to qualify?” but some companies have had success with it.

The way to be effective here is to reduce gate keeping but have transparent discussions on how much time can be spent addressing annoyances. Allowing teams to submit and vote for features works too, but generally just showing developers how the software is used by its users is eye-opening.

Building the right thing

“We have poor requirements” is the main complaint developers have when stories overrun their estimates. In my experience this happens when stories are too big, and maybe some more back and forth is needed with the business to sort it out. If developers and business are organisationally close enough, a half hour meeting could save a lot of time. It can be argued that estimates are a waste of time and should be replaced by budgets instead, but that’s a separate blog post.

Developers have all had experience of wasting time. Let us say request comes down from the business with a vision of something, the team goes back and forth to figure out how to migrate from an existing system and how to put the new app in production. They write a first version to prove the concept, and then it never gets put into production for some external reason. Probably a perfectly valid reason, it is just never made clear to the developers why the last three months of meetings, documents and code was binned, which grates on the developers. As implied above, the fact that it was 3 months of work can explain why thing was never released, its time may very well have passed.

My proposed solution to the business remains to build smaller initial deliverables to test the waters, I have yet to be convinced that doesn’t work. It is hard, yes, it requires different discussions, and I will concede, the IT department might already be at a disadvantage trying to promise faster deliverables.

Also, requirements change because reality changes, the business is not always just messing with you, and because your big or complex changes take time to get through to production, the problem of changing requirements gets exacerbated the slower delivery is. Also – don’t design everything up front. Figure out small chunks you can deliver and show people. This is difficult when you are building backend rails for something, but you can demo API payloads., Even semi- or non technical people can understand fundamental concepts and spot omissions early “why are you asking for x here? We don’t know that at this point of the customer journey”. Naming your API request and response objects the way the business would, i.e. ubiquitous langauge, makes this process a lot easier. Get feedback early. Keep a decision log. You don’t need diagrams of every single class but you do need a glossary and a decision log.

Building the thing right

I have banged on about engineering practices on here before, so there is nothing really new here. Fundamentally, the main things are missing tests, anything from unit, to feature, to integration, to contract – not to mention performance tests. Now, sometimes. you write these tests and run them automatically. Ideally you do, but with contract tests for instance, the amount of ceremony you have to set up to do your first automated contract test, plus its limited value until everything is contract tested means that a fair trade off can be to agree to manually test contracts. The point is, you will have there tests regardless of if they are automated or not, or else you will release code that does not work. The later you have the epiphany that test first is superior, the more legacy you will have that is hard to test and hard to change.

Even if you are stuck with legacy code that is hard to test, you can always test manually. I prefer to have test specialists on a team that I can ask for help, because their devious minds can come up with better tests, and then I just perform that tests when I think I’m done. You hate manually testing? Good, good! Use the hate, automate, but there is really no reason to not test.

If there are other parts of the business calling an API you are changing, never break an interface, always version. Doesn’t matter how you try and communicate, people are busy with their own stuff, they will notice you broke them way too late. Of course, contract tests should catch this, but why tempt fate. Cross-team collaboration is hard, if you can put some of these contracts behind some form of validation, you will save a lot of heartache.

Operation

I have addressed in previous posts the olden day triumvirate of QA, Ops and Dev and how they were opposing forces. Ops never wanted to make any changes, QA would have preferred if you made smaller changes, and Devs just churned out code throwing it over the wall. Recently it is not as bad, the DevOps culture attempts to build unity and decentralise so that teams are able to responsibly operate their own software, but a lot of time there is a specific organisation that handles deployments separately from the development team. Partly it can be interpreted as being required by ITIL, but also it gives operations a final chance to protect the business from poor developer output, but with all gatekeepers and steps, it adds a column on the board where tickets gather, which makes for a bigger release when it finally hits production, a bigger changeset means a bigger surface area and more problems.

The key problem with running a service is to understand its state and alert on it. If it takes you a few hours to know that a service isn’t performing well, you are at a disadvantage. There is a tradeoff between the amount of data you produce and what value it brings.

Once you can quickly detect that a release went bad and can quickly roll it back, ideally automatically, then you will have saved everyone a lot of time and improved the execution speed for the whole department. It is very important. If not, you will further alienate your users and their managers, which is even worse, politically.

Technical advancements

People may argue that I have been telling you that lacking development productivity is the fault of basically everyone else but developer, but…. come on – surely we will be able to just sprinkle some AI on this and be done with it? Or use one of those fancy low code solutions? Surely we can avoid the problem of producing software slowly by not actually writing code?

The only line of code guaranteed to be bug free is the one not written. I am all for focusing on your core business and write only the code you need to solve the business problem at hand. Less code to write sounds like less code to review and maintain.

Now, I won’t speak to all low code solutions because I tend to work in high code(?) environments the last decade and a bit, but the ones I have seen glimpts of look very powerful, slap a text box on a canvas, bosh you have a field stored in a table. The people writing applications with these platforms will become very skilled at producing these applications quickly,

Will all your software live on this platform? What happened to the legacy apps that slow you down today, will they not require some changes in the future as well? Will the teams you currently have problems avoiding to break shared APIs fare better with strings in the textbox of a website? Are you responsible for hosting this platform or is it SaaS, how is data stored? Will the BI team try to break into a third party database to ingest data or will they accept some kind of API? What about recently written micro services that have yet to pay back their cost of development? Bin them?

Conclusion

My belief is that the quest for developer productivity comes from a desire to reduce the time between idea and code running well in production. If that lead time was drastically cut, nobody on the other side of the business is going to care what developers do with their time.

Although development productivity is affected by the care and attention applied by individual developers, given the complexities of a software development department and the constraints of the development workflow, productivity is a function of the execution of the whole department. If your teams are cross functional and autonomous you can hold them accountable on a team level, but that requires a relative transparency around cost that requires engineering effort to acquire.

The speed with which features come to production is only in a limited way affected by the speed with which developers write code, and if you do not address political and organisational bottlenecks, you may not see any improvements if you go low code or AI (“vibe coding”),

Our job as older people within a software development function is to make sure we do what we can to make sure features reach the business faster, regardless of how they are implemented. As always measure first and optimise at the bottleneck before you measure again.

How small is small?

I have great respect for the professional agile coach and scrum master community. Few people seem to systematically care for both humans and business, maintaining profitability without ever sacrificing the humans. Now, however, I will alienate vast swathes of them in one post. Hold on.

What is work in software development?

Most mature teams do two types of work, they look after a system and make small changes to it – maintenance, keeping the lights on and new features that the business claims it wants. It is common to get an army in to build a new platform and then allow the teams to naturally attrit as transformational project works fizzles out, contractors leave, the most marketable developers either get promoted out of the team or get better offers elsewhere. A small stream of fine adjustment changes keep coming in to the core team of maintenance developers – effectively – that remains. Eventually this maintenance development work gets outsourced abroad.

A better way is to have teams of people that work together all day every day. Don’t expand and contract or otherwise mess with teams, hire carefully from the beginning and keep new work flowing into teams rather than restructuring after a piece of work is complete. Contract experts to pair with the existing team if you need to tech the team new technology, but don’t get mercenaries to do work. It might be slower, but if you have a good team that you treat well, odds are better they’ll stay, and they will develop new features to be able to be maintained better in the future and less likely to cut corners as any shortcuts will blow up in their own faces shortly later.

Why do we plan work?

When companies spend money on custom software development, a set of managers at very high positions within the organisation have decided that investing in custom software is a competitive advantage, and several other managers think they are crazy to spend all this money on IT.

To mollify the greater organisation, there is some financial oversight and budgeting. Easily communicated projects are sold to the business “we’ll put a McGuffin in the app”, “we’ll sprinkle some AI on it” or similar, and hopefully there is enough money in there to also do a bit of refactoring on the sly.

This pot of money is finite, so there is strong pressure to keep costs under control, don’t get any surprise AWS bills or middle managers will have to move on. Cost runaway kills companies, so there are legitimately people not sleeping at night when there are big projects in play.

How do we plan?

Problem statement

Software development is very different from real work. If you build a physical thing, anything from a phone to a house, you can make good use of a detailed drawing describing exactly how the thing is constructed and the exact properties of the components that are needed. If you are to make changes or maintain it, you need these specifications. It is useful both for construction and maintenance.

If you write the exact same piece of software twice, you have some kind of compulsive issue, you need help. The operating system comes with commands to duplicate files. Or you could run the compiler twice. There are infinite ways of building the exact same piece of software. You don’t need a programmer to do that, it’s pointless. A piece of software is not a physical thing.

Things change, a lot. Fundamentally – people don’t know what they want until they see it, so even if you did not have problems with technology changing underneath your feet whilst developing software, you would still have problems with the fact that fundamentally people did not know what they wanted back when they asked you to build something.

The big issues though is technology change. Back in the day, computer manufacturers would have the audacity to evolve the hardware in ways that made you have to re-learn how to write code. High level languages came along and now instead we live with Microsoft UI frameworks or Javascript frameworks that are mandatory one day and obsolete the next. Things change.

How do you ever successfully plan to build software, then? Well… we have tried to figure that out for seven decades. The best general concept we have arrived at so far is iteration, i.e. deliver small chunks over time rather than to try and deliver all of it at once.

The wrong way

One of the most well-known but misunderstood papers is Managing The Development of Large Software Systems by Dr Winston W Royce1 that launched the concept Waterfall.

Basically, the software development process in waterfall is outlined into distinct phases:

  1. System requirements
  2. Software requirements
  3. Analysis
  4. Program design
  5. Coding
  6. Testing
  7. Operations

For some reason people took this as gospel for several decades, despite the core, fundamental problem that dooms the process to failure is outlined right below figure 2 – the pretty waterfall illustration of the phases above – that people keep referring to, it says:

I believe in this concept, but the implementation described above is risky and invites failure. The
problem is illustrated in Figure 4. The testing phase which occurs at the end of the development cycle is the
first event for which timing, storage, input/output transfers, etc., are experienced as distinguished from
analyzed. These phenomena are not precisely analyzable. They are not the solutions to the standard partial
differential equations of mathematical physics for instance. Yet if these phenomena fail to satisfy the various
external constraints, then invariably a major redesign is required. A simple octal patch or redo of some isolated
code will not fix these kinds of difficulties. The required design changes are likely to be so disruptive that the
software requirements upon which the design is based and which provides the rationale for everything are
violated. Either the requirements must be modified, or a substantial change in the design is required. In effect
the development process has returned to the origin and one can expect up to a lO0-percent overrun in schedule
and/or costs.

Managing The Development of Large Software Systems, Dr Winston W Royce

Reading further, Royce realises that a more iterative approach is necessary as pure waterfall is impossible in practice. His legacy however was not that.

Another wrong way – RUP

Rational Rose and the Rational Unified Process was the Chat GPT of the late nineties, early noughties. Basically, if you only would make an UML drawing in Rational Rose, it would give you a C++ program that executed. It was magical. Before PRINCE2 and SAFe, everyone was RUP certified. You had loads of planning meetings, wrote elaborate Use Cases on index cards, and eventually you had code. It sounds like waterfall with better tooling.

Agile

People realised that when things are constantly changing, it was doomed to have a fixed plan to start with and to stay on it even when you knew that it was unattainable or undesirable to reach the original goal. Loads of attempts were made, but one day some people got together to actually have a proper go at defining what should be the true way going forward.

In February 11-13, 2001, at The Lodge at Snowbird ski resort in the Wasatch mountains of Utah, seventeen people met to talk, ski, relax, and try to find common ground—and of course, to eat. What emerged was the Agile ‘Software Development’ Manifesto. Representatives from Extreme Programming, SCRUM, DSDM, Adaptive Software Development, Crystal, Feature-Driven Development, Pragmatic Programming, and others sympathetic to the need for an alternative to documentation driven, heavyweight software development processes convened.

History: The Agile Manifesto

So – everybody did that, and we all lived happily ever after?

Short answer: No. You don’t get to just spend cash, i.e. have developer do work, without making it clear what you are spending it on, why, and how you intend to know that it worked. Completely unacceptable, people thought.

The origins of tribalism within IT departments have been done to death in this blog alone, so for once it will not be rehashed. Suffice to say, organisationally often staff is organised according to their speciality rather than in teams that produce output together. Budgeting is complex, there can be political competition that is counter productive to IT as a whole or for the organisation as a whole.

Attempts at running a midsize to large IT department that develops custom software have been made in form of Scaled Agile Framework (SAFe), DevOps and SRE (where SRE is addressing the problem backwards, from running black-box software using monitoring, alerts, metrics and tracing to ensure operability and reliability of the software).

As part of some of the original frameworks that came in with the Agile Manifesto, a bunch of practices became part of Agile even though they were not “canon”, such as User Stories, that were said to be a few words on an index card, pinned to a noticeboard in the team office, just wordy enough to help you discuss a problem directly with your user. This of course eventually started to develop back into the verbose RUP Use Cases from yesteryear, but “agile, because they are in Jira”, and rules had to be created for the minimum amount of information on there to successfully deliver a feature. In the Toyota Production System that originated Scrum, Lean Software Development and Six Sigma (sadly, an antipattern), one of the key the lessons is The ideal batch size is 1, and generally making smaller changes. This explosion in size of the user story is symptomatic of the remaining problems in modern software development.

Current state of affairs

So what do we do

As you can surmise if you read the previous paragraphs, we did not fix it for everybody, we still struggle to reliably make software.

The story and its size problems

The part of this blog post that will alienate the agile community is coming up. The units of work are too big. You can’t release something that is not a feature. Something smaller than a feature has no value.

If you work next to a normal human user, and they say – to offer an example – “we keep accidentally clicking on this button, so we end up sending a message to the customer too early, we are actually just trying to get to this area here to double-check before sending”, you can collaboratively determine the correct behaviour, make it happen, release in one day, and it is a testable and demoable feature.

Unfortunately requirements tend to be much bigger and less customer facing. Like, department X want to start seeing the reasons for turning down customer requests in their BI tooling being a feature, and then a “product backlog item” could be service A and service B needs to post messages on a message bus in various positions of the user flow identifying reasons.

Iterating over and successfully releasing this style of feature to production is hard.

Years ago I saw Allen Holub speaking on SD&D in London and his approach to software development is very pure. It is both depressing and enlightening to read the flamewars that erupt in his mentions on Twitter when he explains how to successfully make and release small changes. People scream and shout that it is not possible to do it his way.

In the years since, I have come to realise that nothing is more important than making smaller units of work. We need to make smaller changes. Everything gets better if / when we succeed. It requires a mindset shift, a move away from big detailed backlogs to smaller changes, discussed directly with the customer (in the XP sense, probably some other person in the business, or another development team). To combat the uncertainty, it is possible to mandate some kind of documentation update (graph? chart?) as part of the definition of done. Yes, needless documentation is waste, but if we need to keep a map over how the software is built, as long as people actually consult it, it is useful. We don’t need any further artefacts of the story once the feature is live in production anyway.

How do we make smaller stories?

This is the challenge for our experts in agile software development. Teach us, be bothered, ignore the sighs of developers that still do not understand, the ones raging in Allen Holub’s mentions. I promise, they will understand when they see it first hand. Daily releases of bug free code. They think people are lying to them when they hear us talk about it. When they experience it though, they will love it.

When every day means a new story in production, you also get predictability. As soon as you are able to split incoming or proposed work into daily chunks, you also get the ability to forecast – roughly, better than most other forms of estimate – and since you deliver the most important new thing every day, you give the illusion of value back to those that pay your salary.

Abstractions, abstractions everywhere

X, X everywhere meme template, licensed by imgflip.com

All work in software engineering is about abstractions.

Abstractions

All models are wrong, but some are useful

George Box

It began with Assembly language, when people were tired of writing large-for-its-time programs in raw binary instructions, they made a language that basically mapped each binary instruction to a text value, and then there was an app that would translate that to raw binary and print punch cards. Not a huge abstraction, but it started there. Then came high level languages and off we went. Now we can conjure virtual hardware out of thin air with regular programming languages.

The magic of abstractions, it really gives you an amazing leverage, but at the same time you sacrifice actual knowledge of the implementation details, meaning you often get exposed to obscure errors that you either have no idea what they mean, or even worse- understand exactly what’s wrong but you don’t have access to make that change because the source is just a machine translated piece of Go, and there is no way to fix the translated C# directly, just to take one example.

Granularity and collaboration across an organisation

Abstractions in code

Starting small

Most systems start small, solving a specific problem. This is done well, and the requirements grow, people begin to understand what is possible and features accrue. A monolith is built, and it is useful. For a while things will be excellent and features will be added at great speed, and developers might be added along the way.

A complex system that works is invariably found to have evolved from a simple system that worked

John Gall

Things take a turn

Usually, at some point some things go wrong – or auditors get involved because regulatory compliance – and you prevent developers from deploying to production, hiring gate keepers to protect the company from the developers. In the olden days – hopefully not anymore – you hire testers to do manual testing to cover a shortfall in automated testing. Now you have a couple of hand-offs within the team, meaning people write code, give it to testers who find bugs, work goes the wrong way – backwards – for developers to clean up their own mess and to try again. Eventually something will be available to release, and the gate keepers will grudgingly allow a change to happen, at some point.

This leads to a slow down in the feature factory, some old design choices may cause problems that further slow down the pace of change, or – if you’re lucky – you just have too many developers in one team, and you somehow have to split them up in different teams, which means comms deteriorate and collaborating in one codebase becomes even harder. With the existing change prevention, struggles with quality and now poor cross-team communication, something has to be done to clear a path so that the two groups of people can collaborate effectively.

Separation of concerns

So what do we do? Well, every change needs to be covered by some kind of automated test, if only to at first guarantee that you aren’t making things worse. This way you can now refactor the codebase to a point where the two groups can have separate responsibilities, and collaborate over well defined API boundaries, for instance. Separate deployable units, so that teams are free to deploy according to their own schedule.

If we can get better collaboration with early test designs and front-load test automation, and befriend the ops gatekeepers to wire in monitoring so that teams are fully wired in to how their products behave in the live environment, we would be close to optimum.

Unfortunately – this is very difficult. Taking a pile of software and making sense of it, deciding how to split it up between teams, gradually separating out features can be too daunting to really get started. You don’t want to break anything, and if you – as many are won’t to do, especially if you are new in an organisation – decide to start over from scratch, you may run into one or more of the problems that occur when attempting a rewrite. One example being where you end up in a competition against a moving target. The same team has to own a feature in both the old and the new codebase, in that case, to stop that competition. For some companies it is simply worth the risk, they are aware they are wasting enormous sums of money, but they still accept the cost. You would have to be very brave.

Abstractions in Infrastructure

From FTP-from-within-the-editor to Cloud native IaC

When software is being deployed – and I am ignoring native apps now, largely, and focusing on web applications and APIs- there are a number of things that are actually happening that are at this point completely obscured by layers of abstraction.

The metal

The hardware needs to exist. This used to be a very physical thing, a brand new HP ProLiant howling in the corner of the office onto which you installed a server OS and set up networking so that you could deploy software on it, before plugging it into a rack somewhere, probably a cupboard – hopefully with cooling and UPS. Then VM hosts became a thing, so you provisioned apps using VMWare or similar and got to be surprised at how expensive enterprise storage is per GB compared to commodity hardware. This could be done via VMWare CLI, but most likely an ops person pointed and clicked.

Deploying software

Once the VM was provisioned, things like Ansible, Chef and Puppet began to become a thing, abstracting away the stopping of websites, the copying of zip files, the unzipping, the rewriting configuration and the restarting of the web app into a neat little script. Already here you are seeing problems where “normal” problems, like a file being locked by a running process, show up as a very cryptic error message that the developer might not understand. You start to see cargo cult where people blindly copy things from one app to another because you think two services are the same, and people don’t understand the details. Most of the time that’s fine, but it can also be problematic with a bit of bad luck.

Somebody else’s computer

Then cloud came, and all of a sudden you did not need to buy a server up front and instead rent as much server as you need. Initially, all you had was VMs, so your Chef/Puppet/Ansible worked pretty much the same as before, and each cloud provider offered a different was of provisioning virtual hardware before you came to the point where the software deployment mechanism came into play. More abstractions to fundamentally do the same thing. Harder to analyse any failures, you some times have to dig out a virtual console to just see why/how an app is failing because it’s not even writing logs. Abstractions may exist, but they often leak.

Works on my machine-as-a-service

Just like the London Pool and the Docklands were rendered derelict by containerisation, a lot of people’s accumulated skills in Chef and Ansible have been rendered obsolete as app deployments have become smaller, each app simply unzipped on top of a brand new Linux OS sprinkled with some configuration answer, and then have the image pushed to a registry somewhere. On one hand, it’s very easy. If you can build the image and run the container locally, it will work in the cloud (provided the correct access is provisioned, but at least AWS offer a fake service that let’s you dry run the app on your own machine and test various role assignments to make sure IAM is also correctly set up. On the other hand, somehow the “metal” is locked away even further and you cannot really access a console anymore, just a focused log viewer that let’s you see only events related to your ECS task, for instance.

Abstractions in Organisations

The above tales of ops vs test vs dev illustrates the problem of structuring an organisation incorrectly. If you structure it per function you get warring tribes and very little progress because one team doesn’t want any change at all in order to maintain stability, the other one gets held responsible for every problem customers encounter and the third one just wants to add features. If you structured the organisation for business outcome, everyone would be on the same team working towards the same goals with different skill sets, so the way you think of the boxes in an org chart can have a massive impact on real world performance.

There are no solutions, only trade-offs, so consider the effects of sprinkling people of various background across the organisation, if instead of being kept in the cellar as usual you start proliferating your developers among the general population of the organisation, how do you ensure that every team follows the agreed best practices, that no corners are cut even when a non-technical manager is demanding answers. How do you manage performance of developers you have to go out of your way to see? I argue such things are solvable problems, but do ask your doctor if reverse Conway is right for you.

Conclusion

What is a good abstraction?

Coupling vs Cohesion

If a team can do all of their day-to-day work without waiting for another team to deliver something or approve something, if there are no hand-offs, then they have good cohesion. All the things needed are to hand. If the rest of the organisation understands what this team does and there is no confusion about which team to go to with this type of work, then you have high cohesion. It is a good thing.

If however, one team constantly is worrying about what another team is doing, where certain tickets are in their sprint in order to schedule their own work, then you have high coupling and time is wasted. Some work has to be moved between teams or the interface between the teams has to be made more explicit in order to reduce this interdependency.

In Infrastructure, you want the virtual resources associated with one application to be managed within the same repository/area to offer locality and ease of change for the team.

Single Responsibility Principle

While dangerous to over-apply within software development (you get more coupling than cohesion if you are too zealous), this principle is generally useful within architecture and infrastructure.

Originally meaning that one class / method should only do one thing – an extrapolation of the UNIX principles – it can more generally be said to mean that on that layer of abstraction, a team, infrastructure pipe, app, program, class […] should have one responsibility. This usually mean a couple of things happen, but they conceptually belong together. They have the same reason to change.

What – if any – pitfalls exist ?

The major weakness of most abstractions is when they fall apart, when they leak. Not having access to a physical computer is fine, as long as the deployment pipeline is working, as long as the observability is wired up correctly, but when it falls down, you still need to be able to see console output, you need to understand how networking works, to some extent, you need to understand what obscure operating system errors mean. Basically when things go really wrong you are needed to have already learned to run that app in that operating system before, so you recognise the error messages and have some troubleshooting steps memorised.
So although we try and save our colleagues from the cognitive load of having to know everything we were forced to learn over the decades, to spare them the heartache, they still need to know. All of it. So yes, the danger with the proliferation of layers of abstraction is to pick the correct ones, and to try and keep the total bundle of layers as lean as possible because otherwise someone will want to simplify or clarify these abstractions by adding another layer on top, and the cycle begins again.

GitHub Action shenanigans

When considering what provider to use in order to polish and cut the diamonds that are your deployable units of code into the stunningly clear diamonds they deserve to be, you have probably considered CircleCi, Azure DevOps, GitHub Actions, TeamCity and similar.

After playing with GitHub Actions for a bit, I’m going to comment on a few recent experiences.

Overall Philosophy

Unlike TeamCity, but like CircleCI and – to some extent – Azure DevOps, it’s all about what is in the yaml. You modify your code and the wáy it gets built in the same commit – which is the way God intended it.

There are countless benefits to this strategy, over that of TeamCity where the builds are defined in the UI. That means that if you make a big restructuring of the source repository but need to hotfix a pre-restructure version of the code, you had better have kept an archived version of the old build chain or you will have a bad day.

There is a downside, though. The artefact management and chaining in TeamCity is extremely intuitive, so if you build an artefact in one chain and deploy it in the next, it is really simple to make work like clockwork. You can achieve this easily with ADO too, but those are predictably the bits that require some tickling of the UI.

Now, is a real problem? Should not – in this modern world – builds be small and self-contained? Build-and-push to a docker registry, [generic-tool] up, Bob’s your uncle? Your artefact stuff and separate build / deployment pipelines smack of legacy, what are we – living in the past?! you exclaim.

Sure, but… Look, the various hallelujah solutions that offer “build-and-push-and-deploy”, you know as well as I do that at some point they are going to behave unpredictably, and all you can tell is that the wrong piece of code is running in production with no evidence offered as to why.

“My kingdom for a logfile” as it is written, so – you want to separate the build from the deploy, and then you need to stitch a couple of things together and the problems start.

Complex scenarios

When working with ADO, you can name builds (in the UI) so that you can reference their output from the yaml, and move on from there, to identify the tag of the docker container you just built and reference it when you are deploying cloud resources.

What about GitHub Actions?

Well…

Allegedly, you can define outputs or you can create reusaable workflows, so that your “let’s build cloud resources” bit of yaml can be shared in case you have multiple situations (different environments?) that you want to deploy that the same time, you can avoid duplication.

There are a couple of gotchas, though. If you defined a couple of outputs in a workflow for returning a couple of docker image tags for later consumption, they … exist, somewhere? Maybe. You could first discover that your tags are disqualified from being used as output in a step because they contain a secret(!), which in the AWS case can be resolved by supplying an undocumented parameter to the AWS Login action, encouraging it to not mask the account number. The big showstopper imhoi is that the scenario where you would want to just grab some metadata from a historic run of a separate workflow file to identify which docker images to deploy, that doesn’t seem as clearly supported.

The idea for GitHub Actions workflows seems to be – at least at time of writing, that you do all the things in one file, in one go, possibly with some flow-control to pick which steps get skipped. There is no support for the legacy concept of “OK, I built it now, and deployed it to my test environment” – some manual testing happens – and “OK, it was fine, of course, I was never worried” -> you deploy the same binaries to live. “Ah HAH! You WERE just complaining about legacy! I KNEW IT!” you shout triumphantly. Fair cop, but society is to blame.

If you were to consider replacing Azure DevOps with GitHub Actions for anything even remotely legacy, please be aware that things could end up being dicey. Imho.

If I’m wrong, I’m hoping to leverage Cunningham’s Law to educate me, because googling and reading the source sure did not reveal any magic to me.

.NET C# CI/CD in Docker

Works on my machine-as-a-service

When building software in the modern workplace, you want to automatically test and statically analyse your code before pushing code to production. This means that rather than tens of test environments and an army of manual testers you have a bunch of automation that runs as close to when the code is written. Tests are run, the rate of how much code is not covered by automated tests is calculated, test results are published to the build server user interface (so that in the event that -heaven forbid – tests are broken, the developer gets as much detail as possible to resolve the problem) and static analysis of the built piece of software is performed to make sure no known problematic code has been introduced by ourselves, and also verifying that dependencies included are free from known vulnerabilities.

The classic dockerfile added by C# when an ASP.NET Core Web Api project is started features a multi stage build layout where an initial layer includes the full C# SDK, and this is where the code is built and published. The next layer is based on the lightweight .NET Core runtime, and the output directory from the build layer is copied here and the entrypoint is configured so that the website starts when you run the finished docker image.

Even tried multi

Multistage builds were a huge deal when they were introduced. You get one docker image that only contains the things you need, any source code is safely binned off in other layers that – sure – are cached, but don’t exist outside this local docker host on the build agent. If you then push the finished image to a repository, none of the source will come along. In the before times you had to solve this with multiple Dockerfiles, which is quite undesirable. You want to have high cohesion but low coupling, and fiddling with multiple Dockerfiles when doing things like upgrading versions does not give you a premium experience and invites errors to an unnecessesary degree.

Where is the evidence?

Now, when you go to Azure DevOps, GitHub Actions or CircleCI to find what went wrong with your build, the test results are available because the test runner has produced and provided output that can be understood by that particular test runner. If your test runner is not forthcoming with the information, all you will know is “computer says no” and you will have to trawl through console data – if that – and that is not the way to improve your day.

So – what – what do we need? Well we need the formatted test output. Luckily dotnet test will give it to us if we ask it nicely.

The only problem is that those files will stay on the image that we are binning – you know multistage builds and all that – since we don’t want these files to show up in the finished supposedly slim article.

Old world Docker

When a docker image is built, every relevant change will create a new layer, and eventually a final image will be created and published that is an amalgamation of all consistuent layers. In the olden days, the legacy builder would cache all of the intermediate layers and publish a hash in the output so that you could refer back to intermediate layers should you so choose.

This seems like the perfect way of forensically finding the test result files we need. Let’s add a LABEL so that we can find the correct layer after the fact, copy the test data output and push it to the build server.

FROM mcr.microsoft.com/dotnet/aspnet:7.0-bullseye-slim AS base
WORKDIR /app
FROM mcr.microsoft.com/dotnet/sdk:7.0-bullseye-slim AS build
WORKDIR /
COPY ["src/webapp/webapp.csproj", "/src/webapp/"]
COPY ["src/classlib/classlib.csproj", "/src/classlib/"]
COPY ["test/classlib.tests/classlib.tests.csproj", "/test/classlib.tests/"]
# restore for all projects
RUN dotnet restore src/webapp/webapp.csproj
RUN dotnet restore src/classlib/classlib.csproj
RUN dotnet restore test/classlib.tests/classlib.tests.csproj
COPY . .
# test
# install the report generator tool
RUN dotnet tool install dotnet-reportgenerator-globaltool --version 5.1.20 --tool-path /tools
RUN dotnet test --results-directory /testresults --logger "trx;LogFileName=test_results.xml" /p:CollectCoverage=true /p:CoverletOutputFormat=cobertura /p:CoverletOutput=/testresults/coverage/ /test/classlib.tests/classlib.tests.csproj
LABEL test=true
# generate html reports using report generator tool
RUN /tools/reportgenerator "-reports:/testresults/coverage/coverage.cobertura.xml" "-targetdir:/testresults/coverage/reports" "-reporttypes:HTMLInline;HTMLChart"
RUN ls -la /testresults/coverage/reports
 
ARG BUILD_TYPE="Release" 
RUN dotnet publish src/webapp/webapp.csproj -c $BUILD_TYPE -o /app/publish
# Package the published code as a zip file, perhaps? Push it to a SAST?
# Bottom line is, anything you want to extract forensically from this build
# process is done in the build layer.
FROM base AS final
WORKDIR /app
COPY --from=build /app/publish .
ENTRYPOINT ["dotnet", "webapp.dll"]

The way you would leverage this test output is by fishing out the remporary layer from the cache and assign it to a new image from which you can do plain file operations.

# docker images --filter "label=test=true"
REPOSITORY   TAG       IMAGE ID       CREATED          SIZE
<none>       <none>    0d90f1a9ad32   40 minutes ago   3.16GB
# export id=$(docker images --filter "label=test=true" -q | head -1)
# docker create --name testcontainer $id
# docker cp testcontainer:/testresults ./testresults
# docker rm testcontainer

All our problems are solved. Wrap this in a script and you’re done. I did, I mean they did, I stole this from another blog.

Unfortunately keeping an endless archive of temporary, orphaned layers became a performance and storage bottleneck for docker, so – sadly – the Modern Era began with some optimisations that rendered this method impossible.

The Modern Era of BuildKit

Since intermediate layers are mostly useless, just letting them fall by the wayside and focus on actual output was much more efficient according to the forces that be. The use of multistage Dockerfiles to additionally produce test data output was not recommended or recognised as a valid use case.

So what to do? Well – there is a new command called docker bake that lets you do docker build on multiple docker images, or – most importantly – built targetting multiple targets on the same Dockerfile.

This means you can run one build all the way through to produce the final lightweight image and also have a second run that saves the intermediary image full of test results. Obviously the docker cache will make sure nothing is actually run twice, the second run is just about picking out the layer from the cache and making it accessible.

The Correct way of using bake is to format a bake file in HCL format:

group "default" {
  targets = [ "webapp", "webapp-test" ]
}
target "webapp" {
  output = [ "type=docker" ]
  dockerfile = "src/webapp/Dockerfile"
}
target "webapp-test" {
  output = [ "type=image" ]
  dockerfile = "src/webapp/Dockerfile"
  target = "build"
} 

If you run this command line with docker buildx bake -f docker-bake.hcl, you will be able to fish out the historic intermediary layer using the method described above.

Conclusion

So – using this mechanism you get a minimal number of dockerfiles, you get all the build guffins happening inside docker, giving you freedom from whatever limitations plague your build agent yet the bloated mess that is the build process will be automagically discarded and forgotten as you march on into your bright future with a lightweight finished image.

Technical debt – the truth

Rationale

Within software engineering we often talk about Technical Debt. It was defined by elder programmer and agilist Ward Cunningham and likens trade offs made when designing and developing software to credit you can take on, but that has to be repaid eventually. The comparison further correctly implies that you have to service your credit every time you make new changes to the code, and that the interest compounds over time. Some companies literally go bankrupt over unmanageable technical debt – because the cost of development goes up as speed of delivery plummets, and whatever use the software was intended to have is eventually completely covered by a competitor yet unburden by technical debt.

But what do you mean technical debt?

The decision to take a shortcut can be a couple of things. Usually amount of features, or time spent per feature. It could mean postponing required features to meet a deadline/milestone, despite it taking longer to circle back and do the work later in the process. If there is no cost to this scheduling change, it’s just good planning. For it to be defined as technical debt there has to have been a cost associated with the rescheduling.

It is also possible to sacrifice quality to meet a deadline. “Let’s not apply test driven development because it takes longer, we can write tests after the feature code instead”. That would mean that instead of iteratively writing a failing tests first followed by the feature code that makes that test pass, we will get into a state of flow and churn out code as we solve the problem in varying levels of abstraction and retrofit tests as we deem necessary. Feels fast, but the tests get big, incomplete, unwieldy and brittle compared to the plentiful small all-encompassing, and specific tests TDD bring. A debt you are taking on to be paid later.

A third kind of technical debt – which I suspect is the most common – is also the one that fits the comparison to financial debt the least. A common way to cut corners is to not continuously maintain your code as it evolves over time. It’s more akin to the cost of not looking after your house as it is attacked by weather and nature, more dereliction than anything else really.

Let’s say your business had a physical product it would sell back when a certain piece of software was written. Now the product sold is essentially a digital license of some kind, but in the source code you still have inventory, shipping et cetera that has been modified to handle selling a digital product in a way that kind of works, but every time you introduce a new type of digital product you have to write further hacks to make it appear like a physical product as far as the system knows.

The correct way to deal with this would have been to make a more fundamental change the first time digital products were introduced. Maybe copy the physical process at first and cut things out that don’t make sense whilst you determine how digital products work, gradually refactoring the code as you learn.

Interest

What does compound interest mean in the context of technical debt? Let’s say you have created a piece of software, your initial tech debt in this story is you are thin on unit tests but have tried to compensate by making more elaborate integration tests. So let’s say the time comes to add an integration, let’s say a json payload needs to be posted to a third party service over HTTP with a bespoke authentication behaviour.

If you had applied TDD, you would most likely have a fairly solid abstraction over the rest payload, so that an integration test could be simple and small.

But in our hypothetical you have less than ideal test coverage, so you need to write a fairly elaborate integration test that needs to verify parts of the surrounding feature along with the integration itself to truly know the integration works.

Like with some credit cards, you have two options on your hypothetical tech debt statement, either build the elaborate new integration test at a significant cost – a day? Three? Or you avert your eyes and choose the second – smaller- amount and increase your tech debt principal by not writing an automated test at all and vow to test this area of code by hand every time you make a change. The technical debt equivalent of a payday loan.

Critique

So what’s wrong with this perfect description of engineering trade offs? We addressed above how a common type of debt doesn’t fit the debt model very neatly, which is one issue, but I think the bigger problem is – to the business we just sound like cowboy builders.

Would you accept that a builder under-specified a steel beam for an extension you are having built? “It’s cheaper and although it is not up to code, it’ll still take the weight of the side of the house and your kids and a few of their friends. Don’t worry about it.“ No, right? Or an electrician getting creative with the earthing of the power shower as it’s Friday afternoon, and he had promised to be done by now. Heck no, yes?

The difference of course is that within programming there is no equivalent of a GasSafe registry, no NICEIC et cetera. There are no safety regulations for how you write code, yet.

This means some people will offer harmful ways of cutting corners to people that don’t have the context to know the true cost of the technical debt involved.

We will complain that product owners are unwilling to spend budget on necessary technical work, so as to blame product rather than take some responsibility. The business expects us to flag up if there are problems. Refactoring as we go, upgrading third party dependencies as we go should not be something the business has to care about. Just add it to the tickets, cost of doing business.

Sure there are big singular incidents such as a form of authentication being decommissioned or a framework being sunset that will require big coordinated change involving product, but usually those changes aren’t that hard to sell to the business. It is unpleasant but the business can understand this type of work being necessary.

The stuff that is hard to sell is bunched up refactorings you should have done along the way over time, but you didn’t- and now you want to do them because it’s starting to hurt. Tech debt amortisation is very hard to sell, because things are not totally broken now, why do we have to eat the cost of this massive ticket when everything works and is making money? Are you sure you aren’t just trying to gold plate something just out of vanity? The budget is finite and product has other things on their mind to deal with. Leave it for now, we’ll come back to it (when it’s already fallen over).

The business expects you to write code that is reliable, performant and maintainable. Even if you warn them you are offering to cut corners at the expense of future speed of execution, a non-developer may have no idea of the scale of the implications of what you are offering .

If they spent a big chunk out of their budget one year – the equivalent of a new house in a good neighbourhood – so that a bunch of people could build a piece of software with the hope that this brand new widget in a website or new line-of-business app will bring increased profits over the coming years, they don’t want to hear roughly the same group of people refer to it as “legacy code” already at the end of the following financial year.

Alternative

Think of your practices as regulations that you simply cannot violate. Stop offering solutions that involve sacrificing quality! Please even.

We are told that making an elaborate big-design-upfront is waterfall and bad – but how about some-design-upfront? Just enough thinking ahead to decide where to extend existing functionality and where to instead put a fork in the road and begin a more separate flow in the code, that you then develop iteratively.

If you have to make the bottom line more appealing to the stakeholders for them to dare invest in making new product through you and not through dubious shadow-IT, try and figure out a way to start smaller and deliver value sooner rather than tricking yourself into accepting work that you cannot possibly deliver safely and responsibly.

Transformers

What are the genuinely difficult aspects of transforming your software function?

It seems everybody intuitively understands what brings speed and short time to market, and how that in turn automatically allows for better innovation. Also people seem to get that in the current stale market, with fast enough delivery you could even forego smarts and just brute force innovation launching new concepts and tweaks until profits go up and then declare a win as if you knew what you were doing that whole time. Secretly people also know that although you could rinse and repeat doing the naïve approach until retirement, optionally you could exert minimum effort and measure a bit better so that you know what you are doing so that you can focus your efforts.

So why aren’t everyone moving on this?

When you get a bunch of people in the same organisation you want to achieve some economies of scale and solve common problems once rather than once per team.

This means you delegate some functions into separate teams. Undoing this, or at least mitigating this, is difficult politically. Some people – with some cause – fear for their jobs when reorgs happen.

Sudden unexpected cost runaway is the biggest recurring nightmare of middle managers. Controls are therefore in place to prevent developer cloud spend to balloon.

Taken together however, this means teams are prevented from innovating independently as they cannot construct the virtual infrastructure as needed because of cost not being authorised, and they cannot play with new pieces of virtual infrastructure because they haven’t been approved by the central tech authority yet.

Automation and security

There is a recent spate of sophisticated attacks on software delivery mechanisms where cyber criminals have had massive success in breaching one organisation to get automatic access to hundreds of thousands of other organisations through the update mechanism the breaches organisation provides.

Must consider security at design time

I think it needs reiterating that security needs to be built in by default, from the beginning. I haven’t gone back to check properly, but I know I went back and deleted an old blog post because it had some dubious security practice in it. My new policy is, I would rather omit some part of a process than show a dodgy sample. There are so many blog posts you find if you search for “login form asp.net” that don’t even hash passwords. And rather than point beginners to the built-in password hashing algorithms that are available in .NET, and the two lines of code you have to write, they leave some beginners thinking it’s all right, just this once and breed this basic idea that security is optional. Something you test for afterwards if you are building something “important” and not something you think about all the time.

The thing is, we developers have tools that help us do complicated things – like break bits of code out from other bits of code automatically or rename specific constructs by a certain name, including surrounding text comments, without also incorrectly renaming unrelated constructs that share name.

It turns out cyber criminals too have plenty of automation that helps them spend very little effort breaking in to companies, and exploit this access in a number of different ways.

There is maybe no “why”

This has a couple of implications. First off, attackers are probably not looking for you per se. You may be a nobody, you will still be exposed to automated attacks that test your network for known vulnerabilities and apply automated suites of exploits to see what happens. This means that even if you don’t do anything that conceivably could have value to an attacker, you will still be probed.

The second thing is, to prevent data loss you need to make every step the attacker has to take a hardship. Don’t advertise what software versions your public facing servers are running, don’t let service accounts have access to things beyond what they need, do divide networks into segments so that – for example – one machine with ransomware cannot directly infect your entire network.

Defend in depth

Change any business processes that require people to open e-mail attachments as part of their job. Offer services that help people do their job in a more convenient way that is also more secure. You cannot berate people for attempting to do their job. I mean, you can but it is not helpful.

Move backups off site and offline of course, for many reasons. But, do remember that having to recover a massive storage system from a backup can still be an extinction level event for a business even if you do have a working reliable off site backup solution. If you lose a large SAN you may be offline for days, and people will not be able to work, you may need to bring sites offline while storage recovers. When you procure a sophisticated storage solution, do not forget to design a recovery strategy ahead of time for how to rebuild a massive spinning rust storage array from absolute zero while new data is continuously generated. It is a non-trivial design challenge that probably needs tailoring to how your business operates. The point is, avoiding the situation where you need to actually restore your entire storage from tapes is always best.

Next level

Despite the intro, I have so far only mentioned things that any company needs to think about. There are of course organisations that are actually targeted. Financial institutions, large e-retailers or software supply chain companies run a greater risk of being manually targeted by evildoers.

Updates

Designing a secure process for delivering software updates is not trivial, I am not in any position to give direct advice beyond suggesting that if you are intending to do that, to consider from the beginning how to track vulnerabilities but also how to effectively remove versions that have been flagged as actively harmful, and how to support your users if they have deployed something dodgy. If that day comes, you need to at least be able to help your users. It will still be awful, but if you treat your users right, you might still make it.

Humans

Your people will be exploited. Every company that has an army of customer service representatives will need to make a trade-off between customer convenience and security. Attacks on customer service reps are very common. If you have high-value clients, people will use you to get to your clients’ money. There is nothing to say here, other than obviously you will be working with relevant authorities and regulatory bodies, as well as fine tune your authentication process so that you ask for confirmation information that is not readily available to an attacker.

Insiders

I don’t have any numbers on this, so I am unsure how big of a problem this is, but it is mentioned often in security. Basically, humans can be exploited in a different way. Employees can be coerced through intimidation, blackmail or bribery to act maliciously on behalf of an attacker. My suspicion is that this is less common than employers think, and that times when an employee was stressed or distracted and fell for a phishing e-mail, the employer would think “that is too obvious of a phish, this guy must have been in on it”.

It makes me think of that one time when a systemic failure on multiple levels meant that a cleaner accidentally started a commuter train that ran from the depot the length of the commuter railway Saltsjöbanan – at maximum speed – eventually crashing through the buffers and into a building at the terminus. In addition to her injuries, she suffered the headlines “train stolen and crashed” until the investigation revealed the shocking institutional failings that had made this accident possible. I can’t remember all of them but there were things from the practices in how cleaners accessed the trains, how safety controls were disabled as a matter of course, how trains were stabled, the fact that points were left set so that a runaway train would actually leave the depot. A shambles. Yet the first reaction from the employer was to blame the cleaner.

Anyway, to return to the matter at hand – yes, although I cannot speculate on the prevalence it is a risk. Presumably, if you hire right and look after your people you can get them to come to you if they have messed up and gotten themselves into a compromised situations where they are being blackmailed or if somebody is leaning on them. Breeding a strong culture of fear can be counterproductive here – i.e. let people think that you will help them rather than fire them and litigate as long as they voluntarily come forward. If you are working in a regulated industry, things are complicated further by law enforcement in various jurisdictions.