Another man’s ML

If you have seen the “code review” of Imperial College’s modelling code, after it being tidied up by Microsoft and others, and the reactions to it, I’d like to offer my unsolicited medium temp take on code review, legacy code and the type of code that people for whom code is not the goal but the means to an end tend to write. If you already have firm opinions here, you might want to skip this, this is an attempt to explain development stuff to people that don’t do software development for a living. Features an unnecessary recap of computer science history.

Background

Developers

Writing code is basically about solving a specific problem by expressing it as source code. Either a complex problem that you cannot fully keep all in your head, or a simple but tedious one that you wish somebody else would just do for you. Or perhaps you are just exploring something you find curious or interesting, but that is not perhaps the most common situation in a professional setting.

Many people with various disparate backgrounds develop software today. Some start by being “in computers” but on the operations side, i e administrating networks or servers, some start because they want some thing in Excel to just do this little thing, getting into the worst Wikipedia hole ever that takes them to a whole new career – and of course some start out programming right form the beginning. Others go to university and learn computer science but stay away from academia and get a normal software engineering job. Now these backgrounds play into when you read their code. If you are a machine engineer and your task is to make a combustion engine behave nicely (start in cold weather, use little fuel, have a pleasant throttle response, deal with less than ideal fuel quality and stay within emissions regulations) you might look at the hardware you are dealing with, knowing what problem you want to solve and then learn what little you can get away with with regards to the various hardware specs, libraries and language quirks and your code might not make sense to a nodejs back-end developer, but another machine engineer might at least know what you are on about and understand the variable names.

What is a program?

Batch

In the early days of business software, you ran batches. You would have a payroll program that would calculate people’s wages, calculate fees and holiday balances, you would feed it a stack of employee records and it would output some printouts that could be given to accounting. Input-=> program => output. One go. Boom. Bob is your uncle. Some programs still work like this. If you remember BAT files on DOS, those were named that way because it was short for Batch. These programs have a start, a middle and an end. On Linux there are various shells that fulfil the same role but are more advanced. Usually, when something goes wrong, you will at some point discover that something has gone awry and you abort mission and show some kind of error message to the user, hoping they know how to fix the problem. In most cases this type of error handling is not only sufficient but preferable is this situation, as the program usually has just one job, so it might as well fail spectacularly with loads of information dumped in the output if it cannot follow through, making life easier for the user when trying to make things work.

The smallest computers businesses would have before the PC revolution were mini computers. After a while these became powerful enough that instead of time-sharing, you could have multiple users using the computer at the same time using something called a teletype , an electric typewriter keyboard paired with a separate printer. You were typing into the computer, and the computer would respond onto the paper. It looked like a command line but on paper. In 1969 the Internet was beginning to be a thing at universities and the Telnet program and protocol was invented. This meant that you could use your TeleType to talk to computers far away over a network (!).

You can see this vestigially in Linux today since /dev/tty is a virtual device that is the command window you are currently typing in. TTY of course short for teletype. The whole paper thing was deeply impractical of course, and soon they replaced the printer with a monitor and the “terminal” was born, and for a decade and more working on the computer meant you used a terminal to interact with a mainframe or mini computer.

Servers

The reason for bringing up Telnet and teletypes is that telnet is a different type of program, our next type. Or rather telnetd is. Telnetd starts out on the command line, creates a new process and closes the standard “files” (stdin, stdout, stderr) that the command prompt uses to feed information in and out of the program, leading the operating system to act like the program has ended, but in actual fact it is still running with an open network socket, listening to network calls ready to serve for instance users using the program telnet – without the d – to connect. This type program that detaches from its owning terminal is called a daemon, and there are plenty of daemons in your average Linux machine. A similar concept in Windows is called a Windows Service. These programs are how servers are implemented. Web servers, email servers, game servers. You start them and they perform a specific task and will never finish until you specifically terminate them. It is important that daemons are resilient to failure, so that one user connecting and experiencing a problem does not affect other users of the same computer. Use error codes or special protocol states to report problems back to the user, or disconnect the user, but the program must not itself exit unless explicitly told to stop. With these long-running programs you would start noticing compound problems such as small memory leaks or file descriptor leaks could have severe consequences. These problems mattered less in batch programs, as long as the results were correct, all memory and file descriptors would be returned to the system when the program ended anyhow.

You saw a similar paradigm shift in the mid noughties when web pages went from being generated on the server and rendered in the browser to being small programs that ran in the browser for a long time. Memory leaks and other inefficiencies that never used to matter back when the world was recreated every single time you requested a fresh page from the server all of a sudden led to real problems for users.

Loops and leaks

In the 1970’s – computer games came into being. These pieces of software required ingenuity and engineering heroics perhaps beyond the scope of this post, but in terms of what type of program they were, they are more closely related to a server in that they do not terminate automatically, but in their early guises the did not wait for network input but ran in a loop that advanced time and moved players incrementally per iteration in the loop, reacting to input, determining whether objects had collided and updated game state for the next go round, always trying to use as little resources as possible to cram in as much game as you could on limited hardware.

Meanwhile in the offices, the personal computer revolution happened and first Apple and then Microsoft nicked the fourth type of program from the Xerox Palo Alto Research Centre, the graphical user interface, or GUI. This type of program was a bit like a game, in that it runs an event loop that listens to events that are sent form the operating system, or specifically the window manager, telling programs they would have to redraw themselves or similar. Because these message loops ran very often, any tiny bug in the event code could quickly cause big problems and early Windows and Mac programs were notoriously hard to write and problems were common. Basically, there was so much code needed to implement even a simple GUI program, known as boilerplate code, and people were reinventing the wheel. If only there was a way to reuse bits of code, so that if you were happy with a button abstraction, you could just use that button in other places?

Because the world of computers and stuff is so new you would think it was quick to adopt new ideas when as soon as they have been discovered right? Anyway. In the 1960s an ALGOL derivative and Simula 67 were working with Object Oriented Programming. Even the source of the user interfaces Apple and Microsoft nicked, Xerox PARC, were working with OOP in a language called Smalltalk. This seemed like the holy grail to some.

Objects, bodging and garbage

Already back in 1985 Steve Jobs was working on a prototype computer nicknamed the Big Mac that ran a proper operating system, a UNIX system, that had more reasonable hardware than the fairly anaemic ur-Macintosh that had premiered a year earlier. When Jobs made himself impossible at Apple and had to be fired, he took the prototype and his gang with him. NeXT and the UNIX based NeXT Step operating system came into being shortly later. The language used to write this operating system was Objective C, an attempt to weld object oriented features on top of C – a language which did not have these features, despite being developed in the same era as Simula and Algol, but that had been successful enough to immediately become the systems programming language of choice after it was used when developing UNIX in 1969.

When Jobs was eventually brought back into Apple, MacOS had reached the end of the road, and Apple has nothing but disdain for its customer base, so basically they replaced wholesale their old broken operating system with NeXT Step, badge engineered to be called MacOS X, and their existing developers and customers were told to just deal with it. Given the paragraphs above I am sure you understand what an enormous disruption that was to a company that had been making a living writing software for the Mac. They had to start over, almost from scratch.

Honestly- I wish Microsoft had done that with one of their UI stacks they invented in the late noughties. Microsoft had come to the end of the road with Windows UI graphics (GDI, from 1985). It has problems with multiple users on the same computer, both security and performance and it was baffled by modern resolutions and could not use modern graphics hardware to offload any processing. Microsoft too developed a stack that leveraged 3D processing hardware, but it had other failings and the Windows Division hated it, so they invented another, and another. Now they have UWP and seem happy with the performance. Ideally now they should cut the cord and let people deal with it, but that is not the Microsoft way.

Anyway, for NeXT Step, Jobs created InterfaceBuilder. A broken unstable piece of software that is still in use today when building user interfaces for the Mac and the iPhone. The beauty of it is that you draw the user interface in a graphical editor that shows your UI the way it will look when you run it. It would take Microsoft several years to come up with something even close. That thing became Visual Basic, and it was not properly object oriented, didn’t encourage proper separation of UI code and the code that solves your problem and on top of it, it had stability issues – but – it was so easy to use and create Windows programs that it too became a runaway success. It was just a tiny step up in complexity from writing excel macros, so it was a common gateway drug into programming.

A Danish academic called Bjarne Staastrup also got into the game of retrofitting object oriented features onto C, but his product C++ became much more successful and immediately became the main language used in application development in both high performance computing and in the Windows world, which at the time was vastly larger than the NeXT/Objective C realm. The coolest thing with C++ was that it was a strict superset of C, so any valid C was valid C++, so it was easy to gradually go more and more C++ and sadly despite C++ now supporting many recent concepts inspired by newer languages as well as its own groundbreaking features from a couple of decades back, most C++ developers are C/C++ developers, writing basically C with some objects. Code is then perhaps unnecessarily unsafe because the programmers are unaware of the newer safer ways of writing code that C++ now supports.

Object orientation seemed very promising, and there was much rejoicing. Developers were still very much involved in the nitty gritty and there was still a lot of details knowledge needed to write a program that should run on a specific computer. Also, C++ still made you manage your own memory and getting memory management wrong had huge costs in terms of vulnerabilities and lost productivity. High Performance computer manufacturer Sun Microsystems decided to solve this problem by creating Java. This language was compiled onto byte code, an intermediate language that was not machine code of individual physical computers but like the machine code of a well defined virtual machine that also managed application memory with a concept called garbage collection that had existed before but had been improved quite a bit. The Java Virtual Machine was then implemented on very many computers, and Sun pitched this code with the optimistic slogan write once – run everywhere. This was a runaway success and all interesting developments in enterprise software development, most cool databases and all of the Netflix networking stack is based on the Java Virtual Machine. Microsoft were dead jealous and created the .NET Framework and the C# language to try and crush Java. I mean C# still lives and is arguably still superior, but – no, they did not manage to do so.

Determinism

If you go to proper programming school, i e you set out to be a developer or at least you get an education on the subject, which again, is only true for a subset of those that write code for a living, you will these days have been told about unit testing, this means writing tiny bits of code to check that the rest of the code is actually doing what it is supposed to be doing. I was part of a generation that was let loose upon the world without knowing about this kind of stuff, and let me tell you, it makes a difference to how your code works.

When you start by thinking about how to test things, you move things out that you cannot replicate in a test. You do not check the local time, you have an abstraction in your code that provides the time, so that in the test you can replace the real time check with a fake check that tells the code exactly the time you need it to be. This means that tests are deterministic, they will work the same way every time you run them as long as you provide the same data.

It may shock you, but this is not obvious to everybody. Loads of businesses have code that works differently whenever you run it because it is hardwired to depend on the system time, on external sensors or similar to do its job. There are no seams where you can put a test dummy. Is it ideal? No, I would change it, but it does not guarantee that the code is broken. Would I complain in a code review? Yes. But it may have been working fine for 30 years.

Code Quality

After several decades and as software became more complex and due to the new types of programs being created, the increased need to avoid defects, the industry as well as academia had yet to answer how to write code with fewer defects. The military was worried, Med Tech was concerned. A bug in the software of a radiation cannon meant to treat cancer had already killed a patient. What are we to do?

Basically humans are bad at complexity and repetition. There are those among us that are more diligent than others, but you cannot rely solely on the individual diligence of your developers.

In the beginning you just wrote all the code in one place and hoped to keep track of it in your head.

Structured programming in the 1980s taught us to write smaller functions, divide programs by layers of abstraction into gradually more and more detailed implementation. The idea was at every level of abstraction you could just read the code and understand what was going on, and if you needed more detail about how things were done you would scroll down to the implementation. Large code files were not seen as a big problems yet.

We have discussed object oriented programming. This is what truly started the sprawl. If you look at a pile of Java code, every tiny class is in its own file in multiple depths of folders, and provided you can find the file, it is astoundingly clear and focuses the mind. Luckily the rise of Java also meant the rise of the Integrated Development Environment (well, basically everybody wanted what Visual Basic had) that quickly got enhanced editors that could make sense of the code and link you, like in a website, to other pieces of relevant code.

Basically people came up with metrics for code quality. How many code paths go through functions? How many branches are there? How manny levels of indentation? What percentage of the code is executed when the tests are run? The point is that the bigger number of different routes the code can take through your code, the harder it is to make sure you have verified that every code path actually works, and quantifying it helps selling to the boss that you need to spend time sorting stuff out. “Ooh, that’s a five there, see? Can’t have that. That’s an MOT failure right there”. The truth is these measurements are heuristics we use. We need them as a guide to make sure we constantly keep an eye, because quality deteriorates incrementally, and these metrics can help catch things early. There is however nothing you can run to conclusively say the code is error free. The best you can do is write a set of tests that verifies that code behaves like you expect- this goes a very long way – but you still cannot guarantee that the code is “right”, I e that it correctly handles scenarios outside of the tests you have devised.

What about Open Source?

Open Source and Free Software are ways to release software where the user gets to see and change the source code. Open Source is free as in free beer, and Free Software is free as in free speech.

The argument being made is that when thousands of people can see code they can see problems and fix them. Open source code is automatically better. I only need one counterexample to refute this statement. OpenSSL. Simple bugs went unnoticed for decades, despite the millions of eyes. The code is horrendous- or is it? I don’t know cryptography- maybe it’s fine?

Have you read the source for the Linux kernel, or Emacs? If you are overwhelmed by a sense of clarity, enlightenment and keep saying “of course! It all makes sense now!” to yourself, well, then you are better at reading code than me.

Greenfield or Legacy?

When a developer approaches writing some code, the main approach differs between whether or not there is something there to begin with. If you are new to a language or framework it is useful to start with some sample project that runs and you can poke at and see what happens. This helps you see what is “idiomatic”, i e how you are supposed to write code beyond the rules that language grammar prescribes, and beyond the syntax associated with a library.

Once you have a full grasp of a language and a set of tools, the ideal state of being is the revered Greenfield project, starting with a literal blank page. File -> New Project. Nobody else had muddled things up and only you and your crystal clear vision holds sway and no abstract arbitrary limitations are shackling your creativity. Truly, this shall be the greatest travel expenses management application (or whatever you are building) imagined by man.

The most likely thing you encounter though is somebody elses spaghetti code, no abstractions make sense. Names are all wrong, describing business concepts from bygone days and there are parts of the code you are dissuaded from looking at by elder colleagues. A shadow comes across their faces as they say “We tried to refactor that once, but…” and after some silence “Yeah, Jimmy didn’t make it” and then you never speak of it again. This is called Legacy Code.

As young or hip you forget that the reason that storied scary code is still around is that the rest of the company is making money from it. If that hadn’t been the case they would have stopped using it a long time ago. Should you let it stay scary and horrible? No, of course not. You must go were Jimmy went before, but with a bit more care. Gently refactor, rename, extract method et cetera. But the important first step is to understand the code. This isn’t a fast process. I was a consultant many years ago, and then you had to quickly acquaint yourself with source code, but even with practice it takes a while and a lot of domain knowledge, i e knowing about what the software actually does, like the machine engineer above, to truly be able to safely refactor legacy code. You may even find that it wasn’t so crazy to begin with. Maybe a few automated renames to reflect changed nomenclature in the company and perhaps a few paragraphs of gasp! documentation. You will not know the full scope until you truly understand the code.

Take

My lukewarm take is therefore – given that there are so many different types of software out there, that there are so many people of different types of backgrounds writing code – I am very sceptical of quick-fire judgements about code quality, especially if the people making these judgements do not have the domain knowledge to truly understand what is going on. Can professional developers identify problem areas and places that need to be changed for the sake of ease of maintenance? Sure, but – that will become clear over time. In summary – one man’s spaghetti code is another man’s Machine Learning.

Disparate Opinions

Various tidbits