Automation and security

There is a recent spate of sophisticated attacks on software delivery mechanisms where cyber criminals have had massive success in breaching one organisation to get automatic access to hundreds of thousands of other organisations through the update mechanism the breaches organisation provides.

Must consider security at design time

I think it needs reiterating that security needs to be built in by default, from the beginning. I haven’t gone back to check properly, but I know I went back and deleted an old blog post because it had some dubious security practice in it. My new policy is, I would rather omit some part of a process than show a dodgy sample. There are so many blog posts you find if you search for “login form asp.net” that don’t even hash passwords. And rather than point beginners to the built-in password hashing algorithms that are available in .NET, and the two lines of code you have to write, they leave some beginners thinking it’s all right, just this once and breed this basic idea that security is optional. Something you test for afterwards if you are building something “important” and not something you think about all the time.

The thing is, we developers have tools that help us do complicated things – like break bits of code out from other bits of code automatically or rename specific constructs by a certain name, including surrounding text comments, without also incorrectly renaming unrelated constructs that share name.

It turns out cyber criminals too have plenty of automation that helps them spend very little effort breaking in to companies, and exploit this access in a number of different ways.

There is maybe no “why”

This has a couple of implications. First off, attackers are probably not looking for you per se. You may be a nobody, you will still be exposed to automated attacks that test your network for known vulnerabilities and apply automated suites of exploits to see what happens. This means that even if you don’t do anything that conceivably could have value to an attacker, you will still be probed.

The second thing is, to prevent data loss you need to make every step the attacker has to take a hardship. Don’t advertise what software versions your public facing servers are running, don’t let service accounts have access to things beyond what they need, do divide networks into segments so that – for example – one machine with ransomware cannot directly infect your entire network.

Defend in depth

Change any business processes that require people to open e-mail attachments as part of their job. Offer services that help people do their job in a more convenient way that is also more secure. You cannot berate people for attempting to do their job. I mean, you can but it is not helpful.

Move backups off site and offline of course, for many reasons. But, do remember that having to recover a massive storage system from a backup can still be an extinction level event for a business even if you do have a working reliable off site backup solution. If you lose a large SAN you may be offline for days, and people will not be able to work, you may need to bring sites offline while storage recovers. When you procure a sophisticated storage solution, do not forget to design a recovery strategy ahead of time for how to rebuild a massive spinning rust storage array from absolute zero while new data is continuously generated. It is a non-trivial design challenge that probably needs tailoring to how your business operates. The point is, avoiding the situation where you need to actually restore your entire storage from tapes is always best.

Next level

Despite the intro, I have so far only mentioned things that any company needs to think about. There are of course organisations that are actually targeted. Financial institutions, large e-retailers or software supply chain companies run a greater risk of being manually targeted by evildoers.

Updates

Designing a secure process for delivering software updates is not trivial, I am not in any position to give direct advice beyond suggesting that if you are intending to do that, to consider from the beginning how to track vulnerabilities but also how to effectively remove versions that have been flagged as actively harmful, and how to support your users if they have deployed something dodgy. If that day comes, you need to at least be able to help your users. It will still be awful, but if you treat your users right, you might still make it.

Humans

Your people will be exploited. Every company that has an army of customer service representatives will need to make a trade-off between customer convenience and security. Attacks on customer service reps are very common. If you have high-value clients, people will use you to get to your clients’ money. There is nothing to say here, other than obviously you will be working with relevant authorities and regulatory bodies, as well as fine tune your authentication process so that you ask for confirmation information that is not readily available to an attacker.

Insiders

I don’t have any numbers on this, so I am unsure how big of a problem this is, but it is mentioned often in security. Basically, humans can be exploited in a different way. Employees can be coerced through intimidation, blackmail or bribery to act maliciously on behalf of an attacker. My suspicion is that this is less common than employers think, and that times when an employee was stressed or distracted and fell for a phishing e-mail, the employer would think “that is too obvious of a phish, this guy must have been in on it”.

It makes me think of that one time when a systemic failure on multiple levels meant that a cleaner accidentally started a commuter train that ran from the depot the length of the commuter railway Saltsjöbanan – at maximum speed – eventually crashing through the buffers and into a building at the terminus. In addition to her injuries, she suffered the headlines “train stolen and crashed” until the investigation revealed the shocking institutional failings that had made this accident possible. I can’t remember all of them but there were things from the practices in how cleaners accessed the trains, how safety controls were disabled as a matter of course, how trains were stabled, the fact that points were left set so that a runaway train would actually leave the depot. A shambles. Yet the first reaction from the employer was to blame the cleaner.

Anyway, to return to the matter at hand – yes, although I cannot speculate on the prevalence it is a risk. Presumably, if you hire right and look after your people you can get them to come to you if they have messed up and gotten themselves into a compromised situations where they are being blackmailed or if somebody is leaning on them. Breeding a strong culture of fear can be counterproductive here – i.e. let people think that you will help them rather than fire them and litigate as long as they voluntarily come forward. If you are working in a regulated industry, things are complicated further by law enforcement in various jurisdictions.

The Powershell and the Glory

In which I add a custom prompt by making a hack in the PowerShell profile.

As I have mentioned in previous posts, I use Oh My Posh to set the theme in Powershell. While working with Pulumi to create deployment stacks, I thought I could use a way to see which stack is the current one, i.e. to effectively have the output of pulumi stack --show-name appear in the prompt automatically.

Back in the old world, the agnoster theme was the prettiest. In my terminal at least, it looked quite a lot worse after upgrading to Oh-My-Posh 3, so I did exactly what they say in the documentation, I used Get-PoshThemes to look at all of them, exported the one I liked best into a json file and went to work.

Command

The naïve implementation would be to add a new segment in the prompt, using the segment type seemed to be “command”, which does what it says on the tin, it allows you to call a command and display the output, like it works in Bash.

        {
          "type": "command",
          "style": "powerline",
          "foreground": "#000000",
          "background": "#ffff00",
          "properties": {
            "shell": "powershell",
            "command": "pulumi stack --show-name"
          }
        },  

They do warn you that there will be performance implications, and – yes- on my 16 core desktop it still takes forever to start a process in PowerShell, so that didn’t seem to be a workable way forward. The suggested approach is to “abuse environment variables”, so… let’s?

Environment variable

I have previously made hacks to set window titles in cmder to work around iffy built-in support for showing the path as the tab name. The idea was to replace the built-in “cd” alias with a PowerShell function that also does dodgy stuff on the side apart from changing directory. In this case I would test if a pulumi.yaml file exists in the new directory, and in that case set the variable PULUMI_STACK to the output of pulumi stack --show-name, or set the variable to empty.

# --- other stuff
function Change-Directory() {
    param(
        [string]
        $directory
    )
    Set-Location $directory
    $env:PULUMI_STACK = ""
    if (Test-Path "pulumi.yaml") {
        $env:PULUMI_STACK = & pulumi stack --show-name
    }
}
# --- other stuff
Set-Alias -name cd -Value Change-Directory -Option AllScope

I of course don’t want to globally change this variable, I explicitly only care about the current terminal session, so hence I’m not trying to update the registry or anything like that. To read this variable and show a prompt, we then modify the theme json file to leverage the envvar block and to contain the following:

{
    "type": "envvar",
    "style": "powerline",
    "foreground": "#000000",
    "background": "#ffff00",
    "properties": {
       "var_name": "PULUMI_STACK"
    }
},  

After this work, the prompt is much faster, beyond acceptable, maybe even pleasant.

You can have nice things

I have come across a few things that are legitimately pleasant to use, so I thought I should collate them here to aid my aging memory. Dear reader, I am not attempting to copy Scott Hanselman’s tools list, I am stealing the concept.

Github Actions

Yea, not something revolutionary I just uncovered that you never heard of before, but still. It’s pretty great. Out of all the yet-another-yet-another-markup-language-configuration-file-to-configure-a-thing tools that exist that help you orchestrate builds, I personally find Github Actions the least weirdly magical and easy to live with, but then I’ve only tried CircleCI, Azure DevOps/TFS and TeamCity.

Pulumi – Infrastructure as code

Write your infrastructure code in C# using Pulumi.It supports Azure, AWS, Google Cloud and Kubernetes, but – as I’ve ranted about before, this shouldn’t be taken as a way to support multi-cloud, the object hierarchy is still very bespoke to each cloud provider. That said, you can mix and match providers in a stack, let’s say you have your DNS hosted in DNSimple but your cloud compute bits in Azure. You would be stuck doing a lot of bash scripting to make it work otherwise, but Pulumi lets you write one C# file that describes all of your infra, mostly.
You will recognise the feel of using it from chef, basically you write code that describes the infrastructure, but the actual construction isn’t happening in the code, first the description is made, the desired state is then compared to the actual running state, and adjustments are made. It is a thin wrapper over terraform, but it does what it says on the tin.

MinVer – automagic versioning for .NET Core

At some point you will write your build chain hack to populate some attributes on your Assembly to stamp a brand on a binary so you can display a version on your site that you can track back to a specific commit. The simplest way of doing this, without needing to change branching strategy or write custom code, is MinVer.

It literally browses through your commits to find your version tags and then increments that version with how many commits there are from that commit. It is what I dreamed would be out there when I started looking. It is genius.

A couple of gotchas: It relies – duh- on having access to the git history, so you need to remember to remove .git from your .dockerignore file, or else your dotnet publish inside docker build will fail to locate any version information. Obviously, unless you intended to release all versions of your source code in the docker image, make sure you have a staged docker build – this is the default in recent Visual Studio templates – but still. I encourage you in any case to mount your finished docker image using docker run -it --entrypoint sh imagename:tag to have a look that your docker image contains what you expect.

Also, in your GitHub Actions you will need to allow for a deeper fetch depth for your script to have enough data to calculate the version number, but that is mentioned in the documentation. I already used a tag prefix ‘v’ for my versions, so I had to add that to my project files. No problems, it just worked. Very impressed.

A cloud strategy

I’m going to rehash some learnings that I have made over the last decade and a bit of doing cloud in one way or another. I have recently read thought leaders report similar things – only better written and backed by more experience of course – which made me think “oh, I’m not totally crazy then” and to set about writing this down.

Basically, it is my medium tempered take on the whole cloud thing, in terms of getting on it.

Why cloud is cool

In the bad old days, if you had an idea for some software, you had to start by buying a server. I mean it was never on the scale of “I have an idea for a consumer product, let’s build a factory”, but still it was definitely a barrier to getting started.

The impetus for building out cloud was that Amazon needed compute for their little bookshop website (remember!?) and thought it was prohibitively expensive to buy high-end servers, so they decided to buy a metric faecal ton of low-end computers instead, and use software to provision this aggregated computing horsepower, and basically letting people choose between a bunch of virtual machine sized depending on the oompgh a certain department needed for their application.

This was of course extremely complicated, the software bit, but once they were done, they had accidentally created cloud compute and could make more money selling cloud compute than selling books. The ability of using relatively simple APIs to create and provision VMs allowed startups to acquire really pathetic servers for nearly nothing, which was amazing for trialling ideas and fed the software boom we have seen over the last decades. Other services were built on top, and competitors came around.

Why cloud is cool is the tight APIs that lets you create and destroy infrastructure in an automated fashion. Yes, you can autoscale, but also the rapid prototyping potential is really beneficial and arguably even more significant.

What to worry about

Security

Yes, of course. Public cloud, you can tell from the name. It is not the same as having your servers at home, at least psychologically. On the other hand, assuming you are secure because your servers are inside your own building is false as well, you are still on the internet. There is no getting around this, you already have network people and security people, and with all cloud providers there are ways of securing your network that they will be familiar with how to operate sensibly. I e your network and security people will know what to do.

Cost

Yes, it’s a big one. Not all cloud things are free or near free. Basically, running a VM 24/7 and block storage (as in a scalable pretend hard drive mapped to a VM) are usually the most expensive things you can do in a public cloud. Sadly, a VM tucked away in a virtual private network with elastic storage mapped to it seem like such an easy migration path if you are currently running your apps in VM You will need to migrate your apps over to cloudier solutions such as AWS Fargate/Lambda or Azure App Services to reduce cost eventually. For your in-house LOB apps you can in most cases (but not all) trivially replace file system storage with cloud native blob storage such as AWS S3 or Azure BlobStorage for your files storage needs, but it does require code changes, even if they are small. As the cloud bill start to come in, it seems a good way to spend developer resources as the returns in terms of cost savings can be quite significant. Be wary of giving developers the ability to create resources at will, as the odd developer accidentally leaving a VM running will quickly accumulate. There are ways of dealing with this kind of stuff, but do consider it.

What not to worry about

Multi cloud

There are many tools that provide abstractions over cloud APIs, and many tools that promise that they can offer you independence and warn of vendor lock-in. That is for most people just a waste.

You will need to choose one provider for your app stack. You can still have Google Apps for email and use AWS for cloud, or use DNSimple for DNS, Office 365 for email and AWS for apps, those are mostly orthogonal concerns. You will suffer outages. You will not – without incurring unfathomable cost – be able to load balance across cloud providers. If you really are that uptime sensitive, it would be cheaper for you to have georedundant datacentres and give the cloud thing a miss.

The problem with attempting to stay cloud agnostic is that you can only use the lowest common denominator of the tools you have available rather than throwing yourself in feet first into all the opportunities that exist with a given cloud provider.
Worst case, if the CEO gets angry enough at something and wants to switch just to make a point, it still will not be completely impossible to rewrite code in the seams. For instance, if you change your code to use AWS S3, it would be relatively trivial to change the code that calls S3 to use Azure BlobStorage in a pinch. No need to go choose a platform for it. Just like with ORMs and database providers (“with NHibernate it’s so easy, you can switch to Oracle much more easily”) people very rarely switch cloud providers. There would have to be a very compelling economic argument anyway.

Why go through with it?

Rapid prototyping

You should test in production anyway, but if you insist on creating test environments, being able to copy/paste your prod environment exactly and test your changes is only possible in an environment where you aren’t poking at real metal. It would be ludicrous to buy overbuilt on-prem hardware “because sometimes I like to spawn up a few extra copies of prod”. The powers that be would be livid at the massive capital expense that would go underutilised most of the time. With cloud however you spawn, test and destroy in minutes. Merely a blip on the radar in terms of cost. To mitigate the risk of developers leaving stray instances around you can just use governance like you do anywhere else in a workplace, but ideally the concept of ephemeral instances should lend itself to clean up nicely.

Modern software development

Bringing the organisation to a place where it has autonomous engineering teams that can bring feature from idea to production without hand-offs is the key driver for organisational performance. Moving to the cloud is going to make that happen. You could achieve this with on prem as well, but it would probably mean buying more hardware than you will really use more than in short bursts. If that trade off is worthwhile to you, then who am I to deny you your wish, but for most people cloud is the way to go.

What to do?

You probably need to get some help with this. Everybody in your organisation is already doing things. Taking on a cloud migration is going to be a massive effort for everybody, and you are still probably going to need an external experienced consultancy to help you. There are many out there that offer to architect a cloud migration for you. Not everyone is a charlatan, but, given that the selection process for these types of gigs is “who did the CxO meet at a conference/play golf with/…” I think the most important take-away is that the individual firm probably doesn’t really matter. It also probably doesn’t really matter which cloud provider you go with. Let a bunch of ops and devs benchmark the tools and APIs of various providers and come back with their feedback. There are probably going to be some budgets that can be negotiated between your consultancy (who most likely also has a VAR agreement with some cloud providers) , the provider and yourself that will determine some kind of benefits for one over another, but that’s still only speculation at this point.

Eventually one provider will be declared the winner and work will start. It doesn’t matter, really, which one was chosen. Even If the engineers say there really is a show-stopper, do investigate, but most problems can probably be avoided through some development. If you are running some weird VM somewhere that needs specific hypervisor features or some curious networking then of course you will have a challenge. Not necessarily an impossible one, though. This is not going to be done in an afternoon anyway. There is time to make changes to code, and there will be a need to do so in order to fully leverage the public cloud as mentioned above. Obviously non-blockers can be deferred, but unlike traditional tech debt there will be direct cost implications of course.

Enterprise IT, a tragedy?

In the beginning of time, Grace Hopper invented business software and Cobol at IBM. This meant that in addition to calculating the trajectory of an intercontinental ballistic missile, large computers could now be installed, literally built into the very lower basement of international enterprises and be used to process payroll and do forecasting, gradually obsoleting armies of calculators and typists.

Earliest business applications

At this point, nobody knew how to actually or maintain write large software systems. I mean, granted, current paradigms like object oriented programming and functional programming were formalised around that time, but still, people had not written as much software back then as you now have to write to make a car set fire to petrol at an opportune time given crank angle, engine speed and temperature (I exaggerate, but not really).

From terminals to personal computers

Eventually Microsoft began it’s journey to global domination in the office by piggybacking onto IBM’s good name and getting installed into offices by default since nobody got fired for buying IBM. Unfortunately, IBM’s own operating system guys that wrote time sharing operating systems for mainframes had no influence over the people that created the CP/M knockoff that was to be PC-DOS (and Microsoft’s MS DOS), so it had no multi-user or networking security features built it as, well you were supposed to have it on your desk, and buy another computer for somebody else’s desk. No need for passwords, amirite? I mean UNIX existed and was fairly wide-spread in the corporate environment at the time, and it had decent security built-in, at least a fundamental understanding that even as a power user, you don’t want to have that power all the time.

The beginnings of Enterprise IT was therefore to maintain and write software for mainframe computers, and that was so alien of work to most organisations that it quickly became outsourced to suppliers that could lower maintenance costs, but overcharged for development work, thus lowering OPEX but moving software to CAPEX. This had one of the first negative incentives driving the business to make fewer changes, making enterprise IT infamously slow to react and become a department of No.

As technology changed, the IT department stayed in the cellars and basements where the business rarely ventured. As mainframes were phased out and replaced with banks of personal computers, the white coats disappeared. Gradually, typing stuff into computers at the IT department stopped being work exclusively for women, and men started taking over; salaries increased (not saying there is a causal relationship, just saying), but still no natural daylight.

Forces conspire to make organisations write and maintain code

There is a syndrome that happens to computer people called “Not Invented Here”, which is a form of exceptionalism that means that no outsider could possibly understand our Very Special Requirements, so we need to write our own X, where X can be anything. While certain pieces of software became standardised, like payroll and accounting for small businesses, there is still big money in helping people shoot themselves in the foot by developing and maintaining their own adaptations of commercial ERP systems like SAP and on a lesser scale Dynamics NAV or GP. Microsoft Excel is the most successful way of combining both, as in letting people buy a commercial off the shelf application but then making excel sheets with bespoke maths that can both be the lifeblood of new business and be its cause of death when the bespoke template turns out to have a bug and nobody knows how to fix it anymore.

An accounting quirk (see the bit about CAPEX above) where you count code written as value created rather than a pure expense, and book development costs as having added capital value to your code base despite, objectively, few if any of your competitors would buy your bespoke mess of a back-office system if you offered it on the open market. Also they most often don’t count the depreciation that comes naturally unless you make the code maintainable and easy to refactor if requirements change in future.

At this point we have an alienated IT department that has only two key metrics, to cut cost and to add features to internal software products, but no real way to directly discuss requirements with those that use the software.

Negative incentives

In the beginning you only have developers. As in, they develop, profess to test and deploy their code without any supervision. Then you have an outage that embarrasses somebody in management but no singular scapegoat could be found and all of a sudden you get ops guys that are there to protect the business from the devs. After another outage that embarrasses the management further, you may get a QA department. You can imagine how IT security comes to exist as a function within an IT department?

You now have one team that’s there to make changes that implement business requirements they have captured some way or another, one side that’s there to make sure those changes are valid and one side that’s there to make sure there are no outages. The incentives become to make few releases so that the QA guys want to be able to make a full regression before they green-light the release, and the QA guys are the ones getting a bollocking when bugs get out in the wild. The ops guys have enough to contend with without f^!£$g developers making changes ON PURPOSE, how are you ever going to maintain a stable system if you keep poking it with new software all the time?! So yea at best one release per month, anything quicker would be irresponsible and you would be working your QA department to the bone.

From a business perspective this means you are never getting your change in. As more things go wrong, process/red tape is added, lead times longer, change freezes are introduced periodically. Longer release cycles and bigger releases cause bigger problems.

Technically speaking, after mainframes people bought servers. At first they were just computers, beige like all the rest of them, and a server room was just a cupboard where they were shoved. You bought a physical computer from a supplier where you got a decent price, and you set it up, installed your OS and then you installed your software one it. Or you bought the computer and shipped it to your software vendor and let them install their software on it. 19″ racks became a thing, and servers became distinct from PCs in that they became loud and flat. You were going to stick them in a room away from humans anyway, so noise was no longer a concern. You did want many computers per rack, and you wanted simple but effective cooling, so you would fit powerful high-RPM fans.

As things progressed, people realised that it is hard to find office space that allows you to fit redundant power, automatic fire suppression and redundant network connections, so instead of trying to fit that into your basement, they would go to a third party that offered co-located datacentres. There you could mount your rack servers and they would give you ways to remotely manage them, so you wouldn’t have to physically interact with the servers to run them. All the patching and other maintenance could be done over the internet, and the data centre would make sure no villains could get at your hardware.

After a while people realised that you could just buy a couple of massively overpowered servers and then divvy up the computer horsepower onto virtual machines, pretend servers that would behave like separate physical machines. Carving out a bit of virtual compute and create a new “server” was a lot faster than buying a physical server and having your colo provider plug it into your rack. You would still have to install the OS and configure the networking, but there would be templates and automation. Heck, you could even write command line scripts to new up new servers.

Point is, in Enterprise IT there is no time to write scripts, and far be it from the mind of any ops person to collaborate with developers to explore things like version control and automated testing, I mean the developers are the enemy, the cause of all our problems, why would we collaborate? VMs are therefore largely artisan creations with very bespoke installs, apart from possibly sharing a raw template with some antivirus or monitoring, not even instance 1 and 2 of a load balanced pair have the same software on them.

As the blows keep coming with outages, bugs in productions and near misses that cause leadership to go on long-term sick leave, decrees can go out to create test environments that are “the same as production”.

Servers are bought (VM hosts, or course) and VMs are configured. Obviously, it is prohibitively expensive to make it exactly the same as production given that the load requirements will be different, so some corners are cut, but, it should be close enough. As the old meme would go – narrator: It was not close enough. Also, since the server operating systems are still hand crafted, there are multiple potential places where differences between production and test can creep in.

Great expectations

In various countries, leaps were made at various times that caused people to adopt electronics at vastly greater rate than before. In Sweden there was a push in the late nineties, early noughties where you would get a quite substantial tax rebate if you bought a modern computer at a certain cost, causing a lot of people to all of a sudden possess a modern computer. With various Covid stimulus cheques it seems a lot of Americans put money straight into a gaming computer (thus worsening the current silicon supply chain constraint situation). There have been similar schemes all over the world at one time or another to encourage adoption of technology to promote familiarity with new technology, I just can’t be bothered to google more examples. Basically, people have used the internet and are able to see what commercial software development can do. Another cause of people becoming aware of the wider world is how various Apple products have had a marked impact on Enterprise IT, and BYOD basically becomes a thing in businesses because the CFO one days comes back to the office with a MacBook Air and stares the IT department down until they “make it work” with their existing corporate software that only runs on a specific type of dell laptop they have imaged two years ago, requiring IE6.

The exposure to what computers can really do and the daily torture of using enterprise software to do their jobs, dissatisfaction among the people on the business side is rife. Eventually, some middle manager just takes their corporate expense account and hires some consultant off the books and builds some quick win software to solve a specific problem. If we are unlucky, this is an all out win. It works, it generates business and was executed in a timely fashion on, or just over, budget. A dangerous precedent is set, another wedge is driven in between the business and the IT department. Shadow IT is born.

Recap

To bring us to the final bit of the story, let’s assess where we are.
Enterprise IT is disgraced, underfunded and distinct from the core business, literally moved away from the rest of the organisation. A troll under a bridge or an ogre in a swamp. Sure, the CTO may report to the Board of Directors, but there is no real correlation between desired business outcomes and the metrics that the IT department measures its services against, and no work is done to ensure that IT services directly benefit the key outcomes that the business as a whole needs to achieve. Control of IT spend is instead largely project based, as in, the business has an idea, it needs some IT support, a project is created with a budget and a final ship date before anybody with technical know-how has even assessed it, and then work commences. Probably contractors are brought in if there is sizeable chunk of development, but normally the department is kept lean. Service desk and “maintenance development” presumably outsourced to a country in a different timezone.

What Enterprise IT wants is to be a force for good within the organisation, but since the beginnings of time IT has often been deemed a non-core support function, and that increased conceptual distance has made it more difficult to effectively be of use to the organisation. Cultural differences between IT and other parts of the organisation, and perhaps ineffective communication between stakeholders and the developer organisation as caused upper management to silo the organisations further apart rather than agree on an effective way of working closer together. In cases where IT is not involved in other aspects of the business, obviously there will be no “osmotic” assimilation of domain knowledge, so the business may be shocked of how little the IT department actually knows about the bread-and-butter business that keeps the lights on.

Roadmap

How do we turn things around? What are the most important things moving forward? I will leave out the Practices bit, because I have rarely seen bad practitioners in organisations I have worked with. People will write automated tests and implement CI and CD if they are given circumstances not directly hostile to professional software development. The bigger lego pieces are usually where the problems lie, and why management buy-in is crucial.

Increase automation

The VMs are not family or even pets, you should delete them all and start over. Not in one go, or to cause another outage, but replace at pace the VMs you are currently running in production with new ones created through automation. When you can automatically deploy your core business application from nothing to a running instance without any manual intervention you are done. Thinking you could replace your running VMs with automation but not actually having proven it has no value. Ideally make the deployment process be some version of “stand up new VMs with an app on them, run tests to make sure it’s working, route traffic to the new instances, destroy the old instances”, so that you can deploy without causing any loss of availability. This is crucial in building trust with the rest of the organisation. When I write VM, I’m not making a dig against containers or containerisation, just saying you can achieve this without moving to kubernetes and a service mesh, you can do it with bog standard VMs and a bit of scripting.

Decouple systems

In order to be able to fearlessly deploy, you will have to decouple systems so that you can deploy small changes often, with small blast radius and short time to recover.

Organise your IT department after what you are supporting. Yes, after business functions but also after the systems that you maintain. If you have a monolith that support all business functions, then you need to split it up. The output of different teams needs to be independently deployable. This is not easy, but it is the only way to enable teams to make predictable progress. Consider it an investment in predictable outcomes.

Maintain products, don’t run projects

When you have finally managed to put together teams that match your customers and your work, then don’t disband them again after you have delivered a certain milestone and the “project is done”. Budget for products rather than projects, so see your own software as a driver to increase profits and reduce cost. Set targets for what you want to achieve and measure outcomes. Let developers prototype things and show the business what’s possible. If you have built the right platform for people, it will be possible to securely bring new ideas to life and test them with real clients and get true feedback. This way you can delight customers faster, and there is lesser risk you get saddled with maintaining some Shadow-IT piece of software that suddenly became core IT after it was successful and the consultant that wrote it moved on to greener pastures.

Goal

By introducing a wider interface between the business and IT, and making teams that have the autonomy and platform support to independently iterate of features quickly, you will both delight the business, delight the customers, make more money and have happier teams. There will be times when you have to organise and coordinate, but the lion’s share of work can be done independently.

SQL Server and cloud

Back in the day

You would have one database connection string, and that would be it. “The Database” would live in the darkest corner of the office and you would try not to interact with the physical machine that hosted the relational database management server. The Windows Forms app would open a connection and keep it alive forever with nothing to really threaten it. It would enable Multiple Active Result Sets because there was no grown-up way to query the database, queries were being shoved down the shared connection seemingly randomly depending on what the user was doing and what events forced data to be loaded. You would use the Windows Credentials of the end-user and specifically allow a the user to access the server, which fit with the licensing model where the user paid for accessing SQL Server when buying Office products, whilst pure server deployments needed per CPU core licensing which was hysterically expensive.

Cloud

The cloud revolution came and went, whilst we in the Microsoft / Windows / .NET bubble never really heard about it, until Microsoft decided to try their hand at hosting cloud servers with the Windows Azure project. I genuinely think this was extremely useful for Windows, as the prospect of hosting cloud services was so far beyond the capabilities of the bare OS, and forced some stark introspection and engineering effort to overcome some of the worst designs in the kernel.

For .NET Framework, a lot was too late already, and many misfeatures weren’t fixed until in the .NET Core reboot effort.

One of the things that did change for the average programmer was that database connections became ephemeral, and pooled. You could no longer hog one connection for all of your needs for all eternity – you created a new connection when you needed something and the framework would pool the connections for you, but you had to allow for connections taking longer to get established and also that they can be evicted at any point. Relatively quickly database providers would build in resilience so that you didn’t have to know or care, but in the early days even the happy path Microsoft marketing slides that usually never have error handling or security in them had to feature retry mechanisms – or else people simply couldn’t successfully replicate the early samples.

Security

As people started using the public cloud, eventually people figured out that security was a thing. Maybe we should not have overly powerful credentials lying around in config files on the webserver ready to be exploited by any nefarious visitors? People eventually started using things like Azure KeyVault or AWS Secrets Manager. Sure it’s bad if people break in and nick your stuff, but it’s worse if they also steal your car to carry away the loot.

Once people had all their secrets backed by hardened credentials services, features like autorotating credentials started becoming available. Since you are provided most of your authorisation by the general hosting environment anyway, and only really need credentials for systems that are incompatible with this approach, why don’t you just automatically periodically update the credentials?

Also – back in the day when the server lived back in the office, if somebody got a bit enthusiastic with the UPDATE statements and forgot to add a WHERE, you could send everybody out for coffee (read tea, dear UK readers) whilst the techiest person in the office would Restore to Point in Time and feel like a hero when the database stopped being in recovery mode and everything was back to the way it was before the Incident.

Today a “restore” means you get to create a new database that contains the world as you knew it, before the Incident. You then need to serially rename databases to achieve the effect of having restored the database. Not a big deal, just not what we used to call restore back in my day.

Back to retry mechanisms

For SQL Server running in Azure, you can now tell the Microsoft.Data.SqlClient to connect to the database using the Managed Identity for the app, meaning you are completely free of faff and the KeyVault is managing your access automatically.

With RDS on AWS you need to use legacy usernames and passwords unless you configure an Active Directory domain up there, but … no, but these credentials can be auto-rotated in AWS Secrets Manager. Because of the cost of talking to the Secrets Manager, it’s not something you want to do every request, as that piles up over a month.

One of those retry mechanisms from early Azure days start to make sense again, and is easily implemented as a factory method you can call instead of the await using var cn = new SqlConnection(...) you have littered all throughout the code. I mean you’ll still await using that factory method, but it can do the setting up of the connection and validating the credentials, and only spend the dosh fetching the latest from the vault if you get the error code for invalid credentials. This means your bespoke factory method replaces both new SqlConnection and OpenAsync().

Naïve retry mechanism, featuring goto:

// https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-18456-database-engine-error?view=sql-server-ver15
// Error code that indicates invalid credentials, meaning we need to bite the bullet and fetch latest settings from Secrets Manager
private const int INVALID_CREDS_ERROR = 18456;
private const int INVALID_DB_NAME = 4060;

public async Task EnsureValidCredentialsAsync(SqlConnection conn, CancellationToken cancellationToken)
{
    var rdsCredential = GetRdsCredential(await _secretsCache.GetCachedSecretAsync(_siteConfiguration.DefaultDatabaseCredentialKey));
    var dbCatalog = await _secretsCache.GetCachedSecretAsync(_siteConfiguration.DefaultCatalogKey);    int reconnectCount = 0;
reconnect:
    var connectionString = GetConnectionString(rdsCredential, dbCatalog);
    conn.ConnectionString = connectionString;
    try
    {
        await conn.OpenAsync(cancellationToken);
        conn.Close(); //  restore to known state
        return;
    }
    catch (SqlException sqlEx) when (sqlEx.Number == INVALID_CREDS_ERROR)
    {
        // Credentials are incorrect, double check with secrets manager to see what's what - potentially creds have been rotated
        rdsCredential = await _secretsCache.GetSecretAsync<AwsSecretsManagerCredential>( _siteConfiguration.DefaultDatabaseCredentialKey );
    }
    catch (SqlException sqlEx) when (sqlEx.Number == INVALID_DB_NAME)
    {
        // Database identifier is not valid, double check with secrets manager to see what's what (potentially restored db, deprecating old db name)
        dbCatalog =
            await _secretsCache.GetSecretAsync(_siteConfiguration.DefaultCatalogKey);
    }
    catch (SqlException sqlEx)
    {
        Log.Error(sqlEx, "Could not open default DB connection. Reattempting");
    }
    if (reconnectCount++ < 3) goto reconnect;
    // surrounding code expects an exception if the open fails 
    throw new InvalidOperationException("Could not open database connection");
}

What about Entity Framework?

Looking at the way you officially configure Entity Framework – it seems you can’t get away from having a known connection string up-front, which again isn’t a problem in Azure as discussed earlier, but for me, I want to only hand credentials to the app that I know have been tested.

In my past life inviting speakers to Oredev I once invited Julie Lerman to speak about Entity Framework and DDD, so I pretend that I know her and that I can call upon her for help in matters of programming, so I sent out a tweet linking to a Hail Mary Stack Overflow question I had created where I asked how I would be able to dynamically handle connection retries in a similar way or at least be able to call out and ask for credentials.

Surprisingly she had time to reply and pointed me to a video where she had addressed this subject and which taught me about something called DbConnectionInterceptors that had all the things I wanted, in addition this also introduced me to the super elegant solution natively supported by Microsoft.Data.SqlClient for handling this situation in Azure that I mentioned earlier.

Basically I therefore created a class that inherits from DbConnectionInterceptor and overrides two methods, one sync and one async version and call something like the above function to test the connection EF Core is about to open.

public override async ValueTask<InterceptionResult> ConnectionOpeningAsync(DbConnection connection, ConnectionEventData eventData, InterceptionResult result,
    CancellationToken cancellationToken = new CancellationToken())
{
    var sqlConnection = (SqlConnection) connection;
    // try opening the connection, if it doesn't work - update its params - close it before returning to achieve same state
    await _dbConnectionFactory.EnsureValidCredentialsAsync(sqlConnection, cancellationToken);
    return await base.ConnectionOpeningAsync(connection, eventData, result, cancellationToken);
}     

Of course – registering this interceptor is easy as well, after I had googled some more. There is a (synchronous) override that allows you access to an instantiated IServiceProvider as follows:

services.AddSingleton<ResilientRdsConnectionInterceptor>();
services.AddDbContext<ADbContext>((provider, options) =>
            {
                var dbConnectionFactory = provider.GetRequiredService<DbConnectionFactory>();
                var connectionString = dbConnectionFactory.GetDefaultConnectionString().GetAwaiter().GetResult(); // So sorry, but there is no async override
                options.AddInterceptors( provider.GetRequiredService<ResilientRdsConnectionInterceptor>());
                options.UseSqlServer(connectionString);
            });

An aside on async/sync. It seems Microsoft will have you rewrite every other line of your codebase to properly support async or your app will fall over. But when you then want to reuse your async code and not have to duplicate everything into similar-but-not-reusable sync and async versions, the tiny changes needed for MS to support async everywhere are all of a sudden “not needed”, line in configuration or DI. It’s laziness, I tell you.

Anyway I have integration tests in place that verify that the mechanism for calling out for updated credentials actually works. Provided the credentials in the secrets manager/key vault actually work, this works like a charm.

A fitting song. Also a banger.

Why is everybody waterfalling “Agile”?

Just like that rebrand cycle years ago when RUP consultants transitioned over to scrum masters through staged re-titling on LinkedIn and liberal use of search / replace in their CV, scaled agile frameworks and certified project managers attempt to apply the agile manifesto to large organisations by bringing in processes and tools to manage the individuals and interactions, comprehensive documentation of the working software, to negotiate contracts to manage customer collaboration and make plans for how to respond to changes. You start seeing concepts like the Agile Release Train, which are – well – absurd.

Why? Do they not see what they are doing? Are they Evil?

No – I think it’s simple – and really really hard, at the same time.

You cannot respond to change quickly if you have delays in the system. These delays could be things like manual regression testing due to lack of automated test coverage, insufficient or badly configured tooling around the release or having a test stack that is an inverted pyramid, where you rely on a big stack of UI tests to cover the entire feature set of the application because faster, lower level tests aren’t covering all the features and you have undeniable experience of users finding bugs for you.

Obviously, if these tests are all you have, you need to run them before releasing or you would alienate customers. If your software stack is highly coupled, it would be irresponsible to not coordinate carefully when making changes on shared components with less-than-stellar contract test coverage. You are stuck with this, and it is easy to just give up. The actual solution is to first shorten the time it takes from deciding you have all the features you want to release until the software is actually live. This means automate everything that isn’t automated (the release itself, the necessary tests, et c) which could very well be a “let’s just stop developing features and focus all our attention on this until this is in place” type investment, the gains are so great. After this initial push you need to make an investment into decoupling software components into bits that can be released independently. This can be done incrementally whilst normal business is progressing.

Once you have reached the minimum bar of being able to release whatever you want at any time you want and be confident that each change is small enough that you can roll them back in the unlikely event that the automated tests missed something, then you are in a position to talk about an agile process, because now teams are empowered and independent enough that you only need to coordinate in very special cases, where you can bring in ad hoc product and technical leadership, but in the day to day, product and engineering together will make very quick progress in independent teams without holding each other up.

When you can release small changes, you can all of a sudden see the value in delivering features in smaller chunks with feature flags, because you can understand the value in making 20 small changes in trunk (main for you zoomers) rather than a massive feature branch, as releases go live several times a day, and the benefit of your colleagues seeing your feature flagged changes start appearing from beginning to end, they can work with your massive refactor rather than be surprised when you open a 100 file PR at 16:45 on a Friday.

Auto-login after signup

If you have a website that uses Open ID Connect for login, you may want to allow the user to be logged in directly after having validated their e-mail address and having created their password.

If you are using IdentityServer 4 you may be confused by the hits you get on the interwebs. I was, so I shall – mostly for my own sake – write down what is what, should I stumble upon this again.

OIDC login flow primer

There are several Open ID authentication flows depending on if you are protecting an API, a mobile native app or a browser-based web app. Most flows basically work in such a way that you navigate to the site that you need to be logged in to access. It discovers that you aren’t logged in (most often – you don’t have the cookie set) and redirects you to its STS, IdentityServer4 in this case, and with this request it tells identityserver4 what site it is (client_id), the scopes it wants and how it wants to receive the tokens. IdentityServer4 will either just return the token (the user was already logged in elsewhere) or get the information it needs from the end user (username, password, biometrics, whatever you want to support) and eventually if this authentication is successful, the IdentityServer will return some tokens and the original website will happily set an authentication token and let you in.

The point is – you have to first go where you want, you can’t just navigate to the login screen, you need the context of having been redirected from the app you want to use for the login flow to work. As a sidenote, this means your end users can wreak havoc unto themselves with favourites/ bookmarks capturing login context that has long expired.

Registration

You want to give users a simple on-boarding procedure, a few textboxes where they can type in email and password, or maybe invite people via e-mail and let them set up their password and then become logged in. How do we make that work with the above flows?

The canonical blog post on this topic seems to be this one: https://benfoster.io/blog/identity-server-post-registration-sign-in/. Although brilliant, it is only partially helpful as it covers IdentityServer3, and the newer one is a lot different. Based on ASP.NET Core, for instance.

  1. The core idea is sound – generate a cryptographically random one-time access code and map against the user after the user has been created in the registration page. (In IdentityServer4)
  2. Create an anonymous endpoint in a controller in one of the apps the user will be allowed to use, in it, ascertain that you have been sent one of those codes, then Challenge the OIDC authentication flow, adding this code as an AcrValue as the request goes back to the IdentityServer4
  3. Extend the authentication system to allow these temporary codes to log you in.

To address the IdentityServer3-ness, people have tried all over the internet, here is somebody who get’s it sorted: https://stackoverflow.com/questions/51457213/identity-server-4-auto-login-after-registration-not-working

Concretely you need a few things – the function that creates OTACs, which you can lift from Ben Foster’s blog post. A sidenote, do remember that if you use a cooler password hashing algorithm you have to use special validators rather than rely on applying the hash onto the the same plaintext to validate. I e, you need to fetch the hash from whatever storage you use and use the specific methods the library offers to validate that the hashes are equivalent.

After the OTAC is created, you need to redirect to a controller action in one of the protected websites, passing the OTAC along.

The next job is therefore to create the action.

        [AllowAnonymous]
        public async Task LogIn(string otac)
        {
            if (otac is null) Response.Redirect("/Home/Index");
            var properties = new AuthenticationProperties
            {
                Items = { new KeyValuePair<string, string>("otac", otac) },
                RedirectUri = Url.Action("Index", "Home", null, Request.Scheme)
            };

            await Request.HttpContext.ChallengeAsync(ClassLibrary.Middleware.AuthenticationScheme.Oidc, properties);
        } 

After storing the OTAC in the HttpContext, it’s time to actually send the code over the wire, and to do that you need to intercept the calls when the authentication middleware is about to send the request over to IdentityServer. This is done where the call to AddOpenIdConnect happens (maybe yours is in Startup.cs?), where you get to configure options, among which are some event handlers.

OnRedirectToIdentityProvider = async n =>{
    n.ProtocolMessage.RedirectUri = redirectUri;
    if ((n.ProtocolMessage.RequestType == OpenIdConnectRequestType.Authentication) && n.Properties.Items.ContainsKey("otac"))
    {
        // Trying to autologin after registration
        n.ProtocolMessage.AcrValues = n.Properties.Items["otac"];
    }
    await Task.FromResult(0);
}

After this – you need to override the AuthorizeInteractionResponseGenerator, get the AcrValues from the request, and – if successful – log the user in, and respond accordingly. Register this class using services.AddAuthorizeInteractionResponseGenerator(); in Startup.cs

Unfortunately, I was still mystified as to how to log things in, in IdentityServer4 as I could not find a SignIn manager used widely in the source code, but then I found this blog post:
https://stackoverflow.com/questions/56216001/login-after-signup-in-identity-server4, and it became clear that using an IHttpContextAccessor was “acceptable”.

    public override async Task<InteractionResponse> ProcessInteractionAsync(ValidatedAuthorizeRequest request, ConsentResponse consent = null)
    {
        var acrValues = request.GetAcrValues().ToList();
        var otac = acrValues.SingleOrDefault();

        if (otac != null && request.ClientId == "client")
        {
            var user = await _userStore.FindByOtac(otac, CancellationToken.None);

            if (user is object)
            {
                await _userStore.ClearOtac(user.Guid);
                var svr = new IdentityServerUser(user.SubjectId)
                {
                    AuthenticationTime = _clock.UtcNow.DateTime

                };
                var claimsPrincipal = svr.CreatePrincipal();
                request.Subject = claimsPrincipal;

                request.RemovePrompt();

                await _httpContextAccessor.HttpContext.SignInAsync(claimsPrincipal);

                return new InteractionResponse
                {
                    IsLogin = false,
                    IsConsent = false,
                };
            }
        }

        return await base.ProcessInteractionAsync(request, consent);
    }

Anyway, after ironing out the kinks the perceived inconvenience of the flow was greatly reduced. Happy coding!

WSL 2 in anger

I have previously written about the Windows Subsystem for Linux. As a recap, it comes in two flavours- one built on the concept of pico processes, marshalling the Linux ABI into Win32 API calls (WSL1) and an actual Linux kernel hosted in a lightweight Hyper-V installation (WSL2). Both types have file system integration and fairly transparent command line interface to run Linux commands from Windows and Windows executables from the Linux command line. But, beyond the headline stuff, how does it work in real life?

Of course with WSL1, there are compatibility issues, but the biggest problem is horrifyingly slow Linux file system performance because of it being Windows NTFS pretending to be EXT4. Since NTFS is slow on small files, you can imagine an operating system whose main feature is being a immense collection of small files working together would run slowly on top of it a filesystem with those characteristics.

With WSL2, obviously kernel compatibility is 100%, as, well it’s a Linux kernel, and the Linux file system stuff Just Works, as the file system is managed natively (although over hypervisor), but ironically, the /mnt filesystem with the Windows drives mounted are prohibitively slow. It has been said to be a bug that has been allegedly fixed, but given that we are – at the end of the day – talking about accessing local PCIe gen 4 NVME storage, managing to make file I/O this slow, betrays plenty of room for improvement. To summarise – if you want to do Linuxy in Linux under Windows, use WSL2 , if you want to do Windowsy things in Linux under Windows, use WSL1. Do with that what you will. WSL2 being based on a proper VM means despite huge efforts, the networking story is not super smooth, no proper mechanism exists to make things easier for you and no hits on Google will actually address the fundamental problem.

That is to say, I can run a website I have built in docker in WSL2, but l need to do a lot of digging to figure out what IP the site got, and do a lot of firewall stuff to be able to reach it. Also, running X Window with the excellent X410 server requires a lot of bespoke scripting because there is no way of setting up the networking to just work on start-up. You would seriously think that a sensible bridging default could have been brought in to make things a lot more palatable? After all, all I want to do is road test my .NET Core APIs and apps in docker before pushing them. Doesn’t seem too extreme of a use case.

To clarify – running or debugging a .NET Core Linux website from Visual Studio Code (with the WSL2 backend) works seamlessly, absolutely seamlessly. My only gripe is that because of the networking issue, I cannot really actually verify docker things in WSL2 which I surmised was the point of WSL2 vs WSL1.

Put your Swagger UI behind a login screen

I have tried to put a piece of API documentation behind interactive authentication. I have various methods that are available to users of different roles. In an attempt to be helpful I wanted to hide the API methods that you can’t access anyway. Of course when the user wants to call the methods for the purpose of trying them out, I use the well documented ways of hooking up Bearer token authentication in the Swashbuckle UI.

I thought this was a simple idea, but it seems to be a radical concept that was only used back in Framework days. After reading a bunch of almost relevant google hits, I finally went ahead and did a couple of things.

  1. Organise the pipeline so that Authentication happens before the UseSwaggerUI call in the pipeline.
  2. Hook up an operation filter to tag the operations that are valid for the current user by checking the Roles in the User ClaimsPrincipal.
  3. Hook up a document filter to filter out the non-tagged operations, and also clean up the tags or you’ll get duplicates – although further experimentation here too can yield results.
  4. Set up the API auth as if you are doing an interactive website so you have Open ID Connect middleware set up as a default Authentication Scheme, set up Cookie as Default Scheme and add Bearer as an additional scheme.
  5. Add the Bearer scheme to all APi controllers (or some other policy, point is, you need to specify that the API controllers only accept Bearer auth.