Monthly Archives: October 2025

The sky is falling?

Outage

If you tried to do anything online today you may have had more problems than usual. All kinds of services were failing, because some storage at AWS on the eastern seaboard was having problems.

Now, there are plenty of people that love to point out that the cloud has a lot of all-eggs-in-one-basket where one service being unreliable can knock out an insane percentage of the infrastructure of the internet, and they say we should go back to having our own servers in our own basement.

There is a lot of valid maths behind that kind of stance, as renting a big enough chunk of cloud infrrastructure is incredibly expensive, even id you would replace them with really hot computers. Now, I remember back when installing a server meant an HP ProLiant 1U server would show up at your desk and you’d plug it in, annoy everyone else in the office with the fan noise, and you’d stick some software on it, but of course that’s not the time that people want to go back to, people want to go back to giant VMWare clusters where you could provision a new VM conveniently from your desk. Except of course storage was always ridiculously expensive with GBs of enterprise SAN storage costing per GB what 10 TB cost on the street.

Where did cloud come from?

Why did we end up where we are? Well, AWS offered people a chance to provision apps on virtual hardware without buying a bunch of servers first. This was an advantage that cloud still has to this day. You can just get started, gauge how much interest there is, what amount of hardware makes sense, what the costs look like, and then you can possibly decide to bering it all home to your basement. Of course, cloud providers will try to entice you with database systems and queueing systems that are vendor specific to prevent you from moving your apps home, but it is not insurmountable.

Also, although I remember a time when companies would have server rooms in their offices where they stashed their electronic equipment, hopefully – but not mandatorily – arranging for improved cooling and redundant power. After a while people realised it would make more sense to rent space in a colocated datacentre, where your servers can socialise with other servers, all managed by a hosting partner that provides a certain level of physical security, climate control and fire suppression. At this point though, you are probably leasing your servers, leasing your rackspace and paying fees for this situation. Are you sure you are saving an enormous amount of money this way versus running cloud native apps in the cloud?

If your product is something like an email provider, of course, you will probably have network and storage needs on a scale that merits building your own datacentre, still reducing cost versus cloud hosting, but – and this may be hard to accept for some leaders, your company’s product is probably not GMail. It is worth making the calculation though.

Why is US East 1 having problems enough to break half the internet?

So, yes, having multiple active copies of your infrastructure up and running globally is expensive yes, but the main reason businesses keep building their infrastructure in US East 1 is that there are very complex problems with consistency and availability as soon as you have multiple replicas out there being updated simultaneously, so if there is any way to just have one database instance, you do that, and a lot of American businesses prefer to keep their code in Virginia, or something. OR maybe it’s because US East 1 is the default region. This is not an inherent property of cloud apps, you are free to have your single copy of your infrastructure in other regions, or – heck – have a cold failover that you can spin up in another region.

“I hate sitting around, I want to never experience this again”

I hear you – you are looking for solutions, I like it.

Multi Cloud – no

Grifters are going to say “Multi cloud! They can’t all be down at the same time!”, and… sure, but I have yet to see a good multicloud setup. There is no true cross platform IaC, so you’ll have to write a whole bunch of duplicate infra and pay for it to sit around waiting for the other clouds to go down, or if you run active- active you’ll pay egress and ingress to synchronise data across worlds and get a whole new class of problems with consistency and latency.

No – this is a bad option, you are spending loads of money on a solution that you cannot even fully use, since you are limited to the lowest common denominator

Bring it in house- meh, maybe

If you are going to bring your software home…. take the numbers for a spin again, because I doubt that they will make sense.

If you are going to do it – do it properly, i.e. use the tools that didn’t exist back when we built stuff for on-premise. Use containers and ephemeral compute instances. Unfortunately – if you don’t have enough money to lease rack space in multiple datacentres you still have a single point of failure, and if you do have enough money for that, then you will have that synchronisation problem again, so the hard engineering really doesn’t go away. Again, make sure the contracts for your data centres and the additional cost of hiring people to manage the on prem apps you will need to replace that fancy managed infrastructure your cloud provider offered (like, yes, now you need to hire a couple of ZooKeeper and Kafka admins) doesn’t exceed the cloud cost, or at least that your expected uptime is better than what yuur cloud provider is offering.

Do nothing – my favourite option

Well… did you get away with the outage? Did you lose less money than it would cost to take decisive action? How many times can the cloud fall over before it’s worth it? Sure, some IT security experts say that when China go to war with Taiwan, the cyber attack that will strike the US will probably take out large cloud providers since it seems to be so effective in crippling infrastructure, do you think that is likely? Will that hurt your business specifically?

If you can get away with telling your users to “email you tomorrow when the cloud is back up” or words to that effect, you should probably take advantage of that and not spend more money than you need, but on. the other hand if you need 100% uptime – as in no nines, 100%, there is an IBM Mainframe that offers that, and you can configure it to behave like an insane number of linux m machines all in one trench coat, so you can run your existing apps on it, kind of.

Presumably, your system needs are somewhere on that continuum between “that’s OK, we’ll try again tomorrow” and “100% or else”, and I cannot make blanket guarantees, but if you chat to the business, they will probably have very specific ideas of what is acceptable and unacceptabler downtime, and if you agree with them about that, you are – I am guessing – going to be surprised at how OK people will be with staying in the cloud and taking your chances, as long as there is some observability and feedback.

Heck, if I am wrong, buy that mainframe.