This is certainly not the first time that nedprod.com has suffered outage. In the approx ~22 years that this website has been here, there have been multiple uptime calamnities, some my fault, some bad luck, and some malfeasance of the website hosting provider. However, this is the first time that I’ve experienced a catastrophic hardware failure on a rented server – it was working fine, I rebooted it for the first time in 485 days, it never restarted. All data on that server was lost.
This partially explains how long it took me to restore this website: whilst all irreplaceable data such as email was safely backed up, and none of that was lost, I did lose all my replaceable data, where ‘replaceable’ is defined as ‘all the stuff repeatable using Niall’s extremely limited free time’, which back when I first took that decision of not backing up everything to home, assumed pre-children free time availability levels.
My first priority was email; email receipt was restored onto a new, temporary, server by Thursday 30th July. But I couldn’t reliably send email until ~2am on Sunday 9th August, with gmail having had to suffice in between. Then followed a process of restoring my various websites, until I had restored enough to use my fancy hand written Javascript post editor in which I am writing this now. Even still, I must still manually initiate rebuild of the website, because the docker plugin which I had written to do that has been completely lost.
Which brings me to the point of this post: irreplaceable data is obviously the most important data of all. My automated backups worked a treat on those. But I hadn’t really considered deeply, until now, just how many hours of my time had been invested into my public server. As a conservative estimate, it’s many hundreds of hours. Normally, when I transition server providers, I take a complete copy of the preceding server onto the new server. Then all the custom scripting and tweaks etc from the preceding servers are all never lost. But when you lose the whole server, all that accumulated investment gets lost. I know a lot of this stuff is trivial, like I had written a small Python script to grok the RTE Pulse page for the current show title, and use that to tell the streamripper doing the recording what the name of the current show is. Thus I can constantly record RTE Pulse, and play back specific shows at work. As much as I could rewrite that in a few hours, it is a few hours of my time to debug the thing. And my non-sleep non-work non-childcare hours are an exceptionally scarce resource. It is extremely likely that much of this lost infrastructure, I won’t be restoring, because most of it was a convenience rather than a necessity – taking RTE Pulse again as an example, I know the shows I like the most, and they all are on mixcloud, so I can just manually go there for each of them.
Anyway, obviously enough I have taken measures to prevent this ever happening again. This website is now being served from a €5/month dual core Intel Atom C2338 @ 1.74Ghz dedicated server with 4Gb RAM and a 128Gb SATA SSD. It is very severely underpowered, it runs at a fraction of the speed of my preceding eight core Intel Atom C2750 dedicated server for €11/month. But here’s the key thing: I now have two of those servers, so for the same money, I get failover redundancy, albeit with far less CPU grunt (half the total CPU cores running at two thirds the clock speed). Because these little servers are so underpowered, and I am making them run ZFS on root because I am a mean person, I’ve had to disable PHP processing entirely – this is now back to being a 100% static website, just like it was in the 1990s . You readers probably won’t notice the difference – the only missing bit is the visitor counter at the top, which used a bit of PHP and a SQLite database (also lost). I do feel that loss a bit, I had visitor counts per page since the 1990s. But given that nobody since the 1990s bothers with that any more, I doubt the loss will be noticed.
Even with this now being a pure static website, ZFS is so much work for these tiny Atom CPUs that storage bandwidth is quite impacted. For incompressible data:
- Raw 128Gb SATA SSD: ~470Mb/sec read, 340Mb/sec write (it’s a Sandisk X400 SSD, a four year old TLC design).
- Unencrypted LZ4 compressed: 348Mb/sec read, 244Mb/sec write (approx -35% over raw, but usually most data compresses well, in which case this compression yields a net gain).
This in turn badly hurts the 1Gbps NIC, as served by nginx, tested from a nearby server:
- Raw network can achieve ~100Mb/sec i.e. RAM to RAM via nginx.
- Cached file content @ 80Mb/sec @28%user 37%system 34%idle (approx -20% over raw).
- Uncached file content requiring i/o and LZ4 decompression @ 59Mb/sec: 22%user 41%system 37%idle (approx -41% over raw).
During that last benchmark, one of the two Atom CPUs is maxed out, the other is fairly idle, so basically the NIC is being throttled by the lack of single core compute available. In the end though, three fifths of a gigabit is probably enough for most people only wanting to pay ~€5/month. And, because we shall be load balancing web requests across both servers, that’s twelve tenths of a single gigabit server i.e. +20% more available bandwidth, for the same money.
Anyway, time for bed methinks. I hope y’all are doing well, and you weren’t worried by here disappearing!
Go to previous entry | Go to next entry | Go back to the archive index | Go back to the latest entries |