Post your IT redundancy tales here

David Gerard@awful.systems · 3 days ago

Post your IT redundancy tales here

bitofhope@awful.systems · 2 days ago

At a previous job a colleague and I used to take on most of the physical data center work. Many of the onprem customers were moving to public cloud, so a good portion of the work was removing hardware during decommissioning.

We wanted to optimize out use of the rack space, so the senior people decided we would move one of our storage clusters to the adjacent rack during a service break night. The box was built for redundancy, with dual PSUs and network ports, so we considered doing the move with the device live, with at least one half of the device connected at all times. We settled on a more conventional approach and one of the senior specialists live migrated the data on another appliance before the move.

Down in the DC the other senior showed us what to move and where and we started to carefully unplug the first box. He came to check on us just after we had taken out the first box.

Now I knew what a storage cluster appliance looked like, having carried our old one out of the DC not too long ago. You have your storage controller, with the CPU and OS and networking bits on it, possibly a bunch of disk slots too, and then you had a number of disk shelves connected to that. This one was quite a bit smaller, but that’s just hardware advancement for you. From four shelves of LFF SAS drives to some SSDs. Also the capacity requirements were trending downwards what with customers moving to pubcloud.

So we put the storage controller to its new home and started to remove the disk shelf from under it. There was a 2U gap between the controller and the shelf, so we decided to ask if that was on purpose and if we should leave a gap in the new rack as well.

“What disk shelf?”

Turns out the new storage appliance was even smaller than I had thought. Just one 2U box, which contained two entire independent storage controllers, not just redundant power and network. The thing we removed was not a part of the cluster we were moving, it was the second cluster, which was currently also handling the duties of the appliance we were actually supposed to move. Or would have, if we hadn’t just unplugged it and taken it out.

We re-racked the box in a hurry and then spent the rest of the very long night rebooting hundreds of VMs that had gone read only. Called in another specialist, told the on-duty admin to ignore the exploding alarm feed and keep the customers informed, and so on. Next day we had a very serious talk with the senior guy and my boss. I wrote a postmortem in excruciating detail. Another specialist awarded me a Netflix Chaos Monkey sticker.

The funny thing is that there was quite reasonable redundancy in place and so many opportunities to avert the incident, but Murphy’s law struck hard:

We had decomm’d the old cluster not a long ago, reinforcing my expectation of a bigger system.
The original plan of moving the system live would have left both appliances reachable at all times. Even if we made a mistake, it would have only broken one cluster’s worth of stuff.
Unlike most of the hardware in the DC, the storage appliances were unlabeled.
The senior guy went back to his desk right before we started to unwittingly unplug the other system
The other guy I was working with was a bit unsure about removing the second box, but thought I knew better and trusted that.

swlabr@awful.systems · 2 days ago

I have pretty much the exact opposite of redundancy right now, i.e. I am the single failure point for most of the IT in the company. Send help

PM_Your_Nudes_Please@lemmy.world · edit-2 1 day ago

Gotta love when your company’s bus number is 0, because one dude in the basement is single-handedly holding the entire business together. The entire c-suite could get hit by a bus and the company could continue to function while new leadership was selected… But that one dude in the basement gets hit, and the entire company’s core function is crippled for weeks.

Sailor Sega Saturn@awful.systems · 2 days ago

I’m now on two completely separate on-call rotations (as a programmer, rather than an IT person proper), and only being paid extra for a fraction of one of them. All this on top of being in charge of way too much code.

Haven’t quit yet because honkin’ big silicon valley mortgage and all, and if I’m driven off I want severance gosh darn it. Unfortunately my company also can’t fire me because like you I’m a single point of failure. I will quit if things end up too annoying though.

froztbyte@awful.systems · 2 days ago

isn’t it fun wearing all 17 hats…

swlabr@awful.systems · 2 days ago

~~no~~

David Gerard@awful.systems · 2 days ago

it’s times like these you need to show up at the office half an hour late wearing a suit

Mii@awful.systems · edit-2 3 days ago

Well, our company is trying really hard to make my whole department redundant at the moment.

Some months ago our CEO went to Silicon Valley to “talk to some people”, although I have no idea what he really did there. We’re also from Europe and don’t even sell anything in the Americas, so that trip was really unusual. And ever since he came back, he’s been completely AI-brained, and here’s some things that happened since then:

an executive order to “integrate AI into all layers of our business”
replacing all our laptops with new Thinkpads because they are apparently better for AI and have a Copilot button
activating Copilot for MS Teams even though it’s 100% not GDPR-compliant
dumping tens of thousands of euros into MS Fabrics in the hopes that it will somehow vomit out useful data for marketing; it hasn’t vomited out ANYTHING so far

Worst of all, though, our development department consists of two people including me, and the other guy mostly does organizational stuff so I am more or less alone responsible for the entirety of our production-critical code. We are understaffed and I am working on a project for which I was promised two juniors early next year … now however, I was being asked to evaluate whether we can do that with AI instead, and the hires have been shelved.

And I don’t think I’m allowed to submit “lol no”.

-dsr-@awful.systems · 1 day ago

My semi-serious suggestion:

“That sounds great. I’m going to need to take a course in how to best utilitize AI, and the existing timeline will probably need to change. To really engage at expert level, I will go look at best-practices from experts. You’ll sign off on reasonable expenses, right?”

Then book a trip to [interesting place] and get it expensed. Then look for a new job while promising great things in a few months time, maybe a year or so.

David Gerard@awful.systems · 2 days ago

if only there were whistleblower bounties for gross GDPR violation

Mii@awful.systems · edit-2 2 days ago

I actually flagged this with our DSO, still waiting for the results.

(Somehow MS Teams itself did go through years ago, which also surprised me.)

Christopher Wood@awful.systems · 3 days ago

Once upon a time I was, for employment reasons, part of a team providing customer support for police forces’ booking equipment including RHL, Solaris, HP-UX, and Tru64 servers. If you were arrested in some specific parts of the USA in the early 2000s it’s likely your PII travelled across a server I had logged into on its way to the FBI.

One specific police force called us about a red light. It turned out that half of their two-disk RAID1 array had failed. Then it transpired that they had not been rotating the backup tapes. Or even putting in the tape for backup. After some discussion it turned out that their server was in a grimy, dusty janitor’s closet instead of, say, under a desk or in a spare office. Which is why it had been out of sight and mind and getting clogged with filth.

I was asked to do a checkup on this server and see how it was. Of course this was after 3 PM on a Friday. The server seemed on the face of it fine, the RAID array was working on one disk, there were no errors on the box, and so on. Apart from the dead disk everything was fine.

(While I was being finicky with this host it got late and somebody turned off the lights and I yelled “I’m still here turn the goddamn lights back on!” or words to that effect and it turned out I had unintentionally cursed at the CEO with whom I had less than ideal relations. So it goes. He seemed more copacetic than usual. He left and I got on with things.)

Eventually I was finishing my little audit and my very junior self (job title: Technician) was wondering how little work I could get away with in my correctly lazy sysadmin style. For the first time I thought the thought that has guided my actions with systems ever since: “If I stopped here, would it be okay if a problem happened later?”

I called the police force and said given circumstances I needed to take a cold backup of their Oracle database and their booking equipment would be down for a bit. The response was that this was fine given arrest volume only picked up later on a Friday anyway. I merrily took down a whole police force’s ability to book suspects to cold-backup their Oracle database onto the third disk in the host (secondary backup mechanism or something, purchasing is a weird art). Then I turned them back on and had them do a smoke test and grabbed the bus home.

I had myself a happy little weekend in the era before cell phones and when I arrived a bit late on Monday morning my workplace was in a rather unprecedented uproar. Readers, the second disk in that police force’s RAID array had failed and taken with it their ability to book prisoners and their built up years of criminal intelligence data.

(In this situation the civil rights clock ticks and judges do not accept “well our computer systems were down” for slow-rolling delivery to bail hearings as much as the public thinks they do. So if this isn’t fixed a whole bunch of innocent and/or gormless and/or unpleasant people run wild and free.)

I was called over by the Executive Vice President of Operations and asked about the database. I said in my then-typical very guileless way “oh, I did a cold backup onto the third disk”. It was like everybody had just exhaled around me.

If I recall correctly the job of restoring the Oracle database was delegated to J. who was a great Oracle DBA among other talents. Well, as soon as a disk arrived. In the meantime the police force dug out their inkpads and paper from somewhere.

This was a superior lesson in the fragility of computer systems and why the extra mile is actually no more or less than all the required miles.

gerikson@awful.systems · 3 days ago

This was before I was made redundant ,which happened for unrelated reasons and was ok.

I worked at a startup. Before I started, the company had an NT server running Exchange, which contained all customer relationship data, emails etc. The box was also a fileserver. One day, space was running out. The “admins” (== developers who only knew Linux) solved the problem by deleting all the “*.log” files cluttering up the filesystem, thereby effectively lobotomizing Exchange.

After some weeks a highly-paid consultant gave up. The company needed a new email server.

The devs decided to use qmail, because “secure”. To translate between the firstname.lastname@company.com address to the firstname_lastname directory, a Perl script was inserted between qmail and the mailboxes. As time went by, this Perl script metastized to include email renames and even out of office replies. It grew to ~300 lines and was run on every single email that arrived.

After a while we got acquired by grownups who knew how to manage Exchange. I discovered that if an email was misplet, it wasn’t bounced, instead it was forwarded to root’s email account, which was a couple of gigs in size.

I swore to never touch email administration again.

noughtnaut@lemmy.world · edit-2 3 days ago

I absolutely love and support your use of misplet.

Jo Miran@lemmy.ml · edit-2 3 days ago

I have ~~two~~ three stories.

Company X: Our testbed server room was supported by redundant rooftop AC units, many yards apart. During a storm, a lightning bolt forked (split) One tip.of the bolt hit AC unit one and the other hit AC unit two, killing both cooling units. To make things worse, the server manufacturer did not add a temperature safety shutdown to the units and instead configured them to fan faster the hotter they got. By the time I got there the cable management was warping and melting due to heat.

Company Y: The main datacenter was on tower 2 and the backup datacenter was on tower 1. Most IT staff was present when the planes hit.

EDIT:
Company Z: I started work at a company where they gave me access to a “test” BigIP (unit 3) to use as my own little playground. Prior to my joining the company was run by devs doubling as IT. I deleted the old spaghetti code rules so that I could start from scratch. So, after verifying that no automation was running on my unit (unit 3), I deleted the old rules. Unfortunately the devs/admins forgot to disengage replication on “unit 2” when they gave me “unit 3”. So production “unit 2” deleted its rules and told production “unit 1” to do the same. Poof…production down and units offline. I had to drive four hours to the datacenter and code the entire BigIP from scratch and under duress. I quit that job months after starting. Some shops are run so poorly that they end up fostering a toxic environment.

db0@lemmy.dbzer0.com · 3 days ago

Company Y: The main datacenter was on tower 2 and the backup datacenter was on tower 1. Most IT staff was present when the planes hit.

Well, that one dark “redundancy” story…

BearOfaTime · 3 days ago

I don’t understand why they had redundancy so physically close.

Whatever affects one has a high risk of affecting the other.

Different regions is a thing for a reason.

bitofhope@awful.systems · 2 days ago

There are tradeoffs to higher and higher grades of redundancy and the appropriate level depends on the situation. Across VMs you just need to know how to set up HA for the system. Across physical hosts requires procuring a second server and more precious Us on a rack. Across racks/aisles might sometimes require renting a whole second rack. Across fire door separated rooms requires a DC with such a feature. Across DCs might require more advanced networking, SDN fabrics, VPNs, BGP and the like. Across sites in different regions you might have latency issues, you might have to hire people in multiple locations or deal with multiple colo providers or ISPs, maybe even set up entire fiber lines. Across states or countries you might have to deal with regulatory compliance in multiple jurisdictions. Especially in 2001 none of this was as easy as selecting a different Availability Zone from a dropdown.

Running a business always involves accepting some level of risk. It seem reasonable for some companies to decide that if someone does a 9/11 to them, they have bigger problems than IT redundancy.

Christopher Wood@awful.systems · 2 days ago

It’s probably good to situate in time when thinking about these things. The twin towers were how a lot of companies became examples of what location redundancy really means. These days people are keeping that lesson well in mind, but back then, not so much.

db0@lemmy.dbzer0.com · edit-2 3 days ago

I think the OP was talking about the other “redundancy”, as in “your whole team has been made redundant”. In this context your story is very dark indeed :D

bitchkat@lemmy.world · 3 days ago

My old company has both data centers in the same metro area. A lot of handwaving is involved when you ask about events that can take out a metro area (natural or man made).

froztbyte@awful.systems · 3 days ago

“~~perverse incentives~~ budgets rule everything around me”

froztbyte@awful.systems · 3 days ago

company X sounds like the sort of bad shit I remember from a DC my side of the world, which was so frequently broken in various states that occasionally you couldn’t even touch the outside doorknobs (the heat would translate from the inside)

Y: oof.

Z: lol, wow. good on ya for leaving. no point sticking around in that kind of disasterzone. it ain’t ever gonna get fixed by your efforts.

froztbyte@awful.systems · 3 days ago

mine’s … not corp flavoured, and please excuse the vagueposting (bc reasons)

the investor behind the company flew me out to their private game farm, in a particular place that had specific kinds of legal import, to do some things. the listed tasks themselves were nonsense, and the entire plan for the trip was a razzle-dazzle “wow”, combined with some attempts to get me pliably drunk and milk me for information (which I deduced was to be used in a “making redundant” play)

the plan was flawed in multiple respects, but the one that’s most entertaining to me was the attempt to get me drunk. they picked (by chance) one of my regular things, and because I’m not a moron I made sure to make them drink at my pace. they very lost on that mark, and the next day had 'em remarkably chill

o7___o7@awful.systems · 1 day ago

That started out positively sinister, glad it came out alright!

froztbyte@awful.systems · 1 day ago

it … it’s really wild

the vagueposting obscures a lot of detail

“there is currently an ongoing murder case, soon to go to trial” is, well. yeah. that’s a thing.