Incantations

We learned the hard way that during an incident, we should monitor and evaluate the severity of the situation and choose a mitigation path whose riskiness is appropriate for that severity.

Recovery mechanisms should be fully tested before an emergency

An emergency fire evacuation in a tall city building is a terrible opportunity to use a ladder for the first time.

Testing recovery mechanisms has a fun side effect of reducing the risk of performing some of these actions. Since this messy outage, we’ve doubled down on testing.

We were pretty sure that it would not lead to anything bad. But pretty sure is not 100% sure.

A “Big Red Button” is a unique but highly practical safety feature: it should kick off a simple, easy-to-trigger action that reverts whatever triggered the undesirable state to (ideally) shut down whatever’s happening.

Unit tests alone are not enough - integration testing is also needed

This lesson was learned during a Calendar outage in which our testing didn’t follow the same path as real use, resulting in plenty of testing… that didn’t help us assess how a change would perform in reality.

Teams were expecting to be able to use Google Hangouts and Google Meet to manage the incident. But when 350M users were logged out of their devices and services… relying on these Google services was, in retrospect, kind of a bad call.

It’s easy to think of availability as either “fully up” or “fully down” … but being able to offer a continuous minimum functionality with a degraded performance mode helps to offer a more consistent user experience.

This next lesson is a recommendation to ensure that your last-line-of-defense system works as expected in extreme scenarios, such as natural disasters or cyber attacks, that result in loss of productivity or service availability.

A useful activity can also be sitting your team down and working through how some of these scenarios could theoretically play out—tabletop game style. This can also be a fun opportunity to explore those terrifying “What Ifs”, for example, “What if part of your network connectivity gets shut down unexpectedly?”.

In such instances, you can reduce your mean time to resolution (MTTR), by automating mitigating measures done by hand. If there’s a clear signal that a particular failure is occurring, then why can’t that mitigation be kicked off in an automated way? Sometimes it is better to use an automated mitigation first and save the root-causing for after user impact has been avoided.

Having long delays between rollouts, especially in complex, multiple component systems, makes it extremely difficult to reason out the safety of a particular change. Frequent rollouts—with the proper testing in place— lead to fewer surprises from this class of failure.

Having only one particular model of device to perform a critical function can make for simpler operations and maintenance. However, it means that if that model turns out to have a problem, that critical function is no longer being performed.

Latent bugs in critical infrastructure can lurk undetected until a seemingly innocuous event triggers them. Maintaining a diverse infrastructure, while incurring costs of its own, can mean the difference between a troublesome outage and a total one.

I eat words@group.lt · 15 days ago

Spots, stripes and more: Working out the logic of animal patterns

I eat words@group.lt · 15 days ago

update to 19.4, some downtime

I eat words@group.lt · 16 days ago

Startling differences between humans and jukeboxes

I eat words@group.lt · 22 days ago

iptables vs. GoXDP: The Ultimate Packet Filtering Benchmark Setup and Results

I eat words@group.lt · 22 days ago

The Cult of the Criterion Collection: The Company Dedicated to Gathering & Distributing the Greatest Films from Around the World

I eat words@group.lt · 22 days ago

How AI will change democracy

I eat words@group.lt · 22 days ago

2023 Biography of Marian Rejewski: “The First Enigma Codebreaker” | flyingpenguin

I eat words@group.lt · 22 days ago

2023 Biography of Marian Rejewski: “The First Enigma Codebreaker” | flyingpenguin

I eat words@group.lt · 23 days ago

The habits of effective remote teams - PostHog

I eat words@group.lt · 25 days ago

Paul Auster, American author of The New York Trilogy, dies aged 77

I eat words@group.lt · 25 days ago

„Nuovargio visuomenė“: kodėl nuolat jaučiamės persidirbę ir nelaimingi?

I eat words@group.lt · 1 month ago

Google Patches Fourth Chrome Zero-Day in Two Weeks

I eat words@group.lt · 1 month ago

We need more calm companies

I eat words@group.lt · 1 month ago

Enhancing Open Source Security: Introducing Siren by OpenSSF – Open Source Security Foundation

I eat words@group.lt · 1 month ago

This is what you get when are not sleeping during biology classes.

I eat words@group.lt · 1 month ago

I eat words@group.lt · 1 month ago

not a bug, but a feature :))

I eat words@group.lt · 2 months ago

a source code of a game ;))

I eat words@group.lt · 3 months ago

i am all for normalizing raiding ambassies for [put the cause you support] as well

I eat words@group.lt · 3 months ago

woah, so nothing is sacred now? 😱🤔😐

I eat words@group.lt · 4 months ago

thank you, actually it seems that it is https://en.m.wikipedia.org/wiki/The_Sliced-Crosswise_Only-On-Tuesday_World , which has inspired Dayworld :)

I eat words@group.lt · 4 months ago

looks interesting, but not this one.

I eat words@group.lt · 4 months ago

from the logs it would seem that synapse went down not due to share volume of traffic, but special malformed usernames - so it seems a different pattern was used (if it is was an attack)

I eat words@group.lt · 4 months ago

I am not sure if that is related, but technically Matrix uses a different protocol from ActivityPub, so it had to be targeted specifically

I eat words@group.lt · 4 months ago

Video debunking the report: https://yewtu.be/watch?v=7CD_Nl3iwhE

I eat words@group.lt · 4 months ago

can do, if you could provide the link to the debunking source - would be great!

I eat words@group.lt · 4 months ago

nice, thank you.

I eat words@group.lt · edit-2 4 months ago

This might be FUD, but… Vastaamo hacker traced via ‘untraceable’ Monero transactions, police says. (Edit) - A video debunking the police report - https://yewtu.be/watch?v=7CD_Nl3iwhE

I eat words@group.lt · 4 months ago

Yes, seems so from the article.

I eat words@group.lt · 5 months ago

Agree, but five nines are not 100% ;) Anyway - this discussion reminds me of Technical Report 85.7 - Jim Gray, which might be of the interest to some of you.

I eat words@group.lt · 5 months ago

a lot of things are possible if you are lucky enough ;)

I eat words@group.lt · 5 months ago

well this is probably PR as there is no such system nor it can be made that can have 100% uptime. not talking about the fact that network engineers rarely work with servers :)

I eat words@group.lt · 5 months ago

sorry, missed the question - it is up!

I eat words

Lessons Learned from Twenty Years of Site Reliability Engineering

Metadata

Highlights

Moderates