Lemmy, what are some of your "oh shit" work stories?

Dio9sys@lemmy.blahaj.zone · 11 months ago

Lemmy, what are some of your "oh shit" work stories?

dan@upvote.au · edit-2 11 months ago

I broke the home page of a big tech (FAANG) company.

I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.

What I didn’t realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn’t designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.

Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.

jjjalljs@ttrpg.network · 11 months ago

Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins

This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.

Vendetta9076@sh.itjust.works · 11 months ago

I work on a SOC team and were really trying to hammer the usage of feature flags into our devs.

WhyAUsername_1@lemmy.world · 11 months ago

What are feature flags?

dan@upvote.au · edit-2 11 months ago

Feature flags are just checks that let you enable or disable code paths at runtime. For example, say you’re rewriting the profile page for your app. Instead of just replacing the old code with the new code, you’d do something like:

if (featureIsEnabled('profile_v2')) {
  // new code
} else {
  // old code
}

Then you’d have some UI to enable or disable the flag. If anything goes wrong with the new page after launch, flip the flag and it’ll switch back to the old version without having to modify the code or redeploy the site.

Fancier gating systems let you do things like roll out to a subset of users (eg a percentage of all users, or to 50% of a particular country, 20% of people that use the site in English, etc) and also let you create a control group in order to compare metrics between users in the test group and users in the control group.

Larger companies all have custom in-house systems for this, but I’m sure there’s some libraries that make it easy too.

At my workplace, we don’t have any Git feature branches. Instead, all changes are merged directly to trunk/master, and new features are all gated using feature flags.

WhyAUsername_1@lemmy.world · 11 months ago

Wow that’s so effing smart!

Vendetta9076@sh.itjust.works · 11 months ago

Everything Dan said and more. They’re sometimes also called canaries, although thats not quite the same thing. There’s been a ton of times where services have been down for hours instead of minutes because a dev never built in a feature flag.

Hadriscus@lemm.ee · 11 months ago

Canaries, relating to mine work ?

Vendetta9076@sh.itjust.works · 11 months ago

Thats where the term derives from, yes

CashewNut@lemmy.world · 11 months ago

What language? PHP, python?