Slack outage 2021

8/12/2023

Slack outage 2021

Read Now

Our own serving systems scale quickly to meet these kinds of peaks in demand (and have always done so successfully after the holidays in previous years). We go from our quietest time of the whole year to one of our biggest days quite literally overnight. On the first Monday back, client caches are cold and clients pull down more data than usual on their first connection to Slack. However, Slack’s annual traffic pattern is a little unusual: Traffic is lower over the holidays, as everyone disconnects from work (good job on the work-life balance, Slack users!). The TGWs are managed by AWS and are intended to scale transparently to us. On January 4th, one of our Transit Gateways became overloaded. Obviously, this does not accurately depict Slack’s actual network architecture.įast forward several months later. Example of a transit gateway connecting multiple VPCs. Then, without having to worry about managing any of the overhead of setting up VPCs, route tables, or network access lists, teams were able to utilize these VPCs and build their resources on top of them. This approach seemed really attractive to our Cloud Engineering team, as we could manage the IP space, build VPCs, and share them with our child account owners. This solved our earlier issue of constantly hitting AWS rate limits due to having all our resources in one AWS account. By leveraging these technologies, they were able to design a network architecture that addressed their problems: They reached for new capabilities: AWS shared VPCs and AWS Transit Gateway Inter-Region Peering. To deal with this risk of saturation, the cloud engineering team adapted again. VPC peering with multiple VPCs increases the cognitive load on the networking engineers This led to a lot of administrative overhead. Having hundreds of AWS accounts became a nightmare to manage when it came to CIDR ranges and IP spaces, because the mis-management of CIDR ranges meant that we couldn’t peer VPCs with overlapping CIDR ranges. Once again, this was the “it’s getting too hard” sort of saturation: With continued growth, they eventually reached saturation again. Now the service teams could request their own AWS accounts and could even peer their VPCs with each other when services needed to talk to other services that lived in a different AWS account.

I still see these as a form of saturation: as a system gets more difficult for humans to deal with, it effectively increases the cost of using the system, and it makes errors more likely.Īnd so, the Slack Cloud Engineering team adapted to meet this saturation risk by adopting AWS child accounts. The other two limits are cognitive: the system made it harder for humans to deal with separating out costs and, it led to confusion for internal teams. The first is a traditional sort of limit we software folks think of: they were running into AWS rate limits associated with an individual AWS account. The above quote makes reference to three different categories of saturation. That cloud isn’t looking so happy anymore Having all our infrastructure in a single AWS account led to AWS rate-limiting issues, cost-separation issues, and general confusion for our internal engineering service teams. However, everything we built still lived in one big AWS account. Here’s a quote from Slack’s blog post Building the Next Evolution of Cloud Networks at Slack by Archie Gunasekara:Īs our customer base grew and the tool evolved, we developed more services and built more infrastructure as needed. However, as Slack grew, it encountered it problems. In the beginning, Slack’s AWS footprint fit nicely into one account and VPC: that’s one happy cloud! In the beginning, they (like, I presume, all small companies) started with a single AWS account. I’m going purely from the text of the original write-up, which means I’ll likely get some things wrong here. In this post, I’m going to walk Laura’s write-up, highlighting all of the examples of saturation and how the system adopted to it. In particular, in socio-technical systems, people will adapt in order to reduce the risk of saturation. Saturation plays a big role in Woods’s model of the adaptive universe. If you’ve done software operations work, I bet you’ve encountered resource exhaustion, which is an example of saturation. Saturation is a phrase often used by the safety science researcher David Woods: it refers to a system that is reaching the limit of what it can handle. On the other hand, it’s an outage story with multiple examples of saturation. There’s nothing about a bug that somehow made its way into a production, or an accidentally incorrect configuration change, or how some corrupt data ended up in the database. One of the things that struck me about this writeup is the contributing factors that aren’t part of this outage. 4, 2021 outage on Slack’s engineering blog.

Laura Nolan of Slack recently published an excellent write-up of their Jan.

0 Comments

Slack outage 2021

Leave a Reply.

Author

Archives

Categories