Every time there’s an AWS outage, half the internet seems to go offline. Why is there such a heavy dependence on it, and can anything be done to reduce that?
Every time there’s an AWS outage, half the internet seems to go offline. Why is there such a heavy dependence on it, and can anything be done to reduce that?
Comments
Because it’s up most of time and the capital to start up such an endeavor is out of reach foot 99.9 percent of companies. Data centers require years of planning and development to scale up to the size of something like AWS. You have to build extremely large buildings, have the ability to cool thousand of servers, have the infrastructure to support your operations, and the work force to roll all this out and develop software to run it and keep it secure. All of this while it doesn’t make you money for years. It’s really hard to deploy.
Amazon Web Services provide many of the services that the web uses to function. Servers, db, auth, etc. Google and Microsoft offer alternatives.
[removed]
It’s a very powerful web hosting service, but it’s not the only one. Amazon’s web hosting is AWS, but there’s also Microsoft’s Azure, and Google’s Cloud. Because these are specialized vendors it’s cheaper and more efficient than setting up your own servers and needing to manage it yourself.
The main way I know companies mitigate it is by paying for 2 or 3 of the big vendors so if one fails you can have the others still working, and then it’s just load balancing and maybe some temporary slowness, rather than service failure.
You can also try to host your own servers, (remember when people were demanding dedicated servers for MW2?) but it’s really not worth it these days.
The problem is that people are to lazy or cheap to set up their applications to be resilient by having them deployed against multiple availability zones and regions
everyone uses AWS ‘cause it’s the easiest, fastest way to run websites and apps without owning tons of servers…and when AWS goes down, all the sites and services that rely on it basically blackout together, making half the internet go dark.
When the internet was small, you could put up a website on your home computer, and it could handle the dozens to hundreds of visits it’d get. If your website got very popular, you bought a dedicated server to run it, and a fast internet connection. As the web grew, businesses sprang up that would do this for you. Very popular websites got several servers around the world to spread traffic and mitigate delays and outages. Nowadays this is really big business, and companies like Amazon will host websites for millions of customers in their huge server parks around the globe.
Which works great until there’s an outage which brings them all down. Such systems have lots of redundancy, so it very rarely happens, but it’s very hard to make a system with no single point of failure. It can be quite interesting to read post failure analysis from such events, as it’s often a chain of errors that led to the ultimate downtime.
All websites need servers. Once upon a time, this would be a single computer, dedicated just to hosting the website. As websites got bigger and more data intensive, they needed multiple computers: this got expensive.
Then virtualization came along, which was a way that one physical computer could “divvy up” it’s resources to act as many different servers in one, each hosting different sites or services.
AWS built virtualization at a massive scale, with massive data centers all over the world. This scale made it both cheap and relatively reliable (the redundancy of multiple data centers).
The reason everyone uses AWS is that everything else is more expensive, and you don’t get better results or performance. And at this point, it’s also like IBM of yore: nobody ever got fired for choosing AWS.
How to reduce the single point of failure is a bit too complex for ELI5, but people would need a reason to use something other than Amazon (or Azure).
AWS outages are fairly rare, generally fairly contained, and good teams implement strategies to manage impact of any outages. The company I worked for used AWS. We show 99.98% uptime over the last year, and I don’t recall any outages that were due to AWS service unavailability. We switched over to different AZ’s once IIRC, but there was no customer impact.
Basically, there are 3 large cloud computing companies globally: Amazon AWS, Microsoft Azure, and Google cloud.
If a company wants to run an application, they can run it on their own servers and infrastructure or just rent it from one of those 3.
Currently, for many use cases, it is much simpler to host these things in the cloud than setup your own infrastructure. The reason is that hosting an application in most cases is much more than just hosting an application. You need an app server, a database server, load balances, backups, firewalls, all kinds of microservices doing things in the background, fail-overs
That makes the infrastructure complex. This makes a good business case for those cloud systems, as they can provide much of that very effective and reliable.
But since there is only 3 of such companies that seem to dominate the market, if one of them fails, a large part of the world will notice.