How a Cache Stampede Caused One of Facebook’s Biggest Outages

Last Updated : 23 Nov, 2023

On September 23, 2010, Facebook experienced one of its most severe outages to date, affecting hundreds of thousands of users internationally. The website online remained inaccessible for over 4 hours, causing extensive disruption and frustration.

The root reason for the outage changed into a cache stampede, a phenomena that happens when a large number of users try to access a cached resource at the same time, overloading the information supply and causing a series of failure.

Table of Content

How a cache stampede possibly cause problems
What happened in the case of Facebook?

A cache stampede, also known as cache thrashing or dog-piling, occurs when a massive range of requests try, and get entry to a specific resource that isn’t present inside the cache. This usually takes place when the cached object expires or is invalidated, and multiple requests attempt to repopulate the cache concurrently.

cache-stampede

How a cache stampede possibly cause problems

Normal Operation: Facebook, like many large-scale net services, uses caching drastically to improve overall performance.
Frequently accessed records are saved in a cache, reducing the need to retrieve them from the database whenever.
Cache Expiration or Invalidation: Due to everyday cache expiration or invalidation regulations, positive cached gadgets want to be refreshed or recomputed.
Cache Stampede Trigger: When a famous object’s cache expires, more than one person requests simultaneously to try to get admission to the useful resource.
Resource Intensive Recomputation: The technique of recomputing or regenerating the content for the cache can be aid-extensive, especially for popular items that many users are looking to get admission to simultaneously.
Increased Load on Resources: The surprising surge in requests for the identical resource can overload the backend systems, inclusive of databases or computation assets.
Service Degradation or Outage: The extended load and aid contention can result in service degradation or, in intense cases, a service outage.

What happened in the case of Facebook?

In the case of facebook, there was a problem because someone made a mistake in how things were set up. This mistake caused a lot of stored information i.e cache to become useless all at once, and many people tried to get the new information from the main server all at the same time. The main server couldn’t handle so many requests at once, and it crashed, causing the whole system to stop working.

The cache stampede got worse because of a computerized system that made mistakes in handling errors. It wrongly thought that a wrong cache entry was a real error in the information. This caused a loop of confusion, where the wrong access to the cache made users question the main server. The main server then couldn’t handle the overload, making more wrong cache entries and causing more requests. This repeating cycle continued until the whole system broke down.

The Facebook outage highlighted the vital position of caching in current net applications and the potential risks related to cache stampedes.

To prevent thses occurrences again in the future, Facebook applied numerous measures, which includes:

Mutual Exclusion: Employing mutexes or other synchronization mechanisms to save multiple clients from concurrently writing to the cache or invalidating cache entries.
Soft Invalidations: Introducing soft invalidations, which mark cached data as old but allows customers to maintain the usage of it while the up to date records is fetched. This reduces the load on the starting place server and prevents cache stampedes.
Load Balancing: Implementing powerful load balancing strategies to distribute requests throughout more than one starting place servers, enhancing overall performance.
Monitoring and Alerts: Establishing robust tracking and alerting structures to stumble on cache stampedes early and take correct moves right away.

The Facebook outage served as a precious lesson for the enterprise, emphasizing the importance of designing and implementing caching structures with attention for ability disasters and imposing suitable safeguards to save you cache stampedes.

Suggest improvement

How Cache Locks can be used to overcome Cache Stampede Problem?

Share your thoughts in the comments

How a Cache Stampede Caused One of Facebook’s Biggest Outages

How a cache stampede possibly cause problems

What happened in the case of Facebook?

Please Login to comment...

Similar Reads

What kind of Experience do you want to share?