Open In App

Redundancy | System Design

Last Updated : 04 Jan, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

In Computer Science, redundancy means having backups or duplicates of things to make sure your computer systems keep working even if something breaks. Imagine you have important files on your computer. If you only have them in one place and your computer crashes or the files get deleted, you’ll lose everything. But if you also keep copies of those files on an external hard drive or in the cloud, that’s redundancy.

Redundancy helps prevent big problems when things go wrong. It can be applied to different parts of a computer system, like having extra computer servers, multiple copies of data, or backup internet connections. This way, if one part fails, the redundant one takes over, and everything keeps running smoothly.

Redundancy

Types of Redundancies

Hardware Redundancy

Hardware Redundancy involves duplicating critical hardware components to ensure system availability in the case of failure.

Example:

In a RAID (Redundant Array of Independent Disks) configuration, multiple hard drives are used to store data redundantly. If one drive fails, data can still be retrieved from the other drives.

Software Redundancy

Software Redundancy relies on multiple instances of an application or service running simultaneously to ensure uninterrupted operation.

Example:

Web servers often use software load balancers to distribute incoming requests across multiple server instances. If one server fails, the load balancer redirects traffic to healthy servers.

Data Redundancy

Data Redundancy involves storing the same data in multiple locations or using replication techniques to ensure data availability.

Example:

Database Replication creats redundant copied of database across multiple servers. If one servers fails, another can continue serving the same data.

Network Redundancy

Network Redudancy provides multiple network paths or connections to ensure network availability and fault tolerance.

Example:

BGP (Border Gateway Protocol) routing uses multiple network paths to reroute trafiic in case of network failures, ensuring dtaa can still flow.

Geograoghic Redundancy

Geographic Redundancy involves deploying redundant systems or data centers in different geographic locations to protect against disasters or region-specific outages.

Example:

A global cloud service provider maintains data centers in multiple cotinets to ensure service availability even in the event of a regional disaster.

Understanding Active and Passive Redundancy in System Design

Active Redundancy

Active Redundancy is when you have two or more things doing the same job at the same time. If one of them can’t do the job, the others steps in right away to keep running smoothly.

Example:

Think of a website with two servers working together. They both show the website to people. If one server has a problem, the other sever quickly takes over to make sure the website keeps running without any issues.

Passive Redundancy

Passive Reduncany is like having a backup that doesn’t do anything until it is needed. It stays quiet in background, ready to jump in and help only when there’s a problem.

Example:

In computer networks, you can have a spare or backup router. The backup doesn’t do any work until main router has a problem. When main one fails, the spare router starts working to keep the network connected.

Role of Load Balancing in Redundancy

Let’s first understand what is Load Balancing?

Load Balancing in System Design is used to distribute incoming network traffic or computational workloads among a group of servers or resources to ensure they work efficinetly and reliably.

Load Balancing plays a crucial role in Redundancy by ensuring that multiple servers or resources are utilized effectively. This helps enhance reliability and ensures that if onse server fails, other can seamlessly take over, keeping the system operational and reducing downtime.

Examples:

  • A web application distributing user requests across multiple web servers.
  • A DNS server using round-robin load balancing algorithm to distribute requests to multiple IP addresses for a single domain.

Failover Mechanisms:

Failover Mechansims are essential for ensuring uniterrrupted service, when a component within a redundancy system fails. These mechanisms automatically detect failures and switch to a redundant component.

Example include:

  • Sever Failover: When a web server fails, a load balancer redirects traffic to a backup server.
  • Databas Failover: In database clusters, a primary database server failure triggers the promotion of standby server to primary role.

Testing and Validation

Testing and Validation are critical to ensure that redundancy mechanisms work as expected. These include:

  • Redundancy Testing: Simulating failures to verify that redundant components and failover mechanisms function correctly.
  • Validation Testing: Ensuring the data synchronization and consistency are maintained in redundant systems.
  • Load Testing: Assessing how the system performs under heavy loads to identify potential bottlenecks and ensure that load balancing is effective.

Fault Tolerance

Fault Tolerance is the ability of a system to continue functioning even in the presecne of failures. Redundancy is a key component of fault tolerance, but it also includes error detectioin, error correction and graceful degradation. Systems with high fault tolerance can provide uninterrupted service despite failures.

Metrics

Measuring the effectiveness of redundancy and fault tolerance is crucial. Common metrics include:

1. Mean Time Between Failures (MTBF):

Measures the average time between component failures.

MTBF = Total Operating Time / Number of Failures

Example:

Let’s say you have a server that has been running continuously for 1,000 hours, and it has eperienced 2 failurres during that time.

MTBF = 1,000 hours / 2 failures = 500 hours per failure

So, the MTBF for this server is 500 hours per failure. This means that, on average, you can expect this server to operate for approximately 500 hours before it encounters a failure. It’s a mesure of the system’s reliability. The higher the MTBF, the more reilabe the system because it can operate for longer time without experiencing failures.

2. Mean Time to Recovery (MTTR):

Measures the average time it takes to recover from a failure.

MTTR = Total Downtime / Number of Failures

Example:

Suppose you have a network router that experienced downtime of 4 hours due to a failure, and this happend 2 times in a month.

MTTR = 4 hours / 2 failures = 2 hours per recovery.

This means that, on average, it takes 2 hours to restore the network router to full operational status each time it encouters a failure. A lower MTTR indicates that system can recover more quickly.

3. Availability:

Represents the precentage of time a system is operational.

Availability = (Total Uptime / Total Time) * 100%

Example:

In a year, a data center was operation for 8,760 hours and had 50 hours of downtime.

Availability = (8,760 hours / (8,760 hours + 50 hours)) * 100 % = 99.43%

So, the availabilty of the data center is approximaterly 99.43%. Highly availability is usually desirable for critical systems because it indicates that they are reliable and accessible to users for the majority of the time.

4. Response Time:

Measures how quickly the system responds to user requests.

Response Time = (Total Processing Time + Total Queue Time) / Number of Requests

Example:

For a web server, you recorded that it took 5 seconds to process a request and 2 seconds on average in the queue. Over a day it handlled 10,000 requests.

Response Time = (5 seconds + 2 seconds) / 10,000 requests = 0.7 second per request.

The average response time for this web server is 0.7 seconds per request.

5. Resource Utilization:

Evaluates the efficiency of resource usage in redundant components.

Resource Utilization = (Resource Usage / Total Available Resources) * 100%

Example:

Let’s say a redundant set of servers collectively uses 200 GB out of 500 GB if available storage space.

Resource Utilization = (200 GB / 500 GB) * 100 % = 40%

The resource utilization for this storage system is 40%.

Real-life Applications of Redundancy

Basically, redundancy is essential in the aviation sector for guaranteeing the dependability and safety of aircraft systems.

  • Finance:
    • Redundancy is very much crucial to the finance sector’s ability to maintain the security and availability of financial systems as per requirement. Banks might, for instance, put in place hot standby systems to guarantee that the initial banking services can carry on even in the case of malfunctions or interruptions for different purposes.
  • Healthcare:
    • Redundancy is also much crucial in the healthcare sector to guarantee patient data accuracy and availability based on the situation. In order to guarantee that patient data is constantly accessible and can be promptly restored in the event of data loss or corruption, for instance, hospitals may use the proper data replication techniques to control all the situations.
  • Aviation:
    • Redundancy is an essential solution in the aviation sector for guaranteeing the dependability and safety of aircraft systems. Aircraft engines, for instance, are built with redundant systems, including the maintainable backup ignition and fuel pumps, to guarantee that the initial engine can run even in the case of a breakdown at any time.
  • Telecommunications:
    • Redundancy plays a vital role in the telecommunications sector in guaranteeing the dependability and availability of the required network services. Telecommunication companies, for instance, could put in place load-balancing and redundant network channels to make sure that the essential services can still function even in the case of network outages or disturbances at any cost or situation.

Conclusion:

In conclusion, redundancy is a key strategy to ensure the continous operation of critical systems and data, even in the face of failures and unexpected challenges. It comes in various forms such as hardware, software, data, network and geographic redundancy. To make it all work smoothly, we use load balancing and failover mechanisms. Testing and fault tolerance ensure that our redundancy works as planned.



Like Article
Suggest improvement
Share your thoughts in the comments

Similar Reads