System Design- When to and when Not to Use Cassandra

A distributed NoSQL database management system called Apache Cassandra was made to manage massive volumes of structured and semi-structured data across a number of commodity servers. Initiated by Facebook, it was subsequently open-sourced in 2008. Cassandra is renowned for its great scalability, fault tolerance, and high availability. It is based on the same ideas as Amazon’s Dynamo and Google’s Bigtable.

The architecture of Cassandra is built on a decentralized approach, in which every node in the cluster is identical and has a copy of the data on it. Data is divided across the nodes using a distributed hash table, and replication is employed to guarantee high availability and durability. Moreover, Cassandra features configurable consistency, which enables customers to strike a compromise between data availability and consistency.

Cassandra works really well if we want to write and store a large amount of data in a distributed system and don’t care much about ACID with good performance.

When to use Cassandra and why?

Let’s look at some of the use cases of Cassandra:

Feature of Masterless Replication for high availability

It relies on a masterless model. In this model, all nodes actually behave the same, each storing a subset of the data. There is no master or slave. If we were to insert a new row in the database, it will go to at least one of these nodes and get replicated to a certain number of nodes.
When we ask for our row, the nodes gossip among one another to find out who holds that piece of data and return it to us. In fact, all nodes can serve requests for any piece of data, even if they don’t actually hold it. All of this is actually managed by the database nodes in the cluster with no intervention from us. The nodes in the cluster are aware of the cluster, they know they are running in a distributed environment and constantly talk to each other(In fact, they talk so much to each other that this kind of talking is called gossiping). To know more about gossiping in Cassandra, click here.
We don’t have to worry too much about managing masters and slaves. The entire system is built from homogeneous nodes, so we just specify the number of nodes we want and that’s it. There is no need for master elections, no single point of failure, and it’s easy to manage and work with.

Masterless Replication in Cassandra

Feature of Tuneable Consistency:

Cassandra offers a tunable consistency model. In this model, Cassandra allows the developers to choose which level of consistency they want. We can choose between eventual consistency or strong consistency at the cost of availability.
Distributed database systems generally fall into two categories, AP systems which maximize availability over consistency, and CP systems which emphasize consistency over availability. There is no right or wrong system here, it just depends on what matters to your application, consistency, or availability. Cassandra is usually described as an AP system, it is generally used in part due to its high availability. However, we can control or tune the balance between availability and consistency. It is impossible to achieve both in any system as the CAP theorem explains so the only thing we can do is trade one with the other.
The way we control consistency is by using consistency level and replication factor. These dictate how many times our data is replicated, how many nodes must write the data synchronously before the response is returned to the user, and how many nodes must return the data when reading. Thus, in short, Cassandra is usually used as an eventually consistent distributed system, but it can be tuned and configured to support stricter levels of consistency.

Feature of High Write Performance:

Since we would generally have a lot of nodes in a Cassandra cluster and each node can perform writes, we get a good write performance. This level of write performance is simply not possible with a single master and multiple slave architecture in a relational database. And using a multi-master system usually comes with complexities.

When not to use Cassandra and Why?

Let’s look at some of the scenarios where Cassandra is not that helpful:

When the want is lots of different types of queries or when we cannot predict the data usage

While Cassandra has a lot of strong points, it’s far from perfect. It comes with its own set of headaches. If it didn’t, we would all be using Cassandra all the time. One of these headaches is its data modeling.
Data modeling in Cassandra is not simple. In fact, this trait is shared by many other databases that were inspired by the famous Dynamo paper. In short, the queries that we are allowed to execute actually depend on this data model. And even if we are allowed to execute queries, its performance depends on the decisions we made when deciding the data model for the table.

When the system scale is low and a single node works fine

Distributed systems are complex, and often require a lot more plumbing than a single-node system. So, it’s pretty simple, if a single node database works for us, then Cassandra probably doesn’t bring much to the table.
If we just want more read performance and a single-node database that can handle the writes, even then chances are Cassandra isn’t the perfect fit. Cassandra’s true use case, in a single word, is scale. The scale should be at the level of thousands of RPS, at the level of hundreds of gigabytes of RAM.

When we want strong ACID compliance

Cassandra(and in fact, most of the Dynamo-style databases) make trade-offs with ACID. All of them, to a certain extent, offer higher performance, and easier scaling, but compromise a bit on atomicity, consistency, and isolation.
It’s wrong to say Cassandra doesn’t offer any isolation or consistency or atomicity. It’s a little bit more nuanced than that. For example, we would get partition-level atomicity(although without rollback) and we would get strong consistency if we really want to(but nobody wants to use it because we would compromise heavily on availability). So think of it as Cassandra supports ACID but terms and conditions apply. If we want ACID compliance, we should probably look somewhere else.

When we want many-to-many mappings or joins between tables

Cassandra doesn’t support a relational schema with foreign keys and join tables. So if we want to write a lot of complex join queries, then Cassandra might not be the right database for us.

When a rigid schema is not a necessary factor

If we think of individual items in our table that should be flexible and have different columns, then perhaps we should look at some other database like a document database, eg: Mongo.

Problems and use cases where Cassandra helps in solving problems in data packet transfer among servers:

A distributed NoSQL database called Cassandra can manage massive volumes of data across numerous machines. It has an architecture that is fault-tolerant and specifically made to manage high writing throughput. Cassandra can assist in the resolution of the following issues and use cases in the context of data packet transit between servers:

Scalability: Cassandra is made to be scalable horizontally by expanding the cluster’s nodes. As a result, managing massive quantities of data is possible without suffering significantly from performance hit. Cassandra may make sure that data gets distributed fairly across the nodes during data packet transit between servers, preventing any one node from acting as a source of congestion.
High availability: Cassandra has a highly available architecture that guarantees data accessibility at all times, regardless of node failures. Cassandra can assist in ensuring that the data is duplicated over numerous nodes during data packet transit between servers, preventing data loss in the event that a node fails.
Performance: Cassandra is suitable for use cases involving heavy data packet transmission between servers since it is optimise for high write throughput. Because of its ability to handle millions of writes per second, data is processed and stored quickly.
Flexibility: Cassandra is a schema-free database, which means that it can handle any type of data. This makes it ideal for use cases where the data packet transfer involves different types of data.
Real-time analytics: Real-time analytics are possible using Cassandra, which enables processing and analysis of data while it is being transmitted between servers. This enables better business intelligence and quicker decision-making.

All problems solved by Cassandra:

Cassandra is a NoSQL database that is designed to solve a range of data management problems. Here are some of the main problems that Cassandra can help to solve:

The ability to scale: Cassandra is very scalable and has no trouble processing enormous amounts of data. It makes use of a distributed architecture that enables horizontal scaling, allowing for the addition of more nodes to the cluster as needed to enhance its capacity. Because of this, it works effectively for applications that must manage a sudden increase in data volume or user traffic.

High availability: Despite node failures, Cassandra is built to retain high availability. Data is replicated across numerous nodes using a decentralised architecture so that even if one node fails, access to the data is still possible from another node. Because of this, it is an excellent option for applications that need continuous availability.

Performance: Cassandra is designed for quick access to individual rows and high-speed data writes. It makes use of a column-family model that enables denormalized data storage, which can enhance query performance in some cases. Performance can be further enhanced by the efficient indexing and caching strategies it enables.

Flexibility: Cassandra is incredibly adaptable and suitable for a variety of applications. It is capable of handling both structured and unstructured data and supports a variety of data kinds, including text, numbers, and blobs. Depending on the requirements of the application, it can also be customised to offer various degrees of consistency and durability.

Geographic distribution: Data that is geographically dispersed across several data centres can be handled with ease by Cassandra. Data replication across various regions is supported via multi-datacenter replication. This makes it a good fit for programmes that must maintain constant performance across geographical boundaries or that must adhere to data residency regulations.

Internet of Things (IoT): IoT devices produce copious volumes of data that must be instantly handled and analysed. Because it supports real-time analytics and can tolerate high write throughput, Cassandra is the best option for this use case.

Fault-tolerance: Cassandra is designed to be fault-tolerant, which means that it can keep running even if some nodes or components collapse. To achieve fault tolerance, it employs a variety of strategies, including as load balancing, replication, and automatic failover.

Cassandra’s role in Non-linear scaling:

A distributed NoSQL database built for non-linear scaling is called Cassandra. This means that even when the cluster’s nodes increase in number, it can manage a massive volume of data while retaining good performance and availability.

Traditional databases are built for vertical scalability, where you can improve a single node’s capability by adding more resources (such CPU or RAM). The scalability and cost-effectiveness of this strategy are constrained, though, as it is more and more difficult and expensive to add new resources to a single node.

In contrast, Cassandra is built for horizontal scalability, which entails expanding the cluster’s number of nodes to improve its capacity. This method is more economical and scalable since it enables you to add additional nodes as needed to meet growing data and traffic volumes.

Cassandra achieves non-linear scaling through a number of key features:

Decentralized architecture: Cassandra implements a de-Centralised architecture with no single point of failure and equal treatment for each node in the cluster. As a result, adding new nodes to the cluster is simple and doesn’t require a lot of configuration or maintenance work.

Peer-to-peer communication: The peer-to-peer communication paradigm used by Cassandra allows each node to directly communicate with any other node. This eliminates the need for a central coordinator and makes it simple to distribute and copy data throughout the cluster.

Partitioning and replication: Data distribution and retrieval are made more effective by Cassandra’s partitioning and replication of data among cluster nodes. Additionally, it replicates data across numerous nodes for high availability and fault tolerance.

Consistent hashing: To distribute data equally among the cluster’s nodes, Cassandra employs a reliable hashing technique. This prevents hotspots that can affect performance and guarantees that data is dispersed properly.

Linearizable consistency: Linearizable consistency is a feature of Cassandra that allows all nodes in the cluster to view the same data simultaneously. As the cluster’s nodes expand, this guarantees that the data is accurate and current.

Overall, cassandra is a good choice for applications that need to manage massive volumes of data while retaining great performance and availability because of its non-linear scaling characteristics. It can be used to support a variety of use cases, including IoT devices, real-time analytics, and online and mobile apps.

All use cases where it should not be used:

While Cassandra is a powerful and flexible database system, there are certain use cases where it may not be the best choice. Here are some situations where Cassandra may not be the optimal solution:

Small datasets: Cassandra can be excessive if you are working with a relatively tiny dataset that fits in a single machine’s RAM. If you have a small dataset, a more straightforward database system would be more appropriate because Cassandra is designed to handle huge datasets that can be dispersed over numerous nodes.

Low-latency reads: Compared to a conventional relational database system, Cassandra’s distributed architecture may cause higher read latencies. Cassandra might not be the ideal option if your application requires reads to happen with extremely little latency. A relational database with a single-node architecture would be a preferable choice in this situation.

Complex queries: Although Cassandra supports a variety of query types, sophisticated queries that call for joins or aggregations across numerous tables are not well suited for Cassandra. A relational database can be a better option if your application calls for complicated queries.

Transactions : Cassandra does not support conventional ACID transactions, which can be a drawback for applications requiring high levels of consistency assurance. A relational database or a distributed transaction system would be a better choice if your application needs transactions.

Heavy analytics workloads: Cassandra is not designed to handle large analytics workloads that demand sophisticated aggregations or machine learning techniques, even though it can enable simple analytics queries. A dedicated analytics database or data warehouse might be a preferable option in this situation.

Limited developer resources: Traditional databases may require more developer resources to set up and maintain than Cassandra’s distributed architecture, which can be more difficult to administer. A simpler database system might be a better option if your company has a restricted number of developer resources.

Conclusion:

In conclusion, the Cassandra database system is strong and adaptable and can be utilised to support a variety of use cases. But depending on the application, data quantity, query complexity, and consistency needs, it might not be the best option. Before selecting a database system, it’s crucial to carefully assess your needs and take into account the trade-offs and restrictions of each solution.

Article Tags :

System Design