HLD or High Level System Design of Apache Kafka Startup

Last Updated : 30 Mar, 2023

Apache Kafka is a distributed data store optimized for ingesting and lower latency processing streaming data in real time. It can handle the constant inflow of data sequentially and incrementally generated by thousands of data sources.

Why use Kafka in the first place?

Let’s look at the problem that inspired Kafka in the first place on Linkedin. The problem is simple: Linkedin was getting a lot of logging data, like log messages, metrics, events, and other monitoring/observability data from multiple services. They wanted to utilize this data in two ways:

Have an online near-real-time system that can process and analyze this data.
Have an offline system that can process this data over a longer period.

Most of the processing was done for analysis, for example, analyzing user behavior, how users use LinkedIn, etc.

Requirement Gathering

The problem is easy to understand, but the solution can seem pretty complex. This is because the problem itself has so many constraints and requirements. Here are some examples of the requirements that such a system needs:

The system should be highly scalable. Popular products can generate tens or hundreds of TBs of data in events, metrics, and logs daily. This requires an almost linearly scalable distributed system to handle such high throughput.
This is important because we need to support the extremely high traffic. Easily hundreds of thousands of messages per second.
It should allow “producers” to send messages and “consumers” to subscribe to certain messages. This is important since there can be multiple consumers(like the online and offline systems we discussed) to the same message, and messages are generally asynchronous.
Consumers should also be able to decide how and when to consume messages. For example, in the problem we discussed, we’d want one consumer to consume messages as soon as possible and the other to do it every few hours.
Messages can be immutable (there is no need to delete log data after all), transaction-like semantics and complex delivery guarantees aren’t important requirements.

Message Brokers vs Kafka

Maybe using message brokers such as RabbitMQ, and ActiveMQ, can solve the above problem, but they cannot, and let’s see why:

Message Batching: Since we are pulling a lot of messages on the consumer, it doesn’t make sense to pull messages one by one. Most of the time, you’d want to batch messages. Otherwise, most of your time would be wasted on-network calls.
Since message brokers aren’t really meant to support such high throughput, they generally don’t provide good ways to batch messages.
Different consumers with different consumption requirements: We discussed having two types of consumers, one online system which processes messages in real-time and the other an offline system that might want to read messages received in the past twelve or twenty-four hours.
This pattern doesn’t work with most message brokers or queues. This is because some message brokers, like RabbitMQ, use a push-based model, pushing messages from the broker to the consumer. This leads to lesser flexibility for the consumer since the consumer cannot decide how and when to consume messages.
Small and simple messages: Message sizes are generally larger in most message brokers. This isn’t a bug, but it’s by design. Message brokers often support many features, like different options for routing messages, message guarantees, being able to acknowledge every message individually, etc., which leads to large individual message headers.
Large messages are fine as long as you don’t have a lot of them and you don’t have to store them, but that is precisely what we want to do in our system.
Distributed high-throughput system: One of the most important requirements is very high throughput. We want to support hundreds of thousands of messages per second, even going up to millions per second. Running this system in a single node is infeasible.
We need a distributed system that can support this throughput, which many message brokers don’t.
Large queues: Message brokers often have varying support for large queue sizes. This depends on the message broker you are using and your configuration, but the internet is filled with people facing issues with message broker queue sizes.

So, let’s now understand what should be the architecture of the Kafka system with the above mentioned requirements.

High-Level Design Architecture of Kafka

High-level design of Apache Kafka

Various Components of the Above Design

Topics: Topics are simply a stream of messages. Producers send messages to topics, and consumers poll them for messages.
Consumers: A consumer is simply an application that wants to listen to a topic. It continuously polls the broker about any messages on the topic. With each polling request, the consumer specifies the last message it received and some other configurable parameters.
Producers: Producers are applications that produce a message and publish them to the queue. Publishing messages is pretty simple: specify a topic, a message, an optional key, and optional metadata, and send it to the broker.
Consumer groups:
- Consumers would typically be a part of a consumer group. Instead of a consumer listening to a topic, generally, a consumer group would listen to a topic. The consumer group comprises multiple consumers, and anyone will receive the message.
- Generally, a single consumer would not be able to process many messages, so you’d need multiple consumers to handle messages. That way, you can support a higher throughput of messages.
- We had the example previously, where various services publish events to a Kafka topic. The events could be related to user or organization activity, such as a user searching for a company or a new job posting. There are two types of consumers listening to this. One is a recommendation service that processes these events and updates data in its database about future recommendations that must be provided to the user. The other is a script that is run once every 24 hours to provide insights into how users use our platform.
- Then, we add a recommendation service that listens to this events topic in real-time. However, since we are getting many messages, a single consumer in the Recommendation service cannot cope, so we need to add more consumers.
- This is where consumer groups come in. Multiple consumers can be a part of the consumer group, and all the messages get divided into multiple consumers.

Partitions in topics for better scale

Partitions

Having a close look at topics, we see that every topic is divided into a configurable number of ‘partitions’. Every single message in a topic is sent to exactly one partition.

Depending on the configuration and the message, this can be either based on the message’s key or in a round-robin fashion. Regardless, what’s important is that a message sent to a topic eventually goes into a single partition.

And partitions aren’t very complex. They are an append-only-like system to store messages. Think of them like a log file and the message like lines in a log file.

Consumers from a consumer group aren’t directly listening to topics. Instead, they listen to zero, one, or more partitions of the topic. Every consumer gets messages only from the partitions it listens to.

Since every consumer is assigned its own partitions on startup, consumers don’t need to discuss which messages have already been consumed. This is also helpful as it helps to scale Kafka linearly since adding more partitions/nodes doesn’t increase the work or communication between existing partitions/nodes. These partitions are often in different brokers running on different machines.

Kafka storage layout

Kafka has a very simple storage layout. Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the same size (e.g., 1GB). Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file.

Suggest improvement

What is the goal of High-Level Design(HLD)?

Share your thoughts in the comments