Complete tutorial on HyperLogLog in redis

Redis HyperLogLog is a powerful probabilistic data structure used for approximating the cardinality of a set. It efficiently estimates the number of unique elements in a large dataset, making it ideal for applications where memory efficiency and speed are crucial. In this article, we will explore what Redis HyperLogLog is, its syntax, and commands, and provide examples of how to use it in real-world scenarios.

What is Redis HyperLogLog?

The Redis HyperLogLog algorithm effectively calculates the number of unique elements in a set without having to explicitly store each element. Unlike traditional data structures that require memory proportional to the number of elements in the set, Due to its fixed memory usage, HyperLogLog is extremely memory-efficient for huge datasets.. The trade-off is that it provides an approximate count of unique elements with an acceptable error rate, which is usually within 1-2% of the actual count.

How Does Redis HyperLogLog Work?

HyperLogLog works based on the observation that if we hash each element in the set and count the number of leading zeroes in the binary representation of the hash, the maximum number of leading zeroes found across all elements will give us an estimate of the cardinality. The more leading zeroes there are, the fewer distinct elements there are in the set.
To achieve this, Redis uses a hash function that maps elements to 64-bit integers and then counts the number of leading zeroes in the binary representation of each hash. The maximum count is used to estimate the cardinality of the set.

Syntax and Commands

Redis provides simple and intuitive commands to work with HyperLogLog:

PFADD key element [element ...]: Adds elements to the HyperLogLog data structure associated with the given key.
PFCOUNT key [key ...]: Returns the approximated cardinality of the HyperLogLog data structure associated with the given keys.
PFMERGE destkey sourcekey [sourcekey ...]: Merges multiple HyperLogLogs into a single one, stored in destkey.

Examples

Let’s see some examples to understand how to use Redis HyperLogLog.

1. Counting Unique Website Visitors

Suppose we have a website and want to count the number of unique visitors.

Java

// Assuming you have a Redis client connected to the server

Jedis jedis = new Jedis("localhost");
 
// Adding unique visitors to the HyperLogLog for the website

jedis.pfadd("website:visitors", "user1", "user2", "user3");
 
// Counting the approximate number of unique visitors

long uniqueVisitors = jedis.pfcount("website:visitors");

System.out.println("Approximate unique visitors: " + uniqueVisitors);

Output: Approximate unique visitors: 3

Explanation: This Java code demonstrates how to use the Jedis library to interact with a Redis server. It connects to the Redis server running onlocalhost, and adds three unique visitors (“user1”, “user2”, and “user3”) to the HyperLogLog data structure associated with the key “website: visitors” using the jedis.pfadd command. Finally, it uses the jedis.pfcount command to estimate the approximate number of unique visitors in the “website: visitors” HyperLogLog, which is 3 in this case.

2. Counting Distinct User Logins

Let’s consider a scenario where we want to count the number of distinct logins for a user.

Python

# Assuming you have a Redis client connected to the server

import redis

r = redis.StrictRedis(host='localhost', port=6379, db=0)
 
# Adding unique logins to the HyperLogLog for a user

r.execute_command("PFADD", "user:logins", "login1", "login2", "login3")
 
# Counting the approximate number of distinct logins for the user

uniqueLogins = r.execute_command("PFCOUNT", "user:logins")

print("Approximate distinct logins: ", uniqueLogins)

The provided Python code is using the redis library to interact with Redis, a data structure server. To get the output, you need to have Redis installed and running on your local machine or accessible via the provided host and port.

Assuming that Redis is running and the redis library is set up correctly, the output of the code will be:

 Approximate distinct logins:  3

This output indicates that three distinct logins, namely “login1,” “login2,” and “login3,” have been added to the HyperLogLog data structure in Redis. Just like in the previous example, the HyperLogLog data structure provides an approximate count of unique elements, which is generally very close to the true count but may not be exact.

Features and Uses of Redis HyperLogLog

Redis HyperLogLog offers several features and use cases:

Memory Efficiency: HyperLogLog consumes a fixed amount of memory, making it suitable for large datasets with millions of elements.
Approximate Cardinality: It provides an estimated count of unique elements with an acceptable error rate, making it suitable for scenarios where exact counts are not critical.
Big Data Analytics: HyperLogLog is widely used in big data analytics, where counting distinct elements in massive datasets is a common task.
Set Operations: It can be used to perform set operations like union and intersection on large sets without needing to store the entire set.
Log Analytics: HyperLogLog is used to analyze log data, counting unique IP addresses, user agents, or event occurrences.

Performance and Limits of Redis HyperLogLog:

Reading from “PFCOUNT” and writing to “PFADD” in the HyperLogLog are performed in O(1) time where as, merging the HyperLogLogs takes O(N) time. The HyperLogLog can estimate the cardianality of sets with up to 2⁶⁴ members.

Conclusion:

Redis HyperLogLog is a valuable addition to Redis’ powerful data structures. It allows you to efficiently estimate the cardinality of large datasets with minimal memory usage. With its simplicity, speed, and accuracy, Redis HyperLogLog is an essential tool for developers and data scientists dealing with big data and counting distinct elements. By leveraging Redis HyperLogLog, you can process and analyze large datasets with ease and make informed decisions based on the approximate cardinality of the data.

Article Tags :

System Design

Redis