# Introduction to Universal Hashing in Data Structure

Hashing is a great practical tool, with an interesting and subtle theory too. In addition to its use as a dictionary data structure, hashing also comes up in many different areas, including cryptography and complexity theory.

This article discusses an important notion:** Universal Hashing** *(also known as universal hash function families*).

**Universal Hashing **refers to selecting a hash function at random from a family of hash functions with a certain mathematical property. This ensures a minimum number of collisions.

A randomized algorithm

Hfor constructing hash functionsh : U → {1,… ,M}isuniversalif for all(x, y)inUsuch thatx ≠ y,Pr(i.e, The probability of_{ h∈H }[h(x) = h(y)] ≤ 1/Mxandysuch thath(x) = h(y)is<= 1/Mfor all possible values ofxandy).

A set **H** of hash functions is called a **universal hash function family** if the procedure *“ choose h ∈ H at random”* is universal. (Here the key is identifying the set of functions with the uniform distribution over the

**set.)**

**Theorem: **If **H** is a set of the universal hash function family, then for any set **S ⊆ U** of size **N**, such that **x ∈ U** and **y ∈ S**, the expected number of collisions between x and** **y** **is at most **N/M**.

**Proof: **Each **y ∈ S (y ≠ x)** has at most a **1/M** chance of colliding with **x** by the definition of “universal”. So,

- Let
**C**if_{xy}= 1**x**and**y**collide and**0**otherwise. - Let
**C**denote the total number of collisions for_{x}**x**. So,**C**_{x }= ∑_{y∈S, x≠y }C_{xy}_{.} - We know
**E[C**._{xy}] = Pr[ x and y collide ] ≤ 1/M - So, by the linearity of expectation,
**E[C**._{x}] = ∑_{y }E[C_{xy}] < N/M

**Corollary:** If H is a set of the universal hash function family, then for any sequence of **L** insert, lookup, or delete operations in which there are at most **M** elements in the system at any time, the expected total cost of the **L** operations for a random **h ∈ H** is only **O(L)** (viewing the time to compute **h** as constant).

For any given operation in the sequence, its expected cost is constant by the above theorem. Therefore, the expected total cost of the **L** operations is **O(L)** by the linearity of expectation.

**Constructing a universal hash family using the matrix method:**

Let’s say keys are * u-bits* long and the table size

*is the power of*

**M****2**, so an index is

*long with*

**b-bits***.*

**M = 2**^{b}What we will do is pick * h* to be a random

*b-by-u binary*matrix, and define

*, where*

**h(x) = hx**

*is calculated by adding some of the columns of*

**hx***(doing vector addition over mod 2) where the*

**h****1**bits in

*indicate which columns to add. (e.g., the 1*

**x**^{st}and 3

^{rd}columns of

**h**are added in the below example). These matrices are short and fat. For instance:

Now, take an arbitrary pair of keys **(*** x, y)* such that

*. They must differ someplace, let’s assume they differ in the*

**x ≠ y**

**i****coordinate, and for concreteness say**

^{th}*and*

**x**_{i}= 0*. Imagine we first choose all of*

**y**_{i}= 1*but the*

**h**

**i****column. Over the remaining choices of the**

^{th}

**i****column,**

^{th}

**h(x)****is fixed. However, each of the**

*different settings of the*

**2**^{b}

**i****column gives a different value of**

^{th}*(in particular, every time we flip a bit in that column, we flip the corresponding bit in*

**h(y)***). So there is exactly a*

**h(y)**

**1/2****chance that**

^{b}

**h(x) = h(y)**.