Count distinct substrings of a string using Rabin Karp algorithm

Given a string, return the number of distinct substrings using Rabin Karp Algorithm.

Examples:

Input  : str = “aba”
Output : 5
Explanation :
Total number of distinct substring are 5 - "a", "ab", "aba", "b" ,"ba" 
Input  : str = “abcd”
Output : 10
Explanation :
Total number of distinct substring are 10 - "a", "ab", "abc", "abcd", "b", "bc", "bcd", "c", "cd", "d"

Approach:

Prerequisite: Rabin-Karp Algorithm for Pattern Searching

Calculate the current hash value of the current character and store
in a dictionary/map to avoid repetition.

To compute the hash (rolling hash) as done in Rabin-Karp algorithm follow:

The hash function suggested by Rabin and Karp calculates an integer value. The integer value for a string is the numeric value of a string. For example, if all possible characters are from 1 to 10, the numeric value of “122” will be 122. The number of possible characters is higher than 10 (256 in general) and the pattern length can be large. So the numeric values cannot be practically stored as an integer. Therefore, the numeric value is calculated using modular arithmetic to make sure that the hash values can be stored in an integer variable (can fit in memory words). To do rehashing, we need to take off the most significant digit and add the new least significant digit in the hash value. Rehashing is done using the following formula.

hash( txt[s+1 .. s+m] ) = ( d ( hash( txt[s .. s+m-1]) – txt[s]*h ) + txt[s + m] ) mod q

hash( txt[s .. s+m-1] ) : Hash value at shift s.
hash( txt[s+1 .. s+m] ): Hash value at next shift (or shift s+1)
d: Number of characters in the alphabet
q: A prime number
h: d^(m-1)

The idea is similar as we evaluate a mathematical expression. For example, we have a string of “1234” let us compute the value of the substring “12” as 12 and we want to compute the value of the substring “123” this can be calculated as ((12)*10+3)=123, similar logic is applied here.

C++

#include <bits/stdc++.h>

using namespace std;
 
// Driver code

int main()
{

  int t = 1;
 
  // store prime to reduce overflow

  long long mod = 9007199254740881;
 
  for(int i = 0; i < t; i++)

  {
 
    // string to check number of distinct substring

    string s = "abcd";
 
    // to store substrings

    vector<vector<long long>>l;
 
    // to store hash values by Rabin Karp algorithm

    unordered_map<long long,int>d;
 
    for(int i=0;i<s.length();i++){

      int suma = 0;

      long long pre = 0;
 
      // Number of input alphabets

      long long D = 256;
 
      for(int j=i;j<s.length();j++){
 
        // calculate new hash value by adding next element

        pre = (pre*D+s[j]) % mod;
 
        // store string length if non repeat

        if(d.find(pre) == d.end())

          l.push_back({i, j});

        d[pre] = 1;

      }

    }
 
    // resulting length

    cout<<l.size()<<endl;
 
    // resulting distinct substrings

    for(int i = 0; i < l.size(); i++)

      cout << s.substr(l[i][0],l[i][1]+1-l[i][0]) << " ";

  }
}
 
// This code is contributed by shinjanpatra

Java

import java.util.*;
 
public class Main {

    public static void main(String[] args) {

        int t = 1;

        // store prime to reduce overflow

        long mod = 9007199254740881L;
 
        for (int i = 0; i < t; i++) {
 
            // string to check number of distinct substring

            String s = "abcd";
 
            // to store substrings

            List<List<Integer>> l = new ArrayList<>();
 
            // to store hash values by Rabin Karp algorithm

            Map<Long, Integer> d = new HashMap<>();
 
            for (int j = 0; j < s.length(); j++) {

                long suma = 0;

                long pre = 0;
 
                // Number of input alphabets

                int D = 256;
 
                for (int k = j; k < s.length(); k++) {
 
                    // calculate new hash value by adding next element

                    pre = (pre*D + (long)s.charAt(k)) % mod;
 
                    // store string length if non repeat

                    if (!d.containsKey(pre)) {

                        List<Integer> sublist = new ArrayList<>();

                        sublist.add(j);

                        sublist.add(k);

                        l.add(sublist);

                    }

                    d.put(pre, 1);

                }

            }
 
            // resulting length

            System.out.println(l.size());
 
            // resulting distinct substrings

            for (int j = 0; j < l.size(); j++) {

                int start = l.get(j).get(0);

                int end = l.get(j).get(1);

                System.out.print(s.substring(start, end+1) + " ");

            }

        }

    }
}

Python3

# importing libraries

import sys

import math as mt

t = 1
# store prime to reduce overflow

mod = 9007199254740881
 
for ___ in range(t):
 
    # string to check number of distinct substring

    s = 'abcd'
 
    # to store substrings

    l = []
 
    # to store hash values by Rabin Karp algorithm

    d = {}
 
    for i in range(len(s)):

        suma = 0

        pre = 0
 
        # Number of input alphabets

        D = 256
 
        for j in range(i, len(s)):
 
            # calculate new hash value by adding next element

            pre = (pre*D+ord(s[j])) % mod
 
            # store string length if non repeat

            if d.get(pre, -1) == -1:

                l.append([i, j])

            d[pre] = 1
 
    # resulting length

    print(len(l))
 
    # resulting distinct substrings

    for i in range(len(l)):

        print(s[l[i][0]:l[i][1]+1], end=" ")

using System;

using System.Collections.Generic;
 
class GFG {

static void Main()
{

int t = 1;
 
    // store prime to reduce overflow

    long mod = 9007199254740881;
 
    for (int i = 0; i < t; i++)

    {

        // string to check number of distinct substring

        string s = "abcd";
 
        // to store substrings

        List<List<long>> l = new List<List<long>>();
 
        // to store hash values by Rabin Karp algorithm

        Dictionary<long, int> d = new Dictionary<long, int>();
 
        for (int j = 0; j < s.Length; j++)

        {

            int suma = 0;

            long pre = 0;
 
            // Number of input alphabets

            long D = 256;
 
            for (int k = j; k < s.Length; k++)

            {

                // calculate new hash value by adding next element

                pre = (pre * D + s[k]) % mod;
 
                // store string length if non repeat

                if (!d.ContainsKey(pre))

                {

                    List<long> sub = new List<long>();

                    sub.Add(j);

                    sub.Add(k);

                    l.Add(sub);

                }

                d[pre] = 1;

            }

        }
 
        // resulting length

        Console.WriteLine(l.Count);

        // resulting distinct substrings

        for (int j = 0; j < l.Count; j++)

        {

            Console.Write(s.Substring((int)l[j][0], (int)l[j][1] + 1 - (int)l[j][0]) + " ");

        }

    }
}
}
//This code is contributed by rudra1807raj

Javascript

<script>
 
let t = 1
 
// store prime to reduce overflow
let mod = 9007199254740881
 
for(let i = 0; i < t; i++){

    // string to check number of distinct substring

    let s = 'abcd'
 
    // to store substrings

    let l = []
 
    // to store hash values by Rabin Karp algorithm

    let d = new Map()
 
    for(let i=0;i<s.length;i++){

        let suma = 0

        let pre = 0
 
        // Number of input alphabets

        let D = 256
 
        for(let j=i;j<s.length;j++){
 
            // calculate new hash value by adding next element

            pre = (pre*D+s.charCodeAt(j)) % mod
 
            // store string length if non repeat

            if(d.has([pre, -1]) == false)

                l.push([i, j])

            d.set(pre , 1)

        }

    }
 
    // resulting length

    document.write(l.length,"</br>")
 
    // resulting distinct substrings

    for(let i = 0; i < l.length; i++)

        document.write(s.substring(l[i][0],l[i][1]+1)," ")
}
 
// This code is contributed by shinjanpatra
 
</script>

Output

10
a ab abc abcd b bc bcd c cd d

Time Complexity: O(N²), N is the length of the string
Auxiliary Space: O(N*2) => O(N)

Another Approach:

Define an input string “s”.
Create an empty unordered set “substrings” to store the distinct substrings.
Use two nested loops to iterate over all possible substrings of “s”. The outer loop iterates over the starting index of each substring.
The inner loop iterates over the ending index of each substring, starting from the starting index.
Use the “substr” method of the input string “s” to extract each substring and insert it into the “substrings” set.
The “substr” method takes two arguments: the starting index of the substring and the length of the substring.
The length of the substring is computed as “j – i + 1”.
After all substrings have been added to the set, output the size of the set to get the number of distinct substrings.

Below is the implementation of the above approach:

C++

#include <iostream>
#include <unordered_set>
#include <string>
 
using namespace std;
 
int main() {

    // Input string

    string s = "abcd";
 
    // Set to store distinct substrings

    unordered_set<string> substrings;
 
    // Iterate over all possible substrings and add them to the set

    for (int i = 0; i < s.size(); i++) {

        for (int j = i; j < s.size(); j++) {

            substrings.insert(s.substr(i, j - i + 1));

        }

    }
//This code is contributed rudra1807raj

    // Output the number of distinct substrings

    cout << substrings.size() << endl;
}

Java

import java.util.*;
 
public class GFG {

    public static void main(String[] args) {

        // Input string

        String s = "abcd";
 
        // Set to store distinct substrings

        Set<String> substrings = new HashSet<>();
 
        // Iterate over all possible substrings and add them to the set

        for (int i = 0; i < s.length(); i++) {

            for (int j = i; j < s.length(); j++) {

                substrings.add(s.substring(i, j + 1));

            }

        }
 
        // Output the number of distinct substrings

        System.out.println(substrings.size());

    }
}
// This code is contributed by rudra1807raj

Python

def main():

    # Input string

    s = "abcd"
 
    # Set to store distinct substrings

    substrings = set()
 
    # Iterate over all possible substrings and add them to the set

    for i in range(len(s)):

        for j in range(i, len(s)):

            substrings.add(s[i:j + 1])
 
    # Output the number of distinct substrings

    print(len(substrings))
 
if __name__ == "__main__":

    main()

using System;

using System.Collections.Generic;
 
class GFG
{

    static void Main(string[] args)

    {   

        // Input string

        string s = "abcd";

        // Set to store distinct substrings

        HashSet<string> substrings = new HashSet<string>();

        // Iterate over all possible substrings and add them to the set

        for (int i = 0; i < s.Length; i++){

           for (int j = i; j < s.Length; j++){

              substrings.Add(s.Substring(i, j - i + 1));

            }

          }

        // Output the number of distinct substrings

        Console.WriteLine(substrings.Count);

    }
}

Javascript

<script>

function countDistinctSubstrings(s) {

    const substrings = new Set();
 
    // Iterate over all possible substrings and add them to the set

    for (let i = 0; i < s.length; i++) {

        for (let j = i; j < s.length; j++) {

            substrings.add(s.substring(i, j + 1));

        }

    }
 
    // Output the number of distinct substrings

    return substrings.size;
}
 
// Input string

const s = "abcd";
 
// Get the number of distinct substrings
const distinctSubstringsCount = countDistinctSubstrings(s);
document.write(distinctSubstringsCount);
</script>

Output

Time complexity: O(n^3), where n is the length of the input string “s”.
Auxiliary Space: O(n^2), where n is the length of the input string “s”. The space complexity is dominated by the number of distinct substrings that are stored in the unordered_set.

Article Tags :

Competitive Programming

DSA

Hash

Strings

frequency-counting

Java-HashMap

Python dictionary-programs

substring