Aho-Corasick Algorithm in Python

Given an input text and an array of k words, arr[], find all occurrences of all words in the input text. Let n be the length of text and m be the total number of characters in all words, i.e. m = length(arr[0]) + length(arr[1]) + … + length(arr[k-1]). Here k is the total number of input words.

Examples:

Input: text = "hello worldhello"
arr = ["hello", "world"]
Output: {'hello': [0, 10], 'world': [6]}
Explantion: In the given text "hello worldhello", the pattern "hello" appears at index 0 and 10, and the pattern "world" appears at index 6.

Input: text = "abxabcabcaby"
arr = ["ab", "abc", "aby"]
Output: {'ab': [0, 3], 'abc': [2, 5], 'aby': [9]}

Aho-Corasick Algorithm:

The Aho-Corasick algorithm is a string-searching algorithm that constructs a finite state machine representing all keywords to be searched for. It’s a kind of dictionary-matching algorithm that locates elements of a finite set of strings (the “dictionary”) within an input text. It matches all strings simultaneously.

Step-by-step explanation of the algorithm:

Build Trie (Keyword Tree):

Create a root node.
For each keyword in the given list, add it to the trie.
If a keyword ends at a node, add it to the output list of that node.

Below is the syntax of the above idea:

Python3

def build_trie(patterns):
    root = AhoCorasickNode(None)  # root node of the trie
    
    # Iterate over each pattern in the list of patterns
    for pattern in patterns:
        node = root
        # Iterate over each character in the pattern
        for char in pattern:
            # If the character is not in the children of the current node, add a new child node
            if char not in node.children:
                node.children[char] = AhoCorasickNode(char)
            # Move to the child node
            node = node.children[char]
        
        # Add the pattern to the output of the current node
        node.output.append(pattern)
    
    return root

Build Failure Links:

Use BFS to traverse the trie.
For each node, set its failure link to the longest suffix of the current keyword that is also a prefix of a keyword in the trie. If no such suffix exists, set the failure link to the root node.

Below is the syntax of the above idea:

Python3

from collections import deque

def build_failure_function(root):
    queue = deque()
    
    # Initialize failure function of the root's children to the root itself
    for node in root.children.values():
        node.failure = root
        queue.append(node)
    
    # Breadth-first traversal of the trie to compute the failure function
    while queue:
        current_node = queue.popleft()
        
        # For each child of the current node
        for char, child_node in current_node.children.items():
            queue.append(child_node)
            failure_node = current_node.failure
            
            # Traverse the failure function until a node is found with a matching child or the root is reached
            while failure_node and char not in failure_node.children:
                failure_node = failure_node.failure
            
            # Update the failure function of the child node
            child_node.failure = failure_node.children[char] if failure_node else root
            
            # Add the output of the failure node to the output of the current node
            child_node.output.extend(child_node.failure.output)

Search the Text:

Start at the root node of the trie.
For each character in the text:
- Follow the character along the trie.
- If a keyword is found, record its position in the text.
- If a character leads to a failure link, follow the failure link and continue searching.

Below is the syntax of the above idea:

Python3

def search(text, patterns):
    root = build_trie(patterns)
    build_failure_function(root)
    
    current_node = root
    results = {}  # Dictionary to store the indices of the found patterns
    
    # Iterate over each character in the text
    for i, char in enumerate(text):
        # Follow the failure function until a matching child is found or the root is reached
        while current_node and char not in current_node.children:
            current_node = current_node.failure
        
        # If a matching child is found, move to that child
        if current_node:
            current_node = current_node.children[char]
        
        # If the current node has any patterns that end at it, store the indices of those patterns
        for pattern in current_node.output:
            start_index = i - len(pattern) + 1
            if start_index not in results:
                results[start_index] = []
            results[start_index].append(pattern)
    
    return results

Implementation of Aho-Corasick Algorithm in Python:

Aho-Corasick Algorithm efficiently finds multiple patterns in a given text. Here's a Python implementation:

Python3

class TrieNode:
    def __init__(self):
        # Initialize TrieNode attributes
        self.children = {}
        self.output = []
        self.fail = None


def build_automaton(keywords):
    # Initialize root node of the trie
    root = TrieNode()

    # Build trie
    for keyword in keywords:
        node = root
        # Traverse the trie and create nodes for each character
        for char in keyword:
            node = node.children.setdefault(char, TrieNode())
        # Add keyword to the output list of the final node
        node.output.append(keyword)

    # Build failure links using BFS
    queue = []
    # Start from root's children
    for node in root.children.values():
        queue.append(node)
        node.fail = root

    # Breadth-first traversal of the trie
    while queue:
        current_node = queue.pop(0)
        # Traverse each child node
        for key, next_node in current_node.children.items():
            queue.append(next_node)
            fail_node = current_node.fail
            # Find the longest proper suffix that is also a prefix
            while fail_node and key not in fail_node.children:
                fail_node = fail_node.fail
            # Set failure link of the current node
            next_node.fail = fail_node.children[key] if fail_node else root
            # Add output patterns of failure node to current node's output
            next_node.output += next_node.fail.output

    return root


def search_text(text, keywords):
    # Build the Aho-Corasick automaton
    root = build_automaton(keywords)
    # Initialize result dictionary
    result = {keyword: [] for keyword in keywords}

    current_node = root
    # Traverse the text
    for i, char in enumerate(text):
        # Follow failure links until a match is found
        while current_node and char not in current_node.children:
            current_node = current_node.fail

        if not current_node:
            current_node = root
            continue

        # Move to the next node based on current character
        current_node = current_node.children[char]
        # Record matches found at this position
        for keyword in current_node.output:
            result[keyword].append(i - len(keyword) + 1)

    return result


# Example 1
text1 = "hello worldhello"
arr1 = ["hello", "world"]
result1 = search_text(text1, arr1)
print(result1)

# Example 2
text2 = "abxabcabcaby"
arr2 = ["ab", "abc", "aby"]
result2 = search_text(text2, arr2)
print(result2)

Output

{'hello': [0, 11], 'world': [6]}
{'ab': [0, 3, 6, 9], 'abc': [3, 6], 'aby': [9]}

Time Complexity:

Building the automaton: O(m+k)
Searching the text: O(n+z), where z is the total number of occurrences of all keywords in the text.

Auxiliary Space: O (m+k)

Article Tags :

DSA

Pattern Searching

python