API reference

class daachorse.DoubleArrayAhoCorasick(patterns, /, match_kind=0)

A byte-wise Aho-Corasick automaton using the compact double-array data structure.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find(b'abcd')
[(0, 1, 2), (1, 4, 0)]
Parameters:
  • patterns (list[bytes]) – List of bytes patterns.

  • match_kind (int) – A search option of the Aho-Corasick automaton.

Return type:

daachorse.DoubleArrayAhoCorasick

static deserialize(data, /)

Deserializes the automaton from the given slice.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'cd', b'abc']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
>>> pma = daachorse.DoubleArrayAhoCorasick.deserialize(b)
Parameters:

data (bytes) – Bytes to deserialize.

Returns:

A deserialized automaton.

Return type:

daachorse.DoubleArrayAhoCorasick

Raises:

ValueError – if the automaton is invalid.

find(haystack, /)

Returns a list of non-overlapping matches in the given haystack.

This function searches from the beginning of the input bytes, adding a pattern immediately to the resulting list when a pattern is found. The next search resumes from the end of the previously found pattern. When the end of the input string is reached, it returns the resulting list.

Depending on the match_kind option specified during construction, the behavior differs for multiple possible matches, as follows.

  • If you set MATCH_KIND_STANDARD (default), it reports the match correspinding to the shortest pattern.

  • If you set MATCH_KIND_LEFTMOST_LONGEST, it reports the match corresponding to the longest pattern.

  • If you set MATCH_KIND_LEFTMOST_FIRST, it reports matches corresponding to the pattern that was registered earlier in the automaton.

Example 1: Standard semantics
>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find(b'abcd')
[(0, 1, 2), (1, 4, 0)]
Example 2: Leftmost longest semantics
>>> import daachorse
>>> patterns = [b'ab', b'a', b'abcd']
>>> pma = daachorse.DoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_LONGEST
... )
>>> pma.find(b'abcd')
[(0, 4, 2)]
Example 3: Leftmost first semantics
>>> import daachorse
>>> patterns = [b'ab', b'a', b'abcd']
>>> pma = daachorse.DoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_FIRST
... )
>>> pma.find(b'abcd')
[(0, 2, 0)]
Parameters:

haystack (bytes) – Bytes to search in.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

find_overlapping(haystack, /)

Returns a list of overlapping matches in the given haystack.

This function follows the standard behavior of the Aho-Corasick algorithm. It searches from the beginning of the input bytes, and upon reaching a given position, it adds the patterns ending at that position to the resulting list in descending order of length. When the end of the input bytes is reached, it returns the resulting list.

If the pattern set contains duplicate patterns, they are added to the list in the order they were registered.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping(b'abcd')
[(0, 1, 2), (0, 2, 1), (1, 4, 0)]
Parameters:

haystack (bytes) – Bytes to search in.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

find_overlapping_no_suffix(haystack, /)

Returns a list of overlapping matches without suffixes in the given haystack.

The behavior of this function is similar to find_overlapping(), except that upon reaching a given position, it adds only the single longest pattern ending at that position to the resulting list.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'cd', b'abc']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping_no_suffix(b'abcd')
[(0, 3, 2), (1, 4, 0)]
Parameters:

haystack (bytes) – Bytes to search in.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

serialize()

Serializes the automaton into a bytes.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'cd', b'abc']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
Returns:

Serialized automaton.

Return type:

bytes

class daachorse.CharwiseDoubleArrayAhoCorasick(patterns, /, match_kind=0)

A character-wise Aho-Corasick automaton using the compact double-array data structure.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'ab', 'a']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find('abcd')
[(0, 1, 2), (1, 4, 0)]
Parameters:
  • patterns (list[str]) – List of string patterns.

  • match_kind (int) – A search option of the Aho-Corasick automaton.

Return type:

daachorse.CharwiseDoubleArrayAhoCorasick

static deserialize(data, /)

Deserializes the automaton from the given slice.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'cd', 'abc']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick.deserialize(b)
Parameters:

data (bytes) – Bytes to deserialize.

Returns:

A deserialized automaton.

Return type:

daachorse.CharwiseDoubleArrayAhoCorasick

Raises:

ValueError – if the automaton is invalid.

find(haystack, /)

Returns a list of non-overlapping matches in the given haystack.

This function searches from the beginning of the input string, adding a pattern immediately to the resulting list when a pattern is found. The next search resumes from the end of the previously found pattern. When the end of the input string is reached, it returns the resulting list.

Depending on the match_kind option specified during construction, the behavior differs for multiple possible matches, as follows.

  • If you set MATCH_KIND_STANDARD (default), it reports the match correspinding to the shortest pattern.

  • If you set MATCH_KIND_LEFTMOST_LONGEST, it reports the match corresponding to the longest pattern.

  • If you set MATCH_KIND_LEFTMOST_FIRST, it reports matches corresponding to the pattern that was registered earlier in the automaton.

Example 1: Standard semantics
>>> import daachorse
>>> patterns = ['bcd', 'ab', 'a']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find('abcd')
[(0, 1, 2), (1, 4, 0)]
Example 2: Leftmost longest semantics
>>> import daachorse
>>> patterns = ['ab', 'a', 'abcd']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_LONGEST
... )
>>> pma.find('abcd')
[(0, 4, 2)]
Example 3: Leftmost first semantics
>>> import daachorse
>>> patterns = ['ab', 'a', 'abcd']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_FIRST
... )
>>> pma.find('abcd')
[(0, 2, 0)]
Parameters:

haystack (str) – String to search in.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

find_overlapping(haystack, /)

Returns a list of overlapping matches in the given haystack.

This function follows the standard behavior of the Aho-Corasick algorithm. It searches from the beginning of the input string, and upon reaching a given position, it adds the patterns ending at that position to the resulting list in descending order of length. When the end of the input string is reached, it returns the resulting list.

If the pattern set contains duplicate patterns, they are added to the list in the order they were registered.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'ab', 'a']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping('abcd')
[(0, 1, 2), (0, 2, 1), (1, 4, 0)]
Parameters:

haystack (str) – String to search in.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

find_overlapping_no_suffix(haystack, /)

Returns a list of overlapping matches without suffixes in the given haystack.

The behavior of this function is similar to find_overlapping(), except that upon reaching a given position, it adds only the single longest pattern ending at that position to the resulting list.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'cd', 'abc']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping_no_suffix('abcd')
[(0, 3, 2), (1, 4, 0)]
Parameters:

haystack (str) – String to search in.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

serialize()

Serializes the automaton into a bytes.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'cd', 'abc']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
Returns:

Serialized automaton.

Return type:

bytes

MATCH_KIND_STANDARD: int

The standard match semantics, which enables find(), find_overlapping(), and find_overlapping_no_suffix(). Patterns are reported in the order that follows the normal behaviour of the Aho-Corasick algorithm.

MATCH_KIND_LEFTMOST_LONGEST: int

The leftmost-longest match semantics, which enables find(). When multiple patterns are started from the same positions, the longest pattern will be reported. For example, when matching patterns ab|a|abcd over abcd, abcd will be reported.

MATCH_KIND_LEFTMOST_FIRST: int

The leftmost-first match semantics, which enables find(). When multiple patterns are started from the same positions, the pattern that is registered earlier will be reported. For example, when matching patterns ab|a|abcd over abcd, ab will be reported.