API reference

class daachorse.DoubleArrayAhoCorasick(patterns, /, match_kind=0)

A byte-wise Aho-Corasick automaton using the compact double-array data structure.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find(b'abcd')
[(0, 1, 2), (1, 4, 0)]
Parameters:
  • patterns (list[bytes]) – List of bytes patterns.

  • match_kind (int) – A search option of the Aho-Corasick automaton.

Return type:

daachorse.DoubleArrayAhoCorasick

static deserialize(data, /)

Deserializes the automaton from the given slice.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'cd', b'abc']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
>>> pma = daachorse.DoubleArrayAhoCorasick.deserialize(b)
Parameters:

data (bytes) – Bytes to deserialize.

Returns:

A deserialized automaton.

Return type:

daachorse.DoubleArrayAhoCorasick

Raises:

ValueError – if the automaton is invalid.

find(haystack, /)

Returns a list of non-overlapping matches in the given haystack.

According to the match_kind option you specified in the construction, the behavior is changed for multiple possible matches, as follows.

  • If you set MATCH_KIND_STANDARD (default), the automaton searches from the beginning of the input string, yielding a value immediately when a pattern is found.

  • If you set MATCH_KIND_LEFTMOST_LONGEST, the automaton reports matches corresponding to the longest pattern.

  • If you set MATCH_KIND_LEFTMOST_FIRST, the automaton reports matches corresponding to the pattern earlier registered to the automaton.

The next search resumes from the end of the previously found pattern.

Example 1: Standard semantics
>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find(b'abcd')
[(0, 1, 2), (1, 4, 0)]
Example 2: Leftmost longest semantics
>>> import daachorse
>>> patterns = [b'ab', b'a', b'abcd']
>>> pma = daachorse.DoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_LONGEST
... )
>>> pma.find(b'abcd')
[(0, 4, 2)]
Example 3: Leftmost first semantics
>>> import daachorse
>>> patterns = [b'ab', b'a', b'abcd']
>>> pma = daachorse.DoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_FIRST
... )
>>> pma.find(b'abcd')
[(0, 2, 0)]
Parameters:

haystack (bytes) – Bytes to search for.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

find_overlapping(haystack, /)

Returns a list of overlapping matches in the given haystack.

The automaton follows the standard behavior of the Aho-Corasick algorithm. It searches from the beginning of the input string, and upon reaching a given position, it yields the patterns ending at that position in descending order of length.

If the pattern set contains duplicate patterns, they are yielded in the order they were registered.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping(b'abcd')
[(0, 1, 2), (0, 2, 1), (1, 4, 0)]
Parameters:

haystack (bytes) – Bytes to search for.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

find_overlapping_no_suffix(haystack, /)

Returns a list of overlapping matches without suffixes in the given haystack.

The behavior of the automaton is similar to find_overlapping(), except that upon reaching a given position, it yields only the single longest pattern ending at that position.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'cd', b'abc']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping_no_suffix(b'abcd')
[(0, 3, 2), (1, 4, 0)]
Parameters:

haystack (bytes) – Bytes to search for.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

serialize()

Serializes the automaton into a bytes.

Examples:
>>> import daachorse
>>> patterns = [b'bcd', b'cd', b'abc']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
Returns:

Serialized automaton.

Return type:

bytes

class daachorse.CharwiseDoubleArrayAhoCorasick(patterns, /, match_kind=0)

A character-wise Aho-Corasick automaton using the compact double-array data structure.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'ab', 'a']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find('abcd')
[(0, 1, 2), (1, 4, 0)]
Parameters:
  • patterns (list[str]) – List of string patterns.

  • match_kind (int) – A search option of the Aho-Corasick automaton.

Return type:

daachorse.CharwiseDoubleArrayAhoCorasick

static deserialize(data, /)

Deserializes the automaton from the given slice.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'cd', 'abc']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick.deserialize(b)
Parameters:

data (bytes) – Bytes to deserialize.

Returns:

A deserialized automaton.

Return type:

daachorse.CharwiseDoubleArrayAhoCorasick

Raises:

ValueError – if the automaton is invalid.

find(haystack, /)

Returns a list of non-overlapping matches in the given haystack.

According to the match_kind option you specified in the construction, the behavior is changed for multiple possible matches, as follows.

  • If you set MATCH_KIND_STANDARD (default), the automaton searches from the beginning of the input string, yielding a value immediately when a pattern is found.

  • If you set MATCH_KIND_LEFTMOST_LONGEST, the automaton reports matches corresponding to the longest pattern.

  • If you set MATCH_KIND_LEFTMOST_FIRST, the automaton reports matches corresponding to the pattern earlier registered to the automaton.

The next search resumes from the end of the previously found pattern.

Example 1: Standard semantics
>>> import daachorse
>>> patterns = ['bcd', 'ab', 'a']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find('abcd')
[(0, 1, 2), (1, 4, 0)]
Example 2: Leftmost longest semantics
>>> import daachorse
>>> patterns = ['ab', 'a', 'abcd']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_LONGEST
... )
>>> pma.find('abcd')
[(0, 4, 2)]
Example 3: Leftmost first semantics
>>> import daachorse
>>> patterns = ['ab', 'a', 'abcd']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(
...     patterns,
...     daachorse.MATCH_KIND_LEFTMOST_FIRST
... )
>>> pma.find('abcd')
[(0, 2, 0)]
Parameters:

haystack (str) – String to search for.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

find_overlapping(haystack, /)

Returns a list of overlapping matches in the given haystack.

The automaton follows the standard behavior of the Aho-Corasick algorithm. It searches from the beginning of the input string, and upon reaching a given position, it yields the patterns ending at that position in descending order of length.

If the pattern set contains duplicate patterns, they are yielded in the order they were registered.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'ab', 'a']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping('abcd')
[(0, 1, 2), (0, 2, 1), (1, 4, 0)]
Parameters:

haystack (str) – String to search for.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

find_overlapping_no_suffix(haystack, /)

Returns a list of overlapping matches without suffixes in the given haystack.

The behavior of the automaton is similar to find_overlapping(), except that upon reaching a given position, it yields only the single longest pattern ending at that position.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'cd', 'abc']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping_no_suffix('abcd')
[(0, 3, 2), (1, 4, 0)]
Parameters:

haystack (str) – String to search for.

Returns:

A list of matches. Each match is a tuple consisting of the start position, end position, and pattern ID.

Return type:

list[tuple[int, int, int]]

Raises:

ValueError – if the automaton is built with a LEFTMOST option.

serialize()

Serializes the automaton into a bytes.

Examples:
>>> import daachorse
>>> patterns = ['bcd', 'cd', 'abc']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> b = pma.serialize()
Returns:

Serialized automaton.

Return type:

bytes

MATCH_KIND_STANDARD: int

The standard match semantics, which enables find(), find_overlapping(), and find_overlapping_no_suffix(). Patterns are reported in the order that follows the normal behaviour of the Aho-Corasick algorithm.

MATCH_KIND_LEFTMOST_LONGEST: int

The leftmost-longest match semantics, which enables find(). When multiple patterns are started from the same positions, the longest pattern will be reported. For example, when matching patterns ab|a|abcd over abcd, abcd will be reported.

MATCH_KIND_LEFTMOST_FIRST: int

The leftmost-first match semantics, which enables find(). When multiple patterns are started from the same positions, the pattern that is registered earlier will be reported. For example, when matching patterns ab|a|abcd over abcd, ab will be reported.