Example usage

Daachorse contains some search options, ranging from basic matching with the Aho-Corasick algorithm to trickier matching. All of them will run very fast based on the double-array data structure and can be easily plugged into your application as shown below.

Finding overlapped occurrences

To search for all occurrences of registered patterns that allow for positional overlap in the input text, use find_overlapping(). When you instantiate a new automaton, unique identifiers are assigned to each pattern in the input order. The match result has the byte positions of the occurrence and its identifier.

>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find_overlapping(b'abcd')
[(0, 1, 2), (0, 2, 1), (1, 4, 0)]

Finding non-overlapped occurrences with standard matching

If you do not want to allow positional overlap, use find() instead. It performs the search on the Aho-Corasick automaton and reports patterns first found in each iteration.

>>> import daachorse
>>> patterns = [b'bcd', b'ab', b'a']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns)
>>> pma.find(b'abcd')
[(0, 1, 2), (1, 4, 0)]

Finding non-overlapped occurrences with longest matching

If you want to search for the longest pattern without positional overlap in each iteration, use MATCH_KIND_LEFTMOST_LONGEST in the construction.

>>> import daachorse
>>> patterns = [b'ab', b'a', b'abcd']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns, daachorse.MATCH_KIND_LEFTMOST_LONGEST)
>>> pma.find(b'abcd')
[(0, 4, 2)]

Finding non-overlapped occurrences with leftmost-first matching

If you want to find the the earliest registered pattern among ones starting from the search position, use MATCH_KIND_LEFTMOST_FIRST.

This is so-called the leftmost first match, a bit tricky search option. For example, in the following code, ab is reported because it is the earliest registered one.

>>> import daachorse
>>> patterns = [b'ab', b'a', b'abcd']
>>> pma = daachorse.DoubleArrayAhoCorasick(patterns, daachorse.MATCH_KIND_LEFTMOST_FIRST)
>>> pma.find(b'abcd')
[(0, 2, 0)]

Find patterns on a string

To build an automaton for strings, use CharwiseDoubleArrayAhoCorasick instead.

>>> import daachorse
>>> patterns = ['全世界', '世界', 'に']
>>> pma = daachorse.CharwiseDoubleArrayAhoCorasick(patterns)
>>> pma.find('全世界中に')
[(0, 3, 0), (4, 5, 2)]