Suffix arrays: How to compute them fast with Python

By Louis Abraham

You can download the notebook from https://louisabraham.github.io/notebooks/suffix_arrays.ipynb

You can comment there on HN. Yes, I'm using HN as a comment system.

Just give me the code!

In [1]:
from itertools import zip_longest, islice


def to_int_keys_best(l):
    """
    l: iterable of keys
    returns: a list with integer keys
    """
    seen = set()
    ls = []
    for e in l:
        if not e in seen:
            ls.append(e)
            seen.add(e)
    ls.sort()
    index = {v: i for i, v in enumerate(ls)}
    return [index[v] for v in l]


def suffix_matrix_best(s):
    """
    suffix matrix of s
    O(n * log(n)^2)
    """
    n = len(s)
    k = 1
    line = to_int_keys_best(s)
    ans = [line]
    while max(line) < n - 1:
        line = to_int_keys_best(
            [a * (n + 1) + b + 1
             for (a, b) in
             zip_longest(line, islice(line, k, None),
                         fillvalue=-1)])
        ans.append(line)
        k <<= 1
    return ans


def suffix_array_best(s):
    """
    suffix array of s
    O(n * log(n)^2)
    """
    n = len(s)
    k = 1
    line = to_int_keys_best(s)
    while max(line) < n - 1:
        line = to_int_keys_best(
            [a * (n + 1) + b + 1
             for (a, b) in
             zip_longest(line, islice(line, k, None),
                         fillvalue=-1)])
        k <<= 1
    return line

Disclaimer

This notebook is not intended as a course, a tutorial or an explanation on the suffix array algorithms, just practical implementations in Python.

This implementation is intended to be simple enough to be rewritten during programming contests, and yet be as efficient as possible. It is particularly optimised against random data.

Introduction

This week-end, I wanted to improve my skills on string algorithms and solve this kattis problem. I had an efficient (in complexity) algorithm to compute the suffix array but it hit the time limit (mostly because I'm using Python). When I implemented my last line skipping trick, my code passed with Python 2 because the server uses PyPy. However, I still had a timeout in Python 3. This is how I managed to have a fast solution with the CPython implementation of Python 3.

I had a couple resources on the topic, so I started with [Stanford].

This paper is really the best I know because out of 17 pages, there are 7 of theory and code, and 9 of exercises with solutions (last page is for references).

They use an algorithm called prefix doubling, and I'm going to describe some possible implementations and their practical performances.

Short reminder

Suffix array

Given a string s of length n, we want to compute the suffix array sa of same length.

sa[i] is the rank according to the lexicographic order of the suffix at index i s[i:] amongst all the suffixes.

They are strictly ordered (two suffixes cannot be equal since they are all of different lengths) thus the suffix array is always a permutation of range(n).

For example, the suffix array of 'banana' is [3, 2, 5, 1, 4, 0]. We can see that the first two suffixes (ranks 0 and 1) are the ones that begin with the letter 'a'.

There is an equivalent definition of suffix arrays: sa_alternative[i] is the index of the suffix at the i-th position in the sorted list of all suffixes.

In [2]:
def suffix_array_alternative_naive(s):
    return [rank for suffix, rank in sorted((s[i:], i) for i in range(len(s)))]

suffix_array_alternative_naive('banana')
Out[2]:
[5, 3, 1, 0, 4, 2]

As you can see, this is exactly the inverse permutation of sa, and it is trivial to invert it in $O(n)$.

In [3]:
def inverse_array(l):
    n = len(l)
    ans = [0] * n
    for i in range(n):
        ans[l[i]] = i
    return ans


def suffix_array_naive(s):
    return inverse_array(suffix_array_alternative_naive(s))

suffix_array_naive('banana')
Out[3]:
[3, 2, 5, 1, 4, 0]

Now the performance doesn't seem horrible:

In [4]:
from random import randint

constant_string = lambda length: 'a' * int(length)
random_string = lambda length: ''.join(chr(randint(ord('a'), ord('z')))
                                       for _ in range(int(length)))

s1 = constant_string(1e4)
s2 = random_string(1e4)

%timeit suffix_array_alternative_naive(s1)
%timeit suffix_array_alternative_naive(s2)
10 loops, best of 3: 37 ms per loop
10 loops, best of 3: 46.6 ms per loop

The time complexity is dominated by the generation of sorted((s[i:], i) for i in range(len(s))) which is $O(n^2log(n))$ in the worst case: $O(n*log(n))$ for the sort and $O(n)$ for one comparison.

In practice, it is not really bad because the strings start to be different really quickly, so the $O(n)$ factor becomes less significant.

But the worst is the memory cost $O(n^2)$. That makes the algorithm for length = 100000 use several Go of RAM.

Because string comparisons seem to be really optimised in Python, the trivial algorithm slightly outperforms the more efficient algorithms for the small values. However, it is really not usable for bigger values of n.

Here, one could think that the performance could be worse for the constant_string because of the comparisons. However, because the prefixes are already sorted and the constant behind the string comparison is really small, it appears faster. See my question on [StackOverflow].

One exercise for the reader

The constant_string is such that the number of letter comparisons between two suffixes is maximal (always the length of the longest). Find an algorithm to produce given an alphabet $A$ and a length $n$ an element of $A^n$ such that the maximum number of comparisons between any two suffixes is minimal.

Prefix doubling

The algorithm described in [Stanford] is called prefix doubling.

Let's say all the letters of your string are different. Then the order of the suffixes is only determined by their first letter, and you can do that in $O(n*log(n))$ time and $O(n)$ space complexity by sorting the letters.

Even if the letters are not different, grouping the suffixes according to their first letter is a good idea. It is the idea of prefix radix sort.

And here we can use an idea similar to pointer jumping to double at each step the length of the prefixes that are compared.

Say you have at step log(k) the orders of the partial prefixes s[i: i+k]. But s[i: i+2*k] == s[i: i+k] + s[i+k: 2*k] so you can merge at each step the order and the shifted order to obtain the new order. Hence the name prefix doubling. It is really a beautiful algorithm.

After at most $log(n)$ steps, you have totally sorted the prefixes. The complexity is $O(n)$ space (or $O(n*log(n))$ for the prefix matrix, see below) and $O(n*log^2n)$ time if you implement the merge operation in $O(n*log(n))$.

Prefix matrix

When you use prefix doubling, keeping the intermediate partial sorts allows us to compute the Longest Common Prefix (LCP) of two indexes i and j in $O(log(n))$. The strategy is described in [Stanford].

This is the reason why I prefer prefix doubling over others because it is a semilinear computation to make LCP queries in $O(log(n))$ time. If you think thoroughly, using the suffix matrix to compute the LCP is similar to the father pointer doubling algorithm for Lowest Common Ancestor (LCA) in the suffix tree, see [Topcoder].

Other algorithms

There are $O(n)$ algorithms to compute the suffix array, like building the suffix tree and converting it to suffix array, but the constant is high. A more efficient algorithms is the Skew algorithms: [Skew].

However, the Skew algorithm does not allow simple computation of the LCA.

I think there are more complicated algorithms using both the suffix array and the LCP array to be able to traverse the suffix tree without constructing it, and then you can use other classic algorithms for the LCA.

The basic implementation

In [5]:
from itertools import zip_longest, islice


def to_int_keys(l):
    """
    l: iterable of keys
    returns: a list with integer keys
    """
    index = {v: i for i, v in enumerate(sorted(set(l)))}
    return [index[v] for v in l]


def suffix_matrix(to_int_keys, s):
    n = len(s)
    k = 1
    line = to_int_keys(s)
    ans = [line]
    while k < n:
        line = to_int_keys(
            list(zip_longest(line, islice(line, k, None),
                             fillvalue=-1)))
        ans.append(line)
        k <<= 1
    return ans

suffix_matrix(to_int_keys, 'banana')
Out[5]:
[[1, 0, 2, 0, 2, 0],
 [2, 1, 3, 1, 3, 0],
 [3, 2, 5, 1, 4, 0],
 [3, 2, 5, 1, 4, 0]]

The to_int_keys function contains almost all the magic. It takes any iterable and returns integer indexes. If two elements were identical, their indexes will be equal.

Using positive integers allows us to set $-1$ as the smallest possible value.

The original implementation

I wrote a version almost identical to the C++ code of the pdf, but it has bad performances.

In [6]:
def suffix_matrix_stanford(s):
    n = len(s)
    k = 1
    line = to_int_keys(s)
    ans = [line]
    while k < n:
        L = sorted(
            (key, index)
            for index, key in enumerate(
                zip_longest(line, islice(line, k, None),
                            fillvalue=-1)
            ))
        line = [0] * n
        for i in range(n):
            line[L[i][1]] = (line[L[i - 1][1]]
                             if i and L[i][0] == L[i - 1][0]
                             else i)
        ans.append(line)
        k <<= 1
    return ans

suffix_matrix_stanford('banana')
Out[6]:
[[1, 0, 2, 0, 2, 0],
 [3, 1, 4, 1, 4, 0],
 [3, 2, 5, 1, 4, 0],
 [3, 2, 5, 1, 4, 0]]

You can see the indexes are different in some lines of the matrix but they are in the same order, which is sufficient.

In [7]:
for s in [constant_string(1e5),
          random_string(1e5)]:
    %timeit suffix_matrix(to_int_keys, s)
    %timeit suffix_matrix_stanford(s)
1 loop, best of 3: 1.06 s per loop
1 loop, best of 3: 1.72 s per loop
1 loop, best of 3: 4.29 s per loop
1 loop, best of 3: 7.14 s per loop

It shows that the Python builtins are very powerful.

We can however observe that using set might slow down sorted because Timsort performs well with partially sorted data, and sets are not ordered in Python. Can we fix that?

Small tricks

An optimisation based on Timsort

After a few steps, the array is almost sorted and the Timsort performs well on that data. In the random case, the suffix is computed after a few steps, and sorting is done in $O(n)$ because the Timsort is adaptative (and in general is "designed to perform well on many kinds of real-world data").

We test different to_int_keys functions implementing this trick:

In [8]:
def to_int_keys1(l):
    """
    l: iterable of keys
    returns: a list with integer keys
    """
    ls = []
    for e in sorted(l):
        if not ls or e != ls[-1]:
            ls.append(e)
    index = {v: i for i, v in enumerate(ls)}
    return [index[v] for v in l]


def to_int_keys2(l):
    """
    l: iterable of keys
    returns: a list with integer keys
    """
    cnt = 0
    lastKey = None
    index = {}
    for key in sorted(l):
        if key != lastKey:
            index[key] = cnt
            cnt += 1
        lastKey = key
    return [index[v] for v in l]


def to_int_keys3(l):
    """
    l: iterable of keys
    returns: a list with integer keys
    """
    seen = set()
    ls = []
    for e in l:
        if not e in seen:
            ls.append(e)
            seen.add(e)
    ls.sort()
    index = {v: i for i, v in enumerate(ls)}
    return [index[v] for v in l]

Faster sorting using integers

We can do a little hack to compare integers of range(n * (n + 1)) instead of tuples.

In [9]:
def suffix_matrix_int(to_int_keys, s):
    n = len(s)
    k = 1
    line = to_int_keys(s)
    ans = [line]
    while k < n:
        line = to_int_keys(
            [a * (n + 1) + b + 1
             for (a, b) in
             zip_longest(line, islice(line, k, None),
                         fillvalue=-1)])
        ans.append(line)
        k <<= 1
    return ans

Performance test

Now, let's compare everybody:

In [10]:
for length in [1e4, 1e5, 5e5]:
    print("length = %i\n" % length)
    for s in ['constant_string',
              'random_string']:
        print(s)
        s = eval(s)(length)
        for sm in ['suffix_matrix', 'suffix_matrix_int']:
            print(sm)
            sm = eval(sm)
            %timeit sm(to_int_keys, s)
            %timeit sm(to_int_keys1, s)
            %timeit sm(to_int_keys2, s)
            %timeit sm(to_int_keys3, s)
            print()
length = 10000

constant_string
suffix_matrix
10 loops, best of 3: 53.3 ms per loop
10 loops, best of 3: 48.3 ms per loop
10 loops, best of 3: 65 ms per loop
10 loops, best of 3: 48.2 ms per loop

suffix_matrix_int
10 loops, best of 3: 59.8 ms per loop
10 loops, best of 3: 77.3 ms per loop
10 loops, best of 3: 59.4 ms per loop
10 loops, best of 3: 63.9 ms per loop

random_string
suffix_matrix
10 loops, best of 3: 188 ms per loop
1 loop, best of 3: 233 ms per loop
1 loop, best of 3: 197 ms per loop
1 loop, best of 3: 212 ms per loop

suffix_matrix_int
1 loop, best of 3: 193 ms per loop
1 loop, best of 3: 185 ms per loop
10 loops, best of 3: 132 ms per loop
1 loop, best of 3: 169 ms per loop

length = 100000

constant_string
suffix_matrix
1 loop, best of 3: 829 ms per loop
1 loop, best of 3: 665 ms per loop
1 loop, best of 3: 583 ms per loop
1 loop, best of 3: 633 ms per loop

suffix_matrix_int
1 loop, best of 3: 857 ms per loop
1 loop, best of 3: 861 ms per loop
1 loop, best of 3: 846 ms per loop
1 loop, best of 3: 776 ms per loop

random_string
suffix_matrix
1 loop, best of 3: 4.57 s per loop
1 loop, best of 3: 4.16 s per loop
1 loop, best of 3: 3.87 s per loop
1 loop, best of 3: 4.7 s per loop

suffix_matrix_int
1 loop, best of 3: 2.93 s per loop
1 loop, best of 3: 2.98 s per loop
1 loop, best of 3: 2.82 s per loop
1 loop, best of 3: 3.34 s per loop

length = 500000

constant_string
suffix_matrix
1 loop, best of 3: 5.06 s per loop
1 loop, best of 3: 4.42 s per loop
1 loop, best of 3: 3.97 s per loop
1 loop, best of 3: 4.15 s per loop

suffix_matrix_int
1 loop, best of 3: 4.89 s per loop
1 loop, best of 3: 5.16 s per loop
1 loop, best of 3: 4.7 s per loop
1 loop, best of 3: 5.9 s per loop

random_string
suffix_matrix
1 loop, best of 3: 34.9 s per loop
1 loop, best of 3: 32.3 s per loop
1 loop, best of 3: 30.8 s per loop
1 loop, best of 3: 35.8 s per loop

suffix_matrix_int
1 loop, best of 3: 25.5 s per loop
1 loop, best of 3: 24.2 s per loop
1 loop, best of 3: 21.5 s per loop
1 loop, best of 3: 26.3 s per loop

Using integer keys helps speeding up the computations for random_string, slowing them a bit for the constant_string.

We can see that the three new functions perform very well, and the big winner is to_int_keys2.

The performance gained from applying the Timsort on partially sorted data is too complicated to be evaluated, it is only a rule of thumb that performs very well.

I was expecting to_int_keys3 to perform better because there is some kind of "equilibrium" between the number of different elements to sort and their partial order.

Indeed, to_int_keys3 has a $O(A*log(A) + n)$ time complexity where A is the number of different elements, so it should be faster on the first steps when the alphabet is small. However, the online operations done on set seem to be inefficient so the difference can only be seen on large values.

We can also observe that using suffix_matrix_int helps to_int_keys3, probably because set is more efficient with integers than with tuples.

Interlude: an outsider

On [AlgorithmicAlley], I found a surprisingly elegant algorithm (I modified it a bit):

In [11]:
from collections import defaultdict


def sort_bucket(s, bucket, order):
    d = defaultdict(list)
    for i in bucket:
        key = s[i + order // 2:i + order]
        d[key].append(i)
    result = []
    for k, v in sorted(d.items()):
        if len(v) > 1:
            result += sort_bucket(s, v, 2 * order)
        else:
            result.append(v[0])
    return result


def suffix_array_ManberMyers(s):
    return sort_bucket(s, range(len(s)), 1)

suffix_array_ManberMyers('banana')
Out[11]:
[5, 3, 1, 0, 4, 2]
In [12]:
s = random_string(1e5)

%timeit suffix_matrix_int(to_int_keys2, s)
%timeit suffix_array_ManberMyers(s)
1 loop, best of 3: 2.87 s per loop
1 loop, best of 3: 334 ms per loop

Yes, almost 10 times better. At first, I was impressed. Then I tested it on a constant example:

In [13]:
s = constant_string(5e4)

%timeit suffix_array_naive(s)
%timeit suffix_matrix(to_int_keys2, s)
%timeit suffix_array_ManberMyers(s)
1 loop, best of 3: 1.03 s per loop
1 loop, best of 3: 286 ms per loop
1 loop, best of 3: 2.04 s per loop

What happened?

When d.items() is sorted, it contains the real substrings. When you don't need to go really deep (like with a random string), it's not a problem. Because you don't compute the orders that are not useful, you limit the number of comparisons and you spare a lot of time. But the principle of prefix doubling is precisely to use the partial order that is already computed.

In fact, for constant_string, you have to store almost all the suffixes before sorting them, and because you use a defaultdict, it is even less space and time efficient than the naive algorithm, and is $O(n^2)$.

So this implementation is really not the same algorithm. And because of the method used, you don't have the suffix matrix that justifies to use a $O(n*log^2n)$ algorithm instead of the linear [Skew].

Awesome optimisation: skipping the last lines

The idea

We saw that when the suffixes can be compared fast, there are a lot of neat optimisations, either by exploiting the Timsort or by comparing the string themselves like in [AlgorithmicAlley].

We can observe that if it is the case, the last lines of the matrix are going to be the same. Then, we don't need them!

The stop condition is simply to check if we have all the integers. We only change one line.

For random strings, the mean time to compare two prefixes is $O(log(n))$ instead of $O(n)$ so we do only about $O(log(log(n)))$ steps instead of $O(log(n))$.

I have never seen this trick anywhere, if you know a reference, please tell me on HN!

In [14]:
def suffix_matrix_smart(to_int_keys, s):
    n = len(s)
    k = 1
    line = to_int_keys(s)
    ans = [line]
    while max(line) < n - 1:
        line = to_int_keys(
            list(zip_longest(line, islice(line, k, None),
                             fillvalue=-1)))
        ans.append(line)
        k <<= 1
    return ans

suffix_matrix_smart(to_int_keys2, 'banana')
Out[14]:
[[1, 0, 2, 0, 2, 0], [2, 1, 3, 1, 3, 0], [3, 2, 5, 1, 4, 0]]
In [15]:
for s in [constant_string(1e5), random_string(1e5)]:
    %timeit suffix_matrix(to_int_keys2, s)
    %timeit suffix_matrix_smart(to_int_keys2, s)
1 loop, best of 3: 704 ms per loop
1 loop, best of 3: 747 ms per loop
1 loop, best of 3: 4.35 s per loop
1 loop, best of 3: 639 ms per loop

That's really better for the random_string.

The reason is that we don't compute the last lines that are identical.

In [16]:
s = random_string(1e5)
sm = suffix_matrix(to_int_keys2, s)
sms = suffix_matrix_smart(to_int_keys2, s)

print(len(sm), len(sms))
print(sm[:len(sms)] == sms)
18 4
True

It is even better with integers

Let's test again the performances with integers:

In [17]:
def suffix_matrix_smart_int(to_int_keys, s):
    """
    suffix matrix of s
    O(n * log(n)^2)
    """
    n = len(s)
    k = 1
    line = to_int_keys(s)
    ans = [line]
    while max(line) < n - 1:
        line = to_int_keys(
            [a * (n + 1) + b + 1
             for (a, b) in
             zip_longest(line, islice(line, k, None),
                         fillvalue=-1)])
        ans.append(line)
        k <<= 1
    return ans
In [18]:
for length in [1e5, 5e5]:
    print("length = %i\n" % length)
    for s in ['constant_string',
              'random_string']:
        print(s)
        s = eval(s)(length)
        for sm in ['suffix_matrix_smart',
                   'suffix_matrix_smart_int']:
            print(sm)
            sm = eval(sm)
            %timeit sm(to_int_keys2, s)
            %timeit sm(to_int_keys3, s)
            print()
length = 100000

constant_string
suffix_matrix_smart
1 loop, best of 3: 761 ms per loop
1 loop, best of 3: 837 ms per loop

suffix_matrix_smart_int
1 loop, best of 3: 954 ms per loop
1 loop, best of 3: 967 ms per loop

random_string
suffix_matrix_smart
1 loop, best of 3: 618 ms per loop
1 loop, best of 3: 645 ms per loop

suffix_matrix_smart_int
1 loop, best of 3: 529 ms per loop
1 loop, best of 3: 536 ms per loop

length = 500000

constant_string
suffix_matrix_smart
1 loop, best of 3: 5.4 s per loop
1 loop, best of 3: 4.33 s per loop

suffix_matrix_smart_int
1 loop, best of 3: 5.5 s per loop
1 loop, best of 3: 5.49 s per loop

random_string
suffix_matrix_smart
1 loop, best of 3: 3.93 s per loop
1 loop, best of 3: 3.74 s per loop

suffix_matrix_smart_int
1 loop, best of 3: 2.9 s per loop
1 loop, best of 3: 2.45 s per loop

As I explained before, to_int_keys3 is really better for the first steps (and when we work with integers). Since we are stopping after less steps, to_int_keys3 provides the best results.

The to_int_keys_best, suffix_matrix_best and suffix_array_best functions of the beginning are just some refactoring of to_int_keys3 and suffix_matrix_smart_int.

Final remarks on the LCP computation

Discarding the last identical lines of the suffix matrix is not a problem to compute the LCP:

In [19]:
def lcp(sm, i, j):
    """
    longest common prefix
    O(log(n))

    sm: suffix matrix
    """
    n = len(sm[-1])
    if i == j:
        return n - i
    k = 1 << (len(sm) - 2)
    ans = 0
    for line in sm[-2::-1]:
        if i >= n or j >= n:
            break
        if line[i] == line[j]:
            ans ^= k
            i += k
            j += k
        k >>= 1
    return ans

print(lcp(suffix_matrix(to_int_keys, 'banana'), 2, 4))
print(lcp(suffix_matrix_smart(to_int_keys, 'banana'), 2, 4))
2
2

Plus Ultra

The prefix doubling algorithm uses $O(log(n))$ steps that cost each $O(n*log(n))$ (or $O(A*log(A) + n)$ with to_int_keys3).

We attacked the first $O(log(n))$ factor for random strings.

Now, if you use a radix sort, you can implement to_int_key in $O(n)$, and use it with suffix_matrix_smart or suffix_matrix_smart_int (the former seems more suitable to avoid moduli operations).

I implemented a radix sort based to_int_keys function that was not optimised and used it with suffix_matrix_smart. I tested it against suffix_matrix_best and it had similar results against random_string but performed really bad against constant_string. My radix sort was not optimised for almost sorted arrays, and was quite bad because it was not in place and had to reduce two elements counting arrays (see below).

In fact, the native Timsort is so efficient that it beats the radix sort when we provide almost sorted arrays. I don't know what would happen for really high values of $n$ (tested up to 5e6), but who would process longer arrays on a single computer with a Python code?

You can consult those two good references: