*By Louis Abraham*

You can download the notebook from https://louisabraham.github.io/notebooks/suffix_arrays.ipynb

You can comment there on HN. Yes, I'm using HN as a comment system.

In [1]:

```
from itertools import zip_longest, islice
def to_int_keys_best(l):
"""
l: iterable of keys
returns: a list with integer keys
"""
seen = set()
ls = []
for e in l:
if not e in seen:
ls.append(e)
seen.add(e)
ls.sort()
index = {v: i for i, v in enumerate(ls)}
return [index[v] for v in l]
def suffix_matrix_best(s):
"""
suffix matrix of s
O(n * log(n)^2)
"""
n = len(s)
k = 1
line = to_int_keys_best(s)
ans = [line]
while max(line) < n - 1:
line = to_int_keys_best(
[a * (n + 1) + b + 1
for (a, b) in
zip_longest(line, islice(line, k, None),
fillvalue=-1)])
ans.append(line)
k <<= 1
return ans
def suffix_array_best(s):
"""
suffix array of s
O(n * log(n)^2)
"""
n = len(s)
k = 1
line = to_int_keys_best(s)
while max(line) < n - 1:
line = to_int_keys_best(
[a * (n + 1) + b + 1
for (a, b) in
zip_longest(line, islice(line, k, None),
fillvalue=-1)])
k <<= 1
return line
```

This notebook is **not** intended as a course, a tutorial or an explanation on the suffix array algorithms, just practical implementations in Python.

This implementation is intended to be simple enough to be rewritten during programming contests, and yet be as efficient as possible. It is particularly optimised against **random** data.

This week-end, I wanted to improve my skills on string algorithms and solve this kattis problem. I had an efficient (in complexity) algorithm to compute the suffix array but it hit the time limit (mostly because I'm using Python). When I implemented my last line skipping trick, my code passed with Python 2 because the server uses PyPy. However, I still had a timeout in Python 3. This is how I managed to have a fast solution with the CPython implementation of Python 3.

I had a couple resources on the topic, so I started with [Stanford].

This paper is really the best I know because out of 17 pages, there are 7 of theory and code, and 9 of exercises with solutions (last page is for references).

They use an algorithm called prefix doubling, and I'm going to describe some possible implementations and their practical performances.

Given a string `s`

of length `n`

, we want to compute the suffix array `sa`

of same length.

`sa[i]`

is the rank according to the lexicographic order of the suffix at index i `s[i:]`

amongst all the suffixes.

They are strictly ordered (two suffixes cannot be equal since they are all of different lengths) thus the suffix array is always a permutation of `range(n)`

.

For example, the suffix array of 'banana' is [3, 2, 5, 1, 4, 0]. We can see that the first two suffixes (ranks 0 and 1) are the ones that begin with the letter 'a'.

There is an equivalent definition of suffix arrays: `sa_alternative[i]`

is the index of the suffix at the i-th position in the sorted list of all suffixes.

In [2]:

```
def suffix_array_alternative_naive(s):
return [rank for suffix, rank in sorted((s[i:], i) for i in range(len(s)))]
suffix_array_alternative_naive('banana')
```

Out[2]:

`sa`

, and it is trivial to invert it in $O(n)$.

In [3]:

```
def inverse_array(l):
n = len(l)
ans = [0] * n
for i in range(n):
ans[l[i]] = i
return ans
def suffix_array_naive(s):
return inverse_array(suffix_array_alternative_naive(s))
suffix_array_naive('banana')
```

Out[3]:

Now the performance doesn't seem horrible:

In [4]:

```
from random import randint
constant_string = lambda length: 'a' * int(length)
random_string = lambda length: ''.join(chr(randint(ord('a'), ord('z')))
for _ in range(int(length)))
s1 = constant_string(1e4)
s2 = random_string(1e4)
%timeit suffix_array_alternative_naive(s1)
%timeit suffix_array_alternative_naive(s2)
```

The time complexity is dominated by the generation of `sorted((s[i:], i) for i in range(len(s)))`

which is $O(n^2log(n))$ in the *worst* case: $O(n*log(n))$ for the sort and $O(n)$ for one comparison.

In practice, it is not really bad because the strings start to be different really quickly, so the $O(n)$ factor becomes less significant.

But the worst is the memory cost $O(n^2)$. That makes the algorithm for `length = 100000`

use several Go of RAM.

Because string comparisons seem to be really optimised in Python, the trivial algorithm slightly outperforms the more efficient algorithms for the small values. However, it is really not usable for bigger values of `n`

.

Here, one could think that the performance could be worse for the `constant_string`

because of the comparisons. However, because the prefixes are already sorted and the constant behind the string comparison is really small, it appears faster. See my question on [StackOverflow].

The `constant_string`

is such that the number of letter comparisons between two suffixes is maximal (always the length of the longest). Find an algorithm to produce given an alphabet $A$ and a length $n$ an element of $A^n$ such that the maximum number of comparisons between any two suffixes is minimal.

The algorithm described in [Stanford] is called *prefix doubling*.

Let's say all the letters of your string are different. Then the order of the suffixes is only determined by their first letter, and you can do that in $O(n*log(n))$ time and $O(n)$ space complexity by sorting the letters.

Even if the letters are not different, grouping the suffixes according to their first letter is a good idea. It is the idea of prefix radix sort.

And here we can use an idea similar to pointer jumping to double at each step the length of the prefixes that are compared.

Say you have at step `log(k)`

the orders of the partial prefixes `s[i: i+k]`

. But `s[i: i+2*k] == s[i: i+k] + s[i+k: 2*k]`

so you can merge at each step the order and the shifted order to obtain the new order. Hence the name *prefix doubling*. It is really a beautiful algorithm.

After at most $log(n)$ steps, you have totally sorted the prefixes. The complexity is $O(n)$ space (or $O(n*log(n))$ for the prefix matrix, see below) and $O(n*log^2n)$ time if you implement the merge operation in $O(n*log(n))$.

When you use prefix doubling, keeping the intermediate partial sorts allows us to compute the Longest Common Prefix (LCP) of two indexes `i`

and `j`

in $O(log(n))$. The strategy is described in [Stanford].

This is the reason why I prefer prefix doubling over others because it is a semilinear computation to make LCP queries in $O(log(n))$ time. If you think thoroughly, using the suffix matrix to compute the LCP is similar to the father pointer doubling algorithm for Lowest Common Ancestor (LCA) in the suffix tree, see [Topcoder].

There are $O(n)$ algorithms to compute the suffix array, like building the suffix tree and converting it to suffix array, but the constant is high. A more efficient algorithms is the Skew algorithms: [Skew].

However, the Skew algorithm does not allow simple computation of the LCA.

I think there are more complicated algorithms using both the suffix array and the LCP array to be able to traverse the suffix tree without constructing it, and then you can use other classic algorithms for the LCA.

In [5]:

```
from itertools import zip_longest, islice
def to_int_keys(l):
"""
l: iterable of keys
returns: a list with integer keys
"""
index = {v: i for i, v in enumerate(sorted(set(l)))}
return [index[v] for v in l]
def suffix_matrix(to_int_keys, s):
n = len(s)
k = 1
line = to_int_keys(s)
ans = [line]
while k < n:
line = to_int_keys(
list(zip_longest(line, islice(line, k, None),
fillvalue=-1)))
ans.append(line)
k <<= 1
return ans
suffix_matrix(to_int_keys, 'banana')
```

Out[5]:

The `to_int_keys`

function contains almost all the magic. It takes any iterable and returns integer indexes. If two elements were identical, their indexes will be equal.

Using positive integers allows us to set $-1$ as the smallest possible value.

I wrote a version almost identical to the C++ code of the pdf, but it has bad performances.

In [6]:

```
def suffix_matrix_stanford(s):
n = len(s)
k = 1
line = to_int_keys(s)
ans = [line]
while k < n:
L = sorted(
(key, index)
for index, key in enumerate(
zip_longest(line, islice(line, k, None),
fillvalue=-1)
))
line = [0] * n
for i in range(n):
line[L[i][1]] = (line[L[i - 1][1]]
if i and L[i][0] == L[i - 1][0]
else i)
ans.append(line)
k <<= 1
return ans
suffix_matrix_stanford('banana')
```

Out[6]:

In [7]:

```
for s in [constant_string(1e5),
random_string(1e5)]:
%timeit suffix_matrix(to_int_keys, s)
%timeit suffix_matrix_stanford(s)
```

It shows that the Python builtins are very powerful.

We can however observe that using `set`

might slow down `sorted`

because Timsort performs well with partially sorted data, and sets are not ordered in Python. Can we fix that?

After a few steps, the array is almost sorted and the Timsort performs well on that data. In the random case, the suffix is computed after a few steps, and sorting is done in $O(n)$ because the Timsort is adaptative (and in general is "designed to perform well on many kinds of real-world data").

We test different `to_int_keys`

functions implementing this trick:

In [8]:

```
def to_int_keys1(l):
"""
l: iterable of keys
returns: a list with integer keys
"""
ls = []
for e in sorted(l):
if not ls or e != ls[-1]:
ls.append(e)
index = {v: i for i, v in enumerate(ls)}
return [index[v] for v in l]
def to_int_keys2(l):
"""
l: iterable of keys
returns: a list with integer keys
"""
cnt = 0
lastKey = None
index = {}
for key in sorted(l):
if key != lastKey:
index[key] = cnt
cnt += 1
lastKey = key
return [index[v] for v in l]
def to_int_keys3(l):
"""
l: iterable of keys
returns: a list with integer keys
"""
seen = set()
ls = []
for e in l:
if not e in seen:
ls.append(e)
seen.add(e)
ls.sort()
index = {v: i for i, v in enumerate(ls)}
return [index[v] for v in l]
```

We can do a little hack to compare integers of `range(n * (n + 1))`

instead of tuples.

In [9]:

```
def suffix_matrix_int(to_int_keys, s):
n = len(s)
k = 1
line = to_int_keys(s)
ans = [line]
while k < n:
line = to_int_keys(
[a * (n + 1) + b + 1
for (a, b) in
zip_longest(line, islice(line, k, None),
fillvalue=-1)])
ans.append(line)
k <<= 1
return ans
```

Now, let's compare everybody:

In [10]:

```
for length in [1e4, 1e5, 5e5]:
print("length = %i\n" % length)
for s in ['constant_string',
'random_string']:
print(s)
s = eval(s)(length)
for sm in ['suffix_matrix', 'suffix_matrix_int']:
print(sm)
sm = eval(sm)
%timeit sm(to_int_keys, s)
%timeit sm(to_int_keys1, s)
%timeit sm(to_int_keys2, s)
%timeit sm(to_int_keys3, s)
print()
```

Using integer keys helps speeding up the computations for `random_string`

, slowing them a bit for the `constant_string`

.

We can see that the three new functions perform very well, and the big winner is `to_int_keys2`

.

The performance gained from applying the Timsort on partially sorted data is too complicated to be evaluated, it is only a rule of thumb that performs very well.

I was expecting `to_int_keys3`

to perform better because there is some kind of "equilibrium" between the **number** of different elements to sort and their partial **order**.

Indeed, `to_int_keys3`

has a $O(A*log(A) + n)$ time complexity where A is the number of different elements, so it should be faster on the first steps when the alphabet is small. However, the online operations done on `set`

seem to be inefficient so the difference can only be seen on large values.

We can also observe that using `suffix_matrix_int`

helps `to_int_keys3`

, probably because `set`

is more efficient with integers than with tuples.

On [AlgorithmicAlley], I found a surprisingly elegant algorithm (I modified it a bit):

In [11]:

```
from collections import defaultdict
def sort_bucket(s, bucket, order):
d = defaultdict(list)
for i in bucket:
key = s[i + order // 2:i + order]
d[key].append(i)
result = []
for k, v in sorted(d.items()):
if len(v) > 1:
result += sort_bucket(s, v, 2 * order)
else:
result.append(v[0])
return result
def suffix_array_ManberMyers(s):
return sort_bucket(s, range(len(s)), 1)
suffix_array_ManberMyers('banana')
```

Out[11]:

In [12]:

```
s = random_string(1e5)
%timeit suffix_matrix_int(to_int_keys2, s)
%timeit suffix_array_ManberMyers(s)
```

Yes, almost 10 times better. At first, I was impressed. Then I tested it on a constant example:

In [13]:

```
s = constant_string(5e4)
%timeit suffix_array_naive(s)
%timeit suffix_matrix(to_int_keys2, s)
%timeit suffix_array_ManberMyers(s)
```

What happened?

When `d.items()`

is sorted, it contains the real substrings. When you don't need to go really deep (like with a random string), it's not a problem. Because you don't compute the orders that are not useful, you limit the number of comparisons and you spare a lot of time. But the principle of prefix doubling is precisely to use the partial order that is already computed.

In fact, for `constant_string`

, you have to store almost all the suffixes before sorting them, and because you use a `defaultdict`

, it is even less space and time efficient than the naive algorithm, and is $O(n^2)$.

So this implementation is really not the same algorithm. And because of the method used, you don't have the suffix matrix that justifies to use a $O(n*log^2n)$ algorithm instead of the linear [Skew].

We saw that when the suffixes can be compared fast, there are a lot of neat optimisations, either by exploiting the Timsort or by comparing the string themselves like in [AlgorithmicAlley].

We can observe that if it is the case, the last lines of the matrix are going to be the same. Then, we don't need them!

The stop condition is simply to check if we have all the integers. We only change **one** line.

For random strings, the mean time to compare two prefixes is $O(log(n))$ instead of $O(n)$ so we do only about $O(log(log(n)))$ steps instead of $O(log(n))$.

I have never seen this trick anywhere, if you know a reference, please tell me on HN!

In [14]:

```
def suffix_matrix_smart(to_int_keys, s):
n = len(s)
k = 1
line = to_int_keys(s)
ans = [line]
while max(line) < n - 1:
line = to_int_keys(
list(zip_longest(line, islice(line, k, None),
fillvalue=-1)))
ans.append(line)
k <<= 1
return ans
suffix_matrix_smart(to_int_keys2, 'banana')
```

Out[14]:

In [15]:

```
for s in [constant_string(1e5), random_string(1e5)]:
%timeit suffix_matrix(to_int_keys2, s)
%timeit suffix_matrix_smart(to_int_keys2, s)
```

That's really better for the `random_string`

.

The reason is that we don't compute the last lines that are identical.

In [16]:

```
s = random_string(1e5)
sm = suffix_matrix(to_int_keys2, s)
sms = suffix_matrix_smart(to_int_keys2, s)
print(len(sm), len(sms))
print(sm[:len(sms)] == sms)
```

Let's test again the performances with integers:

In [17]:

```
def suffix_matrix_smart_int(to_int_keys, s):
"""
suffix matrix of s
O(n * log(n)^2)
"""
n = len(s)
k = 1
line = to_int_keys(s)
ans = [line]
while max(line) < n - 1:
line = to_int_keys(
[a * (n + 1) + b + 1
for (a, b) in
zip_longest(line, islice(line, k, None),
fillvalue=-1)])
ans.append(line)
k <<= 1
return ans
```

In [18]:

```
for length in [1e5, 5e5]:
print("length = %i\n" % length)
for s in ['constant_string',
'random_string']:
print(s)
s = eval(s)(length)
for sm in ['suffix_matrix_smart',
'suffix_matrix_smart_int']:
print(sm)
sm = eval(sm)
%timeit sm(to_int_keys2, s)
%timeit sm(to_int_keys3, s)
print()
```

As I explained before, `to_int_keys3`

is really better for the first steps (and when we work with integers). Since we are stopping after less steps, `to_int_keys3`

provides the best results.

The `to_int_keys_best`

, `suffix_matrix_best`

and `suffix_array_best`

functions of the beginning are just some refactoring of `to_int_keys3`

and `suffix_matrix_smart_int`

.

Discarding the last identical lines of the suffix matrix is not a problem to compute the LCP:

In [19]:

```
def lcp(sm, i, j):
"""
longest common prefix
O(log(n))
sm: suffix matrix
"""
n = len(sm[-1])
if i == j:
return n - i
k = 1 << (len(sm) - 2)
ans = 0
for line in sm[-2::-1]:
if i >= n or j >= n:
break
if line[i] == line[j]:
ans ^= k
i += k
j += k
k >>= 1
return ans
print(lcp(suffix_matrix(to_int_keys, 'banana'), 2, 4))
print(lcp(suffix_matrix_smart(to_int_keys, 'banana'), 2, 4))
```

The prefix doubling algorithm uses $O(log(n))$ steps that cost each $O(n*log(n))$ (or $O(A*log(A) + n)$ with `to_int_keys3`

).

We attacked the first $O(log(n))$ factor for random strings.

Now, if you use a radix sort, you can implement `to_int_key`

in $O(n)$, and use it with `suffix_matrix_smart`

or `suffix_matrix_smart_int`

(the former seems more suitable to avoid moduli operations).

I implemented a radix sort based `to_int_keys`

function that was not optimised and used it with `suffix_matrix_smart`

. I tested it against `suffix_matrix_best`

and it had similar results against `random_string`

but performed really bad against `constant_string`

. My radix sort was not optimised for almost sorted arrays, and was quite bad because it was not in place and had to reduce two elements counting arrays (see below).

In fact, the native Timsort is so efficient that it beats the radix sort when we provide almost sorted arrays. I don't know what would happen for really high values of $n$ (tested up to `5e6`

), but who would process longer arrays on a single computer with a Python code?

You can consult those two good references:

[Kattis]: https://open.kattis.com/problems/substrings

[Stanford]: http://web.stanford.edu/class/cs97si/suffix-array.pdf

[StackOverflow]: https://stackoverflow.com/questions/44866493/how-is-the-string-sort-optimised-in-cpython

[Skew]: http://www.cs.umd.edu/class/fall2011/cmsc858s/SuffixArrays.pdf

[AlgorithmicAlley]: http://algorithmicalley.com/archive/2013/06/30/suffix-arrays.aspx