Skip to main content

Strings

Intro Strings + Rolling Hash

Many ideas in string problems can be extended to integer arrays and vice-versa. For that reason, in this section we stick to concepts which have a "string flavor."

Rolling Hash - Core idea

Often, when comparing substrings of a string, looping through the substring is too slow. String hashing lets us compare substrings in O(1)O(1) instead of spending time roughly the length of the substring.

The main idea is we will map each string to a number such that the probability of two unequal strings mapping to the same number is quite small.

We can define the hash of a function as:

H(s)=s0Bk1+s1Bk2++sk1H(s) = s_0B^{k-1} + s_1B^{k-2} + \dots + s_{k-1}

usually modulo some large prime. Here BB is some integer (usually more than the alphabet size) and we map characters to numbers.

We can use a trick similar to prefix sums to be able to compute the hash of a string in O(1)O(1). Start by computing all the prefix hashes:

For a string indexed from 00:

powB[0] = 1
for i in 0..n-1:
powB[i+1] = powB[i] * B mod M
pref[i+1] = (pref[i] * B + value(s[i])) mod M

Then:

hash(l,r)=pref[r]pref[l]Brl\text{hash}(l,r) = pref[r] - pref[l]\cdot B^{r-l}

with the usual modulo fixup.

This gives the hash of substring s[l..r1]s[l..r-1] in O(1)O(1).

This pseudocode is from the implementation on KACTL where the prefix hashes are exclusive on the index. For the modulo, they use unsigned integers which by default modulo by 2642^{64} on overflow with some extra care for details. Addtionally, they take subarrays modulo two primes to reduce the probability of a hash collision.

What is the probability of a hash collision modulo a certain prime?

What rolling hash is good for

  • checking if two substrings are equal
  • binary search + hash comparison to find the longest match from some points.

How can we use string hashing to find the Longest Palindromic Substring of a string faster than before?

Practice:


KMP

KMP is built around the prefix function. This is one of the most important arrays in strings.

Prefix function / LPS array

For each index ii, let π[i]\pi[i] be the length of the longest proper prefix of s[0..i]s[0..i] that is also a suffix of s[0..i]s[0..i].

Computing the prefix function in linear time

pi[0] = 0
for i in 1..n-1:
j = pi[i-1]
while j > 0 and s[i] != s[j]:
j = pi[j-1]
if s[i] == s[j]:
j++
pi[i] = j

Pseudocode from CP-Algos

Why this works:

  • suppose you want to extend the best border ending at i1i-1
  • if the next character mismatches, the next candidate border is the border of that border
  • so you “jump” through previously computed border lengths instead of restarting from scratch - (to be explained more clearly in class)

Since every iteration of the while loop decreases the value of the match by at least one, and the prefix function can only increase by at most one in each step, the whole algorithm runs in an amortized O(n)O(n).

How can we use KMP to solve the string matching problem in linear time?

Z-function

The Z-function is another linear-time array that is often interchangeable with KMP. Many problems can be solved with either one and the algorithms are quite similar. In fact, the z-function array and the lps array can be computed from each other.

Definition

For each index ii, let z[i]z[i] be the length of the longest common prefix of:

  • the whole string s with
  • the suffix starting at i

So z[i] tells you how many characters match if you align s with s[i..].

Linear-time construction

Maintain the rightmost segment [l,r][l,r] that matches a prefix. When processing position ii:

  • if i>ri > r, start matching from scratch
  • otherwise, reuse information from the mirrored position inside the current "Z-box"
  • then extend greedily if possible (exact details to be discussed in class)
l = r = 0
for i in 1..n-1:
if i <= r:
z[i] = min(r - i + 1, z[i-l])
while i + z[i] < n and s[z[i]] == s[i+z[i]]:
z[i]++
if i + z[i] - 1 > r:
l = i
r = i + z[i] - 1

Pseudocode from CP-Algos

This is also O(n)O(n) as the number of iterations of the inner while loop is bounded by the size of the array.

Relationship to KMP

  • KMP stores: best prefix match ending at each position
  • Z stores: prefix match length starting at each position They capture the same string structure from different angles. The linked CP-Algos pages are really good references for understanding how and why the algorithms work.

Problems (KMP and/or Z-function):