Strings
Intro Strings + Rolling Hash
Many ideas in string problems can be extended to integer arrays and vice-versa. For that reason, in this section we stick to concepts which have a "string flavor."
Rolling Hash - Core idea
Often, when comparing substrings of a string, looping through the substring is too slow. String hashing lets us compare substrings in instead of spending time roughly the length of the substring.
The main idea is we will map each string to a number such that the probability of two unequal strings mapping to the same number is quite small.
We can define the hash of a function as:
usually modulo some large prime. Here is some integer (usually more than the alphabet size) and we map characters to numbers.
We can use a trick similar to prefix sums to be able to compute the hash of a string in . Start by computing all the prefix hashes:
For a string indexed from :
powB[0] = 1
for i in 0..n-1:
powB[i+1] = powB[i] * B mod M
pref[i+1] = (pref[i] * B + value(s[i])) mod M
Then:
with the usual modulo fixup.
This gives the hash of substring in .
This pseudocode is from the implementation on KACTL where the prefix hashes are exclusive on the index. For the modulo, they use unsigned integers which by default modulo by on overflow with some extra care for details. Addtionally, they take subarrays modulo two primes to reduce the probability of a hash collision.
What is the probability of a hash collision modulo a certain prime?
What rolling hash is good for
- checking if two substrings are equal
- binary search + hash comparison to find the longest match from some points.
How can we use string hashing to find the Longest Palindromic Substring of a string faster than before?
Practice:
- Finding Borders (CSES)
- String Matching (CSES)
- Finding Periods (CSES)
- Good Substrings (CF 271D)
- Password (CF 126B)
- Who Says a Pun? (AtCoder ABC141 E)
KMP
KMP is built around the prefix function. This is one of the most important arrays in strings.
Prefix function / LPS array
For each index , let be the length of the longest proper prefix of that is also a suffix of .
Computing the prefix function in linear time
pi[0] = 0
for i in 1..n-1:
j = pi[i-1]
while j > 0 and s[i] != s[j]:
j = pi[j-1]
if s[i] == s[j]:
j++
pi[i] = j
Pseudocode from CP-Algos
Why this works:
- suppose you want to extend the best border ending at
- if the next character mismatches, the next candidate border is the border of that border
- so you “jump” through previously computed border lengths instead of restarting from scratch - (to be explained more clearly in class)
Since every iteration of the while loop decreases the value of the match by at least one, and the prefix function can only increase by at most one in each step, the whole algorithm runs in an amortized .
How can we use KMP to solve the string matching problem in linear time?
Z-function
The Z-function is another linear-time array that is often interchangeable with KMP. Many problems can be solved with either one and the algorithms are quite similar. In fact, the z-function array and the lps array can be computed from each other.
Definition
For each index , let be the length of the longest common prefix of:
- the whole string
swith - the suffix starting at
i
So z[i] tells you how many characters match if you align s with s[i..].
Linear-time construction
Maintain the rightmost segment that matches a prefix. When processing position :
- if , start matching from scratch
- otherwise, reuse information from the mirrored position inside the current "Z-box"
- then extend greedily if possible (exact details to be discussed in class)
l = r = 0
for i in 1..n-1:
if i <= r:
z[i] = min(r - i + 1, z[i-l])
while i + z[i] < n and s[z[i]] == s[i+z[i]]:
z[i]++
if i + z[i] - 1 > r:
l = i
r = i + z[i] - 1
Pseudocode from CP-Algos
This is also as the number of iterations of the inner while loop is bounded by the size of the array.
Relationship to KMP
- KMP stores: best prefix match ending at each position
- Z stores: prefix match length starting at each position They capture the same string structure from different angles. The linked CP-Algos pages are really good references for understanding how and why the algorithms work.
Problems (KMP and/or Z-function):