程序代写案例-CSE 584A|学霸联盟

程序代写案例-CSE 584A

时间：2022-02-19

CSE 584A Algorithms for Biosequence Comparison February 26, 2018
Homework 1 Solutions
Note: this solution set is just for the theory problems.
1. (a) Consider a pattern P = AWA. The extended sp-values for this string are [0, 1, 2], since W
matches A. Now suppose we match P against the text T = ATAA. Clearly, there is a match
starting at T [1], which would cause KMP to shift forward by 3− 2 = 1 and resume comparisons
at T [4], generating a second match because T [4] = P [3]. However, this second match is incorrect,
since the allegedly matching string is actually TAA.
The underlying issue is that a residue class match does not imply transitivity in the same way as
a single-residue match. In particular, T [2] matches P [2] and P [2] matches P [1], but T [2] fails to
match P [1]. KMP assumes transitivity and so assumes a match where none exists.
(b) Suppose pattern P has a single residue class, at some position i. For concreteness, assume the
class is an R. We construct two sp arrays for P : one for each possible value of the class. Call
these arrays spA and spT. For example, if P = AWA, then we construct spA from the string
AAA and spT from ATA. (The two arrays share the same prefix through position i− 1.)
When we run KMP, if pattern P is partially matched starting at some text position j, and the
match aligns the residue class W to an A, then we use spA to determine the next shift; if it aligns
to a T , then we use spT to determine the shift. If we do not explicitly match the residue class at
a particular offset (either because the partial match does not reach it, or because it was matched
as part of the initial skip), we continue using whichever sp array we used previously.
To see that this scheme always induces a correct shift and initial skip of matching characters,
observe first that the shift and skip is surely correct if the current partial match causes us to
explicitly choose one of spA or spT. If a partial match is terminated without W being matched,
then the shift and skip are computed using the common prefix of spA and spT, so it does not
matter which one we use. Finally, if the residue class was matched as part of the skip, then it
matched the same character as it did at the last explicit choice, and so we may continue to use
the same sp array to compute the next shift and skip.
(c) If we use a similar scheme to the above on a pattern with k residue classes, then for a partial
match that includes all class positions, we may need to remember a value for each such position.
There are 2k possibilities for the values at the k class positions, each of which may require a
distinct array of sp values. Hence, we need Θ(2k|P |) space. (Even if we try to merge common
prefixes of the sp arrays, we still have 2k possibilities. A pattern with all k classes at the start
and arbitrarily many characters thereafter still achieves the asymptotic worst case.)
2. We begin by computing the simple failure links fi = spi−1+1. We then fill in the gi,c array in increasing
order of i as follows whenever P [i] 6= c:
g1,c = 0
gi,c =
{
fi if P [fi] = c
gfi,c otherwise.
1
To see correctness, proceed inductively on i. In the base case of i = 1, the initialization is correct by
definition of gi,c. Now assume that gi′,c is correct for i
′ < i, and consider how we proceed for i. The
first arm of the recursive case is surely correct by definition of gi,c. For the second arm, we know that
the naive KMP algorithm would fail again at position fi, then follow the same series of failure links as
if the mismatch had been at position fi. By the inductive hypothesis, the eventual target in the latter
case is gfi,c, so we may simply transfer this value to gi,c.
The algorithm spends Θ(|P |) time to compute fi and constant time per cell of the g array thereafter,
for a total cost of Θ(|P ||Σ|).
3. (a) We can simply mark as “accepting” each vertex of the Aho-Corasick trie that corresponds to
the end of some pattern. Each accepting vertex is labeled with the number of the pattern that
it accepts. Whenever the traversal reaches an accepting vertex labeled with pattern j, we emit
“match j” and continue matching the text as usual.
To see that this extension is correct, we first observe that, since a shorter pattern P can occur
only as a prefix of a longer pattern, P cannot label any subpath in the trie except for a single
path starting at the root. Our extension marks the end of this path. By the correctness of Aho-
Corasick’s transitions, if the last |P | characters seen match P , then the traversal must finish at
the end of this path in the trie, and so we will accept.
The cost of the additional marking of accepting states is surely O(|P |) per pattern P in the trie.
(b) This case is trickier because a pattern may occur as more than one subpath in the trie. To address
the issue, we introduce a new annotation of the trie, called output links.
For each component pattern Pj , we mark an accepting vertex for Pj starting from the root as
above. Then, for each vertex v of the trie, call it “quasi-accepting” iff there is a path of one or
more failure links from v to some accepting vertex u. We annotate each quasi-accepting vertex
v with an output link o(v) = u, where u is the first accepting vertex 6= v reached by following
failure links from v.
When the search algorithm reaches a vertex v, it proceeds as follows:
i. If v is labeled “accepting j,” emit “match j.”
ii. If v has an output link o(v) = u, then move to u and goto (i).
iii. Finally, return to the original node v to continue the traversal.
Claim 1: the above extension to Aho-Corasick will find every occurrence of every pattern in the
text.
Pf : Suppose that some pattern Pj occurs in the text ending at position i, and that Pj is the
longest such pattern for position i. When the Aho-Corasick algorithm reaches position i, the trie
traversal is at the end of some path whose label ends with Pj (though not necessarily the path
with this label that starts at the root!). Let v be the vertex at the end of this path.
If we traverse the path of failure links out of v, we will find every vertex in the trie whose label is
a suffix of v’s label, since any one of these suffixes could be a legitimate shift if we mismatch after
reaching node v. Moreover, we will find these vertices in descending order of their labels’ length.
Hence, the first accepting vertex on this path must be the vertex u for Pj , and so o(v) = u.
Conclude that the algorithm will emit “match j” when it reaches vertex v.
Now suppose that some other pattern Pk also occurs ending at position i. Since |Pk| < |Pj |, Pk
is a proper suffix of Pj . Again, let u be the vertex whose path is labeled Pj in the trie, and let
w be the vertex whose path is labeled Pk; by the same argument as above, there must be a path
2
of failure links from u to w. Hence, by following a path of one or more output links from u, we
eventually reach w and emit “match k.”
Finally, we note that, since Aho-Corasick’s traversal is correct, we never reach a vertex with an
output link unless we have seen a text substring corresponding to that vertex’s path; hence, there
are no false positives. QED
Claim 2: all output links for an Aho-Corasick automaton can be constructed in time O(
∑ |P |)
given the trie, its accepting states, and its failure links.
Pf : Set all the output links to be initially null, and then process the vertices of the trie in
breadth-first order. If f(v) = u, and u is accepting, set o(v) = u; otherwise, if o(u) is non-null,
set o(v) = o(u).
The proof of correctness is similar to that for setting the generalized failure links in KMP. QED
Claim 3: the cost of the search algorithm is proportional to the length of the text plus the
number of pattern matches.
Pf : the total cost of traversing the trie during a search, by moving forward and/or following
failure links, is unchanged, hence Θ(|T |). Every output link leads to an accepting state and
results in emitting a match; hence, the total cost of following output links is at most proportional
to the number of matches. QED
3