HELP QUICK_SUBSTRINGS David Young, Dec 1994 LIB * QUICK_SUBSTRINGS provides procedures for finding substrings of strings. Under certain conditions the procedures are faster than the corresponding system procedures. CONTENTS - (Use g to access required sections) 1 The procedures 2 When to use these procedures 3 The algorithm 3.1 The search strategy 3.2 The lookup table and the search algorithm 3.3 Building the lookup table 3.4 Implementation ----------------------------------------------------------------------- 1 The procedures ----------------------------------------------------------------------- It is sometimes important to search for substrings of very long strings efficiently (for instance in genetic algorithms). The following procedures facilitate this. Later sections describe when it is appropriate to use them, and how they work. The prefix ______quick_ is intended to indicate that the procedures are sometimes quicker than the standard string manipulation procedures (___REF * _______STRINGS). They are also sometimes slower. Unlike procedures with the prefix _____fast_, these procedures are not dangerous in any way. quick_issubstring(__________sub_string, _n, ______string) -> _m quick_issubstring(__________sub_string, ______string) -> _m The arguments and result are the same as for issubstring (___REF * _______STRINGS). That is, this searches the string ______string, starting from its _n'th character, for a substring equal to __________sub_string, and, if found, returns the subscript _m of ______string at which the matching substring begins; otherwise it returns . If _n is not given, it defaults to 1. Also works on words. Instead of being an actual string, __________sub_string may be a lookup table returned by quick_substring_table (see below). quick_issubstring_lim(__________sub_string, _n, ________startlim, ______endlim, ______string) -> _m The arguments and result are the same as for issubstring_lim (___REF * _______STRINGS). That is, it is the same as quick_issubstring, but the match is constrained to start on or before the subscript ________startlim, and to end on or before the subscript ______endlim. The ________startlim or ______endlim arguments may be disabled by supplying for either argument. Also works on words, and __________sub_string may be a lookup table returned by quick_substring_table (see below). quick_substrings(__________sub_string, _n, ______string) -> _____mlist quick_substrings(__________sub_string, ______string) -> _____mlist This is like quick_issubstring except that every occurrence of __________sub_string is found, rather than just the first. It returns a list _____mlist of the subscripts of ______string at which each matching substring begins, in reverse order. If there are no matches, _____mlist is the empty list. Matching substrings in ______string may overlap with one another - e.g. quick_substrings('aba', 'xyababaxy') returns the list [5 3]. This procedure is faster than making repeated calls to quick_issubstring, particularly if there are many occurrences of the __________sub_string in ______string. quick_substrings_lim(__________sub_string, _n, ________startlim, ______endlim, ______string) -> _____mlist This is like quick_substrings except that the search is limited by ______endlim and ________startlim, as for quick_issubstring_lim. quick_substring_table(__________sub_string) -> _____table This carries out the preprocessing of __________sub_string needed to speed up the search. Preprocessing is normally an overhead on each call of one of the search procedures. If the same substring is to be found in several different strings, the preprocessing can be carried out once by a call to quick_substring_table. The result _____table should then be used as the __________sub_string argument to any of the search procedures, instead of the original substring. ----------------------------------------------------------------------- 2 When to use these procedures ----------------------------------------------------------------------- If an application makes heavy use of string matching, it will be worth carrying out some timing tests with the kinds of strings involved. In general, it is worth considering the use of these routines if all of these apply: o the string to search in is a lot longer than the substring being searched for, or the same substring is searched for in a lot of different strings whose total length is large; o the string to search in contains a reasonable number of different characters; o the substring being searched for is longer than a few characters; o a little garbage creation is not a problem. To give some idea of the relative timings, the following results were obtained when the characters of ______string and __________sub_string were all randomly chosen from an alphabet of size _K, ______string was fairly long (a million characters), and the procedures were run on a SPARCstation under POPLOG version 14.5. Note that issubstring is coded in assembly language, whilst the quick_ procedures are written in ordinary Pop-11. Under these conditions, issubstring took about as long as quick_issubstring when _K = 10 and the __________sub_string had 25 characters. When _K = 256 and the length of __________sub_string was 256, issubstring took about 9 times as long as quick_issubstring. If the preprocessing was done in advance, and not counted in the search time, the speed-up ratio improved further to about 15. As an extreme (and rather artificial) example, when _K was only 4 for __________sub_string, but was 256 for ______string, and the length of __________sub_string was 1024, then the speed-up factor fell back to about 5.5 if preprocessing was counted; however, if preprocessing was omitted from the time taken, then quick_issubstring was about 90 times faster than issubstring. Because the procedures build lookup tables, some garbage is created on each call, unless a table created by quick_substring_table is supplied. ----------------------------------------------------------------------- 3 The algorithm ----------------------------------------------------------------------- The algorithm is a variant of the Boyer-Moore algorithm (R.S. Boyer & J.S. Moore, "A fast string searching algorithm", Communications of the ACM, Vol. 20 no. 10, pp 762-772, October 1977). Boyer & Moore mention this algorithm briefly towards the end of their paper, but did not implement it. Below, I use ___pat instead of __________sub_string, to conform to Boyer & Moore's terminology. 3.1 The search strategy ------------------------ As in all serial string-matching algorithms, the pattern ___pat is in effect slid from left to right along ______string looking for a match. At each position, the procedure starts checking for a possible match at the right hand end of ___pat, then when the check is finished, shifts as far as possible along ______string before checking again. Suppose a call to quick_issubstring has got to this point: v <- checking here a b c d e f g h i <- ______string b b c <- ___pat The procedure makes use of the fact that `e` is not in ___pat, and shifts ___pat along 3 places before the next test. It will then test the `h`, shift 3 more places, and so on. Unless a `b` or a `c` is encountered, each shift can be 3 places. If ___pat was 'ebd', and the situation looked like: v a b c d e f g h i e b d then it can only shift 2 places to align the `e`s before checking again. In general, if there is a partial match then it can shift along only until any other occurrences of the pattern already seen are brought into line. As a more general example, in this position: v <- checking here a b c d e f g h i g h k <- ______string f g h i g h <- ___pat the leftmost 'gh' in ______string has matched the rightmost 'gh' in ___pat, but the `f` then fails to match the `i`. The procedure can then shift 3 places to bring the 'fgh' earlier in ___pat into line with the 'fgh' found in ______string - in this case the match will then succeed. On the other hand, with v a b c d e f g h i g h k h g h i g h it can shift 5 places rather than 3, because 'hgh' will not match 'fgh'. The difference from standard Boyer-Moore is illustrated by this case: v a b c d e g g h i g h k h g h i g h Here, standard Boyer-Moore will only shift 3 places. Because `g` occurs to the right of the current position, the only information used about the current character in ______string is that it is not `i`. Therefore the 'gh's are brought into line. On the other hand, quick_issubstring uses the fact that 'ggh' does not occur in ___pat, so shifts 5 places. 3.2 The lookup table and the search algorithm ---------------------------------------------- The search is implemented by building a 2-D lookup table - call this _____Delta. _____Delta is indexed by _p, the number of characters at the right of ___pat already matched, and _c, the character of ______string currently being examined. _____Delta returns 0 if _c is a hit (i.e. the corresponding character in ___pat is also _c), and otherwise returns the distance to shift ___pat for the next test. (This is not quite how it is implemented, but serves for explanation.) The search algorithm could therefore be written as length(pat) -> i; until i > length(string) do for p from 0 to length(pat) - 1 do Delta(p, string(i - p)) -> shift; quitunless (shift == 0) ;;; quit if unmatched character endfor; quitif (shift == 0); ;;; quit if matched all characters i + shift -> i enduntil; if shift == 0 then return (i) else return (false) endif 3.3 Building the lookup table ------------------------------ The lookup table is built from ___pat in the preprocessing phase. This is the most intricate part of the algorithm. To work properly, _____Delta(_p, _c) must return the distance from the right of ___pat to the rightmost occurrence of '_cxxxx' in ___pat, where 'xxxx' means the _p rightmost characters of ___pat. If '_cxxxx' does not occur in ___pat, then _____Delta(_p, _c) must return the length of ___pat, minus the number of characters on the right of 'xxxx' which match the left hand end of ___pat. (This last part is equivalent to extending ___pat on the left with "wild card" characters which match anything, and simply applying the first rule.) The table is built by scanning ___pat from the left. From each scan position, the procedure backs up towards the left, matching characters with the right end of ___pat. When a non-matching character is found, a table entry is made. For example, suppose the string is '...axy...bxy' and the scan has reached the left-hand `y`. Two characters then match the end of the string, and after working backwards past these we have: scan position -> v ___pat -> . . . a x y . . . b x y non-matching character -> ^ At this point we make the table entry ______patlen - _s -> _____Delta(2, `a`) where ______patlen is the length of ___pat and _s is the index of the current scan position. Thus ______patlen - _s is the distance of 'axy' from the right of ___pat. Most characters will be different from the final character, of course, so will just generate an entry for _____Delta(0, _c) before the scan moves on, without any backtracking. If all the characters back to the left end of ___pat match the right end of ___pat, then a default entry is generated for all values of _p greater than or equal to the number of characters matched. For example, if the string is 'xy...xy' then the procedure will get to scan position -> v ___pat -> x y . . . x y start of string -> ^ Now it must record that if 2 __or ____more characters of ______string have matched ___pat, then the maximum shift is ______patlen - 2. In general, if _s characters on the left of ___pat match _s characters on the right, table entries are made as follows: for _p from _s to ______patlen - 1 do ______patlen - _s -> _____Delta(_p, *) endfor where _____Delta(_p, *) means the default value, to be returned for any character for which no specific entry has been made. The scan actually starts with _s = 0, so all the table entries initially default to ______patlen. As the scan of ___pat moves to the right, smaller defaults may overwrite larger ones in the table. The default for a given _p will be established before any specific entries are made for that _p. Smaller specific entries may later overwrite larger specific entries. This procedure stops with the second character from the right. The final step is to do 0 -> Delta(_p, ___pat(______patlen - _p)) for each _p, to set up the condition for a hit. 3.4 Implementation ------------------- The lookup table _____Delta is implemented as a list of properties. If the list is called _____Table, then _____Table(_p + 1) is a property such that _____Table(_p + 1)(_c) gives a shift corresponding to _____Delta(_p, _c). In Pop-11, this gives a good compromise between speed and memory usage - since most entries are the default, it would use excessive memory to store the table as an array or vector. The algorithm is modified in the following ways, for efficiency: o The lookup table returns instead of 0 to indicate a hit when there are still characters in ___pat to match. This simplifies a test in the search procedure. o The integer returned by _____Table(_p+1)(_c) is actually _____Delta(_p,_c) + _p. This simplifies moving the pointer in the search procedure - it jumps from the character being examined to the next position for the right hand end of ___pat. o When a complete match has been found, the final procedure in the table returns a large shift which would take the pattern off the right-hand end of ______string. This means that it is unnecessary to count the number of iterations in the inner loop in the algorithm, and also unnecessary to have separate tests in the main loop for success and failure. o So that repeated instances of ___pat can be found efficiently, the shift returned after a match carries the information needed to restart the search at the next possible location for ___pat. --- _____________________________________$poplocal/local/help/quick_substrings --- _________Copyright __________University __of ______Sussex _____1994. ___All ______rights _________reserved.