Boyer Moore Algorithm | Good Suffix heuristic

heuristic for pattern searching. Just like bad character heuristic, a preprocessing table is generated for good suffix heuristic.

Good Suffix Heuristic

t

be substring of text

T

which is matched with substring of pattern

P

. Now we shift pattern until : 1) Another occurrence of t in P matched with t in T. 2) A prefix of P, which matches with suffix of t 3) P moves past t

Case 1: Another occurrence of t in P matched with t in T

Pattern P might contain few more occurrences of

t

. In such case, we will try to shift the pattern to align that occurrence with t in text T. For example-

Explanation:

In the above example, we have got a substring t of text T matched with pattern P (in green) before mismatch at index 2. Now we will search for occurrence of t (“AB”) in P. We have found an occurrence starting at position 1 (in yellow background) so we will right shift the pattern 2 times to align t in P with t in T. This is weak rule of original Boyer Moore and not much effective, we will discuss a

Strong Good Suffix rule

Case 2: A prefix of P, which matches with suffix of t in T

It is not always likely that we will find the occurrence of t in P. Sometimes there is no occurrence at all, in such cases sometimes we can search for some

suffix of t

matching with some

prefix of P

and try to align them by shifting P. For example –

Explanation:

In above example, we have got t (“BAB”) matched with P (in green) at index 2-4 before mismatch . But because there exists no occurrence of t in P we will search for some prefix of P which matches with some suffix of t. We have found prefix “AB” (in the yellow background) starting at index 0 which matches not with whole t but the suffix of t “AB” starting at index 3. So now we will shift pattern 3 times to align prefix with the suffix.

Case 3: P moves past t

If the above two cases are not satisfied, we will shift the pattern past the t. For example –

Explanation:

If above example, there exist no occurrence of t (“AB”) in P and also there is no prefix in P which matches with the suffix of t. So, in that case, we can never find any perfect match before index 4, so we will shift the P past the t ie. to index 5.

Strong Good suffix Heuristic

q = P[i to n]

got matched with

t

c = P[i-1]

is the mismatching character. Now unlike case 1 we will search for t in P which is not preceded by character

c

. The closest such occurrence is then aligned with t in T by shifting pattern P. For example –

Explanation:

In above example,

q = P[7 to 8]

got matched with t in T. The mismatching character

c

is “C” at position P[6]. Now if we start searching t in P we will get the first occurrence of t starting at position 4. But this occurrence is preceded by “C” which is equal to c, so we will skip this and carry on searching. At position 1 we got another occurrence of t (in the yellow background). This occurrence is preceded by “A” (in blue) which is not equivalent to c. So we will shift pattern P 6 times to align this occurrence with t in T.We are doing this because we already know that character

c = “C”

causes the mismatch. So any occurrence of t preceded by c will again cause mismatch when aligned with t, so that’s why it is better to skip this.

Preprocessing for Good suffix heuristic

As a part of preprocessing, an array

shift

is created. Each entry

shift[i]

contain the distance pattern will shift if mismatch occur at position

i-1

. That is, the suffix of pattern starting at position

i

is matched and a mismatch occur at position

i-1

. Preprocessing is done separately for strong good suffix and case 2 discussed above.

1) Preprocessing for Strong Good Suffix

Before discussing preprocessing, let us first discuss the idea of border. A

border

is a substring which is both proper suffix and proper prefix. For example, in string

“ccacc”

“c”

“cc”

is a border because it appears in both end of string but

“cca”

is not a border. As a part of preprocessing an array

bpos

(border position) is calculated. Each entry

bpos[i]

contains the starting index of border for suffix starting at index i in given pattern P. The suffix

?

beginning at position m has no border, so

bpos[m]

m+1

m

is the length of the pattern. The shift position is obtained by the borders which cannot be extended to the left. Following is the code for preprocessing –

void preprocess_strong_suffix(int *shift, int *bpos,
char *pat, int m)

int i = m, j = m+1;
bpos[i] = j;
while(i > 0)

while(j

if (shift[j] == 0)
shift[j] = j-i;
j = bpos[j];
>
i--; j--;
bpos[i] = j;
>
>

Explanation:

P = “ABBABAB”, m = 7

  • The widest border of suffix “AB” beginning at position i = 5 is ?(nothing) starting at position 7 so bpos[5] = 7.
  • At position i = 2 the suffix is “BABAB”. The widest border for this suffix is “BAB” starting at position 4, so j = bpos[2] = 4.
  • We can understand
  • bpos[i] = j
  • using following example –
  • If character
  • #
  • Which is at position
  • i-1
  • is equivalent to character
  • ?
  • at position
  • j-1
  • , we know that border will be
  • ? + border of suffix at position i
  • which is starting at position
  • j
  • which is equivalent to saying that
  • border of suffix at i-1 begin at j-1
  • or
  • bpos[ i-1 ] = j-1
  • or in the code –

i--;
j--;
bpos[ i ] = j
  1. Border width will be smaller than the border starting at position j ie. smaller than x…?
  2. Border has to begin with # and end with ? or could be empty (no border exist).
while(j

j = bpos[j];
>
i--; j--;
bpos[i]=j;
pat[i-1] != pat[j-1]
if (shift[j] == 0 )