The design of the Colussi algorithm follows a tight analysis of the Knuth, Morris and Pratt algorithm.
in the first phase the comparisons are performed from left to right with text characters aligned with pattern position for which the value of the kmpNext function is strictly greater than -1. These positions are called noholes; |
the second phase consists in comparing the remaining positions (called holes) from right to left. |
when a mismatch occurs during the first phase, after the appropriate shift it is not necessary to compare the text characters aligned with noholes compared during the previous attempt; |
when a mismatch occurs during the second phase it means that a suffix of the pattern matches a factor of the text, after the corresponding shift a prefix of the pattern will still match a factor of the text, then it is not necessary to compare this factor again. |
For 0 i m-1: kmin[i]= d>0 iff x[0 .. i-1-d]=x[d .. i-1] and x[i-d] x[i], 0 otherwise.
When kmin 0 a periodicity ends at position i in x.
For 0 < i < m if kmin[i-1] 0 then i is a nohole otherwise i is a hole.
Let nd+1 be the number of noholes in x.
for 0 i nd, h[i] is a nohole and h[i] < h[i+1] for 0 i<nd; |
for nd < i < m, h[i] is a hole and h[i] > h[i+1] for nd < i < m-1. |
If i is a hole then rmin[i] is the smallest period of x greater than i.
The value of first[u] is the smallest integer v such that u h[v].
Then assume that x is aligned with y[j .. j+m-1]. If x[h[k]]=y[j+h[k]] for 0 k < r < nd and x[h[r]] y[j+h[r]]. Let j’ = j+kmin[h[r]].
Then there is no occurrence of x beginning in y[j .. j’] and x can be shifted by kmin[h[r]] positions to the right.
Moreover x[h[k]]=y[j’+h[k]] for 0 k < first[h[r]-kmin[h[r]]] meaning that the comparisons can be resume with x[h[first[h[r]-kmin[h[r]]]]] and y[j’+h[first[h[r]-kmin[h[r]]]]].
Figure 9.1: Mismatch with a nohole. Noholes are black circles and are compared from left to right.
In this situation, after the shift, it is not necessary to compare the first two noholes again.
If x[h[k]]=y[j+h[k]] for 0 k < r and x[h[r]] y[j+h[r]] with nd r < m. Let j’=j+rmin[h[r]]. Then there is no occurrence of x beginning in y[j .. j’] and x can be shifted by kmin[h[r]] positions to the right.
Moreover x[0 .. m-1-rmin[h[r]]]=y[j’ .. j+m-1] meaning that the comparisons can be resume with x[h[first[m-1-rmin[h[r]]]]] and y[j’+h[first[m-1-rmin[h[r]]]]].
To compute the values of kmin, a table hmax is used and defined as follows: hmax[k] is such that x[k .. hmax[k]-1]=x[0 .. hmax[k]-k-1] and x[hmax[k]] x[hmax[k]-k].
The value of nhd0[i] is the number of noholes strictly smaller than i.
shift[i]=kmin[h[i]] and next[i]=nhd0[h[i]-kmin[h[i]]] for i < nd; |
shift[i]=rmin[h[i]] and next[i]=nhd0[m-rmin[h[i]]] for nd i < m; |
shift[m]=rmin[0] and next[m]=nhd0[m-rmin[h[m-1]]]. |
Thus, during an attempt where the window is positioned on the text factor y[j .. j+m-1], when a mismatch occurs between x[h[r]] and y[j+h[r]] the window must be shifted by shift[r] and the comparisons can be resume with pattern position h[next[r]].
The preprocessing phase can be done in O(m) space and time. The searching phase can then be done in O(n) time complexity and furthermore at most n text character comparisons are performed during the searching phase.
int preColussi(char *x, int m, int h[], int next[], int shift[]) { int i, k, nd, q, r, s; int hmax[XSIZE], kmin[XSIZE], nhd0[XSIZE], rmin[XSIZE]; /* Computation of hmax */ i = k = 1; do { while (x[i] == x[i - k]) i++; hmax[k] = i; q = k + 1; while (hmax[q - k] + k < i) { hmax[q] = hmax[q - k] + k; q++; } k = q; if (k == i + 1) i = k; } while (k <= m); /* Computation of kmin */ memset(kmin, 0, m*sizeof(int)); for (i = m; i >= 1; --i) if (hmax[i] < m) kmin[hmax[i]] = i; /* Computation of rmin */ for (i = m - 1; i >= 0; --i) { if (hmax[i + 1] == m) r = i + 1; if (kmin[i] == 0) rmin[i] = r; else rmin[i] = 0; } /* Computation of h */ s = -1; r = m; for (i = 0; i < m; ++i) if (kmin[i] == 0) h[--r] = i; else h[++s] = i; nd = s; /* Computation of shift */ for (i = 0; i <= nd; ++i) shift[i] = kmin[h[i]]; for (i = nd + 1; i < m; ++i) shift[i] = rmin[h[i]]; shift[m] = rmin[0]; /* Computation of nhd0 */ s = 0; for (i = 0; i < m; ++i) { nhd0[i] = s; if (kmin[i] > 0) ++s; } /* Computation of next */ for (i = 0; i <= nd; ++i) next[i] = nhd0[h[i] - kmin[h[i]]]; for (i = nd + 1; i < m; ++i) next[i] = nhd0[m - rmin[h[i]]]; next[m] = nhd0[m - rmin[h[m - 1]]]; return(nd); } void COLUSSI(char *x, int m, char *y, int n) { int i, j, last, nd, h[XSIZE], next[XSIZE], shift[XSIZE]; /* Processing */ nd = preColussi(x, m, h, next, shift); /* Searching */ i = j = 0; last = -1; while (j <= n - m) { while (i < m && last < j + h[i] && x[h[i]] == y[j + h[i]]) i++; if (i >= m || last >= j + h[i]) { OUTPUT(j); i = m; } if (i > nd) last = j + m - 1; j += shift[i]; i = next[i]; } }
Preprocessing phase
Searching phase