close

Вход

Забыли?

вход по аккаунту

код для вставкиСкачать
Faster algorithms for string
matching problems: matching
the convolution bound
Piotr Indyk
報告人: 蕭志宣 田文錦 王弘倫
Outline




Introduction
Randomized Boolean convolution
Convolution over GF(2) in O(n)-time
Application
Pattern matching

Input:
two string t, p (text and patten)

Output:
A binary sequence o
o[i]=1 if p match t[i]
o[i]=0 otherwise
Approach

Brute-force O(mn) time algorithm compares p
with each of the string start at t(i), for i=1…n

A well-known algorithm KMP achieve O(m+n)
Fingerprint Approach

A fingerprint function Fp(Z)=Z mod p

Use F and compare F(p) and each of
fingerprints F(t(j))

The Monte Carlo algorithm for pattern
matching requires O(n+m) time and has a
probability of error O(1/n)
Use boolean convolution for string
matching?

Solve application problem
Boolean convolution(,)
1 1 0 0 0 0 0 0 0 1 1 0
1 0 0 0 1 1 1 1
1 1 1 1 0 0 0 1
1 0 0 0 1 1 1 1
1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 0 1 1 0
Polynomial convolution(+,)(over GF(2))
1 1 0 0 0 0 0 0 0 1 1 0
1 0 0 0 1 1 1 1
1 1 1 1 0 0 0 1
1 0 0 0 1 1 1 1
1 0 0 0 1 0 0 1 1 1 0 0 0 1 0 0 1 1 0
String matching VS. Boolean Convolution
T
a
b
a
c
c
a
c
b
a
b
a
b
a
c
b
Ta
1
0
1
0
0
1
0
0
1
0
1
0
1
0
0
P
c
a
a
a
a
O
Pa
0
1
1
0
1
1
0
1
0
1
0
0
1
0
1
1
0
0
0
0
Pa not
Boolean convolution u*v
v
1
1
0
u
1
1
0
0
0
1
Can be done in
O(nlogn)
o
1
1
0
0
Convolution on GF(2) u*v
v
1
1
0
u
1
1
0
0
0
1
Can be done in
O(n)
o’
0
1
0
0
Different from boolean
convolution
Number of 1’s is….

Odd
output of convolution on GF(2) will be the
same with boolean convolution

Even
the position of 1 might be wrong…
Randomize?

u
n bit Hamming
space
Random choose r from Hn uniformly
and conjunct r and u ru
1
1
0
r
0
1
0
ru
0
1
0
Output
correct 1
v
1
1
0
ru
0
1
0
0
0
Correct
boolean
convolution
output
1
o
o’
1
0
0
0
Output wrong
0
1
1
0
0
Lemma 1

If(u*v)[i]=0 then o’[i]=0

If(u*v)[i]=1 then Pr[o’[i]=1]=1/2
Expected error probability of position i
after executer d times

o[i]=1 if o’[i]=1 in outcome of one execution

o[i]=0 with error probability of 1/2d
Error probability of convolution

Worst case
All position output wrong 0
1/2d per position
Probability of outputing a wrong convolution is
O(n/2d)
Outline




Introduction
Randomized Boolean convolution
Convolution over GF(2) in O(n)-time
Application
Naturally…


Convolution is like the multiplication of two
polynomial.
Therefore, the time complexity of FFT (Fast
Fourier Transform) O(nlogn) is an upper
bound of convolution.
GF(2) vs.



t
GF(2 )
An element of GF(2t) can be defined as a
polynomial of degree less than t over GF(2).
The operation of two elements over GF(2t)
corresponds to the operation of two
polynomials over GF(2).
e.g.- a, b  GF(2t), ab over GF(2t)
corresponds to a(x)b(x) mod u(x) over GF(2),
(u(x) is an irreducible polynomial).
O(n)-time algorithm for polynomial
multiplication over GF(2)


Step1: Reduce multiplication of p and q to a
multiplication of two polynomials p’ and q’
of degree n/t over GF(2t), such that n/t= 2t,
for t = O(logn).
Step2: Multiply p’ and q’ over GF(2t).

n=8, t=2
Elements
in GF(2)
1
0
1 0
0
1
0 1
0
0 1
1
1
1
1 1
Elements
in GF(22)
Step2




n
n
Using O( t log t ) operations.
Thus we only need to show each operation
can be done in O(1) time.
We can view a coefficients over GF(2t) as a
polynomial over GF(2).
Therefore, we need to consider addition,
multiplication, and modulus of polynomials of
degree t over GF(2).
Addition over

t
GF(2 )
Constant time since each element is of size
O(logn), thus the addition can be performed
in constant time (RAM model).
The more we need to know


Compute the product d(x) of polynomials a(x)
and b(x).
Compute d(x) mod u(x), where u(x) is an
irreducible polynomial of degree t (which can
be found in negligible time during
preprocessing)
Main idea



Shift the polynomial u by t/c (instead of 1)
position.
There are only c necessary steps.
For each step, we use a lookup table and
thus each operation can in constant time.
Multiplication



By FFT, each multiplication can be done in
t
t
O(c log c ). t
There are (2 c )2 possible products.
Thus, we need
2
t
t
t
n c
t
t
c 2
(2 ) O( c log c ) = (t ) O( c log c )
= O(n)
to build the lookup table.
Illustration
t/c
t/c
…
t/c
t/c
t/c
…
t/c
Division




Naturally, we have an O(t) algorithm (for d(x)
& u(x) of degree 2t & t, respectively).
For i = 2t-1…t
Step1: check if the ith coefficient of di is 1.
Step2: if so , assign di-1 = di – u; otherwise set
di-1 = di.
1100
11 10101
11
011
011
00001
Illustration

Each si has length
t/c
d(x) = u(x)s(x)+ k(x)
= u(x)(s1(x)+s2(x)+…+sc(x))+k(x)
Constant
time
d(x) – u(x)s1(x) = u(x)(s2(x)+…+sc(x))+k(x)
…
we can compute k(x) after c steps.
Lookup table




Each component needs t/c time.
There are O(2t/c) elements (since u is unique,
and d has length t/c).
t/c  2t/c ≤ t/c  2t = t/c  n/t = O(n).
Thus, we need O(n) time to build the table.
So far…


We have a O(n) algorithm to multiply two
polynomials of degree n over GF(2).
In other semiring, a convolution still needs
O(nlogn) time.
Outline




Introduction
Randomized Boolean convolution
Convolution over GF(2) in O(n)-time
Application
String matching with don’t cares
T
AA D C C G E D AA C D E A C A B AAA * C C
P
AA D * C
* = don’t cares
String matching VS. Boolean Convolution
T
a
b
a
c
c
a
c
b
a
b
a
b
a
c
b
Ta
1
0
1
0
0
1
0
0
1
0
1
0
1
0
0
P
c
a
a
a
a
O
Pa
0
1
1
0
1
1
0
1
0
1
0
0
1
0
1
1
0
0
0
0
Pa not
let d  c log n for a proper constant
Also, for a given sequence
function
f ,g :  H
 ra if a  
f (a )   d
 0 if a  *
and
 ra if a  
g (a )   d
 0 if a  *
d
as
c  0 specified
r0 ... rk 1 of element
of H
later
d
define
Algorithm
 choose ri ( i  0 ... k  1) independen
d
at random from H
 compute
tly and uniformly
ors p  f ( p [ 0 ])[ i ]... f ( p [ m  1])[ i ]
i
binary vet
and t  g ( t [ 0 ])[ i ]... g ( t [ n  1])[ i ] , for i  0 ... d  1
i
 compute
the randomized
boolean convolutio
for i  0 ... d  1
 compute
a   i 1 a
d
i
 output a as the occurrence
vector
n ai  t  p
i
i
Example
T
a
b
*
a
*
b
b
c
tP1
0
a
0*
0
b
0
0
0
0
1
1
1
0
1
0
1
1
0
0
1
0
0
0
1
1
1
t2
t3
a = 101
b = 100
c = 010
* = 000
a1
p1
1
0
1
p2
0
0
0
p3
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
a
0
1
0
0
0
1
anot
1
0
1
1
1
0
a2
a3
Analysis(1/2)

Lemma 2
For any α>0 there is a constant c>0 such that the occurrence
vector generated by the algorithm is correct with probability 11/nα
Proof:
Case 1. if p occurs in t at position i, then aj[i] = 0 with probability 1 for
any j = 0…d-1
a
*
a
*
*
a
a
*
Analysis(2/2)
Case 2. If p does not occur in t at position
j such that
a = 101
b = 100
c = 010
* = 000
i , then ther e exists
p [ j ]  a   ,t [ i  j ]  b   and a  b.
Pr [ ra [ k ]  rb [ k ]  1] 
we know that the
1
for any k  0 ...d- 1 (assuming
4
i th bit if the boolean convolutio
is non - zero with the
same probabilit
probabilit
that Pr [ a [ i ]  1]  ( 1-δ ) 
d
1
n
1
0
1
bnot 0
1
1
a
k
n of t and p
k
y.
Thus (by Lemma 1) there is a constant
which implies
a  b ),
α
y δ  0 that a [ i ]  1,
Time Complexity
 choose ri ( i  0 ... k  1) independen
d
at random from H
 compute
tly and uniformly
ors p  f ( p [ 0 ])[ i ]... f ( p [ m  1])[ i ]
i
binary vet
O(n)
and t  g ( t [ 0 ])[ i ]... g ( t [ n  1])[ i ] , for i  0 ... d  1
i
 compute
the randomized
boolean convolutio
for i  0 ... d  1
 compute
a   i 1 a
d
i
 output a as the occurrence
O(log n)
vector
n ai  t  p
i
i
Subset Matching


Input: A set-string T and a set-string P .
Output: All occurrences of P in T.
T=
a
b
c
P=
a
c
a
c
b
c
c
b
a
c
c
e
f
b
f
b
Tree Pattern Matching and Subset Matching in
Randomized O(nlog3m) Time,
Proc.STOC’97,1997
R.Cole, R. Hariharan
Give a very elegant O(nlog2n)-time randomized algorithm for
this problem.
We can replace the exact computation of boolean convolution
by the probabilistic one. => time complexity O(nlogn)
1/--страниц
Пожаловаться на содержимое документа