TY - GEN
T1 - Approximating text-to-pattern Hamming distances
AU - Chan, Timothy M.
AU - Golan, Shay
AU - Kociumaka, Tomasz
AU - Kopelowitz, Tsvi
AU - Porat, Ely
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/6/8
Y1 - 2020/6/8
N2 - We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size σ, compute the Hamming distance (i.e., the number of mismatches) between the pattern and the text at every location. Several randomized (1+ϵ)-approximation algorithms have been proposed in the literature (e.g., by Karloff (Inf. Proc. Lett., 1993), Indyk (FOCS 1998), and Kopelowitz and Porat (SOSA 2018)), with running time of the form O(ϵ-O(1)nlognlogm), all using fast Fourier transform (FFT). We describe a simple randomized (1+ϵ)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results (all Monte-Carlo randomized) in different settings: (1) We design the first truly linear-time approximation algorithm for constant ; the running time is O(ϵ-2n). In fact, the time bound can be made slightly sublinear in n if the alphabet size σ is small (by using bit packing tricks). (2) We apply our approximation algorithms to design a faster exact algorithm computing all Hamming distances up to a threshold k; its runtime of O(n + min(nkglogm/gm,nk2/m)) improves upon previous results by logarithmic factors and is linear for k≤ gm. (3) We alternatively design approximation algorithms with better ϵ-dependence, by using fast rectangular matrix multiplication. In fact, the time bound is O(n polylog n) when the pattern is sufficiently long, i.e., m≥ ϵ-c for a specific constant c. Previous algorithms with the best ϵ-dependence require O(ϵ-1n polylog n) time. (4) When k is not too small, we design a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in time O((n/kω(1)+occ)no(1)) time, where occ is the output size. The algorithm leads to a property tester for pattern matching that costs O((-1/3n2/3 + -1n/m) n) time and, with high probability, returns true if an exact match exists and false if the Hamming distance is more than δm at every location. (5) We design a streaming algorithm that approximately computes the Hamming distance for all locations with the distance approximately less than k, using O(ϵ-2gk n) space. Previously, streaming algorithms were known for the exact problem with O(k n) space (which is tight up to the polylogn factor) or for the approximate problem with O(ϵ-O(1)gmpolylogn) space. For the special case of k=m, we improve the space usage to O(ϵ-1.5gmpolylogn).
AB - We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size σ, compute the Hamming distance (i.e., the number of mismatches) between the pattern and the text at every location. Several randomized (1+ϵ)-approximation algorithms have been proposed in the literature (e.g., by Karloff (Inf. Proc. Lett., 1993), Indyk (FOCS 1998), and Kopelowitz and Porat (SOSA 2018)), with running time of the form O(ϵ-O(1)nlognlogm), all using fast Fourier transform (FFT). We describe a simple randomized (1+ϵ)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results (all Monte-Carlo randomized) in different settings: (1) We design the first truly linear-time approximation algorithm for constant ; the running time is O(ϵ-2n). In fact, the time bound can be made slightly sublinear in n if the alphabet size σ is small (by using bit packing tricks). (2) We apply our approximation algorithms to design a faster exact algorithm computing all Hamming distances up to a threshold k; its runtime of O(n + min(nkglogm/gm,nk2/m)) improves upon previous results by logarithmic factors and is linear for k≤ gm. (3) We alternatively design approximation algorithms with better ϵ-dependence, by using fast rectangular matrix multiplication. In fact, the time bound is O(n polylog n) when the pattern is sufficiently long, i.e., m≥ ϵ-c for a specific constant c. Previous algorithms with the best ϵ-dependence require O(ϵ-1n polylog n) time. (4) When k is not too small, we design a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in time O((n/kω(1)+occ)no(1)) time, where occ is the output size. The algorithm leads to a property tester for pattern matching that costs O((-1/3n2/3 + -1n/m) n) time and, with high probability, returns true if an exact match exists and false if the Hamming distance is more than δm at every location. (5) We design a streaming algorithm that approximately computes the Hamming distance for all locations with the distance approximately less than k, using O(ϵ-2gk n) space. Previously, streaming algorithms were known for the exact problem with O(k n) space (which is tight up to the polylogn factor) or for the approximate problem with O(ϵ-O(1)gmpolylogn) space. For the special case of k=m, we improve the space usage to O(ϵ-1.5gmpolylogn).
KW - Hamming distance
KW - Pattern matching
KW - Property testing
KW - Sampling
KW - Streaming
KW - Sublinear
UR - https://www.scopus.com/pages/publications/85086761072
U2 - 10.1145/3357713.3384266
DO - 10.1145/3357713.3384266
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85086761072
T3 - Proceedings of the Annual ACM Symposium on Theory of Computing
SP - 643
EP - 656
BT - STOC 2020 - Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing
A2 - Makarychev, Konstantin
A2 - Makarychev, Yury
A2 - Tulsiani, Madhur
A2 - Kamath, Gautam
A2 - Chuzhoy, Julia
T2 - 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020
Y2 - 22 June 2020 through 26 June 2020
ER -