TY - GEN
T1 - Expected Density of Random Minimizers
AU - Golan, Shay
AU - Shur, Arseny M.
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2025.
PY - 2025
Y1 - 2025
N2 - Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size σ, a minimizer is defined by two integers k,w≥2 and a total order ρ on strings of length k (also called k-mers). A string is processed by a sliding window algorithm that chooses, in each window of length w+k-1, its minimal k-mer with respect to ρ. A key characteristic of the minimizer is the expected density of chosen k-mers among all k-mers in a random infinite σ-ary string. Random minimizers, in which the order ρ is chosen uniformly at random, are often used in applications. However, little is known about their expected density DRσ(k,w) besides the fact that it is close to 2w+1 unless w≫k. We first show that DRσ(k,w) can be computed in O(kσk+w) time. Then we attend to the case w≤k and present a formula that allows one to compute DRσ(k,w) in just O(wlogw) time. Further, we describe the behaviour of DRσ(k,w) in this case, establishing the connection between DRσ(k,w), DRσ(k+1,w), and DRσ(k,w+1). In particular, we show that DRσ(k,w)<2w+1 (by a tiny margin) unless w is small. We conclude with some partial results and conjectures for the case w>k.
AB - Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size σ, a minimizer is defined by two integers k,w≥2 and a total order ρ on strings of length k (also called k-mers). A string is processed by a sliding window algorithm that chooses, in each window of length w+k-1, its minimal k-mer with respect to ρ. A key characteristic of the minimizer is the expected density of chosen k-mers among all k-mers in a random infinite σ-ary string. Random minimizers, in which the order ρ is chosen uniformly at random, are often used in applications. However, little is known about their expected density DRσ(k,w) besides the fact that it is close to 2w+1 unless w≫k. We first show that DRσ(k,w) can be computed in O(kσk+w) time. Then we attend to the case w≤k and present a formula that allows one to compute DRσ(k,w) in just O(wlogw) time. Further, we describe the behaviour of DRσ(k,w) in this case, establishing the connection between DRσ(k,w), DRσ(k+1,w), and DRσ(k,w+1). In particular, we show that DRσ(k,w)<2w+1 (by a tiny margin) unless w is small. We conclude with some partial results and conjectures for the case w>k.
KW - Expected Density
KW - Minimizer
KW - Random Minimizer
UR - https://www.scopus.com/pages/publications/85219213225
U2 - 10.1007/978-3-031-82670-2_25
DO - 10.1007/978-3-031-82670-2_25
M3 - ???researchoutput.researchoutputtypes.contributiontobookanthology.conference???
AN - SCOPUS:85219213225
SN - 9783031826696
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 347
EP - 360
BT - SOFSEM 2025
A2 - Královič, Rastislav
A2 - Kůrková, Věra
PB - Springer Science and Business Media Deutschland GmbH
T2 - 50th International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2025
Y2 - 20 January 2025 through 23 January 2025
ER -