1. Introduction ................................................ 1
1.1. Ubiquity of Patterns .................................. 1
1.2. Motivations from Biology .............................. 2
1.3. The Need for Rigor .................................... 2
1.4. Who is a Reader of this Book? ......................... 3
1.4.1. About this book ............................... 4
I. THE FUNDAMENTALS ........................................... 7
2. Basic Algorithmics .......................................... 9
2.1. Introduction .......................................... 9
2.2. Graphs ................................................ 9
2.3. Tree Problem 1: Minimum Spanning Tree ................ 14
2.3.1. Prim's algorithm ............................. 17
2.4. Tree Problem 2: Steiner Tree ......................... 21
2.5. Tree Problem 3: Minimum Mutation Labeling ............ 22
2.5.1. Fitch's algorithm ............................ 23
2.6. Storing & Retrieving Elements ........................ 27
2.7. Asymptotic Functions ................................. 30
2.8. Recurrence Equations ................................. 32
2.8.1. Counting binary trees ........................ 34
2.8.2. Enumerating unrooted trees
(Priifer sequence) ........................... 36
2.9. NP-Complete Class of Problems ........................ 40
2.10. Exercises ............................................ 41
3. Basic Statistics ........................................... 47
3.1. Introduction ......................................... 47
3.2. Basic Probability .................................... 48
3.2.1. Probability space foundations ................ 48
3.2.2. Multiple events (Bayes' theorem) ............. 50
3.2.3. Inclusion-exclusion principle ................ 51
3.2.4. Discrete probability space ................... 54
3.2.5. Algebra of random variables .................. 57
3.2.6. Expectations ................................. 58
3.2.7. Discrete probability distribution
(binomial, Poisson) .......................... 60
3.2.8. Continuous probability distribution
(normal) ..................................... 64
3.2.9. Continuous probability space (O is K) ........ 66
3.3. The Bare Trnth abont Inferential Statistics .......... 69
3.3.1. Probability distribution invariants .......... 70
3.3.2. Samples & summary statistics ................. 72
3.3.3. The central limit theorem .................... 77
3.3.4. Statistical significance (p-value) ........... 80
3.4. Summary .............................................. 82
3.5. Exercises ............................................ 82
4. What Are Patterns? ......................................... 89
4.1. Introduction ......................................... 89
4.2. Common Thread ........................................ 90
4.3. Pattern Duality ...................................... 90
4.3.1. Operators on p ............................... 92
4.4. Irredundant Patterns ................................. 92
4.4.1. Special case: maximality ..................... 93
4.4.2. Transitivity of redundancy ................... 95
4.4.3. Uniqueness property .......................... 95
4.4.4. Case studies ................................. 96
4.5. Constrained Patterns ................................. 99
4.6. When is a Pattern Specification Nontrivial? .......... 99
4.7. Classes of Patterns ................................. 100
4.8. Exercises ........................................... 103
II. PATTERNS ON LINEAR STRINGS ............................... 111
5. Modeling the Stream of Life ............................... 113
5.1. Introduction ........................................ 113
5.2. Modeling a Biopolymer ............................... 113
5.2.1. Repeats in DNA .............................. 114
5.2.2. Directionality of biopolymers ............... 115
5.2.3. Modeling a random permutation ............... 117
5.2.4. Modeling a random string .................... 119
5.3. Bernoulli Scheme .................................... 120
5.4. Markov Chain ........................................ 121
5.4.1. Stationary distribution ..................... 123
5.4.2. Computing probabilities ..................... 127
5.5. Hidden Markov Model (HMM) ........................... 128
5.5.1. The decoding problem (Viterbi algorithm) .... 130
5.6. Comparison of the Schemes ........................... 133
5.7. Conclusion .......................................... 133
5.8. Exercises ........................................... 134
6. String Pattern Specifications ............................. 139
6.1. Introduction ........................................ 139
6.2. Notation ............................................ 140
6.3. Solid Patterns ...................................... 142
6.3.1. Maximality .................................. 144
6.3.2. Counting the maximal patterns ............... 144
6.4. Rigid Patterns ...................................... 149
6.4.1. Maximal rigid patterns ...................... 150
6.4.2. Enumerating maximal rigid patterns .......... 152
6.4.3. Density-constrained patterns ................ 156
6.4.4. Quorum-constrained patterns ................. 157
6.4.5. Large-|Σ| input ............................. 158
6.4.6. Irredundant patterns ........................ 160
6.5. Extensible Patterns ................................. 164
6.5.1. Maximal extensible patterns ................. 165
6.6. Generalizations ..................................... 165
6.6.1. Homologous sets ............................. 165
6.6.2. Sequence on reals ........................... 167
6.7. Exercises ........................................... 170
7. Algorithms & Pattern Statistics ........................... 183
7.1. Introduction ........................................ 183
7.2. Discovery Algorithm ................................. 183
7.3. Pattern Statistics .................................. 191
7.4. Rigid Patterns ...................................... 191
7.5. Extensible Patterns ................................. 193
7.5.1. Nondegenerate extensible patterns ........... 194
7.5.2. Degenerate extensible patterns .............. 196
7.5.3. Correction factor for the dot character ..... 197
7.6. Measure of Surprise ................................. 198
7.6.1. z-score ..................................... 199
7.6.2. χ-square ratio .............................. 199
7.6.3. Interplay of combinatorics & statistics ..... 200
7.7. Applications ........................................ 201
7.8. Exercises ........................................... 203
8. Motif Learning ............................................ 213
8.1. Introduction: Local Multiple Alignment .............. 213
8.2. Probabilistic Model: Motif Profile .................. 214
8.3. The Learning Problem ................................ 215
8.4. Importance Measure .................................. 216
8.4.1. Statistical significance .................... 216
8.4.2 Information content .......................... 219
8.5. Algorithms to Learn a Motif Profile ................. 220
8.6. An Expectation Maximization Framework ............... 222
8.6.1. The initial estimate ρ0 ..................... 222
8.6.2. Estimating z given ρ ........................ 223
8.6.3. Estimating ρ given z ........................ 224
8.7. A Gibbs Sampling Strategy ........................... 227
8.7.1. Estimating ρ given an alignment ............. 227
8.7.2. Estimating background probabilities
given Z ..................................... 228
8.7.3. Estimating Z given ρ ........................ 228
8.8. Interpreting the Motif Profile in Terms of ρ ........ 229
8.9. Exercises ........................................... 230
9. The Subtle Motif .......................................... 235
9.1. Introduction: Consensus Motif ....................... 235
9.2. Combinatorial Model: Subtle Motif ................... 236
9.3. Distance between Motifs ............................. 238
9.4. Statistics of Subtle Motifs ......................... 240
9.5. Performance Score ................................... 245
9.6. Enumeration Schemes ................................. 246
9.6.1. Neighbor enumeration (exact) ................ 246
9.6.2. Submotif enumeration (inexact) .............. 249
9.7. A Combinatorial Algorithm ........................... 252
9.8. A Probabilistic Algorithm ........................... 255
9.9. A Modular Solution .................................. 257
9.10. Conclusion .......................................... 259
9.11. Exercises ........................................... 260
III. PATTERNS ON META-DATA .................................... 263
10. Permutation Patterns ...................................... 265
10.1. Introduction ........................................ 265
10.1.1. Notation .................................... 266
10.2. How Many Permutation Patterns? ...................... 267
10.3. Maximality .......................................... 268
10.3.1. P=1: Linear notation & PQ trees ............. 269
10.3.2. P>1: Linear notation? ....................... 271
10.4. Parikh Mapping-based Algorithm ...................... 273
10.4.1. Tagging technique ........................... 275
10.4.2. Time complexity analysis .................... 275
10.5. Intervals ........................................... 278
10.5.1. The naive algorithm ......................... 280
10.5.2. The Uno-Yagiura RC algorithm ................ 281
10.6. Intervals to PQ Trees ............................... 294
10.6.1. Irreducible intervals ....................... 295
10.6.2. Encoding intervals as a PQ tree ............. 297
10.7. Applications ........................................ 307
10.7.1. Case study I: Human and rat ................. 308
10.7.2. Case study II: E. Coli K-12 and
B. Subtilis ................................. 309
10.8. Conclusion .......................................... 311
10.9. Exercises ........................................... 312
11. Permutation Pattern Probabilities ......................... 323
11.1. Introduction ........................................ 323
11.2. Unstructured Permutations ........................... 323
11.2.1. Multinomial coefficients .................... 325
11.2.2. Patterns with multiplicities ................ 328
11.3. Structured Permutations ............................. 329
11.3.1. P-arrangement ............................... 330
11.3.2. An incremental method ....................... 331
11.3.3. An upper bound on P-arrangements** .......... 336
11.3.4. A lower bound on P-arrangements ............. 341
11.3.5. Estimating the number of frontiers .......... 342
11.3.6. Combinatorics to probabilities .............. 345
11.4. Exercises ........................................... 346
12. Topological Motifs ........................................ 355
12.1. Introduction ........................................ 355
12.1.1. Graph notation .............................. 355
12.2. What Are Topological Motifs? ........................ 356
12.2.1. Combinatorics in topologies ................. 357
12.2.2. Input with self-isomorphisms ................ 358
12.3. The Topological Motif ............................... 359
12.3.1. Maximally ................................... 367
12.4. Compact Topological Motifs .......................... 369
12.4.1. Occurrence-isomorphisms ..................... 369
12.4.2. Vertex indistinguishability ................. 372
12.4.3. Compact list ................................ 373
12.4.4. Compact vertex, edge & motif ................ 373
12.4.5. Maximal compact lists ....................... 374
12.4.6. Conjugates of compact lists ................. 374
12.4.7. Characteristics of compact lists ............ 378
12.4.8. Maximal operations on compact lists ......... 380
12.4.9. Maximal subsets of location lists ........... 381
12.4.10.Binary relations on compact lists ........... 384
12.4.11.Compact motifs from compact lists ........... 384
12.5. The Discovery Method ................................ 392
12.5.1. The algorithm ............................... 393
12.6. Related Classical Problems .......................... 399
12.7. Applications ........................................ 400
12.8. Conclusion .......................................... 402
12.9. Exercises ........................................... 402
13. Set-Theoretic Algorithmic Tools ........................... 417
13.1. Introduction ........................................ 417
13.2. Some Basic Properties of Finite Sets ................ 418
13.3. Partial Order Graph G(S, E) of Sets ................. 419
13.3.1. Reduced partial order graph ................. 420
13.3.2. Straddle graph .............................. 421
13.4. Boolean Closure of Sets ............................. 423
13.4.1. Intersection closure ........................ 423
13.4.2. Union closure ............................... 424
13.5. Consecutive (Linear) Arrangement of Set Members ..... 426
13.5.1. PQ trees .................................... 426
13.5.2. Straddling sets ............................. 429
13.6. Maximal Set Intersection Problem (maxSIP) ........... 434
13.6.1. Ordered enumeration trie .................... 435
13.6.2. Depth first traversal of the trie ........... 436
13.7. Minimal Set Intersection Problem (minSIP) ........... 447
13.7.1. Algorithm ................................... 447
13.7.2. Minimal from maximal sets ................... 448
13.8. Multi-Sets .......................................... 450
13.8.1. Ordered enumeration trie of multi-sets ...... 451
13.8.2. Enumeration algorithm ....................... 453
13.9. Adapting the Enumeration Scheme ..................... 455
13.10.Exercises ........................................... 458
14. Expression & Partial Order Motifs ......................... 469
14.1. Introduction ........................................ 469
14.1.1. Motivation .................................. 470
14.2. Extracting (Monotone CNF) Boolean Expressions ....... 471
14.2.1. Extracting biclusters ....................... 475
14.2.2. Extracting patterns in microarrays .......... 478
14.3. Extracting Partial Orders ........................... 480
14.3.1. Partial orders .............................. 480
14.3.2. Partial order construction problem .......... 481
14.3.3. Excess in partial orders .................... 483
14.4. Statistics of Partial Orders ........................ 485
14.4.1. Computing Cex(B) ............................ 489
14.5. Redescriptions ...................................... 493
14.6. Application: Partial Order of Expressions ........... 494
14.7. Summary ............................................. 495
14.8. Exercises ........................................... 496
References .................................................... 503
Index ......................................................... 515
|