Internals
How treesearch works under the hood.
Architecture Overview
Treesearch treats pattern matching as a Constraint Satisfaction Problem (CSP):
- Query Parser (
src/query.rs) - Pest-based parser converts query strings to Pattern AST - CSP Solver (
src/searcher.rs) - DFS with forward checking and MRV heuristic - Tree Storage (
src/tree.rs) - String interning (lasso + FxHash) for memory efficiency - CoNLL-U Parser (
src/conllu.rs) - Streaming parser with automatic gzip detection - Parallelization (
src/iterators.rs) - File-level parallelism via rayon + channels - Python Bindings (
src/python.rs) - PyO3 with zero-copy Arc sharing
Search Algorithm
1. Order variables by constraint degree (MRV heuristic)
2. For each variable:
a. Get candidates satisfying node constraints
b. Check arc consistency (edge constraints)
c. Recursively assign remaining variables
d. Backtrack if no valid assignment
3. Yield all complete assignments
Forward checking typically reduces search space by 90%+.
Constraint Types
- Node constraints (lemma, upos, form, deprel): Direct string pool comparison
- Feature constraints (feats.X, misc.X): Iterate word features for key-value match
- Edge constraints: Check parent/children relationships
- Anonymous variables (
_): Create HasIncomingEdge/HasOutgoingEdge constraints without bindings
Parallelization
Files → Chunks (8) → rayon::par_iter → Process → Bounded channel (8) → Iterator
Results streamed through channels with backpressure. Use ordered=False for maximum throughput.
Design Decisions
Why CSP? Natural representation of structural constraints, well-studied optimizations, exhaustive search guarantee.
Why exhaustive search? Corpus linguistics requires all matches; researchers filter in post-processing.
Why Rust? 10-100x faster than pure Python, memory-efficient interning, safe parallelism.
Source Files
| File | Purpose |
|---|---|
src/query.rs |
Query parser (Pest grammar) |
src/pattern.rs |
Pattern AST and constraints |
src/searcher.rs |
CSP solver |
src/tree.rs |
Tree/Word data structures |
src/conllu.rs |
CoNLL-U parsing |
src/iterators.rs |
Parallel iteration |
src/python.rs |
Python bindings |