# nucleo-ext This is a fork of [helix-editor/nucleo](https://github.com/helix-editor/nucleo) that includes filtering and custom scoring built into the matching loop. ## Fork Changes ### Filter Callback Filter items before fuzzy matching. Useful for category/facet filtering (e.g., filter by directory, host, session). ```rust use std::sync::Arc; // Filter to only include items where category == 1 nucleo.set_filter(Some(Arc::new(|item: &MyItem| item.category == 1))); // Remove filter nucleo.set_filter(None); ``` Filters are applied: - During initial matching of new items - When rescoring existing matches after pattern changes - When resetting matches for empty patterns Setting a filter triggers a rescore on the next `tick()`. ### Scorer Callback Compute custom scores after fuzzy matching. The scorer receives the item and the fuzzy score, and returns the final score used for sorting. ```rust use std::sync::Arc; // Use item priority as the score (ignoring fuzzy score) nucleo.set_scorer(Some(Arc::new(|item: &MyItem, _fuzzy_score| { item.priority }))); // Combine fuzzy score with frecency nucleo.set_scorer(Some(Arc::new(|item: &MyItem, fuzzy_score| { fuzzy_score + item.frecency_score }))); // Remove scorer (use raw fuzzy score) nucleo.set_scorer(None); ``` The scorer output is stored in `Match::external_score` and used as the primary sort key. ### Match Struct Changes The `Match` struct now has an additional field: ```rust pub struct Match { pub score: u32, // Raw fuzzy match score pub external_score: u32, // Scorer output (used for sorting) pub idx: u32, // Item index } ``` Results are sorted by `external_score` (descending), with tie-breakers on item length and index. ### Use Case: Frecency Ranking This fork was created to support frecency-based ranking in [Atuin](https://github.com/atuinsh/atuin). The typical pattern is: 1. Store metadata (frecency scores, categories) in a separate data structure 2. Use filter callback to exclude items that don't match the current context 3. Use scorer callback to combine fuzzy score with frecency at query time ```rust // External data store let metadata: Arc> = /* ... */; // Filter: only show items used in current directory let dir = current_dir.clone(); let meta = metadata.clone(); nucleo.set_filter(Some(Arc::new(move |cmd: &String| { meta.get(cmd).map(|m| m.used_in_dir(&dir)).unwrap_or(false) }))); // Scorer: combine fuzzy score with frecency let meta = metadata.clone(); nucleo.set_scorer(Some(Arc::new(move |cmd: &String, fuzzy_score| { let frecency = meta.get(cmd).map(|m| m.frecency()).unwrap_or(0); fuzzy_score + (frecency * 10) }))); ``` --- The original Nucleo readme follows. --- `nucleo` is a highly performant fuzzy matcher written in Rust. It aims to fill the same use case as `fzf` and `skim`. Compared to `fzf` `nucleo` has a significantly faster matching algorithm. This mainly makes a difference when matching patterns with low selectivity on many items. An (unscientific) comparison is shown in the benchmark section below. > Note: If you are looking for a replacement of the `fuzzy-matcher` crate and not a fully managed fuzzy picker, you should use the [`nucleo-matcher`](https://crates.io/crates/nucleo-matcher) crate. `nucleo` uses the exact **same scoring system as fzf**. That means you should get the same ranking quality (or better) as you are used to from fzf. However, `nucleo` has a more faithful implementation of the Smith-Waterman algorithm which is normally used in DNA sequence alignment (see https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf) with two separate matrices (instead of one like fzf). This means that `nucleo` finds the optimal match more often. For example if you match `foo` in `xf foo` `nucleo` will match `x__foo` but `fzf` will match `xf_oo` (you can increase the word length the result will stay the same). The former is the more intuitive match and has a higher score according to the ranking system that both `nucleo` and fzf. **Compared to `skim`** (and the `fuzzy-matcher` crate) `nucleo` has an even larger performance advantage and is often around **six times faster** (see benchmarks below). Furthermore, the bonus system used by nucleo and fzf is (in my opinion) more consistent/superior. `nucleo` also handles non-ascii text much better. (`skim`s bonus system and even case insensitivity only work for ASCII). Nucleo also handles Unicode graphemes more correctly. `Fzf` and `skim` both operate on Unicode code points (chars). That means that multi codepoint graphemes can have weird effects (match multiple times, weirdly change the score, ...). `nucleo` will always use the first codepoint of the grapheme for matching instead (and reports grapheme indices, so they can be highlighted correctly). ## Status Nucleo is used in the helix-editor and therefore has a large user base with lots of real world testing. The core matcher implementation is considered complete and is unlikely to see major changes. The `nucleo-matcher` crate is finished and ready for widespread use, breaking changes should be very rare (a 1.0 release should not be far away). While the high level `nucleo` crate also works well (and is also used in helix), there are still additional features that will be added in the future. The high level crate also need better documentation and will likely see a few API changes in the future. ## Benchmarks > WIP currently more of a demonstration than a comprehensive benchmark suit > most notably scientific comparisons with `fzf` are missing (a pain because it can't be called as a library) ### Matcher micro benchmarks Benchmark comparing the runtime of various patterns matched against all files in the source of the linux kernel. Repeat on your system with `BENCHMARK_DIR= cargo run -p benches --release` (you can specify an empty directory and the kernel is cloned automatically). Method | Mean | Samples -----------------------|-----------|----------- nucleo "never_matches" | 2.30 ms |2,493/2,500 skim "never_matches" | 17.44 ms | 574/574 nucleo "copying" | 2.12 ms |2,496/2,500 skim "copying" | 16.85 ms | 593/594 nucleo "/doc/kernel" | 2.59 ms |2,499/2,500 skim "/doc/kernel" | 18.32 ms | 546/546 nucleo "//.h" | 9.53 ms |1,049/1,049 skim "//.h" | 35.46 ms | 282/282 ### Comparison with fzf For example in the following two screencasts the pattern `///.` is pasted into `fzf` and `nucleo` (both with about 3 million items open). `fzf` takes a while to filter the text (about 1 second) while `nucleo` has barely any noticeable delay (a single frame in the screencast so about 1/30 seconds). This comparison was made on a very beefy CPU (Ryzen 5950x) so on slower systems the difference may be larger: [![asciicast](https://asciinema.org/a/600517.svg)](https://asciinema.org/a/600517) [![asciicast](https://asciinema.org/a/600516.svg)](https://asciinema.org/a/600516) # Future Work * [x] merge integration into helix * [ ] build a standalone CLI application * [ ] reach feature parity with `fzf` (mostly `--no-sort` and `--tac`) * [ ] add a way to allow columnar matching * [ ] expose C API so both the high level API and the matching algorithm itself can be used in other applications (like various nvim plugins) # Naming The name `nucleo` plays on the fact that the `Smith-Waterman` algorithm (that it's based on) was originally developed for matching DNA/RNA sequences. The elements of DNA/RNA that are matched are called *nucleotides* which was shortened to `nucleo` here. The name also indicates its close relationship with the *helix* editor (sticking with the DNA theme). # Implementation Details > This is only intended for those interested and will not be relevant to most people. I plan to turn this into a blog post when I have more time The fuzzy matching algorithm is based on the `Smith-Waterman` (with affine gaps) as described in https://www.cs.cmu.edu/~ckingsf/bioinfo-lectures/gaps.pdf (TODO: explain). `Nucleo` faithfully implements this algorithm and therefore has two separate matrices. However, by precomputing the next `m-matrix` row we can avoid storing the p-matrix at all and instead just store the value in a variable as we iterate the row. Nucleo also never really stores the `m-matrix` instead we only ever store the current row (which simultaneously serves as the next row). During index calculation a full matrix is however required to backtrack which indices were actually matched. We only store two bools here (to indicate where we came from in the matrix). By comparison `skim` stores the full p and m matrix in that case. `fzf` always allocates a full `mn` matrix (even during matching!). `nucleo`s' matrix is only width `n-m+1` instead of width `n`. This comes from the observation that the `p` char requires `p-1` chars before it and `m-p` chars after it, so there are always `p-1 + m-p = m+1` chars that can never match the current char. This works especially well with only using a single row because the first relevant char is always at the same position even though it's technically further to the right. This is particularly nice because we precalculate the m-matrix row. The m-matrix is computed from diagonal elements, so the precalculated values stay in the same matrix cell. Compared to `skim` nucleo does couple simpler (but arguably even more impactful) optimizations: * *Presegment Unicode*: Unicode segmentation is somewhat slow and matcher will filter the same elements quite often so only doing it once is nice. It also prevents a very common source of bugs (mixing of char indices which we use here and utf8 indices) and makes the code a lot simpler as a result. Fzf does the same. * *Aggressive prefiltering*: Especially for ASCII this works very well, but we also do this for Unicode to a lesser extent. This ensures we reject non-matching haystacks as fast as possible. Usually most haystacks will not match when fuzzy matching large lists so having fast path for that case is a huge win. * *Special-case ASCII*: 90% of practical text is ASCII. ASCII can be stored as bytes instead of `chars`, so cache locality is improved a lot, and we can use `memchar` for superfast prefilters (even case-insensitive prefilter are possible that way) * *Fallback for very long matches*: We fall back to greedy matcher which runs in `O(N)` (and `O(1)` space complexity) to avoid the `O(mn)` blowup for large matches. This is fzfs old algorithm and yields decent (but not great) results.