Benchmarks¶
Distillery runs the LongMemEval retrieval bench nightly against its own pipeline. This page surfaces the most recent headline numbers, the configurations Distillery ships, and — equally important — the things this number does not claim.
This page is wired up before any number lands. The cells below show placeholders until the first stable nightly run completes and the variance gate (5-seed back-to-back execution, stddev ≤ 0.5pp on R@5) confirms the bench is a useful regression signal. No number is displayed here until that gate is green.
Headline¶
The pre-registered headline triplet — Recall@5, Recall@10, NDCG@10 — for the headline cell
(retrieval=hybrid, granularity=session, recency=on, embed=bge-small).
-
Recall@5
0.970Headline cell, mean across seeds.
-
Recall@10
0.990Headline cell, mean across seeds.
-
NDCG@10
0.890Headline cell, mean across seeds.
Configuration¶
The headline cell is pre-registered and immutable without an ADR. It does not change to chase a number.
| Axis | Headline value |
|---|---|
| Retrieval | hybrid (BM25 + vector via Reciprocal Rank Fusion) |
| Granularity | session (one document per haystack session) |
| Recency | on (90-day linear decay, recency_min_weight=0.5) |
| Embed model | bge-small (BAAI/bge-small-en-v1.5, 384-dim, fastembed) |
Full pre-registration rationale and change-control rules live in
bench/HEADLINE.md.
What this number does NOT claim¶
Read this before citing any number on this page
- These numbers are retrieval metrics (R@k, NDCG@k). They are not comparable to LongMemEval QA-accuracy leaderboard entries — the LongMemEval paper's primary metric is GPT-4o-judged QA accuracy, which requires a generator stack Distillery does not ship.
- These numbers are Distillery vs. Distillery only. There are no competitor rows
anywhere on this page, in the README, or in the auto-generated
bench/results/summaries. Cross-system retrieval-vs-QA comparisons are a known category error. - Cross-granularity rows (
sessionvsturn) are non-comparable to one another — the corpus_id space differs, so R@k means different things in each row. - Cross-embed-model rows carry an HNSW-construction caveat: the index is rebuilt per model, so insertion-order and seed effects are part of the score.
- The headline number is the mean of multiple seeds. The corresponding stddev lives
alongside it in
bench/results/variance_baseline.json.
Read the full limitations before citing this number → bench/LIMITATIONS.md
Internal comparison table¶
Distillery configurations only. No competitor rows. Each row is a single Distillery configuration evaluated against the same LongMemEval-S question set with the same SHA-pinned dataset and embedding model.
| Configuration | R@5 | R@10 | NDCG@10 |
|---|---|---|---|
hybrid + recency on (headline) |
0.970 |
0.990 |
0.890 |
raw + recency on |
0.870 |
0.940 |
0.787 |
hybrid + recency off |
0.970 |
0.990 |
0.892 |
hybrid + granularity=turn |
0.980 |
1.000 |
0.681 |
The granularity=turn row is shown for ablation interest only; it is not directly
comparable to the session rows above (see the LIMITATIONS callout).
Per-question-type breakdown¶
LongMemEval-S partitions questions into six types. The headline cell scores each type independently.
| Question type | R@5 | R@10 | NDCG@10 |
|---|---|---|---|
knowledge-update |
— |
— |
— |
multi-session |
1.000 |
1.000 |
0.914 |
temporal |
— |
— |
— |
single-session-user |
0.957 |
0.986 |
0.880 |
single-session-preference |
— |
— |
— |
single-session-assistant |
— |
— |
— |
Graph features — Cell A regression gate, Cell B deferred¶
Issue #458 splits the bench's coverage of the graph-enabled retrieval path (PRs #422–#429, epic #147) into two cells. Only one produces a publishable number.
Cell A — graph regression gate (DO)¶
Same config as the HEADLINE cell (hybrid / session / recency-on / bge-small, 500q × 5
seeds), re-run with --expand-graph enabled. Cell A asks: does enabling graph
features regress baseline recall when the entry graph is sparse? The pass criterion
is that Cell A's mean R@5 stays within the variance-gate threshold (default 0.5pp) of
the HEADLINE mean.
Status: gate live — first 500q × 5-seed result lands at delta = 0.0pp
The graph retrieval PRs (#422–#429)
merged ahead of the 0.5.0 release, and Cell A's regression-gate semantics
are live. The first full-500q × 5-seed Cell A run on the v0.5.0 commit
landed at mean R@5 = 0.972 (stddev 0.000), exactly matching the
HEADLINE mean of 0.972 over the same 500q sample for a delta of 0.0pp
against the 0.5pp variance-gate threshold (gate_pass=true,
sample_size_match=true). Run:
actions/runs/25453787717.
Aggregate receipt: bench/results/graph_regression_cell_a.json.
Per the discipline in bench/LIMITATIONS.md §(f), this is a regression
result only — no value-add claim is implied. Cell A passing means
"enabling graph features did not regress baseline recall on
LongMemEval-S" and nothing more. The graph hypothesis (cross-user /
cross-session relations) is not exercised by LongMemEval and is deferred
to Cell B.
- Workflow.
.github/workflows/bench-graph-regression-cell.ymlruns nightly at 06:00 UTC, sequenced after the HEADLINE workflow at 05:00 UTC. Nightly samples 100q for trending; full-500q runs (gate-relevant) areworkflow_dispatchonly and require a sample-size match against the committedvariance_baseline.jsonbefore the gate is computed. - Aggregate. Cell A's 5-seed mean + delta vs HEADLINE lands at
bench/results/graph_regression_cell_a.json. Per-seed receipts live as workflow artifacts only (90-day retention) and are deliberately not committed — the repo must never accumulate a graph-receipt history that could be silently re-published as a HEADLINE claim. - Default-off. Graph features remain default-off in production (the existing
HEADLINE cell does not set
expand_graph); Cell A exists as a separate axis and does not displace the public number.
Cell B — graph value-add (DEFER)¶
A claim of the form "graph features improve LongMemEval" is deferred to a fit-for-purpose eval. LongMemEval is a single-user, single-session benchmark — each question is scored against one user's haystack — and does not exercise the graph hypothesis (cross-user / cross-session entry relations) that motivates Distillery's graph features. Measuring graph value-add on LongMemEval would be a category error analogous to (a) above.
The deferred eval will be one of:
- a multi-hop QA dataset (questions whose answer requires traversing entry relations);
- a synthetic team-knowledge eval (multiple authors, cross-author lookups);
- an in-house
/investigateor/poursynthesis eval that scores the value the graph adds to multi-document narrative answers.
Until that eval exists, no public surface (this page, the README, the blog,
the 0.5.0 release notes) may claim that graph features improve LongMemEval
scores. The 0.5.0 release notes claim "no regression with graph enabled" —
never "graph improves LongMemEval." Full discipline rationale is in
bench/LIMITATIONS.md §(f).
Methodology¶
The bench instantiates an in-memory DuckDBStore per question, fixes the PRNG seed before
ingestion to neutralise HNSW insertion-order non-determinism, ingests haystack_sessions
as entries with metadata.session_id populated, then runs store.search(query, limit=50)
and maps the returned session_ids back to the gold answer_session_ids for scoring.
Scoring is a textbook dcg/ndcg/evaluate_retrieval reimplementation in
src/distillery/eval/scoring.py, adapted to the existing RetrievalMetrics shape from
src/distillery/eval/retrieval_scorer.py. Every JSONL line carries a SHA panel
(git_sha, dataset_revision_sha, embed_model_sha, python_version) so any number on
this page can be reproduced bit-for-bit.
Full methodology and dataset citation:
bench/METHODOLOGY.md.
Dataset citation: Wu et al., LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, ICLR 2025 — arxiv:2410.10813.
Reproduction¶
The bench runs offline. fastembed downloads model weights once, the dataset loader pins the HuggingFace revision, and there is no API call in the hot loop.
pip install -e ".[dev,fastembed]"
distillery bench longmemeval \
--retrieval hybrid \
--granularity session \
--recency on \
--embed-model bge-small \
--seeds 1
Outputs land in bench/results/results_longmemeval_<mode>_<embed>_<UTC>.jsonl plus a
summary.json next to it. The canonical reproduction guide — including the 5-seed variance
characterisation procedure — is in bench/README.md.