LongMemEval-S Benchmark Results

LongMemEval (ICLR 2025) is an academic benchmark for evaluating long-term memory in chat assistants. It tests 5 core abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Setup

Dataset: LongMemEval-S (500 questions, ~48 sessions per question, ~115K tokens)
Source: xiaowu0162/longmemeval-cleaned
Metric: recall_any@K — does ANY gold session appear in top-K retrieved results?
Embedding model: all-MiniLM-L6-v2 (384 dimensions, local, no API key)
No LLM in the loop: Pure retrieval evaluation, no answer generation or judge

Results

System	R@5	R@10	R@20	NDCG@10	MRR
agentmemory BM25+Vector	95.2%	98.6%	99.4%	87.9%	88.2%
agentmemory BM25-only	86.2%	94.6%	98.6%	73.0%	71.5%
MemPalace raw (vector-only)	96.6%	~97.6%	—	—	—

By Question Type (BM25+Vector)

Type	R@5	R@10	Count
knowledge-update	98.7%	100.0%	78
multi-session	97.7%	100.0%	133
single-session-assistant	96.4%	98.2%	56
temporal-reasoning	95.5%	97.7%	133
single-session-user	90.0%	97.1%	70
single-session-preference	83.3%	96.7%	30

By Question Type (BM25-only)

Type	R@5	R@10	Count
knowledge-update	92.3%	98.7%	78
single-session-user	91.4%	95.7%	70
temporal-reasoning	88.0%	94.7%	133
multi-session	86.5%	96.2%	133
single-session-assistant	80.4%	91.1%	56
single-session-preference	60.0%	80.0%	30

Analysis

BM25+Vector (95.2%) nearly matches pure vector search (96.6%) with only a 1.4pp gap. Both use the same embedding model (all-MiniLM-L6-v2).
BM25 alone gets 86.2% — keyword search with Porter stemming and synonym expansion is surprisingly effective on conversational data.
Adding vectors to BM25 gives +9pp (86.2% → 95.2%), the largest improvement from any single component.
Preferences are the hardest category for both BM25 (60%) and hybrid (83.3%). These require understanding implicit/indirect statements.
Multi-session and knowledge-update are strongest (97.7%+ hybrid). The hybrid approach excels when facts are distributed across sessions.
R@10 reaches 98.6% — nearly all gold sessions are found within the top 10 results.

Important Notes on Methodology

These are retrieval recall scores, not end-to-end QA accuracy. The official LongMemEval metric is QA accuracy (retrieve + generate answer + GPT-4o judge).
Systems on the actual LongMemEval QA leaderboard score 60-95% depending on the LLM reader (Oracle GPT-4o gets ~82.4%).
We do NOT claim these as "LongMemEval scores" — they are retrieval-only evaluations on the LongMemEval-S haystack.
Each question builds a fresh index from its ~48 sessions, searches with the question text, and checks if gold session IDs appear in results.

Reproducibility

# Download dataset (264 MB)
pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='xiaowu0162/longmemeval-cleaned', filename='longmemeval_s_cleaned.json', repo_type='dataset', local_dir='benchmark/data')
"

# Run BM25-only
npx tsx benchmark/longmemeval-bench.ts bm25

# Run BM25+Vector hybrid (requires @xenova/transformers)
npx tsx benchmark/longmemeval-bench.ts hybrid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LongMemEval-S Benchmark Results

Setup

Results

By Question Type (BM25+Vector)

By Question Type (BM25-only)

Analysis

Important Notes on Methodology

Reproducibility

FilesExpand file tree

LONGMEMEVAL.md

Latest commit

History

LONGMEMEVAL.md

File metadata and controls

LongMemEval-S Benchmark Results

Setup

Results

By Question Type (BM25+Vector)

By Question Type (BM25-only)

Analysis

Important Notes on Methodology

Reproducibility