LongMemEval (ICLR 2025) is an academic benchmark for evaluating long-term memory in chat assistants. It tests 5 core abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.
- Dataset: LongMemEval-S (500 questions, ~48 sessions per question, ~115K tokens)
- Source: xiaowu0162/longmemeval-cleaned
- Metric:
recall_any@K— does ANY gold session appear in top-K retrieved results? - Embedding model:
all-MiniLM-L6-v2(384 dimensions, local, no API key) - No LLM in the loop: Pure retrieval evaluation, no answer generation or judge
| System | R@5 | R@10 | R@20 | NDCG@10 | MRR |
|---|---|---|---|---|---|
| agentmemory BM25+Vector | 95.2% | 98.6% | 99.4% | 87.9% | 88.2% |
| agentmemory BM25-only | 86.2% | 94.6% | 98.6% | 73.0% | 71.5% |
| MemPalace raw (vector-only) | 96.6% | ~97.6% | — | — | — |
| Type | R@5 | R@10 | Count |
|---|---|---|---|
| knowledge-update | 98.7% | 100.0% | 78 |
| multi-session | 97.7% | 100.0% | 133 |
| single-session-assistant | 96.4% | 98.2% | 56 |
| temporal-reasoning | 95.5% | 97.7% | 133 |
| single-session-user | 90.0% | 97.1% | 70 |
| single-session-preference | 83.3% | 96.7% | 30 |
| Type | R@5 | R@10 | Count |
|---|---|---|---|
| knowledge-update | 92.3% | 98.7% | 78 |
| single-session-user | 91.4% | 95.7% | 70 |
| temporal-reasoning | 88.0% | 94.7% | 133 |
| multi-session | 86.5% | 96.2% | 133 |
| single-session-assistant | 80.4% | 91.1% | 56 |
| single-session-preference | 60.0% | 80.0% | 30 |
-
BM25+Vector (95.2%) nearly matches pure vector search (96.6%) with only a 1.4pp gap. Both use the same embedding model (all-MiniLM-L6-v2).
-
BM25 alone gets 86.2% — keyword search with Porter stemming and synonym expansion is surprisingly effective on conversational data.
-
Adding vectors to BM25 gives +9pp (86.2% → 95.2%), the largest improvement from any single component.
-
Preferences are the hardest category for both BM25 (60%) and hybrid (83.3%). These require understanding implicit/indirect statements.
-
Multi-session and knowledge-update are strongest (97.7%+ hybrid). The hybrid approach excels when facts are distributed across sessions.
-
R@10 reaches 98.6% — nearly all gold sessions are found within the top 10 results.
- These are retrieval recall scores, not end-to-end QA accuracy. The official LongMemEval metric is QA accuracy (retrieve + generate answer + GPT-4o judge).
- Systems on the actual LongMemEval QA leaderboard score 60-95% depending on the LLM reader (Oracle GPT-4o gets ~82.4%).
- We do NOT claim these as "LongMemEval scores" — they are retrieval-only evaluations on the LongMemEval-S haystack.
- Each question builds a fresh index from its ~48 sessions, searches with the question text, and checks if gold session IDs appear in results.
# Download dataset (264 MB)
pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='xiaowu0162/longmemeval-cleaned', filename='longmemeval_s_cleaned.json', repo_type='dataset', local_dir='benchmark/data')
"
# Run BM25-only
npx tsx benchmark/longmemeval-bench.ts bm25
# Run BM25+Vector hybrid (requires @xenova/transformers)
npx tsx benchmark/longmemeval-bench.ts hybrid