Skip to content

Latest commit

 

History

History
79 lines (57 loc) · 3.54 KB

File metadata and controls

79 lines (57 loc) · 3.54 KB

LongMemEval-S Benchmark Results

LongMemEval (ICLR 2025) is an academic benchmark for evaluating long-term memory in chat assistants. It tests 5 core abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention.

Setup

  • Dataset: LongMemEval-S (500 questions, ~48 sessions per question, ~115K tokens)
  • Source: xiaowu0162/longmemeval-cleaned
  • Metric: recall_any@K — does ANY gold session appear in top-K retrieved results?
  • Embedding model: all-MiniLM-L6-v2 (384 dimensions, local, no API key)
  • No LLM in the loop: Pure retrieval evaluation, no answer generation or judge

Results

System R@5 R@10 R@20 NDCG@10 MRR
agentmemory BM25+Vector 95.2% 98.6% 99.4% 87.9% 88.2%
agentmemory BM25-only 86.2% 94.6% 98.6% 73.0% 71.5%
MemPalace raw (vector-only) 96.6% ~97.6%

By Question Type (BM25+Vector)

Type R@5 R@10 Count
knowledge-update 98.7% 100.0% 78
multi-session 97.7% 100.0% 133
single-session-assistant 96.4% 98.2% 56
temporal-reasoning 95.5% 97.7% 133
single-session-user 90.0% 97.1% 70
single-session-preference 83.3% 96.7% 30

By Question Type (BM25-only)

Type R@5 R@10 Count
knowledge-update 92.3% 98.7% 78
single-session-user 91.4% 95.7% 70
temporal-reasoning 88.0% 94.7% 133
multi-session 86.5% 96.2% 133
single-session-assistant 80.4% 91.1% 56
single-session-preference 60.0% 80.0% 30

Analysis

  1. BM25+Vector (95.2%) nearly matches pure vector search (96.6%) with only a 1.4pp gap. Both use the same embedding model (all-MiniLM-L6-v2).

  2. BM25 alone gets 86.2% — keyword search with Porter stemming and synonym expansion is surprisingly effective on conversational data.

  3. Adding vectors to BM25 gives +9pp (86.2% → 95.2%), the largest improvement from any single component.

  4. Preferences are the hardest category for both BM25 (60%) and hybrid (83.3%). These require understanding implicit/indirect statements.

  5. Multi-session and knowledge-update are strongest (97.7%+ hybrid). The hybrid approach excels when facts are distributed across sessions.

  6. R@10 reaches 98.6% — nearly all gold sessions are found within the top 10 results.

Important Notes on Methodology

  • These are retrieval recall scores, not end-to-end QA accuracy. The official LongMemEval metric is QA accuracy (retrieve + generate answer + GPT-4o judge).
  • Systems on the actual LongMemEval QA leaderboard score 60-95% depending on the LLM reader (Oracle GPT-4o gets ~82.4%).
  • We do NOT claim these as "LongMemEval scores" — they are retrieval-only evaluations on the LongMemEval-S haystack.
  • Each question builds a fresh index from its ~48 sessions, searches with the question text, and checks if gold session IDs appear in results.

Reproducibility

# Download dataset (264 MB)
pip install huggingface_hub
python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id='xiaowu0162/longmemeval-cleaned', filename='longmemeval_s_cleaned.json', repo_type='dataset', local_dir='benchmark/data')
"

# Run BM25-only
npx tsx benchmark/longmemeval-bench.ts bm25

# Run BM25+Vector hybrid (requires @xenova/transformers)
npx tsx benchmark/longmemeval-bench.ts hybrid