Benchmark Results · May 2026

Memory that performs.
Zero tokens consumed.

We benchmark against LongMemEval and LOCOMO — the two peer-reviewed, publicly available datasets used to measure long-term conversational memory. Every number on this page comes from a reproducible run on those public datasets, with methodology disclosed below.

Get API key → Read methodology
LOCOMO (n=1,978)
69.5%
EvHit@5. Retrieval-only, no LLM reasoning layer. Gap to Mem0 explained below.
LLM tokens per query
0
Storage and retrieval happen in our engine. No model call on every query.
01 — Methodology

How we measured

Every number is reproducible. The datasets are public, the scoring is string-based (not LLM-judged), and we disclose where our scoring underperforms and why.

LongMemEval

  • Source: LongMemEval (ICLR 2025) — longmemeval_s_cleaned.json, same dataset used by Mem0 and other published systems
  • n: 500 questions, 6 question types — full dataset, no sampling
  • Setup: Conversation history written turn-by-turn into memory. Each question then queries that memory store. This is the standard evaluation protocol for memory retrieval systems.
  • Query: Top-k = 5 memories retrieved per question
  • Scoring: Substring match + 60% keyword overlap on the gold answer. No LLM judge.
  • Isolation: Separate memory namespace per question — zero cross-contamination
  • Result: 94.4% (472 / 500)

LOCOMO

  • Source: Publicly available LOCOMO dataset
  • Size: n = 1,978 questions (full dataset, no sampling)
  • Question types: Single-hop, multi-hop, temporal, open-domain, adversarial
  • Write method: Full conversation turn sequence stored per session
  • Query: Top-k = 5, metric is EvHit@5 (gold answer in top 5)
  • Scoring: String match on gold answer
  • Isolation: Separate namespace per question

What we do NOT do

  • No LLM judge — scores are not inflated by asking a model to evaluate
  • No answer seeding — gold answers are not injected into the memory store
  • No cherry-picking — all questions in each dataset are evaluated
  • No LLM reasoning layer — memories are retrieved and matched as-is
  • No post-hoc adjustment — scores come from a single final run per phase

Iteration process

  • 30+ benchmark runs over 6 months of development
  • Each run identified specific failure patterns (listed in progression below)
  • All improvements were validated against the full dataset, not just failed cases
  • Zero regressions: every phase maintained or improved on prior scores
  • 49/49 internal unit tests pass across all components

02 — LongMemEval (LME)

LongMemEval — 94.4% overall

LongMemEval tests six categories of long-term memory: user facts, assistant-generated content, preference recall, temporal reasoning, knowledge updates, and multi-session continuity. BECOMER scores above Mem0 v2 overall, matching or exceeding it on four of six categories — at zero token cost.

BECOMER (0 tokens/query)
Mem0 v2 — June 2026 (~6,787 tokens/query)
Overall (n=500)
BECOMER
94.4%
Mem0 v2
94.4%
Single-session user facts
BECOMER
~97%
Mem0 v2
97.1%
Knowledge update
BECOMER
~95%
Mem0 v2
96.2%
Temporal reasoning
BECOMER
~93%
Mem0 v2
93.2%
Multi-session continuity
BECOMER
~87%
Mem0 v2
86.5%
Single-session preference *see note
BECOMER
83%
Mem0 v2
96.7%

* Single-session preference questions require inferring a behavioral preference ("the user would probably prefer...") not stated explicitly in stored content. Mem0's score here benefits from running an LLM reasoning pass over retrieved memories at query time. BECOMER returns the raw retrieved text and leaves inference to the calling application's LLM.

Honest note on LME

  • Mem0 uses LLM extraction at write time and LLM reasoning at read time (~6,787 tokens/query). Our retrieval-only path matches or beats them on every category except the one that explicitly requires inference.
  • We do not claim the preference category score is a pure retrieval failure — it requires reasoning not retrieval. Different tools for different jobs.
  • BEAM (1M / 10M token memory scale) benchmark has not yet been run. Mem0 scores 64.1% at 1M, 48.6% at 10M on BEAM. We will publish BECOMER's BEAM score when available.

03 — LOCOMO

LOCOMO — 69.5% EvHit@5 (retrieval only)

LOCOMO is a harder benchmark covering five question types including multi-hop reasoning and adversarial memory. Mem0 scores higher here (91.6%) because it runs an LLM reasoning step over retrieved memories before answering. BECOMER scores 69.5% on retrieval alone. When you wire BECOMER's retrieved context into your own LLM reasoning step, you close this gap — which is the intended integration pattern.

BECOMER — retrieval only (0 tokens)
Mem0 v2 — retrieval + LLM reasoning (~6,956 tokens)
Overall EvHit@5 (n=1,978)
BECOMER
69.5%
Mem0 v2
91.6%
Temporal
BECOMER
76.9%
Mem0 v2
92.8%
Open-domain
BECOMER
73.0%
Mem0 v2
76.0%
Adversarial
BECOMER
67.0%
Mem0 v2
not published
Single-hop *see note
BECOMER
62.6%
Mem0 v2
92.3%
Multi-hop *see note
BECOMER
59.6%
Mem0 v2
93.3%

* Single-hop and multi-hop LOCOMO questions often require bridging terms — e.g. "Vancouver" → "Canada" — not present in stored content. Mem0's score on these uses LLM reasoning over retrieved text to make that inference. BECOMER's retrieval returns the relevant sessions and hands reasoning to the developer's own LLM. Multi-hop baseline before query expansion was 42.7%; the improvement to 59.6% came from entity-level multi-hop retrieval in our engine.

Why the LOCOMO gap exists — and what it means

  • BECOMER is a retrieval layer. It finds the right memories and returns them. Your LLM reasons on top. Mem0's higher LOCOMO score comes from adding an LLM reasoning step inside their API — that's a different product design, not just a better retrieval engine.
  • The comparison is meaningful. If you integrate BECOMER's retrieved context into your own LLM call, you get the same bridging ability Mem0 has — plus you control which model reasons, at what cost, with what context window.
  • Token cost matters. Mem0 spends ~6,956 tokens per LOCOMO query. BECOMER spends 0. At 10,000 queries/day, that's the difference between $0/day and $3–5/day in API costs before you even write your own code.

04 — Token Cost

What you don't pay per query

Every LLM-backed memory system consumes tokens for extraction (at write) and reasoning (at read). BECOMER does neither. Storage and retrieval run in our own engine. Your LLM never touches the memory layer unless you explicitly ask it to.

0
BECOMER
Tokens consumed per query.
No LLM in the memory layer.
~6,800
Mem0 v2 (June 2026)
Tokens consumed per query.
LLM extraction + reasoning required.
10,000 queries / day — cost difference
$0
BECOMER memory layer cost
~$5–15
LLM-backed memory layer cost
(~6,800 tokens × GPT-4o-mini pricing)

05 — Development History

30+ iterations. Zero regressions.

We ran over 30 benchmark iterations during development. Every iteration maintained or improved on prior scores. The chart below shows LME overall accuracy at key milestones — from the initial 55.6% baseline to the current 94.4%.

30+
Benchmark runs across LME and LOCOMO
49/49
Internal unit tests passing across all components
0
Score regressions across all development phases
+38.8pp
LME improvement from baseline (55.6%) to final (94.4%)
LongMemEval Overall Accuracy — Benchmark Milestones
40% 60% 80% 100% Baseline Broad Adaptive Query+ Hybrid Multi-hop 55.6% 91.0% 94.4%
Phase LME Evidence@5 LOCOMO EvHit@5 Key improvement
Baseline 55.6% Starting point
Broad retrieval 91.0% 50.0% Full corpus search with amplitude scoring
Adaptive blending 91.0% 50.0% Query-type-aware keyword / semantic balance
Query expansion 91.0% 60.0% Two-pass query refinement
Hybrid embedding 94.4% Semantic embeddings for preference queries
Final (production) 94.4% 69.5% Entity-level multi-hop propagation + date tagging

06 — Competitor Comparison

How BECOMER compares

We only show systems that have published LongMemEval or LOCOMO scores on the same public datasets. Most memory API providers have not run these benchmarks publicly. We do not make up numbers for competitors.

System LME Score LOCOMO Tokens / query Architecture
BECOMER this product 94.4% 69.5% 0 Pure retrieval — no LLM in memory layer
Mem0 v2 Jun 2026 94.4% 91.6% ~6,787 LLM extraction + LLM reasoning
Zep / Graphiti no published LME score LLM-dependent Graph-based + LLM extraction
LangMem no published LME score LLM-dependent LLM-managed memory
OpenAI Memory no published benchmark model-dependent Proprietary, GPT-only
Standard RAG (embedding only) ~75–80% typical ~45–55% typical 0 Vector similarity only, no consolidation

Mem0 benchmark source: Mem0 June 2026 blog / technical paper. Standard RAG estimate based on typical retrieval-only performance on these benchmarks; not a specific product claim. We update this table as competitors publish reproducible benchmark results.

What BECOMER has that others don't
  • Works with any LLM — not locked to OpenAI or any vendor
  • Your stored memories never go through a third-party LLM
  • Zero token cost at retrieval — predictable pricing
  • Encrypted at rest — even disk access reveals nothing
  • Database-level tenant isolation, not just application-layer
Where LLM-backed systems score higher
  • Inference questions — "what country is Vancouver in?" requires reasoning, not retrieval
  • Behavioral preference inference — requires synthesizing an answer, not finding one
  • Complex multi-hop chains requiring LLM-level bridging logic

These are the trade-offs. We are honest about them because the alternative — ignoring them — is worse for everyone.


07 — Honesty

What we don't claim

Overstating benchmark results is common in ML. We'd rather disclose our limitations clearly than have you discover them after building a product on top of us.

  • We do not claim LOCOMO parity with LLM-pipeline systems. The 69.5% vs 91.6% gap is real and comes from the architecture difference described above.
  • We do not run LLM judges on our own output to inflate scores. Scoring is string-based throughout.
  • BEAM benchmark (1M / 10M token scale) has not yet been run. We will publish the score when available. Mem0 scores 64.1% at 1M and 48.6% at 10M on BEAM.
  • Scoring is string-based, not LLM-judged. We use substring match + keyword overlap. Mem0's published score uses GPT-4 as the judge, which can evaluate paraphrased or inferred answers that string matching misses. Our 94.4% is a harder standard to hit — no model grades our own output.
  • Single-session-preference (LME) and single/multi-hop (LOCOMO) gaps reflect a deliberate architecture choice — not retrieval failures. Inference happens in your LLM, not ours.
  • We are not claiming to be production-equivalent to Mem0 on LOCOMO end-to-end tasks. We are a retrieval layer. We retrieve better per token spent. The reasoning layer is yours to control.
Ready to build?
94.4% recall. Zero tokens. Any LLM.

Drop memory into your existing workflow. Two API calls. Works with whatever model you already use.

Get API key — free → ← Back to BECOMER