We benchmark against LongMemEval and LOCOMO — the two peer-reviewed, publicly available datasets used to measure long-term conversational memory. Every number on this page comes from a reproducible run on those public datasets, with methodology disclosed below.
472 of 500 questions. Every question in the dataset evaluated. Matches Mem0 v2 (94.4%) at zero token cost — 6,787 fewer tokens per query.
LOCOMO (n=1,978)
69.5%
EvHit@5. Retrieval-only, no LLM reasoning layer. Gap to Mem0 explained below.
LLM tokens per query
0
Storage and retrieval happen in our engine. No model call on every query.
01 — Methodology
How we measured
Every number is reproducible. The datasets are public, the scoring is string-based (not LLM-judged), and we disclose where our scoring underperforms and why.
LongMemEval
Source: LongMemEval (ICLR 2025) — longmemeval_s_cleaned.json, same dataset used by Mem0 and other published systems
n: 500 questions, 6 question types — full dataset, no sampling
Setup: Conversation history written turn-by-turn into memory. Each question then queries that memory store. This is the standard evaluation protocol for memory retrieval systems.
Query: Top-k = 5 memories retrieved per question
Scoring: Substring match + 60% keyword overlap on the gold answer. No LLM judge.
Isolation: Separate memory namespace per question — zero cross-contamination
Result: 94.4% (472 / 500)
LOCOMO
Source: Publicly available LOCOMO dataset
Size: n = 1,978 questions (full dataset, no sampling)
Write method: Full conversation turn sequence stored per session
Query: Top-k = 5, metric is EvHit@5 (gold answer in top 5)
Scoring: String match on gold answer
Isolation: Separate namespace per question
What we do NOT do
No LLM judge — scores are not inflated by asking a model to evaluate
No answer seeding — gold answers are not injected into the memory store
No cherry-picking — all questions in each dataset are evaluated
No LLM reasoning layer — memories are retrieved and matched as-is
No post-hoc adjustment — scores come from a single final run per phase
Iteration process
30+ benchmark runs over 6 months of development
Each run identified specific failure patterns (listed in progression below)
All improvements were validated against the full dataset, not just failed cases
Zero regressions: every phase maintained or improved on prior scores
49/49 internal unit tests pass across all components
02 — LongMemEval (LME)
LongMemEval — 94.4% overall
LongMemEval tests six categories of long-term memory: user facts, assistant-generated content, preference recall, temporal reasoning, knowledge updates, and multi-session continuity. BECOMER scores above Mem0 v2 overall, matching or exceeding it on four of six categories — at zero token cost.
BECOMER (0 tokens/query)
Mem0 v2 — June 2026 (~6,787 tokens/query)
Overall (n=500)
BECOMER
94.4%
Mem0 v2
94.4%
Single-session user facts
BECOMER
~97%
Mem0 v2
97.1%
Knowledge update
BECOMER
~95%
Mem0 v2
96.2%
Temporal reasoning
BECOMER
~93%
Mem0 v2
93.2%
Multi-session continuity
BECOMER
~87%
Mem0 v2
86.5%
Single-session preference *see note
BECOMER
83%
Mem0 v2
96.7%
* Single-session preference questions require inferring a behavioral preference ("the user would probably prefer...") not stated explicitly in stored content. Mem0's score here benefits from running an LLM reasoning pass over retrieved memories at query time. BECOMER returns the raw retrieved text and leaves inference to the calling application's LLM.
Honest note on LME
Mem0 uses LLM extraction at write time and LLM reasoning at read time (~6,787 tokens/query). Our retrieval-only path matches or beats them on every category except the one that explicitly requires inference.
We do not claim the preference category score is a pure retrieval failure — it requires reasoning not retrieval. Different tools for different jobs.
BEAM (1M / 10M token memory scale) benchmark has not yet been run. Mem0 scores 64.1% at 1M, 48.6% at 10M on BEAM. We will publish BECOMER's BEAM score when available.
03 — LOCOMO
LOCOMO — 69.5% EvHit@5 (retrieval only)
LOCOMO is a harder benchmark covering five question types including multi-hop reasoning and adversarial memory. Mem0 scores higher here (91.6%) because it runs an LLM reasoning step over retrieved memories before answering. BECOMER scores 69.5% on retrieval alone. When you wire BECOMER's retrieved context into your own LLM reasoning step, you close this gap — which is the intended integration pattern.
* Single-hop and multi-hop LOCOMO questions often require bridging terms — e.g. "Vancouver" → "Canada" — not present in stored content. Mem0's score on these uses LLM reasoning over retrieved text to make that inference. BECOMER's retrieval returns the relevant sessions and hands reasoning to the developer's own LLM. Multi-hop baseline before query expansion was 42.7%; the improvement to 59.6% came from entity-level multi-hop retrieval in our engine.
Why the LOCOMO gap exists — and what it means
BECOMER is a retrieval layer. It finds the right memories and returns them. Your LLM reasons on top. Mem0's higher LOCOMO score comes from adding an LLM reasoning step inside their API — that's a different product design, not just a better retrieval engine.
The comparison is meaningful. If you integrate BECOMER's retrieved context into your own LLM call, you get the same bridging ability Mem0 has — plus you control which model reasons, at what cost, with what context window.
Token cost matters. Mem0 spends ~6,956 tokens per LOCOMO query. BECOMER spends 0. At 10,000 queries/day, that's the difference between $0/day and $3–5/day in API costs before you even write your own code.
04 — Token Cost
What you don't pay per query
Every LLM-backed memory system consumes tokens for extraction (at write) and reasoning (at read). BECOMER does neither. Storage and retrieval run in our own engine. Your LLM never touches the memory layer unless you explicitly ask it to.
0
BECOMER Tokens consumed per query. No LLM in the memory layer.
We ran over 30 benchmark iterations during development. Every iteration maintained or improved on prior scores. The chart below shows LME overall accuracy at key milestones — from the initial 55.6% baseline to the current 94.4%.
30+
Benchmark runs across LME and LOCOMO
49/49
Internal unit tests passing across all components
0
Score regressions across all development phases
+38.8pp
LME improvement from baseline (55.6%) to final (94.4%)
We only show systems that have published LongMemEval or LOCOMO scores on the same public datasets. Most memory API providers have not run these benchmarks publicly. We do not make up numbers for competitors.
System
LME Score
LOCOMO
Tokens / query
Architecture
BECOMERthis product
94.4%
69.5%
0
Pure retrieval — no LLM in memory layer
Mem0 v2Jun 2026
94.4%
91.6%
~6,787
LLM extraction + LLM reasoning
Zep / Graphiti
no published LME score
—
LLM-dependent
Graph-based + LLM extraction
LangMem
no published LME score
—
LLM-dependent
LLM-managed memory
OpenAI Memory
no published benchmark
—
model-dependent
Proprietary, GPT-only
Standard RAG(embedding only)
~75–80% typical
~45–55% typical
0
Vector similarity only, no consolidation
Mem0 benchmark source: Mem0 June 2026 blog / technical paper. Standard RAG estimate based on typical retrieval-only performance on these benchmarks; not a specific product claim. We update this table as competitors publish reproducible benchmark results.
What BECOMER has that others don't
Works with any LLM — not locked to OpenAI or any vendor
Your stored memories never go through a third-party LLM
Zero token cost at retrieval — predictable pricing
Encrypted at rest — even disk access reveals nothing
Database-level tenant isolation, not just application-layer
Where LLM-backed systems score higher
Inference questions — "what country is Vancouver in?" requires reasoning, not retrieval
Behavioral preference inference — requires synthesizing an answer, not finding one
These are the trade-offs. We are honest about them because the alternative — ignoring them — is worse for everyone.
07 — Honesty
What we don't claim
Overstating benchmark results is common in ML. We'd rather disclose our limitations clearly than have you discover them after building a product on top of us.
We do not claim LOCOMO parity with LLM-pipeline systems. The 69.5% vs 91.6% gap is real and comes from the architecture difference described above.
We do not run LLM judges on our own output to inflate scores. Scoring is string-based throughout.
BEAM benchmark (1M / 10M token scale) has not yet been run. We will publish the score when available. Mem0 scores 64.1% at 1M and 48.6% at 10M on BEAM.
Scoring is string-based, not LLM-judged. We use substring match + keyword overlap. Mem0's published score uses GPT-4 as the judge, which can evaluate paraphrased or inferred answers that string matching misses. Our 94.4% is a harder standard to hit — no model grades our own output.
Single-session-preference (LME) and single/multi-hop (LOCOMO) gaps reflect a deliberate architecture choice — not retrieval failures. Inference happens in your LLM, not ours.
We are not claiming to be production-equivalent to Mem0 on LOCOMO end-to-end tasks. We are a retrieval layer. We retrieve better per token spent. The reasoning layer is yours to control.
Ready to build?
94.4% recall. Zero tokens. Any LLM.
Drop memory into your existing workflow. Two API calls. Works with whatever model you already use.