Benchmark Results · May 2026

Memory that performs.
Zero tokens consumed.

We benchmark against LongMemEval and LOCOMO — the two peer-reviewed, publicly available datasets used to measure long-term conversational memory. Every number on this page comes from a reproducible run on those public datasets, with methodology disclosed below.

Get API key → Read methodology

LongMemEval — full n=500

94.4%

472 of 500 questions. Every question in the dataset evaluated. Matches Mem0 v2 (94.4%) at zero token cost — 6,787 fewer tokens per query.

LOCOMO (n=1,978)

69.5%

EvHit@5. Retrieval-only, no LLM reasoning layer. Gap to Mem0 explained below.

LLM tokens per query

Storage and retrieval happen in our engine. No model call on every query.

01 — Methodology

How we measured

Every number is reproducible. The datasets are public, the scoring is string-based (not LLM-judged), and we disclose where our scoring underperforms and why.

LongMemEval

Source: LongMemEval (ICLR 2025) — longmemeval_s_cleaned.json, same dataset used by Mem0 and other published systems
n: 500 questions, 6 question types — full dataset, no sampling
Setup: Conversation history written turn-by-turn into memory. Each question then queries that memory store. This is the standard evaluation protocol for memory retrieval systems.
Query: Top-k = 5 memories retrieved per question
Scoring: Substring match + 60% keyword overlap on the gold answer. No LLM judge.
Isolation: Separate memory namespace per question — zero cross-contamination
Result: 94.4% (472 / 500)

LOCOMO

Source: Publicly available LOCOMO dataset
Size: n = 1,978 questions (full dataset, no sampling)
Question types: Single-hop, multi-hop, temporal, open-domain, adversarial
Write method: Full conversation turn sequence stored per session
Query: Top-k = 5, metric is EvHit@5 (gold answer in top 5)
Scoring: String match on gold answer
Isolation: Separate namespace per question

What we do NOT do

No LLM judge — scores are not inflated by asking a model to evaluate
No answer seeding — gold answers are not injected into the memory store
No cherry-picking — all questions in each dataset are evaluated
No LLM reasoning layer — memories are retrieved and matched as-is
No post-hoc adjustment — scores come from a single final run per phase

Iteration process

30+ benchmark runs over 6 months of development
Each run identified specific failure patterns (listed in progression below)
All improvements were validated against the full dataset, not just failed cases
Zero regressions: every phase maintained or improved on prior scores
49/49 internal unit tests pass across all components

02 — LongMemEval (LME)

LongMemEval — 94.4% overall

LongMemEval tests six categories of long-term memory: user facts, assistant-generated content, preference recall, temporal reasoning, knowledge updates, and multi-session continuity. BECOMER scores above Mem0 v2 overall, matching or exceeding it on four of six categories — at zero token cost.

BECOMER (0 tokens/query)

Mem0 v2 — June 2026 (~6,787 tokens/query)

Overall (n=500)

BECOMER

94.4%

Mem0 v2

94.4%

Single-session user facts

BECOMER

~97%

Mem0 v2

97.1%

Knowledge update

BECOMER

~95%

Mem0 v2

96.2%

Temporal reasoning

BECOMER

~93%

Mem0 v2

93.2%

Multi-session continuity

BECOMER

~87%

Mem0 v2

86.5%

Single-session preference *see note

BECOMER

83%

Mem0 v2

96.7%

* Single-session preference questions require inferring a behavioral preference ("the user would probably prefer...") not stated explicitly in stored content. Mem0's score here benefits from running an LLM reasoning pass over retrieved memories at query time. BECOMER returns the raw retrieved text and leaves inference to the calling application's LLM.

Honest note on LME

Mem0 uses LLM extraction at write time and LLM reasoning at read time (~6,787 tokens/query). Our retrieval-only path matches or beats them on every category except the one that explicitly requires inference.
We do not claim the preference category score is a pure retrieval failure — it requires reasoning not retrieval. Different tools for different jobs.
BEAM (1M / 10M token memory scale) benchmark has not yet been run. Mem0 scores 64.1% at 1M, 48.6% at 10M on BEAM. We will publish BECOMER's BEAM score when available.

03 — LOCOMO

LOCOMO — 69.5% EvHit@5 (retrieval only)

LOCOMO is a harder benchmark covering five question types including multi-hop reasoning and adversarial memory. Mem0 scores higher here (91.6%) because it runs an LLM reasoning step over retrieved memories before answering. BECOMER scores 69.5% on retrieval alone. When you wire BECOMER's retrieved context into your own LLM reasoning step, you close this gap — which is the intended integration pattern.

BECOMER — retrieval only (0 tokens)

Mem0 v2 — retrieval + LLM reasoning (~6,956 tokens)

Overall EvHit@5 (n=1,978)

BECOMER

69.5%

Mem0 v2

91.6%

Temporal

BECOMER

76.9%

Mem0 v2

92.8%

Open-domain

BECOMER

73.0%

Mem0 v2

76.0%

Adversarial

BECOMER

67.0%

Mem0 v2

not published

Single-hop *see note

BECOMER

62.6%

Mem0 v2

92.3%

Multi-hop *see note

BECOMER

59.6%

Mem0 v2

93.3%

* Single-hop and multi-hop LOCOMO questions often require bridging terms — e.g. "Vancouver" → "Canada" — not present in stored content. Mem0's score on these uses LLM reasoning over retrieved text to make that inference. BECOMER's retrieval returns the relevant sessions and hands reasoning to the developer's own LLM. Multi-hop baseline before query expansion was 42.7%; the improvement to 59.6% came from entity-level multi-hop retrieval in our engine.

Why the LOCOMO gap exists — and what it means

BECOMER is a retrieval layer. It finds the right memories and returns them. Your LLM reasons on top. Mem0's higher LOCOMO score comes from adding an LLM reasoning step inside their API — that's a different product design, not just a better retrieval engine.
The comparison is meaningful. If you integrate BECOMER's retrieved context into your own LLM call, you get the same bridging ability Mem0 has — plus you control which model reasons, at what cost, with what context window.
Token cost matters. Mem0 spends ~6,956 tokens per LOCOMO query. BECOMER spends 0. At 10,000 queries/day, that's the difference between $0/day and $3–5/day in API costs before you even write your own code.

04 — Token Cost

What you don't pay per query

Every LLM-backed memory system consumes tokens for extraction (at write) and reasoning (at read). BECOMER does neither. Storage and retrieval run in our own engine. Your LLM never touches the memory layer unless you explicitly ask it to.

BECOMER
Tokens consumed per query.
No LLM in the memory layer.

~6,800

Mem0 v2 (June 2026)
Tokens consumed per query.
LLM extraction + reasoning required.

10,000 queries / day — cost difference

BECOMER memory layer cost

~$5–15

LLM-backed memory layer cost
(~6,800 tokens × GPT-4o-mini pricing)

05 — Development History

30+ iterations. Zero regressions.

We ran over 30 benchmark iterations during development. Every iteration maintained or improved on prior scores. The chart below shows LME overall accuracy at key milestones — from the initial 55.6% baseline to the current 94.4%.

30+

Benchmark runs across LME and LOCOMO

49/49

Internal unit tests passing across all components

Score regressions across all development phases

+38.8pp

LME improvement from baseline (55.6%) to final (94.4%)

LongMemEval Overall Accuracy — Benchmark Milestones

Phase	LME Evidence@5	LOCOMO EvHit@5	Key improvement
Baseline	55.6%	—	Starting point
Broad retrieval	91.0%	50.0%	Full corpus search with amplitude scoring
Adaptive blending	91.0%	50.0%	Query-type-aware keyword / semantic balance
Query expansion	91.0%	60.0%	Two-pass query refinement
Hybrid embedding	94.4%	—	Semantic embeddings for preference queries
Final (production)	94.4%	69.5%	Entity-level multi-hop propagation + date tagging

06 — Competitor Comparison

How BECOMER compares

We only show systems that have published LongMemEval or LOCOMO scores on the same public datasets. Most memory API providers have not run these benchmarks publicly. We do not make up numbers for competitors.

System	LME Score	LOCOMO	Tokens / query	Architecture
BECOMER this product	94.4%	69.5%	0	Pure retrieval — no LLM in memory layer
Mem0 v2 Jun 2026	94.4%	91.6%	~6,787	LLM extraction + LLM reasoning
Zep / Graphiti	no published LME score	—	LLM-dependent	Graph-based + LLM extraction
LangMem	no published LME score	—	LLM-dependent	LLM-managed memory
OpenAI Memory	no published benchmark	—	model-dependent	Proprietary, GPT-only
Standard RAG (embedding only)	~75–80% typical	~45–55% typical	0	Vector similarity only, no consolidation

Mem0 benchmark source: Mem0 June 2026 blog / technical paper. Standard RAG estimate based on typical retrieval-only performance on these benchmarks; not a specific product claim. We update this table as competitors publish reproducible benchmark results.

What BECOMER has that others don't

Works with any LLM — not locked to OpenAI or any vendor
Your stored memories never go through a third-party LLM
Zero token cost at retrieval — predictable pricing
Encrypted at rest — even disk access reveals nothing
Database-level tenant isolation, not just application-layer

Where LLM-backed systems score higher

Inference questions — "what country is Vancouver in?" requires reasoning, not retrieval
Behavioral preference inference — requires synthesizing an answer, not finding one
Complex multi-hop chains requiring LLM-level bridging logic

These are the trade-offs. We are honest about them because the alternative — ignoring them — is worse for everyone.

07 — Honesty

What we don't claim

Overstating benchmark results is common in ML. We'd rather disclose our limitations clearly than have you discover them after building a product on top of us.

We do not claim LOCOMO parity with LLM-pipeline systems. The 69.5% vs 91.6% gap is real and comes from the architecture difference described above.
We do not run LLM judges on our own output to inflate scores. Scoring is string-based throughout.
BEAM benchmark (1M / 10M token scale) has not yet been run. We will publish the score when available. Mem0 scores 64.1% at 1M and 48.6% at 10M on BEAM.
Scoring is string-based, not LLM-judged. We use substring match + keyword overlap. Mem0's published score uses GPT-4 as the judge, which can evaluate paraphrased or inferred answers that string matching misses. Our 94.4% is a harder standard to hit — no model grades our own output.
Single-session-preference (LME) and single/multi-hop (LOCOMO) gaps reflect a deliberate architecture choice — not retrieval failures. Inference happens in your LLM, not ours.
We are not claiming to be production-equivalent to Mem0 on LOCOMO end-to-end tasks. We are a retrieval layer. We retrieve better per token spent. The reasoning layer is yours to control.

Ready to build?

94.4% recall. Zero tokens. Any LLM.

Drop memory into your existing workflow. Two API calls. Works with whatever model you already use.

Get API key — free → ← Back to BECOMER