Benchmark

LongMemEval-S Results

The standard benchmark for AI memory retrieval quality. We publish our full results, methodology, and raw data.

83.6%

Overall accuracy on LongMemEval-S, running gpt-4o-mini

Ahead of Supermemory (81.6%) who require more expensive answer models. 10x cheaper for you per query, enabling you to build better systems for lower cost and swap between answer models with no loss of performance.

Category Breakdown

LongMemEval-S tests six categories of memory retrieval. Each measures a different aspect of how well your memory system understands, stores, and retrieves information. We are the highest scoring and the most reliably even of all of the MaaS services.

m3mory(gpt-4o-mini)

Supermemory(GPT-4o)

Zep(GPT-4o)

Mem0(gpt-4.1-mini)

Full context(GPT-4o)

83.6

81.6

71.2

66.4

60.2

Overall

Combined accuracy across all six categories

97.1

92.9

82.9

81.4

Single-Session User

Recall facts and details the user has shared in conversation

87.5

96.4

80.4

26.8

94.6

Single-Session Assistant

Recall information the AI agent itself provided earlier

86.7

70.0

56.7

90.0

20.0

Single-Session Preference

Recall the user's stated preferences, choices, and decisions

79.5

88.5

83.3

66.7

78.2

Knowledge Update

Track when information changed without backups or version history, and present the latest fact

80.5

76.7

62.4

72.2

45.1

Temporal Reasoning

Answer questions that depend on when things happened

79.7

71.4

57.9

63.2

44.3

Multi-Session

Combine knowledge across multiple conversations to answer a question. The most important metric for agentic AI

Methodology

What is LongMemEval-S?

LongMemEval-S (Short) is a standardised benchmark for evaluating AI memory systems. It tests how accurately a system can store, update, and retrieve information from long-running conversations across six categories.

How we tested

Each system was given identical conversation histories and asked the same retrieval questions. m3mory used gpt-4o-mini for all LLM calls. Most competitors used GPT-4o for their answers, a significantly more expensive model.

Scoring and validation

We constantly develop for ourselves, with our agents using our own system and logging their own issues. Scores are derived using a GPT-4o scoring model to evaluate the answers produced by our gpt-4o-mini memory system (we found that mini was not a good scoring model as it created too much variance). We average across at least 10 benchmark runs per parameter set. Our current parameters average 81.6% across runs, peaking at 83.6%.

We do not manually adjust incorrect model judging as we believe that is outside the spirit of the LongMemEval standard. If we did, our average would be 82.2% and our highest scoring runs 84.2%.

Token efficiency

Across 500 benchmark queries, the average context returned was 688 tokens with a median of 524 tokens. The distribution tells the story:

50% of queries need under 524 tokens
75% need under 934 tokens
95% need under 1,642 tokens

Maximum was 4,411 tokens out of a 4,500 budget. Almost nothing hits the cap. The retrieval pipeline is selective by design, returning only what matters rather than flooding the context window with noise.

The cost advantage

We believe we are the most efficient memory system available today. m3mory delivers the highest quality context for the lowest possible cost, achieving top accuracy on gpt-4o-mini at roughly 1/10th the per-query cost of competitors needing you to use at least GPT-4o to achieve their scores. Our proprietary retrieval pipeline does the heavy lifting, not the LLM. When you factor in our token efficiency, we are potentially up to 40x cheaper as an AI memory layer than competitors.

Where we excel

With a median context of just 524 tokens per retrieval, m3mory is one of the most token-efficient memory systems available. Single-session user recall (97.1%), single-session preference (86.7%), temporal reasoning (80.5%), and multi-session (79.7%) are our strongest categories. We beat Supermemory overall (83.6% vs 81.6%) and on preference, temporal reasoning, and multi-session. Every single category above 79%. These map directly to the most common real-world use cases. It doesn't matter what model you choose to use, our system still delivers.

Active development

We are continuously improving our retrieval pipeline. New modules are being added regularly to improve accuracy across all categories.