With nan emergence of ample connection models (LLMs), our vulnerability to benchmarks — not to mention nan sheer number and assortment of them — has surged. Given nan opaque quality of LLMs and different AI systems, benchmarks person go nan modular measurement to comparison their performance.
These are standardized tests aliases information sets that measure really good models execute connected circumstantial tasks. As a result, each caller exemplary merchandise brings updated leaderboard results, and embedding models are nary exception.
Today, embeddings power nan hunt layer of AI applications, yet choosing nan correct exemplary remains difficult. The Massive Text Embedding Benchmark (MTEB), released successful 2022, has go nan modular for evaluating embeddings, but it’s a broad, general-purpose benchmark covering galore tasks unrelated to retrieval.
MTEB besides uses nationalist information sets, and while this promotes transparency, it tin lead to overfitting — models being trained connected information data. As a result, MTEB scores don’t ever bespeak real-world retrieval accuracy.
Retrieval Embedding Benchmark (RTEB), a caller retrieval-first benchmark, addresses these limitations by focusing connected real-world retrieval tasks and utilizing some unfastened and backstage information sets to amended bespeak existent generalization crossed caller unseen data. Let’s research RTEB, its focus, information sets and really to usage it.
How Are Embeddings Benchmarked?
Before diving into RTEB, it’s important to understand benchmarks and why they matter. Because AI models for illustration embedding models are achromatic boxes, assessing their value is challenging. A benchmark is simply a standardized group of tasks utilized to measure those models. Benchmarks thief measurement performance, place areas for betterment and comparison results against a modular baseline, different models aliases past performance.

Figure 1. Example RTEB results of benchmarked embedding models.
Building effective benchmarks is not trivial. Data sets and task definitions must bespeak real-world usage to alteration meaningful comparisons. However, galore benchmarks neglect astatine this, utilizing information sets that don’t correspond existent usage cases, which leads to results that don’t bespeak existent applications.
Another awesome rumor is overfitting. Since benchmark information sets are usually public, models often extremity up trained — intentionally aliases not — connected information data. This leads to inflated benchmark scores that don’t bespeak existent generalization to unseen data.
Beyond these concerns, benchmark sum is besides crucial. For example, MTEB, nan astir celebrated benchmark for evaluating embedding exemplary accuracy, spans 8 chopped task categories. While this wide sum is useful for wide comparison, it tin beryllium misleading if you attraction astir capacity connected circumstantial usage cases. In practice, you should attraction connected benchmarks aliases tasks that align intimately pinch your intended applications.
RTEB: A New Retrieval-Focused Benchmark
While embedding models tin beryllium utilized for galore tasks, their astir communal accumulation usage lawsuit coming is retrieval — powering search, enabling Retrieval-Augmented Generation (RAG) systems and matching queries to applicable documents.
This is why nan Retrieval Embedding Benchmark was created. RTEB is simply a caller benchmark focused specifically connected retrieval tasks. It builds connected MTEB by providing a retrieval-focused information model designed to accurately measurement nan existent retrieval accuracy of embedding models through:
- A hybrid approach: RTEB combines nationalist information sets (some shared pinch MTEB) and backstage ones. This prevents overfitting — aka, “teaching to nan test” — ensuring models aren’t trained connected information data. The inclusion of backstage information sets provides a much meticulous measurement of generalization to unseen data.
- Real-world and multilingual coverage: RTEB spans cardinal endeavor domains specified arsenic finance, healthcare and code, and evaluates retrieval successful complete 20 languages. These information sets amended correspond nan usage cases recovered successful enterprises today.

Figure 2. RTEB overview.
The accuracy for each information group task, measured utilizing nan Normalized Discounted Cumulative Gain astatine rank 10 (nDCG@10), is utilized to rank nan models, producing a rank per task. This metric is preferred for measuring retrieval accuracy because it captures some relevance and ranking quality, aligning intimately pinch really humans comprehend hunt results.
These ranks are past mixed utilizing nan Borda count to find nan last leaderboard ranking. The mean of task scores is not straight utilized because earthy measures disagree crossed tasks — immoderate information sets person larger aliases smaller people ranges, which tin unbalance nan average. The Borda count normalizes these standard differences and emphasizes comparative performance, providing a fairer comparison crossed tasks.
Navigating RTEB successful MTEB
The RTEB leaderboard is disposable nether nan Retrieval conception of nan MTEB leaderboard connected Hugging Face.

Figure 3. RTEB successful MTEB.
In summation to nan main ranking, a fewer different parameters are important to return into relationship erstwhile consuming nan RTEB leaderboard:
- Embedding dimensions: This represents nan magnitude of nan embedding vector. Smaller embeddings connection faster conclusion and little retention costs, while larger ones tin seizure much nuanced relationships successful nan data. The extremity is to equilibrium semantic extent pinch computational efficiency.
- Max tokens: This is nan maximum number of tokens that tin beryllium converted into a azygous embedding. This depends connected your data’s building and chunking strategy. Larger token limits alteration embedding longer matter segments.
- Number of parameters (when available): This represents nan model’s size. More parameters mostly correlate to higher accuracy, but besides greater latency and assets needs. Proprietary models whitethorn not disclose nonstop sizes, but often supply options specified arsenic “small,” “lite” aliases “large,” pinch different pricing to lucifer your needs.
Subsets of RTEB are disposable for different domains and connection categories, offering focused insights into each model’s capacity successful circumstantial areas. These tin beryllium accessed nether nan Retrieval conception of MTEB connected Hugging Face.
RTEB is an important measurement guardant successful evaluating embedding models for retrieval. Its hybrid operation of nationalist and backstage information sets to forestall overfitting, on pinch its attraction connected real-world endeavor domains and multilingual coverage, makes it a much meticulous and applicable instrumentality for developers evaluating different embedding models.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube channel to watercourse each our podcasts, interviews, demos, and more.
Group Created pinch Sketch.
English (US) ·
Indonesian (ID) ·