Evaluating Embedding Strategies for Structured Knowledge Extraction

The Leaderboard Lie

I'll admit it: the first embedding model I ever used in production was whatever sat at the top of the MTEB leaderboard that week. Sorted by average score, picked the winner, deployed it, and genuinely believed I'd done my due diligence.

Retrieval quality was mediocre. Not terrible — just consistently underwhelming. Queries that seemed obvious to a human would return chunks that were adjacent to the right answer but never quite it. I spent two weeks blaming my chunking strategy before I finally looked at what MTEB actually benchmarks: clean, self-contained passages of natural language prose. Wikipedia paragraphs. News articles. Textbook sections.

My data was procurement records with nested tables, specification documents full of part numbers, and technical reports where "Section 4.2.1" carried meaning that no embedding model on that leaderboard had ever been trained to understand.

The leaderboard was answering a question I wasn't asking.

What Structured Data Does to Embeddings

When your knowledge base has structure — tables, hierarchies, key-value pairs, identifiers — embedding models start failing in specific, predictable ways.

Tables become word soup. Flatten a table to text and "Revenue: $4.2M, Q3 2024" and "Revenue:$ 3.8M, Q1 2023" produce nearly identical embeddings. The model sees similar tokens; the fact that they represent different time periods — the entire point — gets compressed away.

Hierarchical context vanishes. A chunk that says "Minimum: 8GB" means something completely different under the RAM section versus the storage section. The embedding captures the text. It has no idea where in the document that text lived.

Identifiers are invisible. Part numbers, postal codes, ISO standards — they carry enormous meaning for the humans searching for them and are essentially random noise to the embedding model.

What to Do Instead of Trusting the Leaderboard

Build a small evaluation set that reflects your data. This isn't a research project — it's maybe two days of work.

The minimum viable evaluation: 50 queries pulled from your real (or realistic) query distribution, each mapped to the chunks that should be retrieved. Test 3–4 embedding models. Measure recall@5. Done.

But here's the thing that surprised me most: the embedding model matters less than how you chunk. I tested four models across three chunking strategies and the variance from chunking was larger than the variance from model choice every single time. A mediocre model on well-structured chunks — where tables stay intact and section headers travel with their content — outperformed the top-ranked model on naively split text.

The other dimension worth testing: how short queries interact with long chunks. Structured knowledge queries tend to be terse. "ISO 27001 requirements." "Component X thermal specs." Some models handle the asymmetry between a five-token query and a two-hundred-token chunk gracefully. Others collapse. You won't know until you test with your actual query lengths.

The Uncomfortable Takeaway

There's no universal answer to "which embedding model for structured data." I know that's unsatisfying. But the teams I've seen get this right all did the same thing: they spent two days building a test set that reflected their actual data, tested a handful of models, and let the numbers decide. No leaderboard. No blog post recommendations. Just their queries, their documents, and recall@5.

The payoff isn't just picking the right model. It's having a baseline you can re-run every time your data changes, your queries shift, or a shiny new model drops on Hugging Face. That baseline is worth more than any benchmark someone else ran on data that looks nothing like yours.