Building a Knowledge Graph from Unstructured Text: A Practical Guide

When Vector Search Hits a Wall

There's a class of questions that RAG just can't answer well, and it took me embarrassingly long to recognize the pattern.

"Which suppliers provide components used in products that failed testing in Q3?" The information exists — scattered across procurement records, test reports, and product specs. But no single chunk contains the answer. You need to traverse relationships: supplier → component → product → test result → date. Vector search doesn't traverse. It matches.

That's what finally pushed me toward knowledge graphs. Not because they're elegant (though they are), but because I had users asking relational questions and a retrieval pipeline that kept returning adjacent-but-useless chunks no matter how much I tuned it.

The Part Everyone Gets Wrong First

The temptation when you start building a knowledge graph is to extract everything. Every entity, every possible relationship, every piece of metadata. Cast a wide net, figure out what's useful later.

I did this. I ended up with a graph that had 47 entity types, relationships pointing in contradictory directions, and so much noise that querying it was slower and less reliable than just reading the original documents. Three weeks of work, essentially useless.

The thing that actually worked: I sat down with a sticky note and wrote the five queries my users cared about most. Then I worked backwards to the entity types and relationships those queries required. It fit on the sticky note. Three entity types, four relationship types. That was the schema.

The sticky note test: If your schema doesn't fit on a sticky note, it's too complex for a first pass. Start with 3–5 entity types and 3–5 relationship types. You can always expand. You can't un-extract garbage.

Extraction Is Easy Now. Resolution Is Still Hell.

LLMs made the extraction step almost trivial. Give the model a chunk of text, your schema, and a JSON output format, and it'll pull entities and relationships with maybe 80% accuracy out of the box. Two years ago this required custom NER models and weeks of annotation. Now it's a prompt.

The hard part — the part that still makes me want to close my laptop — is entity resolution. The same person shows up as "J. Smith," "John Smith," "Dr. John Smith," and "Smith, J. (2024)" across different documents. Are they the same person? Are there two John Smiths? Which publications belong to which one?

I use a layered approach: fuzzy string matching catches the obvious cases, embedding similarity (on the entity name plus surrounding context) catches the subtle ones, and for the genuinely ambiguous cases, I throw them at an LLM and ask "same entity or different?" It's not fast. It's not elegant. But it gets me to about 90% accuracy, and I've learned to stop there. A graph with some duplicate nodes is infinitely more useful than no graph because you spent three months on deduplication.

The Payoff

Once the graph exists and you wire it into your LLM pipeline — graph traversal for structural queries, vector search for semantic ones, both feeding context to the model — those impossible multi-hop questions just... work. "Which suppliers provide components used in products that failed testing in Q3?" becomes a Cypher query that returns in milliseconds and gives the LLM exactly the structured context it needs.

The gap between "I have documents" and "I have queryable knowledge" is smaller than it used to be. LLMs handle extraction. The hard parts are schema discipline, entity resolution, and the patience to start small. Pick fifty documents. Build the pipeline. Get it working on the sticky-note schema. Everything else is iteration.