Data Architecture & Engineering June 19, 2026 · 24 min read

Sourcing the Graph: Building Knowledge from Structured and Unstructured Data

Most enterprise data is split between structured systems with stable schemas and unstructured documents with no schema at all. A knowledge graph has to ingest both. This article gives the working design for the two construction tracks: deterministic mapping for relational and tabular sources (R2RML, Direct Mapping, RML, virtualization vs materialization), and probabilistic extraction for unstructured text (end-to-end relation extraction, LLM-assisted graph indexing, incremental construction, schema-first vs schema-emergent). It covers where the two tracks meet, the failure modes specific to construction, and the decision tree for picking a sourcing strategy. Part 6 of the Knowledge Graph Practitioner's Guide.

By Vikas Pratap Singh
#knowledge-graph #r2rml #rml #llm-extraction #graphrag #data-architecture #data-engineering

Knowledge Graph Practitioner’s Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10 | Part 11a | Part 11b | Part 11c | Appendix A | Appendix B | Appendix C | Part 12

Two Million Useless Triples

The bank in this section is an illustrative composite drawn from publicly documented enterprise KG failure patterns, and the figures are illustrative rather than measured. A regional commercial bank decides to build its first knowledge graph. The leadership ask is concrete: a relationship-banker chatbot that can answer “what is our exposure to this counterparty across loans, deposits, derivatives, and contingent commitments.” The data lives in three places. A relational CRM holds counterparty master records. A relational warehouse holds the transaction and exposure tables. Four years of credit memos, board minutes, and counterparty risk assessments live in a document store of PDFs.

The first attempt is what an enthusiastic engineer always proposes: point an LLM at every source, ask it to “extract all entities and relationships,” and load the result. After three weeks the team has on the order of two million triples. When the chatbot is wired to it, the answers are confidently wrong. The graph asserts that one customer “owns” a loan officer because the credit memo said the officer “owned the relationship.” It conflates the parent entity Acme Holdings with the subsidiary Acme Manufacturing because the LLM treated bare mentions of “Acme” as the same node. Counterparty A appears with three different IRIs (one per source), so exposure aggregations are wrong. Triple count is impressive. Usable coverage is a small fraction, on the order of ten percent.

The team backs up. They split the work into two tracks. For the structured sources they write a W3C R2RML mapping that produces RDF deterministically from the CRM and warehouse tables, with stable IRIs minted from internal master IDs. For the unstructured sources they design a tightly scoped LLM extraction pipeline with a fixed predicate vocabulary, an explicit entity-resolution step, and a SHACL validation pass before triples enter the graph. A few months later the graph is far smaller and the usable share is most of it. The chatbot’s exposure aggregations now match the warehouse to the cent.

This article is about the design choices that produce the second outcome instead of the first. The runtime layer in Part 5 covered identity, reference, and inference assuming the graph already exists. This article covers how the graph gets built. The vocabulary work in Part 4 gave you a way to describe what kinds of things exist. This article covers how the things get into the graph in the first place.

Two Tracks, Different Engineering

There is no single construction pipeline. Production knowledge graphs have two construction tracks because enterprise data has two shapes: structured sources with stable schemas, and unstructured sources with no schema at all. Each track has its own toolchain, its own failure modes, and its own provenance discipline. Treating them as one problem produces the low-coverage graph in the anecdote above.

TrackSource shapeConstruction styleOutput guaranteeCost shape
Structured to RDFRelational tables, CSV, JSON, XML, APIs with schemasDeterministic mapping (R2RML, RML, Direct Mapping)Same input always produces same triples; auditableEngineering cost upfront, near-zero per ingest
Unstructured to RDFPDFs, emails, contracts, transcripts, web pagesProbabilistic extraction (NLP pipelines, LLM extraction, LLM-assisted graph indexing)Output varies by model, prompt, run; requires validationPer-document compute cost; scales with corpus size

The two tracks meet at three places: the entity resolution layer (the same Acme Corp must get one IRI regardless of source), the validation layer (SHACL shapes apply to both), and the provenance layer (Part 7 covers how every triple records where it came from). Until they meet they are independent engineering problems.

A diagram showing the two construction tracks of a knowledge graph. The left panel shows Track 1 (structured to RDF) with three source types (relational CRM, relational warehouse, JSON/CSV/API) flowing into a mapping layer (R2RML for relational, Direct Mapping for prototypes, RML for CSV/JSON/XML), then branching into materialize or virtualize architecture choices. The right panel shows Track 2 (unstructured to RDF) with three source types (credit memos PDF, contracts and board minutes, news and transcripts) flowing into an extraction layer (classic NLP, end-to-end relation extraction, LLM-assisted extraction pipelines), then branching into schema-first, schema-adaptive, or schema-emergent approaches. At the bottom both tracks converge into entity resolution, SHACL validation gate, provenance attachment, and a unified knowledge graph.

What this looks like in practice. A useful planning assumption is that a bank’s first KG releases are roughly two-thirds structured-track triples (master data, transactions, organizational hierarchy from HR systems) and one-third unstructured-track triples (credit memo claims, regulatory commitments, third-party news); treat the split as illustrative, not measured. The balance flips in pharma and legal where most institutional value is in documents. The right starting question is “what fraction of source value is structured vs unstructured” before any tool selection.

Track 1: Structured to RDF

The W3C ratified two RDB-to-RDF Recommendations in September 2012: Direct Mapping and R2RML. These have been the production standard for moving relational data into RDF for over a decade. The 2024 RML specification at rml.io extends the same mapping discipline to CSV, JSON, and XML sources. If a source has a schema, one of these languages is your tool.

Direct Mapping: the zero-config option

Direct Mapping is the trivial case. Every table becomes a class. Every row becomes a resource with an IRI derived from the primary key. Every column becomes a predicate with the column name. Foreign keys become typed relationships. No mapping file is written by hand; the algorithm is fully specified by the W3C and implementations are interchangeable.

Direct Mapping is correct, fast, and almost never what you want for a production KG. The output uses table and column names as predicate IRIs (http://example.com/db/CUST_TBL#FST_NM), which means your graph schema is your database schema. Refactor the database and the graph schema breaks. Use Direct Mapping for prototypes and for sanity checks. Do not use it as the production interface.

R2RML: the customized mapping

R2RML lets a mapping author define how relational tables turn into RDF using a target vocabulary the author controls. An R2RML document is itself an RDF graph (in Turtle), which is one of the language’s quietly elegant features: your transformation logic is queryable.

The atomic unit of R2RML is the TriplesMap: a declaration that says “from this SQL view, mint a subject IRI like this, and emit these predicates with these object terms.” A small but realistic mapping looks like this:

@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix lk: <https://lakeside.com/kg/> .
@prefix lkv: <https://lakeside.com/vocab/> .

<#CustomerMap> a rr:TriplesMap ;
  rr:logicalTable [ rr:tableName "CRM_CUSTOMER" ] ;
  rr:subjectMap [
    rr:template "https://lakeside.com/kg/customer/{CUST_ID}" ;
    rr:class lkv:Customer
  ] ;
  rr:predicateObjectMap [
    rr:predicate lkv:legalName ;
    rr:objectMap [ rr:column "LEGAL_NAME" ]
  ] ;
  rr:predicateObjectMap [
    rr:predicate lkv:lei ;
    rr:objectMap [ rr:column "LEI_CODE" ]
  ] .

A diagram showing how a single relational row becomes RDF triples through an R2RML mapping. The left panel shows a CRM_CUSTOMER table in a relational database with columns CUST_ID, LEGAL_NAME, LEI_CODE, COUNTRY, with one row highlighted (12345, Acme Corp, 529900T8B..., US). The middle panel shows the corresponding R2RML TriplesMap in Turtle, with a logical table reference, a subject map using a template that builds the IRI from CUST_ID, a class assignment of lkv:Customer, and predicate-object maps for legalName, lei, and country. The right panel shows the resulting Turtle triples for the highlighted row, with three annotations: stable IRI from opaque CUST_ID, controlled vocabulary from the team ontology, and reversible execution (materialize or virtualize). A bottom annotation explains why R2RML beats hand-coded ETL for KG construction (declarative, reversible, vocabulary-controlled, standardized).

Three properties of this mapping matter for production work.

First, the IRI template (/customer/{CUST_ID}) is the Part 5 IRI minting policy expressed in code. The mapping author chose CUST_ID as the stable opaque identifier; the customer name and address are properties, not part of the identifier. If the customer rebrands, the IRI does not move.

Second, the target vocabulary (lkv:Customer, lkv:legalName, lkv:lei) is the team’s curated ontology, not the database column names. Refactoring CRM_CUSTOMER does not require downstream consumers to relearn the schema. The mapping is the contract.

Third, the mapping is reversible-ish: an R2RML processor can serve the SPARQL query as a SQL query against the original tables (virtualization) or precompute the RDF and store it (materialization). That choice is the single biggest architectural decision in the structured track and gets its own section below.

RML: the same discipline for non-relational sources

RML extends R2RML to cover CSV files, JSON documents, and XML files using the same TriplesMap structure with a richer rml:logicalSource definition. Per the RML specification, the language adds support for iterators that walk JSON paths, XPath expressions for XML, and column references for CSV, while keeping the rest of R2RML’s syntax intact.

If your source is a JSON API response or a stream of CSV exports, RML is the language. Tools like Morph-KGC, RMLMapper, and SDM-RDFizer process RML mappings against arbitrary tabular and tree-structured inputs to emit RDF.

Virtualization vs materialization

A mapping is a specification. The runtime choice is how to execute it. There are two postures.

ArchitectureMechanismWhen to pick it
MaterializationRun the mapping as a batch job; store the resulting triples in a triple store; serve queries from the storeSource data changes daily or less; query latency is critical; downstream needs derived inferences materialized
VirtualizationTranslate incoming SPARQL queries into SQL (or JSON path, etc.) against the source at query time; never store the triplesSource data changes faster than batch refresh; storage cost matters; the source is authoritative and the KG layer is a semantic facade

The canonical open source option here is a SPARQL-to-SQL virtualization layer (see Appendix A for the specific tools), a research-grade virtual knowledge graph (VKG) system that has been deployed in industrial settings since 2018. Commercial entrants in this category include the canonical RDF triple store, whose virtual graph documentation describes how SPARQL is rewritten to native source queries at runtime.

The architectural temptation is to virtualize everything because it sounds cheaper. The architectural reality is that virtualization is a great fit for slow-changing reference data over a small number of well-indexed sources and a poor fit when (a) the source schema is unstable, (b) federated joins span many sources with different join performance characteristics, or (c) the KG needs reasoning over derived facts that the source database cannot compute. Most production deployments end up hybrid: master data and large reference tables virtualized, derived inferences and resolved entities materialized.

How to build the check. Pick three of your most common KG queries and trace them by hand against the candidate architecture. If a virtualization plan generates a 14-way SQL join across three databases with no shared index, materialize. If the same query rewrites cleanly to a single indexed SQL hit, virtualize. The decision is usually obvious once you draw the query plan.

Track 2: Unstructured to RDF

Unstructured sources do not have a schema. Building a knowledge graph from them requires extracting entities and relationships from raw text, then normalizing the result into RDF or LPG triples. This was a research problem with hundreds of papers between 2010 and 2023. As of 2026 it is a production category with several plausible toolchains and one persistent failure pattern: triple explosion without usable coverage.

The classic NLP pipeline

The 2010s pattern is named entity recognition (NER) followed by relation extraction (RE) followed by entity linking (EL). NER finds spans of text that mention entities (Acme Corp, Janet Yellen, 0.5 percent rate cut). RE finds typed relationships between mentioned entities (Janet Yellen announced 0.5 percent rate cut). EL links each mention to a canonical IRI in a target KG (Janet Yellenhttps://example.org/wikidata/Q187149). Pipelines like spaCy, Stanford CoreNLP, OpenNRE, and OpenIE all implement variants of this flow.

Classical pipelines are fast, cheap, and predictable. They are also bound by the predicates they were trained on. OpenIE will give you triples like (Janet Yellen, announced, 0.5 percent rate cut) where the predicate is just a verb phrase from the source text. Without a normalization step those predicates are not part of any controlled vocabulary, which makes querying brittle.

REBEL and end-to-end relation extraction

REBEL (Relation Extraction By End-to-end Language generation) reframed extraction as a sequence-to-sequence problem: given a span of text, generate the triples directly using a fine-tuned BART model. REBEL gives you typed predicates from a fixed vocabulary (originally the Wikidata predicate set, customizable by retraining), entity boundaries, and relations in one pass. It is open source, runs on commodity GPUs, and is the bridge between classical RE and the LLM-extraction era.

REBEL’s discipline is its predicate vocabulary: it will not invent predicates that were not in its training set. That makes it more useful than OpenIE for production work, because the output joins cleanly into a controlled ontology. The cost is that domain-specific predicates require fine-tuning on domain data.

LLM extraction: the 2024-2026 pattern

The current production pattern uses an LLM as the extraction engine. The LLM reads a text chunk, is given the target ontology in the prompt (or asked to induce one), and emits triples in JSON. The implementations fall into a few capability shapes: an LLM-assisted extraction pipeline for general indexing, an incremental construction tool that deduplicates as it ingests, a plain-text-to-graph extractor, and an agent episodic-memory store for real-time KGs serving AI agents.

One widely cited LLM-assisted indexing pipeline is the canonical reference design for this pattern. Its standard method, per the published indexing documentation, proceeds in stages:

  1. Chunk the source corpus into text units of bounded length.
  2. Extract entities from each chunk: ask the LLM for named entities and a brief description of each.
  3. Extract relationships between entities the LLM identified in the same chunk, again with descriptions.
  4. Summarize entity descriptions across chunks where the same entity appears.
  5. Detect communities in the resulting entity graph using a graph algorithm like Leiden.
  6. Generate community summaries so high-level queries can use the cluster level instead of the node level.

The 2024 paper behind that pipeline at arXiv:2404.16130 reported significant gains over vector RAG on query-focused summarization tasks. A November 2024 deferred-indexing variant cut indexing cost to 0.1 percent of the original by deferring summarization until query time, demonstrating that LLM extraction cost is a real constraint that the field is actively engineering against.

A diagram of the LLM extraction pipeline with three discipline points. The top row shows the six stages of the LLM-assisted graph-indexing pipeline: chunk, extract entities, extract relations, summarize, cluster, cluster summaries. The middle row shows the three discipline points needed to prevent triple explosion: fixed or seeded ontology applied at the extraction prompt, deduplication and entity resolution applied between extraction and assert, and a SHACL validation gate applied before triples enter the graph. The bottom row compares outcomes at scale on a 10,000-document corpus: without the discipline points, 3 million candidate triples become 1.2 million useful but indistinguishable triples mixed with duplicates, spurious assertions, hallucinations, and out-of-vocabulary triples; with the discipline points, the same 3 million candidates collapse cleanly to 1.2 million validated triples with provenance attached, and the chatbot can cite source spans for every assertion.

Schema-first vs schema-emergent

LLM extraction frameworks land on a spectrum.

ApproachWhat you provideWhat the LLM doesWhen to choose
Schema-firstFixed entity types and predicate vocabulary in promptExtract triples that conform; reject the restRegulated industries; KG joins existing master data; provenance and audit matter
Schema-adaptiveInitial seed ontology; rules for proposing new typesExtend the ontology cautiously while extractingDomain is evolving; ontology team cannot keep up; you have a reviewer
Schema-emergentNo ontology at allInduce the schema from the corpusExploratory; corpus is small; ontology team will refactor later

The schema-emergent approach is where most of the 2025 research excitement is. The AutoSchemaKG paper reports autonomous KG construction with simultaneous schema induction, which its authors frame as operating at web scale. KGGen and similar systems can produce surprisingly coherent KGs from text without any predefined types. The risk for enterprise use is that schema emergence at construction time is unpredictable: the same corpus run twice with the same model produces different ontologies, which makes downstream applications and downstream queries non-deterministic.

For most enterprise KGs in 2026 the production answer is schema-first or schema-adaptive with a human-curated seed. The ontology you locked in Part 4 is your prompt content. The LLM’s job is to populate it, not to design it.

What this looks like in practice. When evaluating a tool that claims “automatic KG construction,” the diagnostic question is “what does the schema look like after my third reingest.” If the tool cannot show stability across reingest, the system is closer to a one-shot exploratory tool than to a production graph. Stability across reingest is the price of admission to production.

The triple explosion failure mode

Every team that runs an unconstrained LLM extraction at scale produces too many triples. The pattern looks like this: 100 chunks per document, 30 triples per chunk, ten thousand documents, three million triples. Of those, perhaps 20 percent are duplicates without a deduplication step, 20 percent are spurious “X mentions Y” assertions, 10 percent are wrong, and 10 percent are not in the target vocabulary. The remaining 40 percent might be useful, but you cannot easily tell which 40 percent.

Triple explosion is a sign that the construction pipeline lacks three discipline points: (1) a fixed or seeded ontology that constrains output, (2) a deduplication and entity resolution step that runs as triples enter the graph, and (3) a SHACL validation gate that rejects triples that violate the ontology constraints. Without those three the unstructured track produces noise faster than it produces knowledge. With them, the same pipeline produces a usable graph an order of magnitude smaller.

Where the Two Tracks Meet

Track 1 emits RDF with stable IRIs minted from primary keys. Track 2 emits RDF with extracted entity strings that may or may not match anything in Track 1. The same Acme Corp shows up in both. The graph is only useful if both Acme nodes collapse to the same IRI. This is where the 8-stage pipeline introduced in Part 5 reasserts itself.

Pipeline stageTrack 1 behaviorTrack 2 behavior
1. IngestR2RML or RML reads source rowsLLM or NLP pipeline reads source documents
2. MapMapping language declares triples directlyExtraction emits candidate triples in JSON, normalized to RDF
3. ResolveSource primary key already maps to canonical IRI; no resolution needed for Track 1 internalsEntity strings must be resolved to canonical IRIs (deterministic + probabilistic ER from Part 5)
4. MintIf a new entity is discovered, mint per IRI policyIf a new entity is discovered, mint per IRI policy after ER fails
5. AssertTriples enter the graph with provenance metadataTriples enter the graph with provenance metadata including source span and extraction confidence
6-8. Reason / Validate / ServeSame for bothSame for both

The integration point is stage 3. Without entity resolution between tracks, the structured side asserts lk:customer/12345 and the unstructured side asserts lk:customer/auto-generated/acme-corp-from-memo-page-3. Both refer to the same Acme Corp. The graph treats them as different. Aggregations are wrong. The chatbot answers wrong.

The agent-era restatement: AI agents make this integration the most expensive part of construction. A retrieval agent walking the graph from a structured exposure node to an unstructured credit memo claim only works if the entity nodes are unified. Per the Part 5 real-time ER framing, agent-serving KGs need ER on the write path between Track 1 and Track 2, not as a nightly batch.

A Diagnostic Table for the Construction Layer

Use this when assessing a KG construction pipeline you inherited or are evaluating. Each row is a yes/no question whose answer reveals the architecture.

Diagnostic questionIf yesIf no
Is there a written mapping from each source schema to the target ontology?Track 1 is disciplined; refactor-safeThe mapping is implicit in code; refactors will break the graph
Are IRIs minted from stable opaque identifiers (not natural keys)?Identity is durable across re-ingestIdentity will fail when the natural key changes
Does the unstructured-track extraction use a fixed or seeded ontology?Output is constrained; downstream queries are tractableTriple explosion likely; ontology emerged from the corpus and may differ next run
Is entity resolution explicit between Track 1 and Track 2 outputs?Cross-source aggregations are correctSame entity will appear under multiple IRIs; aggregations silently undercount
Is provenance attached to every triple (named graph or per-edge metadata)?Audit trail exists; trust is differentiatedTrust is uniform across all triples; auditors will not be satisfied
Is there a SHACL validation gate before triples enter the graph?Constraint violations are caught at ingestBad triples land in the graph; downstream cleanup becomes ongoing
Is the architecture materialization, virtualization, or hybrid (and is the choice deliberate)?The team knows why; can defend the latency / cost trade-offThe architecture is whatever the first vendor demo showed

A “no” on three or more rows is a strong signal that the construction layer will produce confidently wrong knowledge at scale, regardless of how impressive the triple count looks.

Six Failure Modes Specific to Construction

These are the patterns where the layer this article covers breaks down. Recognize them on a system you inherit and you save quarters of effort.

FailureSymptomRoot cause
Triple explosion without coverageTriple count grows faster than usable answers; downstream queries return noiseUnconstrained LLM extraction without fixed ontology, deduplication, or SHACL gate
Mapping rotThe graph silently goes wrong after a database schema changeR2RML/RML mappings are not in the same CI pipeline as the database schema; nothing alerts when the source moves
Schema-emergent without governanceOntology drifts every reingest; downstream queries breakSchema-emergent extraction adopted because it was easy; no human curates emergent types
Single-pass LLM extractionRandom triples are correct, random triples are not; nobody can predict whichNo validation step (LLM-as-judge, deduplication, ER, SHACL) between extraction and graph load
Bypass entity resolution at constructionSame entity in three forms; cross-source aggregations are wrongThe construction pipeline trusted the natural keys it received; ER was deferred to “later” and never built
No provenance attachedAudit fails; trust is uniform across all triplesConstruction emitted triples without source attribution; named graphs or per-edge provenance never wired in

Decision Tree for the Construction Layer

When you are designing the construction layer for a new KG, walk these questions in order.

  1. What fraction of source value is in structured vs unstructured form? This sets which track gets investment first. For a transaction-heavy domain, structured is usually 70-90 percent of value; for a documents-heavy domain (legal, pharma, intelligence), unstructured is usually 60-80 percent.
  2. For each structured source, what is the schema-stability profile? Stable means R2RML or RML once and forever. Unstable means R2RML or RML in a CI pipeline with the source schema.
  3. For each unstructured source, what is the ontology relationship? Use schema-first if the target ontology is locked; schema-adaptive if it is evolving with reviewers; schema-emergent only for exploratory or one-off corpora.
  4. Materialize, virtualize, or hybrid? Materialize for query latency and reasoning support. Virtualize for slow-changing references over indexed sources. Hybrid for the realistic enterprise mix.
  5. What is the entity-resolution strategy between tracks? Default to deterministic-first hybrid per Part 5; explicitly identify the entity types where Track 1 and Track 2 will collide.
  6. What is the SHACL validation gate? Define shapes from the target ontology; run them as a CI check on the construction pipeline.
  7. What is the provenance contract? Per-named-graph or per-edge. Either is fine; neither is not.
  8. What is the deduplication strategy in the unstructured track? Embedding-based clustering plus LLM-as-judge plus human review queue is the 2026 default.
  9. What is the cost ceiling for Track 2? LLM-extraction cost scales linearly with corpus size. Set a budget before scaling; consider deferred, query-time indexing if cost is a constraint.
  10. What is the change protocol when the ontology evolves? See Part 8.

Most production KGs end up at: R2RML (or RML) with materialization for reference data and frequently joined sources, virtualization for slow-changing master data, schema-first LLM extraction with embedding-based deduplication for the documents corpus, deterministic-first hybrid ER between tracks, SHACL validation as a CI gate, and named graphs for provenance. That is not the only shape but it is the shape most teams converge to over their first eighteen months.

What You Should Now Be Able to Do

If you read this article cold, you should now be able to:

  • Decide whether a given source belongs to Track 1 (structured mapping) or Track 2 (extraction) and pick the appropriate toolchain.
  • Write a small R2RML or RML mapping that produces RDF with stable IRIs from a relational or tabular source.
  • Distinguish materialization, virtualization, and hybrid architectures and defend the choice for a given workload.
  • Choose between schema-first, schema-adaptive, and schema-emergent LLM extraction with the trade-offs articulated.
  • Diagnose triple explosion in an unstructured-track pipeline and add the three discipline points (fixed ontology, deduplication, SHACL gate) that fix it.
  • Recognize the six construction-layer failure modes on a system you inherit.

You now have the construction layer of the KG. The next two articles cover what happens after triples are in the graph: Part 7 covers quality, provenance, and trust; Part 8 covers operations, change, and versioning. The applications layer (Part 9 for agents, Part 10 for governance) follows. The Lakeside Trust Bank reference architecture in Part 11 instantiates everything end to end.

Do Next

PriorityActionWhy it matters
This weekFor your largest structured source, sketch a one-page R2RML mapping that produces three to five entity types with stable IRIs. The exercise reveals where your IRI policy is implicit and where the target vocabulary is unclear.Most enterprise KGs are blocked not by tool selection but by the absence of an explicit source-to-vocabulary contract. Writing one mapping forces the contract.
This weekFor your largest unstructured corpus, run a 100-document LLM-assisted graph-indexing extraction with a default open pipeline and inspect the output. The triple count, deduplication rate, and predicate variety set realistic expectations for the full corpus.Tool demos understate the cost and the noise. A 100-document run with your data is the cheapest reality check.
This monthLock the materialization vs virtualization decision in writing for the top three KG queries. Run the candidate architecture against representative source data and measure latency, cost, and refresh lag.Most architectures default to materialization without measuring. The decision often flips for slow-changing reference data when measured.
This monthDefine the entity-resolution contract between Track 1 and Track 2 for one bounded entity type (legal entities, customers, products). Build the deterministic-first hybrid ER per Part 5 and wire it into the construction pipeline, not as a downstream cleanup.ER as a downstream cleanup never finishes. ER on the write path keeps the graph honest.
This quarterStand up a SHACL validation gate as a required CI step for any pipeline that writes triples to the graph. Author shapes from the target ontology in Part 4.The gate is the single most effective discipline against triple explosion in the unstructured track. It also catches mapping rot in the structured track.
This quarterRead Part 7 before designing the provenance layer. Provenance has to be designed at construction time, not bolted on after.Retrofitting provenance into a graph that was constructed without it is one of the most expensive recovery projects in KG work.

Part 7 of this series, “Quality, Provenance, and Trust: Why Knowledge Graphs Fail Audits,” covers what construction has to record so the graph can be trusted, audited, and reasoned about. Read it next.

Sources & References

  1. W3C R2RML: RDB to RDF Mapping Language(2012)
  2. W3C A Direct Mapping of Relational Data to RDF(2012)
  3. RML: RDF Mapping Language Specification(2024)
  4. Knowledge Graphs (Hogan et al., ACM Computing Surveys 2021)(2021)
  5. Construction of Knowledge Graphs: Current State and Challenges(2024)
  6. Microsoft GraphRAG: From Local to Global(2024)
  7. Microsoft GraphRAG documentation: Indexing methods(2025)
  8. LazyGraphRAG: Setting a New Standard for Quality and Cost (Microsoft Research)(2024)
  9. Stardog Virtual Graphs(2025)
  10. The Virtual Knowledge Graph System Ontop (Xiao et al., ISWC 2020)(2020)
  11. AutoSchemaKG: Autonomous KG Construction through Dynamic Schema Induction(2025)
  12. iText2KG / ATOM: Incremental KG construction from unstructured text (primary repository)(2025)

Stay in the loop

Get new articles on data governance, AI, and engineering delivered to your inbox.

No spam. Unsubscribe anytime.