Data Architecture & Engineering June 19, 2026 · 24 min read

Sourcing the Graph: Building Knowledge from Structured and Unstructured Data

Most enterprise data is split between structured systems with stable schemas and unstructured documents with no schema at all. A knowledge graph has to ingest both. This article gives the working design for the two construction tracks: deterministic mapping for relational and tabular sources (R2RML, Direct Mapping, RML, virtualization vs materialization), and probabilistic extraction for unstructured text (end-to-end relation extraction, LLM-assisted graph indexing, incremental construction, schema-first vs schema-emergent). It covers where the two tracks meet, the failure modes specific to construction, and the decision tree for picking a sourcing strategy. Part 6 of the Knowledge Graph Practitioner's Guide.

By Vikas Pratap Singh

#knowledge-graph #r2rml #rml #llm-extraction #graphrag #data-architecture #data-engineering

Executive Briefing

What this covers: How a knowledge graph is actually built from real source data. The two construction tracks: structured (relational, tabular, JSON, XML) via standardized mapping languages (W3C R2RML, Direct Mapping, the RML extension), and unstructured (text, PDF, transcripts) via NLP and LLM extraction (end-to-end relation extraction, LLM-assisted graph indexing, incremental construction, schema-adaptable approaches). The virtualization vs materialization architectural choice. Where the two tracks integrate.
Who should read it: Anyone designing the ingestion layer of a knowledge graph. Anyone who has been told an LLM can build the KG by itself. Anyone weighing whether to copy data into a triple store or query it in place. Anyone who has 80% of the company's value locked in PDFs and 20% in databases and is wondering which to attack first.
Key finding: Construction is a two-track problem with different failure modes per track, not a single pipeline. Mapping languages for structured sources have been a W3C Recommendation since 2012 and are deterministic, audit-friendly, and cheap. LLM extraction for unstructured sources is fast, expensive at scale, and probabilistic by nature. Treat them as separate engineering disciplines that meet at entity resolution.
For practitioners: when someone proposes building your KG by 'pointing GPT at the data lake,' ask three questions: (1) what fraction of source value lives in structured vs unstructured form, (2) what is your provenance contract per triple, (3) what is your strategy when the same entity appears in both tracks. If any answer is fuzzy, the construction layer will produce confidently wrong knowledge at scale.

Knowledge Graph Practitioner’s Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10 | Part 11a | Part 11b | Part 11c | Appendix A | Appendix B | Appendix C | Part 12

Two Million Useless Triples

The bank in this section is an illustrative composite drawn from publicly documented enterprise KG failure patterns, and the figures are illustrative rather than measured. A regional commercial bank decides to build its first knowledge graph. The leadership ask is concrete: a relationship-banker chatbot that can answer “what is our exposure to this counterparty across loans, deposits, derivatives, and contingent commitments.” The data lives in three places. A relational CRM holds counterparty master records. A relational warehouse holds the transaction and exposure tables. Four years of credit memos, board minutes, and counterparty risk assessments live in a document store of PDFs.

The first attempt is what an enthusiastic engineer always proposes: point an LLM at every source, ask it to “extract all entities and relationships,” and load the result. After three weeks the team has on the order of two million triples. When the chatbot is wired to it, the answers are confidently wrong. The graph asserts that one customer “owns” a loan officer because the credit memo said the officer “owned the relationship.” It conflates the parent entity Acme Holdings with the subsidiary Acme Manufacturing because the LLM treated bare mentions of “Acme” as the same node. Counterparty A appears with three different IRIs (one per source), so exposure aggregations are wrong. Triple count is impressive. Usable coverage is a small fraction, on the order of ten percent.

The team backs up. They split the work into two tracks. For the structured sources they write a W3C R2RML mapping that produces RDF deterministically from the CRM and warehouse tables, with stable IRIs minted from internal master IDs. For the unstructured sources they design a tightly scoped LLM extraction pipeline with a fixed predicate vocabulary, an explicit entity-resolution step, and a SHACL validation pass before triples enter the graph. A few months later the graph is far smaller and the usable share is most of it. The chatbot’s exposure aggregations now match the warehouse to the cent.

This article is about the design choices that produce the second outcome instead of the first. The runtime layer in Part 5 covered identity, reference, and inference assuming the graph already exists. This article covers how the graph gets built. The vocabulary work in Part 4 gave you a way to describe what kinds of things exist. This article covers how the things get into the graph in the first place.

Two Tracks, Different Engineering

There is no single construction pipeline. Production knowledge graphs have two construction tracks because enterprise data has two shapes: structured sources with stable schemas, and unstructured sources with no schema at all. Each track has its own toolchain, its own failure modes, and its own provenance discipline. Treating them as one problem produces the low-coverage graph in the anecdote above.

Track	Source shape	Construction style	Output guarantee	Cost shape
Structured to RDF	Relational tables, CSV, JSON, XML, APIs with schemas	Deterministic mapping (R2RML, RML, Direct Mapping)	Same input always produces same triples; auditable	Engineering cost upfront, near-zero per ingest
Unstructured to RDF	PDFs, emails, contracts, transcripts, web pages	Probabilistic extraction (NLP pipelines, LLM extraction, LLM-assisted graph indexing)	Output varies by model, prompt, run; requires validation	Per-document compute cost; scales with corpus size

The two tracks meet at three places: the entity resolution layer (the same Acme Corp must get one IRI regardless of source), the validation layer (SHACL shapes apply to both), and the provenance layer (Part 7 covers how every triple records where it came from). Until they meet they are independent engineering problems.

What this looks like in practice. A useful planning assumption is that a bank’s first KG releases are roughly two-thirds structured-track triples (master data, transactions, organizational hierarchy from HR systems) and one-third unstructured-track triples (credit memo claims, regulatory commitments, third-party news); treat the split as illustrative, not measured. The balance flips in pharma and legal where most institutional value is in documents. The right starting question is “what fraction of source value is structured vs unstructured” before any tool selection.

Track 1: Structured to RDF

The W3C ratified two RDB-to-RDF Recommendations in September 2012: Direct Mapping and R2RML. These have been the production standard for moving relational data into RDF for over a decade. The 2024 RML specification at rml.io extends the same mapping discipline to CSV, JSON, and XML sources. If a source has a schema, one of these languages is your tool.

Direct Mapping: the zero-config option

Direct Mapping is the trivial case. Every table becomes a class. Every row becomes a resource with an IRI derived from the primary key. Every column becomes a predicate with the column name. Foreign keys become typed relationships. No mapping file is written by hand; the algorithm is fully specified by the W3C and implementations are interchangeable.

Direct Mapping is correct, fast, and almost never what you want for a production KG. The output uses table and column names as predicate IRIs (http://example.com/db/CUST_TBL#FST_NM), which means your graph schema is your database schema. Refactor the database and the graph schema breaks. Use Direct Mapping for prototypes and for sanity checks. Do not use it as the production interface.

R2RML: the customized mapping

R2RML lets a mapping author define how relational tables turn into RDF using a target vocabulary the author controls. An R2RML document is itself an RDF graph (in Turtle), which is one of the language’s quietly elegant features: your transformation logic is queryable.

The atomic unit of R2RML is the TriplesMap: a declaration that says “from this SQL view, mint a subject IRI like this, and emit these predicates with these object terms.” A small but realistic mapping looks like this:

@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix lk: <https://lakeside.com/kg/> .
@prefix lkv: <https://lakeside.com/vocab/> .

<#CustomerMap> a rr:TriplesMap ;
  rr:logicalTable [ rr:tableName "CRM_CUSTOMER" ] ;
  rr:subjectMap [
    rr:template "https://lakeside.com/kg/customer/{CUST_ID}" ;
    rr:class lkv:Customer
  ] ;
  rr:predicateObjectMap [
    rr:predicate lkv:legalName ;
    rr:objectMap [ rr:column "LEGAL_NAME" ]
  ] ;
  rr:predicateObjectMap [
    rr:predicate lkv:lei ;
    rr:objectMap [ rr:column "LEI_CODE" ]
  ] .

Three properties of this mapping matter for production work.

First, the IRI template (/customer/{CUST_ID}) is the Part 5 IRI minting policy expressed in code. The mapping author chose CUST_ID as the stable opaque identifier; the customer name and address are properties, not part of the identifier. If the customer rebrands, the IRI does not move.

Second, the target vocabulary (lkv:Customer, lkv:legalName, lkv:lei) is the team’s curated ontology, not the database column names. Refactoring CRM_CUSTOMER does not require downstream consumers to relearn the schema. The mapping is the contract.

Third, the mapping is reversible-ish: an R2RML processor can serve the SPARQL query as a SQL query against the original tables (virtualization) or precompute the RDF and store it (materialization). That choice is the single biggest architectural decision in the structured track and gets its own section below.

RML: the same discipline for non-relational sources

RML extends R2RML to cover CSV files, JSON documents, and XML files using the same TriplesMap structure with a richer rml:logicalSource definition. Per the RML specification, the language adds support for iterators that walk JSON paths, XPath expressions for XML, and column references for CSV, while keeping the rest of R2RML’s syntax intact.

If your source is a JSON API response or a stream of CSV exports, RML is the language. Tools like Morph-KGC, RMLMapper, and SDM-RDFizer process RML mappings against arbitrary tabular and tree-structured inputs to emit RDF.

Virtualization vs materialization

A mapping is a specification. The runtime choice is how to execute it. There are two postures.

Architecture	Mechanism	When to pick it
Materialization	Run the mapping as a batch job; store the resulting triples in a triple store; serve queries from the store	Source data changes daily or less; query latency is critical; downstream needs derived inferences materialized
Virtualization	Translate incoming SPARQL queries into SQL (or JSON path, etc.) against the source at query time; never store the triples	Source data changes faster than batch refresh; storage cost matters; the source is authoritative and the KG layer is a semantic facade

The canonical open source option here is a SPARQL-to-SQL virtualization layer (see Appendix A for the specific tools), a research-grade virtual knowledge graph (VKG) system that has been deployed in industrial settings since 2018. Commercial entrants in this category include the canonical RDF triple store, whose virtual graph documentation describes how SPARQL is rewritten to native source queries at runtime.

The architectural temptation is to virtualize everything because it sounds cheaper. The architectural reality is that virtualization is a great fit for slow-changing reference data over a small number of well-indexed sources and a poor fit when (a) the source schema is unstable, (b) federated joins span many sources with different join performance characteristics, or (c) the KG needs reasoning over derived facts that the source database cannot compute. Most production deployments end up hybrid: master data and large reference tables virtualized, derived inferences and resolved entities materialized.

How to build the check. Pick three of your most common KG queries and trace them by hand against the candidate architecture. If a virtualization plan generates a 14-way SQL join across three databases with no shared index, materialize. If the same query rewrites cleanly to a single indexed SQL hit, virtualize. The decision is usually obvious once you draw the query plan.

Track 2: Unstructured to RDF

Unstructured sources do not have a schema. Building a knowledge graph from them requires extracting entities and relationships from raw text, then normalizing the result into RDF or LPG triples. This was a research problem with hundreds of papers between 2010 and 2023. As of 2026 it is a production category with several plausible toolchains and one persistent failure pattern: triple explosion without usable coverage.

The classic NLP pipeline

The 2010s pattern is named entity recognition (NER) followed by relation extraction (RE) followed by entity linking (EL). NER finds spans of text that mention entities (Acme Corp, Janet Yellen, 0.5 percent rate cut). RE finds typed relationships between mentioned entities (Janet Yellen announced 0.5 percent rate cut). EL links each mention to a canonical IRI in a target KG (Janet Yellen → https://example.org/wikidata/Q187149). Pipelines like spaCy, Stanford CoreNLP, OpenNRE, and OpenIE all implement variants of this flow.

Classical pipelines are fast, cheap, and predictable. They are also bound by the predicates they were trained on. OpenIE will give you triples like (Janet Yellen, announced, 0.5 percent rate cut) where the predicate is just a verb phrase from the source text. Without a normalization step those predicates are not part of any controlled vocabulary, which makes querying brittle.

REBEL and end-to-end relation extraction

REBEL (Relation Extraction By End-to-end Language generation) reframed extraction as a sequence-to-sequence problem: given a span of text, generate the triples directly using a fine-tuned BART model. REBEL gives you typed predicates from a fixed vocabulary (originally the Wikidata predicate set, customizable by retraining), entity boundaries, and relations in one pass. It is open source, runs on commodity GPUs, and is the bridge between classical RE and the LLM-extraction era.

REBEL’s discipline is its predicate vocabulary: it will not invent predicates that were not in its training set. That makes it more useful than OpenIE for production work, because the output joins cleanly into a controlled ontology. The cost is that domain-specific predicates require fine-tuning on domain data.

LLM extraction: the 2024-2026 pattern

The current production pattern uses an LLM as the extraction engine. The LLM reads a text chunk, is given the target ontology in the prompt (or asked to induce one), and emits triples in JSON. The implementations fall into a few capability shapes: an LLM-assisted extraction pipeline for general indexing, an incremental construction tool that deduplicates as it ingests, a plain-text-to-graph extractor, and an agent episodic-memory store for real-time KGs serving AI agents.

One widely cited LLM-assisted indexing pipeline is the canonical reference design for this pattern. Its standard method, per the published indexing documentation, proceeds in stages:

Chunk the source corpus into text units of bounded length.
Extract entities from each chunk: ask the LLM for named entities and a brief description of each.
Extract relationships between entities the LLM identified in the same chunk, again with descriptions.
Summarize entity descriptions across chunks where the same entity appears.
Detect communities in the resulting entity graph using a graph algorithm like Leiden.
Generate community summaries so high-level queries can use the cluster level instead of the node level.

The 2024 paper behind that pipeline at arXiv:2404.16130 reported significant gains over vector RAG on query-focused summarization tasks. A November 2024 deferred-indexing variant cut indexing cost to 0.1 percent of the original by deferring summarization until query time, demonstrating that LLM extraction cost is a real constraint that the field is actively engineering against.

Schema-first vs schema-emergent

LLM extraction frameworks land on a spectrum.

Approach	What you provide	What the LLM does	When to choose
Schema-first	Fixed entity types and predicate vocabulary in prompt	Extract triples that conform; reject the rest	Regulated industries; KG joins existing master data; provenance and audit matter
Schema-adaptive	Initial seed ontology; rules for proposing new types	Extend the ontology cautiously while extracting	Domain is evolving; ontology team cannot keep up; you have a reviewer
Schema-emergent	No ontology at all	Induce the schema from the corpus	Exploratory; corpus is small; ontology team will refactor later

The schema-emergent approach is where most of the 2025 research excitement is. The AutoSchemaKG paper reports autonomous KG construction with simultaneous schema induction, which its authors frame as operating at web scale. KGGen and similar systems can produce surprisingly coherent KGs from text without any predefined types. The risk for enterprise use is that schema emergence at construction time is unpredictable: the same corpus run twice with the same model produces different ontologies, which makes downstream applications and downstream queries non-deterministic.

For most enterprise KGs in 2026 the production answer is schema-first or schema-adaptive with a human-curated seed. The ontology you locked in Part 4 is your prompt content. The LLM’s job is to populate it, not to design it.

What this looks like in practice. When evaluating a tool that claims “automatic KG construction,” the diagnostic question is “what does the schema look like after my third reingest.” If the tool cannot show stability across reingest, the system is closer to a one-shot exploratory tool than to a production graph. Stability across reingest is the price of admission to production.

The triple explosion failure mode

Every team that runs an unconstrained LLM extraction at scale produces too many triples. The pattern looks like this: 100 chunks per document, 30 triples per chunk, ten thousand documents, three million triples. Of those, perhaps 20 percent are duplicates without a deduplication step, 20 percent are spurious “X mentions Y” assertions, 10 percent are wrong, and 10 percent are not in the target vocabulary. The remaining 40 percent might be useful, but you cannot easily tell which 40 percent.

Triple explosion is a sign that the construction pipeline lacks three discipline points: (1) a fixed or seeded ontology that constrains output, (2) a deduplication and entity resolution step that runs as triples enter the graph, and (3) a SHACL validation gate that rejects triples that violate the ontology constraints. Without those three the unstructured track produces noise faster than it produces knowledge. With them, the same pipeline produces a usable graph an order of magnitude smaller.

Where the Two Tracks Meet

Track 1 emits RDF with stable IRIs minted from primary keys. Track 2 emits RDF with extracted entity strings that may or may not match anything in Track 1. The same Acme Corp shows up in both. The graph is only useful if both Acme nodes collapse to the same IRI. This is where the 8-stage pipeline introduced in Part 5 reasserts itself.

Pipeline stage	Track 1 behavior	Track 2 behavior
1. Ingest	R2RML or RML reads source rows	LLM or NLP pipeline reads source documents
2. Map	Mapping language declares triples directly	Extraction emits candidate triples in JSON, normalized to RDF
3. Resolve	Source primary key already maps to canonical IRI; no resolution needed for Track 1 internals	Entity strings must be resolved to canonical IRIs (deterministic + probabilistic ER from Part 5)
4. Mint	If a new entity is discovered, mint per IRI policy	If a new entity is discovered, mint per IRI policy after ER fails
5. Assert	Triples enter the graph with provenance metadata	Triples enter the graph with provenance metadata including source span and extraction confidence
6-8. Reason / Validate / Serve	Same for both	Same for both

The integration point is stage 3. Without entity resolution between tracks, the structured side asserts lk:customer/12345 and the unstructured side asserts lk:customer/auto-generated/acme-corp-from-memo-page-3. Both refer to the same Acme Corp. The graph treats them as different. Aggregations are wrong. The chatbot answers wrong.

The agent-era restatement: AI agents make this integration the most expensive part of construction. A retrieval agent walking the graph from a structured exposure node to an unstructured credit memo claim only works if the entity nodes are unified. Per the Part 5 real-time ER framing, agent-serving KGs need ER on the write path between Track 1 and Track 2, not as a nightly batch.

A Diagnostic Table for the Construction Layer

Use this when assessing a KG construction pipeline you inherited or are evaluating. Each row is a yes/no question whose answer reveals the architecture.

Diagnostic question	If yes	If no
Is there a written mapping from each source schema to the target ontology?	Track 1 is disciplined; refactor-safe	The mapping is implicit in code; refactors will break the graph
Are IRIs minted from stable opaque identifiers (not natural keys)?	Identity is durable across re-ingest	Identity will fail when the natural key changes
Does the unstructured-track extraction use a fixed or seeded ontology?	Output is constrained; downstream queries are tractable	Triple explosion likely; ontology emerged from the corpus and may differ next run
Is entity resolution explicit between Track 1 and Track 2 outputs?	Cross-source aggregations are correct	Same entity will appear under multiple IRIs; aggregations silently undercount
Is provenance attached to every triple (named graph or per-edge metadata)?	Audit trail exists; trust is differentiated	Trust is uniform across all triples; auditors will not be satisfied
Is there a SHACL validation gate before triples enter the graph?	Constraint violations are caught at ingest	Bad triples land in the graph; downstream cleanup becomes ongoing
Is the architecture materialization, virtualization, or hybrid (and is the choice deliberate)?	The team knows why; can defend the latency / cost trade-off	The architecture is whatever the first vendor demo showed

A “no” on three or more rows is a strong signal that the construction layer will produce confidently wrong knowledge at scale, regardless of how impressive the triple count looks.

Six Failure Modes Specific to Construction

These are the patterns where the layer this article covers breaks down. Recognize them on a system you inherit and you save quarters of effort.

Failure	Symptom	Root cause
Triple explosion without coverage	Triple count grows faster than usable answers; downstream queries return noise	Unconstrained LLM extraction without fixed ontology, deduplication, or SHACL gate
Mapping rot	The graph silently goes wrong after a database schema change	R2RML/RML mappings are not in the same CI pipeline as the database schema; nothing alerts when the source moves
Schema-emergent without governance	Ontology drifts every reingest; downstream queries break	Schema-emergent extraction adopted because it was easy; no human curates emergent types
Single-pass LLM extraction	Random triples are correct, random triples are not; nobody can predict which	No validation step (LLM-as-judge, deduplication, ER, SHACL) between extraction and graph load
Bypass entity resolution at construction	Same entity in three forms; cross-source aggregations are wrong	The construction pipeline trusted the natural keys it received; ER was deferred to “later” and never built
No provenance attached	Audit fails; trust is uniform across all triples	Construction emitted triples without source attribution; named graphs or per-edge provenance never wired in

Decision Tree for the Construction Layer

When you are designing the construction layer for a new KG, walk these questions in order.

What fraction of source value is in structured vs unstructured form? This sets which track gets investment first. For a transaction-heavy domain, structured is usually 70-90 percent of value; for a documents-heavy domain (legal, pharma, intelligence), unstructured is usually 60-80 percent.
For each structured source, what is the schema-stability profile? Stable means R2RML or RML once and forever. Unstable means R2RML or RML in a CI pipeline with the source schema.
For each unstructured source, what is the ontology relationship? Use schema-first if the target ontology is locked; schema-adaptive if it is evolving with reviewers; schema-emergent only for exploratory or one-off corpora.
Materialize, virtualize, or hybrid? Materialize for query latency and reasoning support. Virtualize for slow-changing references over indexed sources. Hybrid for the realistic enterprise mix.
What is the entity-resolution strategy between tracks? Default to deterministic-first hybrid per Part 5; explicitly identify the entity types where Track 1 and Track 2 will collide.
What is the SHACL validation gate? Define shapes from the target ontology; run them as a CI check on the construction pipeline.
What is the provenance contract? Per-named-graph or per-edge. Either is fine; neither is not.
What is the deduplication strategy in the unstructured track? Embedding-based clustering plus LLM-as-judge plus human review queue is the 2026 default.
What is the cost ceiling for Track 2? LLM-extraction cost scales linearly with corpus size. Set a budget before scaling; consider deferred, query-time indexing if cost is a constraint.
What is the change protocol when the ontology evolves? See Part 8.

Most production KGs end up at: R2RML (or RML) with materialization for reference data and frequently joined sources, virtualization for slow-changing master data, schema-first LLM extraction with embedding-based deduplication for the documents corpus, deterministic-first hybrid ER between tracks, SHACL validation as a CI gate, and named graphs for provenance. That is not the only shape but it is the shape most teams converge to over their first eighteen months.

What You Should Now Be Able to Do

If you read this article cold, you should now be able to:

Decide whether a given source belongs to Track 1 (structured mapping) or Track 2 (extraction) and pick the appropriate toolchain.
Write a small R2RML or RML mapping that produces RDF with stable IRIs from a relational or tabular source.
Distinguish materialization, virtualization, and hybrid architectures and defend the choice for a given workload.
Choose between schema-first, schema-adaptive, and schema-emergent LLM extraction with the trade-offs articulated.
Diagnose triple explosion in an unstructured-track pipeline and add the three discipline points (fixed ontology, deduplication, SHACL gate) that fix it.
Recognize the six construction-layer failure modes on a system you inherit.

You now have the construction layer of the KG. The next two articles cover what happens after triples are in the graph: Part 7 covers quality, provenance, and trust; Part 8 covers operations, change, and versioning. The applications layer (Part 9 for agents, Part 10 for governance) follows. The Lakeside Trust Bank reference architecture in Part 11 instantiates everything end to end.

Do Next

Priority	Action	Why it matters
This week	For your largest structured source, sketch a one-page R2RML mapping that produces three to five entity types with stable IRIs. The exercise reveals where your IRI policy is implicit and where the target vocabulary is unclear.	Most enterprise KGs are blocked not by tool selection but by the absence of an explicit source-to-vocabulary contract. Writing one mapping forces the contract.
This week	For your largest unstructured corpus, run a 100-document LLM-assisted graph-indexing extraction with a default open pipeline and inspect the output. The triple count, deduplication rate, and predicate variety set realistic expectations for the full corpus.	Tool demos understate the cost and the noise. A 100-document run with your data is the cheapest reality check.
This month	Lock the materialization vs virtualization decision in writing for the top three KG queries. Run the candidate architecture against representative source data and measure latency, cost, and refresh lag.	Most architectures default to materialization without measuring. The decision often flips for slow-changing reference data when measured.
This month	Define the entity-resolution contract between Track 1 and Track 2 for one bounded entity type (legal entities, customers, products). Build the deterministic-first hybrid ER per Part 5 and wire it into the construction pipeline, not as a downstream cleanup.	ER as a downstream cleanup never finishes. ER on the write path keeps the graph honest.
This quarter	Stand up a SHACL validation gate as a required CI step for any pipeline that writes triples to the graph. Author shapes from the target ontology in Part 4.	The gate is the single most effective discipline against triple explosion in the unstructured track. It also catches mapping rot in the structured track.
This quarter	Read Part 7 before designing the provenance layer. Provenance has to be designed at construction time, not bolted on after.	Retrofitting provenance into a graph that was constructed without it is one of the most expensive recovery projects in KG work.

Part 7 of this series, “Quality, Provenance, and Trust: Why Knowledge Graphs Fail Audits,” covers what construction has to record so the graph can be trusted, audited, and reasoned about. Read it next.