Data Architecture & Engineering June 19, 2026 · 23 min read

Identity, Reference, and Inference: How a Graph Becomes Knowledge

Identity is the load-bearing decision in a knowledge graph. IRIs are identifiers, not URLs. owl:sameAs is not as simple as it looks. Entity resolution is not optional. Inference is what turns stored facts into knowledge, and the choice between forward chaining (materialization) and backward chaining (query rewriting) is the second-most expensive design call after identity. This article gives the working design rules for all three and the W3C reasoning profiles (OWL 2 EL, QL, RL) that production KGs actually pick. Part 5 of the Knowledge Graph Practitioner's Guide.

By Vikas Pratap Singh

#knowledge-graph #entity-resolution #owl #inference #identity #reasoning #data-architecture

Executive Briefing

What this covers: The runtime semantics of a working knowledge graph. How IRIs make identity stable across systems and what naming patterns to follow. Why owl:sameAs is the most-misused construct in linked data and how to use it safely. Entity resolution as a first-class KG operation, with deterministic, probabilistic, and hybrid strategies. Inference via forward chaining (materialization) vs backward chaining (query rewriting). The OWL 2 reasoning profiles (EL, QL, RL) and which one to pick for which use case.
Who should read it: Anyone designing the identity layer of a knowledge graph. Anyone whose KG has thousands of entities resolved to millions of records and is feeling the seams. Anyone weighing whether to materialize derived facts or compute them on demand. Anyone choosing an OWL profile for production reasoning.
Key finding: Vocabulary mistakes can be refactored. Identity mistakes propagate forever. The two most expensive design decisions in a KG are how you mint IRIs and how you resolve entities. Get those wrong and every downstream consumer pays for years.
For practitioners: when a system claims to be a knowledge graph, ask three runtime questions: (1) what is your IRI minting policy and is it stable across re-ingest, (2) what is your entity resolution strategy and where is the provenance for every match, (3) does your reasoner materialize or query-rewrite, and which OWL profile does it implement. If any answer is fuzzy, the production behavior is going to surprise you.

Knowledge Graph Practitioner’s Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10 | Part 11a | Part 11b | Part 11c | Appendix A | Appendix B | Appendix C | Part 12

The Two Records That Cost Five Million Dollars

A bank ingests counterparty data from three systems. The CRM has Acme Corp at 100 Main Street. The trade booking system has Acme Corp. (with a trailing period) at the same address. The risk system has ACME CORPORATION at a slightly different address (the same building, different floor). All three records are valid. None of the three identifiers carry a global anchor. The bank’s downstream applications treat these as three different counterparties.

When the AI risk agent computes total exposure to “Acme Corp” before approving a new credit facility, it sees one entity with $X exposure and approves the new line. The actual exposure is $X plus the two additional positions the agent never saw. Six months later, Acme files for bankruptcy. The realized loss is five million dollars more than the bank’s risk model thought possible.

This is the cost of getting identity wrong. It is not a theoretical problem. The vocabulary work we did in Part 4 lets you describe what an Acme Corp is. It does not, by itself, tell you that these three records refer to the same Acme Corp. That is the job of the layer this article covers.

By the end you will know how to mint stable identifiers, how to resolve entities responsibly, when to use the dangerous-but-useful owl:sameAs, and how to pick a reasoning profile that performs at production scale. The four-part lens introduced in Part 1 (entities, typed relationships, identity, inference) had two halves. Parts 3 and 4 covered the first half. This article covers the second.

Identity: What an IRI Is and Is Not

In the RDF paradigm, every entity has an IRI: an Internationalized Resource Identifier, the W3C’s name for a globally unique identifier with Unicode support. Per RDF 1.1 Concepts and Abstract Syntax, an IRI “identifies a resource, where a resource may be anything, including physical things, documents, abstract concepts, numbers and strings.”

Three properties matter.

Property	What it means	Why it matters
Globally unique	The IRI uniquely identifies one resource across all systems and all time	Two systems with the same IRI know they mean the same thing
Dereferenceable (optional but recommended)	Resolving the IRI as a URL returns useful information about the resource	Linked data and federated queries become possible
Opaque to humans	The structure of the IRI does not, by itself, encode meaning that consumers should rely on	Renaming the entity later does not invalidate the identifier

The biggest single misconception about IRIs is that they are URLs that must resolve to a web page. They are not URLs in that sense. An IRI like https://lakeside.com/kg/customer/12345 is a valid identifier whether or not anything is served at that address. It is convenient if dereferencing the IRI returns RDF describing the entity, but it is not required. As Tim Berners-Lee’s Linked Data design note puts it, the four principles of linked data are: use URIs as names; use HTTP URIs so people can look them up; when an HTTP URI is looked up, return useful information; include links to other URIs. Each principle is voluntary; together they make a graph that travels.

IRI design patterns to use

Use a stable namespace you control. A bank should mint customer IRIs under a domain it owns: https://lakeside.com/kg/customer/{stable-id}. This is not the customer’s marketing URL. It is a permanent identifier-space that survives marketing rebrands.

Make the local part opaque. The string /customer/12345 is fine. The string /customer/acme-corp-100-main-st is not, because it embeds details (name, address) that change. If Acme moves, the IRI must not change; the IRI is identity, not metadata.

Use one IRI scheme per entity type. All Customers under /customer/, all Loans under /loan/, all Accounts under /account/. This makes traversal patterns and access policies easier to reason about.

Plan for versioning before you need it. A common pattern is to mint a stable canonical IRI for the entity, plus dated versioned IRIs for snapshots: /customer/12345 for the entity itself, /customer/12345/version/2026-04-30 for the state of that customer at a point in time. Named graphs (covered in Part 7) hold the versioned snapshots.

IRI design anti-patterns to avoid

Do not use natural keys as IRIs. Email addresses, names, tax IDs, account numbers in the local part of the IRI seem convenient and become problems when the natural key changes (email rotates, account number reissues, the customer renames the business).

Do not embed meaning in the path. /loan/mortgage/30-year-fixed/CA/12345 looks descriptive and breaks the moment any segment is incorrect or the loan reclassifies.

Do not reuse retired IRIs. Once an IRI is minted, it identifies that entity forever. If the entity is deleted, the IRI is retired, not recycled.

Do not let the same entity have many IRIs from different sources without resolution. This is the root cause of the Acme example. We come back to it in the entity resolution section below.

Reference: How Identity Travels Across Systems

Once you have IRIs, you need a way to express that two IRIs minted by different parties refer to the same real-world entity. The RDF stack offers several constructs. They are not interchangeable.

Construct	Strength	Right use
`owl:sameAs`	Strict identity. Two IRIs refer to the same entity; everything stated about one is automatically true of the other (because of OWL inference)	Two systems’ IRIs that you have verified refer to the same entity, and you want full inference closure
`skos:exactMatch`	”These two concepts are interchangeable for indexing/retrieval.” No automatic inference closure	Cross-vocabulary mapping where strict logical identity is too strong
`skos:closeMatch`	”These two concepts are very similar but not identical”	Loose mapping; explicitly hedged
`skos:relatedMatch`	”These two concepts are related”	Discovery and retrieval; not identity
`owl:differentFrom`	Explicit non-identity (the unique-name assumption is off in OWL by default)	When you need to assert that two IRIs do not refer to the same entity

The most-misused construct on this list is owl:sameAs. Its semantics are simple: if A owl:sameAs B, then every fact about A is a fact about B and vice versa. This is exactly what you want when you have two enterprise systems’ identifiers for the same Acme Corp and you want them treated identically. It is also exactly what you do not want when two records are similar but not the same entity.

Halpin, Hayes, McCusker, McGuinness, and Thompson published the canonical critique of owl:sameAs in 2010. Their analysis of how owl:sameAs was used across the linked data web at the time found that real-world usage almost always violates the strict logical semantics of identity the construct demands. In practice, people reach for it to connect resources that are very similar but not truly identical, sharing some but not all properties. Their finding generalizes: when ambiguous identity is asserted with strict semantics, downstream inference produces confidently wrong results.

The discipline for enterprise KGs is to use owl:sameAs only when you have:

A high-confidence resolution event that says these two IRIs are one entity.
Provenance attached to the assertion: who decided, when, with what evidence.
A reversal mechanism if the resolution turns out to be wrong.

For probabilistic or context-dependent links, prefer SKOS mappings or your own custom property with explicit semantics. A pattern that survives at scale is to make every coreference assertion a reified statement (a node), so that you can attach a confidence score, a method, a timestamp, and a reviewer to the assertion itself. This is heavier than a single triple but it is the price of operating safely.

What this looks like in practice: every cross-system identity assertion in your KG should answer three questions before it ships. Who decided this is a match? What evidence did they have? What happens if it is wrong? If you cannot answer all three, do not use owl:sameAs. Use a weaker SKOS construct and treat it as a hint, not a fact.

Entity Resolution: Where Theory Meets Reality

Long before anyone in my world said “IRI,” I sat through master data management reviews arguing about exactly this problem: three records, one real-world party, and a survivorship rule deciding which attributes win. Entity resolution in a knowledge graph is the same discipline with higher stakes, because an agent will traverse the merge and act on it. The rules I learned in MDM still apply: deterministic keys first, probabilistic matching second, and always a log of why two records became one.

Entity resolution (ER) is the operation of deciding whether two records refer to the same real-world entity. It is the single hardest problem in KG construction and the place where most of the engineering effort goes once a project is past Part 4. The MDM heritage of this work is real; we covered the data shape side in our existing MDM and golden record article. The KG twist is that ER outputs become first-class assertions in the graph.

Three families of approach dominate.

Approach	How it works	Strength	Weakness
Deterministic (rules-based)	Exact-match rules over chosen identifiers (LEI, tax ID, customer number)	Fast, transparent, auditable	Fails when identifiers are missing, inconsistent, or absent
Probabilistic (statistical/ML)	Compare attribute vectors; produce a match score; threshold	Handles fuzzy data, typos, transliteration	Requires labeled training data; less interpretable; tuning is ongoing
Hybrid	Deterministic first for high-confidence matches, probabilistic for the long tail	The production answer	More moving parts; pipeline complexity

The 2026 practitioner consensus, summarized in a property-graph store vendor’s entity resolution overview (see Appendix A for the specific tools), is to “use deterministic rules for certain matches, probabilistic or ML models for uncertain cases, and graph clustering to consolidate results, while balancing automation with oversight by automating clear cases but routing ambiguous ones to human review.” An entity-resolution engine vendor’s guidance on entity-resolved knowledge graphs makes the same point: ER is not a one-time job but an ongoing operation that has to be re-run as data changes.

A real production case from OpenCorporates’ 2025 work on legal-entity knowledge graphs makes the discipline concrete: “Match records using both deterministic IDs (registry numbers, LEIs) and probabilistic signals (names, addresses, officer overlaps). Always keep a log of why a match was made.” The log is not optional. It is the audit trail that lets you reverse a bad match without losing the rest of the graph.

The shift in 2026 that affects KG architecture: ER is no longer a nightly batch process. AI agents that read from the KG need real-time entity resolution because every new interaction has to be resolved to the correct profile before the agent acts. This pushes ER from a back-office data engineering task into the latency-sensitive serving path of the KG, which has architectural implications for materialization, caching, and reasoning that we will cover in Part 8.

Inference: How a Graph Becomes Knowledge

A knowledge graph that only stores asserted facts is useful but limited. The leverage comes from inference: deriving new facts from existing ones using the ontology axioms we covered in Part 4. If subClassOf(CommercialCustomer, Customer) and type(:Acme, CommercialCustomer) are stated, a reasoner derives type(:Acme, Customer) without anyone explicitly asserting it. If subsidiaryOf is transitive, the chain composes automatically.

There are two mainstream approaches to running inference. The choice between them is the second-most expensive runtime decision in a KG, after identity.

Forward chaining and materialization

Forward chaining starts with stated facts and applies rules to derive everything that follows. The derived facts are stored alongside the asserted ones. The technical term is materialization: at write time, the reasoner pre-computes the inference closure and persists it.

The advantage, as a reasoning guide on big knowledge graphs explains, is “forward-chaining and materialization, which allows us to do efficient query evaluation on big datasets.” The disadvantage is up-front compute and storage cost. The closure of a non-trivial ontology is materially larger than the asserted graph; the same guide reports a British Museum dataset whose roughly 200 million asserted statements expanded about fourfold once inference ran, and with ontologies that include transitivity the multiplier can grow well beyond that.

Forward chaining wins when:

Read-to-write ratio is high (many queries, fewer writes).
Inference latency at query time must be low (sub-second).
The ontology is stable enough that recomputing closure is acceptable.

Most enterprise KGs land here. Materialization is the default for production deployments.

Backward chaining and query rewriting

Backward chaining starts with a query and rewrites it to also retrieve facts implied by the rules but not stored. No materialization happens at write time. The reasoner does its work at query time.

A reasoning-engine vendor’s FAQ frames the trade-off: backward chaining “can be used in conjunction with virtual graphs and does not need to do any upfront computation when ingesting a rule.” The price is paid in query latency and the loss of cardinality estimates that query optimizers rely on.

Backward chaining wins when:

Write-to-read ratio is high (many writes, few reads).
The ontology changes frequently (recomputing materialization is too expensive).
The reasoning profile selected naturally maps to query rewriting (OWL 2 QL is designed for this).

Hybrid approaches exist: materialize the stable parts of the ontology forward, leave the volatile parts to backward chaining. Most production KGs end up with this hybrid in some form.

Dimension	Forward chaining (materialization)	Backward chaining (query rewriting)
Write cost	High (closure computed on every change)	Low (writes are direct)
Storage cost	High (closure persisted)	Low (only asserted facts stored)
Query latency	Low (everything is stored)	Higher (inference at query time)
Cardinality estimation	Available (statistics over closure)	Limited (query planner blind to derivations)
Best fit	Read-heavy KGs with stable ontology	Write-heavy KGs with volatile rules
Profile fit	OWL 2 RL	OWL 2 QL

Reasoning Profiles: OWL 2 EL, QL, RL

OWL 2 has multiple profiles, defined in the W3C OWL 2 Profiles specification. Each trades expressivity for tractability in a different way. Choosing the right profile is the runtime decision that determines whether your reasoning is fast enough to use.

Profile	Optimized for	Reasoning complexity	Real-world example
OWL 2 EL	Very large terminologies with class hierarchies; classification and subsumption	Polynomial time	SNOMED CT, Gene Ontology
OWL 2 QL	Lightweight ontologies over very large data; query rewriting against relational stores	LogSpace (AC0) for query answering	Ontology-based data access (OBDA) over SQL warehouses
OWL 2 RL	Rule-based forward chaining over RDF; many individuals with moderate ontology complexity	Polynomial time	Most enterprise KGs (FIBO, internal banking ontologies)
OWL 2 DL	Maximum expressivity that remains decidable	Worst-case 2NEXPTIME (avoid in production)	Research; small but very rich ontologies
OWL 2 Full	Full expressivity, no tractability guarantees	Undecidable in general	Avoid for production KGs

The selection rule of thumb: pick OWL 2 RL by default for an enterprise KG with rich data and a moderate-complexity ontology. Pick OWL 2 EL when the ontology is dominated by a very deep class hierarchy and you need scalable subsumption (clinical, biomedical, library science). Pick OWL 2 QL when your data lives in a relational warehouse and you want a virtual KG view over it without ETL.

This selection is consistent with what we already saw in Part 4: FIBO targets OWL 2 RL among others; SNOMED CT is OWL 2 EL. The choice of profile follows from the shape of the domain.

For practitioners: do not pick OWL 2 DL or Full for a production KG. The expressivity is real and the cost is unbounded. If you need a feature outside RL/EL/QL, the right move is almost always to model the requirement differently within a tractable profile, not to escalate to a more expensive one.

Open World vs Closed World, Revisited

We touched on this in Part 4 and it deserves a closer look here because it directly shapes what a reasoner concludes.

OWL is governed by the Open World Assumption (OWA). A missing fact is not false; it is unknown. If your KG does not state that Alice has a phone number, OWL does not conclude that Alice has no phone number. It concludes nothing.

SHACL is governed by the Closed World Assumption (CWA). A missing fact relative to a constraint is treated as a violation. If a SHACL shape says Customer must have minCount 1 phone, and Alice has no phone, the validation fails.

This split is intentional. OWL is a knowledge representation language; the world is bigger than what we have written down. SHACL is a validation language; the data we have is the data we are checking. The mistake teams make is to expect OWL to behave like SHACL, or vice versa. The mistake produces unexpected reasoner output (“why didn’t the reasoner conclude X?”) and unexpected validation failures (“why is SHACL flagging this?”).

The practical pattern: OWL for inference (what follows from what we know), SHACL for validation (what must be present in what we have). Use both. They are complementary.

The Production Stack: Identity, ER, and Inference Together

Putting the layers together gives the standard pipeline for a production KG: the 8-stage pipeline (Ingest, Map, Resolve, Mint, Assert, Reason, Validate, Serve).

Stage	What it does	Tooling concern
1. Ingest	Pull source data (databases, documents, APIs) into a staging area	ETL/ELT, change data capture, document parsers
2. Map and lift	Map source records to ontology classes and properties; produce candidate IRIs	R2RML, RML, custom mappers
3. Resolve	Run entity resolution; decide which records refer to the same entity	Deterministic rules + probabilistic matching + human review queue
4. Mint	Assign canonical IRIs to resolved entities; link source IRIs via owl:sameAs or skos:exactMatch with provenance	Identity-as-a-service component
5. Assert	Write the resolved facts (triples or property graph nodes/edges) into the store	Triple store or graph database
6. Reason	Run forward-chaining materialization (OWL 2 RL is typical), or set up backward-chaining for volatile parts	An OWL reasoning engine, whether standalone or embedded in the triple store (per Appendix A)
7. Validate	SHACL shapes check write-time data integrity	SHACL engines
8. Serve	Queries from BI, applications, and AI agents hit the KG; some hit materialized facts directly, some trigger backward inference	SPARQL endpoint, GraphQL gateway, agent-specific query layer

The Lakeside Trust Bank reference architecture in Part 11 instantiates this exact pipeline. The OpenCorporates legal-entity KG does it for legal entity data at planetary scale. A dedicated entity-resolution engine packages stages 3 and 4 as a specialized layer. Each production KG you will encounter in 2026 is some flavor of this pipeline.

Failure Modes Specific to Identity and Inference

These are the patterns where the layer this article covers breaks down. Recognizing them on a system you inherit saves quarters of effort.

Failure	Symptom	Root cause
`sameAs` explosion	Reasoner is slow; closure size is unmanageable; queries return surprising “linked” entities	Naive use of `owl:sameAs` to mean “kind of like” instead of “exactly the same”; transitive closure unbounded
Materialization without versioning	After an ontology change, derived facts are wrong but stored; downstream consumers consume stale inferences	No tracking of which facts came from which ontology version; no re-materialization discipline
Profile mismatch	Reasoner is too slow, sometimes hangs; queries time out under modest load	OWL 2 DL or Full chosen for a use case that needs RL; reasoner is doing exponential work
ER thresholds without humans	Confident wrong matches; or low-recall and missed matches; either way, nobody reviews	Probabilistic matching turned on with no review queue; thresholds tuned in isolation from business consequence
Identity collisions across re-ingest	Same entity gets multiple canonical IRIs over time; downstream consumers see duplication	Re-ingest pipeline mints fresh IRIs without checking the resolution layer
Forward chaining where backward fits	High write-amplification; sluggish ingest; reasoner consumes most compute	KG was built read-heavy, but data churn is high; materialization recomputes constantly
Backward chaining where forward fits	Surprisingly slow queries; query optimizer cannot estimate cardinality; agents time out	Inference deferred to query time on a read-heavy KG with low write volume

A Decision Tree for the Runtime Layer

When you are designing the runtime layer of a KG, walk these questions in order.

What is the entity space? Bound it: customers, products, employees, transactions. The smaller the entity space, the easier identity is.
What is the canonical identifier source? An LEI, a customer master ID, a tax ID. Pick one source per entity type.
What is the IRI minting policy? https://yourdomain.com/kg/{type}/{stable-opaque-id}. Document this. Treat the IRI namespace as load-bearing infrastructure.
What is the ER strategy? Deterministic-first hybrid is the production default. Set thresholds. Set up the review queue.
Are there cross-system mappings to other organizations’ identifiers? Use owl:sameAs only with provenance. Use SKOS for fuzzy matches. Do not use owl:sameAs casually.
What OWL profile? RL by default. EL for very large terminologies. QL for relational pass-through. Avoid DL/Full.
Forward or backward chaining? Forward (materialization) by default. Backward for volatile or write-heavy parts. Hybrid is fine.
What is the reasoning trigger? On every write (online materialization), nightly batch, or on-demand. The choice depends on freshness requirements.
What is the validation layer? SHACL shapes for write-time integrity. Run them as a CI check on the ontology itself.
What is the change protocol when the ontology evolves? See Part 8.

Most production KGs end up at: deterministic-plus-probabilistic ER, OWL 2 RL with online materialization for the stable core, SHACL validation at ingest, named graphs for versioning. That is not the only shape but it is the shape most teams converge to over their first two years.

What You Should Now Be Able to Do

If you read this article cold, you should now be able to:

Mint IRIs for a new entity type without embedding meaning that will rot.
Decide between owl:sameAs, skos:exactMatch, and a custom resolution property for any cross-system identity link, with provenance attached.
Lay out a deterministic-plus-probabilistic ER pipeline with a human review queue and explain when to add real-time matching for AI agents.
Choose between forward and backward chaining for a given KG workload based on read/write ratio, ontology stability, and latency requirements.
Pick an OWL 2 profile (EL, QL, RL) appropriate to the shape of your domain and avoid the DL/Full trap.
Recognize the seven runtime failure modes in this article on a system you inherit.

You now have the full conceptual core of the KG series: vocabulary (Part 4), identity, reference, and inference (this article). The remaining articles cover sourcing the graph from real data (Part 6), quality and provenance (Part 7), operations (Part 8), and the two flagship applications: KGs for AI agents (Part 9) and KGs for governance (Part 10).

Do Next

Priority	Action	Why it matters
This week	Audit your existing master data system. For three core entity types (customer, product, account, or your equivalent), document the IRI policy if there is one and the ER process if there is one. The audit usually reveals that one or both are implicit.	Implicit identity decisions are the single biggest source of cross-system Data Quality issues. Making them explicit is the first step to fixing them.
This week	Read Halpin et al., “When owl:sameAs Isn’t the Same”. Even if you never deploy a single `owl:sameAs` statement, the analysis sharpens your discipline around when “same” is appropriate and when it is not.	Almost every KG project misuses sameAs at first. Reading the canonical critique once saves quarters of cleanup later.
This month	Pick one bounded entity type (e.g., legal entities) and build the deterministic-first hybrid ER pipeline for it. Set thresholds. Set up a human review queue for ambiguous matches. Capture provenance for every match.	A working ER pipeline for one entity type is more valuable than a planned pipeline for ten. Per Part 2, bounded scope is the hedge against year-two collapse.
This month	If your KG is in design, lock the OWL profile choice in writing before any reasoner is selected. Justify the choice. Most teams pick the wrong profile by deferring the choice to the vendor evaluation.	Reasoner profile is a fundamental architectural decision. Vendors who bend OWL Full to be “almost as fast as RL” are selling future pain.
This quarter	Run a materialization vs backward-chaining benchmark on representative data for your KG. Measure write latency, query latency, storage size. The numbers usually surprise the team and shape the cache strategy.	Many teams default to materialization without measuring. When the write workload changes, the choice should be revisited.
This quarter	Read Part 6 before designing your ingestion layer. Sourcing decisions interact heavily with the identity and inference choices in this article.	The full pipeline only works if the ingestion stage is designed compatibly with identity, ER, and reasoning. Designing them in isolation produces seam pain.

Part 6 of this series, “Sourcing the Graph: Building Knowledge from Structured and Unstructured Data,” covers how to construct the KG from databases, documents, APIs, and LLM-extracted text. Read it next.