Identity, Reference, and Inference: How a Graph Becomes Knowledge
Identity is the load-bearing decision in a knowledge graph. IRIs are identifiers, not URLs. owl:sameAs is not as simple as it looks. Entity resolution is not optional. Inference is what turns stored facts into knowledge, and the choice between forward chaining (materialization) and backward chaining (query rewriting) is the second-most expensive design call after identity. This article gives the working design rules for all three and the W3C reasoning profiles (OWL 2 EL, QL, RL) that production KGs actually pick. Part 5 of the Knowledge Graph Practitioner's Guide.
Knowledge Graph Practitioner’s Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10 | Part 11a | Part 11b | Part 11c | Appendix A | Appendix B | Appendix C | Part 12
The Two Records That Cost Five Million Dollars
A bank ingests counterparty data from three systems. The CRM has Acme Corp at 100 Main Street. The trade booking system has Acme Corp. (with a trailing period) at the same address. The risk system has ACME CORPORATION at a slightly different address (the same building, different floor). All three records are valid. None of the three identifiers carry a global anchor. The bank’s downstream applications treat these as three different counterparties.
When the AI risk agent computes total exposure to “Acme Corp” before approving a new credit facility, it sees one entity with $X exposure and approves the new line. The actual exposure is $X plus the two additional positions the agent never saw. Six months later, Acme files for bankruptcy. The realized loss is five million dollars more than the bank’s risk model thought possible.
This is the cost of getting identity wrong. It is not a theoretical problem. The vocabulary work we did in Part 4 lets you describe what an Acme Corp is. It does not, by itself, tell you that these three records refer to the same Acme Corp. That is the job of the layer this article covers.
By the end you will know how to mint stable identifiers, how to resolve entities responsibly, when to use the dangerous-but-useful owl:sameAs, and how to pick a reasoning profile that performs at production scale. The four-part lens introduced in Part 1 (entities, typed relationships, identity, inference) had two halves. Parts 3 and 4 covered the first half. This article covers the second.
Identity: What an IRI Is and Is Not
In the RDF paradigm, every entity has an IRI: an Internationalized Resource Identifier, the W3C’s name for a globally unique identifier with Unicode support. Per RDF 1.1 Concepts and Abstract Syntax, an IRI “identifies a resource, where a resource may be anything, including physical things, documents, abstract concepts, numbers and strings.”
Three properties matter.
| Property | What it means | Why it matters |
|---|---|---|
| Globally unique | The IRI uniquely identifies one resource across all systems and all time | Two systems with the same IRI know they mean the same thing |
| Dereferenceable (optional but recommended) | Resolving the IRI as a URL returns useful information about the resource | Linked data and federated queries become possible |
| Opaque to humans | The structure of the IRI does not, by itself, encode meaning that consumers should rely on | Renaming the entity later does not invalidate the identifier |
The biggest single misconception about IRIs is that they are URLs that must resolve to a web page. They are not URLs in that sense. An IRI like https://lakeside.com/kg/customer/12345 is a valid identifier whether or not anything is served at that address. It is convenient if dereferencing the IRI returns RDF describing the entity, but it is not required. As Tim Berners-Lee’s Linked Data design note puts it, the four principles of linked data are: use URIs as names; use HTTP URIs so people can look them up; when an HTTP URI is looked up, return useful information; include links to other URIs. Each principle is voluntary; together they make a graph that travels.
IRI design patterns to use
Use a stable namespace you control. A bank should mint customer IRIs under a domain it owns: https://lakeside.com/kg/customer/{stable-id}. This is not the customer’s marketing URL. It is a permanent identifier-space that survives marketing rebrands.
Make the local part opaque. The string /customer/12345 is fine. The string /customer/acme-corp-100-main-st is not, because it embeds details (name, address) that change. If Acme moves, the IRI must not change; the IRI is identity, not metadata.
Use one IRI scheme per entity type. All Customers under /customer/, all Loans under /loan/, all Accounts under /account/. This makes traversal patterns and access policies easier to reason about.
Plan for versioning before you need it. A common pattern is to mint a stable canonical IRI for the entity, plus dated versioned IRIs for snapshots: /customer/12345 for the entity itself, /customer/12345/version/2026-04-30 for the state of that customer at a point in time. Named graphs (covered in Part 7) hold the versioned snapshots.
IRI design anti-patterns to avoid
Do not use natural keys as IRIs. Email addresses, names, tax IDs, account numbers in the local part of the IRI seem convenient and become problems when the natural key changes (email rotates, account number reissues, the customer renames the business).
Do not embed meaning in the path. /loan/mortgage/30-year-fixed/CA/12345 looks descriptive and breaks the moment any segment is incorrect or the loan reclassifies.
Do not reuse retired IRIs. Once an IRI is minted, it identifies that entity forever. If the entity is deleted, the IRI is retired, not recycled.
Do not let the same entity have many IRIs from different sources without resolution. This is the root cause of the Acme example. We come back to it in the entity resolution section below.
Reference: How Identity Travels Across Systems
Once you have IRIs, you need a way to express that two IRIs minted by different parties refer to the same real-world entity. The RDF stack offers several constructs. They are not interchangeable.
| Construct | Strength | Right use |
|---|---|---|
owl:sameAs | Strict identity. Two IRIs refer to the same entity; everything stated about one is automatically true of the other (because of OWL inference) | Two systems’ IRIs that you have verified refer to the same entity, and you want full inference closure |
skos:exactMatch | ”These two concepts are interchangeable for indexing/retrieval.” No automatic inference closure | Cross-vocabulary mapping where strict logical identity is too strong |
skos:closeMatch | ”These two concepts are very similar but not identical” | Loose mapping; explicitly hedged |
skos:relatedMatch | ”These two concepts are related” | Discovery and retrieval; not identity |
owl:differentFrom | Explicit non-identity (the unique-name assumption is off in OWL by default) | When you need to assert that two IRIs do not refer to the same entity |
The most-misused construct on this list is owl:sameAs. Its semantics are simple: if A owl:sameAs B, then every fact about A is a fact about B and vice versa. This is exactly what you want when you have two enterprise systems’ identifiers for the same Acme Corp and you want them treated identically. It is also exactly what you do not want when two records are similar but not the same entity.
Halpin, Hayes, McCusker, McGuinness, and Thompson published the canonical critique of owl:sameAs in 2010. Their analysis of how owl:sameAs was used across the linked data web at the time found that real-world usage almost always violates the strict logical semantics of identity the construct demands. In practice, people reach for it to connect resources that are very similar but not truly identical, sharing some but not all properties. Their finding generalizes: when ambiguous identity is asserted with strict semantics, downstream inference produces confidently wrong results.
The discipline for enterprise KGs is to use owl:sameAs only when you have:
- A high-confidence resolution event that says these two IRIs are one entity.
- Provenance attached to the assertion: who decided, when, with what evidence.
- A reversal mechanism if the resolution turns out to be wrong.
For probabilistic or context-dependent links, prefer SKOS mappings or your own custom property with explicit semantics. A pattern that survives at scale is to make every coreference assertion a reified statement (a node), so that you can attach a confidence score, a method, a timestamp, and a reviewer to the assertion itself. This is heavier than a single triple but it is the price of operating safely.
What this looks like in practice: every cross-system identity assertion in your KG should answer three questions before it ships. Who decided this is a match? What evidence did they have? What happens if it is wrong? If you cannot answer all three, do not use
owl:sameAs. Use a weaker SKOS construct and treat it as a hint, not a fact.
Entity Resolution: Where Theory Meets Reality
Long before anyone in my world said “IRI,” I sat through master data management reviews arguing about exactly this problem: three records, one real-world party, and a survivorship rule deciding which attributes win. Entity resolution in a knowledge graph is the same discipline with higher stakes, because an agent will traverse the merge and act on it. The rules I learned in MDM still apply: deterministic keys first, probabilistic matching second, and always a log of why two records became one.
Entity resolution (ER) is the operation of deciding whether two records refer to the same real-world entity. It is the single hardest problem in KG construction and the place where most of the engineering effort goes once a project is past Part 4. The MDM heritage of this work is real; we covered the data shape side in our existing MDM and golden record article. The KG twist is that ER outputs become first-class assertions in the graph.
Three families of approach dominate.
| Approach | How it works | Strength | Weakness |
|---|---|---|---|
| Deterministic (rules-based) | Exact-match rules over chosen identifiers (LEI, tax ID, customer number) | Fast, transparent, auditable | Fails when identifiers are missing, inconsistent, or absent |
| Probabilistic (statistical/ML) | Compare attribute vectors; produce a match score; threshold | Handles fuzzy data, typos, transliteration | Requires labeled training data; less interpretable; tuning is ongoing |
| Hybrid | Deterministic first for high-confidence matches, probabilistic for the long tail | The production answer | More moving parts; pipeline complexity |
The 2026 practitioner consensus, summarized in a property-graph store vendor’s entity resolution overview (see Appendix A for the specific tools), is to “use deterministic rules for certain matches, probabilistic or ML models for uncertain cases, and graph clustering to consolidate results, while balancing automation with oversight by automating clear cases but routing ambiguous ones to human review.” An entity-resolution engine vendor’s guidance on entity-resolved knowledge graphs makes the same point: ER is not a one-time job but an ongoing operation that has to be re-run as data changes.
A real production case from OpenCorporates’ 2025 work on legal-entity knowledge graphs makes the discipline concrete: “Match records using both deterministic IDs (registry numbers, LEIs) and probabilistic signals (names, addresses, officer overlaps). Always keep a log of why a match was made.” The log is not optional. It is the audit trail that lets you reverse a bad match without losing the rest of the graph.
The shift in 2026 that affects KG architecture: ER is no longer a nightly batch process. AI agents that read from the KG need real-time entity resolution because every new interaction has to be resolved to the correct profile before the agent acts. This pushes ER from a back-office data engineering task into the latency-sensitive serving path of the KG, which has architectural implications for materialization, caching, and reasoning that we will cover in Part 8.
Inference: How a Graph Becomes Knowledge
A knowledge graph that only stores asserted facts is useful but limited. The leverage comes from inference: deriving new facts from existing ones using the ontology axioms we covered in Part 4. If subClassOf(CommercialCustomer, Customer) and type(:Acme, CommercialCustomer) are stated, a reasoner derives type(:Acme, Customer) without anyone explicitly asserting it. If subsidiaryOf is transitive, the chain composes automatically.
There are two mainstream approaches to running inference. The choice between them is the second-most expensive runtime decision in a KG, after identity.
Forward chaining and materialization
Forward chaining starts with stated facts and applies rules to derive everything that follows. The derived facts are stored alongside the asserted ones. The technical term is materialization: at write time, the reasoner pre-computes the inference closure and persists it.
The advantage, as a reasoning guide on big knowledge graphs explains, is “forward-chaining and materialization, which allows us to do efficient query evaluation on big datasets.” The disadvantage is up-front compute and storage cost. The closure of a non-trivial ontology is materially larger than the asserted graph; the same guide reports a British Museum dataset whose roughly 200 million asserted statements expanded about fourfold once inference ran, and with ontologies that include transitivity the multiplier can grow well beyond that.
Forward chaining wins when:
- Read-to-write ratio is high (many queries, fewer writes).
- Inference latency at query time must be low (sub-second).
- The ontology is stable enough that recomputing closure is acceptable.
Most enterprise KGs land here. Materialization is the default for production deployments.
Backward chaining and query rewriting
Backward chaining starts with a query and rewrites it to also retrieve facts implied by the rules but not stored. No materialization happens at write time. The reasoner does its work at query time.
A reasoning-engine vendor’s FAQ frames the trade-off: backward chaining “can be used in conjunction with virtual graphs and does not need to do any upfront computation when ingesting a rule.” The price is paid in query latency and the loss of cardinality estimates that query optimizers rely on.
Backward chaining wins when:
- Write-to-read ratio is high (many writes, few reads).
- The ontology changes frequently (recomputing materialization is too expensive).
- The reasoning profile selected naturally maps to query rewriting (OWL 2 QL is designed for this).
Hybrid approaches exist: materialize the stable parts of the ontology forward, leave the volatile parts to backward chaining. Most production KGs end up with this hybrid in some form.
| Dimension | Forward chaining (materialization) | Backward chaining (query rewriting) |
|---|---|---|
| Write cost | High (closure computed on every change) | Low (writes are direct) |
| Storage cost | High (closure persisted) | Low (only asserted facts stored) |
| Query latency | Low (everything is stored) | Higher (inference at query time) |
| Cardinality estimation | Available (statistics over closure) | Limited (query planner blind to derivations) |
| Best fit | Read-heavy KGs with stable ontology | Write-heavy KGs with volatile rules |
| Profile fit | OWL 2 RL | OWL 2 QL |
Reasoning Profiles: OWL 2 EL, QL, RL
OWL 2 has multiple profiles, defined in the W3C OWL 2 Profiles specification. Each trades expressivity for tractability in a different way. Choosing the right profile is the runtime decision that determines whether your reasoning is fast enough to use.
| Profile | Optimized for | Reasoning complexity | Real-world example |
|---|---|---|---|
| OWL 2 EL | Very large terminologies with class hierarchies; classification and subsumption | Polynomial time | SNOMED CT, Gene Ontology |
| OWL 2 QL | Lightweight ontologies over very large data; query rewriting against relational stores | LogSpace (AC0) for query answering | Ontology-based data access (OBDA) over SQL warehouses |
| OWL 2 RL | Rule-based forward chaining over RDF; many individuals with moderate ontology complexity | Polynomial time | Most enterprise KGs (FIBO, internal banking ontologies) |
| OWL 2 DL | Maximum expressivity that remains decidable | Worst-case 2NEXPTIME (avoid in production) | Research; small but very rich ontologies |
| OWL 2 Full | Full expressivity, no tractability guarantees | Undecidable in general | Avoid for production KGs |
The selection rule of thumb: pick OWL 2 RL by default for an enterprise KG with rich data and a moderate-complexity ontology. Pick OWL 2 EL when the ontology is dominated by a very deep class hierarchy and you need scalable subsumption (clinical, biomedical, library science). Pick OWL 2 QL when your data lives in a relational warehouse and you want a virtual KG view over it without ETL.
This selection is consistent with what we already saw in Part 4: FIBO targets OWL 2 RL among others; SNOMED CT is OWL 2 EL. The choice of profile follows from the shape of the domain.
For practitioners: do not pick OWL 2 DL or Full for a production KG. The expressivity is real and the cost is unbounded. If you need a feature outside RL/EL/QL, the right move is almost always to model the requirement differently within a tractable profile, not to escalate to a more expensive one.
Open World vs Closed World, Revisited
We touched on this in Part 4 and it deserves a closer look here because it directly shapes what a reasoner concludes.
OWL is governed by the Open World Assumption (OWA). A missing fact is not false; it is unknown. If your KG does not state that Alice has a phone number, OWL does not conclude that Alice has no phone number. It concludes nothing.
SHACL is governed by the Closed World Assumption (CWA). A missing fact relative to a constraint is treated as a violation. If a SHACL shape says Customer must have minCount 1 phone, and Alice has no phone, the validation fails.
This split is intentional. OWL is a knowledge representation language; the world is bigger than what we have written down. SHACL is a validation language; the data we have is the data we are checking. The mistake teams make is to expect OWL to behave like SHACL, or vice versa. The mistake produces unexpected reasoner output (“why didn’t the reasoner conclude X?”) and unexpected validation failures (“why is SHACL flagging this?”).
The practical pattern: OWL for inference (what follows from what we know), SHACL for validation (what must be present in what we have). Use both. They are complementary.
The Production Stack: Identity, ER, and Inference Together
Putting the layers together gives the standard pipeline for a production KG: the 8-stage pipeline (Ingest, Map, Resolve, Mint, Assert, Reason, Validate, Serve).
| Stage | What it does | Tooling concern |
|---|---|---|
| 1. Ingest | Pull source data (databases, documents, APIs) into a staging area | ETL/ELT, change data capture, document parsers |
| 2. Map and lift | Map source records to ontology classes and properties; produce candidate IRIs | R2RML, RML, custom mappers |
| 3. Resolve | Run entity resolution; decide which records refer to the same entity | Deterministic rules + probabilistic matching + human review queue |
| 4. Mint | Assign canonical IRIs to resolved entities; link source IRIs via owl:sameAs or skos:exactMatch with provenance | Identity-as-a-service component |
| 5. Assert | Write the resolved facts (triples or property graph nodes/edges) into the store | Triple store or graph database |
| 6. Reason | Run forward-chaining materialization (OWL 2 RL is typical), or set up backward-chaining for volatile parts | An OWL reasoning engine, whether standalone or embedded in the triple store (per Appendix A) |
| 7. Validate | SHACL shapes check write-time data integrity | SHACL engines |
| 8. Serve | Queries from BI, applications, and AI agents hit the KG; some hit materialized facts directly, some trigger backward inference | SPARQL endpoint, GraphQL gateway, agent-specific query layer |
The Lakeside Trust Bank reference architecture in Part 11 instantiates this exact pipeline. The OpenCorporates legal-entity KG does it for legal entity data at planetary scale. A dedicated entity-resolution engine packages stages 3 and 4 as a specialized layer. Each production KG you will encounter in 2026 is some flavor of this pipeline.
Failure Modes Specific to Identity and Inference
These are the patterns where the layer this article covers breaks down. Recognizing them on a system you inherit saves quarters of effort.
| Failure | Symptom | Root cause |
|---|---|---|
sameAs explosion | Reasoner is slow; closure size is unmanageable; queries return surprising “linked” entities | Naive use of owl:sameAs to mean “kind of like” instead of “exactly the same”; transitive closure unbounded |
| Materialization without versioning | After an ontology change, derived facts are wrong but stored; downstream consumers consume stale inferences | No tracking of which facts came from which ontology version; no re-materialization discipline |
| Profile mismatch | Reasoner is too slow, sometimes hangs; queries time out under modest load | OWL 2 DL or Full chosen for a use case that needs RL; reasoner is doing exponential work |
| ER thresholds without humans | Confident wrong matches; or low-recall and missed matches; either way, nobody reviews | Probabilistic matching turned on with no review queue; thresholds tuned in isolation from business consequence |
| Identity collisions across re-ingest | Same entity gets multiple canonical IRIs over time; downstream consumers see duplication | Re-ingest pipeline mints fresh IRIs without checking the resolution layer |
| Forward chaining where backward fits | High write-amplification; sluggish ingest; reasoner consumes most compute | KG was built read-heavy, but data churn is high; materialization recomputes constantly |
| Backward chaining where forward fits | Surprisingly slow queries; query optimizer cannot estimate cardinality; agents time out | Inference deferred to query time on a read-heavy KG with low write volume |
A Decision Tree for the Runtime Layer
When you are designing the runtime layer of a KG, walk these questions in order.
- What is the entity space? Bound it: customers, products, employees, transactions. The smaller the entity space, the easier identity is.
- What is the canonical identifier source? An LEI, a customer master ID, a tax ID. Pick one source per entity type.
- What is the IRI minting policy?
https://yourdomain.com/kg/{type}/{stable-opaque-id}. Document this. Treat the IRI namespace as load-bearing infrastructure. - What is the ER strategy? Deterministic-first hybrid is the production default. Set thresholds. Set up the review queue.
- Are there cross-system mappings to other organizations’ identifiers? Use
owl:sameAsonly with provenance. Use SKOS for fuzzy matches. Do not useowl:sameAscasually. - What OWL profile? RL by default. EL for very large terminologies. QL for relational pass-through. Avoid DL/Full.
- Forward or backward chaining? Forward (materialization) by default. Backward for volatile or write-heavy parts. Hybrid is fine.
- What is the reasoning trigger? On every write (online materialization), nightly batch, or on-demand. The choice depends on freshness requirements.
- What is the validation layer? SHACL shapes for write-time integrity. Run them as a CI check on the ontology itself.
- What is the change protocol when the ontology evolves? See Part 8.
Most production KGs end up at: deterministic-plus-probabilistic ER, OWL 2 RL with online materialization for the stable core, SHACL validation at ingest, named graphs for versioning. That is not the only shape but it is the shape most teams converge to over their first two years.
What You Should Now Be Able to Do
If you read this article cold, you should now be able to:
- Mint IRIs for a new entity type without embedding meaning that will rot.
- Decide between
owl:sameAs,skos:exactMatch, and a custom resolution property for any cross-system identity link, with provenance attached. - Lay out a deterministic-plus-probabilistic ER pipeline with a human review queue and explain when to add real-time matching for AI agents.
- Choose between forward and backward chaining for a given KG workload based on read/write ratio, ontology stability, and latency requirements.
- Pick an OWL 2 profile (EL, QL, RL) appropriate to the shape of your domain and avoid the DL/Full trap.
- Recognize the seven runtime failure modes in this article on a system you inherit.
You now have the full conceptual core of the KG series: vocabulary (Part 4), identity, reference, and inference (this article). The remaining articles cover sourcing the graph from real data (Part 6), quality and provenance (Part 7), operations (Part 8), and the two flagship applications: KGs for AI agents (Part 9) and KGs for governance (Part 10).
Do Next
| Priority | Action | Why it matters |
|---|---|---|
| This week | Audit your existing master data system. For three core entity types (customer, product, account, or your equivalent), document the IRI policy if there is one and the ER process if there is one. The audit usually reveals that one or both are implicit. | Implicit identity decisions are the single biggest source of cross-system Data Quality issues. Making them explicit is the first step to fixing them. |
| This week | Read Halpin et al., “When owl:sameAs Isn’t the Same”. Even if you never deploy a single owl:sameAs statement, the analysis sharpens your discipline around when “same” is appropriate and when it is not. | Almost every KG project misuses sameAs at first. Reading the canonical critique once saves quarters of cleanup later. |
| This month | Pick one bounded entity type (e.g., legal entities) and build the deterministic-first hybrid ER pipeline for it. Set thresholds. Set up a human review queue for ambiguous matches. Capture provenance for every match. | A working ER pipeline for one entity type is more valuable than a planned pipeline for ten. Per Part 2, bounded scope is the hedge against year-two collapse. |
| This month | If your KG is in design, lock the OWL profile choice in writing before any reasoner is selected. Justify the choice. Most teams pick the wrong profile by deferring the choice to the vendor evaluation. | Reasoner profile is a fundamental architectural decision. Vendors who bend OWL Full to be “almost as fast as RL” are selling future pain. |
| This quarter | Run a materialization vs backward-chaining benchmark on representative data for your KG. Measure write latency, query latency, storage size. The numbers usually surprise the team and shape the cache strategy. | Many teams default to materialization without measuring. When the write workload changes, the choice should be revisited. |
| This quarter | Read Part 6 before designing your ingestion layer. Sourcing decisions interact heavily with the identity and inference choices in this article. | The full pipeline only works if the ingestion stage is designed compatibly with identity, ER, and reasoning. Designing them in isolation produces seam pain. |
Part 6 of this series, “Sourcing the Graph: Building Knowledge from Structured and Unstructured Data,” covers how to construct the KG from databases, documents, APIs, and LLM-extracted text. Read it next.
Sources & References
- Knowledge Graphs (Hogan et al., ACM Computing Surveys 2021)(2021)
- W3C RDF 1.1 Concepts and Abstract Syntax(2014)
- W3C OWL 2 Profiles (EL, QL, RL)(2012)
- W3C OWL 2 Web Ontology Language Document Overview(2012)
- Tim Berners-Lee, Linked Data Design Issues (Cool URIs)(2006)
- When owl:sameAs Isn't the Same (Halpin et al., ISWC 2010)(2010)
- Ontotext: Reasoning with Big Knowledge Graphs(2024)
- Oxford Semantic Technologies: Backward Chaining(2024)
- Neo4j: What Is Entity Resolution?(2024)
- OpenCorporates: Legal-entity knowledge graphs(2025)
- Senzing: Entity-resolved knowledge graphs(2024)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.