Knowledge Graphs for Data Governance: Lineage, CDEs, and Master Data as a Graph
Most enterprise Data Governance has plateaued at a catalog plus a lineage tool plus a glossary plus a policy register, and four disconnected stores cannot answer the questions a regulator now asks. This article shows how a knowledge graph turns Data Lineage, Critical Data Elements, master data, and policy into one queryable substrate, with the OpenLineage-to-PROV-O bridge from Part 7 as the connective tissue. Worked patterns for BCBS 239, ECB RDARR attribute-level lineage, GDPR Article 30, and EU AI Act Article 10. Part 10 of the Knowledge Graph Practitioner's Guide.
Knowledge Graph Practitioner’s Guide: Overview | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 | Part 6 | Part 7 | Part 8 | Part 9 | Part 10 | Part 11a | Part 11b | Part 11c | Appendix A | Appendix B | Appendix C | Part 12
The CDO Who Spent Three Weeks Answering A Five-Minute Question
A mid-size US bank we will call Northwind Bancshares, a composite drawn from publicly documented failure patterns rather than a real firm, had what most peers would consider a mature governance program. There was a data catalog with roughly forty thousand registered datasets, a lineage tool that crawled their warehouse and three of their five operational systems, a business glossary with about nine hundred entries, a policy register maintained by the Privacy Office, and a Critical Data Element inventory with three hundred CDEs that had been assembled over two years (the CDE meta-model shape was followed faithfully). The Chief Data Officer had presented the program at three industry conferences. By every internal metric, governance was working.
In late 2025, an examiner from the bank’s primary regulator asked a question during a routine review of the quarterly capital report. The question was simple. “The Counterparty Credit Exposure number on page eleven of your Pillar 3 disclosure: where does it come from, who owns each input, what quality checks were applied, when was it last revalidated, and which inputs have changed since the prior quarter.” The examiner expected an answer in the first ten minutes of the meeting.
The team produced a partial answer in three weeks. The lineage tool had the warehouse-to-report flow but not the upstream operational-system feeds. The catalog had the datasets but the column-level lineage was incomplete. The CDE inventory listed Counterparty Credit Exposure as a Critical Data Element with a defined owner, but the link from the CDE to the physical columns lived in a separate spreadsheet that was last reconciled four months earlier. The quality framework knew what rules were defined but not which had run on which executions of the report. The policy register listed the BCBS 239 principles the bank claimed to comply with but did not link them to specific lineage edges. Each of those four registries was, on its own, well-maintained. None of them shared identifiers. Reconciling them was an archeological project.
Northwind’s incident is the failure mode that the Data Catalog shelfware article and the Metadata Management article each described from a different angle. The bank had four governance stores. The regulator was asking a question that required all four to be one. Every Data Governance program past a certain scale arrives at this collision. The escape from it is not a fifth tool. It is a shared substrate. That substrate is a knowledge graph.
The Four Stores, And Why They Fragment
Most enterprise governance programs accrete in the same order. The catalog comes first because it is the most visible. The lineage tool follows because impact analysis is impossible without it. The business glossary lives in a third place because it was owned by a different team. The policy register lives in a fourth place because it was the Privacy Office’s project. The Critical Data Element inventory, when it exists, lives in a fifth (often a spreadsheet) because it was a cross-functional effort that did not have a natural home in any of the previous four.
| Store | What it knows | What it does not know |
|---|---|---|
| Data catalog | Datasets, columns, owners, descriptions, tags | Which dataset produced which other dataset; which CDE this column implements; which policy applies |
| Lineage tool | Edges between jobs and datasets, often column level | What each entity means semantically; whether the lineage is gold or extracted from heuristics; which lineage spans a CDE |
| Business glossary | Term definitions, term-to-term relationships, sometimes term-to-policy mappings | Which physical column implements each term; which jobs touched the term last week |
| Policy register | Policies, controls, regulatory mappings | Which datasets, which jobs, which CDEs the policies apply to in practice |
| CDE inventory (when separate) | Critical concepts, owners, quality rules, regulatory drivers | Which physical columns currently implement each CDE; which lineage edges carry the CDE; which incidents touched it |
The four-or-five-store pattern is not stupid. Each tool was a reasonable purchase at the time it was made. The fragmentation was emergent, not designed. The pattern fails because the questions a regulator, an auditor, an AI Act conformance assessor, or an internal incident reviewer now asks are questions that span every store at once. “Show me everything that touched our Counterparty Credit Exposure number in the last quarter, with owners, quality status, change events, and the BCBS 239 principles each step satisfies” is a single query, not five.
A knowledge graph is the data structure that makes that query possible without first reconciling four registries by hand. The argument of this article is that a governance KG is not “another tool.” It is the substrate that the four existing tools should write into and read from.
The Governance KG Has A Specific Shape
The governance KG is not a generic graph. It has a specific ontological shape that follows from what governance is for. Six entity types and the relationships between them carry most of the weight.
| Entity type | What it represents | Source |
|---|---|---|
| Dataset | A logical or physical data asset (table, view, file, topic) | Catalog ingest, OpenLineage events |
| Field | A column or property within a Dataset | Catalog schema crawl, OpenLineage schema facet |
| Job (Activity) | A pipeline run, query, transformation, or model training run that produced or consumed Datasets | OpenLineage run events; PROV-O Activity |
| CriticalDataElement | A business-level concept (e.g. Counterparty Credit Exposure) that maps one-to-many to Fields | CDE inventory; FIBO Business Entity references |
| Policy | A rule, control, or regulatory requirement (BCBS 239 Principle 3, GDPR Article 30, EU AI Act Article 10) | Policy register; regulatory cross-walks |
| Owner (Agent) | A team or individual accountable for a Dataset, Field, CDE, or Policy | Catalog; HRIS; PROV-O Agent |
The relationships between these entity types are where governance lives. A Job wasGeneratedBy an Agent. A Dataset wasGeneratedBy a Job. A Field implementsCDE a CriticalDataElement. A Job validatedAgainstShapes a SHACL ShapeGraph (the seven-field provenance contract introduced in Part 7 and completed in Part 8). A Policy applies to a CriticalDataElement and, transitively, to every Field that implements it and every Job that touches one of those Fields. A regulator’s question becomes a graph traversal across these typed relationships, not a join across spreadsheets.
The ontology is not invented from scratch. It reuses three existing W3C-backed vocabularies and one industry ontology. The W3C DCAT 3 Recommendation gives the catalog vocabulary (Catalog, Dataset, Distribution, DataService) and was promoted to a W3C Recommendation in 2024. The W3C PROV-O ontology gives Activity, Entity, Agent, and the relationship verbs (wasGeneratedBy, wasDerivedFrom, wasAttributedTo, used). The W3C SKOS recommendation gives the business glossary (Concept, prefLabel, broader, narrower) and lets the glossary attach to Datasets without forcing it to become an OWL ontology. FIBO supplies the regulated-finance type system that CDE definitions can point at (FIBO BE for legal entities, FIBO LOAN for credit, FIBO SEC for securities). The result is a governance ontology that is mostly imports, with a thin organization-specific module on top, exactly the modular pattern from Part 4.
For practitioners: The discipline that breaks the four-store pattern is not “build a graph.” It is “stop minting new identifiers.” Every Dataset has one IRI in the governance graph, and every catalog, lineage tool, glossary, and policy register references that IRI rather than creating its own. If your catalog says
dataset_id=4711and your lineage tool saysdataset.urn=warehouse.public.fct_exposure, those are two different identifiers for the same asset, and reconciling them later is the work that produced Northwind’s three-week delay. The IRI discipline from Part 5 is the foundation that keeps the governance graph honest.
Lineage As A Graph: The OpenLineage To PROV-O Bridge
Lineage is the most natural fit for a graph substrate. The challenge is that the dominant lineage instrumentation today is event-shaped, not graph-shaped, and the bridge between the two is what most teams skip.
OpenLineage is the de facto open standard for lineage metadata collection in 2026. Apache Airflow, Apache Spark, dbt, Apache Flink, and Dagster all emit OpenLineage events natively. The model is simple: every job run emits Run events (start, complete, fail, abort) that reference the Job that ran, the Inputs it consumed, and the Outputs it produced. Each Dataset and each Run carries Facets (extensible metadata blocks): the schema facet, the column-level lineage facet, the data quality facet, the run statistics facet. An OpenLineage metadata store, maintained as an LF AI and Data project, is the reference implementation of the OpenLineage backend (see Appendix A for the specific tools).
OpenLineage events are the firehose. The governance KG is the curated substrate. The pattern that connects them is the OpenLineage-to-PROV-O bridge that Part 7 introduced. The bridge is mechanical and worth stating explicitly.
| OpenLineage concept | PROV-O concept | Governance KG predicate |
|---|---|---|
| Run | Activity | prov:Activity; one node per run |
| Job | Plan (a kind of Entity) | prov:wasAssociatedWith from the Run |
| Input dataset | Entity used by the Activity | prov:used |
| Output dataset | Entity generated by the Activity | prov:wasGeneratedBy |
| Producer (operator/team) | Agent | prov:wasAttributedTo |
| Column-level lineage facet | Derivation between fields | prov:wasDerivedFrom at field granularity |
| Data quality facet (e.g. a data-quality validation result) | Quality assertion attached to Activity | dqv:hasQualityMeasurement |
| Schema facet | Field definitions on the Entity | dcat:schema/schema:propertyID |
The OpenLineage metadata store holds these events natively. A small ingestion service reads from that store (or from the OpenLineage HTTP transport directly), maps each event into the table above, and asserts the result into a designated named graph in the governance KG. The named graph is one per source system per release window, exactly the Part 8 versioning discipline. The seven-field provenance contract is satisfied automatically: the Run carries where (system), what process (job IRI), who (agent IRI), when (timestamp), trust level (set by source policy: gold for production warehouse jobs, silver for sandbox, bronze for ad hoc query exports), source hash (the dataset’s content hash from the schema facet), and validatedAgainstShapes (the SHACL shapes that approved the run’s outputs).
What the bridge buys is the ability to ask three classes of question from one substrate. Forward impact: “if Dataset X changes, which Datasets, Fields, CDEs, and Reports depend on it.” Backward provenance: “show every Activity, Agent, and Dataset that contributed to this number, transitively, with timestamps and quality assertions.” Cross-cutting: “every Job in the last quarter that touched a Field implementing the Counterparty Credit Exposure CDE, ordered by the most recent SHACL validation result, with the Owner and the Policy that applies to each step.” The third question is what Northwind’s examiner asked. With OpenLineage flowing into a governance KG via the PROV-O bridge, the third question is one SPARQL query, not three weeks.
The 2024 ECB Guide on effective risk data aggregation and risk reporting made attribute-level Data Lineage one of seven priority areas for European supervised banks. Attribute-level lineage is the column-level facet of OpenLineage assembled into a graph and queried against the CDE inventory. Banks that have OpenLineage emitting from every pipeline but no graph behind it have the firehose without the substrate, which is the same as having the substrate empty. The substrate is what makes the firehose answer the question.
Critical Data Elements As A Graph
The CDE meta model from the Part 0 of the CDE series is already a graph in disguise. The article presented a three-layer structure: a business-level CDE (Counterparty Credit Exposure), a logical-level data element (per-system semantic mapping), and a physical-level column (each system’s actual column). The 1:N ratio (one CDE to roughly twenty columns at Northwind, per the TDAN finding that any attribute exists ten to seventy times across an enterprise) is exactly what a typed relationship implementsCDE captures.
In a governance KG, a CDE becomes a typed node with a small, fixed set of relationships.
| Relationship | What it links | Why it matters |
|---|---|---|
cde:implementsCDE (inverse cde:hasImplementation) | Field to CDE | The 1:N mapping; quality rules attach at the CDE; aggregation rolls up to the CDE |
cde:typeReference | CDE to FIBO/SNOMED/USCDI/Schema.org concept | Anchors the CDE definition in a recognized industry ontology, defending against the “what does Counterparty mean here” objection |
cde:hasOwner (PROV-O prov:wasAttributedTo) | CDE to Owner | The single accountable party; not the same as the dataset owner |
cde:hasQualityRule | CDE to QualityRule | A SHACL shape or DQ rule that any implementing Field inherits |
cde:appliesPolicy (inverse of policy:applies) | CDE to Policy | The regulatory drivers (BCBS 239, GDPR, EU AI Act) attached at the CDE level rather than scattered across Fields |
cde:dependsOn | CDE to CDE | Computational dependencies (Counterparty Credit Exposure depends on Counterparty Identifier, Exposure Amount, and Recovery Rate) |
The Northwind bank had three hundred CDEs and roughly six thousand physical columns, per the 1:20 ratio its CDE program had measured. In the governance KG, the three hundred CDE nodes become the load-bearing semantic anchors. Quality rules attach at the CDE node and inherit downward to every implementing Field. Lineage queries that ask “what changed in Counterparty Credit Exposure last quarter” traverse from the CDE node through cde:hasImplementation to the Fields, then through prov:wasGeneratedBy to the Activities, then through prov:wasAttributedTo to the Agents that ran them, all in one query. Quality status rolls up the same path.
The pattern works because the CDE node is the right grain for governance. It is small enough to be human-curatable (three hundred at Northwind). It is large enough to organize the long tail of fields underneath. It is the node that owners, regulators, and policies all naturally attach to. The catalog and lineage tool can keep emitting field-level metadata at scale, and the CDE layer collapses that scale into a governable surface.
What this looks like in practice. The most common failure mode for CDE programs that try to live in a spreadsheet is that the field-to-CDE mapping rots within six months: schema changes, dataset renames, and new pipelines all break the spreadsheet links silently. When the same mapping lives in the graph as
cde:hasImplementationedges, every OpenLineage schema-facet event is an opportunity to reconcile the mapping automatically. New columns in a tracked dataset trigger a stewardship task (“does this implement an existing CDE, or is it a new concept”), and renames update the edge atomically. The graph is what keeps the CDE program from rotting.
Master Data Golden Records As Nodes
The master data conversation has the same shape as the CDE conversation, but at the entity layer instead of the concept layer. The MDM in a Data Mesh World article made the case that a single golden record is in tension with domain ownership in a mesh, and that three patterns resolve the tension (federated MDM, domain-owned with cross-domain reconciliation, MDM as a thin platform service). All three patterns are easier to operate when the golden record is a node in the governance KG rather than a row in a separate MDM hub.
The pattern follows directly from Part 5’s identity discipline. Every source system contributes records about a real-world entity (a customer, a counterparty, a product, an employee). Entity resolution (the Part 5 ER pipeline) produces a stable IRI for the resolved entity. Each source record points at the resolved entity through owl:sameAs (when the match is identity-strength) or a SKOS mapping property (when the match is similarity-strength). The resolved entity is the golden record. It is a node in the governance KG, with provenance edges back to every source record and trust-tier labels per source.
| MDM concern | KG-as-substrate answer |
|---|---|
| Where does the canonical identity live | At the resolved-entity IRI; every source record points at it via owl:sameAs or a SKOS exact match; each match is reified with method, score, timestamp, reviewer (the Part 5 provenance-per-match discipline) |
| How does federated MDM not collapse into central MDM | Each domain mints its own source-side IRIs; the resolved-entity IRI is in a shared named graph that any domain can read; only ER updates the shared named graph; domains keep ownership of their source records |
| How do conflicts between domains get reported | A query for entities with multiple high-confidence source records that disagree on a property surfaces the conflict; the governance graph makes “two domains disagree about a customer’s address” a reportable triple, not a tribal-knowledge problem |
| How does the AI agent get the right golden record (the Part 9 trust-tier-aware retrieval pattern) | The agent retrieves through the resolved-entity IRI; trust-tier labels filter source records by tier before the agent sees them; the same graph that answers governance questions answers retrieval questions |
The federated MDM pattern from the existing article and the entity resolution pattern from Part 5 are the same pattern at different vocabulary levels. The MDM article speaks the language of operating models. Part 5 speaks the language of identity. Both reduce to a single architectural choice: keep the resolved entity in a shared named graph, use the IRI everywhere, never let any source system mint its own canonical identifier. When that discipline holds, the governance KG, the MDM golden record, and the Part 9 agent’s retrieval surface are the same store with three reading patterns.
Regulatory Cross-Walks: BCBS 239, GDPR Article 30, EU AI Act Article 10
The strongest single argument for a governance KG in 2026 is regulatory. Three regulations now ask graph-shaped questions: BCBS 239 with the 2024 ECB RDARR amplification, GDPR Article 30, and the EU AI Act Article 10 (whose data-governance obligations for stand-alone (Annex III) high-risk AI were deferred from 2 August 2026 to 2 December 2027 under the Digital Omnibus provisional agreement of 7 May 2026, with Annex I embedded systems moved to 2 August 2028, pending publication in the Official Journal). Each regulation has been treated by most programs as a separate compliance project. Each is, structurally, a query against the same governance graph.
BCBS 239 and ECB RDARR
BCBS 239 (2013, the Basel Committee’s “Principles for effective risk data aggregation and risk reporting”) sets fourteen principles, but only three of them are operationally graph-shaped. Principle 2 (Data architecture and IT infrastructure) demands a documented architecture that maps risk data sources to risk reports. Principle 3 (Accuracy and Integrity) demands that risk data be reconciled, with documented controls at each stage. Principle 6 (Adaptability) demands the ability to produce ad hoc risk reports across stress scenarios.
The 2024 ECB RDARR Guide sharpens the ask. Among the seven priority areas, attribute-level Data Lineage is, in my reading, the one most clearly shaped like a graph problem, and the one most banks are least ready for. “Show every transformation, owner, validation, and quality measurement on every attribute that enters every risk report” cannot be answered by lineage at the table level or system level. It requires field-level edges with PROV-O metadata.
In the governance KG, BCBS 239 Principle 3 reduces to a SPARQL pattern that traverses from a designated risk report Dataset, through prov:wasGeneratedBy chains, to every contributing Field, with dqv:hasQualityMeasurement annotations attached at each step. The pattern is one query template, parameterized by the report. Every quarterly report runs the same query.
GDPR Article 30
GDPR Article 30 requires that controllers maintain a Record of Processing Activities (ROPA): a registry of every processing activity, the categories of data subjects and personal data, the recipients, the purpose, the retention period, and the security measures. ROPA is, structurally, a query against a graph: for each Activity in the governance KG that touches a Field tagged as gdpr:PersonalData, return the Activity, its purpose annotation, the recipient Datasets, the retention policy attached to the Field, and the controlling Agent.
The Golpayegani et al. open knowledge graph approach for AI governance (2024) demonstrates the same pattern for the EU AI Act, mapping its concepts and requirements to international standards via an open KG. The same architecture works for ROPA: tag Fields and CDEs with personal-data classifications via SKOS concepts, attach retention and purpose policies as Policy nodes, and the ROPA report is a fixed query that produces a fresh ROPA on demand. Compliance-automation tooling automates this pattern end-to-end when the underlying graph is in place; without the graph, automation tools degenerate into yet another disconnected store.
EU AI Act Article 10
Article 10 of the EU AI Act (whose stand-alone Annex III high-risk obligations were deferred from 2 August 2026 to 2 December 2027 under the Digital Omnibus provisional agreement of 7 May 2026, pending publication in the Official Journal; if the Omnibus is not adopted before 2 August 2026 the original date snaps back, so the safest reading is that the deadline is in flux but no earlier than originally set) requires that high-risk AI systems be developed using high-quality training, validation, and test data with documented Data Governance, including provenance, collection methodology, bias detection, and quality checks. The Article 10 ask is, in graph terms, “for the model M, traverse from M back through prov:wasDerivedFrom to the training Dataset, then through prov:wasGeneratedBy to the Activities that produced it, then through prov:wasAttributedTo to the Agents responsible, then surface the quality measurements, bias diagnostics, and policy validations attached at each step.”
The Article 10 traversal is structurally identical to the BCBS 239 traversal. Different terminal node (a model versus a risk report). Same edge structure, same graph, same provenance contract. A bank that has a governance KG for BCBS 239 has 90% of the architecture for EU AI Act Article 10 already in place. The remaining 10% is adding model-specific facets to the PROV-O bridge (training data hash, hyperparameters, evaluation results) and tagging Datasets used in training with the additional Article 10 quality flags.
The cross-walk view
The same governance KG answers all three regulators with three different fixed queries. The cross-walk between regulations becomes a graph problem: which Policies share which mapped requirements, which Fields fall under which combination of policies, which Agents are accountable across regulations.
| Regulation | What it asks | KG answer pattern |
|---|---|---|
| BCBS 239 Principle 3 / ECB RDARR | Where does this risk number come from, attribute by attribute, with quality and ownership | Backward traversal from a risk-report Dataset through PROV-O Activity chain; aggregate dqv:hasQualityMeasurement |
| BCBS 239 Principle 6 | Can you produce ad hoc cross-scenario risk reports | Forward traversal from CDEs to all Datasets that materialize them, parameterized by stress scenario |
| GDPR Article 30 | Records of every processing activity touching personal data | Filter Activities where any used or generated Field is tagged gdpr:PersonalData; project Activity attributes |
| EU AI Act Article 10 | Provenance and quality of training data for high-risk models | Backward traversal from Model through prov:wasDerivedFrom to training Datasets and their PROV-O chain |
| EU AI Act Article 12 (logging) | Audit logs of model behavior | Forward append into an immutable named graph for each model run; query by model IRI |
| Future regulations | (whatever the structure is) | New SPARQL templates against the same graph |
The point is not that the governance KG eliminates compliance work. It is that it eliminates the per-regulation reconciliation work. Each regulation becomes a query template. New regulations become new templates against the same graph. The Northwind incident did not happen because the bank lacked a regulator-specific tool. It happened because four governance stores could not be queried together.
What this looks like in practice. When a regulator’s question has a different shape from your governance graph’s edges, you have two choices: change the question, or extend the edges. Extend the edges. The governance graph should be the structure that any reasonable regulatory question can be expressed against, not a structure that fits your favorite report. If your team finds itself flattening a graph traversal into a spreadsheet to answer a regulator, the spreadsheet is not the answer; it is the symptom that an edge is missing from the graph.
Trust Tiers Meet Governance Reporting
The four-tier trust pattern from Part 7 (gold, silver, bronze, quarantine) extends naturally into governance reporting. The rules are simple and worth stating before they get bent in production.
| Reporting surface | Allowed tiers | Why |
|---|---|---|
| Regulatory submissions (Pillar 3, Y-9C, FFIEC, EBA) | Gold only | Regulators expect verified, owner-attested, validated content; submitting silver content as gold is a finding |
| Internal management reports | Gold or silver, with tier visible per row | Speed-of-availability matters; tier visibility lets readers calibrate trust |
| Operational dashboards | Gold or silver; bronze allowed only with a “preliminary” badge | Same reason as management reports; bronze acceptable where the consumer can see the tier |
| Exploratory analysis / data science / agent retrieval | All tiers, with explicit tier in the response | Bronze is signal; quarantining it from analysts is wrong; mislabeling it as gold is the failure |
| ROPA / Article 30 reports | All tiers; tier becomes a column | Article 30 wants the full picture, including tentative records; the tier column shows the regulator the maturity of the program |
| EU AI Act high-risk training data | Gold only for production training; silver allowed for evaluation; bronze quarantined | Article 10 requires high-quality training data; tier discipline is the operational implementation of “high quality” |
The discipline is the same as the Part 9 trust-tier-aware retrieval pattern: every consumer specifies the tier policy explicitly, the retrieval (or report-generation) layer enforces it structurally, and the response surface (a regulatory submission, a dashboard, an agent answer) cites tier alongside content. Northwind’s three-week delay would have been worse if the team had inadvertently submitted silver content as gold; the lack of a tier policy on the regulatory submission would have been the second incident behind the first.
Seven Failure Modes For Governance Knowledge Graphs
Each of these has been observed in production governance-KG deployments reported in the 2024-2026 literature, including a metadata-graph platform survey and a data catalog and governance-metadata platform article on the same pattern.
| Failure | What it looks like | Root cause |
|---|---|---|
| The fifth-store anti-pattern | The governance KG was added alongside the catalog, lineage tool, glossary, and policy register without retiring or absorbing any of them; now there are five stores instead of four; no shared identifiers | The KG was treated as a tool, not as the substrate the other tools should write into |
| Unresolved entities at the dataset layer | Same dataset has three different IRIs because three crawlers (catalog, OpenLineage, manual entry) each minted their own; queries return duplicate results that look like inconsistencies | No IRI minting policy; the Part 5 identity discipline was skipped |
| Lineage without PROV-O | OpenLineage events were ingested as raw event records, not mapped to PROV-O Activities; cross-cutting queries that join lineage to provenance metadata cannot run | The OpenLineage-to-PROV-O bridge from Part 7 was treated as optional |
| CDE program rot | The field-to-CDE mapping was loaded once and never updated; six months later, half the mappings are stale; nobody trusts the CDE-derived metrics | No automated reconciliation against schema-facet events; mappings live as a one-time load instead of a maintained edge |
| Policy-as-document, not policy-as-node | Policy register exports PDFs; the governance KG has Policy nodes but they have no applies edges to CDEs, Fields, or Activities; cross-walks cannot run | Policies were tagged but not connected; the cross-walk discipline from this article was skipped |
| Trust-tier laundering at the report boundary | A regulatory submission inadvertently includes silver content because the tier policy was enforced only at retrieval, not at report generation | The Part 9 three-layer enforcement (planner, prompt, post-processing) was applied only to agents, not to reports |
| Silent schema drift | An ontology refactor (renamed CDE, removed field tag) ran without versioning; old policy queries return empty; downstream reports look complete but are not | The Part 8 versioning discipline was not extended to the governance KG; named graphs per release were not in use |
The failure pattern across all seven is the same as the failure pattern of the four-store governance program: the metadata that was needed downstream was not made first-class in the substrate. The fix in every case is to move the missing structure into the graph rather than working around its absence.
A Decision Tree For The Governance Layer
Use this when scoping a governance KG initiative or assessing an existing program for the gap that produces a Northwind-style incident.
- Can you produce, in one query, the answer to “where did this regulatory number come from, owned by whom, with what quality status, since when?” If yes, you have a governance graph. If no, you have a fragmented governance program; the rest of the questions diagnose where.
- Do all four (or five) governance stores reference the same IRI for each Dataset, Field, CDE, Policy, and Owner? If not, the fix is identifier discipline before any new tool. The Part 5 IRI rules are the foundation.
- Does your lineage tool emit OpenLineage, and are those events flowing into a PROV-O bridge? If lineage emission is partial (warehouse only, not operational systems), the regulator’s first question already exceeds your coverage. Extend OpenLineage emission to every system before extending the graph.
- Are your CDEs nodes in the graph or rows in a spreadsheet? Spreadsheet rot is the most reliable predictor of CDE program failure. Move CDEs into the graph as typed nodes with
cde:hasImplementationedges; reconcile from OpenLineage schema-facet events. - Are your golden master records nodes in the graph, with
owl:sameAsedges from each source record and provenance-per-match metadata? If not, federated MDM is harder than it needs to be, and the Part 9 agent retrieval pattern cannot reuse the same store. - Are your policies nodes with
appliesedges to CDEs, Fields, and Activities, or are they PDFs in a register? Policy-as-document means the cross-walk between regulations cannot run as a query. - Does every Activity in the graph carry the seven-field provenance contract (Part 7 and Part 8 combined: where, what process, who, when, trust level, source hash, validatedAgainstShapes)? If any field is missing, regulators with detailed provenance asks (EU AI Act Article 10 in particular) will catch the gap.
- Does your tier policy extend to every reporting surface (regulatory submissions, management reports, dashboards, ROPA, AI Act conformance evidence)? Tier-at-retrieval-only is the failure mode that produces the second incident.
- Are your governance KG named graphs versioned per release, with consumer pinning, per the Part 8 versioning rules? Without per-release versioning, an ontology refactor breaks past reports silently.
- When a new regulation arrives, is your work a new SPARQL template or a new tool? The right answer is “template.” If your governance team’s first instinct is to procure another store, the substrate is not the substrate yet.
A governance program that has answers to all ten is a program that absorbs new regulations as templates. A program missing several is a program that will surface a Northwind-style incident the next time a regulator asks a question that spans more than one store.
What This Article Did Not Cover
Three governance topics deserve more depth than fits here.
- Federated governance across business units: how a parent organization with semi-autonomous subsidiaries (the kind of structure common in financial services and healthcare) operates a governance graph that is partly shared and partly per-subsidiary. The named-graph model from Part 8 is the foundation; the operating model is its own problem and lives in the centralized-to-federated governance article.
- Cost modeling for a governance KG: the build-vs-buy and total-cost-of-ownership calculations specifically for a governance graph. Appendix B of this series covers the general KG cost shape; the governance-specific shape (where the heavy cost is the catalog crawler maintenance and the OpenLineage emission backfill, not the graph database itself) is its own piece.
- The governance graph during M and A integration: when two organizations merge, their governance graphs have to be reconciled the same way their MDM golden records do. The pattern is the same as the Part 5 entity-resolution pipeline applied at the ontology and policy level. A worked example belongs in a follow-up integration playbook.
This article has covered why the four-or-five-store governance pattern collapses under modern regulatory pressure, the specific shape of a governance KG (DCAT plus PROV-O plus SKOS plus FIBO plus a thin in-house module), the OpenLineage-to-PROV-O bridge as the lineage substrate, CDEs as typed nodes with hasImplementation edges, master data golden records as resolved-entity nodes, the regulatory cross-walk to BCBS 239 and ECB RDARR and GDPR Article 30 and EU AI Act Article 10, the trust-tier extension to governance reporting, and the seven failure modes that show up when any layer is treated as optional.
Do Next: Governance KG Discipline Tier List
| Priority | Action | Why It Matters |
|---|---|---|
| Now (this quarter) | Audit your governance program for the four-or-five-store pattern. Count how many distinct identifier spaces exist for the same Dataset, Field, CDE, and Policy across catalog, lineage, glossary, and policy register. Anything more than one is the problem. | The Northwind incident is downstream of identifier fragmentation. Fixing identifiers without a graph is hard; fixing them as part of building the governance graph is the same work. |
| Now (this quarter) | If OpenLineage is not yet emitting from every pipeline that touches a CDE-implementing dataset, scope the gap. Backfill emission before extending any other governance investment. | The 2024 ECB RDARR Guide demands attribute-level lineage today, and EU AI Act Article 10 will demand it for high-risk training data once the deferred Annex III deadline (now 2 December 2027) takes effect. Without OpenLineage emission, the graph cannot know what the pipelines did. |
| Next (next two quarters) | Stand up a governance KG with the DCAT+PROV-O+SKOS+FIBO ontology and the OpenLineage-to-PROV-O bridge. Move CDE inventory into the graph as typed nodes. Move policy register into the graph as Policy nodes with applies edges. | The substrate is what eliminates per-regulation reconciliation work. Every tool that does not write into the graph or read from the graph is a candidate for retirement. |
| Next (next two quarters) | Extend the Part 9 trust-tier policy from agent retrieval to every reporting surface (regulatory submissions, management reports, ROPA, AI Act evidence). Make tier a first-class field on every reportable triple. | Tier-at-retrieval-only is the second incident waiting to happen. Submitting silver content as gold is a regulatory finding regardless of how thorough your agent retrieval is. |
| Soon (next year) | Build SPARQL query templates for BCBS 239 Principle 3, BCBS 239 Principle 6, GDPR Article 30 ROPA, and EU AI Act Article 10. Run them quarterly as part of your standard control set. | Cross-walking regulations is what a governance graph is for. The work is template authoring, not store reconciliation. Each new regulation is a new template, not a new tool. |
| Soon (next year) | Move master data golden records into the same governance KG via resolved-entity IRIs and provenance-per-match edges. Retire any standalone MDM hub that does not write into the shared graph. | The same store that answers governance questions answers MDM questions and Part 9 agent retrieval. Three pipelines into three stores is three times the cost; one pipeline into one graph with three reading patterns is the architecture that survives. |
| Eventually (when stable) | Apply the Part 8 versioning discipline to the governance KG itself. Named graph per release. Consumer pinning. Schema-evolution playbook for ontology refactors. | A governance program that cannot reproduce a 2025 report against the 2025 ontology has the same archeology problem as Northwind, just delayed by two years. Versioning is what makes “show me what we knew on April 4” answerable. |
Up Next
Part 11 is the synthesis, split across three pieces because the capstone is too large to read in one sitting. Lakeside Trust Bank, the mid-size US bank introduced in Part 4, implements the full architecture end to end. Part 11a covers the foundation and the operational use case (customer 360 plus beneficial ownership plus real-time transaction risk). Part 11b takes the same graph and shows how it answers regulators (BCBS 239 and ECB RDARR exactly as cross-walked in this article, plus GDPR Article 30 and EU AI Act Article 10). Part 11c shows how the relationship-banker agent uses the same graph (the Part 9 trust-tier-aware retrieval pattern in production). One graph, one identity discipline, three use cases.
Sources & References
- OpenLineage: An Open Standard for lineage metadata collection(2024)
- OpenLineage Column-Level Lineage Dataset Facet(2024)
- Marquez Project: Reference Implementation of OpenLineage (LF AI & DATA)(2024)
- W3C PROV-O: The PROV Ontology(2013)
- W3C SKOS: Simple Knowledge Organization System(2009)
- W3C DCAT Version 3 Recommendation(2024)
- Basel Committee on Banking Supervision: BCBS 239 Principles for effective risk data aggregation and risk reporting(2013)
- ECB Guide on effective risk data aggregation and risk reporting (RDARR)(2024)
- EU AI Act Article 10: Data and Data Governance(2024)
- GDPR Article 30: Records of processing activities(2018)
- FIBO: Financial Industry Business Ontology(2025)
- DataHub: What Is a Metadata Knowledge Graph?(2025)
- Golpayegani et al.: An Open Knowledge Graph-Based Approach for Mapping Concepts and Requirements between the EU AI Act and International Standards(2024)
- TDAN: Business Glossaries and Metadata: Using the Glossary to Drive Your Quality Strategy(2024)
- Gibson Dunn: EU AI Act Omnibus Agreement: Postponed High-Risk Deadlines and Other Key Changes(2026)
Stay in the loop
Get new articles on data governance, AI, and engineering delivered to your inbox.
No spam. Unsubscribe anytime.