Data Lineage

TL;DR

Data lineage is the documented path that data takes from its source through transformations to its consumers — what tables produce which downstream tables, what columns flow into which metrics, what reports depend on which models. Lineage is critical for impact analysis (what breaks if I change this?), governance (who has access to what?), debugging (where did this number come from?), and trust (can I rely on this metric?). Modern data-stack tools (dbt, OpenLineage, Datafold, Castor, Monte Carlo) make lineage capture mostly automatic.

What is data lineage?

Data lineage is the documented dependency graph of analytical data — tracking how raw source data flows through extraction, transformation, and modelling steps into final consumer-facing assets (dashboards, ML models, operational systems).

It answers questions analytics teams ask constantly: 'if I rename this column, what breaks?', 'which dashboards depend on this table?', 'where did this number on the executive report actually come from?', 'who has access to data derived from this PII source?'.

Levels of lineage

Table-level lineage: which tables depend on which tables — the most common and easiest to capture
Column-level lineage: which columns derive from which source columns — required for fine-grained impact analysis (renaming a column, deprecating a field)
Metric-level lineage: which metric definitions depend on which columns and dimensions — required for trust assessment of dashboards and operational metrics
Cross-system lineage: tracing flows across multiple systems (source DBs → warehouse → BI → operational tools) — hardest to capture but most operationally valuable

Why lineage matters

Without lineage, every downstream change becomes risk-laden. Renaming a column requires manual analysis of every dashboard and downstream model that might use it. A single broken metric in a board deck triggers hours of investigation to find which transform produced the wrong number. Compliance requirements (which datasets contain PII derived from EU sources?) become painful audit projects rather than queryable metadata.

With lineage, all of these become tractable. Impact analysis runs in seconds. Bug investigation traces the metric back to its source. Compliance reporting queries the lineage graph rather than auditing manually.

Tools and approaches (2025)

Tool / approach	Strength
dbt (built-in lineage)	Free with dbt; covers transformation layer well
OpenLineage (open-source standard)	Cross-tool standard for lineage emission
Datafold	Column-level lineage with diff and impact analysis
Castor	Catalog-and-lineage with broad ecosystem coverage
Monte Carlo / Bigeye	Lineage paired with data observability
Atlan, Alation	Enterprise data catalogues with lineage features
Snowflake Object Dependencies	Native warehouse lineage for in-warehouse transforms

Common pitfalls

1. Manual lineage maintenance. Keeping a hand-maintained lineage diagram up to date is impossible at any meaningful scale. Lineage must be auto-generated from actual code (dbt models, SQL queries) — tooling that requires manual upkeep produces stale lineage that misleads users.
2. Table-only lineage when column lineage is needed. Table-level lineage misses the precision needed for many use cases (column rename impact, PII flow tracking). Column-level lineage is meaningfully harder to capture but often necessary.
3. Lineage without consumer tracking. Lineage that ends at the warehouse misses the consumer-side flow into BI tools, AI agents, and operational systems. Cross-system lineage is harder but necessary for full impact analysis.

Data governance relies on lineage for policy enforcement. Data catalog provides the discoverability layer that lineage feeds. Data products are the lineage endpoints that warrant the most rigorous lineage discipline. ELT pipelines (especially dbt) auto-generate much of the lineage capture.

At a glance

Category: Business Intelligence
Related: 5 terms

Frequently asked questions

How is lineage different from a data catalog?

A catalog is the inventory and discoverability layer (what datasets exist, who owns them, what they mean). Lineage is the dependency graph (how datasets flow into one another). They're complementary — modern data-catalog products typically include lineage as a core feature.

Do I need column-level lineage?

If you do impact analysis (who breaks when I change this column?), PII tracking (which downstream tables contain data derived from this sensitive column?), or precise debugging — yes. Table-level lineage is enough for high-level dependency understanding but misses the precision needed for these use cases.

How do you keep lineage current?

Auto-generation from code. Manual lineage maintenance fails at scale. Modern tools (dbt, OpenLineage, Datafold) parse SQL and pipeline definitions to produce lineage automatically — keeping it in sync with reality without manual upkeep.

Sources

OpenLineage standard documentation
Datafold / Castor / Monte Carlo product documentation
Modern Data Stack reports (2024–25)

Fairview is an operating intelligence platform that surfaces metric lineage to the underlying warehouse models and source data — so trust questions ('where did this number come from?') resolve via the dashboard rather than as a separate investigation. Start your free trial →

Siddharth Gangal is the founder of Fairview. He built the lineage-surface layer after watching three CFOs ask 'how is this metric calculated?' in different board meetings — each time triggering 30-minute investigation chains because the metric definitions were buried in dashboard logic that didn't expose lineage to consumers.

See it in Fairview

Track Data Lineage automatically.

14-day free trial. No credit card. First data source connected in 5 minutes.

Start free trial Book a demo