TL;DR
Data lineage is the documented path that data takes from its source through transformations to its consumers — what tables produce which downstream tables, what columns flow into which metrics, what reports depend on which models. Lineage is critical for impact analysis (what breaks if I change this?), governance (who has access to what?), debugging (where did this number come from?), and trust (can I rely on this metric?). Modern data-stack tools (dbt, OpenLineage, Datafold, Castor, Monte Carlo) make lineage capture mostly automatic.
What is data lineage?
Data lineage is the documented dependency graph of analytical data — tracking how raw source data flows through extraction, transformation, and modelling steps into final consumer-facing assets (dashboards, ML models, operational systems).
It answers questions analytics teams ask constantly: 'if I rename this column, what breaks?', 'which dashboards depend on this table?', 'where did this number on the executive report actually come from?', 'who has access to data derived from this PII source?'.
Levels of lineage
- Table-level lineage: which tables depend on which tables — the most common and easiest to capture
- Column-level lineage: which columns derive from which source columns — required for fine-grained impact analysis (renaming a column, deprecating a field)
- Metric-level lineage: which metric definitions depend on which columns and dimensions — required for trust assessment of dashboards and operational metrics
- Cross-system lineage: tracing flows across multiple systems (source DBs → warehouse → BI → operational tools) — hardest to capture but most operationally valuable
Why lineage matters
Without lineage, every downstream change becomes risk-laden. Renaming a column requires manual analysis of every dashboard and downstream model that might use it. A single broken metric in a board deck triggers hours of investigation to find which transform produced the wrong number. Compliance requirements (which datasets contain PII derived from EU sources?) become painful audit projects rather than queryable metadata.
With lineage, all of these become tractable. Impact analysis runs in seconds. Bug investigation traces the metric back to its source. Compliance reporting queries the lineage graph rather than auditing manually.
Tools and approaches (2025)
| Tool / approach | Strength |
|---|---|
| dbt (built-in lineage) | Free with dbt; covers transformation layer well |
| OpenLineage (open-source standard) | Cross-tool standard for lineage emission |
| Datafold | Column-level lineage with diff and impact analysis |
| Castor | Catalog-and-lineage with broad ecosystem coverage |
| Monte Carlo / Bigeye | Lineage paired with data observability |
| Atlan, Alation | Enterprise data catalogues with lineage features |
| Snowflake Object Dependencies | Native warehouse lineage for in-warehouse transforms |
Common pitfalls
- 1. Manual lineage maintenance. Keeping a hand-maintained lineage diagram up to date is impossible at any meaningful scale. Lineage must be auto-generated from actual code (dbt models, SQL queries) — tooling that requires manual upkeep produces stale lineage that misleads users.
- 2. Table-only lineage when column lineage is needed. Table-level lineage misses the precision needed for many use cases (column rename impact, PII flow tracking). Column-level lineage is meaningfully harder to capture but often necessary.
- 3. Lineage without consumer tracking. Lineage that ends at the warehouse misses the consumer-side flow into BI tools, AI agents, and operational systems. Cross-system lineage is harder but necessary for full impact analysis.
Related concepts
Data governance relies on lineage for policy enforcement. Data catalog provides the discoverability layer that lineage feeds. Data products are the lineage endpoints that warrant the most rigorous lineage discipline. ELT pipelines (especially dbt) auto-generate much of the lineage capture.
At a glance
- Category
- Business Intelligence
- Related
- 5 terms
Frequently asked questions
How is lineage different from a data catalog?
A catalog is the inventory and discoverability layer (what datasets exist, who owns them, what they mean). Lineage is the dependency graph (how datasets flow into one another). They're complementary — modern data-catalog products typically include lineage as a core feature.
Do I need column-level lineage?
If you do impact analysis (who breaks when I change this column?), PII tracking (which downstream tables contain data derived from this sensitive column?), or precise debugging — yes. Table-level lineage is enough for high-level dependency understanding but misses the precision needed for these use cases.
How do you keep lineage current?
Auto-generation from code. Manual lineage maintenance fails at scale. Modern tools (dbt, OpenLineage, Datafold) parse SQL and pipeline definitions to produce lineage automatically — keeping it in sync with reality without manual upkeep.
Sources
- OpenLineage standard documentation
- Datafold / Castor / Monte Carlo product documentation
- Modern Data Stack reports (2024–25)
Fairview is an operating intelligence platform that surfaces metric lineage to the underlying warehouse models and source data — so trust questions ('where did this number come from?') resolve via the dashboard rather than as a separate investigation. Start your free trial →
Siddharth Gangal is the founder of Fairview. He built the lineage-surface layer after watching three CFOs ask 'how is this metric calculated?' in different board meetings — each time triggering 30-minute investigation chains because the metric definitions were buried in dashboard logic that didn't expose lineage to consumers.
See it in Fairview
Track Data Lineage automatically.
14-day free trial. No credit card. First data source connected in 5 minutes.