Data Catalog

TL;DR

A data catalog is a centralised inventory of organisational datasets — typically with metadata (schema, ownership, classification, freshness), documentation (descriptions, business context), and discovery features (search, browse, lineage). Modern data catalogs (Atlan, Castor, OpenMetadata, DataHub, Alation) auto-populate metadata from data systems and turn the catalog into the front door for data discovery. Without a catalog, organisations rapidly accumulate data assets nobody knows exist or how to use.

What is a data catalog?

A data catalog is a centralised, searchable inventory of an organisation's data assets — datasets, dashboards, ML models, metrics — with metadata, ownership, documentation, lineage, and access information attached.

It is the discoverability layer of the modern data stack. Without a catalog, employees can't find the data they need; analytical work duplicates because the existing dataset is invisible; access governance breaks down because nobody knows what exists to govern.

What a catalog contains

Datasets: tables, views, lakehouse files, with schema, sample values, and freshness info
Ownership: who is accountable for each dataset, who consumes it, who originally produced it
Documentation: business descriptions, definitions, glossary terms, business-context notes
Classification: sensitivity tags (PII, confidential, public), regulatory tags (GDPR, HIPAA scope)
Lineage: upstream sources and downstream consumers, ideally column-level
Quality information: data-quality test results, freshness SLAs, recent incident history
Usage analytics: who queries the dataset, how often, from which tools
Access information: how to request access, current access controls, audit trail

Catalog products (2025 landscape)

Product	Origin	Strength
Atlan	Independent (founded 2018)	Modern UX, broad ecosystem, governance integration
Castor	Independent	Lineage-first, lightweight onboarding
OpenMetadata	Open-source (Collate)	Open-source standard, broad community
DataHub	Open-source (LinkedIn-originated)	Powerful for engineering-led teams
Alation	Independent (founded 2012)	Enterprise-focused, mature compliance features
Collibra	Independent (founded 2008)	Enterprise governance heavyweight
Select Star	Independent	Lineage- and observability-leaning

Why auto-population matters

First-generation data catalogs required manual entry — ownership, descriptions, classification all entered by humans. This pattern produced catalogs that were 70% incomplete and 50% out-of-date within six months.

Modern catalogs auto-populate from connected data systems: schemas from warehouses, lineage from dbt and SQL parsing, quality from observability tools, usage from query logs. Humans add the parts machines can't determine — business descriptions, ownership decisions, classification edge cases — but the bulk of catalog content updates automatically.

Common pitfalls

1. Manual-only catalogs. Human-entered catalogs decay fast. Default to auto-population from systems with humans filling the gaps machines can't.
2. Low-quality first round. Cataloguing every dataset at low fidelity produces clutter that hides high-value datasets. Better to catalogue the top 20% of consumer-facing datasets at high quality and tier the rest.
3. Catalog as ticket system. Catalogs that require ticketed access requests for every read produce friction and circumvention. Self-service access flows (with appropriate governance) keep usage high.

Data governance is the broader policy framework; the catalog is the operational layer that makes governance work. Data lineage is typically a catalog feature. Data products are the catalog entries that warrant the strongest documentation discipline.

At a glance

Category: Business Intelligence
Related: 5 terms

Frequently asked questions

When does an organisation need a data catalog?

When data assets exceed what one person can hold in their head — typically 50+ tables in regular use across multiple teams. Smaller organisations can manage with documentation in dbt and tribal knowledge. Past 50–100 tables, the cognitive cost of 'where does this metric come from?' starts justifying catalog investment.

Open-source or commercial catalog?

OpenMetadata and DataHub are mature open-source options with strong communities. Atlan and Castor are commercial products with smoother onboarding. The choice depends on engineering bandwidth (open-source needs internal maintainers) and budget.

What's the relationship between a catalog and a metric store?

A catalog is the broader inventory (all datasets, dashboards, ML models). A metric store is the specific layer for centralised metric definitions. Modern catalogs typically include or integrate with metric stores; the metric layer is one type of asset in the catalog.

Sources

Atlan / Castor / OpenMetadata / DataHub documentation
Modern Data Stack reports (2024–25)
DAMA Data Management Body of Knowledge

Fairview is an operating intelligence platform that surfaces dataset metadata from connected catalogs — so operators see ownership, freshness, and quality status alongside the operating metrics, without needing to context-switch into the catalog tool. Start your free trial →

Siddharth Gangal is the founder of Fairview. He built the catalog-aware metadata layer after watching operators question dashboard numbers because they couldn't see at a glance whether the underlying data was fresh or stale — the catalog had the answer, but the dashboard didn't surface it where the question was being asked.

See it in Fairview

Track Data Catalog automatically.

14-day free trial. No credit card. First data source connected in 5 minutes.

Start free trial Book a demo