TL;DR
A data catalog is a centralised inventory of organisational datasets — typically with metadata (schema, ownership, classification, freshness), documentation (descriptions, business context), and discovery features (search, browse, lineage). Modern data catalogs (Atlan, Castor, OpenMetadata, DataHub, Alation) auto-populate metadata from data systems and turn the catalog into the front door for data discovery. Without a catalog, organisations rapidly accumulate data assets nobody knows exist or how to use.
What is a data catalog?
A data catalog is a centralised, searchable inventory of an organisation's data assets — datasets, dashboards, ML models, metrics — with metadata, ownership, documentation, lineage, and access information attached.
It is the discoverability layer of the modern data stack. Without a catalog, employees can't find the data they need; analytical work duplicates because the existing dataset is invisible; access governance breaks down because nobody knows what exists to govern.
What a catalog contains
- Datasets: tables, views, lakehouse files, with schema, sample values, and freshness info
- Ownership: who is accountable for each dataset, who consumes it, who originally produced it
- Documentation: business descriptions, definitions, glossary terms, business-context notes
- Classification: sensitivity tags (PII, confidential, public), regulatory tags (GDPR, HIPAA scope)
- Lineage: upstream sources and downstream consumers, ideally column-level
- Quality information: data-quality test results, freshness SLAs, recent incident history
- Usage analytics: who queries the dataset, how often, from which tools
- Access information: how to request access, current access controls, audit trail
Catalog products (2025 landscape)
| Product | Origin | Strength |
|---|---|---|
| Atlan | Independent (founded 2018) | Modern UX, broad ecosystem, governance integration |
| Castor | Independent | Lineage-first, lightweight onboarding |
| OpenMetadata | Open-source (Collate) | Open-source standard, broad community |
| DataHub | Open-source (LinkedIn-originated) | Powerful for engineering-led teams |
| Alation | Independent (founded 2012) | Enterprise-focused, mature compliance features |
| Collibra | Independent (founded 2008) | Enterprise governance heavyweight |
| Select Star | Independent | Lineage- and observability-leaning |
Why auto-population matters
First-generation data catalogs required manual entry — ownership, descriptions, classification all entered by humans. This pattern produced catalogs that were 70% incomplete and 50% out-of-date within six months.
Modern catalogs auto-populate from connected data systems: schemas from warehouses, lineage from dbt and SQL parsing, quality from observability tools, usage from query logs. Humans add the parts machines can't determine — business descriptions, ownership decisions, classification edge cases — but the bulk of catalog content updates automatically.
Common pitfalls
- 1. Manual-only catalogs. Human-entered catalogs decay fast. Default to auto-population from systems with humans filling the gaps machines can't.
- 2. Low-quality first round. Cataloguing every dataset at low fidelity produces clutter that hides high-value datasets. Better to catalogue the top 20% of consumer-facing datasets at high quality and tier the rest.
- 3. Catalog as ticket system. Catalogs that require ticketed access requests for every read produce friction and circumvention. Self-service access flows (with appropriate governance) keep usage high.
Related concepts
Data governance is the broader policy framework; the catalog is the operational layer that makes governance work. Data lineage is typically a catalog feature. Data products are the catalog entries that warrant the strongest documentation discipline.
At a glance
- Category
- Business Intelligence
- Related
- 5 terms
Frequently asked questions
When does an organisation need a data catalog?
When data assets exceed what one person can hold in their head — typically 50+ tables in regular use across multiple teams. Smaller organisations can manage with documentation in dbt and tribal knowledge. Past 50–100 tables, the cognitive cost of 'where does this metric come from?' starts justifying catalog investment.
Open-source or commercial catalog?
OpenMetadata and DataHub are mature open-source options with strong communities. Atlan and Castor are commercial products with smoother onboarding. The choice depends on engineering bandwidth (open-source needs internal maintainers) and budget.
What's the relationship between a catalog and a metric store?
A catalog is the broader inventory (all datasets, dashboards, ML models). A metric store is the specific layer for centralised metric definitions. Modern catalogs typically include or integrate with metric stores; the metric layer is one type of asset in the catalog.
Sources
- Atlan / Castor / OpenMetadata / DataHub documentation
- Modern Data Stack reports (2024–25)
- DAMA Data Management Body of Knowledge
Fairview is an operating intelligence platform that surfaces dataset metadata from connected catalogs — so operators see ownership, freshness, and quality status alongside the operating metrics, without needing to context-switch into the catalog tool. Start your free trial →
Siddharth Gangal is the founder of Fairview. He built the catalog-aware metadata layer after watching operators question dashboard numbers because they couldn't see at a glance whether the underlying data was fresh or stale — the catalog had the answer, but the dashboard didn't surface it where the question was being asked.
See it in Fairview
Track Data Catalog automatically.
14-day free trial. No credit card. First data source connected in 5 minutes.