Data Lake

TL;DR

A data lake is a centralised storage repository for raw structured, semi-structured, and unstructured data at any scale — typically built on cheap object storage (S3, GCS, Azure Blob). Unlike a data warehouse, a data lake stores data in its original format with minimal transformation; schema is applied on read rather than on write. Data lakes were the dominant analytical-storage pattern 2010–20 before being superseded by <a href="/glossary/data-lakehouse" class="text-brand-600 underline decoration-brand-200 underline-offset-2 hover:text-brand-700">data lakehouses</a> for most new builds.

What is a data lake?

A data lake is a centralised repository for storing data of any type at any scale — structured tables, semi-structured JSON/Avro, unstructured logs and media — in its raw or near-raw form. The defining characteristic is schema-on-read: data is stored without enforcing schema at write time, with schema interpretation deferred to query time.

The pattern emerged in the early 2010s with Hadoop and HDFS, then matured around object-storage-based architectures (S3, GCS) by the late 2010s. It enabled storing data orders of magnitude cheaper than warehouses, at the cost of weaker governance and slower query performance.

Why lakes existed

Data lakes existed because data warehouses had two limitations that mattered increasingly through the 2010s: they were expensive (warehouses charge for storage and compute together), and they required schema-on-write (data had to be transformed and conformed before storage). Both became binding constraints as data volumes grew and use cases extended beyond traditional BI.

Lakes solved both constraints: cheap object storage decoupled from compute, and schema-on-read flexibility for new use cases. The trade-off was governance — lakes routinely became 'data swamps' without disciplined organisation and metadata.

Lake vs warehouse vs lakehouse

Property	Data Lake	Data Warehouse	Data Lakehouse
Storage cost	Low	High	Low
Schema enforcement	On-read (weak)	On-write (strong)	Configurable
Governance	Hard without metadata layer	Built-in	Strong (with table formats)
Query performance	Slow without indexing	Fast	Fast
Use cases	ML, archival, exploration	BI, reporting, operational analytics	Both

When data lakes still make sense

For analytical use cases, the lakehouse pattern (lake + table format) is now the default rather than the pure lake.

Archival storage: long-term retention of raw data for compliance or possible future use
ML training data: particularly for unstructured data (images, audio, video)
Streaming-only landing zones: raw event-stream capture before downstream transformation
Cost-optimised cold storage: rarely-queried data where lake economics dominate query-performance needs

Common pitfalls

1. The 'data swamp' anti-pattern. Lakes without metadata, governance, and access control rapidly accumulate undocumented files that nobody can use. Disciplined cataloguing is required.
2. Treating lake as a primary BI substrate. Pure lakes (without lakehouse table formats) underperform for high-concurrency BI. Lakehouse format is required for production BI workloads.
3. Skipping data quality on write. Schema-on-read is flexible but pushes data-quality checks downstream. Without quality controls at landing, downstream consumers spend disproportionate time on data cleaning.

Data lakehouse is the modern successor that adds table formats. Data mart is the smaller analytical-store pattern. ETL and ELT are the data-loading patterns. Data governance is the discipline that prevents lakes from becoming swamps.

At a glance

Category: Business Intelligence
Related: 5 terms

Frequently asked questions

Should we use a data lake in 2025?

For new builds, default to a data lakehouse rather than a pure data lake — lakehouse table formats (Iceberg, Delta) give you lake-class storage costs with warehouse-class query performance and governance. Pure lakes still make sense for archival, ML training data, and streaming landing zones.

What's the difference between a data lake and a data warehouse?

Lake: cheap object storage, schema-on-read, weak governance, slower BI queries. Warehouse: managed storage, schema-on-write, strong governance, fast queries. Lakehouse combines lake economics with warehouse properties via open table formats.

How do you prevent a 'data swamp'?

Three disciplines: (1) data cataloguing — every dataset has metadata, ownership, and documentation; (2) write-time data quality — basic schema validation at landing; (3) access governance — clear permissions on who can read what. Without all three, lakes degrade into swamps within 12–18 months.

Sources

Databricks Lakehouse paper
AWS Big Data architectures
Modern Data Stack benchmark reports (2024–25)

Fairview is an operating intelligence platform that reads from lakehouse-format data (Iceberg, Delta) as well as legacy data lakes — surfacing operating-metric views without requiring teams to migrate raw data into a separate warehouse first. Start your free trial →

Siddharth Gangal is the founder of Fairview. He built the lake-aware ingestion path after watching companies sit on years of operating data in S3 that they couldn't query at BI-class performance — the lake-to-warehouse migration was always 'next quarter' until lakehouse table formats made the migration unnecessary.

See it in Fairview

Track Data Lake automatically.

14-day free trial. No credit card. First data source connected in 5 minutes.

Start free trial Book a demo