TL;DR
- Data warehouse: structured, schema-on-write, fast BI queries, governed — best for reporting and business intelligence on known data models.
- Data lake: raw, schema-on-read, all data types, cheap storage — best for data science, ML, and storing data before its use case is defined.
- Lakehouse: combines lake storage economics with warehouse-grade ACID transactions and query performance — increasingly the default starting architecture in 2026.
- The split is converging: over 50% of organizations now implement lakehouse patterns; the lakehouse market is projected to grow from $14 billion to over $112 billion by 2035.
- Most mature stacks use all three in sequence: raw data lands in a lake, curated data is served through a lakehouse layer, and summary data is loaded into a warehouse for BI tools.
Three terms appear in almost every data infrastructure conversation: data warehouse, data lake, and data lakehouse. They are often used interchangeably, and they should not be. Each solves a distinct problem, has a different cost profile, and performs differently under different workloads. Choosing the wrong one — or building a stack that duplicates work across all three without clear boundaries — is one of the most expensive architectural mistakes a data team can make.
This post defines all three architectures clearly, explains the key differences between them, walks through the tools in each category, compares cost and performance, and gives you a decision framework for choosing which architecture fits your situation. It also covers what drove the emergence of the lakehouse as a concept, how modern data stacks typically combine all three, and what migration actually looks like when you decide to move between architectures.
For context on how data storage choices connect to downstream pipeline design, see our overview of what connected data means in practice and our comparison of ETL versus ELT — the two primary patterns for moving data into these systems.
What Is a Data Warehouse?
A data warehouse is a centralized repository of structured, processed data optimized for analytical queries and business reporting. Data enters a warehouse after it has been extracted from source systems, transformed into a consistent schema, and loaded in a form that query engines can process efficiently. This is the ETL pattern — transform before you store.
The defining characteristic of a warehouse is schema-on-write: every piece of data must conform to a predefined structure before it is accepted. A customer record must have defined fields in defined types. A transaction must carry a consistent set of attributes. If incoming data does not match the schema, it is rejected or transformed before entry. This strictness is not a limitation — it is what makes warehouses reliable for business intelligence. When a finance team runs a revenue report against a warehouse, they are querying data that has been validated, deduplicated, and organized precisely for that kind of query.
Warehouses use columnar storage and pre-computed indexes to execute aggregation queries — SUM, GROUP BY, window functions — at speeds that row-oriented transactional databases cannot match. A query that scans 10 billion rows to compute monthly revenue by product line runs in seconds against a warehouse, not minutes. This performance characteristic is what makes warehouses the correct choice for BI tools like Tableau, Looker, and Power BI.
The trade-offs are real. Warehouses require upfront schema design, which means decisions about what data to collect and how to represent it must be made before data enters the system. Data that does not fit the schema is either transformed at significant cost or left out entirely. Semi-structured data — JSON, XML, nested records — is awkward in traditional warehouses. Unstructured data — documents, images, audio — has no native place.
Warehouse Tools: Snowflake, BigQuery, Redshift
The three dominant cloud data warehouse platforms — Snowflake, Google BigQuery, and Amazon Redshift — share the same core architecture but differ in their pricing models, performance characteristics, and ecosystem integrations.
Snowflake separates compute from storage, meaning you can scale query processing independently of data volume. It stores data in a proprietary columnar format and supports multi-cloud deployments across AWS, Azure, and GCP. Snowflake's primary advantages are ease of administration, near-zero concurrency conflicts (each query gets its own virtual warehouse), and its growing ecosystem of native apps and data sharing capabilities. Its pricing model charges for compute time consumed — which is predictable for known workloads but can produce surprise costs for exploratory query patterns.
Google BigQuery is a serverless warehouse — there are no clusters to provision or warehouses to size. Compute scales automatically. Pricing is per-byte scanned by default (or flat-rate slots for predictable billing). BigQuery is natively integrated with the Google Cloud ecosystem, making it the natural choice for organizations already running on GCP. Its ML capabilities (BigQuery ML) allow training models directly in SQL without moving data to a separate environment.
Amazon Redshift is the oldest of the three major cloud warehouses and has the deepest integration with the AWS ecosystem — S3, Glue, SageMaker, and Kinesis. Redshift Serverless eliminates node management for variable workloads. Redshift Spectrum allows querying S3 data directly without loading it into the warehouse, which blurs the boundary between warehouse and lake functionality. Redshift is often the default choice for organizations already committed to AWS infrastructure.
What Is a Data Lake?
A data lake is a centralized repository that stores data in its raw, native format — structured, semi-structured, and unstructured — at any scale. Unlike a warehouse, a data lake applies no schema at write time. Data is accepted as-is and stored cheaply in object storage. Schema is applied at read time, when a data consumer queries the data and specifies how to interpret it. This is the ELT pattern — load first, transform when needed.
The defining characteristic of a data lake is flexibility. If you do not yet know what questions you will ask of a dataset, you cannot design a schema for it. A data lake allows you to ingest data now and define its structure later — when a data scientist wants to train a model on raw clickstream events, or when a new analytical use case emerges that requires combining data in ways that were not anticipated when it was first collected.
This flexibility comes at a governance cost. A data lake without active curation degrades into a data swamp — petabytes of raw data with no catalog, no quality guarantees, no access controls, and no documentation of what any file contains or whether it is still accurate. The joke in data engineering circles is that data lakes are where data goes to die. The joke is only partially unfair: the organizational discipline required to maintain a usable data lake is significant, and many organizations underinvest in it.
Lake Tools: S3, Azure Data Lake Storage, Google Cloud Storage
Amazon S3 is the most widely used object storage platform and the default foundation for AWS-based data lakes. S3 offers eleven nines of durability, lifecycle policies for cost-tiered storage, and fine-grained access controls via IAM. AWS Glue provides a managed catalog and ETL service on top of S3. Amazon Lake Formation adds governance capabilities — access controls, auditing, and data quality rules — that address the data swamp problem.
Azure Data Lake Storage Gen2 (ADLS Gen2) is Microsoft's managed lake service, built on Azure Blob Storage with a Hadoop-compatible filesystem interface. It integrates natively with Azure Synapse Analytics, Azure Databricks, and Azure Machine Learning. ADLS Gen2 supports hierarchical namespaces — directory-style organization of objects — which improves query performance for tools that rely on file system semantics.
Google Cloud Storage (GCS) is Google's object storage layer and the foundation for GCP-based data lakes. It integrates with BigQuery (via external tables and BigLake), Dataproc, and Vertex AI. GCS's multi-region and dual-region options provide redundancy without the operational overhead of managing replication manually.
What Is a Data Lakehouse?
A data lakehouse is an architecture that adds a transactional metadata layer on top of data lake storage, giving the lake the reliability, performance, and governance properties of a data warehouse — without requiring a separate system. The term was coined by Databricks researchers in a 2021 paper, but the problem it solves had been accumulating for years.
The core innovation is the open table format: a metadata layer that sits between raw object storage files and query engines, tracking which files belong to which table, which records have been inserted, updated, or deleted, and what the schema is at any point in time. This metadata layer is what enables ACID transactions on object storage — a capability that was previously only available in purpose-built warehouse systems.
With ACID transactions on lake storage, a lakehouse can guarantee that a concurrent read never sees a partially written result, that a failed write does not corrupt the table, and that a DELETE operation actually removes records rather than leaving orphaned files that query engines must filter around. These guarantees are what make lake storage safe enough to use directly for business intelligence — the use case that previously required a separate warehouse.
The practical implication: a lakehouse can serve both BI analysts running SQL reports and data scientists training ML models from the same data, without copying data between systems. This eliminates the synchronization lag, storage duplication, and governance inconsistency that plague two-tier architectures.
For a deeper understanding of how lakehouses fit into the broader landscape of data movement patterns, see our explanation of what reverse ETL is — the pattern for pushing processed data from a warehouse or lakehouse back to operational systems.
Lakehouse Tools: Databricks, Delta Lake, Apache Iceberg, Snowflake Hybrid
Databricks is the commercial platform most closely associated with the lakehouse concept. It runs on top of AWS, Azure, or GCP and provides managed Spark compute, the Delta Lake table format, Unity Catalog for governance, and a SQL warehouse product (Databricks SQL) that allows BI tool connections via JDBC/ODBC. Databricks coined the term "lakehouse" and its platform is the most complete implementation of the architecture end to end.
Delta Lake is the open source table format that underlies Databricks' lakehouse offering. It stores data as Parquet files with a transaction log that records all changes, enabling ACID transactions, time travel (querying a table as it existed at a prior point in time), and schema evolution. Delta Lake is natively supported in Apache Spark and has connectors for most major query engines.
Apache Iceberg was created by engineers at Netflix and Apple and has emerged as the most broadly adopted open table format as of 2026. It supports all major query engines — Spark, Flink, Trino, Presto, Hive, Dremio, Snowflake, and BigQuery — making it the most interoperable option. Iceberg's hidden partitioning feature allows partitioning logic to evolve without rewriting existing data. Its catalog abstraction supports multiple backend catalog implementations, including AWS Glue, Hive Metastore, Nessie, and REST catalogs.
Apache Hudi (Hadoop Upserts Deletes and Incrementals) was created at Uber for high-frequency streaming upserts. Where Iceberg optimizes for broad interoperability and Delta Lake optimizes for Spark, Hudi optimizes for write performance on continuously updated datasets — sensor data, log streams, event feeds. It is the table format of choice for organizations whose primary challenge is ingesting high-velocity updates, not querying historical data.
Snowflake's hybrid approach deserves specific mention. Snowflake has historically been a pure warehouse, but its Iceberg Table feature allows data stored in S3 or GCS as Iceberg files to be queried through Snowflake's compute engine without loading it into Snowflake storage. This hybrid model lets organizations use Snowflake for its SQL performance and governance while storing the underlying data in open formats on their own cloud storage — keeping exit flexibility while retaining the Snowflake experience that analysts expect.
The Rise of the Lakehouse: Why It Emerged as a Concept
The lakehouse did not emerge because warehouses and lakes were bad ideas. Both architectures were correct solutions for their original problem statements. The problem was that organizations outgrew those original problem statements.
Through the 2010s, most data teams followed the same trajectory. They built a data warehouse first — typically on-premises with Teradata or Netezza, then migrated to Redshift or BigQuery as cloud adoption accelerated. The warehouse handled BI reporting well. Then data science became a priority, and the warehouse was the wrong tool: it could not handle unstructured data, it was expensive for exploratory analysis on raw data, and its schema rigidity prevented the kind of ad-hoc data access that data scientists required.
The response was to add a data lake alongside the warehouse. Store raw data in S3 or ADLS for data science. Keep processed data in the warehouse for BI. The two-tier architecture seemed rational. In practice, it created three chronic problems that compounded over time.
First, data duplication. The same data existed in two systems — raw in the lake, processed in the warehouse. Storage costs doubled. ETL pipelines had to move data from lake to warehouse, adding latency and failure points. When a transformation rule changed, it had to be updated in two places.
Second, governance fragmentation. Access controls, data quality rules, and compliance policies applied to the warehouse did not automatically apply to the lake. The lake accumulated uncontrolled copies of sensitive data — customer PII, financial records — that sat outside the governance perimeter. This created compliance exposure that legal and security teams increasingly flagged as unacceptable.
Third, the freshness gap. BI reports consumed warehouse data. ML models consumed lake data. These two datasets were never perfectly synchronized. Analysts and data scientists had different views of the truth, with the gap between them determined by how frequently the ETL pipeline ran. For organizations making time-sensitive decisions, this gap was not acceptable.
The lakehouse addressed all three problems by collapsing the two tiers into one: open format storage that served both BI and ML workloads from the same physical data, with a transactional metadata layer providing the reliability guarantees that each workload required.
Understanding how data integration failures originate in multi-tier architectures is essential context for this discussion — our analysis of AI data integration challenges covers the specific failure modes that emerge when data pipelines span warehouse and lake systems without a unified governance layer.
Key Differences: Structure vs. Flexibility vs. Hybrid
| Dimension | Data Warehouse | Data Lake | Data Lakehouse |
|---|---|---|---|
| Schema | Schema-on-write (enforced before storage) | Schema-on-read (applied at query time) | Schema-on-write for curated tables; schema-on-read for raw zones |
| Data types | Structured only (rows and columns) | Structured, semi-structured, unstructured | Structured, semi-structured; unstructured via external tables |
| ACID transactions | Yes — native | No — files are append-only by default | Yes — via table format (Delta Lake, Iceberg, Hudi) |
| Query performance (BI) | Excellent | Poor without optimization | Good to excellent (with caching and optimization) |
| ML workloads | Limited — structured data only | Excellent — native access to raw features | Excellent — unified access for both SQL and ML frameworks |
| Governance | Strong — schema, access controls, audit logs built in | Weak by default — requires additional tooling | Strong — unified catalog covers all data |
| Storage cost | Higher ($20–$40/TB/month) | Lowest ($2–$3/TB/month) | Low ($2–$5/TB/month) |
| Vendor lock-in | High (proprietary formats) | Low (open formats on commodity storage) | Low (open formats like Iceberg) |
| Time travel | Limited (some vendors) | No — no native versioning | Yes — native via table format transaction log |
Cost Comparison in Depth
Storage pricing comparisons between the three architectures are commonly cited but misleading in isolation. The accurate comparison requires looking at total cost of ownership: storage, compute, egress, data preparation, and the engineering time required to make data usable for each workload type.
Storage costs: Object storage on AWS S3, Azure ADLS Gen2, and GCS runs approximately $0.02–$0.03 per GB per month in standard tiers, falling to $0.004–$0.01 per GB in infrequent access and archive tiers. Managed warehouse storage — Snowflake's internal storage, for example — is priced at $23–$46 per TB per month depending on region and plan tier. For large raw data volumes, this is a 10x to 20x storage cost difference per byte stored. The lake and lakehouse win on raw storage economics by a substantial margin for large-volume, low-query-frequency data.
Compute costs: Warehouse compute pricing is typically per-credit or per-slot (BigQuery). Lakehouse compute on Databricks is priced in DBUs (Databricks Units), which vary by cluster type and cloud. For heavy BI query workloads, warehouse compute is often more cost-efficient than running equivalent queries on a general-purpose Spark cluster. For ML training workloads, lake and lakehouse compute (with access to GPU clusters) is necessary — warehouses cannot serve this workload at all.
Data preparation costs: This is where lake-only architectures incur hidden costs. Raw lake data requires cataloging, profiling, quality checks, transformation, and documentation before it is usable by analysts. These activities require data engineering time and often additional tooling (data quality platforms, catalog tools, transformation layers). Organizations that treat a data lake as a plug-and-play BI solution consistently underestimate this cost. The lakehouse architecture reduces data preparation cost by providing governance infrastructure (catalog, access controls, schema enforcement) as part of the platform rather than as a separate implementation project.
Duplication costs: Two-tier architectures (separate lake and warehouse) store the same data twice for curated datasets — once in the lake as raw data, once in the warehouse after transformation. At petabyte scale, this doubles storage cost for the data that matters most. The lakehouse architecture eliminates this duplication by serving both workloads from the same physical storage.
Performance Comparison
Performance varies significantly by workload type, and no single architecture dominates across all dimensions.
BI query performance: Managed data warehouses (Snowflake, BigQuery, Redshift) consistently outperform lakehouses for concurrent, complex SQL queries run by many BI users simultaneously. Warehouses store data in highly optimized columnar formats with pre-computed statistics that query planners use to minimize bytes scanned. Lakehouse query performance has improved substantially with technologies like Z-ordering, data skipping, and liquid clustering in Delta Lake and Iceberg, but for organizations with large numbers of concurrent analysts running ad-hoc queries, a dedicated warehouse still has an edge.
ML training performance: Data lakes and lakehouses are superior to warehouses for ML training. Training a large model on petabytes of raw feature data requires distributed compute with direct access to files — a pattern that object storage and Spark are purpose-built for. Warehouses that offer ML capabilities (BigQuery ML, Snowflake Cortex) are suitable for standard regression and classification models but cannot serve deep learning workloads that require GPU clusters and frameworks like PyTorch or TensorFlow.
Streaming and real-time ingestion: Data lakes and lakehouses handle high-velocity streaming ingestion better than warehouses. Apache Kafka paired with Spark Structured Streaming or Apache Flink writes continuously to lake storage at throughput levels that would overwhelm warehouse ingestion APIs. Warehouses are typically refreshed in batch — hourly or daily — which creates freshness gaps that are unacceptable for real-time use cases like anomaly detection or fraud scoring.
Query latency for dashboards: Warehouses return sub-second results on well-modeled summary tables. Lakehouses can match this with caching and materialized views, but the default query latency on cold lakehouse tables is higher. Organizations building dashboards that require sub-second refresh rates for operational decisions are better served by a warehouse or a dedicated OLAP layer (Apache Druid, ClickHouse) downstream of the lakehouse.
Use Cases for Each Architecture
When a Data Warehouse Is the Right Choice
- Your primary consumers are BI analysts using SQL or BI tools, and their queries run on structured business data — revenue, pipeline, orders, churn, marketing attribution.
- You need sub-second dashboard refresh rates for operational monitoring.
- Your data volume is in the tens or hundreds of terabytes, not petabytes — storage cost differences between warehouse and lake storage are less material at this scale.
- Your team does not have data scientists running ML workloads, or those workloads are handled by a separate, specialized system.
- You need the tightest possible governance — row-level security, column-level masking, complete audit logs — and want that governance built into the platform rather than implemented on top of it.
When a Data Lake Is the Right Choice
- You need to store large volumes of raw data — logs, clickstreams, sensor data, media files — at the lowest possible cost, and the primary consumers are data scientists or ML engineers, not BI analysts.
- Your data includes unstructured types (images, documents, audio, video) that cannot be stored in a warehouse.
- You do not yet know what questions you will ask of the data. You want to preserve optionality by storing everything raw before committing to a schema.
- You are building a data platform from scratch and plan to implement a lakehouse architecture on top of the lake storage — meaning the lake is a foundation, not a destination.
When a Lakehouse Is the Right Choice
- You are currently maintaining separate lake and warehouse systems and paying the synchronization, duplication, and governance costs of a two-tier architecture.
- You need to serve both SQL/BI workloads and ML training workloads from the same data, with consistent governance across both.
- You are building a new data stack and want the most modern architecture as your starting point, with maximum openness and minimum vendor lock-in.
- You need time travel and audit capabilities — the ability to query your data as it existed at any prior point, or to roll back an accidental deletion.
- You work with real-time data streams and need fresh data available for both analytics and ML without maintaining two separate pipelines.
Decision Framework: How to Choose
| Your situation | Recommended starting architecture |
|---|---|
| Small-to-midsize team; primary use case is BI reporting on revenue, pipeline, and operations data; no ML requirements now | Data warehouse (Snowflake, BigQuery, or Redshift) |
| Large data science team; primary use case is ML training on raw, semi-structured, or unstructured data; BI is secondary | Data lake (S3/ADLS/GCS) with a lakehouse layer for curated tables |
| Mixed team of analysts and data scientists; both BI and ML workloads are important; building a modern stack from scratch | Lakehouse (Databricks with Delta Lake or Iceberg) |
| Currently running separate lake and warehouse; experiencing synchronization problems or governance gaps between them | Consolidate to a lakehouse; use Iceberg for maximum interoperability |
| Highly regulated industry (finance, healthcare); strict governance, audit, and data residency requirements | Data warehouse for production BI; evaluate Databricks Unity Catalog or Snowflake for lakehouse governance |
| Real-time streaming data (IoT, clickstream, event logs) at high volume; need sub-hour freshness for both analytics and ML | Lakehouse with streaming ingestion (Kafka + Spark Structured Streaming or Flink) |
| Existing investment in AWS ecosystem; want to leverage managed services with minimal operational overhead | Redshift + S3 with Lake Formation, or evaluate AWS Glue Iceberg tables |
How Modern Data Stacks Combine All Three
In practice, mature data organizations do not operate a single architecture in isolation. They use all three in a deliberate, layered sequence that assigns each architecture to the workload it handles best. This is not a sign of architectural confusion — it is a sign of architectural maturity.
The typical pattern looks like this:
- Raw ingestion layer (data lake): All incoming data — from SaaS APIs, event streams, operational databases, third-party data providers — lands in object storage as raw files. This is the append-only record of everything the organization has ever ingested. No transformation is applied. Data is partitioned by source and date. Storage cost is minimized with lifecycle policies that move old data to cheaper storage tiers.
- Curated lakehouse layer: Data engineers apply transformation, quality checks, deduplication, and schema enforcement to promote raw data into curated tables. These tables are written in an open format (typically Iceberg or Delta Lake) on the same object storage, governed by a unified catalog (Databricks Unity Catalog, AWS Glue, or Apache Polaris). Data scientists access these tables directly for ML training. Streaming data is processed continuously into near-real-time tables.
- BI warehouse layer: For the subset of data consumed by BI tools — dimensional models, summary aggregations, KPI tables — data is loaded into a warehouse (Snowflake or BigQuery) where query performance is optimized for concurrent analyst access and dashboard refresh rates. This layer contains far less data than the lake or lakehouse and is refreshed on a schedule appropriate for the reporting cadence.
This three-tier pattern reflects the different requirements of different data consumers. BI analysts get the fast, predictable performance of a warehouse on clean, modeled data. Data scientists get direct access to raw and curated lake data without going through a warehouse bottleneck. Streaming systems write to the lakehouse layer in near-real-time without conflicting with warehouse load jobs.
The Medallion Architecture
The three-tier pattern described above is frequently formalized as the Medallion Architecture, popularized by Databricks. It defines three data layers by quality and structure level:
- Bronze layer: Raw ingested data — unmodified, append-only. The source of truth for raw records.
- Silver layer: Cleaned, deduplicated, and standardized data — joined across sources, nulls handled, types cast. The foundation for analytics and ML features.
- Gold layer: Aggregated, business-ready tables — dimensional models, KPI metrics, summary tables. What BI tools and dashboards query.
Migration Considerations
Deciding to move between architectures — from a warehouse to a lakehouse, from a lake to a lakehouse, or from a two-tier stack to a unified lakehouse — requires a clear-eyed assessment of what migration actually involves. Many organizations underestimate the complexity and overestimate the speed.
Migrating from a warehouse to a lakehouse is the most common migration path in 2026. The typical triggers are cost (warehouse storage at scale is expensive) and workload expansion (adding ML or streaming capabilities that the warehouse cannot serve). The migration involves exporting warehouse data to Parquet files, converting to an open table format, validating query equivalence, migrating transformation pipelines to dbt or Spark, and cutover planning to avoid downtime for BI tools. For organizations with a mature dbt project, the transformation layer migrates cleanly. The harder work is re-pointing BI tools and verifying that query results match the warehouse exactly.
Migrating from a lake to a lakehouse is technically simpler than moving from a warehouse but organizationally complex. The technical work is adding a table format layer on top of existing raw files — converting Parquet directories to Delta Lake or Iceberg tables, registering them in a catalog, and applying governance policies. The organizational challenge is the data catalog work: inventorying what exists in the lake, documenting schemas, assigning ownership, and establishing quality standards for data that has been accumulating without governance for years. This catalog work typically takes longer than the technical conversion.
Migrating between lakehouse platforms — from Delta Lake to Iceberg, or from one cloud vendor's managed lakehouse to another — is made significantly easier by the open format standards. Because both formats store data as Parquet files with metadata logs, conversion utilities exist that can translate Delta Lake tables to Iceberg format without rewriting the underlying data. The migration risk is primarily in the query engine and catalog layers, not the storage layer.
The key migration risks to plan for:
- Query result discrepancies: Differences in how null values, timezone handling, rounding, and aggregation edge cases are computed between the old and new system. Run a parallel validation period where both systems produce outputs and compare them before cutover.
- BI tool reconnection: BI tools connected via JDBC/ODBC to a warehouse require driver updates and connection string changes to connect to a lakehouse. This is low-risk technically but time-consuming operationally if there are many reports and dashboards.
- Pipeline rebuild costs: If transformation pipelines were written in warehouse-specific SQL dialects (Snowflake SQL, BigQuery Standard SQL, Redshift SQL), they may require rewriting for the target platform. Test transformation equivalence before decommissioning the source system.
- Permission model differences: Warehouse row-level security and column masking policies do not automatically translate to lakehouse governance. Map every existing permission policy to the equivalent lakehouse governance construct before migration and verify that no data access gaps are introduced.
For a complete picture of how data movement patterns — ETL, ELT, and reverse ETL — fit into a lakehouse architecture, see our overview of ETL versus ELT: which pattern to choose and our explanation of what connected data means for operating intelligence.
What the Research and Industry Show
The direction of the market is unambiguous. Databricks' analysis of enterprise data architectures found that organizations maintaining separate lake and warehouse systems consistently report higher operational cost, more frequent data quality incidents, and slower time-to-insight compared to organizations that have consolidated to a unified lakehouse layer.
The open table format ecosystem has reached a level of maturity that removes the primary historical risk of lakehouse adoption: vendor lock-in. Apache Iceberg now has native support from Snowflake, BigQuery, AWS, Azure, Databricks, and over a dozen query engines. An organization that adopts Iceberg as its table format is not locked into any single compute vendor — they can switch query engines, add new consumers, and negotiate with multiple vendors from a position of portable, open data.
The practical implication for operators evaluating their data infrastructure: the question is no longer whether to adopt a lakehouse architecture. The question is how quickly to migrate existing two-tier stacks, and which open table format to standardize on. Independent analysis from MotherDuck and other data infrastructure vendors consistently points to Iceberg as the safest choice for organizations that prioritize interoperability and long-term exit optionality over depth of integration with a single vendor's platform.
Frequently Asked Questions
Key Takeaways
- A data warehouse is the right choice when your primary consumers are BI analysts running SQL queries on structured business data and you need fast, predictable query performance with strong governance.
- A data lake is the right choice when you need to store large volumes of raw, semi-structured, or unstructured data at minimum cost, or when data science and ML workloads are the primary use case.
- A data lakehouse eliminates the data duplication, synchronization lag, and governance fragmentation of two-tier architectures by providing warehouse-grade reliability on top of lake-cost storage.
- The key technologies enabling lakehouses are open table formats — Delta Lake, Apache Iceberg, and Apache Hudi — which add ACID transactions, time travel, and schema evolution to object storage.
- Apache Iceberg has emerged as the most interoperable open table format as of 2026, with native support from Snowflake, BigQuery, AWS, Azure, and Databricks.
- Most mature data stacks use all three architectures in sequence: raw ingestion to a lake, curation to a lakehouse layer, and summary data loaded to a warehouse for BI — the Medallion Architecture pattern.
- Migration from a two-tier warehouse-plus-lake stack to a unified lakehouse is the most common data infrastructure project in 2026; the primary risks are query result discrepancies and permission model gaps, not storage conversion.
- Organizations evaluating lakehouse adoption should standardize on open table formats to preserve query engine portability and avoid proprietary lock-in as the architecture evolves.
Siddharth Gangal is the founder of Fairview, an Operating Intelligence Platform that turns fragmented operating data into decisive action — so operators always know what is making money, what is leaking margin, and what to do next.