TL;DR
CDC (Change Data Capture) is the technique of identifying and tracking changes in source databases — inserts, updates, deletes — and propagating those changes to downstream systems incrementally. CDC enables low-latency, low-load data pipelines compared to full-table scans. Modern CDC tools (Debezium, Fivetran's HVR, Airbyte, native AWS DMS / GCP Datastream) read database transaction logs to capture changes without impacting source-system performance. CDC is the backbone of modern <a href="/glossary/elt" class="text-brand-600 underline decoration-brand-200 underline-offset-2 hover:text-brand-700">ELT</a> at scale.
What is CDC?
CDC (Change Data Capture) is the technique of detecting and propagating data changes in source databases — typically by reading the database's transaction log (PostgreSQL WAL, MySQL binlog, SQL Server CDC tables) rather than running periodic full-table scans.
It is the foundational technique for low-latency, low-impact data extraction at scale. Without CDC, ELT pipelines either run periodic full-table scans (slow, expensive, high source-system load) or rely on application-level change tracking (incomplete, error-prone, manual).
How CDC works
Modern CDC reads the database transaction log directly:
- Source database writes every change (INSERT, UPDATE, DELETE) to a transaction log for crash recovery — Postgres WAL, MySQL binlog, SQL Server transaction log
- CDC connector reads the log incrementally, parsing change events without requiring schema changes or query overhead on the source
- Change events are written to a streaming platform (Kafka) or directly to the warehouse, preserving event ordering and transactional consistency
- Downstream consumers (warehouses, lakehouses, search systems, caches) apply the changes to keep their copies in sync
CDC vs alternative extraction patterns
| Pattern | Latency | Source load | Completeness |
|---|---|---|---|
| Full-table scan (periodic) | High (hours/days) | Very high | Complete (heavy) |
| Incremental query (where updated_at > X) | Medium (minutes) | Medium | Misses deletes |
| Application-level change tracking | Variable | Low | Often incomplete |
| CDC (transaction log) | Low (seconds) | Very low | Complete (incl. deletes) |
Common CDC tools (2025)
- Debezium: open-source CDC connector, broadly the gold-standard for transaction-log-based CDC
- Fivetran (HVR / native log-based extraction): managed CDC for many warehouse-targeted use cases
- Airbyte: open-source connector ecosystem with growing CDC support
- AWS Database Migration Service (DMS): AWS-managed CDC with broad source support
- GCP Datastream: Google's managed CDC service
- Estuary Flow: streaming-first CDC platform
- Snowflake Streams: native CDC for data already in Snowflake
Common pitfalls
- 1. Treating CDC as 'incremental queries'. Application-level tracking (where updated_at > X) misses deletes and is fragile. True log-based CDC is the right choice when both performance and completeness matter.
- 2. Ignoring schema evolution. Source schemas change; CDC pipelines need schema-evolution handling. Without it, schema changes break pipelines silently.
- 3. Underestimating ordering and exactly-once requirements. CDC preserves transactional ordering; downstream consumers need to apply changes in order with exactly-once semantics for correctness. Some downstream targets handle this natively; others require careful integration.
Related concepts
ELT pipelines often use CDC for the extract step. Reverse ETL can use CDC for efficient warehouse-to-operational syncs. Lakehouse table formats (Iceberg, Delta) handle CDC outputs natively via merge operations. Data products backed by CDC have lower freshness lag.
At a glance
- Category
- Business Intelligence
- Related
- 5 terms
Frequently asked questions
Do I need CDC?
If your data volumes are large enough that full-table scans are slow or impact source-system performance, yes. If your pipelines need low latency (seconds, not hours), yes. For small-volume sources where full scans run quickly, CDC may be more complexity than the benefit warrants.
What's the difference between CDC and event streaming?
CDC captures database changes (INSERT/UPDATE/DELETE rows). Event streaming captures application events (user actions, system events). Both are streaming patterns; they're complementary. Mature data architectures often have both: CDC for database state, event streaming for application behaviour.
Can CDC handle deletes?
Log-based CDC handles deletes correctly — that's one of its key advantages over query-based incremental extraction (which has no way to detect deleted rows). When using CDC outputs in lakehouses, ensure the table format and merge logic preserve delete operations correctly.
Sources
- Debezium documentation
- AWS DMS / GCP Datastream documentation
- Modern Data Stack reports (2024–25)
Fairview is an operating intelligence platform that consumes CDC-fed warehouse data — ensuring operating views reflect source-of-truth state with seconds of lag rather than hours, without requiring custom CDC integration per source. Start your free trial →
Siddharth Gangal is the founder of Fairview. He built the CDC-aware ingestion layer after watching companies invest in CDC pipelines for sub-minute warehouse freshness only to have downstream operating tools poll daily — defeating the latency advantage that the CDC stack was paying for.
See it in Fairview
Track CDC (Change Data Capture) automatically.
14-day free trial. No credit card. First data source connected in 5 minutes.