Fairview
Business Intelligence

CDC (Change Data Capture)

2026-04-30 10 min read

The technique of identifying and tracking changes in source databases — inserts, updates, deletes — and propagating those changes to downstream systems incrementally. CDC enables low-latency, low-load data pipelines compared to full-table scans. Modern CDC tools (Debezium, Fivetran HVR, Airbyte, native AWS DMS / GCP Datastream) read database transaction logs to capture changes without impacting source-system performance. CDC is the backbone of modern ELT at scale.

TL;DR

CDC (Change Data Capture) is the technique of identifying and tracking changes in source databases — inserts, updates, deletes — and propagating those changes to downstream systems incrementally. CDC enables low-latency, low-load data pipelines compared to full-table scans. Modern CDC tools (Debezium, Fivetran's HVR, Airbyte, native AWS DMS / GCP Datastream) read database transaction logs to capture changes without impacting source-system performance. CDC is the backbone of modern <a href="/glossary/elt" class="text-brand-600 underline decoration-brand-200 underline-offset-2 hover:text-brand-700">ELT</a> at scale.

What is CDC?

CDC (Change Data Capture) is the technique of detecting and propagating data changes in source databases — typically by reading the database's transaction log (PostgreSQL WAL, MySQL binlog, SQL Server CDC tables) rather than running periodic full-table scans.

It is the foundational technique for low-latency, low-impact data extraction at scale. Without CDC, ELT pipelines either run periodic full-table scans (slow, expensive, high source-system load) or rely on application-level change tracking (incomplete, error-prone, manual).

How CDC works

Modern CDC reads the database transaction log directly:

  • Source database writes every change (INSERT, UPDATE, DELETE) to a transaction log for crash recovery — Postgres WAL, MySQL binlog, SQL Server transaction log
  • CDC connector reads the log incrementally, parsing change events without requiring schema changes or query overhead on the source
  • Change events are written to a streaming platform (Kafka) or directly to the warehouse, preserving event ordering and transactional consistency
  • Downstream consumers (warehouses, lakehouses, search systems, caches) apply the changes to keep their copies in sync

CDC vs alternative extraction patterns

PatternLatencySource loadCompleteness
Full-table scan (periodic)High (hours/days)Very highComplete (heavy)
Incremental query (where updated_at > X)Medium (minutes)MediumMisses deletes
Application-level change trackingVariableLowOften incomplete
CDC (transaction log)Low (seconds)Very lowComplete (incl. deletes)

Common CDC tools (2025)

  • Debezium: open-source CDC connector, broadly the gold-standard for transaction-log-based CDC
  • Fivetran (HVR / native log-based extraction): managed CDC for many warehouse-targeted use cases
  • Airbyte: open-source connector ecosystem with growing CDC support
  • AWS Database Migration Service (DMS): AWS-managed CDC with broad source support
  • GCP Datastream: Google's managed CDC service
  • Estuary Flow: streaming-first CDC platform
  • Snowflake Streams: native CDC for data already in Snowflake

Common pitfalls

  • 1. Treating CDC as 'incremental queries'. Application-level tracking (where updated_at > X) misses deletes and is fragile. True log-based CDC is the right choice when both performance and completeness matter.
  • 2. Ignoring schema evolution. Source schemas change; CDC pipelines need schema-evolution handling. Without it, schema changes break pipelines silently.
  • 3. Underestimating ordering and exactly-once requirements. CDC preserves transactional ordering; downstream consumers need to apply changes in order with exactly-once semantics for correctness. Some downstream targets handle this natively; others require careful integration.

ELT pipelines often use CDC for the extract step. Reverse ETL can use CDC for efficient warehouse-to-operational syncs. Lakehouse table formats (Iceberg, Delta) handle CDC outputs natively via merge operations. Data products backed by CDC have lower freshness lag.

At a glance

Category
Business Intelligence
Related
5 terms

Frequently asked questions

Do I need CDC?

If your data volumes are large enough that full-table scans are slow or impact source-system performance, yes. If your pipelines need low latency (seconds, not hours), yes. For small-volume sources where full scans run quickly, CDC may be more complexity than the benefit warrants.

What's the difference between CDC and event streaming?

CDC captures database changes (INSERT/UPDATE/DELETE rows). Event streaming captures application events (user actions, system events). Both are streaming patterns; they're complementary. Mature data architectures often have both: CDC for database state, event streaming for application behaviour.

Can CDC handle deletes?

Log-based CDC handles deletes correctly — that's one of its key advantages over query-based incremental extraction (which has no way to detect deleted rows). When using CDC outputs in lakehouses, ensure the table format and merge logic preserve delete operations correctly.

Sources

  1. Debezium documentation
  2. AWS DMS / GCP Datastream documentation
  3. Modern Data Stack reports (2024–25)

Fairview is an operating intelligence platform that consumes CDC-fed warehouse data — ensuring operating views reflect source-of-truth state with seconds of lag rather than hours, without requiring custom CDC integration per source. Start your free trial →

Siddharth Gangal is the founder of Fairview. He built the CDC-aware ingestion layer after watching companies invest in CDC pipelines for sub-minute warehouse freshness only to have downstream operating tools poll daily — defeating the latency advantage that the CDC stack was paying for.

See it in Fairview

Track CDC (Change Data Capture) automatically.

14-day free trial. No credit card. First data source connected in 5 minutes.

Know the number. Take the action.