Data & Analytics 15 min read

How to Connect Multiple Data Sources: Architecture

Guide to connecting multiple data sources — ETL vs ELT, tool comparison (Fivetran, Airbyte, dbt), integration patterns, and a step-by-step setup process.

Siddharth Gangal · Founder, Fairview Published May 29, 2026 Updated May 31, 2026 Reviewed by Jordan Cole Editorial standards

Key takeaways

Guide to connecting multiple data sources — ETL vs ELT, tool comparison (Fivetran, Airbyte, dbt), integration patterns, and a step-by-step setup process.

Part of the Data Infrastructure topic hub.

TL;DR

ELT is the default architecture for connecting multiple data sources into a central warehouse — extract raw data, load it first, then transform with dbt.
Fivetran, Airbyte, and Stitch handle extraction and loading. dbt handles transformation. These are not competing tools — they work in sequence.
Reverse ETL is a separate pattern for pushing warehouse data back into operational tools (CRM enrichment, product activation).
Data virtualization queries data in place without moving it — useful for ad-hoc analysis, but not for high-volume dashboards.
Data mesh is a governance model, not a technology — relevant for organizations with 200+ people and multiple data-producing teams.
The most common failure is connecting sources without a canonical schema, then spending months debugging metric discrepancies that could have been prevented.

Most operators reach the same point: their CRM says one thing, their billing system says another, and their finance tool produces a third number. All three are pulling from real data. The problem is not accuracy at the source — it is that the sources were never connected in a way that makes them comparable.

Connecting multiple data sources is a solved problem at the architecture level. The patterns are well established, the tooling is mature, and the failure modes are documented. What trips operators and analysts is picking the wrong pattern for their use case, or skipping the foundational work — defining what the unified data should look like before building the pipeline to produce it.

This guide covers the four main integration architectures, how the tools map onto them, a comparison of the leading options, and a step-by-step process for the most common use cases. If you have already connected sources but are dealing with inconsistent metrics across them, the guide on data normalization from multiple business sources covers the transformation layer in detail.

The Four Data Integration Architectures

Before choosing a tool, choose an architecture. The tool selection follows from what you are trying to do — not the other way around.

ETL: Extract, Transform, Load

ETL is the original integration pattern. Data is extracted from source systems, transformed in an intermediate processing layer, and then loaded into a target destination. Transformation happens before the data lands anywhere persistent.

ETL was the standard when compute was expensive and data warehouses had limited storage. Transforming data before loading it kept warehouse size manageable. Today, cloud warehouses have made storage cheap enough that pre-load transformation is no longer necessary — and often counterproductive, because it makes reprocessing expensive when transformation logic changes.

ETL remains relevant for specific use cases: real-time streaming pipelines where data must be validated and shaped before reaching a downstream system, or compliance-sensitive environments where certain raw data cannot be stored at all.

ELT: Extract, Load, Transform

ELT is the current default for analytical data integration. Raw data is extracted from source systems, loaded into a data warehouse or data lake as-is, and then transformed in place using the warehouse's own compute.

ELT Data Flow

Source API → Extract (Fivetran / Airbyte) → Raw Tables → Transform (dbt) → Clean Tables → BI Tool

Raw data lands in a staging schema. dbt models build clean, business-logic-applied tables in a separate analytics schema. BI tools only query the clean layer.

ELT's advantages are significant. Raw data is preserved and reprocessable. Transformations are version-controlled SQL models, not black-box ETL scripts. The warehouse handles scale. When business definitions change — "what counts as a closed deal?" — you rerun the dbt model against the already-loaded raw data instead of re-extracting from the source.

The dominant ELT stack in 2026: Fivetran or Airbyte for extraction and loading, BigQuery or Snowflake or Redshift as the warehouse, dbt for transformation.

Reverse ETL

Reverse ETL is a pattern for pushing data from your warehouse back into operational tools. It runs the direction of ELT in reverse: clean, transformed warehouse data flows into Salesforce, HubSpot, Intercom, Marketo, or other SaaS applications your team works in daily.

Common reverse ETL use cases:

Enriching Salesforce account records with product usage data from your warehouse
Syncing customer health scores computed in your warehouse into your CS platform
Pushing segment membership (high-value customers, churn risk cohorts) into email marketing tools
Updating HubSpot contact properties with computed LTV or engagement scores

Census, Hightouch, and Polytomic are the main reverse ETL tools. They read from your warehouse and write to your operational tools on a schedule, without requiring custom API integrations for each destination.

Data Virtualization

Data virtualization queries data in place — at its source — rather than copying it to a central location. A virtualization layer sits between queries and source systems, translating SQL into API calls or native queries against each source and returning unified results.

Trino (formerly PrestoSQL), Dremio, and Denodo are the primary data virtualization platforms. They shine in specific scenarios: regulatory environments where data cannot be copied, ad-hoc analysis across sources that do not need continuous pipeline maintenance, or when source data changes too frequently for batch copies to stay current.

When not to use data virtualization:

Virtualization does not work well for high-volume analytical queries, BI dashboards that run on every page load, or complex multi-join transformations. Every query hits the source APIs. Response time is unpredictable. If your dashboard needs to load in under two seconds, virtualization is not the right pattern.

Data Mesh vs. Centralized Warehouse

Data mesh emerged as an architectural response to a specific scaling problem: when a central data engineering team becomes a bottleneck for every business unit that needs data products. Rather than centralizing all data in one warehouse managed by one team, data mesh distributes ownership — each business domain (sales, marketing, finance, product) owns and publishes its own data products.

The four data mesh principles are: domain ownership, data as a product, self-serve infrastructure, and federated governance. These are organizational design principles, not technical specifications. Data mesh does not tell you which tools to use — it tells you how to assign responsibility.

For most operators reading this guide, data mesh is not the right choice. The organizational complexity it solves only emerges at scale — multiple large teams, multiple data engineering squads, years of accumulated pipeline debt. A centralized warehouse with clear ownership is simpler, faster to set up, and easier to debug.

Approach	Best For	Not For	Organizational Prerequisite
Centralized Warehouse	Most companies under 200 people; unified analytics across sources	Orgs with many autonomous data teams	1–3 data engineers or an integrated platform
Data Mesh	Large orgs with multiple domain teams; data bottleneck problems	Early-stage companies or teams without domain data ownership	Mature data teams in each business domain
Data Lake	Unstructured data, ML training, event streams, long-term archival	Business metrics dashboards; analyst self-service	Data engineering team with ML or streaming requirements
Data Lakehouse	Combining analytics + ML workloads on one storage layer	Simple SaaS metric reporting	Engineering team comfortable with Delta Lake or Iceberg

Tool Comparison: Connecting Multiple Data Sources

The tooling landscape splits cleanly across the four integration layers: ingestion (extraction and loading), transformation, reverse ETL, and orchestration. Here is how the main options compare.

Ingestion Tools

Tool	Connectors	Pricing Model	Best For	Limitations
Fivetran	500+	Per MAR (rows synced)	Managed, low-maintenance pipelines; strong SLA guarantees	Expensive at high volume; limited custom transformation
Airbyte	300+ (open source)	Open source (self-host) or cloud credits	Custom connectors; cost-sensitive teams; community ecosystem	More ops overhead when self-hosted; cloud tier pricing can surprise
Stitch	130+	Per row / per source	Smaller stacks; quick setup; Singer-compatible sources	Fewer connectors; less flexible than Fivetran or Airbyte
Zapier / Make	5,000+ apps	Per task / per operation	Lightweight SaaS-to-SaaS automation; non-technical users	Not a data pipeline — breaks under volume; no warehouse loading
Custom scripts	Unlimited	Engineering time	Sources with no managed connector; highly custom extraction	Full maintenance burden; breaks silently when source APIs change

Transformation Tools

Tool	Approach	Best For	Limitations
dbt (data build tool)	SQL models + tests + docs in Git	Standard choice for ELT; version-controlled transformations; data testing	Requires SQL fluency; runs inside the warehouse (not a standalone compute layer)
Apache Spark	Distributed compute; Python/Scala/SQL	Large-scale transformations; ML feature engineering; streaming	Significant infrastructure overhead; steep learning curve
Pandas (Python)	In-memory DataFrame operations	Custom logic; one-off transformations; prototyping	Does not scale to large datasets; no version-control conventions
Warehouse SQL	Native SQL views and stored procedures	Simple transformations; teams without dbt	Hard to test, version, or document; logic embedded in the warehouse

Orchestration Tools

Orchestration tools schedule and monitor your pipelines — ensuring extraction runs before transformation, alerting when a step fails, and managing dependencies between jobs.

Tool	Best For	Trade-offs
Apache Airflow	Complex DAG-based pipelines; large engineering teams	Heavy infrastructure; Python-only DAG definitions; steep learning curve
Prefect	Python-first teams; simpler than Airflow; cloud or self-hosted	Smaller ecosystem; newer
dbt Cloud	Teams already on dbt; scheduling within the transformation layer	Does not orchestrate extraction — only the transformation phase
Fivetran built-in scheduling	Teams using Fivetran; no separate orchestration needed for ingestion	Limited to Fivetran-managed sources; no custom logic

Step-by-Step: How to Connect Multiple Data Sources

The following process applies whether you are connecting three sources or thirty. The steps are sequential — skipping or compressing the early ones is the primary reason integration projects fail six months after launch.

Step 1 — Define the Business Questions First

Do not start by connecting tools. Start by writing down the specific business questions the unified data needs to answer. This sounds obvious but most integration projects skip it — resulting in pipelines that pull all available data from every source without a clear picture of what the reporting layer actually needs.

Examples of well-defined starting questions:

What is our CAC by acquisition channel for the last four quarters?
Which customer segments have the highest 12-month LTV?
What is our gross margin by product line, net of returns and discounts?
How does pipeline coverage compare to quota across regions?

Each question implies a specific set of data sources, join conditions, and metrics. Documenting these up front gives you the scope for your integration — and a test for whether the finished pipeline is working correctly.

Step 2 — Inventory Your Sources and Identify the Joins

For each data source you plan to connect, document:

What data it contains that is relevant to your business questions
The API or export mechanism available (REST API, webhook, CSV export, database connection)
The update frequency — how often does data change?
The entity identifier — what key does this system use for customers, deals, transactions?
The join condition to other sources — is there a shared key (email, company domain, custom ID field), or will you need fuzzy matching?

The join condition question is the most important one. If your CRM and billing system use different customer identifiers with no shared key, your entity resolution problem needs solving before any pipeline is useful. Decisions made now about shared keys will save weeks of debugging later.

Step 3 — Choose Your Destination

For most analytical use cases, the destination is a cloud data warehouse. The three major options:

BigQuery (Google Cloud): Strong default for teams already in GCP. Serverless, per-query billing, excellent for infrequent large queries. dbt integrates natively.
Snowflake: Cloud-agnostic, strong performance for concurrent users, separate compute and storage scaling. Higher baseline cost than BigQuery for low-volume use cases.
Redshift (AWS): Good for teams already in AWS. Historically strong performance on large analytical queries. More infrastructure management than BigQuery or Snowflake.

For smaller stacks or operators without a dedicated data team, Postgres (via RDS or hosted providers like Neon or Supabase) handles moderate analytical workloads and has strong dbt support.

Step 4 — Set Up Ingestion

Once you have a destination warehouse, set up the ingestion connectors. For most sources, Fivetran or Airbyte will have a pre-built connector. The setup process is broadly the same across tools:

Authenticate with the source (API key, OAuth, service account)
Select the tables or objects to sync (do not sync everything — sync what your business questions require)
Configure sync frequency (hourly for transactional data; daily for ad spend data; weekly is rarely sufficient for operational metrics)
Map the destination schema (Fivetran and Airbyte create a raw staging schema automatically)
Run an initial historical sync — this can take hours for sources with years of history
Verify row counts against the source system before moving to transformation

On sync frequency:

Daily syncs are sufficient for most reporting use cases. Hourly syncs are necessary if you are tracking real-time pipeline or running same-day revenue reporting. Sub-hourly syncs require webhook-based ingestion rather than API polling, which is a more complex setup.

Step 5 — Define Your Canonical Schema

Before writing transformation code, define the target schema: every field that matters in your reporting layer, its canonical name, data type, unit convention, and which source is authoritative when the same concept appears in multiple systems.

Canonical Schema — Customer Entity Example

customer_id (canonical) → maps to:

Salesforce: AccountId (18-char)
Stripe: cus_xxx
HubSpot: hs_object_id
Authority: Salesforce AccountId for B2B; Stripe cus_id for transactional history
Resolution key: company email domain (normalized, lowercase, strip www)

This schema document becomes the contract for all downstream work. Store it in your git repository alongside your transformation code.

Step 6 — Build Transformations with dbt

With raw data in the warehouse and a canonical schema defined, write dbt models that transform raw source tables into clean, business-logic-applied tables. A standard dbt project structure:

Staging models (stg_*): One model per source table. Rename fields to canonical names, cast data types, apply currency conversions, filter soft-deleted records.
Intermediate models (int_*): Join staging models together, resolve entity relationships, apply business logic that spans multiple sources.
Mart models (mart_*): Wide, denormalized tables optimized for BI queries. One row per deal, per transaction, per cohort. All dimensions pre-joined.

Write dbt tests alongside each model: not-null tests on key fields, uniqueness tests on primary keys, referential integrity tests between tables, and accepted-values tests on enum fields. These tests run on every pipeline execution and fail loudly when source data changes unexpectedly.

Step 7 — Connect Your BI Layer and Validate

Connect your BI tool (Looker, Metabase, Tableau, Superset, or a purpose-built operating intelligence platform) to the mart layer in your warehouse. Build the specific reports that correspond to the business questions you defined in Step 1.

Validate against known-good numbers before declaring the pipeline production-ready:

Cross-check total revenue in the mart table against your billing system's revenue report for the same period
Verify customer counts match between your CRM and the unified entity table (net of entity resolution)
Confirm ad spend totals per channel match the native platform reports for a recent complete month
Check row-level records: spot-check five specific transactions end-to-end from source to mart

Do not skip this validation step. Pipelines that pass all automated tests can still produce wrong numbers if the transformation logic misunderstands a source field's semantics. Human validation against real business numbers is the final check.

Step 8 — Monitor Ongoing Pipeline Health

Production pipelines fail. Source APIs change. Authentication tokens expire. Webhooks stop firing. Volume anomalies indicate upstream problems. Set up monitoring for:

Sync failures: immediate alert when any connector fails to sync
Freshness: alert if the latest record timestamp in a table is more than N hours old
Volume anomalies: alert if daily row count from any source is more than 30% below 30-day average
dbt test failures: alert when any data quality test fails after a pipeline run
Schema drift: alert when source tables gain or lose columns unexpectedly

Fivetran and Airbyte both have built-in alerting for connector failures. dbt Cloud alerts on test failures. For custom volume and freshness monitoring, dbt's source freshness tests and tools like Elementary or Re:data provide automated data observability.

Common Integration Patterns by Use Case

Pattern 1: CRM + Billing for Revenue Analytics

Connect Salesforce or HubSpot (pipeline, deal stage, owner, segment) with Stripe or Recurly (actual revenue, MRR, churn events). The join key is typically customer email or a custom CRM ID field stored on the Stripe customer object. Output: revenue by deal source, sales rep, segment, and time period.

Pattern 2: Ad Platforms + CRM for Attribution

Connect Google Ads, Meta Ads, and LinkedIn Ads with your CRM. The join requires UTM parameters passed through from ad click to CRM lead creation. Output: cost per lead, cost per opportunity, cost per won deal by channel and campaign. This is the foundation for multi-touch attribution. The guide on multi-touch attribution implementation covers the reporting layer in detail.

Pattern 3: Product + CRM for Usage-Led Sales

Connect product analytics (Mixpanel, Amplitude, Segment, or custom events) with CRM records. The join key is user email or user ID mapped to a CRM contact ID. Output: feature usage by account, product engagement scores, expansion trigger signals. This data typically flows back into the CRM via Reverse ETL (Census or Hightouch) so sales and CS reps see usage data without leaving Salesforce.

Pattern 4: Finance + Billing for Margin Analytics

Connect QuickBooks or Xero (actual COGS, OpEx, headcount costs) with Stripe or Recurly (gross revenue, refunds, discounts). Output: gross margin by product, contribution margin by customer segment, operating efficiency ratios. This is the pattern that operating intelligence for SaaS companies depends on — you cannot compute real margin without both sides of the P&L connected.

The Most Common Integration Failures

Starting with the tool, not the data model

Teams sign up for Fivetran, connect everything available, and then try to figure out what to build. Without a defined canonical schema and clear business questions, you end up with dozens of raw tables and no path to clean, trustworthy metrics. Define the data model first.

Assuming shared keys exist

In most stacks, the CRM customer ID and the billing system customer ID are different. Connecting them requires either a field you deliberately populated to share keys across systems (set this up before your customer count grows past a few hundred) or email-domain matching as a fallback. Many pipelines silently fail to join customer records and produce inflated counts for months before anyone notices.

Treating pipeline setup as a one-time project

Source systems update their APIs. New fields appear. Old fields are renamed or removed. Authentication credentials expire. Volume patterns change as the business grows. A production data pipeline requires ongoing monitoring and maintenance — not just initial setup. Budget for it explicitly.

Embedding transformation logic in the BI tool

Putting business logic (revenue definitions, customer categorizations, deal stage mappings) inside Looker LookML, Tableau calculated fields, or Metabase custom expressions means the logic is invisible to the data engineering layer, untestable, and duplicated across reports. Move all transformation logic into dbt models. BI tools should query clean, already-transformed tables.

Not validating against source-of-truth numbers

Automated pipeline tests verify structural correctness — not semantic correctness. A pipeline can pass all tests and still misdefine revenue because it included refunded transactions, or double-counted multi-year deals recognized monthly. Manual cross-checks against source-system reports are non-negotiable before any pipeline is trusted in production.

Frequently asked

Questions about data & analytics

The most reliable approach depends on your use case. For centralizing data for analytics and reporting, an ELT pipeline using Fivetran or Airbyte into a cloud data warehouse (BigQuery, Snowflake, or Redshift) combined with dbt for transformations is the current standard. For syncing data back into operational tools, use Reverse ETL. For lightweight, no-code automations between SaaS apps, Zapier or Make work for simple use cases but break under volume.

ETL (Extract, Transform, Load) applies transformation logic before data lands in the warehouse — historically required when compute was expensive. ELT (Extract, Load, Transform) loads raw data first, then transforms it inside the warehouse using tools like dbt. ELT is now preferred because cloud warehouses are cheap and powerful, and keeping raw data enables reprocessing when transformation logic changes.

There is no hard ceiling. Fivetran supports 500+ connectors; Airbyte supports 300+. Practically, complexity grows with source count — more sources means more schema drift to monitor, more entity resolution challenges, and more transformation logic to maintain. Most operator data stacks start with 5–10 sources (CRM, billing, ad platforms, finance) and expand from there as reporting needs grow.

Data virtualization queries data in place — at its source — rather than copying it into a central warehouse. It eliminates data movement latency and avoids duplication. It is well suited for infrequent, ad-hoc queries across systems that already have reliable, low-latency APIs. It is not suited for high-volume analytical queries, complex transformations, or reporting dashboards that run continuously, because query performance degrades sharply when every query has to hit multiple source APIs.

Not always. Point-to-point integrations (like syncing Stripe data into Salesforce) do not need a warehouse. But if you want to run analytics that combine data from more than two systems — for example, calculating CAC by combining ad spend from Google, ad spend from Meta, and new customers from Stripe — a central warehouse is the most reliable architecture. Without it, cross-source analytics require brittle, hard-to-maintain custom queries.

A data mesh is a decentralized architecture where each business domain (marketing, sales, finance) owns and publishes its own data products rather than funneling everything into one central warehouse team. It solves the bottleneck of a single data engineering team serving all business units. Data mesh is most relevant for companies with 200+ people and multiple large data-producing teams. For most operators under that size, a centralized warehouse with a small data team is simpler, faster to implement, and easier to maintain.

Build schema change detection into your ingestion pipeline. Fivetran and Airbyte both handle schema evolution — adding new columns automatically and alerting on breaking changes. In your transformation layer (dbt), write tests that fail when expected fields disappear or change type. Set up alerting on pipeline failures and volume anomalies. Document your canonical schema in version control so any change is traceable.

Author

Siddharth Gangal

Founder, Fairview

Siddharth writes on operating intelligence, revenue operations, and the unbundling of business intelligence. Before Fairview, built revenue ops infrastructure across B2B SaaS and DTC.

View profile LinkedIn · Editorially reviewed by Jordan Cole

More from this cluster

Data Infrastructure hub

Defined in the glossary

Use cases & product

See data infrastructure in your data — book a 20-min demo

Book a demo Book a demo

Editorial standards

Sources & further reading

Fairview cites primary sources only. The references below underpin the benchmarks and frameworks discussed in our Data Infrastructure coverage. See our editorial standards.

1 State of Analytics Engineering 2025 — dbt Labs, 2025. View source .
2 Modern Data Stack Annual Report — a16z / Future, 2024. View source .
3 Snowflake Data Cloud Report — Snowflake, 2025. View source .

Fairview cites primary sources only — government data, academic research, industry benchmarks from named publishers, and official vendor documentation. See our editorial standards.