Data & Analytics 14 min read

Dimensional Modeling Explained: Star Schema, Facts, Dimensions & SCDs

Dimensional modeling explained: Kimball methodology, star vs. snowflake vs. data vault, fact and dimension tables, slowly changing dimensions, and modern dbt implementation.

Siddharth Gangal

TL;DR

  • Dimensional modeling structures warehouses for analytics: It separates measurable events (fact tables) from descriptive context (dimension tables), making business questions fast to ask and fast to answer.
  • The star schema is the standard default: A central fact table surrounded by dimension tables — fewer joins, simpler SQL, and excellent performance on columnar cloud warehouses.
  • Grain is the most important design decision: Declaring what one row in a fact table represents before you build it prevents a category of structural errors that are expensive to fix later.
  • Slowly changing dimensions (SCDs) require a deliberate strategy: Type 2 SCDs preserve full history and are the correct default for any attribute that changes and needs to be tracked over time.
  • Modern teams implement Kimball with dbt: dbt has become the standard transformation layer for building star schemas on BigQuery, Snowflake, and Redshift — adding testing, documentation, and version control to the modeling process.

Ralph Kimball published The Data Warehouse Toolkit in 1996. Thirty years later, his core ideas — fact tables, dimension tables, the star schema, conformed dimensions — are still the primary vocabulary of analytical data modeling. They survived the transition from on-premise Oracle to cloud-based BigQuery and Snowflake. They survived the rise of dbt, the lakehouse, and the modern data stack.

That durability is not nostalgia. Dimensional modeling solves a structural problem that every team building analytics infrastructure eventually runs into: how do you organize data so that the people who need to answer business questions can do so without needing a data engineer to rewrite the query every time? The answer Kimball gave — separate what happened from the context around what happened — is still the right answer.

This guide covers the fundamentals in enough depth to be useful whether you are a data analyst encountering the terminology for the first time, a RevOps analyst trying to understand why your data warehouse is structured the way it is, or a data engineer who needs a solid reference on slowly changing dimensions and schema design tradeoffs.

What Dimensional Modeling Is — and What It Solves

Operational databases — the kind that power your CRM, billing system, and product backend — are built for transactional efficiency. They are normalized, meaning data is broken into many small tables to eliminate redundancy and support fast reads and writes of individual records. This normalization is excellent for recording events. It is terrible for analyzing them. A query asking "what was revenue by product category and sales region last quarter?" might require joining eight or twelve tables in a normalized OLTP schema, producing SQL that is slow to write, hard to understand, and unpredictable in performance.

Dimensional modeling addresses this by deliberately denormalizing data into a structure optimized for analytical reads. The central insight is that any analytical question can be decomposed into two parts: a measurement (revenue, quantity, duration, cost) and the context around that measurement (which customer, which product, which date, which geography). Fact tables hold the measurements. Dimension tables hold the context. The join between them is always a single foreign key, which makes SQL simple and query planners efficient.

Core Structure

Analytical Query = Measure (fact table) + Context (dimension tables)

The Kimball methodology — also called bottom-up dimensional modeling — builds the warehouse one business process at a time, starting with the subject area that delivers the most immediate analytical value. This contrasts with the Inmon methodology, which builds a normalized enterprise data warehouse first and then derives analytical structures from it. In practice, Kimball's approach dominates for most analytics teams because it delivers usable results faster and maps more naturally to the way business users ask questions.

Fact Tables: What Happened

A fact table records individual business events at a specific level of detail. Each row is one occurrence of the event the table is designed to capture. Common examples include an order line, a payment, a customer support ticket, a product page view, or a sales activity log entry.

Declaring the Grain

The grain is the most important design decision you make when building a fact table. It is the precise, unambiguous statement of what one row represents. Before writing a single column definition, you must answer: what business event does each row record, and at what level of detail?

A common mistake is declaring grain imprecisely. "One row per order" is not a grain declaration — it is vague. "One row per order line" is a grain declaration. "One row per daily aggregate of product sales" is a grain declaration. "One row per customer per month" is a grain declaration. The grain determines which foreign keys belong in the table, which measures are meaningful, and which queries will return correct results.

Mixing grains in a single fact table — placing order-level rows alongside order-line-level rows, for example — produces a table that is almost impossible to query reliably without adding filter conditions to every query that touches it. Declare grain first. Build the table to match it.

A Worked Example: The Sales Fact Table

Consider a B2B SaaS company tracking deal closes. The grain is: one row per closed-won opportunity. The fact table looks like this:

Column Type Role
opportunity_key INTEGER Surrogate primary key
date_key INTEGER FK → dim_date
customer_key INTEGER FK → dim_customer
product_key INTEGER FK → dim_product
sales_rep_key INTEGER FK → dim_sales_rep
region_key INTEGER FK → dim_region
arr_amount DECIMAL(12,2) Measure: annual contract value
deal_cycle_days INTEGER Measure: days from created to close
discount_pct DECIMAL(5,2) Measure: discount applied
is_expansion BOOLEAN Flag: new logo vs. expansion

The fact table is narrow by design. It contains foreign keys that point to each dimension (date, customer, product, sales rep, region) and the measures that belong at the opportunity grain (ARR, cycle days, discount). Descriptive attributes — the customer's industry, the product's tier name, the sales rep's hire date — live in their respective dimension tables, not here.

Types of Facts

Not all measures behave the same way when aggregated. Understanding fact types prevents common analytical errors:

  • Additive facts can be summed across all dimensions — revenue, quantity, cost. You can sum revenue by date, by region, or by both, and the result is always meaningful.
  • Semi-additive facts can be summed across some dimensions but not all. Account balance is a classic example: summing balance across accounts for a given date is meaningful; summing balance across dates for a single account is not.
  • Non-additive facts cannot be meaningfully summed across any dimension — ratios, percentages, and unit prices. A discount percentage should never be summed; it should be averaged or recalculated from its components.

Dimension Tables: The Context Layer

Dimension tables are wide and descriptive. A typical dimension table has one row per entity (one row per customer, one row per product, one row per date) and many columns describing attributes of that entity. Dimension tables are what make fact tables queryable in human terms.

A dim_customer table for the sales example above might include: customer_key, customer_id (the source system natural key), company_name, industry, company_size_band, customer_tier, contract_start_date, csm_name, and geographic_region. A BI tool connecting to this schema lets a user drag "industry" onto a report and immediately see revenue broken down by industry — without knowing anything about how the underlying data is joined.

Surrogate Keys vs. Natural Keys

Dimension tables use surrogate keys — integer keys generated by the warehouse, with no business meaning — as their primary key, even though the source system has its own identifier (a natural key like a CRM account ID or a product SKU). The surrogate key serves two purposes: it insulates the warehouse from changes in source system IDs, and it is required to implement Type 2 slowly changing dimensions correctly, because a single entity may have multiple rows under different keys representing different historical states.

Conformed Dimensions

A conformed dimension is a dimension table shared across multiple fact tables in the warehouse. The date dimension is the canonical example — every fact table that records an event in time should reference the same dim_date table, which means every fact can be sliced by fiscal week, calendar month, or any other date attribute consistently. A customer dimension shared between a sales fact table and a support ticket fact table enables cross-domain questions like "how does deal size correlate with support volume in the first 90 days?"

Conformed dimensions are what allow a collection of separate star schemas — one per business process — to function as an integrated warehouse rather than a set of isolated silos.

Junk Dimensions

Fact tables often have low-cardinality flags and indicators — boolean fields, small lookup codes, status values — that do not belong in any existing dimension table but would clutter the fact table if left there. A junk dimension collects these miscellaneous attributes into a single dimension table. If an order fact table has four boolean flags (is_promotional, is_bundled, is_first_order, requires_approval), those four fields produce 16 possible combinations. A junk dimension holds all 16 combinations as rows and contributes a single surrogate key to the fact table, keeping the fact table clean and queryable without creating four separate one-bit dimension tables.

Slowly Changing Dimensions

Dimension attributes are not static. A customer changes their billing tier. A sales rep moves from one territory to another. A product is recategorized into a different business line. How you handle these changes determines whether you can accurately reconstruct history or whether historical analysis silently uses incorrect attribute values.

The three primary SCD types each make a different tradeoff between simplicity and historical fidelity.

SCD Type 1: Overwrite

Type 1 simply updates the dimension row in place. The old value is overwritten and no record of the change is kept. Every historical fact row that joined to this dimension now reflects the new value, retroactively.

customer_key company_name tier (before) tier (after Type 1)
1001 Acme Corp Growth Scale

Use Type 1 only when the old value is genuinely wrong (a data correction) or when the business explicitly does not need to report on historical values of that attribute.

SCD Type 2: Add a New Row

Type 2 is the correct default for most analytical attributes. When a dimension attribute changes, a new row is inserted for that entity with the new attribute values. The old row is marked with an end date (or a boolean is_current = false flag). The surrogate key for the new row is different from the old row's surrogate key, so fact rows that occurred before the change point to the old dimension key — preserving the historical value at the time of the event.

customer_key customer_id tier effective_start effective_end is_current
1001 cust-42 Growth 2025-01-01 2026-03-14 false
1087 cust-42 Scale 2026-03-15 9999-12-31 true

The same customer (cust-42) now has two dimension rows. Deals closed before March 15, 2026 join to key 1001 (tier = Growth). Deals closed after that date join to key 1087 (tier = Scale). Historical reports remain accurate.

SCD Type 3: Add a New Column

Type 3 retains the previous value by adding a column alongside the current value — current_territory and previous_territory on the same row. This approach is easy to implement and query but can only preserve one prior state. If the rep changes territory three times, Type 3 records only the most recent previous value. Use Type 3 when the business needs to compare "before and after a single planned change" — a territory realignment, a product line restructure — not as a general history mechanism.

Star Schema vs. Snowflake Schema vs. Data Vault

These three patterns represent fundamentally different philosophies about how a data warehouse should be organized. Understanding the tradeoffs is necessary to choose the right approach — or to understand why your existing warehouse is structured the way it is.

Dimension Star Schema Snowflake Schema Data Vault
Structure Denormalized: fact + flat dimension tables Normalized: dimensions split into sub-tables Hubs, Links, Satellites
Query complexity Low — one join per dimension Medium — multiple joins per dimension path High — requires mart layer for BI
Storage efficiency Lower (repeated strings) Higher (normalized) Lowest (very denormalized via Satellites)
Historical tracking Via SCD Type 2 in dimensions Via SCD Type 2 in sub-dimensions Built in via Satellite versioning
Schema flexibility Low — adding attributes requires migration Medium High — add Satellites without breaking existing models
Auditability Limited Limited Very strong — source system tracked on every record
Build speed Fast Medium Slow — significant upfront design investment
Best for Most analytics teams; BI and reporting Specific normalization needs on cloud warehouses Enterprise DWH with many sources and compliance needs

The Star Schema in Practice

The star schema's central fact table has direct foreign key relationships to each dimension. In the sales example: fct_opportunity links directly to dim_date, dim_customer, dim_product, dim_sales_rep, and dim_region. Each of those is a flat, wide table. A query that asks "total ARR by product tier and sales region in Q1 2026" requires exactly two joins — to dim_product and dim_region. On a columnar cloud warehouse, that query is fast and the SQL is readable.

Why Snowflake Schema Rarely Wins on Cloud Warehouses

The snowflake schema normalizes dimension tables by splitting sub-attributes into separate tables. dim_customer would no longer hold industry directly — instead it would hold an industry_key pointing to a dim_industry table. The original justification was storage: eliminating the repeated string "Financial Services" across thousands of customer rows saved disk space on spinning hard drives. On a modern columnar warehouse with column-level compression, that string is stored efficiently regardless. The storage argument is effectively gone. What remains are the extra joins — which slow down queries and make SQL harder to read. Unless you have a specific architectural reason that requires normalization (a sub-dimension that changes independently and is very large), use the star schema.

Data Vault: When Complexity Is Justified

Data Vault organizes data around three object types. Hubs store unique business keys — the customer ID, the product SKU, the order number — with no descriptive attributes. Links capture relationships between Hubs — which customer placed which order on which date. Satellites hold the descriptive attributes of either a Hub or a Link, along with load date and source system metadata, enabling full historical tracking of every attribute change from every source.

This architecture is exceptionally good at integrating many source systems without conflict, tracking the full lineage of every value, and handling schema changes without breaking existing models. It is genuinely poor for direct BI querying — most teams who use Data Vault build a Kimball-style mart layer on top of it for consumption. The design investment is substantial. Data Vault makes sense for enterprise warehouses with five or more heterogeneous source systems, regulatory environments that require data lineage to the record level, or organizations that have had painful experience with fact tables that broke when a source system changed its schema.

Dimensional Modeling in the Modern Data Stack

The tools have changed substantially since 1996, but Kimball's patterns translate directly to the modern stack.

How dbt Implements the Star Schema

dbt (data build tool) has become the standard transformation layer for implementing dimensional models on cloud warehouses. Most dbt projects follow a three-layer structure:

  • Staging layer: One model per source table. Clean and rename raw source columns, apply consistent data types, deduplication. No business logic yet.
  • Intermediate layer: Business logic transformations — resolving business keys, building customer-level aggregations, standardizing event taxonomy across sources.
  • Mart layer: The final dimensional models. dim_customer, dim_date, fct_opportunity, fct_support_ticket. These are the tables BI tools and analysts connect to directly.

dbt adds tests that enforce dimensional model integrity: not-null and unique constraints on surrogate keys, referential integrity tests ensuring every foreign key in a fact table has a corresponding row in the referenced dimension, and custom schema tests that validate business rules (discount percentage must be between 0 and 1, deal cycle days must be positive). These tests catch modeling errors before they reach BI consumers — a guarantee that was expensive to implement in pre-dbt SQL workflows.

Columnar Warehouses Change the Performance Picture

On traditional row-based databases, joins between large tables were expensive. Data modeling choices were heavily influenced by trying to minimize joins. BigQuery, Snowflake, and Redshift are columnar warehouses — they store data by column rather than by row, which means they only read the columns referenced in a query and handle joins between large dimension tables efficiently using distributed hash joins. This changes several modeling decisions:

  • Wide dimension tables with many attributes are cheap — reading 60 columns costs the same per-column as reading 6, and you only pay for the columns you reference.
  • The storage argument for snowflake normalization is effectively nullified by column-level compression on repeated string values.
  • Some teams building very high-volume analytical workloads on BigQuery use denormalized nested structures (STRUCT and ARRAY types) instead of traditional joins, trading some modeling purity for query performance on billion-row tables.

The Four-Step Kimball Design Process

For any new fact table, work through these four steps in order before writing any SQL:

  1. Identify the business process. What event or measurement are you capturing? Opportunity closes, product page views, support tickets created, invoices paid.
  2. Declare the grain. One row represents exactly what? Do not proceed until this is unambiguous.
  3. Identify the dimensions. What context attributes belong at this grain? These become your foreign keys to dimension tables.
  4. Identify the facts (measures). What numeric values are recorded for each event? Classify each as additive, semi-additive, or non-additive.

Teams that skip step two — grain declaration — almost always build fact tables they have to rebuild six months later. The discipline of writing down "one row per order line" before touching a keyboard is what separates analytical models that scale from ones that accumulate technical debt.

Frequently asked questions

What is dimensional modeling in simple terms?

Dimensional modeling is a technique for structuring data warehouse tables so that business questions are easy to ask and fast to answer. It organizes data into two categories: fact tables, which store measurable business events (a sale, a payment, a login), and dimension tables, which store the descriptive context around those events (which customer, which product, which date). The most common implementation is a star schema, where a central fact table links out to several surrounding dimension tables.

What is the difference between a fact table and a dimension table?

A fact table records individual business events and contains numeric, measurable values — revenue, quantity, duration, cost. Each row represents one event at a specific grain (for example, one order line). A dimension table provides the descriptive attributes that give a fact row its context — the customer's name and segment, the product's category and SKU, the date's week, month, and fiscal quarter. Fact tables are typically narrow and very tall (millions of rows); dimension tables are wider and shorter (thousands to hundreds of thousands of rows).

What is a slowly changing dimension (SCD) and which type should I use?

A slowly changing dimension is a dimension whose attribute values change over time — a customer moves pricing tiers, a sales rep switches territories, a product changes its category. Type 1 overwrites the old value (no history). Type 2 adds a new row for each change with effective date columns, preserving full history. Type 3 adds a "previous value" column alongside the current value (one prior state only). For most analytical use cases Type 2 is the correct default — it preserves the ability to reconstruct what was true at the time of any historical event without ambiguity.

When should I choose a star schema over a snowflake schema?

In a modern columnar cloud data warehouse (BigQuery, Snowflake, Redshift), the star schema is the right default in almost every situation. The snowflake schema's advantage — saving storage by normalizing repeated dimension strings into sub-tables — is largely irrelevant because columnar warehouses compress repeated values efficiently. The added joins in a snowflake schema increase SQL complexity and can hurt query performance with no compensating benefit. Reserve snowflake normalization for cases where a large dimension sub-attribute changes independently and updating it in a denormalized dimension table would be expensive or error-prone.

What is grain in dimensional modeling, and why does it matter?

The grain of a fact table is the precise definition of what one row represents. Declaring grain before building the table is the single most important decision in dimensional modeling. A poorly defined grain produces a fact table that mixes events at different levels of detail — which makes aggregate queries unreliable and joins ambiguous. A clearly declared grain (for example, "one row per order line") ensures that every column belongs at that level and that every consumer knows exactly what they are summing or averaging.

How does dbt fit into dimensional modeling?

dbt has become the standard tool for implementing dimensional models on cloud warehouses. It handles the SQL transformation layer that produces staging models, dimension tables, and fact tables from raw source data. dbt adds version control, automated testing (null checks on surrogate keys, referential integrity between fact and dimension tables, unexpected values in flag columns), and documentation. Most teams structure their dbt projects with a staging layer (raw source cleaning), an intermediate layer (business logic), and a mart layer (the final star schema models consumed by BI tools and analysts).

What is Data Vault and when does it make sense over a Kimball star schema?

Data Vault is a highly normalized modeling methodology built around three table types: Hubs (unique business keys), Links (relationships between Hubs), and Satellites (descriptive attributes with full change history). It is designed for enterprise data warehouses with many source systems, strict auditability requirements, and frequent schema changes. A Kimball star schema is faster to build, easier to query, and the right default for most analytics teams. Data Vault makes sense when you have five or more heterogeneous source systems, strong regulatory compliance requirements, or a team large enough that the structural overhead is justified by the long-term flexibility gains.