TL;DR
- Dimensional modeling structures warehouses for analytics: It separates measurable events (fact tables) from descriptive context (dimension tables), making business questions fast to ask and fast to answer.
- The star schema is the standard default: A central fact table surrounded by dimension tables — fewer joins, simpler SQL, and excellent performance on columnar cloud warehouses.
- Grain is the most important design decision: Declaring what one row in a fact table represents before you build it prevents a category of structural errors that are expensive to fix later.
- Slowly changing dimensions (SCDs) require a deliberate strategy: Type 2 SCDs preserve full history and are the correct default for any attribute that changes and needs to be tracked over time.
- Modern teams implement Kimball with dbt: dbt has become the standard transformation layer for building star schemas on BigQuery, Snowflake, and Redshift — adding testing, documentation, and version control to the modeling process.
Ralph Kimball published The Data Warehouse Toolkit in 1996. Thirty years later, his core ideas — fact tables, dimension tables, the star schema, conformed dimensions — are still the primary vocabulary of analytical data modeling. They survived the transition from on-premise Oracle to cloud-based BigQuery and Snowflake. They survived the rise of dbt, the lakehouse, and the modern data stack.
That durability is not nostalgia. Dimensional modeling solves a structural problem that every team building analytics infrastructure eventually runs into: how do you organize data so that the people who need to answer business questions can do so without needing a data engineer to rewrite the query every time? The answer Kimball gave — separate what happened from the context around what happened — is still the right answer.
This guide covers the fundamentals in enough depth to be useful whether you are a data analyst encountering the terminology for the first time, a RevOps analyst trying to understand why your data warehouse is structured the way it is, or a data engineer who needs a solid reference on slowly changing dimensions and schema design tradeoffs.
What Dimensional Modeling Is — and What It Solves
Operational databases — the kind that power your CRM, billing system, and product backend — are built for transactional efficiency. They are normalized, meaning data is broken into many small tables to eliminate redundancy and support fast reads and writes of individual records. This normalization is excellent for recording events. It is terrible for analyzing them. A query asking "what was revenue by product category and sales region last quarter?" might require joining eight or twelve tables in a normalized OLTP schema, producing SQL that is slow to write, hard to understand, and unpredictable in performance.
Dimensional modeling addresses this by deliberately denormalizing data into a structure optimized for analytical reads. The central insight is that any analytical question can be decomposed into two parts: a measurement (revenue, quantity, duration, cost) and the context around that measurement (which customer, which product, which date, which geography). Fact tables hold the measurements. Dimension tables hold the context. The join between them is always a single foreign key, which makes SQL simple and query planners efficient.
Core Structure
Analytical Query = Measure (fact table) + Context (dimension tables)
The Kimball methodology — also called bottom-up dimensional modeling — builds the warehouse one business process at a time, starting with the subject area that delivers the most immediate analytical value. This contrasts with the Inmon methodology, which builds a normalized enterprise data warehouse first and then derives analytical structures from it. In practice, Kimball's approach dominates for most analytics teams because it delivers usable results faster and maps more naturally to the way business users ask questions.
Fact Tables: What Happened
A fact table records individual business events at a specific level of detail. Each row is one occurrence of the event the table is designed to capture. Common examples include an order line, a payment, a customer support ticket, a product page view, or a sales activity log entry.
Declaring the Grain
The grain is the most important design decision you make when building a fact table. It is the precise, unambiguous statement of what one row represents. Before writing a single column definition, you must answer: what business event does each row record, and at what level of detail?
A common mistake is declaring grain imprecisely. "One row per order" is not a grain declaration — it is vague. "One row per order line" is a grain declaration. "One row per daily aggregate of product sales" is a grain declaration. "One row per customer per month" is a grain declaration. The grain determines which foreign keys belong in the table, which measures are meaningful, and which queries will return correct results.
Mixing grains in a single fact table — placing order-level rows alongside order-line-level rows, for example — produces a table that is almost impossible to query reliably without adding filter conditions to every query that touches it. Declare grain first. Build the table to match it.
A Worked Example: The Sales Fact Table
Consider a B2B SaaS company tracking deal closes. The grain is: one row per closed-won opportunity. The fact table looks like this:
| Column | Type | Role |
|---|---|---|
| opportunity_key | INTEGER | Surrogate primary key |
| date_key | INTEGER | FK → dim_date |
| customer_key | INTEGER | FK → dim_customer |
| product_key | INTEGER | FK → dim_product |
| sales_rep_key | INTEGER | FK → dim_sales_rep |
| region_key | INTEGER | FK → dim_region |
| arr_amount | DECIMAL(12,2) | Measure: annual contract value |
| deal_cycle_days | INTEGER | Measure: days from created to close |
| discount_pct | DECIMAL(5,2) | Measure: discount applied |
| is_expansion | BOOLEAN | Flag: new logo vs. expansion |
The fact table is narrow by design. It contains foreign keys that point to each dimension (date, customer, product, sales rep, region) and the measures that belong at the opportunity grain (ARR, cycle days, discount). Descriptive attributes — the customer's industry, the product's tier name, the sales rep's hire date — live in their respective dimension tables, not here.
Types of Facts
Not all measures behave the same way when aggregated. Understanding fact types prevents common analytical errors:
- Additive facts can be summed across all dimensions — revenue, quantity, cost. You can sum revenue by date, by region, or by both, and the result is always meaningful.
- Semi-additive facts can be summed across some dimensions but not all. Account balance is a classic example: summing balance across accounts for a given date is meaningful; summing balance across dates for a single account is not.
- Non-additive facts cannot be meaningfully summed across any dimension — ratios, percentages, and unit prices. A discount percentage should never be summed; it should be averaged or recalculated from its components.
Dimension Tables: The Context Layer
Dimension tables are wide and descriptive. A typical dimension table has one row per entity (one row per customer, one row per product, one row per date) and many columns describing attributes of that entity. Dimension tables are what make fact tables queryable in human terms.
A dim_customer table for the sales example above might include: customer_key, customer_id (the source system natural key), company_name, industry, company_size_band, customer_tier, contract_start_date, csm_name, and geographic_region. A BI tool connecting to this schema lets a user drag "industry" onto a report and immediately see revenue broken down by industry — without knowing anything about how the underlying data is joined.
Surrogate Keys vs. Natural Keys
Dimension tables use surrogate keys — integer keys generated by the warehouse, with no business meaning — as their primary key, even though the source system has its own identifier (a natural key like a CRM account ID or a product SKU). The surrogate key serves two purposes: it insulates the warehouse from changes in source system IDs, and it is required to implement Type 2 slowly changing dimensions correctly, because a single entity may have multiple rows under different keys representing different historical states.
Conformed Dimensions
A conformed dimension is a dimension table shared across multiple fact tables in the warehouse. The date dimension is the canonical example — every fact table that records an event in time should reference the same dim_date table, which means every fact can be sliced by fiscal week, calendar month, or any other date attribute consistently. A customer dimension shared between a sales fact table and a support ticket fact table enables cross-domain questions like "how does deal size correlate with support volume in the first 90 days?"
Conformed dimensions are what allow a collection of separate star schemas — one per business process — to function as an integrated warehouse rather than a set of isolated silos.
Junk Dimensions
Fact tables often have low-cardinality flags and indicators — boolean fields, small lookup codes, status values — that do not belong in any existing dimension table but would clutter the fact table if left there. A junk dimension collects these miscellaneous attributes into a single dimension table. If an order fact table has four boolean flags (is_promotional, is_bundled, is_first_order, requires_approval), those four fields produce 16 possible combinations. A junk dimension holds all 16 combinations as rows and contributes a single surrogate key to the fact table, keeping the fact table clean and queryable without creating four separate one-bit dimension tables.
Slowly Changing Dimensions
Dimension attributes are not static. A customer changes their billing tier. A sales rep moves from one territory to another. A product is recategorized into a different business line. How you handle these changes determines whether you can accurately reconstruct history or whether historical analysis silently uses incorrect attribute values.
The three primary SCD types each make a different tradeoff between simplicity and historical fidelity.
SCD Type 1: Overwrite
Type 1 simply updates the dimension row in place. The old value is overwritten and no record of the change is kept. Every historical fact row that joined to this dimension now reflects the new value, retroactively.
| customer_key | company_name | tier (before) | tier (after Type 1) |
|---|---|---|---|
| 1001 | Acme Corp | Growth | Scale |
Use Type 1 only when the old value is genuinely wrong (a data correction) or when the business explicitly does not need to report on historical values of that attribute.
SCD Type 2: Add a New Row
Type 2 is the correct default for most analytical attributes. When a dimension attribute changes, a new row is inserted for that entity with the new attribute values. The old row is marked with an end date (or a boolean is_current = false flag). The surrogate key for the new row is different from the old row's surrogate key, so fact rows that occurred before the change point to the old dimension key — preserving the historical value at the time of the event.
| customer_key | customer_id | tier | effective_start | effective_end | is_current |
|---|---|---|---|---|---|
| 1001 | cust-42 | Growth | 2025-01-01 | 2026-03-14 | false |
| 1087 | cust-42 | Scale | 2026-03-15 | 9999-12-31 | true |
The same customer (cust-42) now has two dimension rows. Deals closed before March 15, 2026 join to key 1001 (tier = Growth). Deals closed after that date join to key 1087 (tier = Scale). Historical reports remain accurate.
SCD Type 3: Add a New Column
Type 3 retains the previous value by adding a column alongside the current value — current_territory and previous_territory on the same row. This approach is easy to implement and query but can only preserve one prior state. If the rep changes territory three times, Type 3 records only the most recent previous value. Use Type 3 when the business needs to compare "before and after a single planned change" — a territory realignment, a product line restructure — not as a general history mechanism.
Star Schema vs. Snowflake Schema vs. Data Vault
These three patterns represent fundamentally different philosophies about how a data warehouse should be organized. Understanding the tradeoffs is necessary to choose the right approach — or to understand why your existing warehouse is structured the way it is.
| Dimension | Star Schema | Snowflake Schema | Data Vault |
|---|---|---|---|
| Structure | Denormalized: fact + flat dimension tables | Normalized: dimensions split into sub-tables | Hubs, Links, Satellites |
| Query complexity | Low — one join per dimension | Medium — multiple joins per dimension path | High — requires mart layer for BI |
| Storage efficiency | Lower (repeated strings) | Higher (normalized) | Lowest (very denormalized via Satellites) |
| Historical tracking | Via SCD Type 2 in dimensions | Via SCD Type 2 in sub-dimensions | Built in via Satellite versioning |
| Schema flexibility | Low — adding attributes requires migration | Medium | High — add Satellites without breaking existing models |
| Auditability | Limited | Limited | Very strong — source system tracked on every record |
| Build speed | Fast | Medium | Slow — significant upfront design investment |
| Best for | Most analytics teams; BI and reporting | Specific normalization needs on cloud warehouses | Enterprise DWH with many sources and compliance needs |
The Star Schema in Practice
The star schema's central fact table has direct foreign key relationships to each dimension. In the sales example: fct_opportunity links directly to dim_date, dim_customer, dim_product, dim_sales_rep, and dim_region. Each of those is a flat, wide table. A query that asks "total ARR by product tier and sales region in Q1 2026" requires exactly two joins — to dim_product and dim_region. On a columnar cloud warehouse, that query is fast and the SQL is readable.
Why Snowflake Schema Rarely Wins on Cloud Warehouses
The snowflake schema normalizes dimension tables by splitting sub-attributes into separate tables. dim_customer would no longer hold industry directly — instead it would hold an industry_key pointing to a dim_industry table. The original justification was storage: eliminating the repeated string "Financial Services" across thousands of customer rows saved disk space on spinning hard drives. On a modern columnar warehouse with column-level compression, that string is stored efficiently regardless. The storage argument is effectively gone. What remains are the extra joins — which slow down queries and make SQL harder to read. Unless you have a specific architectural reason that requires normalization (a sub-dimension that changes independently and is very large), use the star schema.
Data Vault: When Complexity Is Justified
Data Vault organizes data around three object types. Hubs store unique business keys — the customer ID, the product SKU, the order number — with no descriptive attributes. Links capture relationships between Hubs — which customer placed which order on which date. Satellites hold the descriptive attributes of either a Hub or a Link, along with load date and source system metadata, enabling full historical tracking of every attribute change from every source.
This architecture is exceptionally good at integrating many source systems without conflict, tracking the full lineage of every value, and handling schema changes without breaking existing models. It is genuinely poor for direct BI querying — most teams who use Data Vault build a Kimball-style mart layer on top of it for consumption. The design investment is substantial. Data Vault makes sense for enterprise warehouses with five or more heterogeneous source systems, regulatory environments that require data lineage to the record level, or organizations that have had painful experience with fact tables that broke when a source system changed its schema.
Dimensional Modeling in the Modern Data Stack
The tools have changed substantially since 1996, but Kimball's patterns translate directly to the modern stack.
How dbt Implements the Star Schema
dbt (data build tool) has become the standard transformation layer for implementing dimensional models on cloud warehouses. Most dbt projects follow a three-layer structure:
- Staging layer: One model per source table. Clean and rename raw source columns, apply consistent data types, deduplication. No business logic yet.
- Intermediate layer: Business logic transformations — resolving business keys, building customer-level aggregations, standardizing event taxonomy across sources.
- Mart layer: The final dimensional models.
dim_customer,dim_date,fct_opportunity,fct_support_ticket. These are the tables BI tools and analysts connect to directly.
dbt adds tests that enforce dimensional model integrity: not-null and unique constraints on surrogate keys, referential integrity tests ensuring every foreign key in a fact table has a corresponding row in the referenced dimension, and custom schema tests that validate business rules (discount percentage must be between 0 and 1, deal cycle days must be positive). These tests catch modeling errors before they reach BI consumers — a guarantee that was expensive to implement in pre-dbt SQL workflows.
Columnar Warehouses Change the Performance Picture
On traditional row-based databases, joins between large tables were expensive. Data modeling choices were heavily influenced by trying to minimize joins. BigQuery, Snowflake, and Redshift are columnar warehouses — they store data by column rather than by row, which means they only read the columns referenced in a query and handle joins between large dimension tables efficiently using distributed hash joins. This changes several modeling decisions:
- Wide dimension tables with many attributes are cheap — reading 60 columns costs the same per-column as reading 6, and you only pay for the columns you reference.
- The storage argument for snowflake normalization is effectively nullified by column-level compression on repeated string values.
- Some teams building very high-volume analytical workloads on BigQuery use denormalized nested structures (STRUCT and ARRAY types) instead of traditional joins, trading some modeling purity for query performance on billion-row tables.
The Four-Step Kimball Design Process
For any new fact table, work through these four steps in order before writing any SQL:
- Identify the business process. What event or measurement are you capturing? Opportunity closes, product page views, support tickets created, invoices paid.
- Declare the grain. One row represents exactly what? Do not proceed until this is unambiguous.
- Identify the dimensions. What context attributes belong at this grain? These become your foreign keys to dimension tables.
- Identify the facts (measures). What numeric values are recorded for each event? Classify each as additive, semi-additive, or non-additive.
Teams that skip step two — grain declaration — almost always build fact tables they have to rebuild six months later. The discipline of writing down "one row per order line" before touching a keyboard is what separates analytical models that scale from ones that accumulate technical debt.