---
title: "Denormalization 101"
date: 2026-06-23
tags: post
---

I have had exactly two conversations this week about denormalization, which is a sign that it is time to write a blog post. If you already know what denormalization is, you will likely find this blog post uninteresting. That is okay! It is meant more to be an artifact that I can point to when people who haven't been burnt by the fires of sad databases (or, perhaps, sad ETL pipelines) ask what I'm talking about.

## The well-behaved schema

Imagine the data behind an art gallery. Galleries run *calls* — open invitations for submissions — and artists submit *entries*. The application that captures all this stores it the way Codd would want: tidy, normalized, every fact in exactly one place.

```erd
Gallery
  id int pk
  name string
  url string
  state string
Call
  id int pk
  gallery_id int fk:Gallery
  title string
  deadline date
Entry
  id int pk
  call_id int fk:Call
  artist_id int fk:Artist
  title string
Artist
  id int pk
  name string
  email string
  state string
```

Here's what a handful of rows actually look like — the keys (`gallery_id`, `call_id`, `artist_id`) are the threads you pull to stitch it back together:

```tabs
=== Gallery
| id | name       | url             | state |
| -- | ---------- | --------------- | ----- |
| 1  | Pace       | pacegallery.com | NY    |
| 2  | Gagosian   | gagosian.com    | NY    |
| 3  | White Cube | whitecube.com   | CA    |
=== Call
| id | gallery_id | title           | deadline   |
| -- | ---------- | --------------- | ---------- |
| 11 | 1          | Spring Open     | 2026-03-15 |
| 12 | 2          | Summer Group    | 2026-06-01 |
| 13 | 3          | Emerging Voices | 2026-09-20 |
=== Entry
| id  | call_id | artist_id | title       |
| --- | ------- | --------- | ----------- |
| 101 | 11      | 5         | Field No. 4 |
| 102 | 11      | 6         | Marfa Light |
| 103 | 12      | 7         | Balloon Dog |
=== Artist
| id | name         | email             | state |
| -- | ------------ | ----------------- | ----- |
| 5  | Agnes Martin | agnes@example.com | NM    |
| 6  | Donald Judd  | donald@example.com| TX    |
| 7  | Jeff Koons   | jeff@example.com  | NY    |
```

Every fact lives once. An artist's state is on the artist; a call's deadline is on the call. This is *normalization*, and for the system that has to keep the data *correct* — accept a submission, change an address, never contradict itself — it is exactly right.

I honestly struggle to talk about normalization, and I think even the concept of database design (or at least data modeling) has become a little less in vogue because best practice has just sort of been absorbed into the collective subconscious. Everyone knows how to model stuff a little bit better now than they did 20 years ago. Most databases look something like the above.

## Asking questions

Sooner or later someone asks an actual question. Say whoever's running the gallery wants to know how much of their interest is *local* — concretely: **how many entries come from an artist based in the same state as the gallery running the call?**

That's one sentence. Against the normalized schema, it's four tables and three joins, because the two facts you need to compare — the artist's state and the gallery's state — live at opposite ends of the graph:

```sql
select count(*)
from entry e
join call    c on c.id = e.call_id
join gallery g on g.id = c.gallery_id
join artist  a on a.id = e.artist_id
where a.state = g.state;
```

You have to hold the whole shape in your head — entry to call to gallery on one side, entry to artist on the other — just to line up two `state` columns. And this is the toy version — real questions pile on: break it down by gallery, restrict to this year's calls, weight it by how many entries each artist submitted.

This, at least in an abstract sense, has two problems:

1. Meaningful insights come from combining disparate datasets, and normalization makes that _definitionally more cumbersome_.
2. Joins are slow.

## Simply eat the other tables

The answer is denormalization!

You build the thing that makes it easy:

```erd
EntryView
  id int pk
  title string
  artist_name string
  artist_email string
  artist_state string
  call_title string
  deadline date
  gallery_name string
  gallery_url string
  gallery_state string
```

```tabs
=== EntryView
| id  | title       | artist_name  | artist_state | call_title   | deadline   | gallery_name | gallery_state |
| --- | ----------- | ------------ | ------------ | ------------ | ---------- | ------------ | ------------- |
| 101 | Field No. 4 | Agnes Martin | NM           | Spring Open  | 2026-03-15 | Pace         | NY            |
| 102 | Marfa Light | Donald Judd  | TX           | Spring Open  | 2026-03-15 | Pace         | NY            |
| 103 | Balloon Dog | Jeff Koons   | NY           | Summer Group | 2026-06-01 | Gagosian     | NY            |
```

This is *denormalization*, and the warehouse crowd has a dozen names for the result — a wide table, a one-big-table model, the flat fact table at the center of a star schema, a *feature table* if you're about to train something on it. ETL, materialized views, whatever — the implementation is a detail; the point is that this pattern has a dozen names.

The artist's state and the gallery's state now sit in the same row. You've reintroduced, deliberately, the redundancy normalization warned you about — `Spring Open` and `Pace` repeat once per entry — because the join you used to pay on every query, you now pay once.

And the query becomes the question, almost verbatim:

```sql
select count(*)
from entry_view
where artist_state = gallery_state;
```

How lovely!

## [points at butterfly] is this a cache?

So obviously there's a reason we don't do this for everything: we now have to figure out how to build the table. This is a rich subject for which I am much less qualified or interested to opine; if you're reading this blog for answers, the answer is something like "run a daily cron which makes the big ol' join manually and reconstitutes the table".

At this point, you have yourselves a cached table with twenty four hours of freshness. You have traded (presumably) a bit of operational toil for a bit of... _something_. What is that something?

- Sometimes you're caching **knowledge** — you just want one clear, blessed definition of what an `EntryView` _is_, a shape everyone can point at and treat as the source of truth. If so, just through it in a view! Don't worry about materialization.
- Other times you're caching **reads** — the join itself is the problem, too slow or too expensive to pay for on every query. _That's_ the one that earns a materialized table and an ETL pipeline to keep it fresh, operational toil and all.

Both are valid reasons, but as someone in the business of [reducing his exposure to moving parts](/posts/no-more-redis) I am morally obligated to make sure you know which camp you're in.

## Postscript: I once again recommend a data greenhouse

I wrote [about building a data greenhouse](/posts/insourcing-your-data-warehouse) four months ago, and it's largely in service of this problem. Being able to trivially answer stuff like "how quickly do we solve bugs from US-based users vs. EU ones?" is wildly actionable.