Good Data is Normalized

Normalized data means 
1. The dataset has distinct "Keys" 
2. Variables in the dataset are at the key's unit-Level

Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.

There are multi-billion dollar firms whose entire business model depends on storing and accessing data. Therefore we should apply our rule and see how they store data:

> **Long ago, smart people figured out a fundamental principle of database design: that the physical structure of a database should communicate its logical structure.**

Let's put that into practice. First we need to agree on three things:

  1. Every database contains "elements" or "observations" (the rows) and "variables" (the columns).
  2. Definition: The "key" is the variable(s) that uniquely identifies a row in the table.
    • If your dataset is about different firms and has one observation per firm, GM is the key for the observation about General Motors.
    • If your dataset contains info about firms over multiple years (a "firm-year" dataset), (GM, 2000) is the key for the observation for GM in 2000.
    • Notice that the "key" can be multiple variables, like the firm and the year in the prior sentence.
  3. Definition: The "unit" of a dataset refers the level of aggregation of variables in the dataset. Examples:
    • Firm-level data: Industry of a firm
    • Firm-year data: Profits
    • Firm-day data: Firm returns
    • Daily-data (or any time interval): Market returns
    • Country-year: GDP
    • Country-level data: Continent

Rules for storing data

  1. Rule: Keys should never be missing! (Rule 4.A) If one is, how can you say what that row is about?
  2. Rule: Keys should never be duplicated. If any keys are duplicated, there is a problem with your dataset or you misunderstand what the unit is and what variable(s) constitute the key.
  3. Rule: Variables in a normalized dataset are at the unit level of the dataset, and not other unit levels.

Rule 3 means that if the key is a county, and the unit is the county in 2010 (a cross section at a single point in time), then the variables you can have in that dataset are attributes of the county as of 2010.

If you have variables about state populations, they should be in a different database (the state database).

If you have variables about the county at many points in time (county GDP for each year), those should be in a different database (the county-year database).

Normalized data in practice

Try to create and store data that is normalized. But notice how Golden rule 4.B is phrased: "Keep data normalized as far into your code pipeline as you can."

I doubt you will ever run a serious regression or analysis on completely normalized data. For example, asset pricing tests require the firm's return that day (which is a variable whose unit is a firm-day) and the market return (which is a variable whose unit is a day).

So at some point you will combine datasets with different unit levels. Do your best to delay that (4.B) until the beginning of the analysis.

Example: Fixing non-normalized data

So, as a little example, here is a bad firm-year dataset:

Firm Year Profits First_year IndustryDef SubIndustryDef Profits_Ind
GM 2000 5 1908 1 1A 1
GM 2001 5.5 1 1A 13
Ford 2000 -4 1903 2 1A 1
Ford 2001 7.5 1903 2 1A 13
Ford 2002 8 1903 2 1A 23
2002 15 1903 2 1A 23

What a mess!

  • What is the last observation about?
  • How is GM missing a "first year" in one observation? GM and Ford are in the same sub-industry, but different industries?
  • You'll have to keep track of what Profits and Profits_Ind mean. (And which industry level are profits at?)

Beyond those problems, this dataset has variables at THREE unit levels:

  1. Firm-year variables (profits in a given year for a given firm)
  2. Firm-level variables (first year the firm existed)
  3. Industry-year variables (industry profits)

To save this data in a normalized form, we need to save three datasets. Check out how I did that:

```{tabbed} Firm-year variables

Firm Year Profits
GM 2000 5
GM 2001 5.5
Ford 2000 -4
Ford 2001 7.5
Ford 2002 8

```{tabbed} Firm-level variables

| Firm | First_year | IndustryDef | SubIndustryDef|
|:--- | :--- | :--- | :---|
|GM|1908|1 | 1A |

```{tabbed} Industry-year variables

IndustryDef Year Profits
1 2000 1
1 2001 13
1 2002 23


Now, the physical structure of each dataset communicates its logical structure: Every firm-year has a profit level, and each firm has a founding year. In fact, these three databases are self-documenting! (Rule 6.C) Isn't that nice!?