{important}
Normalized data means
1. The dataset has distinct "Keys"
2. Variables in the dataset are at the key's unit-Level
Storing normalized data means your data will be easier to understand and it will be harder to make costly mistakes.
There are multi-billion dollar firms whose entire business model depends on storing and accessing data. Therefore we should apply our rule and see how they store data:
{note}
> **Long ago, smart people figured out a fundamental principle of database design: that the physical structure of a database should communicate its logical structure.**
Let's put that into practice. First we need to agree on three things:
GM
is the key for the observation about General Motors.(GM, 2000)
is the key for the observation for GM in 2000. Rule 3 means that if the key is a county, and the unit is the county in 2010 (a cross section at a single point in time), then the variables you can have in that dataset are attributes of the county as of 2010.
If you have variables about state populations, they should be in a different database (the state database).
If you have variables about the county at many points in time (county GDP for each year), those should be in a different database (the county-year database).
Try to create and store data that is normalized. But notice how Golden rule 4.B is phrased: "Keep data normalized as far into your code pipeline as you can."
I doubt you will ever run a serious regression or analysis on completely normalized data. For example, asset pricing tests require the firm's return that day (which is a variable whose unit is a firm-day) and the market return (which is a variable whose unit is a day).
So at some point you will combine datasets with different unit levels. Do your best to delay that (4.B) until the beginning of the analysis.
So, as a little example, here is a bad firm-year dataset:
Firm | Year | Profits | First_year | IndustryDef | SubIndustryDef | Profits_Ind |
---|---|---|---|---|---|---|
GM | 2000 | 5 | 1908 | 1 | 1A | 1 |
GM | 2001 | 5.5 | 1 | 1A | 13 | |
Ford | 2000 | -4 | 1903 | 2 | 1A | 1 |
Ford | 2001 | 7.5 | 1903 | 2 | 1A | 13 |
Ford | 2002 | 8 | 1903 | 2 | 1A | 23 |
2002 | 15 | 1903 | 2 | 1A | 23 |
What a mess!
Profits
and Profits_Ind
mean. (And which industry level are profits at?)Beyond those problems, this dataset has variables at THREE unit levels:
To save this data in a normalized form, we need to save three datasets. Check out how I did that:
```{tabbed} Firm-year variables
Firm | Year | Profits |
---|---|---|
GM | 2000 | 5 |
GM | 2001 | 5.5 |
Ford | 2000 | -4 |
Ford | 2001 | 7.5 |
Ford | 2002 | 8 |
```{tabbed} Firm-level variables
| Firm | First_year | IndustryDef | SubIndustryDef|
|:--- | :--- | :--- | :---|
|GM|1908|1 | 1A |
|Ford|1903|1|1A|
```{tabbed} Industry-year variables
IndustryDef | Year | Profits |
---|---|---|
1 | 2000 | 1 |
1 | 2001 | 13 |
1 | 2002 | 23 |
```
Now, the physical structure of each dataset communicates its logical structure: Every firm-year has a profit level, and each firm has a founding year. In fact, these three databases are self-documenting! (Rule 6.C) Isn't that nice!?