Dealing with Missing Data

  1. df.isnull().sum will report missing values by variable.
  2. With new datasets, look out for "missing values" that aren't missing. Some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaNs with replace.

These slides on missing data are quite good! This article has examples too.

You should focus on the whys and hows of dealing with missing data rather than mechanics. (You can look up mechanics later.)

For example, with firm level data that investment analysts deal with, the most common approach is to keep all valid data, and for each test you run, use all observations that have no missing values for the variables of interest. In the slides, this is called "Complete-Case Analysis".

It is less common in my field of research to impute missing values. However, "Deductive imputation" is common when the cost of doing so isn't high.

In general, when you can confidently deduce a value (my height this year is the same as last year because I'm a fully grown adult (mostly)), go ahead. That is valid data.

Options

These pandas functions can be very useful:

  • fillna - any value (strings included) you want, back fill, forward, fill, and more
  • dropna - might not be explicitly needed, some functions ignore NaNs, but some don't. I tend to drop only as needed for a given estimation, and only temporarily.
  • replace - some datasets use a certain number to indicate missing data (i.e. -99). Convert these to NaN with this function.
  • interpolate - e.g. use values in surrounding time periods to fill in gaps
  • Deduce. Suppose PPE this year is missing. $PPE_t = PPE_{t-1} + CAPX_t - DP_t $

Practice

Play around with each of those functions on this dataframe:

In [ ]:
import pandas as pd
import numpy as np

df = pd.DataFrame({"A":[12, 4, 5, None, 1], 
                   "B":[None, 2, 54, 3, None], 
                   "C":[20, 16, None, 3, 8], 
                   "D":[14, 3, None, None, 6]}) 

_df1 = df.copy()
_df1['firm'] = 1
_df1['date'] = _df1.index

_df2 = df.copy()
_df2['firm'] = 2
_df2['date'] = _df2.index

df2 = _df1.append(_df2)

Questions for df:

  1. Fill all missing values with -1
  2. Fill missing values for variable "B" with -1
  3. Fill all values with the mean for the variable
  4. Fill all values with the median for the variable
  5. Fill values by taking the most recent non-missing prior value

Questions for df2:

  • Carry missing values forward without carrying values from firm 1 to firm 2
  • Fill missing values with the average for firms on that date