This is US wind turbine data. The numeric fields use -9999 as a null value for missing data.
Using -9999 as a null value in numeric fields will cause big problems for any summary statistics like totals, means, etc,
we should change that to something else, like np.NaN which Pandas sum and mean functions will automatically filter out.
You can see that the means for before and after replacing -9999 with np.NaN are very different.
You can use Janitor's find_replace to easily replace them.
import pandas as pd
import janitor
import numpy as np
wind = pd.read_csv(
"https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2018/2018-11-06/us_wind.csv"
)
wind.head()
case_id | faa_ors | faa_asn | usgs_pr_id | t_state | t_county | t_fips | p_name | p_year | p_tnum | ... | t_hh | t_rd | t_rsa | t_ttlh | t_conf_atr | t_conf_loc | t_img_date | t_img_srce | xlong | ylat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3073429 | missing | missing | 4960 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | -9999.0 | -9999.0 | -9999.0 | -9999.0 | 2 | 3 | 1/1/2012 | NAIP | -118.360725 | 35.083778 |
1 | 3071522 | missing | missing | 4997 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | -9999.0 | -9999.0 | -9999.0 | -9999.0 | 2 | 3 | 1/1/2012 | NAIP | -118.361168 | 35.081512 |
2 | 3073425 | missing | missing | 4957 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | -9999.0 | -9999.0 | -9999.0 | -9999.0 | 2 | 3 | 1/1/2012 | NAIP | -118.360420 | 35.084709 |
3 | 3071569 | missing | missing | 5023 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | -9999.0 | -9999.0 | -9999.0 | -9999.0 | 2 | 3 | 7/31/2016 | Digital Globe | -118.364029 | 35.079418 |
4 | 3005252 | missing | missing | 5768 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | -9999.0 | -9999.0 | -9999.0 | -9999.0 | 2 | 3 | 11/23/2017 | Digital Globe | -118.354286 | 35.085594 |
5 rows × 24 columns
wind.t_hh.mean()
-1069.986537767466
The t_hh column appears to be affected by -9999 values. What are all the columns that are affected?
[col for col in wind.columns if -9999 in wind[col].values]
['usgs_pr_id', 'p_year', 'p_cap', 't_cap', 't_hh', 't_rd', 't_rsa', 't_ttlh']
Note: When replacing the -9999 values you can make a copy of the dataframe to prevent making modifications to the original dataframe.
At first glance, it looks like the mean is negative, but this is only because of the bad values (-9999.0) throughout the column. To get the right mean, we should replace them!## Replace Bad Values with NaNs
mapping = {-9999.0: np.nan}
wind2 = (
wind
.find_replace(
usgs_pr_id=mapping,
p_tnum=mapping,
p_cap=mapping,
t_cap=mapping,
t_hh=mapping,
t_rd=mapping,
t_rsa=mapping,
t_ttlh=mapping,
)
)
wind2.head()
case_id | faa_ors | faa_asn | usgs_pr_id | t_state | t_county | t_fips | p_name | p_year | p_tnum | ... | t_hh | t_rd | t_rsa | t_ttlh | t_conf_atr | t_conf_loc | t_img_date | t_img_srce | xlong | ylat | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3073429 | missing | missing | 4960.0 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | NaN | NaN | NaN | NaN | 2 | 3 | 1/1/2012 | NAIP | -118.360725 | 35.083778 |
1 | 3071522 | missing | missing | 4997.0 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | NaN | NaN | NaN | NaN | 2 | 3 | 1/1/2012 | NAIP | -118.361168 | 35.081512 |
2 | 3073425 | missing | missing | 4957.0 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | NaN | NaN | NaN | NaN | 2 | 3 | 1/1/2012 | NAIP | -118.360420 | 35.084709 |
3 | 3071569 | missing | missing | 5023.0 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | NaN | NaN | NaN | NaN | 2 | 3 | 7/31/2016 | Digital Globe | -118.364029 | 35.079418 |
4 | 3005252 | missing | missing | 5768.0 | CA | Kern County | 6029 | 251 Wind | 1987 | 194 | ... | NaN | NaN | NaN | NaN | 2 | 3 | 11/23/2017 | Digital Globe | -118.354286 | 35.085594 |
5 rows × 24 columns
wind2.t_hh.mean()
77.31203064391
And, now that the bad values were replaced by NaNs (which the mean() method ignores), the calculated mean is correct!
If we look at the description of the data (see README) we can find descriptions for our data values, for example:
p_year
: Year project became operationalt_hh
: Turbine hub height (meters)xlong
: LongitudeUsing our knowledge of the data, this would give us bounds we could use for these values. For example, the oldest electric wind turbine was built in 1887 and this document was written in 2018, so $1887 \leq \text{p_year} \leq 2018$. Turbine hub height should be positive, and a value above 1 km would be silly, so $0 < \text{t_hh} < 1000$. These are wind turbines in the United States, so $-161.76 < \text{xlong} < -68.01$.
(Note that the README actually gives us minima and maxima for the data, so we could get much tighter bounds from that.)
To filter out potential bad values, we will use update_where
to remove values outside these ranges.
# Note that update_where mutates the original dataframe
(
wind.update_where(
(wind['p_year'] < 1887) | (wind['p_year'] > 2018), 'p_year', np.nan
)
.update_where((wind['t_hh'] <= 0) | (wind['t_hh'] >= 1000), 't_hh', np.nan)
.update_where((wind['xlong'] < -161.76) | (wind['xlong'] > -68.01), 'xlong', np.nan)
);
wind.t_hh.mean()
77.31203064391