import pandas as pd
import numpy as np
import random as rd
import seaborn as sns
rd.seed(42)
df = pd.read_feather("../datasets/attrition.feather")
df.head()
Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | ... | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21 | 0.0 | Travel_Rarely | 391 | Research_Development | 15 | College | Life_Sciences | High | Male | ... | Excellent | Very_High | 0 | 0 | 6 | Better | 0 | 0 | 0 | 0 |
1 | 19 | 1.0 | Travel_Rarely | 528 | Sales | 22 | Below_College | Marketing | Very_High | Male | ... | Excellent | Very_High | 0 | 0 | 2 | Good | 0 | 0 | 0 | 0 |
2 | 18 | 1.0 | Travel_Rarely | 230 | Research_Development | 3 | Bachelor | Life_Sciences | High | Male | ... | Excellent | High | 0 | 0 | 2 | Better | 0 | 0 | 0 | 0 |
3 | 18 | 0.0 | Travel_Rarely | 812 | Sales | 10 | Bachelor | Medical | Very_High | Female | ... | Excellent | Low | 0 | 0 | 2 | Better | 0 | 0 | 0 | 0 |
4 | 18 | 1.0 | Travel_Frequently | 1306 | Sales | 5 | Bachelor | Marketing | Medium | Male | ... | Excellent | Very_High | 0 | 0 | 3 | Better | 0 | 0 | 0 | 0 |
5 rows × 31 columns
print(df.shape)
df["Age"].hist()
(1470, 31)
<Axes: >
df_simple_samp = df.sample(n=70, random_state=42)
df_simple_samp["Age"].hist()
<Axes: >
Systematic sampling has a problem: if the data has been sorted, or there is some sort of pattern or meaning behind the row order, then the resulting sample may not be representative of the whole population. The problem can be solved by shuffling the rows, but then systematic sampling is equivalent to simple random sampling.
sample_size = 70
pop_size = len(df)
interval = pop_size // sample_size
df_sys_samp = df.iloc[::interval]
df_sys_samp["Age"].hist()
<Axes: >
The proportions of each category or subgroup will be similar between the population and sampling data.
df_strat_samp = df.groupby("Department", observed=False).sample(frac=0.1, random_state=42)
df_strat_samp["Age"].hist()
<Axes: >
df["Department"].value_counts(normalize=True)
Department Research_Development 0.653741 Sales 0.303401 Human_Resources 0.042857 Name: proportion, dtype: float64
df_strat_samp["Department"].value_counts(normalize=True)
Department Research_Development 0.653061 Sales 0.306122 Human_Resources 0.040816 Name: proportion, dtype: float64
The sampling will extract n rows of each category
df_eq_strat_samp = df.groupby("Department", observed=False).sample(n=15, random_state=42)
df_eq_strat_samp["Age"].hist()
<Axes: >
df["Department"].value_counts(normalize=True)
Department Research_Development 0.653741 Sales 0.303401 Human_Resources 0.042857 Name: proportion, dtype: float64
df_eq_strat_samp["Department"].value_counts(normalize=True)
Department Human_Resources 0.333333 Research_Development 0.333333 Sales 0.333333 Name: proportion, dtype: float64
Specify weights to adjust the relative probability of a row being sampled.
df_weight = df
condition = df_weight["Department"] == "Sales"
df_weight["weight"] = np.where(condition, 2, 1) # weight 2 if match - 1 don't match => 2 times the chance of beign picked
df_weight = df.sample(frac=0.1, weights="weight", random_state=42)
df_weight["Age"].hist()
<Axes: >
df.value_counts("Department", normalize=True)
Department Research_Development 0.653741 Sales 0.303401 Human_Resources 0.042857 Name: proportion, dtype: float64
df_weight.value_counts("Department", normalize=True)
Department Research_Development 0.537415 Sales 0.435374 Human_Resources 0.027211 Name: proportion, dtype: float64
job_roles = list(df["JobRole"].unique())
job_roles_samp = rd.sample(job_roles, k=4)
condition = df["JobRole"].isin(job_roles_samp)
df_filtered = df[condition]
df_filtered["JobRole"] = df_filtered["JobRole"].cat.remove_unused_categories()
df_clust_samp = df_filtered.groupby("JobRole").sample(n=10, random_state=42)
df_clust_samp["Age"].hist()
C:\Users\b061995\AppData\Local\Temp\ipykernel_21152\4183575958.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df_filtered["JobRole"] = df_filtered["JobRole"].cat.remove_unused_categories() C:\Users\b061995\AppData\Local\Temp\ipykernel_21152\4183575958.py:6: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. df_clust_samp = df_filtered.groupby("JobRole").sample(n=10, random_state=42)
<Axes: >
attrition_srs100 = df.sample(n=100, random_state=42)
mean_attrition_srs100 = attrition_srs100["Attrition"].mean()
rel_error_pct100 = 100 * abs(df["Attrition"].mean() - mean_attrition_srs100) / df["Attrition"].mean()
print(rel_error_pct100)
24.05063291139242
Sampling: going from a population to a smaller sample.
Bootstraping: building up a theorical population from the sample.
Process:
Bootstrap distribution mean:
Standard error: standard deviation of the statistic of interest.
df_resample = df.sample(frac=1, replace=True)
means = []
for i in range(1000):
means.append(np.mean(df.sample(frac=1, replace=True)["Age"]))
sns.histplot(means)
<Axes: ylabel='Count'>
sns.histplot(df["Age"])
<Axes: xlabel='Age', ylabel='Count'>
Ways to calculate:
mean = np.mean(df["Age"])
c1 = mean - np.std(df["Age"], ddof=1)
c2 = mean + np.std(df["Age"], ddof=1)
print([c1, c2])
[27.78843603467279, 46.05918301294626]
q1 = np.quantile(df["Age"], 0.025)
q2 = np.quantile(df["Age"], 0.975)
print([q1, q2])
[21.0, 56.0]
from scipy.stats import norm
point_estimate = np.mean(df["Age"])
std_error = np.std(df["Age"], ddof=1) # here we should use the standard error (std(bootstrap_distribution))
lower = norm.ppf(0.025, loc=point_estimate, scale=std_error)
upper = norm.ppf(0.975, loc=point_estimate, scale=std_error)
print([lower, upper])
[19.018806499779515, 54.82881254783953]