! pip install chart_studio
Requirement already satisfied: chart_studio in /usr/local/lib/python3.7/dist-packages (1.1.0) Requirement already satisfied: plotly in /usr/local/lib/python3.7/dist-packages (from chart_studio) (4.4.1) Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from chart_studio) (2.23.0) Requirement already satisfied: retrying>=1.3.3 in /usr/local/lib/python3.7/dist-packages (from chart_studio) (1.3.3) Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from chart_studio) (1.15.0) Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->chart_studio) (3.0.4) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->chart_studio) (1.24.3) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->chart_studio) (2021.10.8) Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->chart_studio) (2.10)
! pip install bqplot
! pip install pingouin
Collecting bqplot
Downloading bqplot-0.12.31-py2.py3-none-any.whl (1.2 MB)
|████████████████████████████████| 1.2 MB 25.5 MB/s
Requirement already satisfied: traitlets>=4.3.0 in /usr/local/lib/python3.7/dist-packages (from bqplot) (5.1.1)
Collecting traittypes>=0.0.6
Downloading traittypes-0.2.1-py2.py3-none-any.whl (8.6 kB)
Requirement already satisfied: ipywidgets>=7.5.0 in /usr/local/lib/python3.7/dist-packages (from bqplot) (7.6.5)
Requirement already satisfied: pandas<2.0.0,>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from bqplot) (1.1.5)
Requirement already satisfied: numpy<2.0.0,>=1.10.4 in /usr/local/lib/python3.7/dist-packages (from bqplot) (1.19.5)
Requirement already satisfied: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.0->bqplot) (3.5.2)
Requirement already satisfied: ipython>=4.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.0->bqplot) (5.5.0)
Requirement already satisfied: ipykernel>=4.5.1 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.0->bqplot) (4.10.1)
Requirement already satisfied: ipython-genutils~=0.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.0->bqplot) (0.2.0)
Requirement already satisfied: jupyterlab-widgets>=1.0.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.0->bqplot) (1.0.2)
Requirement already satisfied: nbformat>=4.2.0 in /usr/local/lib/python3.7/dist-packages (from ipywidgets>=7.5.0->bqplot) (5.1.3)
Requirement already satisfied: tornado>=4.0 in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.0->bqplot) (5.1.1)
Requirement already satisfied: jupyter-client in /usr/local/lib/python3.7/dist-packages (from ipykernel>=4.5.1->ipywidgets>=7.5.0->bqplot) (5.3.5)
Requirement already satisfied: pexpect in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (4.8.0)
Requirement already satisfied: decorator in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (4.4.2)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (0.7.5)
Requirement already satisfied: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (1.0.18)
Requirement already satisfied: setuptools>=18.5 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (57.4.0)
Requirement already satisfied: simplegeneric>0.8 in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (0.8.1)
Requirement already satisfied: pygments in /usr/local/lib/python3.7/dist-packages (from ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (2.6.1)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.0->bqplot) (2.6.0)
Requirement already satisfied: jupyter-core in /usr/local/lib/python3.7/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.0->bqplot) (4.9.1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->bqplot) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas<2.0.0,>=1.0.0->bqplot) (2.8.2)
Requirement already satisfied: six>=1.9.0 in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (1.15.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.7/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0->ipywidgets>=7.5.0->bqplot) (0.2.5)
Requirement already satisfied: notebook>=4.4.1 in /usr/local/lib/python3.7/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (5.3.1)
Requirement already satisfied: Send2Trash in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (1.8.0)
Requirement already satisfied: nbconvert in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (5.6.1)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (2.11.3)
Requirement already satisfied: terminado>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (0.12.1)
Requirement already satisfied: pyzmq>=13 in /usr/local/lib/python3.7/dist-packages (from jupyter-client->ipykernel>=4.5.1->ipywidgets>=7.5.0->bqplot) (22.3.0)
Requirement already satisfied: ptyprocess in /usr/local/lib/python3.7/dist-packages (from terminado>=0.8.1->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (0.7.0)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.7/dist-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (2.0.1)
Requirement already satisfied: defusedxml in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (0.7.1)
Requirement already satisfied: entrypoints>=0.2.2 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (0.3)
Requirement already satisfied: testpath in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (0.5.0)
Requirement already satisfied: bleach in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (4.1.0)
Requirement already satisfied: pandocfilters>=1.4.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (1.5.0)
Requirement already satisfied: mistune<2,>=0.8.1 in /usr/local/lib/python3.7/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (0.8.4)
Requirement already satisfied: packaging in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (21.3)
Requirement already satisfied: webencodings in /usr/local/lib/python3.7/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (0.5.1)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging->bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->bqplot) (3.0.6)
Installing collected packages: traittypes, bqplot
Successfully installed bqplot-0.12.31 traittypes-0.2.1
Collecting pingouin
Downloading pingouin-0.5.0.tar.gz (182 kB)
|████████████████████████████████| 182 kB 8.0 MB/s
Requirement already satisfied: numpy>=1.19 in /usr/local/lib/python3.7/dist-packages (from pingouin) (1.19.5)
Collecting scipy>=1.7
Downloading scipy-1.7.3-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (38.1 MB)
|████████████████████████████████| 38.1 MB 1.3 MB/s
Requirement already satisfied: pandas>=1.0 in /usr/local/lib/python3.7/dist-packages (from pingouin) (1.1.5)
Requirement already satisfied: matplotlib>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from pingouin) (3.2.2)
Requirement already satisfied: seaborn>=0.9.0 in /usr/local/lib/python3.7/dist-packages (from pingouin) (0.11.2)
Collecting statsmodels>=0.12.0
Downloading statsmodels-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
|████████████████████████████████| 9.8 MB 46.8 MB/s
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from pingouin) (1.0.1)
Collecting pandas_flavor>=0.2.0
Downloading pandas_flavor-0.2.0-py2.py3-none-any.whl (6.6 kB)
Collecting outdated
Downloading outdated-0.2.1-py3-none-any.whl (7.5 kB)
Requirement already satisfied: tabulate in /usr/local/lib/python3.7/dist-packages (from pingouin) (0.8.9)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.2->pingouin) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.2->pingouin) (0.11.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.2->pingouin) (3.0.6)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=3.0.2->pingouin) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.0->pingouin) (2018.9)
Requirement already satisfied: xarray in /usr/local/lib/python3.7/dist-packages (from pandas_flavor>=0.2.0->pingouin) (0.18.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.1->matplotlib>=3.0.2->pingouin) (1.15.0)
Requirement already satisfied: patsy>=0.5.2 in /usr/local/lib/python3.7/dist-packages (from statsmodels>=0.12.0->pingouin) (0.5.2)
Collecting littleutils
Downloading littleutils-0.2.2.tar.gz (6.6 kB)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from outdated->pingouin) (2.23.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->outdated->pingouin) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->outdated->pingouin) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->outdated->pingouin) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->outdated->pingouin) (2021.10.8)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->pingouin) (3.0.0)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->pingouin) (1.1.0)
Requirement already satisfied: setuptools>=40.4 in /usr/local/lib/python3.7/dist-packages (from xarray->pandas_flavor>=0.2.0->pingouin) (57.4.0)
Building wheels for collected packages: pingouin, littleutils
Building wheel for pingouin (setup.py) ... done
Created wheel for pingouin: filename=pingouin-0.5.0-py3-none-any.whl size=193661 sha256=59b2dd01b8178f4df34fda405dc0d6a6192410552298ee6000de4a2480c558d3
Stored in directory: /root/.cache/pip/wheels/14/46/f9/cedd81d68d2515c24bbbd000d5b347e4fe092ccc4b568f7f70
Building wheel for littleutils (setup.py) ... done
Created wheel for littleutils: filename=littleutils-0.2.2-py3-none-any.whl size=7048 sha256=f6da31dc27ae1801a1dcc181d3324cc9e0b0cda9e9a41b76466fdd256fb67ea8
Stored in directory: /root/.cache/pip/wheels/d6/64/cd/32819b511a488e4993f2fab909a95330289c3f4e0f6ef4676d
Successfully built pingouin littleutils
Installing collected packages: scipy, littleutils, statsmodels, pandas-flavor, outdated, pingouin
Attempting uninstall: scipy
Found existing installation: scipy 1.4.1
Uninstalling scipy-1.4.1:
Successfully uninstalled scipy-1.4.1
Attempting uninstall: statsmodels
Found existing installation: statsmodels 0.10.2
Uninstalling statsmodels-0.10.2:
Successfully uninstalled statsmodels-0.10.2
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
Successfully installed littleutils-0.2.2 outdated-0.2.1 pandas-flavor-0.2.0 pingouin-0.5.0 scipy-1.7.3 statsmodels-0.13.1
### Load relevant packages
import pandas as pd
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as sm
import chart_studio.plotly as py
# https://community.plot.ly/t/solved-update-to-plotly-4-0-0-broke-application/26526/2
import os
#%matplotlib inline
#plt.style.use('ggplot')
from bokeh.resources import INLINE
import bokeh.io
from bokeh import *
import pingouin
bokeh.io.output_notebook(INLINE)
En este caso, estableceremos una comprensión básica de las estadísticas necesarias para la regresión lineal y luego introduciremos la regresión lineal comenzando con 2 parámetros. Esperamos que los estudiantes comprendan a fondo los componentes funcionales de un modelo de regresión lineal, como la interpretación de los coeficientes y la comprensión de varias métricas para evaluar adecuadamente el rendimiento del modelo.
Contexto empresarial. Eres un científico de datos en una gran organización. Su empresa está pasando por una revisión interna de sus prácticas de contratación y compensación a los empleados. En los últimos años, su empresa ha tenido poco éxito en la conversión de candidatas de alta calidad que deseaba contratar. La gerencia plantea la hipótesis de que esto se debe a una posible discriminación salarial y quiere averiguar qué la está causando.
Problema empresarial Como parte de la revisión interna, el departamento de recursos humanos se ha acercado a usted para investigar específicamente la siguiente pregunta: "En general, ¿se les paga más a los hombres que a las mujeres en su organización? Si es así, ¿qué conduciendo esta brecha? "
Contexto analítico. El departamento de recursos humanos le ha proporcionado una base de datos de empleados que contiene información sobre varios atributos como rendimiento, educación, ingresos, antigüedad, etc. Usaremos técnicas de regresión lineal en este conjunto de datos para resolver el problema comercial descrito anteriormente. Veremos cómo la regresión lineal cuantifica la correlación entre la variable dependiente (salario) y las variables independientes (por ejemplo, educación, ingresos, antigüedad, etc.)
El caso está estructurado de la siguiente manera: (1) realizaremos un análisis de datos exploratorio para investigar visualmente las diferencias salariales; (2) utilizar los conocimientos observados para ajustar formalmente los modelos de regresión; y finalmente (3) abordar el tema de la discriminación salarial.
from google.colab import drive
import os
drive.mount('/content/gdrive')
# Establecer ruta de acceso en drive
import os
print(os.getcwd())
os.chdir("/content/gdrive/My Drive")
Mounted at /content/gdrive /content
Data = pd.read_csv('glassdoordata.csv')
Data.head()
jobtitle | gender | age | performance | education | department | seniority | income | bonus | |
---|---|---|---|---|---|---|---|---|---|
0 | Graphic Designer | Female | 18 | 5 | College | Operations | 2 | 42363 | 9938 |
1 | Software Engineer | Male | 21 | 5 | College | Management | 5 | 108476 | 11128 |
2 | Warehouse Associate | Female | 19 | 4 | PhD | Administration | 5 | 90208 | 9268 |
3 | Software Engineer | Male | 20 | 5 | Masters | Sales | 4 | 108080 | 10154 |
4 | Graphic Designer | Male | 26 | 5 | Masters | Engineering | 5 | 99464 | 9319 |
Data.shape
(1000, 9)
Las variables disponibles son:
Como estamos interesados en la compensación total, creemos una nueva columna llamada pay
:
Data['pay'] = Data['income'] + Data['bonus']# Crear una variable que tenga
Data.head()
jobtitle | gender | age | performance | education | department | seniority | income | bonus | pay | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Graphic Designer | Female | 18 | 5 | College | Operations | 2 | 42363 | 9938 | 52301 |
1 | Software Engineer | Male | 21 | 5 | College | Management | 5 | 108476 | 11128 | 119604 |
2 | Warehouse Associate | Female | 19 | 4 | PhD | Administration | 5 | 90208 | 9268 | 99476 |
3 | Software Engineer | Male | 20 | 5 | Masters | Sales | 4 | 108080 | 10154 | 118234 |
4 | Graphic Designer | Male | 26 | 5 | Masters | Engineering | 5 | 99464 | 9319 | 108783 |
sns.boxplot(x='gender', y = 'pay', data = Data)
#Data.boxplot(grid= False, column = ['pay'], by = ['gender'])
plt.title("Pay vs Gender");
plt.scatter(Data['age'],Data['pay'])
plt.title("Pay vs. Age", fontsize=20, verticalalignment='bottom');
plt.xlabel("Age");
plt.ylabel("Pay");
sns.boxplot(x='education', y = 'pay', data = Data)
plt.title("Pay vs. Education", fontsize=20, verticalalignment='bottom');
sns.boxplot(x='seniority', y = 'pay', data = Data)
plt.title("Pay vs. Seniority", fontsize=20, verticalalignment='bottom');
sns.boxplot(x='education', y = 'pay', hue = 'gender', data = Data)
plt.title("Pay vs. Education", fontsize=20, verticalalignment='bottom');
sns.boxplot(x='jobtitle', y = 'pay', hue = 'gender',data = Data)
plt.title("Pay vs. Jobtitle", fontsize=20, verticalalignment='bottom')
plt.xticks(rotation=90);
El modelo lineal de salario versus edad se puede ajustar de la siguiente manera:
Data.columns
Index(['jobtitle', 'gender', 'age', 'performance', 'education', 'department', 'seniority', 'income', 'bonus', 'pay'], dtype='object')
model1 = 'pay~age'
lm1 = sm.ols(formula = model1, data = Data).fit()
print(lm1.summary())
OLS Regression Results ============================================================================== Dep. Variable: pay R-squared: 0.285 Model: OLS Adj. R-squared: 0.284 Method: Least Squares F-statistic: 397.5 Date: Fri, 31 Dec 2021 Prob (F-statistic): 1.04e-74 Time: 20:54:35 Log-Likelihood: -11384. No. Observations: 1000 AIC: 2.277e+04 Df Residuals: 998 BIC: 2.278e+04 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 6.206e+04 2062.885 30.085 0.000 5.8e+04 6.61e+04 age 939.2501 47.109 19.938 0.000 846.806 1031.694 ============================================================================== Omnibus: 6.360 Durbin-Watson: 1.905 Prob(Omnibus): 0.042 Jarque-Bera (JB): 6.421 Skew: 0.182 Prob(JB): 0.0403 Kurtosis: 2.853 Cond. No. 134. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Ahora que hemos visto que la edad explica parte de la relación con el salario, consideremos un modelo en el que tengamos en cuenta la edad y el género simultáneamente. La edad es una variable numérica (p. Ej., 26,5, 32). Por el contrario, el género solo toma dos valores: masculino y femenino. Estas variables se denominan variables categóricas . La forma en que interpretamos los coeficientes de las variables factoriales en el modelo lineal es ligeramente diferente de los de las variables numéricas:
Data
jobtitle | gender | age | performance | education | department | seniority | income | bonus | pay | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Graphic Designer | Female | 18 | 5 | College | Operations | 2 | 42363 | 9938 | 52301 |
1 | Software Engineer | Male | 21 | 5 | College | Management | 5 | 108476 | 11128 | 119604 |
2 | Warehouse Associate | Female | 19 | 4 | PhD | Administration | 5 | 90208 | 9268 | 99476 |
3 | Software Engineer | Male | 20 | 5 | Masters | Sales | 4 | 108080 | 10154 | 118234 |
4 | Graphic Designer | Male | 26 | 5 | Masters | Engineering | 5 | 99464 | 9319 | 108783 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
995 | Marketing Associate | Female | 61 | 1 | High School | Administration | 1 | 62644 | 3270 | 65914 |
996 | Data Scientist | Male | 57 | 1 | Masters | Sales | 2 | 108977 | 3567 | 112544 |
997 | Financial Analyst | Male | 48 | 1 | High School | Operations | 1 | 92347 | 2724 | 95071 |
998 | Financial Analyst | Male | 65 | 2 | High School | Administration | 1 | 97376 | 2225 | 99601 |
999 | Financial Analyst | Male | 60 | 1 | PhD | Sales | 2 | 123108 | 2244 | 125352 |
1000 rows × 10 columns
model2 = 'pay~age + gender'
lm2 = sm.ols(formula = model2, data = Data).fit()
print(lm2.summary())
OLS Regression Results ============================================================================== Dep. Variable: pay R-squared: 0.319 Model: OLS Adj. R-squared: 0.317 Method: Least Squares F-statistic: 233.2 Date: Fri, 31 Dec 2021 Prob (F-statistic): 8.10e-84 Time: 21:02:16 Log-Likelihood: -11359. No. Observations: 1000 AIC: 2.272e+04 Df Residuals: 997 BIC: 2.274e+04 Df Model: 2 Covariance Type: nonrobust ================================================================================== coef std err t P>|t| [0.025 0.975] ---------------------------------------------------------------------------------- Intercept 5.674e+04 2151.480 26.373 0.000 5.25e+04 6.1e+04 gender[T.Male] 9279.3180 1317.787 7.042 0.000 6693.364 1.19e+04 age 948.5266 46.022 20.610 0.000 858.216 1038.837 ============================================================================== Omnibus: 9.898 Durbin-Watson: 1.871 Prob(Omnibus): 0.007 Jarque-Bera (JB): 9.345 Skew: 0.197 Prob(JB): 0.00935 Kurtosis: 2.737 Cond. No. 148. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
La interpretación del coeficiente de edad es la misma que antes: si la edad aumenta en un año, se espera que el salario aumente en 948,5 USD. Ahora, concéntrate en el coeficiente de género. Solo muestra masculino (T.male), porque la categoría femenina se toma como la categoría predeterminada. (Tenga en cuenta que la elección de la categoría predeterminada no importa; fácilmente podríamos haber elegido hacer masculino como categoría predeterminada y, por lo tanto, el coeficiente de género sería T.female). El coeficiente 9279.3180 se interpreta de la siguiente manera: para empleados de la misma edad, en promedio, los hombres ganan 9279,3180 USD más que las mujeres.
Consideremos todos los demás factores que podrían explicar las brechas salariales a la vez. ¿Qué puedes concluir ?:
model4 = 'pay~jobtitle + age+ performance + education+department + seniority + gender'
lm4 = sm.ols(formula = model4, data = Data).fit()
print(lm4.summary())
OLS Regression Results ============================================================================== Dep. Variable: pay R-squared: 0.841 Model: OLS Adj. R-squared: 0.838 Method: Least Squares F-statistic: 259.6 Date: Fri, 31 Dec 2021 Prob (F-statistic): 0.00 Time: 21:11:50 Log-Likelihood: -10631. No. Observations: 1000 AIC: 2.130e+04 Df Residuals: 979 BIC: 2.141e+04 Df Model: 20 Covariance Type: nonrobust =================================================================================================== coef std err t P>|t| [0.025 0.975] --------------------------------------------------------------------------------------------------- Intercept 2.203e+04 1933.534 11.392 0.000 1.82e+04 2.58e+04 jobtitle[T.Driver] -3928.9812 1447.166 -2.715 0.007 -6768.886 -1089.076 jobtitle[T.Financial Analyst] 3417.7090 1388.276 2.462 0.014 693.370 6142.048 jobtitle[T.Graphic Designer] -2457.6992 1420.886 -1.730 0.084 -5246.031 330.633 jobtitle[T.IT] -2149.7022 1427.414 -1.506 0.132 -4950.846 651.442 jobtitle[T.Manager] 3.16e+04 1471.445 21.476 0.000 2.87e+04 3.45e+04 jobtitle[T.Marketing Associate] -1.701e+04 1385.795 -12.277 0.000 -1.97e+04 -1.43e+04 jobtitle[T.Sales Associate] 263.4456 1435.261 0.184 0.854 -2553.096 3079.988 jobtitle[T.Software Engineer] 1.339e+04 1413.182 9.473 0.000 1.06e+04 1.62e+04 jobtitle[T.Warehouse Associate] -564.0171 1452.967 -0.388 0.698 -3415.305 2287.271 education[T.High School] -1435.1693 908.773 -1.579 0.115 -3218.536 348.198 education[T.Masters] 4717.9971 914.421 5.160 0.000 2923.546 6512.448 education[T.PhD] 6026.2867 929.705 6.482 0.000 4201.842 7850.731 department[T.Engineering] 3267.9358 1036.509 3.153 0.002 1233.901 5301.970 department[T.Management] 2957.8290 1033.031 2.863 0.004 930.619 4985.039 department[T.Operations] -481.4968 1014.616 -0.475 0.635 -2472.570 1509.577 department[T.Sales] 6193.4962 1020.679 6.068 0.000 4190.526 8196.466 gender[T.Male] 392.3244 715.798 0.548 0.584 -1012.351 1797.000 age 948.9464 22.527 42.126 0.000 904.740 993.152 performance 1156.8797 228.188 5.070 0.000 709.085 1604.674 seniority 9903.7103 231.490 42.782 0.000 9449.436 1.04e+04 ============================================================================== Omnibus: 3.500 Durbin-Watson: 1.982 Prob(Omnibus): 0.174 Jarque-Bera (JB): 3.555 Skew: -0.130 Prob(JB): 0.169 Kurtosis: 2.866 Cond. No. 453. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Usamos las técnicas de regresión lineal para determinar si existía o no discriminación salarial basada en el género dentro de su organización. Modelamos el efecto de varias variables independientes (en este caso, antigüedad, edad, desempeño y cargo) para explicar la variación observada en una variable dependiente (en este caso, el salario). Observamos la estadística de $ R $ al cuadrado de nuestros modelos lineales para ayudarnos a medir qué porcentaje de la variación observada en el pago se explica por las variables independientes.
Vimos que la diferencia en el salario promedio entre hombres y mujeres es de unos 8500 USD en estos datos. Sin embargo, esta diferencia se convirtió en 400 USD y es estadísticamente indistinguible (valor $ p $ = 0.584) de cero después de controlar los otros factores correlacionados con el salario. Sin embargo, una exploración más profunda de los datos sugirió que las mujeres están desproporcionadamente sobrerrepresentadas en los trabajos peor pagados, mientras que los hombres están desproporcionadamente sobrerrepresentados en los trabajos mejor pagados.
Por lo tanto, se justifica una investigación sobre las prácticas de contratación, promoción y colocación laboral de hombres y mujeres. En su informe al departamento de recursos humanos, debe pedirles que analicen las siguientes preguntas:
En este caso, aprendió cómo aprovechar sus habilidades en el análisis de datos exploratorios para construir un modelo lineal efectivo que tuvo en cuenta varios factores relacionados con el resultado de interés (pago). Fundamentalmente, aprendimos que:
En estos días, los medios destacan constantemente los algoritmos de aprendizaje automático más avanzados, como las redes neuronales. Es importante que reconozca el inmenso valor de la regresión lineal, en particular por sus capacidades de inferencia e interpretabilidad. Si bien las redes neuronales pueden superar la regresión lineal en ciertas tareas, es mucho más una caja negra y comprender cómo los datos hacen que el modelo reaccione es extremadamente importante en la mayoría de los escenarios comerciales.