The objective of the study is to identify the factors that affect obesity the most. To achieve this, three approaches are used: PCA, classification, and regression.
- PCA is used to identify the factors that contribute the most to obesity.
- Classification is used to identify the most important predictors of obesity.
- Regression is used to identify the factors that contribute the most to an individual's BMI.
To interpret the results, the SHAP approach is used in both classification and regression. SHAP values are used to explain the contribution of each feature to the target variable, providing a deeper understanding of the contribution of each feature to obesity.
Overall, these three approaches work together to identify the most important factors that affect obesity. PCA identifies the most important features, classification identifies the most important predictors of obesity, and regression identifies the factors that contribute the most to an individual's BMI. The interpretation of the results using the SHAP approach provides a more detailed understanding of the contribution of each feature to obesity.
This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, 23% of the data was collected directly from users through a web platform.
The data contains 17 attributes and 2111 records, the records are labeled with the class variable NObesity (Obesity Level), that allows classification of the data using the values of:
Feature Name | Description | Label | Variable Type |
---|---|---|---|
Gender | The gender of the individual | Male or Female | Categorical |
Age | The age of the individual in years | Years | Numerical |
Height | The height of the individual in centimeters | Centimeters | Numerical |
Weight | The weight of the individual in kilograms | Kilograms | Numerical |
Family history with overweight | Whether the individual has a family history of overweight | Yes or No | Categorical |
FAVC (Frequent consumption of high caloric food) | Whether the individual frequently consumes high-calorie foods | Yes or No | Categorical |
FCVC (Frequency of consumption of vegetables) | How frequently the individual consumes vegetables | Never, 1-2 days, 3-4 days, or every day | Ordinal |
NCP (Number of main meals) | The number of main meals that the individual consumes in a day | 1, 2, 3, 4, or 5 | Ordinal |
CAEC (Consumption of food between meals) | Whether the individual consumes food between meals | No, Sometimes, or Frequently | Ordinal |
SMOKE (Smoking habit) | Whether the individual smokes or not | Yes or No | Categorical |
CH2O (Daily water consumption) | The daily water consumption of the individual in liters | Liters | Ordinal |
SCC (Calories consumption monitoring) | Whether the individual monitors their calorie consumption | Yes or No | Categorical |
FAF (Physical activity frequency) | How frequently the individual engages in physical activity | None, 1-2 days, 3-4 days, or 5 or more days | Ordinal |
TUE (Time using technology devices) | The amount of time the individual spends using technology devices | 0, 1-2 hours, 3-4 hours, 5 or more hours | Ordinal |
CALC (Consumption of alcohol) | Whether the individual consumes alcohol | No, Sometimes, or Frequently | Ordinal |
MTRANS (Transportation used) | The mode of transportation used by the individual | Automobile, Motorbike, Bike, Public transportation, Walking, or Other | Categorical |
NObeyesdad | The obesity level of the individual based on BMI calculation | Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II, or Obesity Type III | Categorical |
#General
import pandas as pd
import numpy as np
#Preprocessing
from sklearn import feature_selection
from sklearn import model_selection
from sklearn.model_selection import train_test_split, KFold, cross_val_predict, StratifiedKFold, cross_validate, cross_val_score
from sklearn.feature_selection import RFECV
from sklearn.preprocessing import StandardScaler
#Models
import xgboost as xgb
import lightgbm as lgb
from sklearn import svm, tree, linear_model, neighbors, naive_bayes, ensemble, gaussian_process, neural_network
from sklearn.decomposition import PCA
#Model_Evaluation
import optuna
import shap
from optuna.pruners import MedianPruner
from sklearn.metrics import roc_curve, auc, make_scorer, accuracy_score, precision_score, recall_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import f1_score, plot_confusion_matrix, classification_report, average_precision_score
#Visualization
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots
#Setting_Parameters
import time
import warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
from IPython.display import display, Markdown
def bold(string):
display(Markdown(string))
simplefilter("ignore", category=ConvergenceWarning)
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings('ignore')
warnings.filterwarnings('ignore', 'Solver terminated early.*')
sns.set()
style.use('ggplot')
pd.set_option("display.max_rows", 100)
%matplotlib inline
plt.rcParams["figure.figsize"] = (8, 6)
The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.4-CAPI-1.16.2). Conversions between both will be slow.
df = pd.read_csv('/kaggle/input/obesity-levels/ObesityDataSet_raw_and_data_sinthetic.csv')
df.head()
Gender | Age | Height | Weight | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | NObeyesdad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Female | 21.0 | 1.62 | 64.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 0.0 | 1.0 | no | Public_Transportation | Normal_Weight |
1 | Female | 21.0 | 1.52 | 56.0 | yes | no | 3.0 | 3.0 | Sometimes | yes | 3.0 | yes | 3.0 | 0.0 | Sometimes | Public_Transportation | Normal_Weight |
2 | Male | 23.0 | 1.80 | 77.0 | yes | no | 2.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 1.0 | Frequently | Public_Transportation | Normal_Weight |
3 | Male | 27.0 | 1.80 | 87.0 | no | no | 3.0 | 3.0 | Sometimes | no | 2.0 | no | 2.0 | 0.0 | Frequently | Walking | Overweight_Level_I |
4 | Male | 22.0 | 1.78 | 89.8 | no | no | 2.0 | 1.0 | Sometimes | no | 2.0 | no | 0.0 | 0.0 | Sometimes | Public_Transportation | Overweight_Level_II |
df.describe(include=np.object)
Gender | family_history_with_overweight | FAVC | CAEC | SMOKE | SCC | CALC | MTRANS | NObeyesdad | |
---|---|---|---|---|---|---|---|---|---|
count | 2111 | 2111 | 2111 | 2111 | 2111 | 2111 | 2111 | 2111 | 2111 |
unique | 2 | 2 | 2 | 4 | 2 | 2 | 4 | 5 | 7 |
top | Male | yes | yes | Sometimes | no | no | Sometimes | Public_Transportation | Obesity_Type_I |
freq | 1068 | 1726 | 1866 | 1765 | 2067 | 2015 | 1401 | 1580 | 351 |
df.describe()
Age | Height | Weight | FCVC | NCP | CH2O | FAF | TUE | |
---|---|---|---|---|---|---|---|---|
count | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 | 2111.000000 |
mean | 24.312600 | 1.701677 | 86.586058 | 2.419043 | 2.685628 | 2.008011 | 1.010298 | 0.657866 |
std | 6.345968 | 0.093305 | 26.191172 | 0.533927 | 0.778039 | 0.612953 | 0.850592 | 0.608927 |
min | 14.000000 | 1.450000 | 39.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 |
25% | 19.947192 | 1.630000 | 65.473343 | 2.000000 | 2.658738 | 1.584812 | 0.124505 | 0.000000 |
50% | 22.777890 | 1.700499 | 83.000000 | 2.385502 | 3.000000 | 2.000000 | 1.000000 | 0.625350 |
75% | 26.000000 | 1.768464 | 107.430682 | 3.000000 | 3.000000 | 2.477420 | 1.666678 | 1.000000 |
max | 61.000000 | 1.980000 | 173.000000 | 3.000000 | 4.000000 | 3.000000 | 3.000000 | 2.000000 |
cols = {"Age":40,"FCVC": 3, "NCP": 4, "CH2O": 3, "FAF": 4, "TUE": 3}
for col,uniqs in cols.items():
bold(f'**{col}** has **{len(df[col].unique())}** unique values instead of **{uniqs}** ')
Age has 1402 unique values instead of 40
FCVC has 810 unique values instead of 3
NCP has 635 unique values instead of 4
CH2O has 1268 unique values instead of 3
FAF has 1190 unique values instead of 4
TUE has 1129 unique values instead of 3
#Rounding floats to integers
for col, uniqs in cols.items():
df[col] = round(df[col]).astype('int')
# Checking again
for col,uniqs in cols.items():
bold(f'**{col}** has **{len(df[col].unique())}** unique values ')
Age has 40 unique values
FCVC has 3 unique values
NCP has 4 unique values
CH2O has 3 unique values
FAF has 4 unique values
TUE has 3 unique values
This section will inspect the distributions of all the features and their impact on the target variable.
hist_data = [df['Age']]
group_labels = ['Age']
fig = ff.create_distplot(hist_data, group_labels, show_hist=False, show_rug=False)
# Customize the plot
#fig.update_traces(line=dict(color='royalblue', width=2))
fig.update_layout(
title=dict(text='Age Distribution (KDE)', x=0.5, font=dict(size=24)),
xaxis=dict(title='Age', title_font=dict(size=18), tickfont=dict(size=14)),
yaxis=dict(title='Density', title_font=dict(size=18), tickfont=dict(size=14)),
template='plotly_white',
showlegend=False
)
fig.show()
attributes = df['NObeyesdad'].unique()
# Prepare the data
kde_data = []
for attr in attributes:
kde = neighbors.KernelDensity(kernel='gaussian', bandwidth=1)
data = df[df['NObeyesdad'] == attr]['Age'].values
kde.fit(data[:, np.newaxis])
X_plot = np.linspace(df['Age'].min(), df['Age'].max(), 1000)[:, np.newaxis]
log_dens = kde.score_samples(X_plot)
kde_data.append((attr,X_plot, np.exp(log_dens)))
# Create the plot
fig = go.Figure()
for attr, X_plot, density in kde_data:
fig.add_trace(go.Scatter(x=X_plot[:, 0], y=density, mode='lines', name=attr))
# Customize the plot
fig.update_layout(
title=dict(text='Age Distribution (KDE) by NObeyesdad', x=0.5, font=dict(size=24)),
xaxis=dict(title='Age', title_font=dict(size=18), tickfont=dict(size=14)),
yaxis=dict(title='Density', title_font=dict(size=18), tickfont=dict(size=14)),
template='plotly_white',
legend=dict(title='NObeyesdad', font=dict(size=14))
)
fig.show()
obesity_order = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I',
'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']
fig = px.box(df, x="NObeyesdad", y="Age", category_orders={'NObeyesdad': obesity_order})
# Update the color of the box and whiskers
fig.update_traces(#boxmean=True,
jitter=0,
#whiskerwidth=0.2,
marker=dict(color='#2a9d8f', size=3),
line=dict(width=1.2)
)
# Update the layout
fig.update_layout(
title=dict(text='Box Plot of Age by Obesity Status', x=0.5, font=dict(size=24)),
xaxis_title="Obesity Status",
yaxis_title="Age",
template='plotly_white',
width=800, # increase the width of the plot
height=500, # increase the height of the plot
)
fig.show()
We can clearly see that "AGE" can be an important factor in classifying obesity
columns = ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE',
'SCC', 'CALC', 'MTRANS', 'NObeyesdad']
for col in columns:
df[col] = df[col].astype('category')
# Define a list of colors for the bars
#colors = ['#264653', '#2a9d8f', '#e9c46a', '#f4a261', '#e76f51', '#023047', '#8ecae6', '#fefae0', '#dda15e']
# Create a 3x3 grid of subplots
fig = make_subplots(rows=3, cols=3, subplot_titles=columns)
# Add bar plots for each categorical column
for idx, col in enumerate(columns):
row, col_idx = divmod(idx, 3)
counts = df[col].value_counts()
fig.add_trace(go.Bar(x=counts.index, y=counts.values, name=col, showlegend=False, marker=dict(color='rgb(158,202,225)')), row=row+1, col=col_idx+1)
# Customize the plot
fig.update_layout(
title=dict(text='Frequency Plots of Categorical Columns', x=0.5, font=dict(size=24)),
template='plotly_white',
height=1000,
width=1000,
)
# Update axis titles and font size
for i in range(1, 10):
fig.update_xaxes(title_text=columns[i-1], title_font=dict(size=12), tickfont=dict(size=9), row=divmod(i-1, 3)[0]+1, col=divmod(i-1, 3)[1]+1)
fig.update_yaxes(title_text='Frequency', title_font=dict(size=12), tickfont=dict(size=9), row=divmod(i-1, 3)[0]+1, col=divmod(i-1,3)[1]+1)
# Show the plot
fig.show()
columns = ["FCVC", "NCP", "CH2O", "FAF", "TUE"]
# Create a 2x3 grid of subplots
fig = make_subplots(rows=2, cols=3, subplot_titles=columns)
# Add bar plots for each categorical column
for idx, col in enumerate(columns):
row, col_idx = divmod(idx, 3)
counts = df[col].value_counts()
fig.add_trace(go.Bar(x=counts.index, y=counts.values, name=col, showlegend=False), row=row+1, col=col_idx+1)
# Change the color of each bar plot
fig.update_traces(marker_color='rgb(158,202,225)', row=row+1, col=col_idx+1)
# Customize the plot
fig.update_layout(
title=dict(text='Frequency Plots of Ordinal Columns', x=0.5, font=dict(size=24)),
template='plotly_white',
height=800,
width=1000,
showlegend=True,
)
# Update axis titles and font size
for i in range(1, 6):
fig.update_xaxes(title_text=columns[i-1], title_font=dict(size=12), tickfont=dict(size=10), row=divmod(i-1, 3)[0]+1, col=divmod(i-1, 3)[1]+1)
fig.update_yaxes(title_text='Frequency', title_font=dict(size=12), tickfont=dict(size=10), row=divmod(i-1, 3)[0]+1, col=divmod(i-1,3)[1]+1)
# Show the plot
fig.show()
PCA (Principal Component Analysis) is a statistical technique used to reduce the complexity of high-dimensional data by identifying the most important variables that explain the variance in the dataset.
In the context of this study, PCA is used to identify the factors that contribute the most to obesity. By analyzing the relationships among the different features in the dataset, PCA can identify which features are most strongly correlated with obesity and which ones can be disregarded.
X = df.drop(["Weight", "Height", "NObeyesdad"], axis=1)
We firstly dropped our target variable NObeyesdad, and since it has been infered from the Height and Weight column using BMI calculation, we dropped them too
# Scale numerical variables
scaler = StandardScaler()
X_toScale = X.select_dtypes(include=['float64', 'int64'])
X_scaled = pd.DataFrame(scaler.fit_transform(X_toScale), columns=X_toScale.columns)
X_scaled.head()
Age | FCVC | NCP | CH2O | FAF | TUE | |
---|---|---|---|---|---|---|
0 | -0.521741 | -0.725454 | 0.385644 | -0.021330 | -1.124415 | 0.497717 |
1 | -0.521741 | 0.987559 | 0.385644 | 1.431202 | 2.226606 | -0.986295 |
2 | -0.207057 | -0.725454 | 0.385644 | -0.021330 | 1.109599 | 0.497717 |
3 | 0.422312 | 0.987559 | 0.385644 | -0.021330 | 1.109599 | -0.986295 |
4 | -0.364399 | -0.725454 | -2.085053 | -0.021330 | -1.124415 | -0.986295 |
We scaled the numerical variables of our dataset to improve the results of PCA.
dummies_cols = X[["Gender","family_history_with_overweight",
"FAVC","CAEC","SMOKE","SCC","CALC","MTRANS"]]
X_dummies = pd.get_dummies(dummies_cols,columns=["Gender","family_history_with_overweight",
"FAVC","CAEC","SMOKE","SCC","CALC","MTRANS"], drop_first=True)
X_dummies.head()
Gender_Male | family_history_with_overweight_yes | FAVC_yes | CAEC_Frequently | CAEC_Sometimes | CAEC_no | SMOKE_yes | SCC_yes | CALC_Frequently | CALC_Sometimes | CALC_no | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
As for our categorical variables, they have been encoded using one hot encoding and turned into dummy variables, it should be noted that we dropped one attribute from each feature to prevent Multicolinearity
# Combine the encoded categorical and scaled numerical features
X_final = pd.concat([X_scaled, X_dummies], axis=1)
X_final.head()
Age | FCVC | NCP | CH2O | FAF | TUE | Gender_Male | family_history_with_overweight_yes | FAVC_yes | CAEC_Frequently | ... | CAEC_no | SMOKE_yes | SCC_yes | CALC_Frequently | CALC_Sometimes | CALC_no | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.521741 | -0.725454 | 0.385644 | -0.021330 | -1.124415 | 0.497717 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | -0.521741 | 0.987559 | 0.385644 | 1.431202 | 2.226606 | -0.986295 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | -0.207057 | -0.725454 | 0.385644 | -0.021330 | 1.109599 | 0.497717 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 0.422312 | 0.987559 | 0.385644 | -0.021330 | 1.109599 | -0.986295 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | -0.364399 | -0.725454 | -2.085053 | -0.021330 | -1.124415 | -0.986295 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 21 columns
# Perform PCA
pca = PCA()
pca.fit(X_final)
PCA()
# Get the explained variance ratio for each component
explained_variance_ratio = pca.explained_variance_ratio_
# Print the explained variance ratio for each component
for i, ratio in enumerate(explained_variance_ratio):
print(f'Explained variance ratio of component {i+1}: {ratio:.2%}')
Explained variance ratio of component 1: 18.71% Explained variance ratio of component 2: 15.99% Explained variance ratio of component 3: 13.71% Explained variance ratio of component 4: 12.37% Explained variance ratio of component 5: 11.61% Explained variance ratio of component 6: 9.86% Explained variance ratio of component 7: 5.22% Explained variance ratio of component 8: 3.33% Explained variance ratio of component 9: 2.66% Explained variance ratio of component 10: 1.70% Explained variance ratio of component 11: 1.36% Explained variance ratio of component 12: 1.17% Explained variance ratio of component 13: 0.62% Explained variance ratio of component 14: 0.52% Explained variance ratio of component 15: 0.44% Explained variance ratio of component 16: 0.27% Explained variance ratio of component 17: 0.25% Explained variance ratio of component 18: 0.09% Explained variance ratio of component 19: 0.07% Explained variance ratio of component 20: 0.04% Explained variance ratio of component 21: 0.00%
# cumulative explained variance ratio
cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
# create trace for line with markers
trace = go.Scatter(x=list(range(1, len(cumulative_variance_ratio)+1)),
y=cumulative_variance_ratio,
mode='lines+markers',
marker=dict(size=8, color='#2a9d8f'),
line=dict(width=1.2)
)
# create layout
layout = go.Layout(title=dict(text='Cumulative Explained Variance Ratio', x=0.5, font=dict(size=24)),
xaxis=dict(title='Number of Components'),
yaxis=dict(title='Cumulative Explained Variance Ratio'),
template='plotly_white',
width=800, # increase the width of the plot
height=500, # increase the height of the plot
)
# create figure
fig = go.Figure(data=[trace], layout=layout)
# show figure
fig.show()
We can see that the first 5 components explain more than 70% of the variablility.
# Look at loadings for each component
loadings = pd.DataFrame(pca.components_, columns=X_final.columns)
# Determine components with highest explained variance ratios
variances = pca.explained_variance_ratio_
highest_variances = np.argsort(variances)[::-1][:5]
print("Highest variance components: ", highest_variances)
# Look at variables with highest loadings for each of the most important components
for component in highest_variances:
top_variables = loadings.iloc[component].sort_values(ascending=False)[:5]
print(f"\nComponent {component} - Top Variables:\n{top_variables}")
# Consider relationships between prominent variables and outcome variable
Highest variance components: [0 1 2 3 4] Component 0 - Top Variables: Age 0.652307 FCVC 0.077632 family_history_with_overweight_yes 0.036282 CALC_Sometimes 0.029018 CAEC_Sometimes 0.028276 Name: 0, dtype: float64 Component 1 - Top Variables: CH2O 0.565114 NCP 0.452785 FAF 0.425349 FCVC 0.344828 Age 0.181481 Name: 1, dtype: float64 Component 2 - Top Variables: NCP 0.331819 Age 0.281140 FAF 0.234547 Gender_Male 0.195991 CALC_no 0.036073 Name: 2, dtype: float64 Component 3 - Top Variables: CH2O 0.659154 FAF 0.086353 Gender_Male 0.073348 CAEC_no 0.035755 CALC_no 0.029883 Name: 3, dtype: float64 Component 4 - Top Variables: FAF 0.722431 FCVC 0.151864 CALC_no 0.149160 Age 0.052581 CAEC_Frequently 0.041599 Name: 4, dtype: float64
# Get the loadings of the significant components
loadings = pca.components_
# Create a DataFrame to store the loadings
loadings_df = pd.DataFrame(loadings, columns=X_final.columns)
# Identify the variables with the highest loadings in each component
component_vars = []
for i in range(len(loadings)):
component_vars.append(list(loadings_df.iloc[i].nlargest(5).index))
# Determine the cumulative explained variance ratio for each component
cumulative_var_ratio = np.cumsum(pca.explained_variance_ratio_)
# Create a summary table
summary_df = pd.DataFrame({
'Component': range(1, len(loadings)+1),
'Explained Variance Ratio': pca.explained_variance_ratio_,
'Cumulative Explained Variance Ratio': cumulative_var_ratio,
'Top Variables': component_vars
})
# Print the summary table
summary_df.head(6)
Component | Explained Variance Ratio | Cumulative Explained Variance Ratio | Top Variables | |
---|---|---|---|---|
0 | 1 | 0.187095 | 0.187095 | [Age, FCVC, family_history_with_overweight_yes... |
1 | 2 | 0.159892 | 0.346987 | [CH2O, NCP, FAF, FCVC, Age] |
2 | 3 | 0.137135 | 0.484122 | [NCP, Age, FAF, Gender_Male, CALC_no] |
3 | 4 | 0.123735 | 0.607857 | [CH2O, FAF, Gender_Male, CAEC_no, CALC_no] |
4 | 5 | 0.116068 | 0.723925 | [FAF, FCVC, CALC_no, Age, CAEC_Frequently] |
5 | 6 | 0.098592 | 0.822517 | [TUE, Age, FCVC, family_history_with_overweigh... |
Based on the top variables for each component, we can conclude the following:
These results suggest that Age, CH2O, FAF, FCVC, NCP, Gender_Male, family_history_with_overweight_yes are the most important variables in determining obesity status in the dataset. However, it is important to note that the interpretation of PCA results can be challenging and should be interpreted with caution. Further analysis and experimentation may be required to validate these findings.
Classification is a machine learning technique that involves predicting the class of an object based on its features.
In this study, multiple classification algorithms are used on the 'NObeyesdad' feature as the target variable. By training different classifiers on the dataset, the study aims to identify the most important features that predict obesity.
To interpret the results of the classification, the SHAP (SHapley Additive exPlanations) approach is used. SHAP values are used to explain the contribution of each feature to the classification decision. This helps to understand which features are most important in predicting obesity.
X = df.drop(["Weight", "Height", "NObeyesdad"], axis=1)
y = df["NObeyesdad"]
# Create a dictionary to map obesity status to a numerical value based on obesity severity
obesity_status_mapping = {
'Insufficient_Weight': 0,
'Normal_Weight': 1,
'Overweight_Level_I': 2,
'Overweight_Level_II': 3,
'Obesity_Type_I': 4,
'Obesity_Type_II': 5,
'Obesity_Type_III': 6
}
# Map the obesity status to numerical values based on obesity severity
y = y.map(obesity_status_mapping)
y.head()
0 1 1 1 2 1 3 2 4 3 Name: NObeyesdad, dtype: category Categories (7, int64): [0, 1, 4, 5, 6, 2, 3]
dummies_cols = X[["Gender","family_history_with_overweight",
"FAVC","CAEC","SMOKE","SCC","CALC","MTRANS"]]
X_dummies = pd.get_dummies(dummies_cols,columns=["Gender","family_history_with_overweight",
"FAVC","CAEC","SMOKE","SCC","CALC","MTRANS"], drop_first=True)
X_num = X.select_dtypes(include=['float64', 'int64'])
X = pd.concat([X_num, X_dummies], axis=1)
X.head()
Age | FCVC | NCP | CH2O | FAF | TUE | Gender_Male | family_history_with_overweight_yes | FAVC_yes | CAEC_Frequently | ... | CAEC_no | SMOKE_yes | SCC_yes | CALC_Frequently | CALC_Sometimes | CALC_no | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21 | 2 | 3 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 21 | 3 | 3 | 3 | 3 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 23 | 2 | 3 | 2 | 2 | 1 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 27 | 3 | 3 | 2 | 2 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 22 | 2 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 21 columns
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=23, stratify=y)
bold(f' X_train: {X_train.shape[0]} | y_train: {y_train.shape[0]}')
bold(f' X_test : {X_test.shape[0]} | y_test: {y_test.shape[0]}')
X_train: 1688 | y_train: 1688
X_test : 423 | y_test: 423
#Scaling the numerical columns of train set and test set separatly to prevent data leakage
cols = ["Age", "FCVC", "NCP", "CH2O", "FAF", "TUE"]
std_scaler = StandardScaler()
X_train[cols] = std_scaler.fit_transform(X_train[cols])
X_test[cols] = std_scaler.transform(X_test[cols])
bold(f' X_train: {X_train.shape[0]}')
bold(f' X_test : {X_test.shape[0]}')
X_train: 1688
X_test : 423
We chose to scale the train and test sets separately to prevent data leakage
%%time
#Initializing the Folds, the Classifier and the RFECV algorithm
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=23)
clf = ensemble.RandomForestClassifier(n_jobs=-1, random_state=23)
rfecv = RFECV(estimator=clf, step=1, cv=cv, scoring='accuracy', verbose=1, n_jobs=-1) #10-fold cross-validation
rfecv = rfecv.fit(X_train, y_train)
cols = X_train.columns[rfecv.support_]
bold(f'Optimal number of features is **{rfecv.n_features_}** from {X_train.shape[1]} available')
Fitting estimator with 21 features. Fitting estimator with 20 features. Fitting estimator with 19 features. Fitting estimator with 18 features. Fitting estimator with 17 features.
Optimal number of features is 16 from 21 available
CPU times: user 2.65 s, sys: 496 ms, total: 3.15 s Wall time: 48.7 s
We performed RFECV to chose the best feature subset for our classification problem, this resulted in keeping 16 features out of the 21 original features.
keep_cols = X_train.columns[rfecv.support_]
features_dropped = [x for x in X_train.columns if x not in keep_cols]
X_train = X_train[keep_cols]
X_test = X_test[keep_cols]
bold('**Features dropped** :\n')
for col in features_dropped:
print(col)
print('\n')
bold('**Features kept** :\n')
for col in keep_cols:
print(col)
Features dropped :
CAEC_no SMOKE_yes MTRANS_Bike MTRANS_Motorbike MTRANS_Walking
Features kept :
Age FCVC NCP CH2O FAF TUE Gender_Male family_history_with_overweight_yes FAVC_yes CAEC_Frequently CAEC_Sometimes SCC_yes CALC_Frequently CALC_Sometimes CALC_no MTRANS_Public_Transportation
%%time
random_state = 23
# create a list of classifiers
classifiers = [
tree.DecisionTreeClassifier(random_state=random_state),
ensemble.RandomForestClassifier(random_state=random_state),
linear_model.LogisticRegression(random_state=random_state),
ensemble.AdaBoostClassifier(random_state=random_state),
svm.SVC(random_state=random_state),
svm.LinearSVC(random_state=random_state),
neighbors.KNeighborsClassifier(),
lgb.LGBMClassifier(random_state=random_state),
xgb.XGBClassifier(random_state=random_state),
neural_network.MLPClassifier(random_state=random_state)
]
# create a list of classifiers to evaluate
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=23)
# define the metrics to be computed
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
#create table to compare MLA metrics
MLA_columns = ['Name','Accuracy','Recall', 'Precision', 'F1_score', 'Accuracy_std', 'Fit Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
# perform cross-validation and compute metrics
row_index = 0
cv_results_list = []
for clf in classifiers:
cv_results = cross_validate(clf, X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
cv_results_list.append(cv_results)
MLA_name = clf.__class__.__name__
MLA_compare.loc[row_index, 'Name'] = MLA_name
MLA_compare.loc[row_index, 'Accuracy'] = round(cv_results['test_accuracy'].mean(), 3)
MLA_compare.loc[row_index, 'Recall'] = round(cv_results['test_recall_macro'].mean(), 3)
MLA_compare.loc[row_index, 'Precision'] = round(cv_results['test_precision_macro'].mean(), 3)
MLA_compare.loc[row_index, 'F1_score'] = round(cv_results['test_f1_macro'].mean(), 3)
MLA_compare.loc[row_index, 'Accuracy_std'] = round(cv_results['test_accuracy'].std(), 3)
MLA_compare.loc[row_index, 'Fit Time'] = round(cv_results['fit_time'].sum(), 3)
print(f'{MLA_name} is done in {round(cv_results["fit_time"].sum(), 3)}')
row_index+=1
MLA_compare.sort_values(by = ['Accuracy'], ascending = False, inplace = True)
MLA_compare
DecisionTreeClassifier is done in 0.106 RandomForestClassifier is done in 3.767 LogisticRegression is done in 1.689 AdaBoostClassifier is done in 1.844 SVC is done in 1.757 LinearSVC is done in 4.587 KNeighborsClassifier is done in 0.045 LGBMClassifier is done in 17.841 XGBClassifier is done in 59.182 MLPClassifier is done in 66.039 CPU times: user 732 ms, sys: 38.2 ms, total: 770 ms Wall time: 48.8 s
Name | Accuracy | Recall | Precision | F1_score | Accuracy_std | Fit Time | |
---|---|---|---|---|---|---|---|
1 | RandomForestClassifier | 0.805 | 0.803 | 0.802 | 0.8 | 0.039 | 3.767 |
7 | LGBMClassifier | 0.797 | 0.795 | 0.793 | 0.792 | 0.03 | 17.841 |
8 | XGBClassifier | 0.795 | 0.792 | 0.791 | 0.788 | 0.034 | 59.182 |
9 | MLPClassifier | 0.731 | 0.728 | 0.721 | 0.72 | 0.034 | 66.039 |
0 | DecisionTreeClassifier | 0.728 | 0.723 | 0.72 | 0.719 | 0.029 | 0.106 |
4 | SVC | 0.723 | 0.719 | 0.715 | 0.71 | 0.035 | 1.757 |
6 | KNeighborsClassifier | 0.678 | 0.672 | 0.666 | 0.653 | 0.027 | 0.045 |
2 | LogisticRegression | 0.591 | 0.583 | 0.576 | 0.56 | 0.026 | 1.689 |
5 | LinearSVC | 0.571 | 0.561 | 0.564 | 0.522 | 0.023 | 4.587 |
3 | AdaBoostClassifier | 0.401 | 0.393 | 0.407 | 0.368 | 0.038 | 1.844 |
We applied a 10 classification algorithms using 10 fold cross-validation on our training set, the results show clear superiority of ensemble methods over other methods.
We chose RandomForestClassifier and LGBMClassifier for further analysis.
def objective(trial, mod, X_train, y_train, scorer= 'accuracy', folds=5):
"""
THIS IS THE OBJECTIVE FUNCTION THAT RETURNS THE MEAN SCORES OF THE MODEL TO OPTUNA
FOR HYPERPARAMETER OPTIMIZATION
"""
cv = StratifiedKFold(n_splits=folds, shuffle=True, random_state=23)
X = X_train.copy()
y = y_train.copy()
if mod == 'RF ':
n_estimators = trial.suggest_int('n_estimators', 50, 500)
max_depth = trial.suggest_int('max_depth', 5, 50)
min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)
max_features = trial.suggest_int('max_features', 1, 8)
model = ensemble.RandomForestClassifier(max_depth=max_depth,
n_estimators=n_estimators,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_features=max_features,
random_state=23)
elif mod == 'LGBM':
max_depth = trial.suggest_int("max_depth", 6, 20)
n_estimators = trial.suggest_int("n_estimators", 100, 1000)
learning_rate = trial.suggest_loguniform('learning_rate', 1e-8, 1)
num_leaves = trial.suggest_int("num_leaves", 2, 3000)
min_child_samples = trial.suggest_int('min_child_samples', 3, 200)
min_child_weight = trial.suggest_loguniform('min_child_weight', 1e-5, 1)
colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1)
subsample = trial.suggest_uniform('subsample', 0.1, 1)
random_state = 23
model = lgb.LGBMClassifier(learning_rate=learning_rate,
n_estimators=n_estimators,
max_depth=max_depth,
num_leaves=num_leaves,
min_child_samples=min_child_samples,
min_child_weight=min_child_weight,
colsample_bytree=colsample_bytree,
subsample=subsample,
random_state=23)
elif mod == 'XGB ':
max_depth = trial.suggest_int("max_depth", 2, 20)
n_estimators = trial.suggest_int("n_estimators", 50, 1000)
learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1)
gamma = trial.suggest_loguniform("gamma", 1e-5, 1)
subsample = trial.suggest_uniform("subsample", 0.1, 1.0)
colsample_bytree = trial.suggest_uniform("colsample_bytree", 0.1, 1.0)
model = xgb.XGBClassifier(learning_rate=learning_rate,
n_estimators=n_estimators,
max_depth=max_depth,
gamma=gamma,
subsample=subsample,
colsample_bytree=colsample_bytree,
random_state=23,
use_label_encoder=False,
eval_metric='mlogloss')
scores = cross_val_score(model, X, y, cv=cv, scoring=scorer, n_jobs=-1,
error_score=float("-inf"))
if len(scores) < folds:
raise TrialPruned()
return scores.mean()
from optuna.samplers import TPESampler
from optuna import TrialPruned
def tune(objective, mod, X_train, y_train, scorer, n_trials=25, pruner=None):
objective_func = lambda trial: objective(trial, mod, X_train, y_train, scorer)
study = optuna.create_study(direction="maximize", pruner=pruner, sampler=TPESampler(seed=23))
study.optimize(objective_func, n_trials=n_trials, show_progress_bar=True)
params = study.best_params
best_score = study.best_value
return params, best_score
%%time
import datetime
params = {'RF ':'',
#'LGBM':'',
#'XGB ':''
}
pruner = optuna.pruners.MedianPruner(n_warmup_steps=5, interval_steps=1)
for mod, param in params.items():
start_time = time.time()
tuned_params, top_score = tune(objective, mod, X_train, y_train, scorer='accuracy', n_trials=250,
pruner=pruner)
params[mod] = tuned_params
end_time = time.time()
time_spent = datetime.datetime.utcfromtimestamp(end_time-start_time)
time_str = time_spent.strftime('%H:%M:%S')
print(f'Best score: {top_score:.5f}')
print(f'| {mod} |{"-"*90}| {time_str} \n')
0%| | 0/250 [00:00<?, ?it/s]
Best score: 0.80687 | RF |------------------------------------------------------------------------------------------| 00:16:15 CPU times: user 20.3 s, sys: 544 ms, total: 20.9 s Wall time: 16min 15s
Following the results of the hyperparameter tuning, we chose RandomForestClassifier.
bold(f"the best parameters for our random forest are :")
print(params['RF '])
the best parameters for our random forest are :
{'n_estimators': 261, 'max_depth': 42, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': 3}
random_state = 23
# create a list of classifiers
classifiers = [ensemble.RandomForestClassifier(random_state=23)]
# create a list of classifiers to evaluate
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=23)
# define the metrics to be computed
scoring = ['accuracy', 'precision_macro', 'recall_macro', 'f1_macro']
# perform cross-validation and compute metrics
cv_results = cross_validate(classifiers[0].set_params(**params['RF ']), X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
cv_results_list.append(cv_results)
cv_accuracy = round(cv_results['test_accuracy'].mean(), 3)
cv_recall = round(cv_results['test_recall_macro'].mean(), 3)
cv_precision = round(cv_results['test_precision_macro'].mean(), 3)
cv_f1_score = round(cv_results['test_f1_macro'].mean(), 3)
best_model = classifiers[0].fit(X_train, y_train)
y_pred = best_model.predict(X_test)
test_accuracy = round(accuracy_score(y_test, y_pred), 3)
test_recall = round(recall_score(y_test, y_pred, average='macro'), 3)
test_precision = round(precision_score(y_test, y_pred, average='macro'), 3)
test_f1_score = round(f1_score(y_test, y_pred, average='macro'), 3)
# comparison
print(f" | cv | test |")
print(f"accuracy: | {cv_accuracy} | {test_accuracy} |")
print(f"recall: | {cv_recall} | {test_recall} |")
print(f"precision: | {cv_precision} | {test_precision} |")
print(f"f1_score: | {cv_f1_score} | {test_f1_score} |")
| cv | test | accuracy: | 0.808 | 0.806 | recall: | 0.805 | 0.803 | precision: | 0.805 | 0.804 | f1_score: | 0.802 | 0.802 |
%%time
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
CPU times: user 1min 44s, sys: 570 ms, total: 1min 45s Wall time: 1min 43s
class_names = {value: key for key, value in obesity_status_mapping.items()}
class_names
{0: 'Insufficient_Weight', 1: 'Normal_Weight', 2: 'Overweight_Level_I', 3: 'Overweight_Level_II', 4: 'Obesity_Type_I', 5: 'Obesity_Type_II', 6: 'Obesity_Type_III'}
shap.summary_plot(shap_values, X_train, plot_type='bar', class_names=class_names, max_display=10)
Regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables.
In this study, a 'BMI' variable is created by dividing an individual's weight in kilograms by the square of their height in meters. This variable is then used as the target variable for regression analysis to identify the factors that contribute to obesity.
To interpret the results of the regression analysis, the SHAP approach is again used. SHAP values are used to explain the contribution of each feature to the BMI of an individual. This helps to understand which features are most important in contributing to an individual's BMI.
X = df.drop(columns=["NObeyesdad"], axis=1)
X['BMI'] = df['Weight'] / df['Height']**2
X.drop(columns=["Weight", "Height"], axis=1, inplace=True)
X.head()
Gender | Age | family_history_with_overweight | FAVC | FCVC | NCP | CAEC | SMOKE | CH2O | SCC | FAF | TUE | CALC | MTRANS | BMI | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Female | 21 | yes | no | 2 | 3 | Sometimes | no | 2 | no | 0 | 1 | no | Public_Transportation | 24.386526 |
1 | Female | 21 | yes | no | 3 | 3 | Sometimes | yes | 3 | yes | 3 | 0 | Sometimes | Public_Transportation | 24.238227 |
2 | Male | 23 | yes | no | 2 | 3 | Sometimes | no | 2 | no | 2 | 1 | Frequently | Public_Transportation | 23.765432 |
3 | Male | 27 | no | no | 3 | 3 | Sometimes | no | 2 | no | 2 | 0 | Frequently | Walking | 26.851852 |
4 | Male | 22 | no | no | 2 | 1 | Sometimes | no | 2 | no | 0 | 0 | Sometimes | Public_Transportation | 28.342381 |
We created a BMI (Body Mass Index) feature using the following formula:
We then dropped the Height and Weight features.
# Split the data into features and target
y = X["BMI"]
X = X.drop("BMI", axis=1)
dummies_cols = X[["Gender","family_history_with_overweight",
"FAVC","CAEC","SMOKE","SCC","CALC","MTRANS"]]
X_dummies = pd.get_dummies(dummies_cols,columns=["Gender","family_history_with_overweight",
"FAVC","CAEC","SMOKE","SCC","CALC","MTRANS"], drop_first=True)
X_num = X.select_dtypes(include=['float64', 'int64'])
X = pd.concat([X_num, X_dummies], axis=1)
X.head()
Age | FCVC | NCP | CH2O | FAF | TUE | Gender_Male | family_history_with_overweight_yes | FAVC_yes | CAEC_Frequently | ... | CAEC_no | SMOKE_yes | SCC_yes | CALC_Frequently | CALC_Sometimes | CALC_no | MTRANS_Bike | MTRANS_Motorbike | MTRANS_Public_Transportation | MTRANS_Walking | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 21 | 2 | 3 | 2 | 0 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
1 | 21 | 3 | 3 | 3 | 3 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
2 | 23 | 2 | 3 | 2 | 2 | 1 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 27 | 3 | 3 | 2 | 2 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
4 | 22 | 2 | 1 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 21 columns
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=23)
bold(f' X_train: {X_train.shape[0]} | y_train: {y_train.shape[0]}')
bold(f' X_test : {X_test.shape[0]} | y_test: {y_test.shape[0]}')
X_train: 1688 | y_train: 1688
X_test : 423 | y_test: 423
#Scaling the numerical columns of train set and test set separatly to prevent data leakage
cols = ["Age", "FCVC", "NCP", "CH2O", "FAF", "TUE"]
std_scaler = StandardScaler()
X_train[cols] = std_scaler.fit_transform(X_train[cols])
X_test[cols] = std_scaler.transform(X_test[cols])
bold(f' X_train: {X_train.shape[0]}')
bold(f' X_test : {X_test.shape[0]}')
X_train: 1688
X_test : 423
%%time
random_state = 23
# create a list of regressors
regressors = [linear_model.LinearRegression(),
linear_model.Ridge(),
linear_model.Lasso(),
linear_model.ElasticNet(),
svm.SVR(),
ensemble.RandomForestRegressor(random_state=random_state),
ensemble.GradientBoostingRegressor(random_state=random_state),
xgb.XGBRegressor(random_state=random_state),
lgb.LGBMRegressor(random_state=random_state),
neural_network.MLPRegressor(random_state=random_state)]
# create a list of regressors to evaluate
cv = KFold(n_splits=10, shuffle=True, random_state=23)
# define the metrics to be computed
scoring = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']
# create table to compare MLA metrics
MLA_columns = ['Name','MAE','MAE_std','MSE','MSE_std','RMSE','RMSE_std','R-squared','R-squared_std','Fit Time']
MLA_compare = pd.DataFrame(columns = MLA_columns)
# perform cross-validation and compute metrics
row_index = 0
cv_results_list = []
for clf in regressors:
cv_results = cross_validate(clf, X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
cv_results_list.append(cv_results)
MLA_name = clf.__class__.__name__
MLA_compare.loc[row_index, 'Name'] = MLA_name
MLA_compare.loc[row_index, 'MAE'] = round(-cv_results['test_neg_mean_absolute_error'].mean(), 3)
MLA_compare.loc[row_index, 'MAE_std'] = round(cv_results['test_neg_mean_absolute_error'].std(), 3)
MLA_compare.loc[row_index, 'MSE'] = round(-cv_results['test_neg_mean_squared_error'].mean(), 3)
MLA_compare.loc[row_index, 'MSE_std'] = round(cv_results['test_neg_mean_squared_error'].std(), 3)
MLA_compare.loc[row_index, 'RMSE'] = round(np.sqrt(-cv_results['test_neg_mean_squared_error'].mean()), 3)
MLA_compare.loc[row_index, 'RMSE_std'] = round(np.sqrt(cv_results['test_neg_mean_squared_error'].std()), 3)
MLA_compare.loc[row_index, 'R-squared'] = round(cv_results['test_r2'].mean(), 3)
MLA_compare.loc[row_index, 'R-squared_std'] = round(cv_results['test_r2'].std(), 3)
MLA_compare.loc[row_index, 'Fit Time'] = round(cv_results['fit_time'].sum(), 3)
print(f'{MLA_name} is done in {round(cv_results["fit_time"].sum(), 3)}')
row_index+=1
MLA_compare.sort_values(by = ['RMSE'], ascending = True, inplace = True)
MLA_compare
LinearRegression is done in 0.086 Ridge is done in 0.07 Lasso is done in 0.068 ElasticNet is done in 0.067 SVR is done in 5.731 RandomForestRegressor is done in 15.837 GradientBoostingRegressor is done in 4.941 XGBRegressor is done in 18.179 LGBMRegressor is done in 2.302 MLPRegressor is done in 50.145 CPU times: user 632 ms, sys: 23.7 ms, total: 656 ms Wall time: 28.2 s
Name | MAE | MAE_std | MSE | MSE_std | RMSE | RMSE_std | R-squared | R-squared_std | Fit Time | |
---|---|---|---|---|---|---|---|---|---|---|
5 | RandomForestRegressor | 2.013 | 0.136 | 10.288 | 1.9 | 3.208 | 1.378 | 0.839 | 0.036 | 15.837 |
8 | LGBMRegressor | 2.254 | 0.111 | 11.057 | 1.742 | 3.325 | 1.32 | 0.828 | 0.032 | 2.302 |
7 | XGBRegressor | 2.295 | 0.172 | 11.782 | 1.875 | 3.432 | 1.369 | 0.815 | 0.041 | 18.179 |
6 | GradientBoostingRegressor | 3.262 | 0.196 | 18.271 | 2.173 | 4.274 | 1.474 | 0.716 | 0.043 | 4.941 |
4 | SVR | 3.394 | 0.179 | 20.237 | 2.561 | 4.499 | 1.6 | 0.685 | 0.05 | 5.731 |
9 | MLPRegressor | 3.516 | 0.178 | 20.519 | 2.648 | 4.53 | 1.627 | 0.681 | 0.049 | 50.145 |
1 | Ridge | 4.689 | 0.186 | 34.039 | 2.727 | 5.834 | 1.651 | 0.471 | 0.054 | 0.07 |
0 | LinearRegression | 4.69 | 0.185 | 34.078 | 2.713 | 5.838 | 1.647 | 0.471 | 0.054 | 0.086 |
3 | ElasticNet | 5.978 | 0.397 | 51.477 | 5.696 | 7.175 | 2.387 | 0.207 | 0.02 | 0.067 |
2 | Lasso | 6.004 | 0.424 | 52.45 | 5.893 | 7.242 | 2.428 | 0.192 | 0.019 | 0.068 |
We applied a 10 classification algorithms using 10 fold cross-validation on our training set, the results show clear superiority of ensemble methods over other methods.
We chose RandomForestRegressor and LGBMRegressor for further analysis.
from sklearn.metrics import mean_squared_error
def objective(trial, mod, X_train, y_train, scorer= 'neg_mean_squared_error', folds=10):
"""
THIS IS THE OBJECTIVE FUNCTION THAT RETURNS THE MEAN SCORES OF THE MODEL TO OPTUNA
FOR HYPERPARAMETER OPTIMIZATION
"""
cv = KFold(n_splits=folds, shuffle=True, random_state=23)
X = X_train.copy()
y = y_train.copy()
if mod == 'RF ':
n_estimators = trial.suggest_int('n_estimators', 50, 500)
max_depth = trial.suggest_int('max_depth', 5, 50)
min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)
max_features = trial.suggest_int('max_features', 1, 8)
model = ensemble.RandomForestRegressor(max_depth=max_depth,
n_estimators=n_estimators,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_features=max_features,
random_state=23)
elif mod == 'LGBM':
max_depth = trial.suggest_int("max_depth", 6, 20)
n_estimators = trial.suggest_int("n_estimators", 100, 1000)
learning_rate = trial.suggest_loguniform('learning_rate', 1e-8, 1)
num_leaves = trial.suggest_int("num_leaves", 2, 3000)
min_child_samples = trial.suggest_int('min_child_samples', 3, 200)
min_child_weight = trial.suggest_loguniform('min_child_weight', 1e-5, 1)
colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.1, 1)
subsample = trial.suggest_uniform('subsample', 0.1, 1)
random_state = 23
model = lgb.LGBMRegressor(learning_rate=learning_rate,
n_estimators=n_estimators,
max_depth=max_depth,
num_leaves=num_leaves,
min_child_samples=min_child_samples,
min_child_weight=min_child_weight,
colsample_bytree=colsample_bytree,
subsample=subsample,
random_state=23)
elif mod == 'XGB ':
max_depth = trial.suggest_int("max_depth", 2, 20)
n_estimators = trial.suggest_int("n_estimators", 50, 1000)
learning_rate = trial.suggest_loguniform("learning_rate", 1e-5, 1)
gamma = trial.suggest_loguniform("gamma", 1e-5, 1)
subsample = trial.suggest_uniform("subsample", 0.1, 1.0)
colsample_bytree = trial.suggest_uniform("colsample_bytree", 0.1, 1.0)
random_state = 23
model = xgb.XGBRegressor(max_depth=max_depth,
learning_rate=learning_rate,
n_estimators=n_estimators,
gamma=gamma,
subsample=subsample,
colsample_bytree=colsample_bytree,
random_state=23)
# Cross validation of the model
scores = cross_val_score(model, X, y, scoring=scorer, cv=cv, n_jobs=-1)
# Return mean score and variance
return scores.mean()
from optuna.samplers import TPESampler
from optuna import TrialPruned
def tune(objective, mod, X_train, y_train, scorer, n_trials=25, pruner=None):
objective_func = lambda trial: objective(trial, mod, X_train, y_train, scorer)
if scorer == 'neg_mean_squared_error':
direction = 'minimize'
else:
direction = 'maximize'
study = optuna.create_study(direction=direction, pruner=pruner, sampler=TPESampler(seed=23))
study.optimize(objective_func, n_trials=n_trials, show_progress_bar=True)
params = study.best_params
best_score = study.best_value
return params, best_score
%%time
import datetime
from optuna.pruners import MedianPruner
import optuna.logging
#optuna.logging.set_verbosity(optuna.logging.WARNING)
params = {'RF ':'',
#'LGBM':'',
#'XGB ':''
}
pruner = optuna.pruners.MedianPruner(n_warmup_steps=5, interval_steps=1)
for mod, param in params.items():
start_time = time.time()
tuned_params, top_score = tune(objective, mod, X_train, y_train, scorer='r2', n_trials=250,
pruner=pruner)
params[mod] = tuned_params
end_time = time.time()
time_spent = datetime.datetime.utcfromtimestamp(end_time-start_time)
time_str = time_spent.strftime('%H:%M:%S')
print(f'Best score: {top_score:.5f}')
print(f'| {mod} |{"-"*90}| {time_str} \n')
0%| | 0/250 [00:00<?, ?it/s]
Best score: 0.85611 | RF |------------------------------------------------------------------------------------------| 00:44:21 CPU times: user 27 s, sys: 1.03 s, total: 28 s Wall time: 44min 21s
Following the results of the hyperparameter tuning, the RandomForestRegressor model has improved.
bold(f"the best parameters for our random forest are :")
print(params['RF '])
the best parameters for our random forest are :
{'n_estimators': 418, 'max_depth': 15, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 6}
# create a list of regressors
regressors = [ensemble.RandomForestRegressor(random_state=random_state)]
# create a list of regressors to evaluate
cv = KFold(n_splits=10, shuffle=True, random_state=23)
# define the metrics to be computed
scoring = ['neg_mean_absolute_error', 'neg_mean_squared_error', 'r2']
# perform cross-validation and compute metrics
cv_results = cross_validate(regressors[0].set_params(**params['RF ']), X_train, y_train, cv=cv, scoring=scoring, n_jobs=-1)
name = '10 CV Score'
cv_mse = round(-cv_results['test_neg_mean_squared_error'].mean(), 3)
cv_mae = round(-cv_results['test_neg_mean_absolute_error'].mean(), 3)
cv_r2 = round(cv_results['test_r2'].mean(), 3)
# retrain the best model on the full training set using the tuned parameters
best_model = ensemble.RandomForestRegressor(random_state=random_state).set_params(**params['RF '])
best_model.fit(X_train, y_train)
# compute test set metrics using the best model
y_pred = best_model.predict(X_test)
name = 'Test Score'
test_mse = round(mean_squared_error(y_test, y_pred), 3)
test_mae = round(mean_absolute_error(y_test, y_pred), 3)
test_r2 = round(r2_score(y_test, y_pred), 3)
# comparison
print(f" | cv | test |")
print(f"MSE: | {cv_mse} | {test_mse} |")
print(f"MAE: | {cv_mae} | {test_mae} |")
print(f"R2 Score: | {cv_r2} | {test_r2} |")
| cv | test | MSE: | 9.21 | 7.909 | MAE: | 2.014 | 1.884 | R2 Score: | 0.856 | 0.867 |
# calculate SHAP values
explainer = shap.TreeExplainer(best_model)
shap_values = explainer.shap_values(X_train)
# generate summary plot
shap.summary_plot(shap_values, X_test, plot_type="bar", max_display=10)
After conducting PCA, Classification, and Regression analyses on the UCI Obesity Dataset, it can be concluded that several factors have been identified as significant predictors of obesity. The results show that family history with overweight, age, frequency of consumption of vegetables, and gender are the most important factors that affect and can predict obesity.
The family history with overweight, which indicates whether a person has a family history of being overweight, has a significant impact on obesity. This result suggests that individuals with a family history of obesity may have a higher risk of developing obesity themselves.
Age is another important predictor of obesity. The older an individual is, the higher the likelihood of them being obese. This result is consistent with previous studies that have identified age as a significant risk factor for obesity.
The frequency of consumption of vegetables (FCVC) is another factor that significantly affects obesity. This result suggests that individuals who consume vegetables more frequently have a lower risk of obesity compared to those who consume them less frequently.
Finally, the analysis showed that gender is also an important factor in predicting obesity. Females were found to be more likely to be obese than males. This result is consistent with previous studies that have identified gender as a significant risk factor for obesity.
In conclusion, by using the PCA, Classification, and Regression, we have identified several factors that significantly affect and predict obesity. These results can be useful in developing effective interventions and strategies to prevent and manage obesity.