Notebook

Exercise 02¶

The goal is to find the best set of hyper-parameters which maximize the performance on a training set.

In [1]:

import numpy as np
import pandas as pd

df = pd.read_csv(
    "https://www.openml.org/data/get_csv/1595261/adult-census.csv")
# Or use the local copy:
# df = pd.read_csv('../datasets/adult-census.csv')

target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=target_name)

from sklearn.model_selection import train_test_split

df_train, df_test, target_train, target_test = train_test_split(
    data, target, random_state=42)

TODO: create your machine learning pipeline

You should:

preprocess the categorical columns using a OneHotEncoder and use a StandardScaler to normalize the numerical data.
use a LogisticRegression as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied on each columns.

In [2]:

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

Subsequently, create a ColumnTransformer to redirect the specific columns a preprocessing pipeline.

In [3]:

from sklearn.compose import ColumnTransformer

Finally, concatenate the preprocessing pipeline with a logistic regression.

In [4]:

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

TODO: make your random search

Use a RandomizedSearchCV to find the best set of hyper-parameters by tuning the following parameters for the LogisticRegression model:

C with values ranging from 0.001 to 10. You can use a reciprocal distribution (i.e. scipy.stats.reciprocal);
solver with possible values being "liblinear" and "lbfgs";
penalty with possible values being "l2" and "l1";

In addition, try several preprocessing strategies with the OneHotEncoder by always (or not) dropping the first column when encoding the categorical data.

Notes: You can accept failure during a grid-search or a randomized-search by settgin error_score to np.nan for instance.