MRMR Feature Selection by Maykon Schots & Matheus Rugollo
Experimenting fast is key to success in Data Science. When experimenting we're going to bump with huge datasets that require special attention when feature selecting and engineering. In a profit driven context, it's important to quickly test the potential value of an idea rather than exploring the best way to use your data or parametrize a machine learning model. It not only takes time that we usually can't afford but also increase financial costs.
Herewit we describe an efficient solution to reduce dimensionality of your dataset, by identifying and creating clusters of redundant features and selecting the most relevant one. This has potential to speed up your experimentation process and reduce costs.
You might be wondering how this applies to a real use case and why we had to come up with such technique. Hear this story: Consider a project in a financial company that we try to understand how likely a client is to buy a product through Machine Learning. Other then profile features, we usually end up with many financial transactions history features of the clients. With that in mind we can assume that probably many of them are highly correlated, e.g in order to buy something of x value, the client probably received a value > x in the past, and since we're going to extract aggregation features from such events we're going to end up with a lot of correlation between them.
The solution was to come up with an efficient "automatic" way to wipe redundant features from the training set, that can vary from time to time, maintaining our model performance. With this we can always consider at the start of our pipeline all of our "raw" features and select the most relevant of them that are not highly correlated in given moment.
Based on a published article we developed an implementation using feature_engine and sklearn. Follow the step-by-step to understand our approach.
In order to demonstrate, use the make_classification helper function from sklearn to create a set of features making sure that some of them are redundant. Convert both X and y returned by it to be pandas DataFrames for further compatibility with sklearn api.
import warnings
import pandas as pd
from sklearn.datasets import make_classification
warnings.filterwarnings('ignore')
X, y = make_classification(
n_samples=5000,
n_features=30,
n_redundant=15,
n_clusters_per_class=1,
weights=[0.50],
class_sep=2,
random_state=42
)
cols = []
for i in range(len(X[0])):
cols.append(f"feat_{i}")
X = pd.DataFrame(X, columns=cols)
y = pd.DataFrame({"y": y})
X.head()
feat_0 | feat_1 | feat_2 | feat_3 | feat_4 | feat_5 | feat_6 | feat_7 | feat_8 | feat_9 | feat_10 | feat_11 | feat_12 | feat_13 | feat_14 | feat_15 | feat_16 | feat_17 | feat_18 | feat_19 | feat_20 | feat_21 | feat_22 | feat_23 | feat_24 | feat_25 | feat_26 | feat_27 | feat_28 | feat_29 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.855484 | 0.147268 | 0.931779 | 0.514342 | -1.160567 | 1.548771 | -0.814841 | -0.551259 | -0.559991 | -0.588341 | -0.050275 | -0.418050 | -1.883618 | 0.627457 | -1.430858 | 2.414968 | 0.965552 | -0.427275 | 1.575164 | 1.386096 | 1.635452 | 0.171738 | -0.988298 | 2.129211 | 1.502658 | 1.451028 | -0.159674 | 1.999985 | -0.088617 | 2.173897 |
1 | -0.233143 | -0.253765 | 0.740147 | 0.816885 | 0.169952 | 1.594853 | -0.432744 | -0.118465 | -0.245680 | 0.318794 | -0.307883 | -0.391124 | 1.137396 | -1.150796 | -1.642437 | 2.108450 | 1.168464 | -0.840639 | 1.788039 | 0.134206 | -0.498645 | -0.199357 | -1.260340 | 2.202738 | 1.859376 | 1.392850 | -0.521572 | 1.733263 | 0.399896 | 2.562269 |
2 | -1.004689 | -2.525201 | -0.066370 | 1.708870 | -1.836983 | -0.200895 | 2.617520 | 0.494441 | 2.118569 | -1.165920 | -1.576650 | 0.674405 | -0.687477 | 0.020001 | -0.859114 | -2.652188 | 0.951480 | 0.646614 | 0.821866 | 1.885669 | -0.361392 | -2.347812 | -1.371678 | -0.213288 | 1.733786 | -0.814734 | 0.175997 | -2.276026 | 3.047572 | -1.306679 |
3 | -0.039887 | -2.002593 | -0.137059 | 1.352156 | -0.283016 | -0.166958 | 2.079097 | 0.316852 | 1.682286 | 0.208227 | -1.249645 | -0.460059 | -0.315972 | -0.031462 | -0.673953 | -2.114534 | 0.749490 | 0.440944 | 0.643708 | 0.300287 | -1.195257 | -1.862089 | -1.082490 | -0.179667 | 1.366995 | -0.653095 | 0.799519 | -1.814268 | 2.416413 | 0.303902 |
4 | 1.650810 | 0.594325 | 0.268430 | 0.154022 | -0.626083 | 1.439684 | -1.215922 | -1.373762 | -0.893985 | 0.043517 | 0.242237 | 0.462302 | -1.067515 | -0.677758 | -1.139388 | 2.671788 | 0.701012 | 1.002342 | 1.276911 | 0.194160 | 1.528878 | 0.584114 | -0.644920 | 1.967760 | 1.044779 | 1.463181 | 0.461469 | 2.227199 | -0.636540 | 0.948157 |
Now that we have our master table example set up, we can start by taking advantage of SmartCorrelatedSelection implementation by feature_egine. Let's check it's parameters:
from feature_engine.selection import SmartCorrelatedSelection
MODEL_TYPE = "classifier" ## Or "regressor"
CORRELATION_THRESHOLD = .97
# Setup Smart Selector /// Tks feature_engine
feature_selector = SmartCorrelatedSelection(
variables=None,
method="spearman",
threshold=CORRELATION_THRESHOLD,
missing_values="ignore",
selection_method="variance",
estimator=None,
)
feature_selector.fit_transform(X)
### Setup a list of correlated clusters as lists and a list of uncorrelated features
correlated_sets = feature_selector.correlated_feature_sets_
correlated_clusters = [list(feature) for feature in correlated_sets]
correlated_features = [feature for features in correlated_clusters for feature in features]
uncorrelated_features = [feature for feature in X if feature not in correlated_features]
Now we're going to extract the best feature from each correlated cluster using SelectKBest from sklearn.feature_selection. Here we use mutual_info_classif implementation as our score_func for this classifier example, there are other options like mutual_info_regression be sure to select it according to your use case.
The relevance of each selected feature is considered when we use mutual info of the samples against the target Y, this will be important so we do not lose any predictive power of our features.
We end up with a set of selected features that considering our correlation threshold of .97, probably will have similar performance. In a context where you want to prioritize reduction of dimensionality, you can check how the selection will perform to make a good decision about it.
I don't want to believe, I want to know.
from sklearn.feature_selection import (
SelectKBest,
mutual_info_classif,
mutual_info_regression,
)
mutual_info = {
"classifier": mutual_info_classif,
"regressor": mutual_info_regression,
}
top_features_cluster = []
for cluster in correlated_clusters:
selector = SelectKBest(score_func=mutual_info[MODEL_TYPE], k=1) # selects the top feature (k=1) regarding target mutual information
selector = selector.fit(X[cluster], y)
top_features_cluster.append(
list(selector.get_feature_names_out())[0]
)
selected_features = top_features_cluster + uncorrelated_features
Now that we have our set it's time to decide if we're going with it or not. In this demonstration, the idea was to use a GridSearch to find the best hyperparameters for a RandomForestClassifier providing us with the best possible estimator.
If we attempt to fit many grid searches in a robust way, it would take too long and be very costy. Since we're just experimenting, initally we can use basic cross_validate with the chosen estimator, and we can quickly discard "gone wrong" selections, specially when we lower down our correlation threshold for the clusters.
It's an efficient way to approach experimenation with this method, although I highly recommend going for a more robust evaluation with grid searches or other approaches, and a deep discussion on the impact of the performance threshold for your use cause, sometimes 1% can be a lot of $.
import os
import multiprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
cv = StratifiedKFold(shuffle=True, random_state=42)
baseline_raw = cross_validate(
RandomForestClassifier(
max_samples=1.0,
n_jobs=int(os.getenv("N_CORES", 0.50 * multiprocessing.cpu_count())), # simplifica isso aqui pro artigo, bota -1.
random_state=42
),
X,
y,
cv=cv,
scoring="f1", # or any other metric that you want.
groups=None
)
baseline_selected_features = cross_validate(
RandomForestClassifier(),
X[selected_features],
y,
cv=cv,
scoring="f1",
groups=None,
error_score="raise",
)
score_raw = baseline_raw["test_score"].mean()
score_baseline = baseline_selected_features["test_score"].mean()
# Define a threshold to decide whether to reduce or not the dimensionality for your test case
dif = round(((score_raw - score_baseline) / score_raw), 3)
# 5% is our limit (ponder how it will impact your product $)
performance_threshold = -0.050
if dif >= performance_threshold:
print(f"It's worth to go with the selected set =D")
elif dif < performance_threshold:
print(f"The performance reduction is not acceptable!!!! >.<")
Going Further on implementing a robust feature selection with MRMR , we can use the process explained above to iterate over a range of threshold and choose what's best for our needs instead of a simple score performance evaluation!
# Repeat df from example.
import warnings
import pandas as pd
from sklearn.datasets import make_classification
warnings.filterwarnings('ignore')
X, y = make_classification(
n_samples=5000,
n_features=30,
n_redundant=15,
n_clusters_per_class=1,
weights=[0.50],
class_sep=2,
random_state=42
)
cols = []
for i in range(len(X[0])):
cols.append(f"feat_{i}")
X = pd.DataFrame(X, columns=cols)
y = pd.DataFrame({"y": y})
# Functions to iterate over accepted threshold
from sklearn.feature_selection import (
SelectKBest,
mutual_info_classif,
mutual_info_regression,
)
import os
import multiprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, cross_validate
import pandas as pd
from feature_engine.selection import SmartCorrelatedSelection
def select_features_clf(X: pd.DataFrame, y: pd.DataFrame, corr_threshold: float) -> list:
""" Function will select a set of features with minimum redundance and maximum relevante based on the set correlation threshold """
# Setup Smart Selector /// Tks feature_engine
feature_selector = SmartCorrelatedSelection(
variables=None,
method="spearman",
threshold=corr_threshold,
missing_values="ignore",
selection_method="variance",
estimator=None,
)
feature_selector.fit_transform(X)
### Setup a list of correlated clusters as lists and a list of uncorrelated features
correlated_sets = feature_selector.correlated_feature_sets_
correlated_clusters = [list(feature) for feature in correlated_sets]
correlated_features = [feature for features in correlated_clusters for feature in features]
uncorrelated_features = [feature for feature in X if feature not in correlated_features]
top_features_cluster = []
for cluster in correlated_clusters:
selector = SelectKBest(score_func=mutual_info_classif, k=1) # selects the top feature (k=1) regarding target mutual information
selector = selector.fit(X[cluster], y)
top_features_cluster.append(
list(selector.get_feature_names_out())[0]
)
return top_features_cluster + uncorrelated_features
def get_clf_model_scores(X: pd.DataFrame, y: pd.DataFrame, scoring: str, selected_features:list):
""" """
cv = StratifiedKFold(shuffle=True, random_state=42)
model_result = cross_validate(
RandomForestClassifier(),
X[selected_features],
y,
cv=cv,
scoring=scoring,
groups=None,
error_score="raise",
)
return model_result["test_score"].mean(), model_result["fit_time"].mean(), model_result["score_time"].mean()
def evaluate_clf_feature_selection_range(X: pd.DataFrame, y: pd.DataFrame, scoring:str, corr_range: int, corr_starting_point: float = .98) -> pd.DataFrame:
""" Evaluates feature selection for every .01 on corr threshold """
evaluation_data = {
"corr_threshold": [],
scoring: [],
"n_features": [],
"fit_time": [],
"score_time": []
}
for i in range(corr_range):
current_corr_threshold = corr_starting_point - (i / 100) ## Reduces .01 on corr_threshold for every iteration
selected_features = select_features_clf(X, y, corr_threshold=current_corr_threshold)
score, fit_time, score_time = get_clf_model_scores(X, y, scoring, selected_features)
evaluation_data["corr_threshold"].append(current_corr_threshold)
evaluation_data[scoring].append(score)
evaluation_data["n_features"].append(len(selected_features))
evaluation_data["fit_time"].append(fit_time)
evaluation_data["score_time"].append(score_time)
return pd.DataFrame(evaluation_data)
evaluation_df = evaluate_clf_feature_selection_range(X, y, "f1", 15)
%pip install hiplot
import hiplot
html = hiplot.Experiment.from_dataframe(evaluation_df).to_html()
displayHTML(html)