Intro¶

General¶

"Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making." (Wikipedia).

In Data Analysis, there are some analysis paradigms : Univariate, Bivariate, Multivariate. We apply these paradigms to analyze the features (or statistical variables, or columns of the dataframe) of the dataset and to have a better understanding.

Numeric features are features with numbers that you can perform mathematical operations on. They are further divided into discrete (countable integers with clear boundaries) and continuous (can take any value, even decimals, within a range).

Categorical features are columns with a limited number of possible values. Examples are sex, country, or age group.

Notebook overview¶

This notebook is a guide to start practicing Data Analysis.

Setup¶

Installation¶

Here is the section to install all the packages/libraries that will be needed to tackle the challlenge.

In [ ]:

# !pip install -q <lib_001> <lib_002> ...

Importation¶

Here is the section to import all the packages/libraries that will be used through this notebook.

In [ ]:

# Data handling
import pandas as pd

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
...

# EDA (pandas-profiling, etc. )
...

# Feature Processing (Scikit-learn processing, etc. )
...

# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )
...

# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )
...

# Other packages
import os

Data Loading¶

Here is the section to load the datasets (train, eval, test) and the additional files

In [ ]:

# For CSV, use pandas.read_csv

Exploratory Data Analysis: EDA¶

Here is the section to inspect the datasets in depth, present it, make hypotheses and think the cleaning, processing and features creation.

Dataset overview¶

Have a look at the loaded datsets using the following methods: .head(), .info()

In [ ]:

# Code here

In [ ]:

# Code here

Univariate Analysis¶

‘Univariate analysis’ is the analysis of one variable at a time. This analysis might be done by computing some statistical indicators and by plotting some charts respectively using the pandas dataframe's method .describe() and one of the plotting libraries like Seaborn, Matplotlib, Plotly, etc.

Please, read this article to know more about the charts.

In [ ]:

# Code here

Multivariate Analysis¶

Multivariate analysis’ is the analysis of more than one variable and aims to study the relationships among them. This analysis might be done by computing some statistical indicators like the correlation and by plotting some charts.

Please, read this article to know more about the charts.

In [ ]:

# Code here

Feature processing¶

Here is the section to clean and process the features of the dataset.

Missing/NaN Values¶

Handle the missing/NaN values using the Scikif-learn SimpleImputer

In [ ]:

# Code Here

Scaling¶

Scale the numeric features using the Scikif-learn StandardScaler, MinMaxScaler, or another Scaler.

In [ ]:

# Code here

Encoding¶

Encode the categorical features using the Scikif-learn OneHotEncoder.

In [ ]:

# Code here