"Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making." (Wikipedia).
In Data Analysis, there are some analysis paradigms : Univariate, Bivariate, Multivariate. We apply these paradigms to analyze the features (or statistical variables, or columns of the dataframe) of the dataset and to have a better understanding.
Numeric features are features with numbers that you can perform mathematical operations on. They are further divided into discrete (countable integers with clear boundaries) and continuous (can take any value, even decimals, within a range).
Categorical features are columns with a limited number of possible values. Examples are sex, country, or age group
.
This notebook is a guide to start practicing Data Analysis.
Here is the section to install all the packages/libraries that will be needed to tackle the challlenge.
# !pip install -q <lib_001> <lib_002> ...
Here is the section to import all the packages/libraries that will be used through this notebook.
# Data handling
import pandas as pd
# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
...
# EDA (pandas-profiling, etc. )
...
# Feature Processing (Scikit-learn processing, etc. )
...
# Machine Learning (Scikit-learn Estimators, Catboost, LightGBM, etc. )
...
# Hyperparameters Fine-tuning (Scikit-learn hp search, cross-validation, etc. )
...
# Other packages
import os
Here is the section to load the datasets (train, eval, test) and the additional files
# For CSV, use pandas.read_csv
Here is the section to inspect the datasets in depth, present it, make hypotheses and think the cleaning, processing and features creation.
Have a look at the loaded datsets using the following methods: .head(), .info()
# Code here
# Code here
‘Univariate analysis’ is the analysis of one variable at a time. This analysis might be done by computing some statistical indicators and by plotting some charts respectively using the pandas dataframe's method .describe()
and one of the plotting libraries like Seaborn, Matplotlib, Plotly, etc.
Please, read this article to know more about the charts.
# Code here
Multivariate analysis’ is the analysis of more than one variable and aims to study the relationships among them. This analysis might be done by computing some statistical indicators like the correlation
and by plotting some charts.
Please, read this article to know more about the charts.
# Code here
Here is the section to clean and process the features of the dataset.
Handle the missing/NaN values using the Scikif-learn SimpleImputer
# Code Here
Scale the numeric features using the Scikif-learn StandardScaler, MinMaxScaler, or another Scaler.
# Code here
Encode the categorical features using the Scikif-learn OneHotEncoder.
# Code here