Introduction to Python for Data Sciences |
Franck Iutzeler |
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 3), 0)
plt.plot(x, y)
plt.legend('one two three'.split(' '));
Let us import seaborn and change the matplotlib style with sns.set()
import seaborn as sns
sns.set()
# Same command but now seaborn is set
plt.plot(x, y)
plt.legend('one two three'.split(' '));
Apart from the standard histograms plt.hist, Seaborn provides smoothed density plots based on data using sns.kdeplot or sns.displot.
data = np.random.multivariate_normal([0, 1.5], [[1, 0.2], [0.2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])
for col in 'xy':
plt.hist(data[col], alpha=0.5) # alpha=0.5 provides semi-transparent plots
kdeplot provides density plots from an array or series (shade=True provide filled ones).
sns.kdeplot(data['x'])
sns.kdeplot(data['y'],shade=True)
<AxesSubplot:xlabel='x', ylabel='Density'>
displot is a mix of the two previous ones.
sns.displot(data['x'])
sns.histplot(data['y'])
<AxesSubplot:xlabel='x', ylabel='Count'>
Two-dimensional dataset may be represented by level sets with kdeplot.
sns.kdeplot(data['x'],y = data['y'], shade=True, thresh=0.05, cmap="Reds", cbar=True)
<AxesSubplot:xlabel='x', ylabel='y'>
Joint distribution and the marginal distributions can be displayed together using jointplot
sns.jointplot(x= "x", y= "y", data = data, kind='kde');
Seaborn provides an efficient tool for quickly exploring different features and classification with pairplot.
import pandas as pd
import numpy as np
iris = pd.read_csv('data/iris.csv')
print(iris.shape)
iris.head()
(150, 5)
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
sns.pairplot(iris, hue='species')
<seaborn.axisgrid.PairGrid at 0x7f5e8d7bc3a0>
factorplot also provides error plots.
sns.catplot( x = "species" , y="sepal_length" , data=iris , kind="box")
<seaborn.axisgrid.FacetGrid at 0x7f5e8d91ab20>
For displaying classification data, it is sometimes interesting to melt dataframes, that is separating
The command pd.melt return a dataframe with as columns: the id, the variable (former column) name, and associated value.
irisS = pd.melt(iris,id_vars="species",value_vars=["sepal_length","sepal_width","petal_length","petal_width"])
irisS.head()
species | variable | value | |
---|---|---|---|
0 | setosa | sepal_length | 5.1 |
1 | setosa | sepal_length | 4.9 |
2 | setosa | sepal_length | 4.7 |
3 | setosa | sepal_length | 4.6 |
4 | setosa | sepal_length | 5.0 |
sns.catplot( x= "species" , y = "value" , col="variable" , data=irisS , kind="box")
<seaborn.axisgrid.FacetGrid at 0x7f5e8c3322b0>