单变量或双变量的分布的可视化问题
绘制单变量分布图
绘制双变量分布图
可视化数据集的成对关系
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
np.random.seed(sum(map(ord, "distributions")))
%matplotlib inline
sns.distplot()函数,默认同时绘制直方图(histogram)和核密度图(KDE)
x = np.random.normal(size=100)
sns.distplot(x)
<matplotlib.axes._subplots.AxesSubplot at 0x9ae4f28>
sns.distplot(x, kde=False, rug=True) # kde=False绘制直方图,rug=True为每个观察值添加一个tick
<matplotlib.axes._subplots.AxesSubplot at 0x9e8e470>
sns.distplot(x, kde=False, hist=False, rug=True) # 绘制rugplot,有单独的sns.rugplot()函数
<matplotlib.axes._subplots.AxesSubplot at 0xaeae7b8>
sns.distplot(x, bins=20, kde=False, rug=True) # bins参数,设置bin的个数
<matplotlib.axes._subplots.AxesSubplot at 0xb3d22b0>
sns.distplot(x, hist=False, rug=True) # 设置hist=False,绘制核密度图,有单独的sns.kdeplot()函数
<matplotlib.axes._subplots.AxesSubplot at 0xb8c4eb8>
each observation is first replaced with a normal (Gaussian) curve centered at that value
x = np.random.normal(0, 1, size=30)
bandwidth = 1.06 * x.std() * x.size ** (-1 / 5.)
support = np.linspace(-4, 4, 200)
kernels = []
for x_i in x:
kernel = stats.norm(x_i, bandwidth).pdf(support)
kernels.append(kernel)
plt.plot(support, kernel, color="r")
sns.rugplot(x, color=".2", linewidth=3)
<matplotlib.axes._subplots.AxesSubplot at 0xc0758d0>
these curves are summed to compute the value of the density at each point in the support grid. The resulting curve is then normalized so that the area under it is equal to 1
density = np.sum(kernels, axis=0)
density /= integrate.trapz(density, support)
plt.plot(support, density)
[<matplotlib.lines.Line2D at 0xc538cc0>]
相对于sns.distplot()能够设置更多选项
sns.kdeplot(x, shade=True) # 设置shade参数,填充核密度线下方区域
<matplotlib.axes._subplots.AxesSubplot at 0xc5c4f28>
The bandwidth (bw
) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram.
It corresponds to the width of the kernels we plotted above.
The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values.
sns.kdeplot(x) # 默认bw='scott'
sns.kdeplot(x, bw=.2, label="bw: 0.2")
sns.kdeplot(x, bw=2, label="bw: 2") # 设置bw参数
plt.legend()
<matplotlib.legend.Legend at 0xc93f080>
As you can see above, the nature of the Gaussian KDE process means that estimation extends past the largest and smallest values in the dataset.
It's possible to control how far past the extreme values the curve is drawn with the cut
parameter.
However, this only influences how the curve is drawn and not how it is fit.
sns.kdeplot(x, shade=True, cut=0)
sns.rugplot(x)
<matplotlib.axes._subplots.AxesSubplot at 0xd0bca90>
You can also use distplot() to fit a parametric distribution to a dataset and visually evaluate how closely it corresponds to the observed data.
x = np.random.gamma(6, size=200)
sns.distplot(x, kde=False, fit=stats.gamma) # 设置fit参数,拟合参数分布
<matplotlib.axes._subplots.AxesSubplot at 0xd0d49e8>
sns.distplot(x, fit=stats.gamma) # 绘制直方图和核密度图,与拟合参数分布图对比
<matplotlib.axes._subplots.AxesSubplot at 0xd686470>
sns.jointplot()函数
mean, cov = [0, 1], [(1, .5), (.5, 1)]
data = np.random.multivariate_normal(mean, cov, 200)
df = pd.DataFrame(data, columns=["x", "y"])
jointplot()默认绘制散点图
sns.jointplot(x="x", y="y", data=df) # 注意Seaborn与DataFrame联合使用,data参数指定DataFrame,x、y参数指定列名
<seaborn.axisgrid.JointGrid at 0xd9355f8>
六边形颜色的深浅,代表落入该六边形区域内观测点的数量,常应用于大数据集,与white主题结合使用效果最好
x, y = np.random.multivariate_normal(mean, cov, 1000).T
with sns.axes_style("white"): # hexbin plot与white主题结合使用效果最好
sns.jointplot(x=x, y=y, kind="hex", color="k") # kind参数设置六边形图,颜色设置与matplotlib相同
sns.jointplot(x="x", y="y", data=df, kind="kde")
<seaborn.axisgrid.JointGrid at 0xdf79160>
f, ax = plt.subplots(figsize=(6, 6))
sns.kdeplot(df.x, df.y, ax=ax) # ax参数选择图表绘制在哪个坐标系内
sns.rugplot(df.x, color="g", ax=ax)
sns.rugplot(df.y, vertical=True, ax=ax) # kdeplot()绘制的双变量核密度图,可以与其他图表叠加在同一个坐标系内
<matplotlib.axes._subplots.AxesSubplot at 0xdcfa470>
f, ax = plt.subplots(figsize=(6, 6))
cmap = sns.cubehelix_palette(as_cmap=True, dark=0, light=1, reverse=True)
sns.kdeplot(df.x, df.y, cmap=cmap, n_levels=60, shade=True) # 通过n_levels参数,增加轮廓线的数量,达到连续化核密度图的效果
<matplotlib.axes._subplots.AxesSubplot at 0xf663ba8>
sns.jointplot()绘制后返回JointGrid对象对象,可以通过JointGrid对象来修改图表,例如添加图层或修改其他效果
g = sns.jointplot(x="x", y="y", data=df, kind="kde", color="m") # 生成JointGrid对象
g.plot_joint(plt.scatter, c="w", s=30, linewidth=1, marker="+")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("$X$", "$Y$")
<seaborn.axisgrid.JointGrid at 0xfae12e8>
sns.pairplot()
iris = sns.load_dataset("iris")
sns.pairplot(iris) # 默认在对角线上绘制单变量的直方图
<seaborn.axisgrid.PairGrid at 0xe31d240>
g = sns.PairGrid(iris)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.kdeplot, cmap="Blues_d", n_levels=6)
C:\Anaconda2\lib\site-packages\matplotlib\axes\_axes.py:519: UserWarning: No labelled objects found. Use label='...' kwarg on individual plots. warnings.warn("No labelled objects found. "
<seaborn.axisgrid.PairGrid at 0x12aa1c88>