Введение в Python и обработку данных

Спецкурс кафедры Теоретической информатики мехмата МГУ

Подробная информация по курсу: http://rmbk.me/mm_python/

Визуализация данных в python

План лекции:

  • Зачем нужна визуализация и где она используется
  • Библиотеки для отображения
  • Основные способы отображения информации графически
  • Диаграммы рассеяния
  • Графики
  • Гистограммы
  • Столбчатые диаграммы
  • Box Plots
  • ДЗ

Зачем нужна визуализация и где она используется

Библиотеки для отображения

  • Matplotlib
  • Seaborn
  • PyQt
  • OpenCV

Настройка окружения

In [ ]:
# Установка дополнительных пакетов
import sys
!{sys.executable} -m pip install matplotlib sklearn pandas requests seaborn --user
In [ ]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
import pandas as pd
import matplotlib as mpl

mpl.rcParams['figure.figsize'] = (12, 8)

Диаграммы рассеяния (Scatter Plots)

In [29]:
temps = [57.56, 61.52, 53.42, 59.36, 65.3, 71.78, 66.92, 77.18, 74.12, 64.58, 72.68, 62.96]
icecream_sales = [215, 325, 185, 332, 406, 522, 412, 614, 544, 421, 445, 408]

plt.scatter (temps, icecream_sales)
plt.show ()
In [ ]:
def scatterplot(x_data, y_data, x_label="", y_label="", title="", color = "r", yscale_log=False):

    _, ax = plt.subplots()

    ax.scatter(x_data, y_data, s = 30, color = color)

    if yscale_log:
        ax.set_yscale('log')

    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
In [ ]:
scatterplot (temps, icecream_sales, "temperature", "sales", "icecream sales", color="orange")

Попробуем сами

In [ ]:
# plt.scatter(        
#         species_df['sepal length (cm)'],        
#         species_df['petal length (cm)'],
#         color=colours[i],        
#         alpha=0.5,        
#         label=species[i]   
#     )
In [ ]:
iris = datasets.load_iris()
iris_df = pd.DataFrame(iris['data'], columns=iris['feature_names'])
iris_df['species'] = iris['target']

colours = ['red', 'orange', 'blue']
species = ['I. setosa', 'I. versicolor', 'I. virginica']

for i in range(0, 3):    
    species_df = iris_df[iris_df['species'] == i]    
    #your code here

plt.xlabel('sepal length (cm)')
plt.ylabel('petal length (cm)')
plt.title('Iris dataset: petal length vs sepal length')
plt.legend(loc='lower right')

plt.show()

Графики

In [ ]:
plt.plot (temps, icecream_sales)
plt.show ()

что-то пошло не так! исправим

In [ ]:
plt.plot (sorted(temps), [x for _,x in sorted(zip(temps, icecream_sales))])
plt.xlabel ("temperature")
plt.ylabel ("sales")
plt.title ("Icecream sales")
plt.show ()

Попробуем сами

In [ ]:
def lineplot(x_data, y_data, x_label="", y_label="", title="", tics_step = 1):
    fig, ax = plt.subplots()
    
    fig.set_size_inches(19, 11)

    ax.plot(x_data, y_data, lw = 2, color = 'green', alpha = 1,)

    ax.set_title(title)
    ax.set_xlabel(x_label)
    ax.set_ylabel(y_label)
    
    ax.xaxis.set_ticks(x_data[0::tics_step])
In [ ]:
import pandas as pd
import io
import requests
url="https://datahub.io/core/sea-level-rise/r/csiro_alt_gmsl_mo_2015.csv"
s=requests.get(url).content
c=pd.read_csv(io.StringIO(s.decode('utf-8')))


#your code here (написать функцию рисующую график и вызвать ее на данных (x = c["Time"], y = c["GMSL"]))
#подсказка: выставить метки на оси икс можно такой командой ax.xaxis.set_ticks(x_data[0::tics_step])
In [ ]:
lineplot (c["Time"], c["GMSL"], tics_step=20, x_label="Time", y_label="medium sea level", title="Sea level")

Гистограммы (Histograms)

In [3]:
uniform = np.random.uniform(0,1,1000)
normal = np.random.normal(0.5,0.1,1000)
In [8]:
def histogram(ax, data, bins, cumulative=False, x_label = "", y_label = "", title = ""):
    ax.hist(data, bins = bins, cumulative = cumulative)
    ax.set_ylabel(y_label)
    ax.set_xlabel(x_label)
    ax.set_title(title)
    
fig, ax = plt.subplots(1,2)
fig.set_size_inches (15,5)
histogram (ax[0], normal, 100, title="Normal distribution", x_label="Value", y_label="Count")
histogram (ax[1], uniform, 10, title="Uniform distribution", x_label="Value", y_label="Count", cumulative=True)
In [30]:
def overlaid_histogram(data1, data2, n_bins = 0, data1_name="", data1_color="#539caf", data2_name="", data2_color="#7663b0", x_label="", y_label="", title=""):
    max_nbins = 10
    data_range = [min(min(data1), min(data2)), max(max(data1), max(data2))]
    binwidth = (data_range[1] - data_range[0]) / max_nbins

    if n_bins == 0:
        bins = np.arange(data_range[0], data_range[1] + binwidth, binwidth)
    else:
        bins = n_bins

    _, ax = plt.subplots()
    ax.hist(data1, bins = bins, color = data1_color, alpha = 1, label = data1_name)
    ax.hist(data2, bins = bins, color = data2_color, alpha = 0.75, label = data2_name)
    ax.set_ylabel(y_label)
    ax.set_xlabel(x_label)
    ax.set_title(title)
    ax.legend(loc = 'best')
    
overlaid_histogram (normal, uniform, n_bins=30, data1_name="Normal", data2_name="Uniform")
In [31]:
import seaborn as sns

sns.distplot (normal,label = "normal" )
sns.distplot (uniform,label = "uniform" )
plt.legend ()
plt.show ()

Попробуем сами

In [ ]:
for i in data.target.unique():
    sns.distplot(data['alcohol'][data.target==i],
                 kde=1,label='{}'.format(i))

plt.legend()
plt.show ()
In [11]:
raw_data = datasets.load_wine()
features = pd.DataFrame(data=raw_data['data'],columns=raw_data['feature_names'])
data = features
data['target']=raw_data['target']
data['class']=data['target'].map(lambda ind: raw_data['target_names'][ind])

#your code here (x = data['alcohol'][data.target==i])

Столбчатые диаграммы

In [16]:
def autolabel(rects, xpos='center'):
    """
    Attach a text label above each bar in *rects*, displaying its height.

    *xpos* indicates which side to place the text w.r.t. the center of
    the bar. It can be one of the following {'center', 'right', 'left'}.
    """

    xpos = xpos.lower()  # normalize the case of the parameter
    ha = {'center': 'center', 'right': 'left', 'left': 'right'}
    offset = {'center': 0.5, 'right': 0.57, 'left': 0.43}  # x_txt = x + w*off

    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()*offset[xpos], 1.01*height,
                '{}'.format(height), ha=ha[xpos], va='bottom')

def draw_diag ():
    men_means, men_std = (20, 35, 30, 35, 27), (2, 3, 4, 1, 2)
    women_means, women_std = (25, 32, 34, 20, 25), (3, 5, 2, 3, 3)

    ind = np.arange(len(men_means))  # the x locations for the groups
    width = 0.35  # the width of the bars

    fig, ax = plt.subplots()
    rects1 = ax.bar(ind - width/2, men_means, width, yerr=men_std,
                    color='SkyBlue', label='Men')
    rects2 = ax.bar(ind + width/2, women_means, width, yerr=women_std,
                    color='IndianRed', label='Women')


    # Add some text for labels, title and custom x-axis tick labels, etc.
    ax.set_ylabel('Scores')
    ax.set_title('Scores by group and gender')
    ax.set_xticks(ind)
    ax.set_xticklabels(('G1', 'G2', 'G3', 'G4', 'G5'))
    ax.legend()

    autolabel(rects1, "left")
    autolabel(rects2, "right")

    plt.show()
In [17]:
draw_diag ()
In [18]:
r = [0,1,2,3,4]
raw_data = {'greenBars': [20, 1.5, 7, 10, 5], 'orangeBars': [5, 15, 5, 10, 15],'blueBars': [2, 15, 18, 5, 10]}
df = pd.DataFrame(raw_data)
 
# From raw value to percentage
totals = [i+j+k for i,j,k in zip(df['greenBars'], df['orangeBars'], df['blueBars'])]
greenBars = [i / j * 100 for i,j in zip(df['greenBars'], totals)]
orangeBars = [i / j * 100 for i,j in zip(df['orangeBars'], totals)]
blueBars = [i / j * 100 for i,j in zip(df['blueBars'], totals)]
 
# plot
barWidth = 0.85
names = ('A','B','C','D','E')
In [19]:
# Create green Bars
plt.bar(r, greenBars, color='#b5ffb9', edgecolor='white', width=barWidth, label="group A")
# Create orange Bars
plt.bar(r, orangeBars, bottom=greenBars, color='#f9bc86', edgecolor='white', width=barWidth, label="group B")
# Create blue Bars
plt.bar(r, blueBars, bottom=[i+j for i,j in zip(greenBars, orangeBars)], color='#a3acff', edgecolor='white', width=barWidth, label="group C")
 
# Custom x axis
plt.xticks(r, names)
plt.xlabel("group")
 
# Add a legend
plt.legend(loc='upper left', bbox_to_anchor=(1,1), ncol=1)
 
# Show graphic
plt.show()

Попробуем сами

In [ ]:
plt.bar(y_pos, performance, align='center', alpha=0.5)
plt.xticks(y_pos, objects)
plt.ylabel('Usage')
plt.title('Programming language usage')
 
plt.show()
In [21]:
objects = ('Python', 'C++', 'Java', 'Perl', 'Scala', 'Lisp')
y_pos = np.arange(len(objects))
performance = [10,8,6,4,2,1]
 
#your code here

Box Plots

In [22]:
def boxplot(x_data, y_data, base_color="#539caf", median_color="#297083", x_label="", y_label="", title=""):
    _, ax = plt.subplots()

    # Draw boxplots, specifying desired style
    ax.boxplot(y_data
               # patch_artist must be True to control box fill
               , patch_artist = True
               # Properties of median line
               , medianprops = {'color': median_color}
               # Properties of box
               , boxprops = {'color': base_color, 'facecolor': base_color}
               # Properties of whiskers
               , whiskerprops = {'color': base_color}
               # Properties of whisker caps
               , capprops = {'color': base_color})

    # By default, the tick label starts at 1 and increments by 1 for
    # each box drawn. This sets the labels to the ones we want
    ax.set_xticklabels(x_data)
    ax.set_ylabel(y_label)
    ax.set_xlabel(x_label)
    ax.set_title(title)
    
    
tips = sns.load_dataset("tips")
desired_tips = [tips["total_bill"][tips["day"] == day] for day in np.unique(tips["day"])]
boxplot (np.unique(tips["day"]), desired_tips)
In [ ]:
sns.set(style="whitegrid")
ax = sns.boxplot(x=tips["total_bill"])
In [ ]:
ax = sns.boxplot(x="day", y="total_bill", data=tips)
ax = sns.swarmplot(x="day", y="total_bill", data=tips, color=".25")
plt.show ()
In [ ]:
sns.boxplot(x="day", y="total_bill", hue="smoker", data=tips)
plt.show ()

Попробуем сами

In [ ]:
#your code here (sns.boxplot (...., hue="smoker",...)

Домашнее задание №5

Нарисовать функции распределения для следующих распределений:

  • Распределение Пуассона (np.random.poisson)
  • Распределение Парето (np.random.pareto)

Построить графики зависимости цены жилья в Бостоне от различных параметров. Подобрать лучший вариант отображения для каждого из них.

In [23]:
# загружаем датасет
boston = datasets.load_boston()
boston_df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
#добавляем столбец со стоимостью
boston_df['PRICE'] = boston.target
# смотрим, какие данные у нас есть
print ("Columns: " + " ".join (boston_df.columns))

plt.scatter (boston_df["PRICE"], boston_df["CRIM"])
plt.title ("You think this bad neighborhood?")
plt.xlabel ("Price")
plt.ylabel ("Crime")
plt.show ()
Columns: CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE