#!/usr/bin/env python # coding: utf-8 # # A Python Tour of Data Science: Data Visualization # # [Michaƫl Defferrard](http://deff.ch), *PhD student*, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch) # # Exercise # # Data visualization is a key aspect of exploratory data analysis. # During this exercise we'll gradually build more and more complex vizualisations. We'll do this by replicating plots. Try to reproduce the lines but also the axis labels, legends or titles. # # * Goal of data visualization: clearly and efficiently communicate information through visual representations. While tables are generally used to look up a specific measurement, charts are used to show patterns or relationships. # * Means: mainly statistical graphics for exploratory analysis, e.g. scatter plots, histograms, probability plots, box plots, residual plots, but also [infographics](https://en.wikipedia.org/wiki/Infographic) for communication. # # *Data visualization is both an art and a science. It should combine both aesthetic form and functionality.* # # 1 Time series # # To start slowly, let's make a static line plot from some time series. Reproduce the plots below using: # 1. The procedural API of [matplotlib](http://matplotlib.org), the main data visualization library for Python. Its procedural API is similar to matlab and convenient for interactive work. # 2. [Pandas](http://pandas.pydata.org), which wraps matplotlib around his DataFrame format and makes many standard plots easy to code. It offers many [helpers for data visualization](http://pandas.pydata.org/pandas-docs/version/0.19.1/visualization.html). # # **Hint**: to plot with pandas, you first need to create a DataFrame, pandas' tabular data format. # In[1]: import numpy as np import pandas as pd import matplotlib.pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # Random time series. n = 1000 rs = np.random.RandomState(42) data = rs.randn(n, 4).cumsum(axis=0) # In[2]: # plt.figure(figsize=(15,5)) # plt.plot(data[:, 0]) # In[3]: # df = pd.DataFrame(...) # df.plot(...) # # 2 Categories # # Categorical data is best represented by [bar](https://en.wikipedia.org/wiki/Bar_chart) or [pie](https://en.wikipedia.org/wiki/Pie_chart) charts. Reproduce the plots below using the object-oriented API of matplotlib, which is recommended for programming. # # **Question**: What are the pros / cons of each plot ? # # **Tip**: the [matplotlib gallery](http://matplotlib.org/gallery.html) is a convenient starting point. # In[4]: data = [10, 40, 25, 15, 10] categories = list('ABCDE') # In[5]: fig, axes = plt.subplots(1, 2, figsize=(15, 5)) # Right plot. # axes[1]. # axes[1]. # Left plot. # axes[0]. # axes[0]. # # 3 Frequency # # A frequency plot is a graph that shows the pattern in a set of data by plotting how often particular values of a measure occur. They often take the form of an [histogram](https://en.wikipedia.org/wiki/Histogram) or a [box plot](https://en.wikipedia.org/wiki/Box_plot). # # Reproduce the plots with the following three libraries, which provide high-level declarative syntax for statistical visualization as well as a convenient interface to pandas: # * [Seaborn](http://seaborn.pydata.org) is a statistical visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. Its advantage is that you can modify the produced plots with matplotlib, so you loose nothing. # * [ggplot](http://ggplot.yhathq.com) is a (partial) port of the popular [ggplot2](http://ggplot2.org) for R. It has his roots in the influencial book [the grammar of graphics](https://www.cs.uic.edu/~wilkinson/TheGrammarOfGraphics/GOG.html). Convenient if you know ggplot2 already. # * [Vega](https://vega.github.io/) is a declarative format for statistical visualization based on [D3.js](https://d3js.org), a low-level javascript library for interactive visualization. [Vincent](https://vincent.readthedocs.io/en/latest/) (discontinued) and [altair](https://altair-viz.github.io/) are Python libraries to vega. Altair is quite new and does not provide all the needed functionality yet, but it is promising ! # # **Hints**: # * Seaborn, look at `distplot()` and `boxplot()`. # * ggplot, we are interested by the [geom_histogram](http://ggplot.yhathq.com/docs/geom_histogram.html) geometry. # In[6]: import seaborn as sns import os df = sns.load_dataset('iris', data_home=os.path.join('..', 'data')) # In[7]: fig, axes = plt.subplots(1, 2, figsize=(15, 5)) # Your code for Seaborn: distplot() and boxplot(). # In[8]: import ggplot # Your code for ggplot. # In[10]: import altair # altair.Chart(df).mark_bar(opacity=.75).encode( # x=..., # y=..., # color=... # ) # # 4 Correlation # # [Scatter plots](https://en.wikipedia.org/wiki/Scatter_plot) are very much used to assess the correlation between 2 variables. Pair plots are then a useful way of displaying the pairwise relations between variables in a dataset. # # Use the seaborn `pairplot()` function to analyze how separable is the iris dataset. # In[11]: # One line with Seaborn. # # 5 Dimensionality reduction # # Humans can only comprehend up to 3 dimensions (in space, then there is e.g. color or size), so [dimensionality reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) is often needed to explore high dimensional datasets. Analyze how separable is the iris dataset by visualizing it in a 2D scatter plot after reduction from 4 to 2 dimensions with two popular methods: # 1. The classical [principal componant analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis). # 2. [t-distributed stochastic neighbor embedding (t-SNE)](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding). # # **Hints**: # * t-SNE is a stochastic method, so you may want to run it multiple times. # * The easiest way to create the scatter plot is to add columns to the pandas DataFrame, then use the Seaborn `swarmplot()`. # In[12]: from sklearn.decomposition import PCA from sklearn.manifold import TSNE # In[13]: # df['pca1'] = # df['pca2'] = # df['tsne1'] = # df['tsne2'] = # In[14]: fig, axes = plt.subplots(1, 2, figsize=(15, 5)) sns.swarmplot(x='pca1', y='pca2', data=df, hue='species', ax=axes[0]) sns.swarmplot(x='tsne1', y='tsne2', data=df, hue='species', ax=axes[1]); # # 6 Interactive visualization # # For interactive visualization, look at [bokeh](http://bokeh.pydata.org) (we used it during the [data exploration exercise](http://nbviewer.jupyter.org/github/mdeff/ntds_2016/blob/with_outputs/toolkit/01_demo_acquisition_exploration.ipynb#4-Interactive-Visualization)) or [VisPy](http://vispy.org). # # 7 Geographic map # # If you want to visualize data on an interactive map, look at [Folium](https://github.com/python-visualization/folium).