From `datascience` to `pandas`¶

Throughout this semsester we have been working mostly with the datascience package that was created specifically for Berkeley Data8 course. The package is very useful for the purposes of the course; but once you start programming in Python on your own, most likely you will want to use the pandas package, which the datascience package is built on top of. pandas is a versatile package that is very popular among Pyhton users.

To help you transition from datascience to pandas, today we are going to talk about basics of pandas and how the commands we learned from the datascience package translate into pandas.

To start using pandas we need to import it. The common abbreviation for pandas is pd (similar to numpy and np).

In [ ]:

from datascience import * 
import pandas as pd 
import numpy as np 

import matplotlib
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

1. Dataframes: Working with Tabular Data ¶

In Python's datascience module, we used Table to build our dataframes and used commands such as select(), where(), group(), column() etc. In this section, we will go over some basic commands to work with tabular data in Pandas.

1.1 Creating a Dataframe ¶

pandas introduces a data structure (i.e. dataframe) that represents data as a table with columns and rows.

In Python's datascience module that is used in Data8, this is how we created tables from scratch by extending an empty table:

In [ ]:

ds_tb = Table().with_columns([
     'letter', ['a', 'b', 'c', 'z'],
     'count',  [  9,   3,   3,   1],
     'points', [  1,   2,   2,  10],
 ])
ds_tb

In pandas, we can use the function pd.DataFrame to initialize a dataframe from a dictionary or a list-like object. Refer to the documentation for more information.

What is a dictionary? A dictionary is a Python data structure that allows us to store data as pairs of the form key:value. Keys must be unique and values can be more or less anything: integer, character, array, list, even another dictionary.

In [ ]:

dict = {'name':'Alice', 'age':27, 'occupation':'student'} # basic dictionary
print(dict)
dict['name']

In [ ]:

# Example: creating a dataframe from a dictionary
pd_df = pd.DataFrame({ 'letter' : ['a', 'b', 'c', 'z'],
                      'count' : [  9,   3,   3,   1],
                      'points' : [  1,   2,   2,  10]
                      })
pd_df

More often, we will need to create a dataframe by importing data from a .csv file. In datascience, this is how we read data from a csv:

In [ ]:

ds_baby = Table.read_table('baby.csv')
ds_baby

In Pandas, we use pd.read_csv() to read data from a csv file. Sometimes, depending on the data file, we may need to specify the parameters sep, header or encoding as well. For a full list of parameters, refer to this guide

In [ ]:

# Reading baby.csv (located in current working directory)
pd_baby = pd.read_csv('baby.csv')
pd_baby # displays several first and last rows

In [ ]:

pd_baby.head() # display the first few rows of dataframe
pd_baby.tail(10) # display the last 10 rows of dataframe

In [ ]:

# View summary of data
pd_baby.describe()

We can also read files from a URL, which is very handy. (This works for Table.read_table too.)

In [ ]:

# Example: loading csv from URL
pd_galton = pd.read_csv('https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec30/galton.csv')
pd_galton.head()

In [ ]:

# View information about dataframe
print(pd_galton.shape) # view dimensions (rows, cols)
print(pd_galton.columns.values) # view column names

1.2 Accessing Values in Dataframe ¶

In datascience, we can use column() to access values in a particular column as follows:

In [ ]:

# Access column 'letter', returns array
ds_tb.column('letter')

In [ ]:

# We can also access the column using square bracket notation:
ds_tb['letter']

In pandas we access a column by using the square bracket notation.

In [ ]:

pd_galton['gender']

In [ ]:

type(pd_galton['gender']) # pandas type Series

If we want a numpy array of column values, it is available as the name .values:

In [ ]:

pd_galton['gender'].values

In [ ]:

type(pd_galton['gender'].values)

In datascience, we used take() to access the rows in the Table:

In [ ]:

# Selecting first two rows using Python's slicing notation
ds_tb.take[0:2]

In pandas, we can access rows and columns by their position using the iloc method. We need to specify the rows and columns we want in the following syntax: df.iloc[<rows>, <columns>]. For more information on indexing, refer to this guide.

In [ ]:

# Selecting first two rows using iloc
pd_baby.iloc[0:2, :] 

In [ ]:

# Specifying row indices
pd_baby.iloc[[1, 4, 6], :]

In [ ]:

# Selecting first three rows and second two columns using iloc
pd_baby.iloc[0:3, 1:3] 

We can also access a specific value in the dataframe by passing in the row and column indices:

In [ ]:

# Get value in second row, third column
pd_baby.iloc[1, 2]

1.3 Manipulating Data¶

Adding columns

Adding a new column in datascience is done by the with_column() function as follows:

In [ ]:

ds_tb = ds_tb.with_column('vowel', ['yes', 'no', 'no', 'no'])
ds_tb

In pandas, we can use the bracket notation and assign a list to add to the dataframe as follows:

In [ ]:

# Adding a new column
pd_df['newcol'] = [5, 6, 7, 8]
pd_df

We can also add an existing column to the new dataframe as a new column by performing an operation on it:

In [ ]:

# Adding count * 2 to the dataframe
pd_df['doublecount'] = pd_df['count'] * 2
pd_df

Selecting columns

In datascience, we used select() to subset the dataframe by selecting columns:

In [ ]:

ds_tb.select(['letter', 'points'])

In pandas, we use a double bracket notation to select columns. This returns a dataframe, unlike a Series object when we only use single bracket notation

In [ ]:

# Double bracket notation for new dataframe
pd_df[['count', 'doublecount']]
# Alternatively: 
# pd_df.iloc[:, [1,4]]

Filtering rows conditionally

In datascience, we used where() to select rows according to a given condition:

In [ ]:

ds_tb.where('points', 2) # rows where points == 2

In [ ]:

ds_tb.where(ds_tb['count'] < 8) # rows where count < 8

In pandas, we can use the bracket notation to subset the dataframe based on a condition. We first specify a condition and then subset using the bracket notation:

In [ ]:

# Array of booleans
pd_baby['Maternal.Smoker'] == True

In [ ]:

# Filter rows by condition Maternal.Smoker == True
pd_baby[pd_baby['Maternal.Smoker'] == True]

In [ ]:

# Filtering with multiple conditions
pd_df[(pd_df['count'] < 8) & (pd_df['points'] > 5)]

Renaming columns

In datascience, we used relabeled() to rename columns:

In [ ]:

# Rename 'points' to 'other name'
ds_tb.relabeled('points', 'other name')

pandas uses rename(), which has an index parameter that needs to be set to str and a columns parameter that needs to be set to a dictionary of the names to be replaced with their replacements:

In [ ]:

# Rename 'points' to 'other name'
pd_df.rename(index = str, columns = {"points" : "other name"})

In [ ]:

pd_df

In [ ]:

# Alternatively, we can also do
pd_df.columns.values[2] = "other name"
pd_df

In [ ]:

pd_df.columns.values[2] = "points"

Sorting a Dataframe by column

In datascience we used sort() to sort a Table according to the values in a column:

In [ ]:

# Sort by count
ds_tb.sort('count')

In pandas, we use the sort_values() to sort by column. We need the by parameter to specify the row we want to sort by and the optional parameter ascending = False if we want to sort in descending order:

In [ ]:

# Sort by count, descending
pd_df.sort_values(by = ['count'], ascending = False)

Grouping and aggregating

In datascience, we used group() and the collect argument to group a Table by a column and aggregrate values in another column:

In [ ]:

# Group by count and aggregate by sum
ds_tb.select(['count', 'points']).group('count', collect = sum)

In pandas, we use groupby() to group the dataframe. This function returns a groupby object, on which we can then call an aggregation function to return a dataframe with aggregated values for other columns. For more information, refer to the documentation.

In [ ]:

# Selecting two columns for brevity
pd_df_subset = pd_df[['count', 'points']]
pd_df_subset

In [ ]:

count_sums_df = pd_df_subset.groupby(['count']).sum()
count_sums_df

Pivot Tables

In datascience, we used the pivot() function to build contingency tables:

In [ ]:

# Creating new Table
ds_cones = Table().with_columns(
    'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 
                         'strawberry', 'chocolate', 'bubblegum'),
    'Color', make_array('pink', 'light brown', 'dark brown', 
                        'pink', 'dark brown', 'pink'),
    'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)

ds_cones

In [ ]:

# Pivoting on color and flavor
ds_cones.pivot("Flavor", "Color")

We can also pass in the parameters values to specify the values in the table and collect to specify the aggregration function.

In [ ]:

# Setting parameters values and collect
ds_cones.pivot("Flavor", "Color", values = "Price", collect = np.sum)

In pandas, we use pd.pivot_table() to create a contingency table. The argument columns is similar to the first argument in datascience's pivot function and sets the column names of the pivot table. The argument index is similar to the second argument in datascience's pivot function and sets the first column of the pivot table or the keys to group on. For more information, refer to the documentation.

In [ ]:

# Creating new dataframe
pd_cones = pd.DataFrame({"Flavor" : ['strawberry', 'chocolate', 'chocolate', 
                                     'strawberry', 'chocolate', 'bubblegum'],
                         "Color" : ['pink', 'light brown', 'dark brown', 
                                    'pink', 'dark brown', 'pink'],
                         "Price" : [3.55, 4.75, 5.25, 5.25, 5.25, 4.75]})
pd_cones

In [ ]:

# Creating the pivot table
pd.pivot_table(pd_cones, columns = ["Flavor"], index = ["Color"])

If there is no data in the groups, then pandas will output NaN values ("Not a Number").

We can also specify the parameters like values (equivalent to values in datascience's pivot) and aggfunc (equivalent to collect in datascience's pivot).

In [ ]:

# Additional arguments
pd.pivot_table(pd_cones, columns = ["Flavor"], index = ["Color"], 
               values = "Price", aggfunc = np.sum)

Joining/merging

In datascience, we used join() to join two tables based on shared values in columns. We specify the column name in the first table to match on, the name of the second table and the column name in the second table to match on.

In [ ]:

# Creating new table
ds_ratings = Table().with_columns(
    'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
    'Stars', make_array(2.5, 3.5, 4)
)
ds_ratings

In [ ]:

# Joining cones and ratings
ds_cones.join("Flavor", ds_ratings, "Kind")

In pandas, we can use the merge() function to join two tables together. The first parameter is the name of the second table to join on. The parameters left_on and right_on specify the columns to use in the left and right tables respectively. There are more parameters such as how which specify what kind of join to perform (Inner (Default), Outer, Left, Right). For more information, refer to this Kaggle Tutorial.

In [ ]:

# Creating new ratings df
pd_ratings = pd.DataFrame({"Kind" : ['strawberry', 'chocolate', 'vanilla'],
                           "Stars" : [2.5, 3.5, 4]})
pd_ratings

In [ ]:

# Merging cones and ratings
pd_cones.merge(pd_ratings, left_on = "Flavor", right_on = "Kind")

2. Visualizing Data¶

In datascience, we learned to plot data using histograms, line plots, scatter plots and histograms. The corresponding functions were hist(), plot(), scatter() and barh(). Plotting methods in pandas are nearly identical to datascience since both build on the library matplotlib

In this section we will go through examples of such plots in pandas

2.1 Histograms¶

In datascience, we used hist() to create a histogram. In this example, we will be using data from baby.csv. Recall that the baby data set contains data on a random sample of 1,174 mothers and their newborn babies. The column Birth.Weight contains the birth weight of the baby, in ounces; Gestational.Days is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.

In [ ]:

# Reading in the data
ds_baby = Table.read_table('baby.csv')
ds_baby

In [ ]:

# Creating a histogram
ds_baby.hist('Birth.Weight')

In pandas, we use hist() to create histograms, just like datascience. Refer to the documentation for a full list of parameters.

In [ ]:

# Creating a histogram
pd_baby.hist('Birth.Weight');

2.2 Line Plots¶

In datascience, we used plot() to create a line plot of numerical values. In this example, we will be using census data and plot variables such as Age in a line plot

In [ ]:

# Line plot in datascience
ds_census = Table.read_table("https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec07/nc-est2014-agesex-res.csv").select(['SEX', 'AGE', 'POPESTIMATE2014'])
ds_children = ds_census.where('SEX', are.equal_to(0)).where('AGE', are.below(19)).drop('SEX')
ds_children.plot('AGE')

In pandas, we can use plot.line() to create line plots. For a full list of parameters, refer to the documentation.

In [ ]:

# Line plot in pandas
pd_census = pd.read_csv("https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec07/nc-est2014-agesex-res.csv")[["SEX", "AGE", "POPESTIMATE2014"]]
pd_children = pd_census[(pd_census.SEX == 0) & (pd_census.AGE < 19)].drop("SEX", axis = 1)
pd_children.plot.line(x = "AGE", y = "POPESTIMATE2014");

3.3 Scatter Plots¶

In datascience, we used scatter() to create a scatter plot of two numerical columns

In [ ]:

ds_football = Table.read_table('https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec19/deflategate.csv')
ds_football

In [ ]:

ds_football.scatter('Blakeman', 'Prioleau')

In pandas, we use plot.scatter() to create a scatter plot. For a full list of parameters, refer to the documentation.

In [ ]:

pd_football = pd.read_csv('https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec19/deflategate.csv')
pd_football.plot.scatter(x = "Blakeman", y = "Prioleau");

3.4 Bar Plots¶

In datascience, we used barh() to create a horizontal bar plot

In [ ]:

ds_tb.barh("letter", "points")

In pandas', we useplot.barh()` to create a bar chart. For a full list of parameters, refer to the documentation.

In [ ]:

pd_df.plot.barh(x = 'letter', y = 'points');

From datascience to pandas¶

1. Dataframes: Working with Tabular Data ¶

1.1 Creating a Dataframe ¶

1.2 Accessing Values in Dataframe ¶

1.3 Manipulating Data¶

2. Visualizing Data¶

2.1 Histograms¶

2.2 Line Plots¶

3.3 Scatter Plots¶

3.4 Bar Plots¶

Further Reading¶

From `datascience` to `pandas`¶