datascience
to pandas
¶Throughout this semsester we have been working mostly with the datascience
package that was created specifically for Berkeley Data8 course.
The package is very useful for the purposes of the course; but once you start
programming in Python on your own, most likely you will want to use the pandas
package, which the datascience
package is built on top of.
pandas
is a versatile package that is very popular among Pyhton users.
To help you transition from datascience
to pandas
, today we are going to talk about basics of pandas
and how the commands we learned from the datascience
package translate into pandas
.
To start using pandas
we need to import
it. The common abbreviation for pandas
is pd
(similar to numpy
and np
).
from datascience import *
import pandas as pd
import numpy as np
import matplotlib
from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
In Python's datascience
module, we used Table
to build our dataframes and used commands such as select()
, where()
, group()
, column()
etc. In this section, we will go over some basic commands to work with tabular data in Pandas.
pandas
introduces a data structure (i.e. dataframe) that represents data as a table with columns and rows.
In Python's datascience
module that is used in Data8, this is how we created tables from scratch by extending an empty table:
ds_tb = Table().with_columns([
'letter', ['a', 'b', 'c', 'z'],
'count', [ 9, 3, 3, 1],
'points', [ 1, 2, 2, 10],
])
ds_tb
In pandas
, we can use the function pd.DataFrame
to initialize a dataframe from a dictionary or a list-like object. Refer to the documentation for more information.
What is a dictionary? A dictionary is a Python data structure that allows us to store data as pairs of the form key:value. Keys must be unique and values can be more or less anything: integer, character, array, list, even another dictionary.
dict = {'name':'Alice', 'age':27, 'occupation':'student'} # basic dictionary
print(dict)
dict['name']
# Example: creating a dataframe from a dictionary
pd_df = pd.DataFrame({ 'letter' : ['a', 'b', 'c', 'z'],
'count' : [ 9, 3, 3, 1],
'points' : [ 1, 2, 2, 10]
})
pd_df
More often, we will need to create a dataframe by importing data from a .csv file. In datascience
, this is how we read data from a csv:
ds_baby = Table.read_table('baby.csv')
ds_baby
In Pandas, we use pd.read_csv()
to read data from a csv file. Sometimes, depending on the data file, we may need to specify the parameters sep
, header
or encoding
as well. For a full list of parameters, refer to this guide
# Reading baby.csv (located in current working directory)
pd_baby = pd.read_csv('baby.csv')
pd_baby # displays several first and last rows
pd_baby.head() # display the first few rows of dataframe
pd_baby.tail(10) # display the last 10 rows of dataframe
# View summary of data
pd_baby.describe()
We can also read files from a URL, which is very handy. (This works for Table.read_table
too.)
# Example: loading csv from URL
pd_galton = pd.read_csv('https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec30/galton.csv')
pd_galton.head()
# View information about dataframe
print(pd_galton.shape) # view dimensions (rows, cols)
print(pd_galton.columns.values) # view column names
In datascience
, we can use column()
to access values in a particular column as follows:
# Access column 'letter', returns array
ds_tb.column('letter')
# We can also access the column using square bracket notation:
ds_tb['letter']
In pandas
we access a column by using the square bracket notation.
pd_galton['gender']
type(pd_galton['gender']) # pandas type Series
If we want a numpy array of column values, it is available as the name .values
:
pd_galton['gender'].values
type(pd_galton['gender'].values)
In datascience
, we used take()
to access the rows in the Table:
# Selecting first two rows using Python's slicing notation
ds_tb.take[0:2]
In pandas
, we can access rows and columns by their position using the iloc
method. We need to specify the rows and columns we want in the following syntax: df.iloc[<rows>, <columns>]
. For more information on indexing, refer to this guide.
# Selecting first two rows using iloc
pd_baby.iloc[0:2, :]
# Specifying row indices
pd_baby.iloc[[1, 4, 6], :]
# Selecting first three rows and second two columns using iloc
pd_baby.iloc[0:3, 1:3]
We can also access a specific value in the dataframe by passing in the row and column indices:
# Get value in second row, third column
pd_baby.iloc[1, 2]
Adding columns
Adding a new column in datascience
is done by the with_column()
function as follows:
ds_tb = ds_tb.with_column('vowel', ['yes', 'no', 'no', 'no'])
ds_tb
In pandas
, we can use the bracket notation and assign a list to add to the dataframe as follows:
# Adding a new column
pd_df['newcol'] = [5, 6, 7, 8]
pd_df
We can also add an existing column to the new dataframe as a new column by performing an operation on it:
# Adding count * 2 to the dataframe
pd_df['doublecount'] = pd_df['count'] * 2
pd_df
Selecting columns
In datascience
, we used select()
to subset the dataframe by selecting columns:
ds_tb.select(['letter', 'points'])
In pandas
, we use a double bracket notation to select columns. This returns a dataframe, unlike a Series object when we only use single bracket notation
# Double bracket notation for new dataframe
pd_df[['count', 'doublecount']]
# Alternatively:
# pd_df.iloc[:, [1,4]]
Filtering rows conditionally
In datascience
, we used where()
to select rows according to a given condition:
ds_tb.where('points', 2) # rows where points == 2
ds_tb.where(ds_tb['count'] < 8) # rows where count < 8
In pandas
, we can use the bracket notation to subset the dataframe based on a condition. We first specify a condition and then subset using the bracket notation:
# Array of booleans
pd_baby['Maternal.Smoker'] == True
# Filter rows by condition Maternal.Smoker == True
pd_baby[pd_baby['Maternal.Smoker'] == True]
# Filtering with multiple conditions
pd_df[(pd_df['count'] < 8) & (pd_df['points'] > 5)]
Renaming columns
In datascience
, we used relabeled()
to rename columns:
# Rename 'points' to 'other name'
ds_tb.relabeled('points', 'other name')
pandas
uses rename()
, which has an index
parameter that needs to be set to str
and a columns
parameter that needs to be set to a dictionary of the names to be replaced with their replacements:
# Rename 'points' to 'other name'
pd_df.rename(index = str, columns = {"points" : "other name"})
pd_df
# Alternatively, we can also do
pd_df.columns.values[2] = "other name"
pd_df
pd_df.columns.values[2] = "points"
Sorting a Dataframe by column
In datascience
we used sort()
to sort a Table according to the values in a column:
# Sort by count
ds_tb.sort('count')
In pandas
, we use the sort_values()
to sort by column. We need the by
parameter to specify the row we want to sort by and the optional parameter ascending = False
if we want to sort in descending order:
# Sort by count, descending
pd_df.sort_values(by = ['count'], ascending = False)
Grouping and aggregating
In datascience
, we used group()
and the collect
argument to group a Table by a column and aggregrate values in another column:
# Group by count and aggregate by sum
ds_tb.select(['count', 'points']).group('count', collect = sum)
In pandas
, we use groupby()
to group the dataframe. This function returns a groupby object, on which we can then call an aggregation function to return a dataframe with aggregated values for other columns. For more information, refer to the documentation.
# Selecting two columns for brevity
pd_df_subset = pd_df[['count', 'points']]
pd_df_subset
count_sums_df = pd_df_subset.groupby(['count']).sum()
count_sums_df
Pivot Tables
In datascience
, we used the pivot()
function to build contingency tables:
# Creating new Table
ds_cones = Table().with_columns(
'Flavor', make_array('strawberry', 'chocolate', 'chocolate',
'strawberry', 'chocolate', 'bubblegum'),
'Color', make_array('pink', 'light brown', 'dark brown',
'pink', 'dark brown', 'pink'),
'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)
ds_cones
# Pivoting on color and flavor
ds_cones.pivot("Flavor", "Color")
We can also pass in the parameters values
to specify the values in the table and collect
to specify the aggregration function.
# Setting parameters values and collect
ds_cones.pivot("Flavor", "Color", values = "Price", collect = np.sum)
In pandas
, we use pd.pivot_table()
to create a contingency table. The argument columns
is similar to the first argument in datascience
's pivot
function and sets the column names of the pivot table. The argument index
is similar to the second argument in datascience
's pivot
function and sets the first column of the pivot table or the keys to group on. For more information, refer to the documentation.
# Creating new dataframe
pd_cones = pd.DataFrame({"Flavor" : ['strawberry', 'chocolate', 'chocolate',
'strawberry', 'chocolate', 'bubblegum'],
"Color" : ['pink', 'light brown', 'dark brown',
'pink', 'dark brown', 'pink'],
"Price" : [3.55, 4.75, 5.25, 5.25, 5.25, 4.75]})
pd_cones
# Creating the pivot table
pd.pivot_table(pd_cones, columns = ["Flavor"], index = ["Color"])
If there is no data in the groups, then pandas
will output NaN
values ("Not a Number").
We can also specify the parameters like values
(equivalent to values
in datascience
's pivot
) and aggfunc
(equivalent to collect
in datascience
's pivot
).
# Additional arguments
pd.pivot_table(pd_cones, columns = ["Flavor"], index = ["Color"],
values = "Price", aggfunc = np.sum)
Joining/merging
In datascience
, we used join()
to join two tables based on shared values in columns. We specify the column name in the first table to match on, the name of the second table and the column name in the second table to match on.
# Creating new table
ds_ratings = Table().with_columns(
'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
'Stars', make_array(2.5, 3.5, 4)
)
ds_ratings
# Joining cones and ratings
ds_cones.join("Flavor", ds_ratings, "Kind")
In pandas
, we can use the merge()
function to join two tables together. The first parameter is the name of the second table to join on. The parameters left_on
and right_on
specify the columns to use in the left and right tables respectively. There are more parameters such as how
which specify what kind of join to perform (Inner (Default), Outer, Left, Right). For more information, refer to this Kaggle Tutorial.
# Creating new ratings df
pd_ratings = pd.DataFrame({"Kind" : ['strawberry', 'chocolate', 'vanilla'],
"Stars" : [2.5, 3.5, 4]})
pd_ratings
# Merging cones and ratings
pd_cones.merge(pd_ratings, left_on = "Flavor", right_on = "Kind")
In datascience
, we learned to plot data using histograms, line plots, scatter plots and histograms. The corresponding functions were hist()
, plot()
, scatter()
and barh()
. Plotting methods in pandas
are nearly identical to datascience
since both build on the library matplotlib
In this section we will go through examples of such plots in pandas
In datascience
, we used hist()
to create a histogram. In this example, we will be using data from baby.csv
. Recall that the baby data set contains data on a random sample of 1,174 mothers and their newborn babies. The column Birth.Weight
contains the birth weight of the baby, in ounces; Gestational.Days
is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.
# Reading in the data
ds_baby = Table.read_table('baby.csv')
ds_baby
# Creating a histogram
ds_baby.hist('Birth.Weight')
In pandas
, we use hist()
to create histograms, just like datascience
. Refer to the documentation for a full list of parameters.
# Creating a histogram
pd_baby.hist('Birth.Weight');
In datascience
, we used plot()
to create a line plot of numerical values. In this example, we will be using census data and plot variables such as Age in a line plot
# Line plot in datascience
ds_census = Table.read_table("https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec07/nc-est2014-agesex-res.csv").select(['SEX', 'AGE', 'POPESTIMATE2014'])
ds_children = ds_census.where('SEX', are.equal_to(0)).where('AGE', are.below(19)).drop('SEX')
ds_children.plot('AGE')
In pandas
, we can use plot.line()
to create line plots. For a full list of parameters, refer to the documentation.
# Line plot in pandas
pd_census = pd.read_csv("https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec07/nc-est2014-agesex-res.csv")[["SEX", "AGE", "POPESTIMATE2014"]]
pd_children = pd_census[(pd_census.SEX == 0) & (pd_census.AGE < 19)].drop("SEX", axis = 1)
pd_children.plot.line(x = "AGE", y = "POPESTIMATE2014");
In datascience
, we used scatter()
to create a scatter plot of two numerical columns
ds_football = Table.read_table('https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec19/deflategate.csv')
ds_football
ds_football.scatter('Blakeman', 'Prioleau')
In pandas
, we use plot.scatter()
to create a scatter plot. For a full list of parameters, refer to the documentation.
pd_football = pd.read_csv('https://raw.githubusercontent.com/YData123/sds123-sp21/main/demos/lec19/deflategate.csv')
pd_football.plot.scatter(x = "Blakeman", y = "Prioleau");
In datascience
, we used barh()
to create a horizontal bar plot
ds_tb.barh("letter", "points")
In pandas', we use
plot.barh()` to create a bar chart. For a full list of parameters, refer to the documentation.
pd_df.plot.barh(x = 'letter', y = 'points');
Here is a list of useful Pandas resources: