Introduction to Python for Data Sciences |
Franck Iutzeler |
In a previous chapter, we explored some features of NumPy and notably its arrays. Here we will take a look at the data structures provided by the Pandas library.
Pandas is a newer package built on top of NumPy which provides an efficient implementation of DataFrames. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations.
Just as we generally import NumPy under the alias np
, we will import Pandas under the alias pd
.
import pandas as pd
import numpy as np
A Pandas Series
is a one-dimensional array of indexed data.
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
0 0.25 1 0.50 2 0.75 3 1.00 dtype: float64
The contents can be accessed in the same way as for NumPy arrays, to the difference that when more than one value is selected, the type remains a Pandas Series
.
print(data[0],type(data[0]))
0.25 <class 'numpy.float64'>
print(data[2:],type(data[2:]))
2 0.75 3 1.00 dtype: float64 <class 'pandas.core.series.Series'>
The type Series
wraps both a sequence of values and a sequence of indices, which we can access with the values and index attributes.
values
are the contents of the series as a NumPy arrayprint(data.values,type(data.values))
[0.25 0.5 0.75 1. ] <class 'numpy.ndarray'>
index
are the indices of the seriesprint(data.index,type(data.index))
RangeIndex(start=0, stop=4, step=1) <class 'pandas.core.indexes.range.RangeIndex'>
The main difference between NumPy arrays and Pandas Series is the presence of this index field. By default, it is set (as in NumPy arrays) as 0,1,..,size_of_the_series but a Series index can be explicitly defined. The indices may be numbers but also strings. Then, the contents of the series have to be accessed using these defined indices.
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
print(data)
a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64
print(data['c'])
0.75
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 3, 4, 2])
print(data)
1 0.25 3 0.50 4 0.75 2 1.00 dtype: float64
print(data[2])
1.0
Pandas Series and Python Dictionaries are close semantically: mappping keys to values. However, the implementation of Pandas series is usually more efficient than dictionaries in the context of data science. Naturally, Series can be contructed from dictionaries.
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
print(population_dict,type(population_dict))
print(population,type(population))
{'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135} <class 'dict'> California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64 <class 'pandas.core.series.Series'>
population['California']
38332521
population['California':'Illinois']
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
DataFrames is a fundamental object of Pandas that mimicks what can be found in R
for instance. Dataframes can be seen as an array of Series: to each index
(corresponding to an individual for instance or a line in a table), a Dataframe maps multiples values; these values corresponds to the columns
of the DataFrame which each have a name (as a string).
In the following example, we will construct a Dataframe from two Series with common indices.
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
states = pd.DataFrame({'Population': population, 'Area': area})
print(states,type(states))
Population Area California 38332521 423967 Texas 26448193 695662 New York 19651127 141297 Florida 19552860 170312 Illinois 12882135 149995 <class 'pandas.core.frame.DataFrame'>
In Jupyter notebooks, DataFrames are displayed in a fancier way when the name of the dataframe is typed (instead of using print)
states
Population | Area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |
DataFrames have
print(states.index)
print(states.columns)
print(states.values,type(states.values),states.values.shape)
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object') Index(['Population', 'Area'], dtype='object') [[38332521 423967] [26448193 695662] [19651127 141297] [19552860 170312] [12882135 149995]] <class 'numpy.ndarray'> (5, 2)
Warning: When accessing a Dataframe, dataframe_name[column_name]
return the corresponding column as a Series. dataframe_name[index_name]
returns an error! We will see later how to access a specific index.
print(states['Area'],type(states['Area']))
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: Area, dtype: int64 <class 'pandas.core.series.Series'>
try:
print(states['California'])
except KeyError as error:
print("KeyError: ",error)
KeyError: 'California'
print(population,type(population))
states = pd.DataFrame({'Population': population, 'Area': area})
states
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64 <class 'pandas.core.series.Series'>
Population | Area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |
A = np.random.randn(5,3)
print(A,type(A))
dfA = pd.DataFrame(A)
dfA
[[-1.47908983 0.55834675 -0.68109792] [ 1.18023681 1.82871481 0.0944462 ] [-0.22391784 0.26061809 0.68857944] [-1.75644104 0.74439857 -0.45926716] [-0.90534641 -1.57246221 2.28871663]] <class 'numpy.ndarray'>
0 | 1 | 2 | |
---|---|---|---|
0 | -1.479090 | 0.558347 | -0.681098 |
1 | 1.180237 | 1.828715 | 0.094446 |
2 | -0.223918 | 0.260618 | 0.688579 |
3 | -1.756441 | 0.744399 | -0.459267 |
4 | -0.905346 | -1.572462 | 2.288717 |
data = [{'a': i, 'b': 2 * i} for i in range(3)]
print(data,type(data))
print(data[0],type(data[0]))
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}] <class 'list'> {'a': 0, 'b': 0} <class 'dict'>
df = pd.DataFrame(data)
df
a | b | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | 4 |
from a file , typically a csv file (for comma separated values), eventually with the names of the columns as a first line.
col_1_name,col_2_name,col_3_name col_1_v1,col_2_v1,col_3_v1 col_1_v2,col_2_v2,col_3_v2 ...
For other files types (MS Excel, libSVM, any other separator) see this part of the doc
!head -4 data/president_heights.csv # Jupyter bash command to see the first 4 lines of the file
order,name,height(cm) 1,George Washington,189 2,John Adams,170 3,Thomas Jefferson,189
data = pd.read_csv('data/president_heights.csv')
data
order | name | height(cm) | |
---|---|---|---|
0 | 1 | George Washington | 189 |
1 | 2 | John Adams | 170 |
2 | 3 | Thomas Jefferson | 189 |
3 | 4 | James Madison | 163 |
4 | 5 | James Monroe | 183 |
5 | 6 | John Quincy Adams | 171 |
6 | 7 | Andrew Jackson | 185 |
7 | 8 | Martin Van Buren | 168 |
8 | 9 | William Henry Harrison | 173 |
9 | 10 | John Tyler | 183 |
10 | 11 | James K. Polk | 173 |
11 | 12 | Zachary Taylor | 173 |
12 | 13 | Millard Fillmore | 175 |
13 | 14 | Franklin Pierce | 178 |
14 | 15 | James Buchanan | 183 |
15 | 16 | Abraham Lincoln | 193 |
16 | 17 | Andrew Johnson | 178 |
17 | 18 | Ulysses S. Grant | 173 |
18 | 19 | Rutherford B. Hayes | 174 |
19 | 20 | James A. Garfield | 183 |
20 | 21 | Chester A. Arthur | 183 |
21 | 23 | Benjamin Harrison | 168 |
22 | 25 | William McKinley | 170 |
23 | 26 | Theodore Roosevelt | 178 |
24 | 27 | William Howard Taft | 182 |
25 | 28 | Woodrow Wilson | 180 |
26 | 29 | Warren G. Harding | 183 |
27 | 30 | Calvin Coolidge | 178 |
28 | 31 | Herbert Hoover | 182 |
29 | 32 | Franklin D. Roosevelt | 188 |
30 | 33 | Harry S. Truman | 175 |
31 | 34 | Dwight D. Eisenhower | 179 |
32 | 35 | John F. Kennedy | 183 |
33 | 36 | Lyndon B. Johnson | 193 |
34 | 37 | Richard Nixon | 182 |
35 | 38 | Gerald Ford | 183 |
36 | 39 | Jimmy Carter | 177 |
37 | 40 | Ronald Reagan | 185 |
38 | 41 | George H. W. Bush | 188 |
39 | 42 | Bill Clinton | 188 |
40 | 43 | George W. Bush | 182 |
41 | 44 | Barack Obama | 185 |
42 | 45 | Donald Trump | 188 |
Notice there can be missing values in DataFrames.
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
a | b | c | |
---|---|---|---|
0 | 1.0 | 2 | NaN |
1 | NaN | 3 | 4.0 |
You can set indices and columns names a posteriori
dfA.columns = ['a','b','c']
dfA.index = [i**2 for i in range(1,6) ]
dfA
a | b | c | |
---|---|---|---|
1 | -1.479090 | 0.558347 | -0.681098 |
4 | 1.180237 | 1.828715 | 0.094446 |
9 | -0.223918 | 0.260618 | 0.688579 |
16 | -1.756441 | 0.744399 | -0.459267 |
25 | -0.905346 | -1.572462 | 2.288717 |
area = pd.Series( {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995})
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135})
states = pd.DataFrame({'Population': population, 'Area': area})
states
Population | Area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |
You may access columns directly with names, then you can access individuals with their index.
states['Area']
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: Area, dtype: int64
states['Area']['Texas']
695662
To ease the access, Pandas offers dedicated methods:
states.iloc[:2]
Population | Area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
states.iloc[:2,0]
California 38332521 Texas 26448193 Name: Population, dtype: int64
states.loc[:'New York']
Population | Area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
states.loc[:,'Population':]
Population | Area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |