Pandas objects can be thought of as enhanced versions of NumPy structured arrays in which the rows and columns are identified with labels rather than simple integer indices.
Let's introduce these three fundamental Pandas data structures: the Series
, DataFrame
, and Index
.
import numpy as np
import pandas as pd
A Pandas Series
is a one-dimensional array of indexed data:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
0 0.25 1 0.50 2 0.75 3 1.00 dtype: float64
Series
objects wrap both a sequence of values and a sequence of indices, which we can access with the values
and index
attributes.
The values
are simply a familiar NumPy array:
data.values
array([0.25, 0.5 , 0.75, 1. ])
type(_)
numpy.ndarray
The index
is an array-like object of type pd.Index
:
data.index
RangeIndex(start=0, stop=4, step=1)
print(pd.RangeIndex.__doc__)
Immutable Index implementing a monotonic integer range. RangeIndex is a memory-saving special case of Int64Index limited to representing monotonic ranges. Using RangeIndex may in some instances improve computing speed. This is the default index type used by DataFrame and Series when no explicit index is provided by the user. Parameters ---------- start : int (default: 0), or other RangeIndex instance If int and "stop" is not given, interpreted as "stop" instead. stop : int (default: 0) step : int (default: 1) dtype : np.int64 Unused, accepted for homogeneity with other index types. copy : bool, default False Unused, accepted for homogeneity with other index types. name : object, optional Name to be stored in the index. Attributes ---------- start stop step Methods ------- from_range See Also -------- Index : The base pandas Index type. Int64Index : Index of int64 data.
Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:
data[1]
0.5
data[1:3]
1 0.50 2 0.75 dtype: float64
type(_)
pandas.core.series.Series
Series
as generalized NumPy array¶It may look like the Series
object is basically interchangeable with a one-dimensional NumPy array.
The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series
has an explicitly defined index associated with the values.
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64
data['b'] # item access works as expected
0.5
We can even use non-contiguous or non-sequential indices:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[2, 5, 3, 7])
data
2 0.25 5 0.50 3 0.75 7 1.00 dtype: float64
data[5]
0.5
You can think of a Pandas Series
a bit like a specialization of a Python dictionary.
A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series
is a structure which maps typed keys to a set of typed values:
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
type(_)
pandas.core.series.Series
population.index
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
population['California'] # typical dictionary-style item access
38332521
population['California':'Illinois'] # array-like slicing
California 38332521 Texas 26448193 New York 19651127 Florida 19552860 Illinois 12882135 dtype: int64
The next fundamental structure in Pandas is the DataFrame
which can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
If a Series
is an analog of a one-dimensional array with flexible indices, a DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible column names.
You can think of a DataFrame
as a sequence of aligned Series
objects.
Here, by aligned we mean that they share the same index:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 dtype: int64
states = pd.DataFrame({'population': population, 'area': area})
states
population | area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |
type(_)
pandas.core.frame.DataFrame
states.index
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
Additionally, the DataFrame
has a columns
attribute, which is an Index
object holding the column labels:
states.columns
Index(['population', 'area'], dtype='object')
type(_)
pandas.core.indexes.base.Index
Thus the DataFrame
can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.
Similarly, we can also think of a DataFrame
as a specialization of a dictionary.
Where a dictionary maps a key to a value, a DataFrame
maps a column name to a Series
of column data:
states['area'] # "feature"
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: area, dtype: int64
A DataFrame
is a collection of Series
objects, and a single-column DataFrame
can be constructed from a single Series
:
pd.DataFrame(population, columns=['population'])
population | |
---|---|
California | 38332521 |
Texas | 26448193 |
New York | 19651127 |
Florida | 19552860 |
Illinois | 12882135 |
data = [{'a': i, 'b': 2 * i} for i in range(3)]
data
[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]
pd.DataFrame(data)
a | b | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | 4 |
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}]) # Pandas will fill missing keys with ``NaN``
a | b | c | |
---|---|---|---|
0 | 1.0 | 2 | NaN |
1 | NaN | 3 | 4.0 |
Given a two-dimensional array of data, we can create a DataFrame
with any specified column and index names:
np.random.rand(3, 2)
array([[0.30282887, 0.48376433], [0.53588853, 0.97428136], [0.94756199, 0.46766408]])
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
foo | bar | |
---|---|---|
a | 0.759907 | 0.458958 |
b | 0.776779 | 0.767430 |
c | 0.131552 | 0.740137 |
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A
array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])
pd.DataFrame(A)
A | B | |
---|---|---|
0 | 0 | 0.0 |
1 | 0 | 0.0 |
2 | 0 | 0.0 |
This Index
object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index
objects may contain repeated values).
ind = pd.Index([2, 3, 5, 7, 11])
ind
Int64Index([2, 3, 5, 7, 11], dtype='int64')
The Index
in many ways operates like an array.
ind[1]
3
ind[::2]
Int64Index([2, 5, 11], dtype='int64')
Index
objects also have many of the attributes familiar from NumPy arrays:
ind.size, ind.shape, ind.ndim, ind.dtype,
(5, (5,), 1, dtype('int64'))
One difference is that indices are immutable–that is, they cannot be modified via the normal means:
ind[1] = 0
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-60-906a9fa1424c> in <module> ----> 1 ind[1] = 0 ~/Developer/py-venvs/sphinx-venv/lib/python3.9/site-packages/pandas/core/indexes/base.py in __setitem__(self, key, value) 4275 @final 4276 def __setitem__(self, key, value): -> 4277 raise TypeError("Index does not support mutable operations") 4278 4279 def __getitem__(self, key): TypeError: Index does not support mutable operations
The Index
object follows many of the conventions used by Python's built-in set
data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])
indA.intersection(indB) # intersection
Int64Index([3, 5, 7], dtype='int64')
indA.union(indB) # union
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
indA.symmetric_difference(indB) # symmetric difference
Int64Index([1, 2, 9, 11], dtype='int64')
To modify values in NumPy arrays we use indexing (e.g., arr[2, 1]
), slicing (e.g., arr[:, 1:5]
), masking (e.g., arr[arr > 0]
), fancy indexing (e.g., arr[0, [1, 5]]
), and combinations thereof (e.g., arr[:, [1, 5]]
).
Here we'll look at similar means of accessing and modifying values in Pandas Series
and DataFrame
objects.
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.
A Series
object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
Like a dictionary, the Series
object provides a mapping from a collection of keys to a collection of values:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data
a 0.25 b 0.50 c 0.75 d 1.00 dtype: float64
data['b'] # mnemonic indexing
0.5
'a' in data # dictionary-like Python expressions...
True
data.keys() # ...and methods.
Index(['a', 'b', 'c', 'd'], dtype='object')
list(data.items())
[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]
Series
objects can even be modified with a dictionary-like syntax:
data['e'] = 1.25
data
a 0.25 b 0.50 c 0.75 d 1.00 e 1.25 dtype: float64
This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place.
A Series
builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing:
data['a':'c'] # slicing by explicit index
a 0.25 b 0.50 c 0.75 dtype: float64
data[0:2] # slicing by implicit integer index
a 0.25 b 0.50 dtype: float64
data[(data > 0.3) & (data < 0.8)] # masking
b 0.50 c 0.75 dtype: float64
because
(data > 0.3) & (data < 0.8)
a False b True c True d False e False dtype: bool
type(_)
pandas.core.series.Series
data[['a', 'e']] # fancy indexing
a 0.25 e 1.25 dtype: float64
Notice that when slicing with an explicit index (i.e., data['a':'c']
), the final index is included in the slice, while when slicing with an implicit index (i.e., data[0:2]
), the final index is excluded from the slice.
If your Series
has an explicit integer index, an indexing operation such as data[1]
will use the explicit indices, while a slicing operation like data[1:3]
will use the implicit Python-style index.
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data
1 a 3 b 5 c dtype: object
data[1] # explicit index when indexing
'a'
data[1:3] # implicit index when slicing
3 b 5 c dtype: object
Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series
.
First, the loc
attribute allows indexing and slicing that always references the explicit index:
data.loc[1]
'a'
data.loc[1:3]
1 a 3 b dtype: object
The iloc
attribute allows indexing and slicing that always references the implicit Python-style index:
data.iloc[1:3]
3 b 5 c dtype: object
A third indexing attribute, ix
, is a hybrid of the two, and for Series
objects is equivalent to standard []
-based indexing.
The purpose of the ix
indexer will become more apparent in the context of DataFrame
objects.
Recall that a DataFrame
acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series
structures sharing the same index.
area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Texas | 695662 | 26448193 |
New York | 141297 | 19651127 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
data['area'] # columns can be accessed via dict-style indexing
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: area, dtype: int64
data.area # alternatively, use attribute-style access with column names
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: area, dtype: int64
this dictionary-style syntax can also be used to modify the object, in this case adding a new column:
data['density'] = data['pop'] / data['area']
data
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.413926 |
Texas | 695662 | 26448193 | 38.018740 |
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
DataFrame
can also be viewed as an enhanced two-dimensional array:
data.values # examine the raw underlying data array
array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01], [6.95662000e+05, 2.64481930e+07, 3.80187404e+01], [1.41297000e+05, 1.96511270e+07, 1.39076746e+02], [1.70312000e+05, 1.95528600e+07, 1.14806121e+02], [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])
data.values.T
array([[4.23967000e+05, 6.95662000e+05, 1.41297000e+05, 1.70312000e+05, 1.49995000e+05], [3.83325210e+07, 2.64481930e+07, 1.96511270e+07, 1.95528600e+07, 1.28821350e+07], [9.04139261e+01, 3.80187404e+01, 1.39076746e+02, 1.14806121e+02, 8.58837628e+01]])
type(_)
numpy.ndarray
data.T # transpose the full DataFrame object
California | Texas | New York | Florida | Illinois | |
---|---|---|---|---|---|
area | 4.239670e+05 | 6.956620e+05 | 1.412970e+05 | 1.703120e+05 | 1.499950e+05 |
pop | 3.833252e+07 | 2.644819e+07 | 1.965113e+07 | 1.955286e+07 | 1.288214e+07 |
density | 9.041393e+01 | 3.801874e+01 | 1.390767e+02 | 1.148061e+02 | 8.588376e+01 |
type(_)
pandas.core.frame.DataFrame
data.values[0] # passing a single index to an array accesses a row
array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])
data['area'] # assing a single "index" to access a column
California 423967 Texas 695662 New York 141297 Florida 170312 Illinois 149995 Name: area, dtype: int64
Using the iloc
indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index)
data.iloc[:3, :2]
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Texas | 695662 | 26448193 |
New York | 141297 | 19651127 |
Similarly, using the loc
indexer we can index the underlying data in an array-like style but using the explicit index and column names:
data.loc[:'Illinois', :'pop']
area | pop | |
---|---|---|
California | 423967 | 38332521 |
Texas | 695662 | 26448193 |
New York | 141297 | 19651127 |
Florida | 170312 | 19552860 |
Illinois | 149995 | 12882135 |
Any of the familiar NumPy-style data access patterns can be used within these indexers.
data.loc[data.density > 100, ['pop', 'density']]
pop | density | |
---|---|---|
New York | 19651127 | 139.076746 |
Florida | 19552860 | 114.806121 |
Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:
data.iloc[0, 2] = 90
data
area | pop | density | |
---|---|---|---|
California | 423967 | 38332521 | 90.000000 |
Texas | 695662 | 26448193 | 38.018740 |
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
data['Florida':'Illinois'] # *slicing* refers to rows
area | pop | density | |
---|---|---|---|
Florida | 170312 | 19552860 | 114.806121 |
Illinois | 149995 | 12882135 | 85.883763 |
data[data.density > 100] # direct masking operations are also interpreted row-wise
area | pop | density | |
---|---|---|---|
New York | 141297 | 19651127 | 139.076746 |
Florida | 170312 | 19552860 | 114.806121 |
One of the essential pieces of NumPy is the ability to perform quick element-wise operations, both with basic arithmetic (addition, subtraction, multiplication, etc.) and with more sophisticated operations (trigonometric functions, exponential and logarithmic functions, etc.).
Pandas inherits much of this functionality from NumPy.
Pandas includes a couple useful twists, however: for unary operations like negation and trigonometric functions, these ufuncs will preserve index and column labels in the output, and for binary operations such as addition and multiplication, Pandas will automatically align indices when passing the objects to the ufunc.
Because Pandas is designed to work with NumPy, any NumPy ufunc will work on Pandas Series
and DataFrame
objects:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser
0 6 1 3 2 7 3 4 dtype: int64
rng.randint(0, 10, (3, 4))
array([[1, 7, 5, 1], [4, 0, 9, 5], [8, 0, 9, 2]])
df = pd.DataFrame(rng.randint(0, 10, (3, 4)), columns=['A', 'B', 'C', 'D'])
df
A | B | C | D | |
---|---|---|---|---|
0 | 6 | 3 | 8 | 2 |
1 | 4 | 2 | 6 | 4 |
2 | 8 | 6 | 1 | 3 |
If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:
np.exp(ser)
0 403.428793 1 20.085537 2 1096.633158 3 54.598150 dtype: float64
type(_)
pandas.core.series.Series
np.sin(df * np.pi / 4) # a slightly more complex calculation
A | B | C | D | |
---|---|---|---|---|
0 | -1.000000e+00 | 0.707107 | -2.449294e-16 | 1.000000e+00 |
1 | 1.224647e-16 | 1.000000 | -1.000000e+00 | 1.224647e-16 |
2 | -2.449294e-16 | -1.000000 | 7.071068e-01 | 7.071068e-01 |
type(_)
pandas.core.frame.DataFrame
For binary operations on two Series
or DataFrame
objects, Pandas will align indices in the process of performing the operation.
Suppose we are combining two different data sources, and find only the top three US states by area and the top three US states by population:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662, 'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127}, name='population')
population / area
Alaska NaN California 90.413926 New York NaN Texas 38.018740 dtype: float64
The resulting array contains the union of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:
area.index.union(population.index) # this does create a new index and doesn't modify in place.
Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')
area.index
Index(['Alaska', 'Texas', 'California'], dtype='object')
Any item for which one or the other does not have an entry is marked with NaN
, or "Not a Number," which is how Pandas marks missing data
.
This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with NaN by default:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B
0 NaN 1 5.0 2 9.0 3 NaN dtype: float64
If using NaN values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators:
A.add(B, fill_value=0)
0 2.0 1 5.0 2 9.0 3 5.0 dtype: float64
A similar type of alignment takes place for both columns and indices when performing operations on DataFrame
s:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)), columns=list('AB'))
A
A | B | |
---|---|---|
0 | 13 | 17 |
1 | 8 | 1 |
B = pd.DataFrame(rng.randint(0, 10, (3, 3)), columns=list('BAC'))
B
B | A | C | |
---|---|---|---|
0 | 1 | 5 | 5 |
1 | 9 | 3 | 5 |
2 | 1 | 9 | 1 |
A + B
A | B | C | |
---|---|---|---|
0 | 18.0 | 18.0 | NaN |
1 | 11.0 | 10.0 | NaN |
2 | NaN | NaN | NaN |
fill = A.stack().mean()
fill
9.75
A.add(B, fill_value=fill)
A | B | C | |
---|---|---|---|
0 | 18.00 | 18.00 | 14.75 |
1 | 11.00 | 10.00 | 14.75 |
2 | 18.75 | 10.75 | 10.75 |
The following table lists Python operators and their equivalent Pandas object methods:
Python Operator | Pandas Method(s) |
---|---|
+ |
add() |
- |
sub() , subtract() |
* |
mul() , multiply() |
/ |
truediv() , div() , divide() |
// |
floordiv() |
% |
mod() |
** |
pow() |
When performing operations between a DataFrame
and a Series
, the index and column alignment is similarly maintained.
Operations between a DataFrame
and a Series
are similar to operations between a two-dimensional and one-dimensional NumPy array.
A = rng.randint(10, size=(3, 4))
A
array([[3, 8, 2, 4], [2, 6, 4, 8], [6, 1, 3, 8]])
type(A)
numpy.ndarray
A - A[0]
array([[ 0, 0, 0, 0], [-1, -2, 2, 4], [ 3, -7, 1, 4]])
According to NumPy's broadcasting rules , subtraction between a two-dimensional array and one of its rows is applied row-wise.
In Pandas, the convention similarly operates row-wise by default:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]
Q | R | S | T | |
---|---|---|---|---|
0 | 0 | 0 | 0 | 0 |
1 | -1 | -2 | 2 | 4 |
2 | 3 | -7 | 1 | 4 |
If you would instead like to operate column-wise you have to specify the axis
keyword:
df.subtract(df['R'], axis=0)
Q | R | S | T | |
---|---|---|---|---|
0 | -5 | 0 | -6 | -4 |
1 | -4 | 0 | -2 | 2 |
2 | 5 | 0 | 2 | 7 |
The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.
To make matters even more complicated, different data sources may indicate missing data in different ways.
To indicate the presence of missing data in a table or DataFrame we can use two strategies: using a mask that globally indicates missing values, or choosing a sentinel value that indicates a missing entry.
In the masking approach, the mask might be an entirely separate Boolean array, or it may involve appropriation of one bit in the data representation to locally indicate the null status of a value.
In the sentinel approach, the sentinel value could be some data-specific convention, such as indicating a missing integer value with -9999 or some rare bit pattern, or it could be a more global convention, such as indicating a missing floating-point value with NaN (Not a Number).
None of these approaches is without trade-offs: use of a separate mask array requires allocation of an additional Boolean array. A sentinel value reduces the range of valid values that can be represented, and may require extra (often non-optimized) logic in CPU and GPU arithmetic.
The way in which Pandas handles missing values is constrained by its reliance on the NumPy package, which does not have a built-in notion of NA values for non-floating-point data types.
NumPy does have support for masked arrays – that is, arrays that have a separate Boolean mask array attached for marking data as "good" or "bad." Pandas could have derived from this, but the overhead in both storage, computation, and code maintenance makes that an unattractive choice.
With these constraints in mind, Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point NaN
value, and the Python None
object.
None
: Pythonic missing data¶The first sentinel value used by Pandas is None
, a Python singleton object that is often used for missing data in Python code.
Because it is a Python object, None
cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type 'object'
(i.e., arrays of Python objects):
vals1 = np.array([1, None, 3, 4])
vals1
array([1, None, 3, 4], dtype=object)
Any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:
for dtype in ['object', 'int']:
print("dtype =", dtype)
%timeit np.arange(1E6, dtype=dtype).sum()
print()
dtype = object 81.8 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) dtype = int 1.87 ms ± 34.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The use of Python objects in an array also means that if you perform aggregations like sum()
or min()
across an array with a None
value, you will generally get an error:
vals1.sum()
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-94-30a3fc8c6726> in <module> ----> 1 vals1.sum() ~/Developer/venvs/py-ml/lib/python3.8/site-packages/numpy/core/_methods.py in _sum(a, axis, dtype, out, keepdims, initial, where) 45 def _sum(a, axis=None, dtype=None, out=None, keepdims=False, 46 initial=_NoValue, where=True): ---> 47 return umr_sum(a, axis, dtype, out, keepdims, initial, where) 48 49 def _prod(a, axis=None, dtype=None, out=None, keepdims=False, TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
NaN
: Missing numerical data¶The other missing data representation, NaN
(acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:
vals2 = np.array([1, np.nan, 3, 4])
vals2.dtype
dtype('float64')
1 + np.nan, 0 * np.nan
(nan, nan)
vals2.sum(), vals2.min(), vals2.max()
(nan, nan, nan)
NumPy does provide some special aggregations that will ignore these missing values:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)
(8.0, 1.0, 4.0)
NaN
and None
both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:
pd.Series([1, np.nan, 2, None])
0 1.0 1 NaN 2 2.0 3 NaN dtype: float64
The following table lists the upcasting conventions in Pandas when NA values are introduced:
Typeclass | Conversion When Storing NAs | NA Sentinel Value |
---|---|---|
floating |
No change | np.nan |
object |
No change | None or np.nan |
integer |
Cast to float64 |
np.nan |
boolean |
Cast to object |
None or np.nan |
Keep in mind that in Pandas, string data is always stored with an object
dtype.
As we have seen, Pandas treats None
and NaN
as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:
isnull()
: Generate a boolean mask indicating missing valuesnotnull()
: Opposite of isnull()
dropna()
: Return a filtered version of the datafillna()
: Return a copy of the data with missing values filled or imputedPandas data structures have two useful methods for detecting null data: isnull()
and notnull()
.
Either one will return a Boolean mask over the data:
data = pd.Series([1, np.nan, 'hello', None])
data.isnull()
0 False 1 True 2 False 3 True dtype: bool
In addition to the masking used before, there are the convenience methods, dropna()
(which removes NA values) and fillna()
(which fills in NA values):
data.dropna()
0 1 2 hello dtype: object
For a DataFrame
, there are more options:
df = pd.DataFrame([[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]])
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 2 |
1 | 2.0 | 3.0 | 5 |
2 | NaN | 4.0 | 6 |
df.dropna() # drop all rows in which *any* null value is present
0 | 1 | 2 | |
---|---|---|---|
1 | 2.0 | 3.0 | 5 |
df.dropna(axis='columns') # drop NA values from all columns containing a null value
2 | |
---|---|
0 | 2 |
1 | 5 |
2 | 6 |
The default is how='any'
, such that any row or column (depending on the axis
keyword) containing a null value will be dropped.
df[3] = np.nan
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 2 | NaN |
1 | 2.0 | 3.0 | 5 | NaN |
2 | NaN | 4.0 | 6 | NaN |
You can also specify how='all'
, which will only drop rows/columns that are all null values:
df.dropna(axis='columns', how='all')
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 2 |
1 | 2.0 | 3.0 | 5 |
2 | NaN | 4.0 | 6 |
The thresh
parameter lets you specify a minimum number of non-null values for the row/column to be kept:
df.dropna(axis='rows', thresh=3)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
1 | 2.0 | 3.0 | 5 | NaN |
Sometimes rather than dropping NA values, you'd rather replace them with a valid value.
This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.
You could do this in-place using the isnull()
method as a mask, but because it is such a common operation Pandas provides the fillna()
method, which returns a copy of the array with the null values replaced.
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data
a 1.0 b NaN c 2.0 d NaN e 3.0 dtype: float64
data.fillna(0) # fill NA entries with a single value
a 1.0 b 0.0 c 2.0 d 0.0 e 3.0 dtype: float64
data.fillna(method='ffill') # specify a forward-fill to propagate the previous value forward
a 1.0 b 1.0 c 2.0 d 2.0 e 3.0 dtype: float64
data.fillna(method='bfill') # specify a back-fill to propagate the next values backward
a 1.0 b 2.0 c 2.0 d 3.0 e 3.0 dtype: float64
For DataFrame
s, the options are similar, but we can also specify an axis
along which the fills take place:
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | NaN | 2 | NaN |
1 | 2.0 | 3.0 | 5 | NaN |
2 | NaN | 4.0 | 6 | NaN |
df.fillna(method='ffill', axis=1)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.0 | 1.0 | 2.0 | 2.0 |
1 | 2.0 | 3.0 | 5.0 | 5.0 |
2 | NaN | 4.0 | 6.0 | 6.0 |
Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas Series
and DataFrame
objects, respectively.
Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys.
A far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index.
In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series
and two-dimensional DataFrame
objects.
Let's start by considering how we might represent two-dimensional data within a one-dimensional Series
.
Suppose you would like to track data about states from two different years. Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:
index = [('California', 2000), ('California', 2010),
('New York', 2000), ('New York', 2010),
('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
18976457, 19378102,
20851820, 25145561]
pop = pd.Series(populations, index=index)
pop
(California, 2000) 33871648 (California, 2010) 37253956 (New York, 2000) 18976457 (New York, 2010) 19378102 (Texas, 2000) 20851820 (Texas, 2010) 25145561 dtype: int64
If you need to select all values from 2010, you'll need to do some messy (and potentially slow) munging to make it happen:
pop[[i for i in pop.index if i[1] == 2010]]
(California, 2010) 37253956 (New York, 2010) 19378102 (Texas, 2010) 25145561 dtype: int64
Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex
type gives us the type of operations we wish to have:
index = pd.MultiIndex.from_tuples(index)
index
MultiIndex([('California', 2000), ('California', 2010), ( 'New York', 2000), ( 'New York', 2010), ( 'Texas', 2000), ( 'Texas', 2010)], )
type(_)
pandas.core.indexes.multi.MultiIndex
A MultiIndex
contains multiple levels of indexing–in this case, the state names and the years, as well as multiple labels for each data point which encode these levels.
If we re-index our series with this MultiIndex
, we see the hierarchical representation of the data:
pop = pop.reindex(index)
pop
California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64
Here the first two columns of the Series
representation show the multiple index values, while the third column shows the data.
Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.
Now to access all data for which the second index is 2010, we can simply use the Pandas slicing notation:
pop[:, 2010]
California 37253956 New York 19378102 Texas 25145561 dtype: int64
The result is a singly indexed array with just the keys we're interested in. This syntax is much more convenient (and the operation is much more efficient!) than the home-spun tuple-based multi-indexing solution that we started with.
We could have stored the same data using a simple DataFrame
with index and column labels; in fact, Pandas is built with this equivalence in mind.
The unstack()
method will quickly convert a multiply indexed Series
into a conventionally indexed DataFrame
:
pop_df = pop.unstack()
pop_df
2000 | 2010 | |
---|---|---|
California | 33871648 | 37253956 |
New York | 18976457 | 19378102 |
Texas | 20851820 | 25145561 |
type(pop_df)
pandas.core.frame.DataFrame
Naturally, the stack()
method provides the opposite operation:
pop_df.stack()
California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64
Seeing this, you might wonder why would we would bother with hierarchical indexing at all.
The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series
, we can also use it to represent data of three or more dimensions in a Series
or DataFrame
.
Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent.
Concretely, we might want to add another column of demographic data for each state at each year (say, population under 18) ; with a MultiIndex
this is as easy as adding another column to the DataFrame
:
pop_df = pd.DataFrame({'total': pop,
'under18': [9267089, 9284094,
4687374, 4318033,
5906301, 6879014]})
pop_df
total | under18 | ||
---|---|---|---|
California | 2000 | 33871648 | 9267089 |
2010 | 37253956 | 9284094 | |
New York | 2000 | 18976457 | 4687374 |
2010 | 19378102 | 4318033 | |
Texas | 2000 | 20851820 | 5906301 |
2010 | 25145561 | 6879014 |
In addition, all the ufuncs work with hierarchical indices as well:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()
2000 | 2010 | |
---|---|---|
California | 0.273594 | 0.249211 |
New York | 0.247010 | 0.222831 |
Texas | 0.283251 | 0.273568 |
The most straightforward way to construct a multiply indexed Series
or DataFrame
is to simply pass a list of two or more index arrays to the constructor:
df = pd.DataFrame(np.random.rand(4, 2),
index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
columns=['data1', 'data2'])
df
data1 | data2 | ||
---|---|---|---|
a | 1 | 0.482545 | 0.352967 |
2 | 0.574280 | 0.063582 | |
b | 1 | 0.102271 | 0.569372 |
2 | 0.753026 | 0.194597 |
Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex
by default:
data = {('California', 2000): 33871648,
('California', 2010): 37253956,
('Texas', 2000): 20851820,
('Texas', 2010): 25145561,
('New York', 2000): 18976457,
('New York', 2010): 19378102}
pd.Series(data)
California 2000 33871648 2010 37253956 Texas 2000 20851820 2010 25145561 New York 2000 18976457 2010 19378102 dtype: int64
For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex
.
You can construct the MultiIndex
from a simple list of arrays giving the index values within each level:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])
MultiIndex([('a', 1), ('a', 2), ('b', 1), ('b', 2)], )
You can even construct it from a Cartesian product of single indices:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])
MultiIndex([('a', 1), ('a', 2), ('b', 1), ('b', 2)], )
Sometimes it is convenient to name the levels of the MultiIndex
.
This can be accomplished by passing the names
argument to any of the above MultiIndex
constructors, or by setting the names
attribute of the index after the fact:
pop.index.names = ['state', 'year']
pop
state year California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64
In a DataFrame
, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well:
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]], names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']], names=['subject', 'type'])
data = np.round(np.random.randn(4, 6), 1) # mock some data
data[:, ::2] *= 10
data += 37
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data # create the DataFrame
subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|
type | HR | Temp | HR | Temp | HR | Temp | |
year | visit | ||||||
2013 | 1 | 48.0 | 38.1 | 19.0 | 38.4 | 52.0 | 38.8 |
2 | 34.0 | 38.0 | 37.0 | 36.9 | 31.0 | 37.6 | |
2014 | 1 | 41.0 | 37.0 | 52.0 | 38.9 | 38.0 | 37.4 |
2 | 47.0 | 36.9 | 46.0 | 36.4 | 42.0 | 36.6 |
This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number; we can index the top-level column by the person's name and get a full DataFrame
containing just that person's information:
health_data['Guido']
type | HR | Temp | |
---|---|---|---|
year | visit | ||
2013 | 1 | 19.0 | 38.4 |
2 | 37.0 | 36.9 | |
2014 | 1 | 52.0 | 38.9 |
2 | 46.0 | 36.4 |
pop
state year California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64
pop['California', 2000] # access single elements by indexing with multiple terms
33871648
The MultiIndex
also supports partial indexing, or indexing just one of the levels in the index.
The result is another Series
, with the lower-level indices maintained:
pop['California']
year 2000 33871648 2010 37253956 dtype: int64
Other types of indexing and selection could be based either on Boolean masks:
pop[pop > 22000000]
state year California 2000 33871648 2010 37253956 Texas 2010 25145561 dtype: int64
or on fancy indexing:
pop[['California', 'Texas']]
state year California 2000 33871648 2010 37253956 Texas 2000 20851820 2010 25145561 dtype: int64
A multiply indexed DataFrame
behaves in a similar manner:
health_data
subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|
type | HR | Temp | HR | Temp | HR | Temp | |
year | visit | ||||||
2013 | 1 | 48.0 | 38.1 | 19.0 | 38.4 | 52.0 | 38.8 |
2 | 34.0 | 38.0 | 37.0 | 36.9 | 31.0 | 37.6 | |
2014 | 1 | 41.0 | 37.0 | 52.0 | 38.9 | 38.0 | 37.4 |
2 | 47.0 | 36.9 | 46.0 | 36.4 | 42.0 | 36.6 |
Remember that columns are primary in a DataFrame
, and the syntax used for multiply indexed Series
applies to the columns.
We can recover Guido's heart rate data with a simple operation:
health_data['Guido', 'HR']
year visit 2013 1 19.0 2 37.0 2014 1 52.0 2 46.0 Name: (Guido, HR), dtype: float64
Also, as with the single-index case, we can use the loc
, iloc
, and ix
indexers:
health_data.iloc[:2, :2]
subject | Bob | ||
---|---|---|---|
type | HR | Temp | |
year | visit | ||
2013 | 1 | 48.0 | 38.1 |
2 | 34.0 | 38.0 |
These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc
or iloc
can be passed a tuple of multiple indices:
health_data.loc[:, ('Bob', 'HR')]
year visit 2013 1 48.0 2 34.0 2014 1 41.0 2 47.0 Name: (Bob, HR), dtype: float64
One of the keys to working with multiply indexed data is knowing how to effectively transform the data.
There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations.
We saw a brief example of this in the stack()
and unstack()
methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns.
Earlier, we briefly mentioned a caveat, but we should emphasize it more here.
Many of the MultiIndex
slicing operations will fail if the index is not sorted.
We'll start by creating some simple multiply indexed data where the indices are not lexographically sorted:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
char int a 1 0.002105 2 0.280923 c 1 0.008604 2 0.631968 b 1 0.072270 2 0.273800 dtype: float64
try:
data['a':'b'] # try to take a partial slice of this index
except KeyError as e:
print(type(e))
print(e)
<class 'pandas.errors.UnsortedIndexError'> 'Key length (1) was greater than MultiIndex lexsort depth (0)'
This is the result of the MultiIndex not being sorted; in general, partial slices and other similar operations require the levels in the MultiIndex
to be in sorted (i.e., lexographical) order.
Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index()
and sortlevel()
methods of the DataFrame
.
data = data.sort_index()
data
char int a 1 0.002105 2 0.280923 b 1 0.072270 2 0.273800 c 1 0.008604 2 0.631968 dtype: float64
With the index sorted in this way, partial slicing will work as expected:
data['a':'b']
char int a 1 0.002105 2 0.280923 b 1 0.072270 2 0.273800 dtype: float64
As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:
pop.unstack(level=0)
state | California | New York | Texas |
---|---|---|---|
year | |||
2000 | 33871648 | 18976457 | 20851820 |
2010 | 37253956 | 19378102 | 25145561 |
pop.unstack(level=1)
year | 2000 | 2010 |
---|---|---|
state | ||
California | 33871648 | 37253956 |
New York | 18976457 | 19378102 |
Texas | 20851820 | 25145561 |
The opposite of unstack()
is stack()
, which here can be used to recover the original series:
pop.unstack().stack()
state year California 2000 33871648 2010 37253956 New York 2000 18976457 2010 19378102 Texas 2000 20851820 2010 25145561 dtype: int64
Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index
method.
Calling this on the population dictionary will result in a DataFrame
with a state and year column holding the information that was formerly in the index.
pop_flat = pop.reset_index(name='population') # specify the name of the data for the column
pop_flat
state | year | population | |
---|---|---|---|
0 | California | 2000 | 33871648 |
1 | California | 2010 | 37253956 |
2 | New York | 2000 | 18976457 |
3 | New York | 2010 | 19378102 |
4 | Texas | 2000 | 20851820 |
5 | Texas | 2010 | 25145561 |
Often when working with data in the real world, the raw input data looks like this and it's useful to build a MultiIndex
from the column values.
This can be done with the set_index
method of the DataFrame
, which returns a multiply indexed DataFrame
:
pop_flat.set_index(['state', 'year'])
population | ||
---|---|---|
state | year | |
California | 2000 | 33871648 |
2010 | 37253956 | |
New York | 2000 | 18976457 |
2010 | 19378102 | |
Texas | 2000 | 20851820 |
2010 | 25145561 |
We've previously seen that Pandas has built-in data aggregation methods, such as mean()
, sum()
, and max()
.
For hierarchically indexed data, these can be passed a level
parameter that controls which subset of the data the aggregate is computed on.
health_data
subject | Bob | Guido | Sue | ||||
---|---|---|---|---|---|---|---|
type | HR | Temp | HR | Temp | HR | Temp | |
year | visit | ||||||
2013 | 1 | 48.0 | 38.1 | 19.0 | 38.4 | 52.0 | 38.8 |
2 | 34.0 | 38.0 | 37.0 | 36.9 | 31.0 | 37.6 | |
2014 | 1 | 41.0 | 37.0 | 52.0 | 38.9 | 38.0 | 37.4 |
2 | 47.0 | 36.9 | 46.0 | 36.4 | 42.0 | 36.6 |
Perhaps we'd like to average-out the measurements in the two visits each year. We can do this by naming the index level we'd like to explore, in this case the year:
data_mean = health_data.mean(level='year')
data_mean
subject | Bob | Guido | Sue | |||
---|---|---|---|---|---|---|
type | HR | Temp | HR | Temp | HR | Temp |
year | ||||||
2013 | 41.0 | 38.05 | 28.0 | 37.65 | 41.5 | 38.2 |
2014 | 44.0 | 36.95 | 49.0 | 37.65 | 40.0 | 37.0 |
By further making use of the axis
keyword, we can take the mean among levels on the columns as well:
data_mean.mean(axis=1, level='type')
type | HR | Temp |
---|---|---|
year | ||
2013 | 36.833333 | 37.966667 |
2014 | 44.333333 | 37.200000 |
Some of the most interesting studies of data come from combining different data sources.
These operations can involve anything from very straightforward concatenation of two different datasets, to more complicated database-style joins and merges that correctly handle any overlaps between the datasets.
Series
and DataFrame
s are built with this type of operation in mind, and Pandas includes functions and methods that make this sort of data wrangling fast and straightforward.
Here we'll take a look at simple concatenation of Series
and DataFrame
s with the pd.concat
function; later we'll dive into more sophisticated in-memory merges and joins implemented in Pandas.
For convenience, we'll define this function which creates a DataFrame
of a particular form that will be useful below:
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
A | B | C | |
---|---|---|---|
0 | A0 | B0 | C0 |
1 | A1 | B1 | C1 |
2 | A2 | B2 | C2 |
In addition, we'll create a quick class that allows us to display multiple DataFrame
s side by side. The code makes use of the special _repr_html_
method, which IPython uses to implement its rich object display:
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
pd.concat
¶Pandas has a function, pd.concat()
, which has a similar syntax to np.concatenate
but contains a number of options that we'll discuss momentarily:
# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
pd.concat()
can be used for a simple concatenation of Series
or DataFrame
objects, just as np.concatenate()
can be used for simple concatenations of arrays:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
1 A 2 B 3 C 4 D 5 E 6 F dtype: object
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
df1
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
df2
A | B | |
---|---|---|
3 | A3 | B3 |
4 | A4 | B4 |
pd.concat([df1, df2])
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B3 |
4 | A4 | B4 |
By default, the concatenation takes place row-wise within the DataFrame
(i.e., axis=0
).
Like np.concatenate
, pd.concat
allows specification of an axis along which concatenation will take place:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis=1)")
df3
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
df4
C | D | |
---|---|---|
0 | C0 | D0 |
1 | C1 | D1 |
pd.concat([df3, df4], axis=1)
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
One important difference between np.concatenate
and pd.concat
is that Pandas concatenation preserves indices, even if the result will have duplicate indices:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
x
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
y
A | B | |
---|---|---|
0 | A2 | B2 |
1 | A3 | B3 |
pd.concat([x, y])
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
0 | A2 | B2 |
1 | A3 | B3 |
Notice the repeated indices in the result.
While this is valid within DataFrame
s, the outcome is often undesirable.
pd.concat()
gives us a few ways to handle it.
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
Sometimes the index itself does not matter, and you would prefer it to simply be ignored.
This option can be specified using the ignore_index
flag.
With this set to true, the concatenation will create a new integer index for the resulting Series
:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
x
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
y
A | B | |
---|---|---|
0 | A2 | B2 |
1 | A3 | B3 |
pd.concat([x, y], ignore_index=True)
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B3 |
Another option is to use the keys
option to specify a label for the data sources; the result will be a hierarchically indexed series containing the data:
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
x
A | B | |
---|---|---|
0 | A0 | B0 |
1 | A1 | B1 |
y
A | B | |
---|---|---|
0 | A2 | B2 |
1 | A3 | B3 |
pd.concat([x, y], keys=['x', 'y'])
A | B | ||
---|---|---|---|
x | 0 | A0 | B0 |
1 | A1 | B1 | |
y | 0 | A2 | B2 |
1 | A3 | B3 |
In practice, data from different sources might have different sets of column names, and pd.concat
offers several options in this case.
Consider the concatenation of the following two DataFrame
s, which have some (but not all!) columns in common:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
df5
A | B | C | |
---|---|---|---|
1 | A1 | B1 | C1 |
2 | A2 | B2 | C2 |
df6
B | C | D | |
---|---|---|---|
3 | B3 | C3 | D3 |
4 | B4 | C4 | D4 |
pd.concat([df5, df6])
A | B | C | D | |
---|---|---|---|---|
1 | A1 | B1 | C1 | NaN |
2 | A2 | B2 | C2 | NaN |
3 | NaN | B3 | C3 | D3 |
4 | NaN | B4 | C4 | D4 |
By default, the join is a union of the input column|s (join='outer'
), but we can change this to an intersection of the columns using join='inner'
:
Another option is to directly specify the index of the remaininig colums using the join_axes
argument, which takes a list of index objects.
display('df5', 'df6', "pd.concat([df5, df6])")
df5
A | B | C | |
---|---|---|---|
1 | A1 | B1 | C1 |
2 | A2 | B2 | C2 |
df6
B | C | D | |
---|---|---|---|
3 | B3 | C3 | D3 |
4 | B4 | C4 | D4 |
pd.concat([df5, df6])
A | B | C | D | |
---|---|---|---|---|
1 | A1 | B1 | C1 | NaN |
2 | A2 | B2 | C2 | NaN |
3 | NaN | B3 | C3 | D3 |
4 | NaN | B4 | C4 | D4 |
append()
method¶Because direct array concatenation is so common, Series
and DataFrame
objects have an append
method that can accomplish the same thing in fewer keystrokes.
For example, rather than calling pd.concat([df1, df2])
, you can simply call df1.append(df2)
:
display('df1', 'df2', 'df1.append(df2)')
df1
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
df2
A | B | |
---|---|---|
3 | A3 | B3 |
4 | A4 | B4 |
df1.append(df2)
A | B | |
---|---|---|
1 | A1 | B1 |
2 | A2 | B2 |
3 | A3 | B3 |
4 | A4 | B4 |
Keep in mind that unlike the append()
and extend()
methods of Python lists, the append()
method in Pandas does not modify the original object–instead it creates a new object with the combined data.
It also is not a very efficient method, because it involves creation of a new index and data buffer.
Thus, if you plan to do multiple append
operations, it is generally better to build a list of DataFrame
s and pass them all at once to the concat()
function.
One essential feature offered by Pandas is its high-performance, in-memory join and merge operations.
If you have ever worked with databases, you should be familiar with this type of data interaction.
The main interface for this is the pd.merge
function, and we'll see few examples of how this can work in practice.
For convenience, we will start by redefining the display()
functionality:
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
The behavior implemented in pd.merge()
is a subset of what is known as relational algebra, which is a formal set of rules for manipulating relational data, and forms the conceptual foundation of operations available in most databases.
The strength of the relational algebra approach is that it proposes several primitive operations, which become the building blocks of more complicated operations on any dataset.
With this lexicon of fundamental operations implemented efficiently in a database or other program, a wide range of fairly complicated composite operations can be performed.
Pandas implements several of these fundamental building-blocks in the pd.merge()
function and the related join()
method of Series
and Dataframe
s.
The pd.merge()
function implements a number of types of joins: the one-to-one, many-to-one, and many-to-many joins.
All three types of joins are accessed via an identical call to the pd.merge()
interface; the type of join performed depends on the form of the input data.
Perhaps the simplest type of merge expresion is the one-to-one join, which is in many ways very similar to the column-wise concatenation that we have already seen.
As a concrete example, consider the following two DataFrames
which contain information on several employees in a company:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
To combine this information into a single DataFrame
, we can use the pd.merge()
function:
df3 = pd.merge(df1, df2)
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
The pd.merge()
function recognizes that each DataFrame
has an "employee" column, and automatically joins using this column as a key.
The result of the merge is a new DataFrame
that combines the information from the two inputs.
Notice that the order of entries in each column is not necessarily maintained: in this case, the order of the "employee" column differs between df1
and df2
, and the pd.merge()
function correctly accounts for this.
Additionally, keep in mind that the merge in general discards the index, except in the special case of merges by index (see the left_index
and right_index
keywords, discussed momentarily).
Many-to-one joins are joins in which one of the two key columns contains duplicate entries.
For the many-to-one case, the resulting DataFrame
will preserve those duplicate entries as appropriate:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')
df3
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
df4
group | supervisor | |
---|---|---|
0 | Accounting | Carly |
1 | Engineering | Guido |
2 | HR | Steve |
pd.merge(df3, df4)
employee | group | hire_date | supervisor | |
---|---|---|---|---|
0 | Bob | Accounting | 2008 | Carly |
1 | Jake | Engineering | 2012 | Guido |
2 | Lisa | Engineering | 2004 | Guido |
3 | Sue | HR | 2014 | Steve |
Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined.
If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.
Consider the following, where we have a DataFrame
showing one or more skills associated with a particular group. By performing a many-to-many join, we can recover the skills associated with any individual person:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Engineering', 'Engineering', 'HR', 'HR'],
'skills': ['math', 'spreadsheets', 'coding', 'linux',
'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df5
group | skills | |
---|---|---|
0 | Accounting | math |
1 | Accounting | spreadsheets |
2 | Engineering | coding |
3 | Engineering | linux |
4 | HR | spreadsheets |
5 | HR | organization |
pd.merge(df1, df5)
employee | group | skills | |
---|---|---|---|
0 | Bob | Accounting | math |
1 | Bob | Accounting | spreadsheets |
2 | Jake | Engineering | coding |
3 | Jake | Engineering | linux |
4 | Lisa | Engineering | coding |
5 | Lisa | Engineering | linux |
6 | Sue | HR | spreadsheets |
7 | Sue | HR | organization |
We've already seen the default behavior of pd.merge()
: it looks for one or more matching column names between the two inputs, and uses this as the key.
However, often the column names will not match so nicely, and pd.merge()
provides a variety of options for handling this.
on
keyword¶Most simply, you can explicitly specify the name of the key column using the on
keyword, which takes a column name or a list of column names:
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df2
employee | hire_date | |
---|---|---|
0 | Lisa | 2004 |
1 | Bob | 2008 |
2 | Jake | 2012 |
3 | Sue | 2014 |
pd.merge(df1, df2, on='employee')
employee | group | hire_date | |
---|---|---|---|
0 | Bob | Accounting | 2008 |
1 | Jake | Engineering | 2012 |
2 | Lisa | Engineering | 2004 |
3 | Sue | HR | 2014 |
left_on
and right_on
keywords¶At times you may wish to merge two datasets with different column names; for example, we may have a dataset in which the employee name is labeled as "name" rather than "employee".
In this case, we can use the left_on
and right_on
keywords to specify the two column names:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="name")')
df1
employee | group | |
---|---|---|
0 | Bob | Accounting |
1 | Jake | Engineering |
2 | Lisa | Engineering |
3 | Sue | HR |
df3
name | salary | |
---|---|---|
0 | Bob | 70000 |
1 | Jake | 80000 |
2 | Lisa | 120000 |
3 | Sue | 90000 |
pd.merge(df1, df3, left_on="employee", right_on="name")
employee | group | name | salary | |
---|---|---|---|---|
0 | Bob | Accounting | Bob | 70000 |
1 | Jake | Engineering | Jake | 80000 |
2 | Lisa | Engineering | Lisa | 120000 |
3 | Sue | HR | Sue | 90000 |
The result has a redundant column that we can drop if desired–for example, by using the drop()
method of DataFrame
s:
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis=1)
employee | group | salary | |
---|---|---|---|
0 | Bob | Accounting | 70000 |
1 | Jake | Engineering | 80000 |
2 | Lisa | Engineering | 120000 |
3 | Sue | HR | 90000 |
left_index
and right_index
keywords¶Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:
df1.set_index?
df1a = df1.set_index([ 'group', 'employee'])
df1a
group | employee |
---|---|
Accounting | Bob |
Engineering | Jake |
Lisa | |
HR | Sue |
df1a = df1.set_index(['employee', 'group'])
df2a = df2.set_index('employee')
display('df1a', 'df2a')
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df2a
hire_date | |
---|---|
employee | |
Lisa | 2004 |
Bob | 2008 |
Jake | 2012 |
Sue | 2014 |
You can use the index as the key for merging by specifying the left_index
and/or right_index
flags in pd.merge()
:
display('df1a', 'df2a', "pd.merge(df1a, df2a, left_index=True, right_index=True)")
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df2a
hire_date | |
---|---|
employee | |
Lisa | 2004 |
Bob | 2008 |
Jake | 2012 |
Sue | 2014 |
pd.merge(df1a, df2a, left_index=True, right_index=True)
group | hire_date | |
---|---|---|
employee | ||
Bob | Accounting | 2008 |
Jake | Engineering | 2012 |
Lisa | Engineering | 2004 |
Sue | HR | 2014 |
df1a, df2a, pd.merge(df1a, df2a, left_index=True, right_index=True)
( group employee Bob Accounting Jake Engineering Lisa Engineering Sue HR, hire_date employee Lisa 2004 Bob 2008 Jake 2012 Sue 2014, group hire_date employee Bob Accounting 2008 Jake Engineering 2012 Lisa Engineering 2004 Sue HR 2014)
For convenience, DataFrame
s implement the join()
method, which performs a merge that defaults to joining on indices:
display('df1a', 'df2a', 'df1a.join(df2a)')
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df2a
hire_date | |
---|---|
employee | |
Lisa | 2004 |
Bob | 2008 |
Jake | 2012 |
Sue | 2014 |
df1a.join(df2a)
group | hire_date | |
---|---|---|
employee | ||
Bob | Accounting | 2008 |
Jake | Engineering | 2012 |
Lisa | Engineering | 2004 |
Sue | HR | 2014 |
If you'd like to mix indices and columns, you can combine left_index
with right_on
or left_on
with right_index
to get the desired behavior:
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='name')")
df1a
group | |
---|---|
employee | |
Bob | Accounting |
Jake | Engineering |
Lisa | Engineering |
Sue | HR |
df3
name | salary | |
---|---|---|
0 | Bob | 70000 |
1 | Jake | 80000 |
2 | Lisa | 120000 |
3 | Sue | 90000 |
pd.merge(df1a, df3, left_index=True, right_on='name')
group | name | salary | |
---|---|---|---|
0 | Accounting | Bob | 70000 |
1 | Engineering | Jake | 80000 |
2 | Engineering | Lisa | 120000 |
3 | HR | Sue | 90000 |
We have glossed over one important consideration in performing a join: the type of set arithmetic used in the join. This comes up when a value appears in one key column but not the other:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'], 'food': ['fish', 'beans', 'bread']},
columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'], 'drink': ['wine', 'beer']},
columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')